]> www.infradead.org Git - users/willy/xarray.git/log
users/willy/xarray.git
2 months agonet: phy: micrel: Replace hardcoded pages with defines
Horatiu Vultur [Mon, 18 Aug 2025 07:51:20 +0000 (09:51 +0200)]
net: phy: micrel: Replace hardcoded pages with defines

The functions lan_*_page_reg gets as a second parameter the page
where the register is. In all the functions the page was hardcoded.
Replace the hardcoded values with defines to make it more clear
what are those parameters.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://patch.msgid.link/20250818075121.1298170-4-horatiu.vultur@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agonet: phy: micrel: Introduce lanphy_modify_page_reg
Horatiu Vultur [Mon, 18 Aug 2025 07:51:19 +0000 (09:51 +0200)]
net: phy: micrel: Introduce lanphy_modify_page_reg

As the name suggests this function modifies the register in an
extended page. It has the same parameters as phy_modify_mmd.
This function was introduce because there are many places in the
code where the registers was read then the value was modified and
written back. So replace all this code with this function to make
it clear.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://patch.msgid.link/20250818075121.1298170-3-horatiu.vultur@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agonet: phy: micrel: Start using PHY_ID_MATCH_MODEL
Horatiu Vultur [Mon, 18 Aug 2025 07:51:18 +0000 (09:51 +0200)]
net: phy: micrel: Start using PHY_ID_MATCH_MODEL

Start using PHY_ID_MATCH_MODEL for all the drivers.
While at this add also PHY_ID_KSZ8041RNLI to micrel_tbl.

It is safe to change the current of 0x00fffff0 to PHY_ID_MATCH_MODEL
because all the micrel PHYs have in MSB a value of 0.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250818075121.1298170-2-horatiu.vultur@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoMerge tag 'mlx5-next-vhca-id' of git://git.kernel.org/pub/scm/linux/kernel/git/mellan...
Paolo Abeni [Thu, 21 Aug 2025 08:18:46 +0000 (10:18 +0200)]
Merge tag 'mlx5-next-vhca-id' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
mlx5-next-vhca-id

A preparation patchset for adjacent function vports.

Adjacent functions can delegate their SR-IOV VFs to sibling PFs,
allowing for more flexible and scalable management in multi-host and
ECPF-to-host scenarios. Adjacent vports can be managed by the management
PF via their unique vhca id and can't be managed by function index as the
index can conflict with the local vports/vfs.

This series provides:

- Use the cached vcha id instead of querying it every time from fw
- Query hca cap using vhca id instead of function id when FW supports it
- Add HW capabilities and required definitions for adjacent function vports

* tag 'mlx5-next-vhca-id' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  {rdma,net}/mlx5: export mlx5_vport_get_vhca_id
  net/mlx5: E-Switch, Set/Query hca cap via vhca id
  net/mlx5: E-Switch, Cache vport vhca id on first cap query
  net/mlx5: mlx5_ifc, Add hardware definitions needed for adjacent vports
====================

Link: https://patch.msgid.link/20250815194901.298689-1-saeed@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agonet: pktgen: Use min()/min_t() to improve pktgen_finalize_skb()
Thorsten Blum [Fri, 15 Aug 2025 15:33:33 +0000 (17:33 +0200)]
net: pktgen: Use min()/min_t() to improve pktgen_finalize_skb()

Use min() and min_t() to improve pktgen_finalize_skb() and avoid
calculating 'datalen / frags' twice.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20250815153334.295431-3-thorsten.blum@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agonet: openvswitch: Use for_each_cpu() where appropriate
Yury Norov (NVIDIA) [Mon, 18 Aug 2025 17:28:05 +0000 (13:28 -0400)]
net: openvswitch: Use for_each_cpu() where appropriate

Due to legacy reasons, openswitch code opencodes for_each_cpu() to make
sure that CPU0 is always considered.

Since commit c4b2bf6b4a35 ("openvswitch: Optimize operations for OvS
flow_stats."), the corresponding  flow->cpu_used_mask is initialized
such that CPU0 is explicitly set.

So, switch the code to using plain for_each_cpu().

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/20250818172806.189325-1-yury.norov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests/net: packetdrill: Support single protocol test.
Kuniyuki Iwashima [Tue, 19 Aug 2025 23:14:57 +0000 (23:14 +0000)]
selftests/net: packetdrill: Support single protocol test.

Currently, we cannot write IPv4 or IPv6 specific packetdrill tests
as ksft_runner.sh runs each .pkt file for both protocols.

Let's support single protocol test by checking --ip_version in the
.pkt file.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250819231527.1427361-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: set net.core.rmem_max and net.core.wmem_max to 4 MB
Eric Dumazet [Tue, 19 Aug 2025 17:40:30 +0000 (17:40 +0000)]
net: set net.core.rmem_max and net.core.wmem_max to 4 MB

SO_RCVBUF and SO_SNDBUF have limited range today, unless
distros or system admins change rmem_max and wmem_max.

Even iproute2 uses 1 MB SO_RCVBUF which is capped by
the kernel.

Decouple [rw]mem_max and [rw]mem_default and increase
[rw]mem_max to 4 MB.

Before:

$ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max
net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992

After:

$ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max
net.core.rmem_default = 212992
net.core.rmem_max = 4194304
net.core.wmem_default = 212992
net.core.wmem_max = 4194304

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'bnxt_en-updates-for-net-next'
Jakub Kicinski [Thu, 21 Aug 2025 02:34:11 +0000 (19:34 -0700)]
Merge branch 'bnxt_en-updates-for-net-next'

Michael Chan says:

====================
bnxt_en: Updates for net-next

The first patch is the FW interface update, followed by 3 patches to
support the expanded pcie v2 structure for ethtool -d.  The last patch
adds a Hyper-V PCI ID for the 5760X chips (Thor2).

v1: https://lore.kernel.org/20250818004940.5663-1-michael.chan@broadcom.com
====================

Link: https://patch.msgid.link/20250819163919.104075-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agobnxt_en: Add Hyper-V VF ID
Pavan Chebbi [Tue, 19 Aug 2025 16:39:19 +0000 (09:39 -0700)]
bnxt_en: Add Hyper-V VF ID

VFs of the P7 chip family created by Hyper-V will have the device ID of
0x181b.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250819163919.104075-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agobnxt_en: Add pcie_ctx_v2 support for ethtool -d
Shruti Parab [Tue, 19 Aug 2025 16:39:18 +0000 (09:39 -0700)]
bnxt_en: Add pcie_ctx_v2 support for ethtool -d

Add support to dump the expanded v2 struct that contains PCIE read/write
latency and credit histogram data.

Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Shruti Parab <shruti.parab@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250819163919.104075-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agobnxt_en: Add pcie_stat_len to struct bp
Shruti Parab [Tue, 19 Aug 2025 16:39:17 +0000 (09:39 -0700)]
bnxt_en: Add pcie_stat_len to struct bp

Add this length field to capture the length of the pcie stats structure
supported by the FW.  This length will be determined in
bnxt_ethtool_init().  The minimum of this FW length and the length
known to the driver will determine the actual ethtool -d length.

Suggested-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Shruti Parab <shruti.parab@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250819163919.104075-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agobnxt_en: Refactor bnxt_get_regs()
Shruti Parab [Tue, 19 Aug 2025 16:39:16 +0000 (09:39 -0700)]
bnxt_en: Refactor bnxt_get_regs()

Separate the code that sends the FW message to retrieve pcie stats into
a new helper function.  The caller of the helper will call hwrm_req_hold()
beforehand and so the caller will call hwrm_req_drop() in all cases
afterwards.  This helper will be useful when adding the support for the
larger struct pcie_ctx_hw_stats_v2.

Signed-off-by: Shruti Parab <shruti.parab@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20250819163919.104075-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agobnxt_en: hsi: Update FW interface to 1.10.3.133
Michael Chan [Tue, 19 Aug 2025 16:39:15 +0000 (09:39 -0700)]
bnxt_en: hsi: Update FW interface to 1.10.3.133

The major change is struct pcie_ctx_hw_stats_v2 which has new latency
histograms added.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250819163919.104075-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: rtnetlink: print device info on preferred_lft test failure
Hangbin Liu [Tue, 19 Aug 2025 07:47:49 +0000 (07:47 +0000)]
selftests: rtnetlink: print device info on preferred_lft test failure

Even with slowwait used to avoid system sleep in the preferred_lft test,
failures can still occur after long runtimes.

Print the device address info when the test fails to provide better
troubleshooting data.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250819074749.388064-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: net: bpf_offload: print loaded programs on mismatch
Hangbin Liu [Tue, 19 Aug 2025 07:33:48 +0000 (07:33 +0000)]
selftests: net: bpf_offload: print loaded programs on mismatch

The test sometimes fails due to an unexpected number of loaded programs. e.g

  FAIL: 2 BPF programs loaded, expected 1
    File "/usr/libexec/kselftests/net/./bpf_offload.py", line 940, in <module>
      progs = bpftool_prog_list(expected=1)
    File "/usr/libexec/kselftests/net/./bpf_offload.py", line 187, in bpftool_prog_list
      fail(True, "%d BPF programs loaded, expected %d" %
    File "/usr/libexec/kselftests/net/./bpf_offload.py", line 89, in fail
      tb = "".join(traceback.extract_stack().format())

However, the logs do not show which programs were actually loaded, making it
difficult to debug the failure.

Add printing of the loaded programs when a mismatch is detected to help
troubleshoot such errors. The list is printed on a new line to avoid breaking
the current log format.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250819073348.387972-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests/net/socket.c: removed warnings from unused returns
Alex Tran [Tue, 19 Aug 2025 02:52:27 +0000 (19:52 -0700)]
selftests/net/socket.c: removed warnings from unused returns

socket.c: In function ‘run_tests’:
socket.c:59:25: warning: ignoring return value of ‘strerror_r’ \
declared with attribute ‘warn_unused_result’ [-Wunused-result]
59 | strerror_r(-s->expect, err_string1, ERR_STRING_SZ);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

socket.c:60:25: warning: ignoring return value of ‘strerror_r’ \
declared with attribute ‘warn_unused_result’ [-Wunused-result]
60 | strerror_r(errno, err_string2, ERR_STRING_SZ);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

socket.c:73:33: warning: ignoring return value of ‘strerror_r’ \
declared with attribute ‘warn_unused_result’ [-Wunused-result]
73 | strerror_r(errno, err_string1, ERR_STRING_SZ);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

changelog:
v2
- const char* messages and fixed patch warnings of max 75 chars
  per line

Signed-off-by: Alex Tran <alex.t.tran@gmail.com>
Link: https://patch.msgid.link/20250819025227.239885-1-alex.t.tran@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: avoid one loop iteration in __skb_splice_bits
Pengtao He [Tue, 19 Aug 2025 02:15:51 +0000 (10:15 +0800)]
net: avoid one loop iteration in __skb_splice_bits

If *len is equal to 0 at the beginning of __splice_segment
it returns true directly. But when decreasing *len from
a positive number to 0 in __splice_segment, it returns false.
The __skb_splice_bits needs to call __splice_segment again.

Recheck *len if it changes, return true in time.
Reduce unnecessary calls to __splice_segment.

Signed-off-by: Pengtao He <hept.hept.hept@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250819021551.8361-1-hept.hept.hept@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'sctp-convert-to-use-crypto-lib-and-upgrade-cookie-auth'
Jakub Kicinski [Wed, 20 Aug 2025 02:36:29 +0000 (19:36 -0700)]
Merge branch 'sctp-convert-to-use-crypto-lib-and-upgrade-cookie-auth'

Eric Biggers says:

====================
sctp: Convert to use crypto lib, and upgrade cookie auth

This series converts SCTP chunk and cookie authentication to use the
crypto library API instead of crypto_shash.  This is much simpler (the
diffstat should speak for itself), and also faster too.  In addition,
this series upgrades the cookie authentication to use HMAC-SHA256.

I've tested that kernels with this series applied can continue to
communicate using SCTP with older ones, in either direction, using any
choice of None, HMAC-SHA1, or HMAC-SHA256 chunk authentication.
====================

Link: https://patch.msgid.link/20250818205426.30222-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agosctp: Stop accepting md5 and sha1 for net.sctp.cookie_hmac_alg
Eric Biggers [Mon, 18 Aug 2025 20:54:26 +0000 (13:54 -0700)]
sctp: Stop accepting md5 and sha1 for net.sctp.cookie_hmac_alg

The upgrade of the cookie authentication algorithm to HMAC-SHA256 kept
some backwards compatibility for the net.sctp.cookie_hmac_alg sysctl by
still accepting the values 'md5' and 'sha1'.  Those algorithms are no
longer actually used, but rather those values were just treated as
requests to enable cookie authentication.

As requested at
https://lore.kernel.org/netdev/CADvbK_fmCRARc8VznH8cQa-QKaCOQZ6yFbF=1-VDK=zRqv_cXw@mail.gmail.com/
and https://lore.kernel.org/netdev/20250818084345.708ac796@kernel.org/ ,
go further and start rejecting 'md5' and 'sha1' completely.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-6-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agosctp: Convert cookie authentication to use HMAC-SHA256
Eric Biggers [Mon, 18 Aug 2025 20:54:25 +0000 (13:54 -0700)]
sctp: Convert cookie authentication to use HMAC-SHA256

Convert SCTP cookies to use HMAC-SHA256, instead of the previous choice
of the legacy algorithms HMAC-MD5 and HMAC-SHA1.  Simplify and optimize
the code by using the HMAC-SHA256 library instead of crypto_shash, and
by preparing the HMAC key when it is generated instead of per-operation.

This doesn't break compatibility, since the cookie format is an
implementation detail, not part of the SCTP protocol itself.

Note that the cookie size doesn't change either.  The HMAC field was
already 32 bytes, even though previously at most 20 bytes were actually
compared.  32 bytes exactly fits an untruncated HMAC-SHA256 value.  So,
although we could safely truncate the MAC to something slightly shorter,
for now just keep the cookie size the same.

I also considered SipHash, but that would generate only 8-byte MACs.  An
8-byte MAC *might* suffice here.  However, there's quite a lot of
information in the SCTP cookies: more than in TCP SYN cookies.  So
absent an analysis that occasional forgeries of all that information is
okay in SCTP, I errored on the side of caution.

Remove HMAC-MD5 and HMAC-SHA1 as options, since the new HMAC-SHA256
option is just better.  It's faster as well as more secure.  For
example, benchmarking on x86_64, cookie authentication is now nearly 3x
as fast as the previous default choice and implementation of HMAC-MD5.

Also just make the kernel always support cookie authentication if SCTP
is supported at all, rather than making it optional in the build.  (It
was sort of optional before, but it didn't really work properly.  E.g.,
a kernel with CONFIG_SCTP_COOKIE_HMAC_MD5=n still supported HMAC-MD5
cookie authentication if CONFIG_CRYPTO_HMAC and CONFIG_CRYPTO_MD5
happened to be enabled in the kconfig for other reasons.)

Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-5-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agosctp: Use HMAC-SHA1 and HMAC-SHA256 library for chunk authentication
Eric Biggers [Mon, 18 Aug 2025 20:54:24 +0000 (13:54 -0700)]
sctp: Use HMAC-SHA1 and HMAC-SHA256 library for chunk authentication

For SCTP chunk authentication, use the HMAC-SHA1 and HMAC-SHA256 library
functions instead of crypto_shash.  This is simpler and faster.  There's
no longer any need to pre-allocate 'crypto_shash' objects; the SCTP code
now simply calls into the HMAC code directly.

As part of this, make SCTP always support both HMAC-SHA1 and
HMAC-SHA256.  Previously, it only guaranteed support for HMAC-SHA1.
However, HMAC-SHA256 tended to be supported too anyway, as it was
supported if CONFIG_CRYPTO_SHA256 was enabled elsewhere in the kconfig.

Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-4-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agosctp: Fix MAC comparison to be constant-time
Eric Biggers [Mon, 18 Aug 2025 20:54:23 +0000 (13:54 -0700)]
sctp: Fix MAC comparison to be constant-time

To prevent timing attacks, MACs need to be compared in constant time.
Use the appropriate helper function for this.

Fixes: bbd0d59809f9 ("[SCTP]: Implement the receive and verification of AUTH chunk")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: net: Explicitly enable CONFIG_CRYPTO_SHA1 for IPsec
Eric Biggers [Mon, 18 Aug 2025 20:54:22 +0000 (13:54 -0700)]
selftests: net: Explicitly enable CONFIG_CRYPTO_SHA1 for IPsec

xfrm_policy.sh, nft_flowtable.sh, and vrf-xfrm-tests.sh use 'ip xfrm'
with SHA-1, either 'auth sha1' or 'auth-trunc hmac(sha1)'.  That
requires CONFIG_CRYPTO_SHA1, which CONFIG_INET_ESP intentionally doesn't
select (as per its help text).  Previously, the config for these tests
relied on CONFIG_CRYPTO_SHA1 being selected by the unrelated option
CONFIG_IP_SCTP.  Since CONFIG_IP_SCTP is being changed to no longer do
that, instead add CONFIG_CRYPTO_SHA1 to the configs explicitly.

Reported-by: Paolo Abeni <pabeni@redhat.com>
Closes: https://lore.kernel.org/r/766e4508-aaba-4cdc-92b4-e116e52ae13b@redhat.com
Suggested-by: Florian Westphal <fw@strlen.de>
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'net-memcg-gather-memcg-code-under-config_memcg'
Jakub Kicinski [Wed, 20 Aug 2025 02:21:01 +0000 (19:21 -0700)]
Merge branch 'net-memcg-gather-memcg-code-under-config_memcg'

Kuniyuki Iwashima says:

====================
net-memcg: Gather memcg code under CONFIG_MEMCG.

This series converts most sk->sk_memcg access to helper functions
under CONFIG_MEMCG and finally defines sk_memcg under CONFIG_MEMCG.

This is v5 of the series linked below but without core changes
that decoupled memcg and global socket memory accounting.

I will defer the changes to a follow-up series that will use BPF
to store a flag in sk->sk_memcg.

Overview of the series:

  patch 1 is a trivial fix for MPTCP
  patch 2 ~ 9 move sk->sk_memcg accesses to a single place
  patch 10 moves sk_memcg under CONFIG_MEMCG

v4: https://lore.kernel.org/20250814200912.1040628-1-kuniyu@google.com
v3: https://lore.kernel.org/20250812175848.512446-1-kuniyu@google.com
v2: https://lore.kernel.org/20250811173116.2829786-1-kuniyu@google.com
v1: https://lore.kernel.org/20250721203624.3807041-1-kuniyu@google.com
====================

Link: https://patch.msgid.link/20250815201712.1745332-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Define sk_memcg under CONFIG_MEMCG.
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:18 +0000 (20:16 +0000)]
net: Define sk_memcg under CONFIG_MEMCG.

Except for sk_clone_lock(), all accesses to sk->sk_memcg
is done under CONFIG_MEMCG.

As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:17 +0000 (20:16 +0000)]
net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().

We will store a flag in the lowest bit of sk->sk_memcg.

Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure().

Let's pass struct sock to it and rename the function to match other
functions starting with mem_cgroup_sk_.

Note that the helper is moved to sock.h to use mem_cgroup_from_sk().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge().
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:16 +0000 (20:16 +0000)]
net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge().

We will store a flag in the lowest bit of sk->sk_memcg.

Then, we cannot pass the raw pointer to mem_cgroup_charge_skmem()
and mem_cgroup_uncharge_skmem().

Let's pass struct sock to the functions.

While at it, they are renamed to match other functions starting
with mem_cgroup_sk_.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet-memcg: Introduce mem_cgroup_sk_enabled().
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:15 +0000 (20:16 +0000)]
net-memcg: Introduce mem_cgroup_sk_enabled().

The socket memcg feature is enabled by a static key and
only works for non-root cgroup.

We check both conditions in many places.

Let's factorise it as a helper function.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet-memcg: Introduce mem_cgroup_from_sk().
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:14 +0000 (20:16 +0000)]
net-memcg: Introduce mem_cgroup_from_sk().

We will store a flag in the lowest bit of sk->sk_memcg.

Then, directly dereferencing sk->sk_memcg will be illegal, and we
do not want to allow touching the raw sk->sk_memcg in many places.

Let's introduce mem_cgroup_from_sk().

Other places accessing the raw sk->sk_memcg will be converted later.

Note that we cannot define the helper as an inline function in
memcontrol.h as we cannot access any fields of struct sock there
due to circular dependency, so it is placed in sock.h.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Clean up __sk_mem_raise_allocated().
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:13 +0000 (20:16 +0000)]
net: Clean up __sk_mem_raise_allocated().

In __sk_mem_raise_allocated(), charged is initialised as true due
to the weird condition removed in the previous patch.

It makes the variable unreliable by itself, so we have to check
another variable, memcg, in advance.

Also, we will factorise the common check below for memcg later.

    if (mem_cgroup_sockets_enabled && sk->sk_memcg)

As a prep, let's initialise charged as false and memcg as NULL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV.
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:12 +0000 (20:16 +0000)]
net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV.

Initially, trace_sock_exceed_buf_limit() was invoked when
__sk_mem_raise_allocated() failed due to the memcg limit or the
global limit.

However, commit d6f19938eb031 ("net: expose sk wmem in
sock_exceed_buf_limit tracepoint") somehow suppressed the event
only when memcg failed to charge for SK_MEM_RECV, although the
memcg failure for SK_MEM_SEND still triggers the event.

Let's restore the event for SK_MEM_RECV.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agotcp: Simplify error path in inet_csk_accept().
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:11 +0000 (20:16 +0000)]
tcp: Simplify error path in inet_csk_accept().

When an error occurs in inet_csk_accept(), what we should do is
only call release_sock() and set the errno to arg->err.

But the path jumps to another label, which introduces unnecessary
initialisation and tests for newsk.

Let's simplify the error path and remove the redundant NULL
checks for newsk.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agomptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready().
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:10 +0000 (20:16 +0000)]
mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready().

Some conditions used in mptcp_epollin_ready() are the same as
tcp_under_memory_pressure().

We will modify tcp_under_memory_pressure() in the later patch.

Let's use tcp_under_memory_pressure() instead.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agomptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.
Kuniyuki Iwashima [Fri, 15 Aug 2025 20:16:09 +0000 (20:16 +0000)]
mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.

When sk_alloc() allocates a socket, mem_cgroup_sk_alloc() sets
sk->sk_memcg based on the current task.

MPTCP subflow socket creation is triggered from userspace or
an in-kernel worker.

In the latter case, sk->sk_memcg is not what we want.  So, we fix
it up from the parent socket's sk->sk_memcg in mptcp_attach_cgroup().

Although the code is placed under #ifdef CONFIG_MEMCG, it is buried
under #ifdef CONFIG_SOCK_CGROUP_DATA.

The two configs are orthogonal.  If CONFIG_MEMCG is enabled without
CONFIG_SOCK_CGROUP_DATA, the subflow's memory usage is not charged
correctly.

Let's move the code out of the wrong ifdef guard.

Note that sk->sk_memcg is freed in sk_prot_free() and the parent
sk holds the refcnt of memcg->css here, so we don't need to use
css_tryget().

Fixes: 3764b0c5651e3 ("mptcp: attach subflow socket to parent cgroup")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'stmmac-stop-silently-dropping-bad-checksum-packets'
Jakub Kicinski [Wed, 20 Aug 2025 01:33:09 +0000 (18:33 -0700)]
Merge branch 'stmmac-stop-silently-dropping-bad-checksum-packets'

Oleksij Rempel says:

====================
stmmac: stop silently dropping bad checksum packets

this series reworks how stmmac handles receive checksum offload
(CoE) errors on dwmac4.

At present, when CoE is enabled, the hardware silently discards any
frame that fails checksum validation. These packets never reach the
driver and are not accounted in the generic drop statistics. They are
only visible in the stmmac-specific counters as "payload error" or
"header error" packets, which makes it harder to debug or monitor
network issues.

Following discussion [1], the driver is reworked to propagate checksum
error information up to the stack. With these changes, CoE stays
enabled, but frames that fail hardware validation are no longer dropped
in hardware. Instead, the driver marks them with CHECKSUM_NONE so the
network stack can validate, drop, and properly account them in the
standard drop statistics.

[1] https://lore.kernel.org/all/20250625132117.1b3264e8@kernel.org/
====================

Link: https://patch.msgid.link/20250818090217.2789521-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: dwmac4: stop hardware from dropping checksum-error packets
Oleksij Rempel [Mon, 18 Aug 2025 09:02:17 +0000 (11:02 +0200)]
net: stmmac: dwmac4: stop hardware from dropping checksum-error packets

Tell the MAC not to discard frames that fail TCP/IP checksum
validation.

By default, when the hardware checksum engine (CoE) is enabled,
dwmac4 silently drops any packet where the offload engine detects
a checksum error. These frames are not reported to the driver and
are not counted in any statistics as dropped packets.

Set the MTL_OP_MODE_DIS_TCP_EF bit when initializing the Rx channel so
that all packets are delivered, even if they failed hardware checksum
validation. CoE remains enabled, but instead of dropping such frames,
the driver propagates the error status and marks the skb with
CHECKSUM_NONE. This allows the stack to verify and drop the packet
while updating statistics.

This change follows the decision made in the discussion:
Link: https://lore.kernel.org/all/20250625132117.1b3264e8@kernel.org/
It depends on the previous patches that added proper error propagation
in the Rx path.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250818090217.2789521-4-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: dwmac4: report Rx checksum errors in status
Oleksij Rempel [Mon, 18 Aug 2025 09:02:16 +0000 (11:02 +0200)]
net: stmmac: dwmac4: report Rx checksum errors in status

Propagate hardware checksum failures from the descriptor parser to the
caller.

Currently, dwmac4_wrback_get_rx_status() updates stats when the Rx
descriptor signals an IP header or payload checksum error, but it does
not reflect this in its return value. The higher-level stmmac_rx() code
therefore cannot tell that hardware checksum validation failed.

Set the csum_none flag in the returned status when either
RDES1_IP_HDR_ERROR or RDES1_IP_PAYLOAD_ERROR is present. This aligns
dwmac4 with enh_desc_coe_rdes0() and lets stmmac_rx() mark the skb as
CHECKSUM_NONE for software verification.

This is a preparatory step for disabling the hardware filter that drops
frames which do not pass checksum validation.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250818090217.2789521-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: Correctly handle Rx checksum offload errors
Oleksij Rempel [Mon, 18 Aug 2025 09:02:15 +0000 (11:02 +0200)]
net: stmmac: Correctly handle Rx checksum offload errors

The stmmac_rx function would previously set skb->ip_summed to
CHECKSUM_UNNECESSARY if hardware checksum offload (CoE) was enabled
and the packet was of a known IP ethertype.

However, this logic failed to check if the hardware had actually
reported a checksum error. The hardware status, indicating a header or
payload checksum failure, was being ignored at this stage. This could
cause corrupt packets to be passed up the network stack as valid.

This patch corrects the logic by checking the `csum_none` status flag,
which is set when the hardware reports a checksum error. If this flag
is set, skb->ip_summed is now correctly set to CHECKSUM_NONE,
ensuring the kernel's network stack will perform its own validation and
properly handle the corrupt packet.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250818090217.2789521-2-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'there-are-a-cleancode-and-a-parameter-check-for-hns3-driver'
Jakub Kicinski [Wed, 20 Aug 2025 01:11:18 +0000 (18:11 -0700)]
Merge branch 'there-are-a-cleancode-and-a-parameter-check-for-hns3-driver'

Jijie Shao says:

====================
There are a cleancode and a parameter check for hns3 driver

This patchset includes:
 1. a parameter check omitted from fix code in net branch
   https://lore.kernel.org/all/20250723072900.GV2459@horms.kernel.org/
 2. a small clean code
====================

Link: https://patch.msgid.link/20250815100414.949752-1-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: hns3: change the function return type from int to bool
Jijie Shao [Fri, 15 Aug 2025 10:04:14 +0000 (18:04 +0800)]
net: hns3: change the function return type from int to bool

hclge_only_alloc_priv_buff() only return true or false,
So, change the function return type from integer to boolean.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20250815100414.949752-3-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: hns3: add parameter check for tx_copybreak and tx_spare_buf_size
Jijie Shao [Fri, 15 Aug 2025 10:04:13 +0000 (18:04 +0800)]
net: hns3: add parameter check for tx_copybreak and tx_spare_buf_size

Since the driver always enables tx bounce buffer,
there are minimum values for `copybreak` and `tx_spare_buf_size`.

This patch will check and reject configurations
with values smaller than these minimums.

Closes: https://lore.kernel.org/all/20250723072900.GV2459@horms.kernel.org/
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20250815100414.949752-2-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: phy: realtek: enable serdes option mode for RTL8226-CG
Markus Stockhausen [Fri, 15 Aug 2025 08:20:09 +0000 (04:20 -0400)]
net: phy: realtek: enable serdes option mode for RTL8226-CG

The RTL8226-CG can make use of the serdes option mode feature to
dynamically switch between SGMII and 2500base-X. From what is
known the setup sequence is much simpler with no magic values.

Convert the exiting config_init() into a helper that configures
the PHY depending on generation 1 or 2. Call the helper from two
separated new config_init() functions.

Finally convert the phy_driver specs of the RTL8226-CG to make
use of the new configuration and switch over to the extended
read_status() function to dynamically change the interface
according to the serdes mode.

Remark! The logic could be simpler if the serdes mode could be
set before all other generation 2 magic values. Due to missing
RTL8221B test hardware the mmd command order was kept.

Tested on Zyxel XGS1210-12.

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20250815082009.3678865-1-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoipv6: ip6_gre: replace strcpy with strscpy for tunnel name
Miguel García [Mon, 18 Aug 2025 22:02:03 +0000 (00:02 +0200)]
ipv6: ip6_gre: replace strcpy with strscpy for tunnel name

Replace the strcpy() call that copies the device name into
tunnel->parms.name with strscpy(), to avoid potential overflow
and guarantee NULL termination. This uses the two-argument
form of strscpy(), where the destination size is inferred
from the array type.

Destination is tunnel->parms.name (size IFNAMSIZ).

Tested in QEMU (Alpine rootfs):
 - Created IPv6 GRE tunnels over loopback
 - Assigned overlay IPv6 addresses
 - Verified bidirectional ping through the tunnel
 - Changed tunnel parameters at runtime (`ip -6 tunnel change`)

Signed-off-by: Miguel García <miguelgarciaroman8@gmail.com>
Link: https://patch.msgid.link/20250818220203.899338-1-miguelgarciaroman8@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'net-convert-to-skb_dstref_steal-and-skb_dstref_restore'
Jakub Kicinski [Wed, 20 Aug 2025 00:53:30 +0000 (17:53 -0700)]
Merge branch 'net-convert-to-skb_dstref_steal-and-skb_dstref_restore'

Stanislav Fomichev says:

====================
net: Convert to skb_dstref_steal and skb_dstref_restore

To diagnose and prevent issues similar to [0], emit warning
(CONFIG_DEBUG_NET) from skb_dst_set and skb_dst_set_noref when
overwriting non-null reference-counted entry. Two new helpers
are added to handle special cases where the entry needs to be
reset and restored: skb_dstref_steal/skb_dstref_restore. The bulk of
the patches in the series converts manual _skb_refst manipulations
to these new helpers.

0: https://lore.kernel.org/netdev/20250723224625.1340224-1-sdf@fomichev.me/T/#u
====================

Link: https://patch.msgid.link/20250818154032.3173645-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Add skb_dst_check_unset
Stanislav Fomichev [Mon, 18 Aug 2025 15:40:32 +0000 (08:40 -0700)]
net: Add skb_dst_check_unset

To prevent dst_entry leaks, add warning when the non-NULL dst_entry
is rewritten.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250818154032.3173645-8-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agochtls: Convert to skb_dst_reset
Stanislav Fomichev [Mon, 18 Aug 2025 15:40:31 +0000 (08:40 -0700)]
chtls: Convert to skb_dst_reset

Going forward skb_dst_set will assert that skb dst_entry
is empty during skb_dst_set. skb_dstref_steal is added to reset
existing entry without doing refcnt. Chelsio driver is
doing extra dst management via skb_dst_set(NULL). Replace
these calls with skb_dstref_steal.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250818154032.3173645-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agostaging: octeon: Convert to skb_dst_drop
Stanislav Fomichev [Mon, 18 Aug 2025 15:40:30 +0000 (08:40 -0700)]
staging: octeon: Convert to skb_dst_drop

Instead of doing dst_release and skb_dst_set, do skb_dst_drop which
should do the right thing.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250818154032.3173645-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Switch to skb_dstref_steal/skb_dstref_restore for ip_route_input callers
Stanislav Fomichev [Mon, 18 Aug 2025 15:40:29 +0000 (08:40 -0700)]
net: Switch to skb_dstref_steal/skb_dstref_restore for ip_route_input callers

Going forward skb_dst_set will assert that skb dst_entry
is empty during skb_dst_set. skb_dstref_steal is added to reset
existing entry without doing refcnt. skb_dstref_restore should
be used to restore the previous entry. Convert icmp_route_lookup
and ip_options_rcv_srr to these helpers. Add extra call to
skb_dstref_reset to icmp_route_lookup to clear the ip_route_input
entry.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250818154032.3173645-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonetfilter: Switch to skb_dstref_steal to clear dst_entry
Stanislav Fomichev [Mon, 18 Aug 2025 15:40:28 +0000 (08:40 -0700)]
netfilter: Switch to skb_dstref_steal to clear dst_entry

Going forward skb_dst_set will assert that skb dst_entry
is empty during skb_dst_set. skb_dstref_steal is added to reset
existing entry without doing refcnt. Switch to skb_dstref_steal
in ip[6]_route_me_harder and add a comment on why it's safe
to skip skb_dstref_restore.

Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250818154032.3173645-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoxfrm: Switch to skb_dstref_steal to clear dst_entry
Stanislav Fomichev [Mon, 18 Aug 2025 15:40:27 +0000 (08:40 -0700)]
xfrm: Switch to skb_dstref_steal to clear dst_entry

Going forward skb_dst_set will assert that skb dst_entry
is empty during skb_dst_set. skb_dstref_steal is added to reset
existing entry without doing refcnt. Switch to skb_dstref_steal
in __xfrm_route_forward and add a comment on why it's safe
to skip skb_dstref_restore.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250818154032.3173645-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Add skb_dstref_steal and skb_dstref_restore
Stanislav Fomichev [Mon, 18 Aug 2025 15:40:26 +0000 (08:40 -0700)]
net: Add skb_dstref_steal and skb_dstref_restore

Going forward skb_dst_set will assert that skb dst_entry
is empty during skb_dst_set to prevent potential leaks. There
are few places that still manually manage dst_entry not using
the helpers. Convert them to the following new helpers:
- skb_dstref_steal that resets dst_entry and returns previous dst_entry
  value
- skb_dstref_restore that restores dst_entry previously reset via
  skb_dstref_steal

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250818154032.3173645-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'net-speedup-some-nexthop-handling-when-having-a-lot-of-nexthops'
Jakub Kicinski [Wed, 20 Aug 2025 00:50:35 +0000 (17:50 -0700)]
Merge branch 'net-speedup-some-nexthop-handling-when-having-a-lot-of-nexthops'

Christoph Paasch says:

====================
net: Speedup some nexthop handling when having A LOT of nexthops

Configuring a very large number of nexthops is fairly possible within
a reasonable time-frame. But, certain netlink commands can become
extremely slow.

This series addresses some of these, namely dumping and removing
nexthops.

v1: https://lore.kernel.org/20250724-nexthop_dump-v1-1-6b43fffd5bac@openai.com
====================

Link: https://patch.msgid.link/20250816-nexthop_dump-v2-0-491da3462118@openai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: When removing nexthops, don't call synchronize_net if it is not necessary
Christoph Paasch [Sat, 16 Aug 2025 23:12:49 +0000 (16:12 -0700)]
net: When removing nexthops, don't call synchronize_net if it is not necessary

When removing a nexthop, commit
90f33bffa382 ("nexthops: don't modify published nexthop groups") added a
call to synchronize_rcu() (later changed to _net()) to make sure
everyone sees the new nexthop-group before the rtnl-lock is released.

When one wants to delete a large number of groups and nexthops, it is
fastest to first flush the groups (ip nexthop flush groups) and then
flush the nexthops themselves (ip -6 nexthop flush). As that way the
groups don't need to be rebalanced.

However, `ip -6 nexthop flush` will still take a long time if there is
a very large number of nexthops because of the call to
synchronize_net(). Now, if there are no more groups, there is no point
in calling synchronize_net(). So, let's skip that entirely by checking
if nh->grp_list is empty.

This gives us a nice speedup:

BEFORE:
=======

$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 2097152 nexthops

real 1m45.345s
user 0m0.001s
sys 0m0.005s

$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 4194304 nexthops

real 3m10.430s
user 0m0.002s
sys 0m0.004s

AFTER:
======

$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 2097152 nexthops

real 0m17.545s
user 0m0.003s
sys 0m0.003s

$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 4194304 nexthops

real 0m35.823s
user 0m0.002s
sys 0m0.004s

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250816-nexthop_dump-v2-2-491da3462118@openai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Make nexthop-dumps scale linearly with the number of nexthops
Christoph Paasch [Sat, 16 Aug 2025 23:12:48 +0000 (16:12 -0700)]
net: Make nexthop-dumps scale linearly with the number of nexthops

When we have a (very) large number of nexthops, they do not fit within a
single message. rtm_dump_walk_nexthops() thus will be called repeatedly
and ctx->idx is used to avoid dumping the same nexthops again.

The approach in which we avoid dumping the same nexthops is by basically
walking the entire nexthop rb-tree from the left-most node until we find
a node whose id is >= s_idx. That does not scale well.

Instead of this inefficient approach, rather go directly through the
tree to the nexthop that should be dumped (the one whose nh_id >=
s_idx). This allows us to find the relevant node in O(log(n)).

We have quite a nice improvement with this:

Before:
=======

--> ~1M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
1050624

real 0m21.080s
user 0m0.666s
sys 0m20.384s

--> ~2M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
2101248

real 1m51.649s
user 0m1.540s
sys 1m49.908s

After:
======

--> ~1M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
1050624

real 0m1.157s
user 0m0.926s
sys 0m0.259s

--> ~2M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
2101248

real 0m2.763s
user 0m2.042s
sys 0m0.776s

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250816-nexthop_dump-v2-1-491da3462118@openai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: drv-net: ncdevmem: make configure_channels() support combined channels
Jakub Kicinski [Fri, 15 Aug 2025 23:15:13 +0000 (16:15 -0700)]
selftests: drv-net: ncdevmem: make configure_channels() support combined channels

ncdevmem tests that the kernel correctly rejects attempts
to deactivate queues with MPs bound.

Make the configure_channels() test support combined channels.
Currently it tries to set the queue counts to rx N tx N-1,
which only makes sense for devices which have IRQs per ring
type. Most modern devices used combined IRQs/channels with
both Rx and Tx queues. Since the math is total Rx == combined+Rx
setting Rx when combined is non-zero will be increasing the total
queue count, not decreasing as the test intends.

Note that the test would previously also try to set the Tx
ring count to Rx - 1, for some reason. Which would be 0
if the device has only 2 queues configured.

With this change (device with 2 queues):
  setting channel count rx:1 tx:1
  YNL set channels: Kernel error: 'requested channel counts are too low for existing memory provider setting (2)'

Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250815231513.381652-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: drv-net: tso: increase the retransmit threshold
Jakub Kicinski [Fri, 15 Aug 2025 22:41:00 +0000 (15:41 -0700)]
selftests: drv-net: tso: increase the retransmit threshold

We see quite a few flakes during the TSO test against virtualized
devices in NIPA. There's often 10-30 retransmissions during the
test. Sometimes as many as 100. Set the retransmission threshold
at 1/4th of the wire frame target.

Link: https://patch.msgid.link/20250815224100.363438-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: ethernet: stmmac: dwmac-rk: Make the clk_phy could be used for external phy
Chaoyi Chen [Fri, 15 Aug 2025 02:35:15 +0000 (10:35 +0800)]
net: ethernet: stmmac: dwmac-rk: Make the clk_phy could be used for external phy

For external phy, clk_phy should be optional, and some external phy
need the clock input from clk_phy. This patch adds support for setting
clk_phy for external phy.

Signed-off-by: David Wu <david.wu@rock-chips.com>
Signed-off-by: Chaoyi Chen <chaoyi.chen@rock-chips.com>
Link: https://patch.msgid.link/20250815023515.114-1-kernel@airkyi.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoselftests: drv-net: test the napi init state
Jakub Kicinski [Fri, 15 Aug 2025 01:33:14 +0000 (18:33 -0700)]
selftests: drv-net: test the napi init state

Test that threaded state (in the persistent NAPI config) gets updated
even when NAPI with given ID is not allocated at the time.

This test is validating commit ccba9f6baa90 ("net: update NAPI threaded
config even for disabled NAPIs").

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250815013314.2237512-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agonet: mana: Use page pool fragments for RX buffers instead of full pages to improve...
Dipayaan Roy [Thu, 14 Aug 2025 14:04:10 +0000 (07:04 -0700)]
net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.

This patch enhances RX buffer handling in the mana driver by allocating
pages from a page pool and slicing them into MTU-sized fragments, rather
than dedicating a full page per packet. This approach is especially
beneficial on systems with large base page sizes like 64KB.

Key improvements:

- Proper integration of page pool for RX buffer allocations.
- MTU-sized buffer slicing to improve memory utilization.
- Reduce overall per Rx queue memory footprint.
- Automatic fallback to full-page buffers when:
   * Jumbo frames are enabled (MTU > PAGE_SIZE / 2).
   * The XDP path is active, to avoid complexities with fragment reuse.

Testing on VMs with 64KB pages shows around 200% throughput improvement.
Memory efficiency is significantly improved due to reduced wastage in page
allocations. Example: We are now able to fit 35 rx buffers in a single 64kb
page for MTU size of 1500, instead of 1 rx buffer per page previously.

Tested:

- iperf3, iperf2, and nttcp benchmarks.
- Jumbo frames with MTU 9000.
- Native XDP programs (XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT) for
  testing the XDP path in driver.
- Memory leak detection (kmemleak).
- Driver load/unload, reboot, and stress scenarios.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20250814140410.GA22089@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agonet: airoha: Add wlan flowtable TX offload
Lorenzo Bianconi [Thu, 14 Aug 2025 07:51:16 +0000 (09:51 +0200)]
net: airoha: Add wlan flowtable TX offload

Introduce support to offload the traffic received on the ethernet NIC
and forwarded to the wireless one using HW Packet Processor Engine (PPE)
capabilities.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250814-airoha-en7581-wlan-tx-offload-v1-1-72e0a312003e@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoMerge branch 'net-macb-add-taprio-traffic-scheduling-support'
Paolo Abeni [Tue, 19 Aug 2025 10:13:05 +0000 (12:13 +0200)]
Merge branch 'net-macb-add-taprio-traffic-scheduling-support'

Vineeth Karumanchi says:

====================
net: macb: Add TAPRIO traffic scheduling support

Implement Time-Aware Traffic Scheduling (TAPRIO) offload support
for Cadence MACB/GEM ethernet controllers to enable IEEE 802.1Qbv
compliant time-sensitive networking (TSN) capabilities.

Key features implemented:
- Complete TAPRIO qdisc offload infrastructure with TC_SETUP_QDISC_TAPRIO
- Hardware-accelerated time-based gate control for multiple queues
- Enhanced Scheduled Traffic (ENST) register configuration and management
- Gate state scheduling with configurable start times, on/off intervals
- Support for cycle-time based traffic scheduling with validation
- Hardware capability detection via MACB_CAPS_QBV flag
- Robust error handling and parameter validation
- Queue-specific timing register programming
  (ENST_START_TIME, ENST_ON_TIME, ENST_OFF_TIME)

Changes include:
- Add enst_ns_to_hw_units(): Converts nanoseconds to hardware units
- Add enst_max_hw_interval(): Returns max interval for given speed
- Add macb_taprio_setup_replace() for TAPRIO configuration
- Add macb_taprio_destroy() for cleanup and reset
- Add macb_setup_tc() as TC offload entry point
- Enable NETIF_F_HW_TC feature for QBV-capable hardware
- Add ENST register offsets to queue configuration

The implementation validates timing constraints against hardware limits,
supports per-queue gate mask configuration, and provides comprehensive
logging for debugging and monitoring. Hardware registers are programmed
atomically with proper locking to ensure consistent state.

Tested on Xilinx Versal platforms with QBV-capable MACB controllers.

Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com>
====================

Link: https://patch.msgid.link/20250814071058.3062453-1-vineeth.karumanchi@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agonet: macb: Add capability-based QBV detection and Versal support
Vineeth Karumanchi [Thu, 14 Aug 2025 07:10:58 +0000 (12:40 +0530)]
net: macb: Add capability-based QBV detection and Versal support

The 'exclude_qbv' bit in the designcfg_debug1 register varies across
MACB/GEM IP revisions, making direct probing unreliable for detecting
QBV support. This patch introduces a capability-based approach for
consistent QBV feature identification across the IP family.

Platform support updates:
- Establish foundation for QBV detection in TAPRIO implementation
- Enable MACB_CAPS_QBV for Xilinx Versal platform configuration
- Fix capability line wrapping, ensuring code stays within 80 columns

Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com>
Link: https://patch.msgid.link/20250814071058.3062453-3-vineeth.karumanchi@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agonet: macb: Add TAPRIO traffic scheduling support
Vineeth Karumanchi [Thu, 14 Aug 2025 07:10:57 +0000 (12:40 +0530)]
net: macb: Add TAPRIO traffic scheduling support

Implement Time-Aware Traffic Scheduling (TAPRIO) offload support
for Cadence MACB/GEM ethernet controllers to enable IEEE 802.1Qbv
compliant time-sensitive networking (TSN) capabilities.

Key Features:

1. Enhanced Scheduled Traffic (ENST) Register Management
   - Per-queue ENST registers: ENST_START_TIME, ENST_ON_TIME, ENST_OFF_TIME
   - Centralized control via ENST_CONTROL for gate enable/disable
   - Infrastructure enhancements:
     * Extended macb_queue structure with ENST timing control registers
     * Mapped ENST register offsets into queue management framework
     * Introduced macb_queue_enst_config for per-entry TC configuration
   - Timing conversion utility:
     * enst_ns_to_hw_units(): Converts nanoseconds to hardware units
     * Timing values are programmed as hardware units based on link speed
     * Conversion formula: time_bytes = time_ns / divisor
     * Speed-specific divisors: 1Gbps=8, 100Mbps=80, 10Mbps=800
   - Hardware limit utility:
     * enst_max_hw_interval(): Returns max interval for given speed

2. TAPRIO Configuration via "tc qdisc replace"
   - macb_taprio_setup_replace(): Configures TAPRIO hardware offload
   - Parameter validation checks performed:
     * TC entry limit validation against available hardware queues
     * Base time non-negativity enforcement
     * Speed-adaptive timing constraint verification
     * Cycle time vs. total gate time consistency checks
     * Single-queue gate mask enforcement per scheduling entry
   - Programming sequence:
     * GEM doesn't support changing ENST registers if ENST is enabled,
       hence disable ENST before programming
     * Atomic timing register configuration (START_TIME, ON_TIME, OFF_TIME)
     * Enable queues via ENST_CONTROL

3. TAPRIO Cleanup via "tc qdisc destroy"
   - macb_taprio_destroy(): Safely removes TAPRIO configuration
   - Restores default queue behavior
   - Cleanup steps:
     * Reset TC state
     * Disable ENST
     * Clear timing registers
     * Ensure atomic updates with locking

4. Traffic Control Offload Infrastructure
   - macb_setup_taprio(): TAPRIO command dispatcher
     * Verifies hardware support
     * Handles runtime suspend state
   - macb_setup_tc(): TC_SETUP_QDISC_TAPRIO entry point
   - Supports REPLACE and DESTROY operations

Tested on Xilinx Versal platforms with QBV-capable MACB controllers.

Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com>
Link: https://patch.msgid.link/20250814071058.3062453-2-vineeth.karumanchi@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoMerge branch 'eth-fbnic-add-xdp-support-for-fbnic'
Paolo Abeni [Tue, 19 Aug 2025 08:51:20 +0000 (10:51 +0200)]
Merge branch 'eth-fbnic-add-xdp-support-for-fbnic'

Mohsin Bashir says:

====================
eth: fbnic: Add XDP support for fbnic

This patch series introduces basic XDP support for fbnic. To enable this,
it also includes preparatory changes such as making the HDS threshold
configurable via ethtool, updating headroom for fbnic, tracking
frag state in shinfo, and prefetching the first cacheline of data.

V3: https://lore.kernel.org/netdev/20250812220150.161848-1-mohsin.bashr@gmail.com/
V2: https://lore.kernel.org/netdev/20250811211338.857992-1-mohsin.bashr@gmail.com/
V1: https://lore.kernel.org/netdev/20250723145926.4120434-1-mohsin.bashr@gmail.com/
====================

Link: https://patch.msgid.link/20250813221319.3367670-1-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoeth: fbnic: Report XDP stats via ethtool
Mohsin Bashir [Wed, 13 Aug 2025 22:13:19 +0000 (15:13 -0700)]
eth: fbnic: Report XDP stats via ethtool

Add support to collect XDP stats via ethtool API. We record
packets and bytes sent, and packets dropped on the XDP_TX path.

ethtool -S eth0 | grep xdp | grep -v "0"
     xdp_tx_queue_13_packets: 2
     xdp_tx_queue_13_bytes: 16126

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-10-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoeth: fbnic: Collect packet statistics for XDP
Mohsin Bashir [Wed, 13 Aug 2025 22:13:18 +0000 (15:13 -0700)]
eth: fbnic: Collect packet statistics for XDP

Add support for XDP statistics collection and reporting via rtnl_link
and netdev_queue API.

For XDP programs without frags support, fbnic requires MTU to be less
than the HDS threshold. If an over-sized frame is received, the frame
is dropped and recorded as rx_length_errors reported via ip stats to
highlight that this is an error.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-9-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoeth: fbnic: Add support for XDP_TX action
Mohsin Bashir [Wed, 13 Aug 2025 22:13:17 +0000 (15:13 -0700)]
eth: fbnic: Add support for XDP_TX action

Add support for XDP_TX action and cleaning the associated work.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-8-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoeth: fbnic: Add support for XDP queues
Mohsin Bashir [Wed, 13 Aug 2025 22:13:16 +0000 (15:13 -0700)]
eth: fbnic: Add support for XDP queues

Add support for allocating XDP_TX queues and configuring ring support.
FBNIC has been designed with XDP support in mind. Each Tx queue has 2
submission queues and one completion queue, with the expectation that
one of the submission queues will be used by the stack, and the other
by XDP. XDP queues are populated by XDP_TX and start from index 128
in the TX queue array.
The support for XDP_TX is added in the next patch.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-7-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoeth: fbnic: Add XDP pass, drop, abort support
Mohsin Bashir [Wed, 13 Aug 2025 22:13:15 +0000 (15:13 -0700)]
eth: fbnic: Add XDP pass, drop, abort support

Add basic support for attaching an XDP program to the device and support
for PASS/DROP/ABORT actions. In fbnic, buffers are always mapped as
DMA_BIDIRECTIONAL.

The BPF program pointer can be read either on a per-packet basis or on a
per-NAPI poll basis. Both approaches are functionally equivalent, in the
current code. Stick to per-packet as it limits number of arguments we need
to pass around.

On the XDP hot path, check that packets with fragments are only allowed
when multi-buffer support is enabled for the XDP program. Ideally, this
check should not be necessary because ndo_bpf verifies that for XDP
programs without multi-buff support, MTU is less than the hds_thresh.
However, the MTU currently does not enforce the receive size which would
require cleaning up the data path and bouncing the link. For practical
reasons, prioritize the ability to enter and exit BPF mode with different
MTU sizes without requiring a full reconfig.

Testing:

Hook a simple XDP program that passes all the packets destined for a
specific port

iperf3 -c 192.168.1.10 -P 5 -p 12345
Connecting to host 192.168.1.10, port 12345
[  5] local 192.168.1.9 port 46702 connected to 192.168.1.10 port 12345
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
- - - - - - - - - - - - - - - - - - - - - - - - -
[SUM]   1.00-2.00   sec  3.86 GBytes  33.2 Gbits/sec    0

XDP_DROP:
Hook an XDP program that drops packets destined for a specific port

 iperf3 -c 192.168.1.10 -P 5 -p 12345
^C- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[SUM]   0.00-0.00   sec  0.00 Bytes  0.00 bits/sec    0       sender
[SUM]   0.00-0.00   sec  0.00 Bytes  0.00 bits/sec            receiver
iperf3: interrupt - the client has terminated

XDP with HDS:

- Validate XDP attachment failure when HDS is low
   ~] ethtool -G eth0 hds-thresh 512
   ~] sudo ip link set eth0 xdpdrv obj xdp_pass_12345.o sec xdp
   ~] Error: fbnic: MTU too high, or HDS threshold is too low for single
      buffer XDP.

- Validate successful XDP attachment when HDS threshold is appropriate
  ~] ethtool -G eth0 hds-thresh 1536
  ~] sudo ip link set eth0 xdpdrv obj xdp_pass_12345.o sec xdp

- Validate when the XDP program is attached, changing HDS thresh to a
  lower value fails
  ~] ethtool -G eth0 hds-thresh 512
  ~] netlink error: fbnic: Use higher HDS threshold or multi-buf capable
     program

- Validate HDS thresh does not matter when xdp frags support is
  available
  ~] ethtool -G eth0 hds-thresh 512
  ~] sudo ip link set eth0 xdpdrv obj xdp_pass_mb_12345.o sec xdp.frags

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-6-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoeth: fbnic: Prefetch packet headers on Rx
Mohsin Bashir [Wed, 13 Aug 2025 22:13:14 +0000 (15:13 -0700)]
eth: fbnic: Prefetch packet headers on Rx

Issue a prefetch for the start of the buffer on Rx to try to avoid cache
miss on packet headers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-5-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoeth: fbnic: Use shinfo to track frags state on Rx
Mohsin Bashir [Wed, 13 Aug 2025 22:13:13 +0000 (15:13 -0700)]
eth: fbnic: Use shinfo to track frags state on Rx

Remove local fields that track frags state and instead store this
information directly in the shinfo struct. This change is necessary
because the current implementation can lead to inaccuracies in certain
scenarios, such as when using XDP multi-buff support. Specifically, the
XDP program may update nr_frags without updating the local variables,
resulting in an inconsistent state.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-4-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoeth: fbnic: Update Headroom
Mohsin Bashir [Wed, 13 Aug 2025 22:13:12 +0000 (15:13 -0700)]
eth: fbnic: Update Headroom

Fbnic currently reserves a minimum of 64B headroom, but this is
insufficient for inserting additional headers (e.g., IPV6) via XDP, as
only 24 bytes are available for adjustment. To address this limitation,
increase the headroom to a larger value while ensuring better page use.
Although the resulting headroom (192B) is smaller than the recommended
value (256B), forcing the headroom to 256B would require aligning to
256B (as opposed to the current 128B), which can push the max headroom
to 511B.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-3-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoeth: fbnic: Add support for HDS configuration
Mohsin Bashir [Wed, 13 Aug 2025 22:13:11 +0000 (15:13 -0700)]
eth: fbnic: Add support for HDS configuration

Add support for configuring the header data split threshold.
For fbnic, the tcp data split support is enabled all the time.

Fbnic supports a maximum buffer size of 4KB. However, the reservation
for the headroom, tailroom, and padding reduce the max header size
accordingly.

  ethtool_hds -g eth0
  Ring parameters for eth0:
  Pre-set maximums:
  ...
  HDS thresh: 3584
  Current hardware settings:
  ...
  HDS thresh: 1536

Verify hds tests in ksft-net-drv are passing

  ksft-net-drv]# ./drivers/net/hds.py
  TAP version 13
  1..13
  ok 1 hds.get_hds
  ok 2 hds.get_hds_thresh
  ok 3 hds.set_hds_disable # SKIP disabling of HDS not supported by ...
  ...
  ...
  ok 12 hds.ioctl_set_xdp
  ok 13 hds.ioctl_enabled_set_xdp
  \# Totals: pass:12 fail:0 xfail:0 xpass:0 skip:1 error:0

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-2-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agoMerge branch 'net-stmmac-eee-and-wol-cleanups'
Jakub Kicinski [Tue, 19 Aug 2025 01:10:14 +0000 (18:10 -0700)]
Merge branch 'net-stmmac-eee-and-wol-cleanups'

Russell King says:

====================
net: stmmac: EEE and WoL cleanups

This series contains a series of cleanup patches for the EEE and WoL
code in stmmac, prompted by issues raised during the last three weeks.
====================

Link: https://patch.msgid.link/aJ8avIp8DBAckgMc@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: explain the phylink_speed_down() call in stmmac_release()
Russell King (Oracle) [Fri, 15 Aug 2025 11:32:21 +0000 (12:32 +0100)]
net: stmmac: explain the phylink_speed_down() call in stmmac_release()

The call to phylink_speed_down() looks odd on the face of it. Add a
comment to explain why this call is there. phylink_speed_up() is
always called in __stmmac_open(), and already has a comment.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1umsfV-008vKv-1O@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: add helpers to indicate WoL enable status
Russell King (Oracle) [Fri, 15 Aug 2025 11:32:15 +0000 (12:32 +0100)]
net: stmmac: add helpers to indicate WoL enable status

Add two helpers to abstract the WoL enable status at the PHY and MAC to
make the code easier to read.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1umsfP-008vKp-U1@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: use core wake IRQ support
Russell King (Oracle) [Fri, 15 Aug 2025 11:32:10 +0000 (12:32 +0100)]
net: stmmac: use core wake IRQ support

The PM core provides management of wake IRQs along side setting the
device wake enable state. In order to use this, we need to register
the interrupt used to wakeup the system using devm_pm_set_wake_irq()
or dev_pm_set_wake_irq(). The core will then enable or disable IRQ
wake state on this interrupt as appropriate, depending on the
device_set_wakeup_enable() state. device_set_wakeup_enable() does not
care about having balanced enable/disable calls.

Make use of this functionality, rather than explicitly managing the
IRQ enable state in the set_wol() ethtool op. This removes the IRQ
wake state management from stmmac.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1umsfK-008vKj-Pw@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: remove unnecessary "stmmac: wakeup enable" print
Russell King (Oracle) [Fri, 15 Aug 2025 11:32:05 +0000 (12:32 +0100)]
net: stmmac: remove unnecessary "stmmac: wakeup enable" print

Printing "stmmac: wakeup enable" to the kernel log isn't useful - it
doesn't identify the adapter, and is effectively nothing more than a
debugging print. This information can be discovered by looking at
/sys/device.../power/wakeup as the device_set_wakeup_enable() call
updates this sysfs file.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1umsfF-008vKc-Kt@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: remove redundant WoL option validation
Russell King (Oracle) [Fri, 15 Aug 2025 11:32:00 +0000 (12:32 +0100)]
net: stmmac: remove redundant WoL option validation

The core ethtool API validates the WoL options passed from userspace
against the support which the driver reports from its get_wol() method,
returning EINVAL if an unsupported mode is requested.

Therefore, there is no need for stmmac to implement its own validation.
Remove this unnecessary code.

See ethnl_set_wol() in net/ethtool/wol.c and ethtool_set_wol() in
net/ethtool/ioctl.c.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1umsfA-008vKW-H1@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: remove write-only mac->pmt
Russell King (Oracle) [Fri, 15 Aug 2025 11:31:55 +0000 (12:31 +0100)]
net: stmmac: remove write-only mac->pmt

mac_device_info->pmt is only ever written, nothing reads it. Remove
this struct member.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1umsf5-008vKQ-DT@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: remove unnecessary checks in ethtool eee ops
Russell King (Oracle) [Fri, 15 Aug 2025 11:31:50 +0000 (12:31 +0100)]
net: stmmac: remove unnecessary checks in ethtool eee ops

Phylink will check whether the MAC supports the LPI methods in
struct phylink_mac_ops, and return -EOPNOTSUPP if the LPI capabilities
are not provided. stmmac doesn't provide LPI capabilities if
priv->dma_cap.eee is not set.

Therefore, checking the state of priv->dma_cap.eee in the ethtool ops
and returning -EOPNOTSUPP is redundant - let phylink handle this.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1umsf0-008vKK-A3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoamd-xgbe: Configure and retrieve 'tx-usecs' for Tx coalescing
Vishal Badole [Sat, 16 Aug 2025 14:19:41 +0000 (19:49 +0530)]
amd-xgbe: Configure and retrieve 'tx-usecs' for Tx coalescing

Ethtool has advanced with additional configurable options, but the
current driver does not support tx-usecs configuration using Ethtool.

Add support to configure and retrieve 'tx-usecs' using ethtool, which
specifies the wait time before servicing an interrupt for Tx coalescing.

Signed-off-by: Vishal Badole <Vishal.Badole@amd.com>
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Link: https://patch.msgid.link/20250816141941.126054-1-Vishal.Badole@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'net-use-vmalloc_array-to-simplify-code'
Jakub Kicinski [Tue, 19 Aug 2025 00:49:54 +0000 (17:49 -0700)]
Merge branch 'net-use-vmalloc_array-to-simplify-code'

Qianfeng Rong says:

====================
net: use vmalloc_array() to simplify code

Remove array_size() calls and replace vmalloc() with vmalloc_array() to
simplify the code and maintain consistency with existing kmalloc_array()
usage.

vmalloc_array() is also optimized better, resulting in less instructions
being used [1].

[1]: https://lore.kernel.org/lkml/abc66ec5-85a4-47e1-9759-2f60ab111971@vivo.com/
====================

Link: https://patch.msgid.link/20250816090659.117699-1-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoppp: use vmalloc_array() to simplify code
Qianfeng Rong [Sat, 16 Aug 2025 09:06:54 +0000 (17:06 +0800)]
ppp: use vmalloc_array() to simplify code

Remove array_size() calls and replace vmalloc() with vmalloc_array() in
bsd_alloc().

vmalloc_array() is also optimized better, resulting in less instructions
being used.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://patch.msgid.link/20250816090659.117699-4-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonfp: flower: use vmalloc_array() to simplify code
Qianfeng Rong [Sat, 16 Aug 2025 09:06:53 +0000 (17:06 +0800)]
nfp: flower: use vmalloc_array() to simplify code

Remove array_size() calls and replace vmalloc() with vmalloc_array() in
nfp_flower_metadata_init().  vmalloc_array() is also optimized better,
resulting in less instructions being used.

Place 'NFP_FL_STATS_ELEM_RS' with the sizeof() parameter as the second
argument to vmalloc_array() to avoid -Wcalloc-transposed-args compilation
warnings.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://patch.msgid.link/20250816090659.117699-3-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoeth: intel: use vmalloc_array() to simplify code
Qianfeng Rong [Sat, 16 Aug 2025 09:06:52 +0000 (17:06 +0800)]
eth: intel: use vmalloc_array() to simplify code

Remove array_size() calls and replace vmalloc() with vmalloc_array() to
simplify the code and maintain consistency with existing kmalloc_array()
usage.

vmalloc_array() is also optimized better, resulting in less instructions
being used.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://patch.msgid.link/20250816090659.117699-2-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agodocs: netdev: refine the clean-up patch examples
Jakub Kicinski [Fri, 15 Aug 2025 16:52:42 +0000 (09:52 -0700)]
docs: netdev: refine the clean-up patch examples

We discourage sending trivial patches to clean up checkpatch warnings.
There are other tools which lead to patches of similarly low value
like some coccicheck warnings. The warnings are useful for new code
but fixing them in the existing code base is a waste of review time.

Broaden the example given in the doc a little bit.

Link: https://patch.msgid.link/20250815165242.124240-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoppp: mppe: Use SHA-1 library instead of crypto_shash
Eric Biggers [Fri, 15 Aug 2025 02:07:05 +0000 (19:07 -0700)]
ppp: mppe: Use SHA-1 library instead of crypto_shash

Now that a SHA-1 library API is available, use it instead of
crypto_shash.  This is simpler and faster.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250815020705.23055-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoeth: nfp: Remove u64_stats_update_begin()/end() for stats fetch
Li RongQing [Fri, 15 Aug 2025 01:56:19 +0000 (09:56 +0800)]
eth: nfp: Remove u64_stats_update_begin()/end() for stats fetch

This place is fetching the stats, u64_stats_update_begin()/end()
should not be used, and the fetcher of stats is in the same
context as the updater of the stats, so don't need any protection

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20250815015619.2713-1-lirongqing@baidu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonfc: s3fwrn5: Use SHA-1 library instead of crypto_shash
Eric Biggers [Fri, 15 Aug 2025 02:23:29 +0000 (19:23 -0700)]
nfc: s3fwrn5: Use SHA-1 library instead of crypto_shash

Now that a SHA-1 library API is available, use it instead of
crypto_shash.  This is simpler and faster.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250815022329.28672-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'net-dsa-move-ks8995-phy-driver-to-dsa'
Jakub Kicinski [Tue, 19 Aug 2025 00:26:12 +0000 (17:26 -0700)]
Merge branch 'net-dsa-move-ks8995-phy-driver-to-dsa'

Linus Walleij says:

====================
net: dsa: Move ks8995 "phy" driver to DSA

After we concluded that the KS8995 is a DSA switch, see
commit a0f29a07b654a50ebc9b070ef6dcb3219c4de867
it is time to move the driver to it's right place under
DSA.

Developing full support for the custom tagging, but we
can make sure the driver does the job it did as a "phy",
act as a switch with individually represented ports.

This patch series achieves that first step so the
current device tree bindings produces working set-ups
and paves the way for custom tagging.
====================

Link: https://patch.msgid.link/20250813-ks8995-to-dsa-v1-0-75c359ede3a5@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: dsa: ks8995: Add basic switch set-up
Linus Walleij [Wed, 13 Aug 2025 21:43:06 +0000 (23:43 +0200)]
net: dsa: ks8995: Add basic switch set-up

We start to extend the KS8995 driver by simply registering it
as a DSA device and implementing a few switch callbacks for
STP set-up and such to begin with.

No special tags or other advanced stuff: we use DSA_TAG_NONE
and rely on the default set-up in the switch with the special
DSA tags turned off. This makes the switch wire up properly
to all its PHY's and simple bridge traffic works properly.

After this the bridge DT bindings are respected, ports and their
PHYs get connected to the switch and react appropriately through
the phylib when cables are plugged in etc.

Tested like this in a hacky OpenWrt image:

Bring up conduit interface manually:
ixp4xx_eth c8009000.ethernet eth0: eth0: link up,
  speed 100 Mb/s, full duplex

spi-ks8995 spi0.0: enable port 0
spi-ks8995 spi0.0: set KS8995_REG_PC2 for port 0 to 06
spi-ks8995 spi0.0 lan1: configuring for phy/mii link mode
spi-ks8995 spi0.0 lan1: Link is Up - 100Mbps/Full - flow control rx/tx

PING 169.254.1.1 (169.254.1.1): 56 data bytes
64 bytes from 169.254.1.1: seq=0 ttl=64 time=1.629 ms
64 bytes from 169.254.1.1: seq=1 ttl=64 time=0.951 ms

I also tested SSH from the device to the host and it works fine.

It also works fine to ping the device from the host and to SSH
into the device from the host.

This brings the ks8995 driver to a reasonable state where it can
be used from the current device tree bindings and the existing
device trees in the kernel.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250813-ks8995-to-dsa-v1-4-75c359ede3a5@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: dsa: ks8995: Delete sysfs register access
Linus Walleij [Wed, 13 Aug 2025 21:43:05 +0000 (23:43 +0200)]
net: dsa: ks8995: Delete sysfs register access

There is some sysfs file to read and write registers randomly
in the ks8995 driver.

The contemporary way to achieve the same thing is to implement
regmap abstractions, if needed. Delete this and implement
regmap later if we want to be able to inspect individual registers.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250813-ks8995-to-dsa-v1-3-75c359ede3a5@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: dsa: ks8995: Add proper RESET delay
Linus Walleij [Wed, 13 Aug 2025 21:43:04 +0000 (23:43 +0200)]
net: dsa: ks8995: Add proper RESET delay

According to the datasheet we need to wait 100us before accessing
any registers in the KS8995 after a reset de-assertion.

Add this delay, if and only if we obtained a GPIO descriptor,
otherwise it is just a pointless delay.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250813-ks8995-to-dsa-v1-2-75c359ede3a5@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: dsa: Move KS8995 to the DSA subsystem
Linus Walleij [Wed, 13 Aug 2025 21:43:03 +0000 (23:43 +0200)]
net: dsa: Move KS8995 to the DSA subsystem

By reading the datasheets for the KS8995 it is obvious that this
is a 100 Mbit DSA switch.

Let us start the refactoring by moving it to the DSA subsystem to
preserve development history.

Verified that the chip still probes the same after this patch
provided CONFIG_HAVE_NET_DSA, CONFIG_NET_DSA and CONFIG_DSA_KS8995
are selected.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250813-ks8995-to-dsa-v1-1-75c359ede3a5@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agodt-bindings: net: realtek,rtl82xx: document wakeup-source property
Russell King (Oracle) [Wed, 13 Aug 2025 11:21:19 +0000 (12:21 +0100)]
dt-bindings: net: realtek,rtl82xx: document wakeup-source property

The RTL8211F PHY has two modes for a single INTB/PMEB pin:

1. INTB mode, where it signals interrupts to the CPU, which can
   include wake-on-LAN events.
2. PMEB mode, where it only signals a wake-on-LAN event, which
   may either be a latched logic low until software manually
   clears the WoL state, or pulsed mode.

In the case of (1), there is no way to know whether the interrupt to
which the PHY is connected is capable of waking the system. In the
case of (2), there would be no interrupt property in the PHY's DT
description, and thus there is nothing to describe whether the pin is
even wired to anything such as a power management controller.

There is a "wakeup-source" property which can be applied to any device
- see Documentation/devicetree/bindings/power/wakeup-source.txt

Case 1 above matches example 2 in this document, and case 2 above
matches example 3. Therefore, it seems reasonable to make use of this
existing specification, albiet it hasn't been converted to YAML.

Document the wakeup-source property in the device description, which
will indicate that the PHY has been wired up in such a way that it
can wake the system from a low power state.

We will use this in a rewrite of the existing broken Wake-on-Lan code
that was merged during the 6.16 merge window to support case 1. Case 2
can be added to the driver later without needing to further alter the
DT description. To be clear, the existing Wake-on-Lan code that was
recently merged has multiple functional issues.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/E1um9Xj-008kBx-72@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: phy: realtek: fix RTL8211F wake-on-lan support
Russell King (Oracle) [Wed, 13 Aug 2025 10:04:45 +0000 (11:04 +0100)]
net: phy: realtek: fix RTL8211F wake-on-lan support

Implement Wake-on-Lan for RTL8211F correctly. The existing
implementation has multiple issues:

1. It assumes that Wake-on-Lan can always be used, whether or not the
   interrupt is wired, and whether or not the interrupt is capable of
   waking the system. This breaks the ability for MAC drivers to detect
   whether the PHY WoL is functional.
2. switching the interrupt pin in the .set_wol() method to PMEB mode
   immediately silences link-state interrupts, which breaks phylib
   when interrupts are being used rather than polling mode.
3. the code claiming to "reset WOL status" was doing nothing of the
   sort. Bit 15 in page 0xd8a register 17 controls WoL reset, and
   needs to be pulsed low to reset the WoL state. This bit was always
   written as '1', resulting in no reset.
4. not resetting WoL state results in the PMEB pin remaining asserted,
   which in turn leads to an interrupt storm. Only resetting the WoL
   state in .set_wol() is not sufficient.
5. PMEB mode does not allow software detection of the wake-up event as
   there is no status bit to indicate we received the WoL packet.
6. across reboots of at least the Jetson Xavier NX system, the WoL
   configuration is preserved.

Fix all of these issues by essentially rewriting the support. We:
1. clear the WoL event enable register at probe time.
2. detect whether we can support wake-up by having a valid interrupt,
   and the "wakeup-source" property in DT. If we can, then we mark
   the MDIO device as wakeup capable, and associate the interrupt
   with the wakeup source.
3. arrange for the get_wol() and set_wol() implementations to handle
   the case where the MDIO device has not been marked as wakeup
   capable (thereby returning no WoL support, and refusing to enable
   WoL support.)
4. avoid switching to PMEB mode, instead using INTB mode with the
   interrupt enable, reconfiguring the interrupt enables at suspend
   time, and restoring their original state at resume time (we track
   the state of the interrupt enable register in .config_intr()
   register.)
5. move WoL reset from .set_wol() to the suspend function to ensure
   that WoL state is cleared prior to suspend. This is necessary
   after the PME interrupt has been enabled as a second WoL packet
   will not re-raise a previously cleared PME interrupt.
6. when a PME interrupt (for wakeup) is asserted, pass this to the
   PM wakeup so it knows which device woke the system.

This fixes WoL support in the Realtek RTL8211F driver when used on the
nVidia Jetson Xavier NX platform, and needs to be applied before stmmac
patches which allow these platforms to forward the ethtool WoL commands
to the Realtek PHY.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1um8Ld-008jxD-Mc@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Sat, 16 Aug 2025 01:13:41 +0000 (18:13 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
ice: implement SRIOV VF Active-Active LAG

Dave Ertman says:

Implement support for SRIOV VFs over an Active-Active link aggregate.
The same restrictions apply as are in place for the support of
Active-Backup bonds.

- the two interfaces must be on the same NIC
- the FW LLDP engine needs to be disabled
- the DDP package that supports VF LAG must be loaded on device
- the two interfaces must have the same QoS config
- only the first interface added to the bond will have VF support
- the interface with VFs must be in switchdev mode

With the additional requirement of
- the version of the FW on the NIC needs to have VF Active/Active support

The balancing of traffic between the two interfaces is done on a queue
basis. Taking the queues allocated to all of the VFs as a whole, one
half of them will be distributed to each interface. When a link goes
down, then the queues allocated to the down interface will migrate to
the active port. When the down port comes back up, then the same
queues as were originally assigned there will be moved back.

Patch 1 cleans up void pointer casts
Patch 2 utilizes bool over u8 when appropriate
Patch 3 adds a driver prefix to a LAG define
Patch 4 pre-move a function to reduce delta in implementation patch
Patch 5 cleanup variable initialization in declaration block
Patch 6 cleanup capability parsing for LAG feature
Patch 7 is the implementation of the new functionality

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Implement support for SRIOV VFs across Active/Active bonds
  ice: cleanup capabilities evaluation
  ice: Cleanup variable initialization in LAG code
  ice: move LAG function in code to prepare for Active-Active
  ice: Add driver specific prefix to LAG defines
  ice: replace u8 elements with bool where appropriate
  ice: Remove casts on void pointers in LAG code
====================

Link: https://patch.msgid.link/20250814230855.128068-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Space: Replace memset(0) + strscpy() with strscpy_pad()
Thorsten Blum [Thu, 14 Aug 2025 18:05:14 +0000 (20:05 +0200)]
net: Space: Replace memset(0) + strscpy() with strscpy_pad()

Replace memset(0) followed by strscpy() with strscpy_pad() to improve
netdev_boot_setup_add(). This avoids zeroing the memory before copying
the string and ensures the destination buffer is only written to once,
simplifying the code and improving efficiency.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20250814180514.251000-2-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>