]> www.infradead.org Git - users/dwmw2/linux.git/log
users/dwmw2/linux.git
3 months agoipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().
Kuniyuki Iwashima [Wed, 15 Jan 2025 08:06:05 +0000 (17:06 +0900)]
ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().

We will convert inet6_rtm_newaddr() to per-netns RTNL.

Except for IFA_F_OPTIMISTIC, cfg.ifa_flags can be set before
__dev_get_by_index().

Let's move ifa_flags setup before __dev_get_by_index() so that
we can set ifa_flags without RTNL.

Also, now it's moved before tb[IFA_CACHEINFO] in preparing for
the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoipv6: Pass dev to inet6_addr_add().
Kuniyuki Iwashima [Wed, 15 Jan 2025 08:06:04 +0000 (17:06 +0900)]
ipv6: Pass dev to inet6_addr_add().

inet6_addr_add() is called from inet6_rtm_newaddr() and
addrconf_add_ifaddr().

inet6_addr_add() looks up dev by __dev_get_by_index(), but
it's already done in inet6_rtm_newaddr().

Let's move the 2nd lookup to addrconf_add_ifaddr() and pass
dev to inet6_addr_add().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoipv6: Convert inet6_ioctl() to per-netns RTNL.
Kuniyuki Iwashima [Wed, 15 Jan 2025 08:06:03 +0000 (17:06 +0900)]
ipv6: Convert inet6_ioctl() to per-netns RTNL.

These functions are called from inet6_ioctl() with a socket's netns
and hold RTNL.

  * SIOCSIFADDR    : addrconf_add_ifaddr()
  * SIOCDIFADDR    : addrconf_del_ifaddr()
  * SIOCSIFDSTADDR : addrconf_set_dstaddr()

Let's use rtnl_net_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().
Kuniyuki Iwashima [Wed, 15 Jan 2025 08:06:02 +0000 (17:06 +0900)]
ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().

addrconf_init() holds RTNL for blackhole_netdev, which is the global
device in init_net.

addrconf_cleanup() holds RTNL to clean up devices in init_net too.

Let's use rtnl_net_lock(&init_net) there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoipv6: Hold rtnl_net_lock() in addrconf_dad_work().
Kuniyuki Iwashima [Wed, 15 Jan 2025 08:06:01 +0000 (17:06 +0900)]
ipv6: Hold rtnl_net_lock() in addrconf_dad_work().

addrconf_dad_work() is per-address work and holds RTNL internally.

We can fetch netns as dev_net(ifp->idev->dev).

Let's use rtnl_net_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoipv6: Hold rtnl_net_lock() in addrconf_verify_work().
Kuniyuki Iwashima [Wed, 15 Jan 2025 08:06:00 +0000 (17:06 +0900)]
ipv6: Hold rtnl_net_lock() in addrconf_verify_work().

addrconf_verify_work() is per-netns work to call addrconf_verify_rtnl()
under RTNL.

Let's use rtnl_net_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.
Kuniyuki Iwashima [Wed, 15 Jan 2025 08:05:59 +0000 (17:05 +0900)]
ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.

net.ipv6.conf.${DEV}.XXX sysctl are changed under RTNL:

  * forwarding
  * ignore_routes_with_linkdown
  * disable_ipv6
  * proxy_ndp
  * addr_gen_mode
  * stable_secret
  * disable_policy

Let's use rtnl_net_lock() there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoipv6: Add __in6_dev_get_rtnl_net().
Kuniyuki Iwashima [Wed, 15 Jan 2025 08:05:58 +0000 (17:05 +0900)]
ipv6: Add __in6_dev_get_rtnl_net().

We will convert rtnl_lock() with rtnl_net_lock(), and we want to
convert __in6_dev_get() too.

__in6_dev_get() uses rcu_dereference_rtnl(), but as written in its
comment, rtnl_dereference() or rcu_dereference() is preferable.

Let's add __in6_dev_get_rtnl_net() that uses rtnl_net_dereference().

We can add the RCU version helper later if needed.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags
Furong Xu [Fri, 17 Jan 2025 06:28:05 +0000 (14:28 +0800)]
net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags

After commit df542f669307 ("net: stmmac: Switch to zero-copy in
non-XDP RX path"), SKBs are always marked for recycle, it is redundant
to mark SKBs more than once when new frags are appended.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/20250117062805.192393-1-0x1207@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: mii: Fix the Speed display when the network cable is not connected
Xiangqian Zhang [Fri, 17 Jan 2025 09:46:03 +0000 (17:46 +0800)]
net: mii: Fix the Speed display when the network cable is not connected

Two different models of usb card, the drivers are r8152 and asix. If no
network cable is connected, Speed = 10Mb/s. This problem is repeated in
linux 3.10, 4.19, 5.4, 6.12. This problem also exists on the latest
kernel. Both drivers call mii_ethtool_get_link_ksettings,
but the value of cmd->base.speed in this
function can only be SPEED_1000 or SPEED_100 or SPEED_10.
When the network cable is not connected, set cmd->base.speed
=SPEED_UNKNOWN.

Signed-off-by: Xiangqian Zhang <zhangxiangqian@kylinos.cn>
Link: https://patch.msgid.link/20250117094603.4192594-1-zhangxiangqian@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agosysctl net: Remove macro checks for CONFIG_SYSCTL
Denis Kirjanov [Sun, 19 Jan 2025 13:42:53 +0000 (16:42 +0300)]
sysctl net: Remove macro checks for CONFIG_SYSCTL

Since dccp and llc makefiles already check sysctl code
compilation with xxx-$(CONFIG_SYSCTL)
we can drop the checks

Signed-off-by: Denis Kirjanov <kirjanov@gmail.com>
Link: https://patch.msgid.link/20250119134254.19250-1-kirjanov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge tag 'nf-next-25-01-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilt...
Jakub Kicinski [Mon, 20 Jan 2025 19:59:25 +0000 (11:59 -0800)]
Merge tag 'nf-next-25-01-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next:

1) Unbreak set size settings for rbtree set backend, intervals in
   rbtree are represented as two elements, this detailed is leaked
   to userspace leading to bogus ENOSPC from control plane.

2) Remove dead code in br_netfilter's br_nf_pre_routing_finish()
   due to never matching error when looking up for route,
   from Antoine Tenart.

3) Simplify check for device already in use in flowtable,
   from Phil Sutter.

4) Three patches to restore interface name field in struct nft_hook
   and use it, this is to prepare for wildcard interface support.
   From Phil Sutter.

5) Do not remove netdev basechain when last device is gone, this is
   for consistency with the flowtable behaviour. This allows for netdev
   basechains without devices. Another patch to simplify netdev event
   notifier after this update. Also from Phil.

6) Two patches to add missing spinlock when flowtable updates TCP
   state flags, from Florian Westphal.

7) Simplify __nf_ct_refresh_acct() by removing skbuff parameter,
   also from Florian.

8) Flowtable gc now extends ct timeout for offloaded flow. This
   is to address a possible race that leads to handing over flow
   to classic path with long ct timeouts.

9) Tear down flow if cached rt_mtu is stale, before this patch,
   packet is handed over to classic path but flow entry still remained
   in place.

10) Revisit the flowtable teardown strategy, which was originally
    designed to release flowtable hardware entries early. Add a new
    CLOSING flag that still allows hardware to release entries when
    fin/rst is seen, but keeps the flow entry in place when the
    TCP connection is closed. Release flow after timeout or when a new
    syn packet is seen for TCP reopen scenario.

* tag 'nf-next-25-01-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: flowtable: add CLOSING state
  netfilter: flowtable: teardown flow if cached mtu is stale
  netfilter: conntrack: rework offload nf_conn timeout extension logic
  netfilter: conntrack: remove skb argument from nf_ct_refresh
  netfilter: nft_flow_offload: update tcp state flags under lock
  netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath
  netfilter: nf_tables: Simplify chain netdev notifier
  netfilter: nf_tables: Tolerate chains with no remaining hooks
  netfilter: nf_tables: Compare netdev hooks based on stored name
  netfilter: nf_tables: Use stored ifname in netdev hook dumps
  netfilter: nf_tables: Store user-defined hook ifname
  netfilter: nf_tables: Flowtable hook's pf value never varies
  netfilter: br_netfilter: remove unused conditional and dead code
  netfilter: nf_tables: fix set size with rbtree backend
====================

Link: https://patch.msgid.link/20250119172051.8261-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-ethtool-fixes-for-hds-threshold'
Jakub Kicinski [Mon, 20 Jan 2025 19:45:01 +0000 (11:45 -0800)]
Merge branch 'net-ethtool-fixes-for-hds-threshold'

Jakub Kicinski says:

====================
net: ethtool: fixes for HDS threshold

Quick follow up on the HDS threshold work, since the merge window
is upon us.

Fix the bnxt implementation to apply the settings right away,
because we update the parameters _after_ configuring HW user
needed to reconfig the device twice to get the settings to stick.

For this I took the liberty of moving the config to a separate
struct. This follows my original thinking for the queue API.
It should also fit more neatly into how many drivers which
support safe config update operate. Drivers can allocate
new objects using the "pending" struct.

netdevsim:

  KTAP version 1
  1..7
  ok 1 hds.get_hds
  ok 2 hds.get_hds_thresh
  ok 3 hds.set_hds_disable
  ok 4 hds.set_hds_enable
  ok 5 hds.set_hds_thresh_zero
  ok 6 hds.set_hds_thresh_max
  ok 7 hds.set_hds_thresh_gt
  # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0

bnxt:

  KTAP version 1
  1..7
  ok 1 hds.get_hds
  ok 2 hds.get_hds_thresh
  ok 3 hds.set_hds_disable # SKIP disabling of HDS not supported by the device
  ok 4 hds.set_hds_enable
  ok 5 hds.set_hds_thresh_zero
  ok 6 hds.set_hds_thresh_max
  ok 7 hds.set_hds_thresh_gt
  # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:1 error:0

v1: https://lore.kernel.org/20250117194815.1514410-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250119020518.1962249-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: bnxt: update header sizing defaults
Jakub Kicinski [Sun, 19 Jan 2025 02:05:17 +0000 (18:05 -0800)]
eth: bnxt: update header sizing defaults

300-400B RPC requests are fairly common. With the current default
of 256B HDS threshold bnxt ends up splitting those, lowering PCIe
bandwidth efficiency and increasing the number of memory allocation.

Increase the HDS threshold to fit 4 buffers in a 4k page.
This works out to 640B as the threshold on a typical kernel confing.
This change increases the performance for a microbenchmark which
receives 400B RPCs and sends empty responses by 4.5%.
Admittedly this is just a single benchmark, but 256B works out to
just 6 (so 2 more) packets per head page, because shinfo size
dominates the headers.

Now that we use page pool for the header pages I was also tempted
to default rx_copybreak to 0, but in synthetic testing the copybreak
size doesn't seem to make much difference.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: bnxt: allocate enough buffer space to meet HDS threshold
Jakub Kicinski [Sun, 19 Jan 2025 02:05:16 +0000 (18:05 -0800)]
eth: bnxt: allocate enough buffer space to meet HDS threshold

Now that we can configure HDS threshold separately from the rx_copybreak
HDS threshold may be higher than rx_copybreak.

We need to make sure that we have enough space for the headers.

Fixes: 6b43673a25c3 ("bnxt_en: add support for hds-thresh ethtool command")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: populate the default HDS params in the core
Jakub Kicinski [Sun, 19 Jan 2025 02:05:15 +0000 (18:05 -0800)]
net: ethtool: populate the default HDS params in the core

The core has the current HDS config, it can pre-populate the values
for the drivers. While at it, remove the zero-setting in netdevsim.
Zero are the default values since the config is zalloc'ed.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: bnxt: apply hds_thrs settings correctly
Jakub Kicinski [Sun, 19 Jan 2025 02:05:14 +0000 (18:05 -0800)]
eth: bnxt: apply hds_thrs settings correctly

Use the pending config for hds_thrs. Core will only update the "current"
one after we return success. Without this change 2 reconfigs would be
required for the setting to reach the device.

Fixes: 6b43673a25c3 ("bnxt_en: add support for hds-thresh ethtool command")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: provide pending ring configuration in net_device
Jakub Kicinski [Sun, 19 Jan 2025 02:05:13 +0000 (18:05 -0800)]
net: provide pending ring configuration in net_device

Record the pending configuration in net_device struct.
ethtool core duplicates the current config and the specific
handlers (for now just ringparam) can modify it.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: store netdev in a temp variable in ethnl_default_set_doit()
Jakub Kicinski [Sun, 19 Jan 2025 02:05:12 +0000 (18:05 -0800)]
net: ethtool: store netdev in a temp variable in ethnl_default_set_doit()

For ease of review of the next patch store the dev pointer
on the stack, instead of referring to req_info.dev every time.

No functional changes.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: move HDS config from ethtool state
Jakub Kicinski [Sun, 19 Jan 2025 02:05:11 +0000 (18:05 -0800)]
net: move HDS config from ethtool state

Separate the HDS config from the ethtool state struct.
The HDS config contains just simple parameters, not state.
Having it as a separate struct will make it easier to clone / copy
and also long term potentially make it per-queue.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'af_unix-set-skb-drop-reason-in-every-kfree_skb-path'
Jakub Kicinski [Mon, 20 Jan 2025 19:27:44 +0000 (11:27 -0800)]
Merge branch 'af_unix-set-skb-drop-reason-in-every-kfree_skb-path'

Kuniyuki Iwashima says:

====================
af_unix: Set skb drop reason in every kfree_skb() path.

There is a potential user for skb drop reason for AF_UNIX.

This series replaces some kfree_skb() in connect() and
sendmsg() paths and sets skb drop reason for the rest of
kfree_skb() in AF_UNIX.

Link: https://lore.kernel.org/netdev/CAAf2ycmZHti95WaBR3s+L5Epm1q7sXmvZ-EqCK=-oZj=45tOwQ@mail.gmail.com/
v2: https://lore.kernel.org/20250112040810.14145-1-kuniyu@amazon.com/
v1: https://lore.kernel.org/20250110092641.85905-1-kuniyu@amazon.com/
====================

Link: https://patch.msgid.link/20250116053441.5758-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoaf_unix: Use consume_skb() in connect() and sendmsg().
Kuniyuki Iwashima [Thu, 16 Jan 2025 05:34:42 +0000 (14:34 +0900)]
af_unix: Use consume_skb() in connect() and sendmsg().

This is based on Donald Hunter's patch.

These functions could fail for various reasons, sometimes
triggering kfree_skb().

  * unix_stream_connect() : connect()
  * unix_stream_sendmsg() : sendmsg()
  * queue_oob()           : sendmsg(MSG_OOB)
  * unix_dgram_sendmsg()  : sendmsg()

Such kfree_skb() is tied to the errno of connect() and
sendmsg(), and we need not define skb drop reasons.

Let's use consume_skb() not to churn kfree_skb() events.

Link: https://lore.kernel.org/netdev/eb30b164-7f86-46bf-a5d3-0f8bda5e9398@redhat.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoaf_unix: Reuse out_pipe label in unix_stream_sendmsg().
Kuniyuki Iwashima [Thu, 16 Jan 2025 05:34:41 +0000 (14:34 +0900)]
af_unix: Reuse out_pipe label in unix_stream_sendmsg().

This is a follow-up of commit d460b04bc452 ("af_unix: Clean up
error paths in unix_stream_sendmsg().").

If we initialise skb with NULL in unix_stream_sendmsg(), we can
reuse the existing out_pipe label for the SEND_SHUTDOWN check.

Let's rename it and adjust the existing label as out_pipe_lock.

While at it, size and data_len are moved to the while loop scope.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoaf_unix: Set drop reason in unix_dgram_disconnected().
Kuniyuki Iwashima [Thu, 16 Jan 2025 05:34:40 +0000 (14:34 +0900)]
af_unix: Set drop reason in unix_dgram_disconnected().

unix_dgram_disconnected() is called from two places:

  1. when a connect()ed socket dis-connect()s or re-connect()s to
     another socket

  2. when sendmsg() fails because the peer socket that the client
     has connect()ed to has been close()d

Then, the client's recv queue is purged to remove all messages from
the old peer socket.

Let's define a new drop reason for that case.

  # echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable

  # python3
  >>> from socket import *
  >>>
  >>> # s1 has a message from s2
  >>> s1, s2 = socketpair(AF_UNIX, SOCK_DGRAM)
  >>> s2.send(b'hello world')
  >>>
  >>> # re-connect() drops the message from s2
  >>> s3 = socket(AF_UNIX, SOCK_DGRAM)
  >>> s3.bind('')
  >>> s1.connect(s3.getsockname())

  # cat /sys/kernel/tracing/trace_pipe
     python3-250 ... kfree_skb: ... location=skb_queue_purge_reason+0xdc/0x110 reason: UNIX_DISCONNECT

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoaf_unix: Set drop reason in unix_stream_read_skb().
Kuniyuki Iwashima [Thu, 16 Jan 2025 05:34:39 +0000 (14:34 +0900)]
af_unix: Set drop reason in unix_stream_read_skb().

unix_stream_read_skb() is called when BPF SOCKMAP reads some data
from a socket in the map.

SOCKMAP does not support MSG_OOB, and reading OOB results in a drop.

Let's set drop reasons respectively.

  * SOCKET_CLOSE  : the socket in SOCKMAP was close()d
  * UNIX_SKIP_OOB : OOB was read from the socket in SOCKMAP

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoaf_unix: Set drop reason in manage_oob().
Kuniyuki Iwashima [Thu, 16 Jan 2025 05:34:38 +0000 (14:34 +0900)]
af_unix: Set drop reason in manage_oob().

AF_UNIX SOCK_STREAM socket supports MSG_OOB.

When OOB data is sent to a socket, recv() will break at that point.

If the next recv() does not have MSG_OOB, the normal data following
the OOB data is returned.

Then, the OOB skb is dropped.

Let's define a new drop reason for that case in manage_oob().

  # echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable

  # python3
  >>> from socket import *
  >>> s1, s2 = socketpair(AF_UNIX)
  >>> s1.send(b'a', MSG_OOB)
  >>> s1.send(b'b')
  >>> s2.recv(2)
  b'b'

  # cat /sys/kernel/tracing/trace_pipe
  ...
     python3-223 ... kfree_skb: ... location=unix_stream_read_generic+0x59e/0xc20 reason: UNIX_SKIP_OOB

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoaf_unix: Set drop reason in __unix_gc().
Kuniyuki Iwashima [Thu, 16 Jan 2025 05:34:37 +0000 (14:34 +0900)]
af_unix: Set drop reason in __unix_gc().

Inflight file descriptors by SCM_RIGHTS hold references to the
struct file.

AF_UNIX sockets could hold references to each other, forming
reference cycles.

Once such sockets are close()d without the fd recv()ed, they
will be unaccessible from userspace but remain in kernel.

__unix_gc() garbage-collects skb with the dead file descriptors
and frees them by __skb_queue_purge().

Let's set SKB_DROP_REASON_SOCKET_CLOSE there.

  # echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable

  # python3
  >>> from socket import *
  >>> from array import array
  >>>
  >>> # Create a reference cycle
  >>> s1 = socket(AF_UNIX, SOCK_DGRAM)
  >>> s1.bind('')
  >>> s1.sendmsg([b"nop"], [(SOL_SOCKET, SCM_RIGHTS, array("i", [s1.fileno()]))], 0, s1.getsockname())
  >>> s1.close()
  >>>
  >>> # Trigger GC
  >>> s2 = socket(AF_UNIX)
  >>> s2.close()

  # cat /sys/kernel/tracing/trace_pipe
  ...
     kworker/u16:2-42 ... kfree_skb: ... location=__unix_gc+0x4ad/0x580 reason: SOCKET_CLOSE

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoaf_unix: Set drop reason in unix_sock_destructor().
Kuniyuki Iwashima [Thu, 16 Jan 2025 05:34:36 +0000 (14:34 +0900)]
af_unix: Set drop reason in unix_sock_destructor().

unix_sock_destructor() is called as sk->sk_destruct() just before
the socket is actually freed.

Let's use SKB_DROP_REASON_SOCKET_CLOSE for skb_queue_purge().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoaf_unix: Set drop reason in unix_release_sock().
Kuniyuki Iwashima [Thu, 16 Jan 2025 05:34:35 +0000 (14:34 +0900)]
af_unix: Set drop reason in unix_release_sock().

unix_release_sock() is called when the last refcnt of struct file
is released.

Let's define a new drop reason SKB_DROP_REASON_SOCKET_CLOSE and
set it for kfree_skb() in unix_release_sock().

  # echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable

  # python3
  >>> from socket import *
  >>> s1, s2 = socketpair(AF_UNIX)
  >>> s1.send(b'hello world')
  >>> s2.close()

  # cat /sys/kernel/tracing/trace_pipe
  ...
     python3-280 ... kfree_skb: ... protocol=0 location=unix_release_sock+0x260/0x420 reason: SOCKET_CLOSE

To be precise, unix_release_sock() is also called for a new child
socket in unix_stream_connect() when something fails, but the new
sk does not have skb in the recv queue then and no event is logged.

Note that only tcp_inbound_ao_hash() uses a similar drop reason,
SKB_DROP_REASON_TCP_CLOSE, and this can be generalised later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dropreason: Gather SOCKET_ drop reasons.
Kuniyuki Iwashima [Thu, 16 Jan 2025 05:34:34 +0000 (14:34 +0900)]
net: dropreason: Gather SOCKET_ drop reasons.

The following patch adds a new drop reason starting with
the SOCKET_ prefix.

Let's gather the existing SOCKET_ reasons.

Note that the order is not part of uAPI.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests/net/ipsec: Fix Null pointer dereference in rtattr_pack()
Liu Ye [Thu, 16 Jan 2025 01:30:37 +0000 (09:30 +0800)]
selftests/net/ipsec: Fix Null pointer dereference in rtattr_pack()

Address Null pointer dereference / undefined behavior in rtattr_pack
(note that size is 0 in the bad case).

Flagged by cppcheck as:
    tools/testing/selftests/net/ipsec.c:230:25: warning: Possible null pointer
    dereference: payload [nullPointer]
    memcpy(RTA_DATA(attr), payload, size);
                           ^
    tools/testing/selftests/net/ipsec.c:1618:54: note: Calling function 'rtattr_pack',
    4th argument 'NULL' value is 0
    if (rtattr_pack(&req.nh, sizeof(req), XFRMA_IF_ID, NULL, 0)) {
                                                       ^
    tools/testing/selftests/net/ipsec.c:230:25: note: Null pointer dereference
    memcpy(RTA_DATA(attr), payload, size);
                           ^
Signed-off-by: Liu Ye <liuye@kylinos.cn>
Link: https://patch.msgid.link/20250116013037.29470-1-liuye@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: realtek: HWMON support for standalone versions of RTL8221B and RTL8251
Aleksander Jan Bajkowski [Fri, 17 Jan 2025 22:24:21 +0000 (23:24 +0100)]
net: phy: realtek: HWMON support for standalone versions of RTL8221B and RTL8251

HWMON support has been added for the RTL8221/8251 PHYs integrated together
with the MAC inside the RTL8125/8126 chips. This patch extends temperature
reading support for standalone variants of the mentioned PHYs.

I don't know whether the earlier revisions of the RTL8226 also have a
built-in temperature sensor, so they have been skipped for now.

Tested on RTL8221B-VB-CG.

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 months agonet: macsec: Add endianness annotations in salt struct
Ales Nezbeda [Fri, 17 Jan 2025 11:22:28 +0000 (12:22 +0100)]
net: macsec: Add endianness annotations in salt struct

This change resolves warning produced by sparse tool as currently
there is a mismatch between normal generic type in salt and endian
annotated type in macsec driver code. Endian annotated types should
be used here.

Sparse output:
warning: restricted ssci_t degrades to integer
warning: incorrect type in assignment (different base types)
    expected restricted ssci_t [usertype] ssci
    got unsigned int
warning: restricted __be64 degrades to integer
warning: incorrect type in assignment (different base types)
    expected restricted __be64 [usertype] pn
    got unsigned long long

Signed-off-by: Ales Nezbeda <anezbeda@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 months agotipc: re-order conditions in tipc_crypto_key_rcv()
Dan Carpenter [Fri, 17 Jan 2025 09:36:14 +0000 (12:36 +0300)]
tipc: re-order conditions in tipc_crypto_key_rcv()

On a 32bit system the "keylen + sizeof(struct tipc_aead_key)" math could
have an integer wrapping issue.  It doesn't matter because the "keylen"
is checked on the next line, but just to make life easier for static
analysis tools, let's re-order these conditions and avoid the integer
overflow.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 months agonet: phylink: always do a major config when attaching a SFP PHY
Russell King (Oracle) [Fri, 17 Jan 2025 08:44:25 +0000 (08:44 +0000)]
net: phylink: always do a major config when attaching a SFP PHY

Background: https://lore.kernel.org/r/20250107123615.161095-1-ericwouds@gmail.com

Since adding negotiation of in-band capabilities, it is no longer
sufficient to just look at the MLO_AN_xxx mode and PHY interface to
decide whether to do a major configuration, since the result now
depends on the capabilities of the attaching PHY.

Always trigger a major configuration in this case.

Testing log: https://lore.kernel.org/r/f20c9744-3953-40e7-a9c9-5534b25d2e2a@gmail.com

Reported-by: Eric Woudstra <ericwouds@gmail.com>
Tested-by: Eric Woudstra <ericwouds@gmail.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 months agonet: appletalk: Drop aarp_send_probe_phase1()
谢致邦 (XIE Zhibang) [Fri, 17 Jan 2025 01:41:40 +0000 (01:41 +0000)]
net: appletalk: Drop aarp_send_probe_phase1()

aarp_send_probe_phase1() used to work by calling ndo_do_ioctl of
appletalk drivers ltpc or cops, but these two drivers have been removed
since the following commits:
commit 03dcb90dbf62 ("net: appletalk: remove Apple/Farallon LocalTalk PC
support")
commit 00f3696f7555 ("net: appletalk: remove cops support")

Thus aarp_send_probe_phase1() no longer works, so drop it. (found by
code inspection)

Signed-off-by: 谢致邦 (XIE Zhibang) <Yeking@Red54.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 months agodsa: Use str_enable_disable-like helpers
Krzysztof Kozlowski [Wed, 15 Jan 2025 19:47:03 +0000 (20:47 +0100)]
dsa: Use str_enable_disable-like helpers

Replace ternary (condition ? "enable" : "disable") syntax with helpers
from string_choices.h because:
1. Simple function call with one argument is easier to read.  Ternary
   operator has three arguments and with wrapping might lead to quite
   long code.
2. Is slightly shorter thus also easier to read.
3. It brings uniformity in the text - same string.
4. Allows deduping by the linker, which results in a smaller binary
   file.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 months agonet: sched: refine software bypass handling in tc_run
Xin Long [Wed, 15 Jan 2025 14:27:54 +0000 (09:27 -0500)]
net: sched: refine software bypass handling in tc_run

This patch addresses issues with filter counting in block (tcf_block),
particularly for software bypass scenarios, by introducing a more
accurate mechanism using useswcnt.

Previously, filtercnt and skipswcnt were introduced by:

  Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
  Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")

  filtercnt tracked all tp (tcf_proto) objects added to a block, and
  skipswcnt counted tp objects with the skipsw attribute set.

The problem is: a single tp can contain multiple filters, some with skipsw
and others without. The current implementation fails in the case:

  When the first filter in a tp has skipsw, both skipswcnt and filtercnt
  are incremented, then adding a second filter without skipsw to the same
  tp does not modify these counters because tp->counted is already set.

  This results in bypass software behavior based solely on skipswcnt
  equaling filtercnt, even when the block includes filters without
  skipsw. Consequently, filters without skipsw are inadvertently bypassed.

To address this, the patch introduces useswcnt in block to explicitly count
tp objects containing at least one filter without skipsw. Key changes
include:

  Whenever a filter without skipsw is added, its tp is marked with usesw
  and counted in useswcnt. tc_run() now uses useswcnt to determine software
  bypass, eliminating reliance on filtercnt and skipswcnt.

  This refined approach prevents software bypass for blocks containing
  mixed filters, ensuring correct behavior in tc_run().

Additionally, as atomic operations on useswcnt ensure thread safety and
tp->lock guards access to tp->usesw and tp->counted, the broader lock
down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
and this resolves a performance regression caused by the filter counting
mechanism during parallel filter insertions.

  The improvement can be demonstrated using the following script:

  # cat insert_tc_rules.sh

    tc qdisc add dev ens1f0np0 ingress
    for i in $(seq 16); do
        taskset -c $i tc -b rules_$i.txt &
    done
    wait

  Each of rules_$i.txt files above includes 100000 tc filter rules to a
  mlx5 driver NIC ens1f0np0.

  Without this patch:

  # time sh insert_tc_rules.sh

    real    0m50.780s
    user    0m23.556s
    sys     4m13.032s

  With this patch:

  # time sh insert_tc_rules.sh

    real    0m17.718s
    user    0m7.807s
    sys     3m45.050s

Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 months agonetfilter: flowtable: add CLOSING state
Pablo Neira Ayuso [Mon, 13 Jan 2025 23:50:38 +0000 (00:50 +0100)]
netfilter: flowtable: add CLOSING state

tcp rst/fin packet triggers an immediate teardown of the flow which
results in sending flows back to the classic forwarding path.

This behaviour was introduced by:

  da5984e51063 ("netfilter: nf_flow_table: add support for sending flows back to the slow path")
  b6f27d322a0a ("netfilter: nf_flow_table: tear down TCP flows if RST or FIN was seen")

whose goal is to expedite removal of flow entries from the hardware
table. Before these patches, the flow was released after the flow entry
timed out.

However, this approach leads to packet races when restoring the
conntrack state as well as late flow re-offload situations when the TCP
connection is ending.

This patch adds a new CLOSING state that is is entered when tcp rst/fin
packet is seen. This allows for an early removal of the flow entry from
the hardware table. But the flow entry still remains in software, so tcp
packets to shut down the flow are not sent back to slow path.

If syn packet is seen from this new CLOSING state, then this flow enters
teardown state, ct state is set to TCP_CONNTRACK_CLOSE state and packet
is sent to slow path, so this TCP reopen scenario can be handled by
conntrack. TCP_CONNTRACK_CLOSE provides a small timeout that aims at
quickly releasing this stale entry from the conntrack table.

Moreover, skip hardware re-offload from flowtable software packet if the
flow is in CLOSING state.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: flowtable: teardown flow if cached mtu is stale
Pablo Neira Ayuso [Mon, 13 Jan 2025 23:50:37 +0000 (00:50 +0100)]
netfilter: flowtable: teardown flow if cached mtu is stale

Tear down the flow entry in the unlikely case that the interface mtu
changes, this gives the flow a chance to refresh the cached mtu,
otherwise such refresh does not occur until flow entry expires.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: conntrack: rework offload nf_conn timeout extension logic
Florian Westphal [Mon, 13 Jan 2025 23:50:36 +0000 (00:50 +0100)]
netfilter: conntrack: rework offload nf_conn timeout extension logic

Offload nf_conn entries may not see traffic for a very long time.

To prevent incorrect 'ct is stale' checks during nf_conntrack table
lookup, the gc worker extends the timeout nf_conn entries marked for
offload to a large value.

The existing logic suffers from a few problems.

Garbage collection runs without locks, its unlikely but possible
that @ct is removed right after the 'offload' bit test.

In that case, the timeout of a new/reallocated nf_conn entry will
be increased.

Prevent this by obtaining a reference count on the ct object and
re-check of the confirmed and offload bits.

If those are not set, the ct is being removed, skip the timeout
extension in this case.

Parallel teardown is also problematic:
 cpu1                                cpu2
 gc_worker
                                     calls flow_offload_teardown()
 tests OFFLOAD bit, set
                                     clear OFFLOAD bit
                                     ct->timeout is repaired (e.g. set to timeout[UDP_CT_REPLIED])
 nf_ct_offload_timeout() called
 expire value is fetched
 <INTERRUPT>
-> NF_CT_DAY timeout for flow that isn't offloaded
(and might not see any further packets).

Use cmpxchg: if ct->timeout was repaired after the 2nd 'offload bit' test
passed, then ct->timeout will only be updated of ct->timeout was not
altered in between.

As we already have a gc worker for flowtable entries, ct->timeout repair
can be handled from the flowtable gc worker.

This avoids having flowtable specific logic in the conntrack core
and avoids checking entries that were never offloaded.

This allows to remove the nf_ct_offload_timeout helper.
Its safe to use in the add case, but not on teardown.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: conntrack: remove skb argument from nf_ct_refresh
Florian Westphal [Mon, 13 Jan 2025 23:50:35 +0000 (00:50 +0100)]
netfilter: conntrack: remove skb argument from nf_ct_refresh

Its not used (and could be NULL), so remove it.
This allows to use nf_ct_refresh in places where we don't have
an skb without having to double-check that skb == NULL would be safe.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: nft_flow_offload: update tcp state flags under lock
Florian Westphal [Mon, 13 Jan 2025 23:50:34 +0000 (00:50 +0100)]
netfilter: nft_flow_offload: update tcp state flags under lock

The conntrack entry is already public, there is a small chance that another
CPU is handling a packet in reply direction and racing with the tcp state
update.

Move this under ct spinlock.

This is done once, when ct is about to be offloaded, so this should
not result in a noticeable performance hit.

Fixes: 8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath
Florian Westphal [Mon, 13 Jan 2025 23:50:33 +0000 (00:50 +0100)]
netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath

This state reset is racy, no locks are held here.

Since commit
8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp"),
the window checks are disabled for normal data packets, but MAXACK flag
is checked when validating TCP resets.

Clear the flag so tcp reset validation checks are ignored.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: nf_tables: Simplify chain netdev notifier
Phil Sutter [Thu, 9 Jan 2025 17:31:37 +0000 (18:31 +0100)]
netfilter: nf_tables: Simplify chain netdev notifier

With conditional chain deletion gone, callback code simplifies: Instead
of filling an nft_ctx object, just pass basechain to the per-chain
function. Also plain list_for_each_entry() is safe now.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: nf_tables: Tolerate chains with no remaining hooks
Phil Sutter [Thu, 9 Jan 2025 17:31:36 +0000 (18:31 +0100)]
netfilter: nf_tables: Tolerate chains with no remaining hooks

Do not drop a netdev-family chain if the last interface it is registered
for vanishes. Users dumping and storing the ruleset upon shutdown to
restore it upon next boot may otherwise lose the chain and all contained
rules. They will still lose the list of devices, a later patch will fix
that. For now, this aligns the event handler's behaviour with that for
flowtables.
The controversal situation at netns exit should be no problem here:
event handler will unregister the hooks, core nftables cleanup code will
drop the chain itself.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: nf_tables: Compare netdev hooks based on stored name
Phil Sutter [Thu, 9 Jan 2025 17:31:35 +0000 (18:31 +0100)]
netfilter: nf_tables: Compare netdev hooks based on stored name

The 1:1 relationship between nft_hook and nf_hook_ops is about to break,
so choose the stored ifname to uniquely identify hooks.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: nf_tables: Use stored ifname in netdev hook dumps
Phil Sutter [Thu, 9 Jan 2025 17:31:34 +0000 (18:31 +0100)]
netfilter: nf_tables: Use stored ifname in netdev hook dumps

The stored ifname and ops.dev->name may deviate after creation due to
interface name changes. Prefer the more deterministic stored name in
dumps which also helps avoiding inadvertent changes to stored ruleset
dumps.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: nf_tables: Store user-defined hook ifname
Phil Sutter [Thu, 9 Jan 2025 17:31:33 +0000 (18:31 +0100)]
netfilter: nf_tables: Store user-defined hook ifname

Prepare for hooks with NULL ops.dev pointer (due to non-existent device)
and store the interface name and length as specified by the user upon
creation. No functional change intended.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: nf_tables: Flowtable hook's pf value never varies
Phil Sutter [Thu, 9 Jan 2025 17:31:32 +0000 (18:31 +0100)]
netfilter: nf_tables: Flowtable hook's pf value never varies

When checking for duplicate hooks in nft_register_flowtable_net_hooks(),
comparing ops.pf value is pointless as it is always NFPROTO_NETDEV with
flowtable hooks.

Dropping the check leaves the search identical to the one in
nft_hook_list_find() so call that function instead of open coding.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: br_netfilter: remove unused conditional and dead code
Antoine Tenart [Thu, 9 Jan 2025 09:37:09 +0000 (10:37 +0100)]
netfilter: br_netfilter: remove unused conditional and dead code

The SKB_DROP_REASON_IP_INADDRERRORS drop reason is never returned from
any function, as such it cannot be returned from the ip_route_input call
tree. The 'reason != SKB_DROP_REASON_IP_INADDRERRORS' conditional is
thus always true.

Looking back at history, commit 50038bf38e65 ("net: ip: make
ip_route_input() return drop reasons") changed the ip_route_input
returned value check in br_nf_pre_routing_finish from -EHOSTUNREACH to
SKB_DROP_REASON_IP_INADDRERRORS. It turns out -EHOSTUNREACH could not be
returned either from the ip_route_input call tree and this since commit
251da4130115 ("ipv4: Cache ip_error() routes even when not
forwarding.").

Not a fix as this won't change the behavior. While at it use
kfree_skb_reason.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agonetfilter: nf_tables: fix set size with rbtree backend
Pablo Neira Ayuso [Mon, 6 Jan 2025 22:40:50 +0000 (23:40 +0100)]
netfilter: nf_tables: fix set size with rbtree backend

The existing rbtree implementation uses singleton elements to represent
ranges, however, userspace provides a set size according to the number
of ranges in the set.

Adjust provided userspace set size to the number of singleton elements
in the kernel by multiplying the range by two.

Check if the no-match all-zero element is already in the set, in such
case release one slot in the set size.

Fixes: 0ed6389c483d ("netfilter: nf_tables: rename set implementations")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 months agoMerge tag 'batadv-next-pullrequest-20250117' of git://git.open-mesh.org/linux-merge
Jakub Kicinski [Sun, 19 Jan 2025 01:57:31 +0000 (17:57 -0800)]
Merge tag 'batadv-next-pullrequest-20250117' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - Reorder includes for distributed-arp-table.c, by Sven Eckelmann

 - Fix translation table change handling, by Remi Pommarel (2 patches)

 - Map VID 0 to untagged TT VLAN, by Sven Eckelmann

 - Update MAINTAINERS/mailmap e-mail addresses, by the respective authors
   (4 patches)

 - netlink: reduce duplicate code by returning interfaces,
   by Linus Lüssing

* tag 'batadv-next-pullrequest-20250117' of git://git.open-mesh.org/linux-merge:
  batman-adv: netlink: reduce duplicate code by returning interfaces
  MAINTAINERS: mailmap: add entries for Antonio Quartulli
  mailmap: add entries for Sven Eckelmann
  mailmap: add entries for Simon Wunderlich
  MAINTAINERS: update email address of Marek Linder
  batman-adv: Map VID 0 to untagged TT VLAN
  batman-adv: Don't keep redundant TT change events
  batman-adv: Remove atomic usage for tt.local_changes
  batman-adv: Reorder includes for distributed-arp-table.c
  batman-adv: Start new development cycle
====================

Link: https://patch.msgid.link/20250117123910.219278-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge tag 'for-net-next-2025-01-15' of git://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Sun, 19 Jan 2025 01:51:57 +0000 (17:51 -0800)]
Merge tag 'for-net-next-2025-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

 - btusb: Add new VID/PID 13d3/3610 for MT7922
 - btusb: Add new VID/PID 13d3/3628 for MT7925
 - btusb: Add MT7921e device 13d3:3576
 - btusb: Add RTL8851BE device 13d3:3600
 - btusb: Add ID 0x2c7c:0x0130 for Qualcomm WCN785x
 - btusb: add sysfs attribute to control USB alt setting
 - qca: Expand firmware-name property
 - qca: Fix poor RF performance for WCN6855
 - L2CAP: handle NULL sock pointer in l2cap_sock_alloc
 - Allow reset via sysfs
 - ISO: Allow BIG re-sync
 - dt-bindings: Utilize PMU abstraction for WCN6750
 - MGMT: Mark LL Privacy as stable

* tag 'for-net-next-2025-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (23 commits)
  Bluetooth: MGMT: Fix slab-use-after-free Read in mgmt_remove_adv_monitor_sync
  Bluetooth: qca: Fix poor RF performance for WCN6855
  Bluetooth: Allow reset via sysfs
  Bluetooth: Get rid of cmd_timeout and use the reset callback
  Bluetooth: Remove the cmd timeout count in btusb
  Bluetooth: Use str_enable_disable-like helpers
  Bluetooth: btmtk: Remove resetting mt7921 before downloading the fw
  Bluetooth: L2CAP: handle NULL sock pointer in l2cap_sock_alloc
  Bluetooth: btusb: Add RTL8851BE device 13d3:3600
  dt-bindings: bluetooth: Utilize PMU abstraction for WCN6750
  Bluetooth: btusb: Add MT7921e device 13d3:3576
  Bluetooth: btrtl: check for NULL in btrtl_setup_realtek()
  Bluetooth: btbcm: Fix NULL deref in btbcm_get_board_name()
  Bluetooth: qca: Expand firmware-name to load specific rampatch
  Bluetooth: qca: Update firmware-name to support board specific nvm
  dt-bindings: net: bluetooth: qca: Expand firmware-name property
  Bluetooth: btusb: Add new VID/PID 13d3/3628 for MT7925
  Bluetooth: btusb: Add new VID/PID 13d3/3610 for MT7922
  Bluetooth: btusb: add sysfs attribute to control USB alt setting
  Bluetooth: btusb: Add ID 0x2c7c:0x0130 for Qualcomm WCN785x
  ...
====================

Link: https://patch.msgid.link/20250117213203.3921910-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge tag 'wireless-next-2025-01-17' of git://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Sun, 19 Jan 2025 01:46:53 +0000 (17:46 -0800)]
Merge tag 'wireless-next-2025-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.14

Most likely the last "new features" pull request for v6.14 and this is
a bigger one. Multi-Link Operation (MLO) work continues both in stack
in drivers. Few new devices supported and usual fixes all over.

Major changes:

cfg80211
 * Emergency Preparedness Communication Services (EPCS) station mode support

mac80211
 * an option to filter a sta from being flushed
 * some support for RX Operating Mode Indication (OMI) power saving
 * support for adding and removing station links for MLO

iwlwifi
 * new device ids
 * rework firmware error handling and restart

rtw88
 * RTL8812A: RFE type 2 support
 * LED support

rtw89
 * variant info to support RTL8922AE-VS

mt76
 * mt7996: single wiphy multiband support (preparation for MLO)
 * mt7996: support for more variants
 * mt792x: P2P_DEVICE support
 * mt7921u: TP-Link TXE50UH support

ath12k
 * enable MLO for QCN9274 (although it seems to be broken with dual
   band devices)
 * MLO radar detection support
 * debugfs: transmit buffer OFDMA, AST entry and puncture stats

* tag 'wireless-next-2025-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (322 commits)
  wifi: brcmfmac: fix NULL pointer dereference in brcmf_txfinalize()
  wifi: rtw88: add RTW88_LEDS depends on LEDS_CLASS to Kconfig
  wifi: wilc1000: unregister wiphy only after netdev registration
  wifi: cfg80211: adjust allocation of colocated AP data
  wifi: mac80211: fix memory leak in ieee80211_mgd_assoc_ml_reconf()
  wifi: ath12k: fix key cache handling
  wifi: ath12k: Fix uninitialized variable access in ath12k_mac_allocate() function
  wifi: ath12k: Remove ath12k_get_num_hw() helper function
  wifi: ath12k: Refactor the ath12k_hw get helper function argument
  wifi: ath12k: Refactor ath12k_hw set helper function argument
  wifi: mt76: mt7996: add implicit beamforming support for mt7992
  wifi: mt76: mt7996: fix beacon command during disabling
  wifi: mt76: mt7996: fix ldpc setting
  wifi: mt76: mt7996: fix definition of tx descriptor
  wifi: mt76: connac: adjust phy capabilities based on band constraints
  wifi: mt76: mt7996: fix incorrect indexing of MIB FW event
  wifi: mt76: mt7996: fix HE Phy capability
  wifi: mt76: mt7996: fix the capability of reception of EHT MU PPDU
  wifi: mt76: mt7996: add max mpdu len capability
  wifi: mt76: mt7921: avoid undesired changes of the preset regulatory domain
  ...
====================

Link: https://patch.msgid.link/20250117203529.72D45C4CEDD@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: introduce netdev_napi_exit()
Eric Dumazet [Fri, 17 Jan 2025 23:21:13 +0000 (23:21 +0000)]
net: introduce netdev_napi_exit()

After 1b23cdbd2bbc ("net: protect netdev->napi_list with netdev_lock()")
it makes sense to iterate through dev->napi_list while holding
the device lock.

Also call synchronize_net() at most one time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250117232113.1612899-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: remove leftovers from switch to linkmode bitmaps
Heiner Kallweit [Thu, 16 Jan 2025 21:38:40 +0000 (22:38 +0100)]
net: phy: remove leftovers from switch to linkmode bitmaps

We have some leftovers from the switch to linkmode bitmaps which
- have never been used
- are not used any longer
- have no user outside phy_device.c
So remove them.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/5493b96e-88bb-4230-a911-322659ec5167@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: destroy dev->lock later in free_netdev()
Eric Dumazet [Fri, 17 Jan 2025 22:46:26 +0000 (22:46 +0000)]
net: destroy dev->lock later in free_netdev()

syzbot complained that free_netdev() was calling netif_napi_del()
after dev->lock mutex has been destroyed.

This fires a warning for CONFIG_DEBUG_MUTEXES=y builds.

Move mutex_destroy(&dev->lock) near the end of free_netdev().

[1]
DEBUG_LOCKS_WARN_ON(lock->magic != lock)
 WARNING: CPU: 0 PID: 5971 at kernel/locking/mutex.c:564 __mutex_lock_common kernel/locking/mutex.c:564 [inline]
 WARNING: CPU: 0 PID: 5971 at kernel/locking/mutex.c:564 __mutex_lock+0xdac/0xee0 kernel/locking/mutex.c:735
Modules linked in:
CPU: 0 UID: 0 PID: 5971 Comm: syz-executor Not tainted 6.13.0-rc7-syzkaller-01131-g8d20dcda404d #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/27/2024
 RIP: 0010:__mutex_lock_common kernel/locking/mutex.c:564 [inline]
 RIP: 0010:__mutex_lock+0xdac/0xee0 kernel/locking/mutex.c:735
Code: 0f b6 04 38 84 c0 0f 85 1a 01 00 00 83 3d 6f 40 4c 04 00 75 19 90 48 c7 c7 60 84 0a 8c 48 c7 c6 00 85 0a 8c e8 f5 dc 91 f5 90 <0f> 0b 90 90 90 e9 c7 f3 ff ff 90 0f 0b 90 e9 29 f8 ff ff 90 0f 0b
RSP: 0018:ffffc90003317580 EFLAGS: 00010246
RAX: ee0f97edaf7b7d00 RBX: ffff8880299f8cb0 RCX: ffff8880323c9e00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90003317710 R08: ffffffff81602ac2 R09: 1ffff110170c519a
R10: dffffc0000000000 R11: ffffed10170c519b R12: 0000000000000000
R13: 0000000000000000 R14: 1ffff92000662ec4 R15: dffffc0000000000
FS:  000055557a046500(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd581d46ff8 CR3: 000000006f870000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
  netdev_lock include/linux/netdevice.h:2691 [inline]
  __netif_napi_del include/linux/netdevice.h:2829 [inline]
  netif_napi_del include/linux/netdevice.h:2848 [inline]
  free_netdev+0x2d9/0x610 net/core/dev.c:11621
  netdev_run_todo+0xf21/0x10d0 net/core/dev.c:11189
  nsim_destroy+0x3c3/0x620 drivers/net/netdevsim/netdev.c:1028
  __nsim_dev_port_del+0x14b/0x1b0 drivers/net/netdevsim/dev.c:1428
  nsim_dev_port_del_all drivers/net/netdevsim/dev.c:1440 [inline]
  nsim_dev_reload_destroy+0x28a/0x490 drivers/net/netdevsim/dev.c:1661
  nsim_drv_remove+0x58/0x160 drivers/net/netdevsim/dev.c:1676
  device_remove drivers/base/dd.c:567 [inline]

Fixes: 1b23cdbd2bbc ("net: protect netdev->napi_list with netdev_lock()")
Reported-by: syzbot+85ff1051228a04613a32@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/678add43.050a0220.303755.0016.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250117224626.1427577-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: bnxt: fix string truncation warning in FW version
Jakub Kicinski [Fri, 17 Jan 2025 18:37:26 +0000 (10:37 -0800)]
eth: bnxt: fix string truncation warning in FW version

W=1 builds with gcc 14.2.1 report:

drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:4193:32: error: ‘%s’ directive output may be truncated writing up to 31 bytes into a region of size 27 [-Werror=format-truncation=]
 4193 |                          "/pkg %s", buf);

It's upset that we let buf be full length but then we use 5
characters for "/pkg ".

The builds is also clear with clang version 19.1.5 now.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250117183726.1481524-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomptcp: sysctl: add syn_retrans_before_tcp_fallback
Matthieu Baerts (NGI0) [Fri, 17 Jan 2025 17:40:31 +0000 (18:40 +0100)]
mptcp: sysctl: add syn_retrans_before_tcp_fallback

The number of SYN + MPC retransmissions before falling back to TCP was
fixed to 2. This is certainly a good default value, but having a fixed
number can be a problem in some environments.

The current behaviour means that if all packets are dropped, there will
be:

- The initial SYN + MPC

- 2 retransmissions with MPC

- The next ones will be without MPTCP.

So typically ~3 seconds before falling back to TCP. In some networks
where some temporally blackholes are unfortunately frequent, or when a
client tries to initiate connections while the network is not ready yet,
this can cause new connections not to have MPTCP connections.

In such environments, it is now possible to increase the number of SYN
retransmissions with MPTCP options to make sure MPTCP is used.

Interesting values are:

- 0: the first retransmission will be done without MPTCP options: quite
     aggressive, but also a higher risk of detecting false-positive
     MPTCP blackholes.

- >= 128: all SYN retransmissions will keep the MPTCP options: back to
          the < 6.12 behaviour.

The default behaviour is not changed here.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250117-net-next-mptcp-syn_retrans_before_tcp_fallback-v1-1-ab4b187099b0@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-xilinx-axienet-enable-adaptive-irq-coalescing-with-dim'
Jakub Kicinski [Sun, 19 Jan 2025 01:19:04 +0000 (17:19 -0800)]
Merge branch 'net-xilinx-axienet-enable-adaptive-irq-coalescing-with-dim'

Sean Anderson says:

====================
net: xilinx: axienet: Report an error for bad coalesce settings
====================

Link: https://patch.msgid.link/20250116232954.2696930-1-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: xilinx: axienet: Report an error for bad coalesce settings
Sean Anderson [Thu, 16 Jan 2025 23:29:50 +0000 (18:29 -0500)]
net: xilinx: axienet: Report an error for bad coalesce settings

Instead of silently ignoring invalid/unsupported settings, report an
error. Additionally, relax the check for non-zero usecs to apply only
when it will be used (i.e. when frames != 1).

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250116232954.2696930-3-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: xilinx: axienet: Add some symbolic constants for IRQ delay timer
Sean Anderson [Thu, 16 Jan 2025 23:29:49 +0000 (18:29 -0500)]
net: xilinx: axienet: Add some symbolic constants for IRQ delay timer

Instead of using literals, add some symbolic constants for the IRQ delay
timer calculation.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250116232954.2696930-2-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonfc: mrvl: Don't use "proxy" headers
Andy Shevchenko [Thu, 16 Jan 2025 15:30:37 +0000 (17:30 +0200)]
nfc: mrvl: Don't use "proxy" headers

Update header inclusions to follow IWYU (Include What You Use)
principle.

In this case replace of_gpio.h, which is subject to remove by the GPIOLIB
subsystem, with the respective headers that are being used by the driver.

Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20250116153119.148097-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonfc: st21nfca: Remove unused of_gpio.h
Andy Shevchenko [Thu, 16 Jan 2025 15:27:28 +0000 (17:27 +0200)]
nfc: st21nfca: Remove unused of_gpio.h

of_gpio.h is deprecated and subject to remove. The drivers in question
don't use it, simply remove the unused header.

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'ethtool-get_ts_stats-for-dsa-and-ocelot-driver'
Jakub Kicinski [Sat, 18 Jan 2025 04:01:11 +0000 (20:01 -0800)]
Merge branch 'ethtool-get_ts_stats-for-dsa-and-ocelot-driver'

Vladimir Oltean says:

====================
ethtool get_ts_stats() for DSA and ocelot driver

After a recent patch set with fixes and general restructuring, Jakub asked
for the Felix DSA driver to start reporting standardized statistics
for hardware timestamping:
https://lore.kernel.org/netdev/20241207180640.12da60ed@kernel.org/

Testing follows the same procedure as in the aforementioned series, with PTP
packet loss induced through taprio:

$ ethtool -I --show-time-stamping swp3
Time stamping parameters for swp3:
Capabilities:
        hardware-transmit
        software-transmit
        hardware-receive
        software-receive
        software-system-clock
        hardware-raw-clock
PTP Hardware Clock: 1
Hardware Transmit Timestamp Modes:
        off
        on
        onestep-sync
Hardware Receive Filter Modes:
        none
        ptpv2-l4-event
        ptpv2-l2-event
        ptpv2-event
Statistics:
  tx_pkts: 14591
  tx_lost: 85
  tx_err: 0

v1: https://lore.kernel.org/20241213140852.1254063-1-vladimir.oltean@nxp.com/
====================

Link: https://patch.msgid.link/20250116104628.123555-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: felix: report timestamping stats from the ocelot library
Vladimir Oltean [Thu, 16 Jan 2025 10:46:28 +0000 (12:46 +0200)]
net: dsa: felix: report timestamping stats from the ocelot library

Make the linkage between the DSA user port ethtool_ops :: get_ts_info
and the implementation from the Ocelot switch library.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250116104628.123555-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: mscc: ocelot: add TX timestamping statistics
Vladimir Oltean [Thu, 16 Jan 2025 10:46:27 +0000 (12:46 +0200)]
net: mscc: ocelot: add TX timestamping statistics

Add an u64 hardware timestamping statistics structure for each ocelot
port. Export a function from the common switch library for reporting
them to ethtool. This is called by the ocelot switchdev front-end for
now.

Note that for the switchdev driver, we report the one-step PTP packets
as unconfirmed, even though in principle, for some transmission
mechanisms like FDMA, we may be able to confirm transmission and bump
the "pkts" counter in ocelot_fdma_tx_cleanup() instead. I don't have
access to hardware which uses the switchdev front-end, and I've kept the
implementation simple.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250116104628.123555-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: implement get_ts_stats ethtool operation for user ports
Vladimir Oltean [Thu, 16 Jan 2025 10:46:26 +0000 (12:46 +0200)]
net: dsa: implement get_ts_stats ethtool operation for user ports

Integrate with the standard infrastructure for reporting hardware packet
timestamping statistics.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250116104628.123555-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: ts: add separate counter for unconfirmed one-step TX timestamps
Vladimir Oltean [Thu, 16 Jan 2025 10:46:25 +0000 (12:46 +0200)]
net: ethtool: ts: add separate counter for unconfirmed one-step TX timestamps

For packets with two-step timestamp requests, the hardware timestamp
comes back to the driver through a confirmation mechanism of sorts,
which allows the driver to confidently bump the successful "pkts"
counter.

For one-step PTP, the NIC is supposed to autonomously insert its
hardware TX timestamp in the packet headers while simultaneously
transmitting it. There may be a confirmation that this was done
successfully, or there may not.

None of the current drivers which implement ethtool_ops :: get_ts_stats()
also support HWTSTAMP_TX_ONESTEP_SYNC or HWTSTAMP_TX_ONESTEP_SYNC, so it
is a bit unclear which model to follow. But there are NICs, such as DSA,
where there is no transmit confirmation at all. Here, it would be wrong /
misleading to increment the successful "pkts" counter, because one-step
PTP packets can be dropped on TX just like any other packets.

So introduce a special counter which signifies "yes, an attempt was made,
but we don't know whether it also exited the port or not". I expect that
for one-step PTP packets where a confirmation is available, the "pkts"
counter would be bumped.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250116104628.123555-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'mlxsw-move-tx-header-handling-to-pci-driver'
Jakub Kicinski [Sat, 18 Jan 2025 03:58:01 +0000 (19:58 -0800)]
Merge branch 'mlxsw-move-tx-header-handling-to-pci-driver'

Petr Machata says:

====================
mlxsw: Move Tx header handling to PCI driver

Amit Cohen writes:

Tx header should be added to all packets transmitted from the CPU to
Spectrum ASICs. Historically, handling this header was added as a driver
function, as Tx header is different between Spectrum and Switch-X.

From May 2021, there is no support for SwitchX-2 ASIC, and all the relevant
code was removed.

For now, there is no justification to handle Tx header as part of
spectrum.c, we can handle this as part of PCI, in skb_transmit().

This change will also be useful when XDP support will be added to mlxsw,
as for XDP_TX and XDP_REDIRECT actions, Tx header should be added before
transmitting the packet.

Patch set overview:
Patches #1-#2 add structure to store Tx header info and initialize it
Patch #3 moves definitions of Tx header fields to txheader.h
Patch #4 moves Tx header handling to PCI driver
Patch #5 removes unnecessary attribute
====================

Link: https://patch.msgid.link/cover.1737044384.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomlxsw: Do not store Tx header length as driver parameter
Amit Cohen [Thu, 16 Jan 2025 16:38:18 +0000 (17:38 +0100)]
mlxsw: Do not store Tx header length as driver parameter

Tx header handling was moved to PCI code, as there is no several drivers
which configure Tx header differently. Tx header length is stored as driver
parameter, this is not really necessary as it always stores the same value.
Remove this field and use the macro MLXSW_TXHDR_LEN explicitly.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/1fb7b3f007de4d311e559c8a954b673d0895d5e9.1737044384.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomlxsw: Move Tx header handling to PCI driver
Amit Cohen [Thu, 16 Jan 2025 16:38:17 +0000 (17:38 +0100)]
mlxsw: Move Tx header handling to PCI driver

Tx header should be added to all packets transmitted from the CPU to
Spectrum ASICs. Historically, handling this header was added as a driver
function, as Tx header is different between Spectrum and Switch-X. See
SwitchX implementation in commit 31557f0f9755 ("mlxsw: Introduce
Mellanox SwitchX-2 ASIC support"). From May 2021, there is no support
for SwitchX-2 ASIC, and all the relevant code was removed.

For now, there is no justification to handle Tx header as part of
spectrum.c, we can handle this as part of PCI, in skb_transmit().

A future patch set will add support for XDP in mlxsw driver, to support
XDP_TX and XDP_REDIRECT actions, Tx header should be added before
transmitting the packet. As preparation for this, move Tx header handling
to PCI driver, so then XDP code will not have to call API from spectrum.c.
This also improves the code as now Tx header is pushed just before
transmitting, so it is not done from many flows which might miss something.

Note that for PTP, we should configure Tx header differently, use the
fields from mlxsw_txhdr_info to configure the packets correctly in PCI
driver. Handle VLAN tagging in switch driver, verify that packet which
should be transmitted as data is tagged, otherwise, tag it.

Remove the calls for thxdr_construct() functions, as now this is done as
part of skb_transmit().

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/293a81e6f7d59a8ec9f9592edb7745536649ff11.1737044384.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomlxsw: Define Tx header fields in txheader.h
Amit Cohen [Thu, 16 Jan 2025 16:38:16 +0000 (17:38 +0100)]
mlxsw: Define Tx header fields in txheader.h

The next patch will move Tx header constructing to pci.c. As preparation,
move the definitions of Tx header fields from spectrum.c to txheader.h,
so pci.c will include this header and can access the fields.

Remove 'etclass' which is not used.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/2250b5cb3998ab4850fc8251c3a0f5926d32e194.1737044384.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomlxsw: Initialize txhdr_info according to PTP operations
Amit Cohen [Thu, 16 Jan 2025 16:38:15 +0000 (17:38 +0100)]
mlxsw: Initialize txhdr_info according to PTP operations

A next patch will construct Tx header as part of pci.c. The switch driver
(mlxsw_spectrum.ko) should encapsulate all the differences between the
different ASICs and the bus driver (mlxsw_pci.ko) should remain unaware.

As preparation, add the relevant info as part of mlxsw_txhdr_info
structure, so later bus driver will merely construct the Tx header based on
information passed from the switch driver.

Most of the packets are transmitted as control packets, but PTP packets in
Spectrum-2 and Spectrum-3 should be handled differently. The driver
transmits them as data packets, and the default VLAN tag (4095) is added if
the packet is not already tagged.

Extend PTP operations to store a boolean which indicates whether packets
should be transmitted as data packets. Set it for Spectrum-2 and
Spectrum-3 only. Extend mlxsw_txhdr_info to store fields which will be used
later to construct Tx header. Initialize such fields according to the new
boolean which is stored in PTP operations.

Note that for now, mlxsw_txhdr_info structure is initialized, but not used,
a next patch will use it.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/efcaacd4bedef524e840a0c29f96cebf2c4bc0e0.1737044384.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomlxsw: Add mlxsw_txhdr_info structure
Amit Cohen [Thu, 16 Jan 2025 16:38:14 +0000 (17:38 +0100)]
mlxsw: Add mlxsw_txhdr_info structure

mlxsw_tx_info structure is used to store information that is needed to
process Tx completions when Tx time stamps are requested. A next patch
will move Tx header handling from spectrum.c to pci.c. For that, some
additional fields which are related to Tx should be passed to pci driver.

As preparation, create an extended structure, called mlxsw_txhdr_info,
and store mlxsw_tx_info inside. The new fields should not be added to
mlxsw_tx_info structure as it is stored in the SKB control block which is
of limited size.

The next patch will extend the new structure with some fields which are
needed in order to construct Tx header.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/93aed1961f046f79f46869bab37a3faa5027751d.1737044384.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: fix unintentional sign extension on shift of dest_attr->vport.vhca_id
Colin Ian King [Thu, 16 Jan 2025 18:17:00 +0000 (18:17 +0000)]
net/mlx5: fix unintentional sign extension on shift of dest_attr->vport.vhca_id

Shifting dest_attr->vport.vhca_id << 16 results in a promotion from an
unsigned 16 bit integer to a 32 bit signed integer, this is then sign
extended to a 64 bit unsigned long on 64 bitarchitectures. If vhca_id is
greater than 0x7fff then this leads to a sign extended result where all
the upper 32 bits of idx are set to 1. Fix this by casting vhca_id
to the same type as idx before performing the shift.

Fixes: 8e2e08a6d1e0 ("net/mlx5: fs, add support for dest vport HWS action")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Moshe Shemesh <moshe@nvidia.com>
Link: https://patch.msgid.link/20250116181700.96437-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: tc: improve qdisc error messages
John Ousterhout [Thu, 16 Jan 2025 19:56:41 +0000 (11:56 -0800)]
net: tc: improve qdisc error messages

The existing error message ("Invalid qdisc name") is confusing
because it suggests that there is no qdisc with the given name. In
fact, the name does refer to a valid qdisc, but it doesn't match
the kind of an existing qdisc being modified or replaced. The
new error message provides more detail to eliminate confusion.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20250116195642.2794-1-ouster@cs.stanford.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Sat, 18 Jan 2025 03:45:34 +0000 (19:45 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
ice: support FW Recovery Mode

Konrad Knitter says:

Enable update of card in FW Recovery Mode

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: support FW Recovery Mode
  devlink: add devl guard
  pldmfw: enable selected component update
====================

Link: https://patch.msgid.link/20250116212059.1254349-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agohv_netvsc: Replace one-element array with flexible array member
Thorsten Blum [Thu, 16 Jan 2025 21:19:32 +0000 (22:19 +0100)]
hv_netvsc: Replace one-element array with flexible array member

Replace the deprecated one-element array with a modern flexible array
member in the struct nvsp_1_message_send_receive_buffer_complete.

Use struct_size_t(,,1) instead of sizeof() to maintain the same size.

Compile-tested only.

Link: https://github.com/KSPP/linux/issues/79
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Tested-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Link: https://patch.msgid.link/20250116211932.139564-2-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agogtp: Prepare ip4_route_output_gtp() to .flowi4_tos conversion.
Guillaume Nault [Thu, 16 Jan 2025 13:39:38 +0000 (14:39 +0100)]
gtp: Prepare ip4_route_output_gtp() to .flowi4_tos conversion.

Use inet_sk_dscp() to get the socket DSCP value as dscp_t, instead of
ip_sock_rt_tos() which returns a __u8. This will ease the conversion
of fl4->flowi4_tos to dscp_t, which now just becomes a matter of
dropping the inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/06bdb310a075355ff059cd32da2efc29a63981c9.1737034675.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodccp: Prepare dccp_v4_route_skb() to .flowi4_tos conversion.
Guillaume Nault [Thu, 16 Jan 2025 13:10:16 +0000 (14:10 +0100)]
dccp: Prepare dccp_v4_route_skb() to .flowi4_tos conversion.

Use inet_sk_dscp() to get the socket DSCP value as dscp_t, instead of
ip_sock_rt_tos() which returns a __u8. This will ease the conversion
of fl4->flowi4_tos to dscp_t, which now just becomes a matter of
dropping the inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/208dc5ca28bb5595d7a545de026bba18b1d63bda.1737032802.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: net: give up on the cmsg_time accuracy on slow machines
Jakub Kicinski [Thu, 16 Jan 2025 02:01:05 +0000 (18:01 -0800)]
selftests: net: give up on the cmsg_time accuracy on slow machines

Commit b9d5f5711dd8 ("selftests: net: increase the delay for relative
cmsg_time.sh test") widened the accepted value range 8x but we still
see flakes (at a rate of around 7%).

Return XFAIL for the most timing sensitive test on slow machines.

Before:

  # ./cmsg_time.sh
    Case UDPv4  - TXTIME rel returned '8074us - 7397us < 4000', expected 'OK'
  FAIL - 1/36 cases failed

After:

  # ./cmsg_time.sh
    Case UDPv4  - TXTIME rel returned '1123us - 941us < 500', expected 'OK' (XFAIL)
    Case UDPv6  - TXTIME rel returned '1227us - 776us < 500', expected 'OK' (XFAIL)
  OK

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250116020105.931338-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobatman-adv: netlink: reduce duplicate code by returning interfaces
Linus Lüssing [Mon, 13 Jan 2025 19:31:39 +0000 (20:31 +0100)]
batman-adv: netlink: reduce duplicate code by returning interfaces

Reduce duplicate code by using netlink helpers which return the
soft/hard interface directly. Instead of returning an interface index
which we are typically not interested in.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
3 months agoMerge branch 'add-perout-library-for-rds-ptp-supported-phys'
Jakub Kicinski [Fri, 17 Jan 2025 01:27:59 +0000 (17:27 -0800)]
Merge branch 'add-perout-library-for-rds-ptp-supported-phys'

Divya Koppera says:

====================
Add PEROUT library for RDS PTP supported phys

Adds support for PEROUT library, where phy can generate
periodic output signal on supported pin out.
====================

Link: https://patch.msgid.link/20250115090634.12941-1-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: microchip_rds_ptp : Add PEROUT feature library for RDS PTP supported Microc...
Divya Koppera [Wed, 15 Jan 2025 09:06:34 +0000 (14:36 +0530)]
net: phy: microchip_rds_ptp : Add PEROUT feature library for RDS PTP supported Microchip phys

Adds PEROUT feature for RDS PTP supported phys where
we can generate periodic output signal on supported
pin out

Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20250115090634.12941-4-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: microchip_t1: Enable pin out specific to lan887x phy for PEROUT signal
Divya Koppera [Wed, 15 Jan 2025 09:06:33 +0000 (14:36 +0530)]
net: phy: microchip_t1: Enable pin out specific to lan887x phy for PEROUT signal

Adds support for enabling pin out that is required
to generate periodic output signal on lan887x phy.

Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20250115090634.12941-3-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: microchip_rds_ptp: Header file library changes for PEROUT
Divya Koppera [Wed, 15 Jan 2025 09:06:32 +0000 (14:36 +0530)]
net: phy: microchip_rds_ptp: Header file library changes for PEROUT

This ptp header file library changes will cover PEROUT
macros that are required to generate periodic output
from pin out

Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20250115090634.12941-2-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests/net: packetdrill: make tcp buf limited timing tests benign
Jakub Kicinski [Wed, 15 Jan 2025 23:21:29 +0000 (15:21 -0800)]
selftests/net: packetdrill: make tcp buf limited timing tests benign

The following tests are failing on debug kernels:

  tcp_tcp_info_tcp-info-rwnd-limited.pkt
  tcp_tcp_info_tcp-info-sndbuf-limited.pkt

with reports like:

      assert 19000 <= tcpi_sndbuf_limited <= 21000, tcpi_sndbuf_limited; \
  AssertionError: 18000

and:

      assert 348000 <= tcpi_busy_time <= 360000, tcpi_busy_time
  AssertionError: 362000

Extend commit 912d6f669725 ("selftests/net: packetdrill: report benign
debug flakes as xfail") to cover them.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250115232129.845884-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-add-phylink-managed-eee-support'
Jakub Kicinski [Fri, 17 Jan 2025 01:23:02 +0000 (17:23 -0800)]
Merge branch 'net-add-phylink-managed-eee-support'

Russell King says:

====================
net: add phylink managed EEE support

Adding managed EEE support to phylink has been on the cards ever since
the idea in phylib was mooted. This overly large series attempts to do
so. I've included all the patches as it's important to get the driver
patches out there.

Patch 1 adds a definition for the clock stop capable bit in the PCS
MMD status register.

Patch 2 adds a phylib API to query whether the PHY allows the transmit
xMII clock to be stopped while in LPI mode. This capability is for MAC
drivers to save power when LPI is active, to allow them to stop their
transmit clock.

Patch 3 extracts a phylink internal helper for determining whether the
link is up.

Patch 4 adds basic phylink managed EEE support. Two new MAC APIs are
added, to enable and disable LPI. The enable method is passed the LPI
timer setting which it is expected to program into the hardware, and
also a flag ehther the transmit clock should be stopped.

I have taken the decision to make enable_tx_lpi() to return an error
code, but not do much with it other than report it - the intention
being that we can later use it to extend functionality if needed
without reworking loads of drivers.

I have also dropped the validation/limitation of the LPI timer, and
left that in the driver code prior to calling phylink_ethtool_set_eee().

The remainder of the patches convert mvneta, lan743x and stmmac, and
add support for mvneta.
====================

Link: https://patch.msgid.link/Z4gdtOaGsBhQCZXn@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: convert to phylink managed EEE support
Russell King (Oracle) [Wed, 15 Jan 2025 20:43:08 +0000 (20:43 +0000)]
net: stmmac: convert to phylink managed EEE support

Convert stmmac to use phylink managed EEE support rather than delving
into phylib:

1. Move the stmmac_eee_init() calls out of mac_link_down() and
   mac_link_up() methods into the new mac_{enable,disable}_lpi()
   methods. We leave the calls to stmmac_set_eee_pls() in place as
   these change bits which tell the EEE hardware when the link came
   up or down, and is used for a separate hardware timer. However,
   symmetrically conditionalise this with priv->dma_cap.eee.

2. Update the current LPI timer each time LPI is enabled - which we
   need for software-timed LPI.

3. With phylink managed EEE, phylink manages the receive clock stop
   configuration via phylink_config.eee_rx_clk_stop_enable. Set this
   appropriately which makes the call to phy_eee_rx_clock_stop()
   redundant.

4. From what I can work out, all supported interfaces support LPI
   signalling on stmmac (there's no restriction implemented.) It
   also appears to support LPI at all full duplex speeds at or over
   100M. Set these capabilities.

5. The default timer appears to be derived from a module parameter.
   Set this the same, although we keep code that reconfigures the
   timer in stmmac_init_phy().

6. Remove the direct call to phy_support_eee(), which phylink will do
   on the drivers behalf if phylink_config.eee_enabled_default is set.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYAEG-0014QH-9O@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: lan743x: convert to phylink managed EEE
Russell King (Oracle) [Wed, 15 Jan 2025 20:43:03 +0000 (20:43 +0000)]
net: lan743x: convert to phylink managed EEE

Convert lan743x to phylink managed EEE:
- Set the lpi_capabilties.
- Move the call to lan743x_mac_eee_enable() into the enable/disable
  tx_lpi functions.
- Ensure that EEEEN is clear during probe.
- Move the setting of the LPI timer into mac_enable_tx_lpi().
- Move reading of LPI timer to phylink initialisation to set the
  default timer value.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYAEB-0014QB-4s@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: lan743x: use netdev in lan743x_phylink_mac_link_down()
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:57 +0000 (20:42 +0000)]
net: lan743x: use netdev in lan743x_phylink_mac_link_down()

Use the netdev that we already have in lan743x_phylink_mac_link_down().

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYAE5-0014Q5-Up@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: mvpp2: add EEE implementation
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:52 +0000 (20:42 +0000)]
net: mvpp2: add EEE implementation

Add EEE support for mvpp2, using phylink's EEE implementation, which
means we just need to implement the two methods for LPI control, and
with the initial configuration. Only SGMII mode is supported, so only
100M and 1G speeds.

Disabling LPI requires clearing a single bit. Enabling LPI needs a full
configuration of several values, as the timer values are dependent on
the MAC operating speed.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYAE0-0014Pz-R9@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: mvneta: convert to phylink EEE implementation
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:47 +0000 (20:42 +0000)]
net: mvneta: convert to phylink EEE implementation

Convert mvneta to use phylink's EEE implementation by implementing the
two LPI control methods, and adding the initial configuration and
capabilities.

Although disabling LPI requires clearing a single bit, for safety we
clear the manual mode and force bits to ensure that auto mode will be
used.

Enabling LPI needs a full configuration of several values, as the timer
values are dependent on the MAC operating speed, as per the original
code.

As Armada 388 states that EEE is only supported in "SGMII" modes, mark
this in lpi_interfaces. Testing with RGMII on the Clearfog platform
indicates that the receive path fails to detect LPI over RGMII.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADv-0014Pt-NO@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: add EEE management
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:42 +0000 (20:42 +0000)]
net: phylink: add EEE management

Add EEE management to phylink, making use of the phylib implementation.
This will only be used where a MAC driver populates the methods and
capabilities bitfield, otherwise we keep our old behaviour.

Phylink will keep track of the EEE configuration, including the clock
stop abilities at each end of the MAC to PHY link, programming the PHY
appropriately and preserving the LPI configuration should the PHY go
away.

Phylink will call into the MAC driver when LPI needs to be enabled or
disabled, with the requirement that the MAC have LPI disabled prior
to the netdev being brought up (in other words, it will only call
mac_disable_tx_lpi() if it has already called mac_enable_tx_lpi().)

Support for phylink managed EEE is enabled by populating both tx_lpi
MAC operations method pointers, and filling in both LPI interfaces
and capabilities. If the methods are provided but the LPI interfaces
or capabilities remain empty, this indicates to phylink that EEE is
implemented by the driver but the hardware it is driving does not
support EEE, and thus the ethtool set_eee() and get_eee() methods will
return EOPNOTSUPP.

No validation of the LPI timer value is performed by this patch.

For interface modes which do not support LPI, we make no attempt to
manipulate the phylib EEE advertisement, but instead refuse to
activate LPI at the MAC, noting it at debug message level.

We also restrict the advertisement and reported userspace support
linkmode masks according to the lpi_capabilities provided to
phylink by the MAC driver.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADq-0014Pn-J1@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: add phylink_link_is_up() helper
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:37 +0000 (20:42 +0000)]
net: phylink: add phylink_link_is_up() helper

Add a helper to determine whether the link is up or down. Currently
this is only used in one location, but becomes necessary to test
when reconfiguring EEE.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADl-0014Ph-EV@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: add support for querying PHY clock stop capability
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:32 +0000 (20:42 +0000)]
net: phy: add support for querying PHY clock stop capability

Add support for querying whether the PHY allows the transmit xMII clock
to be stopped while in LPI mode. This will be used by phylink to pass
to the MAC driver so it can configure the generation of the xMII clock
appropriately.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADg-0014Pb-AJ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: mdio: add definition for clock stop capable bit
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:27 +0000 (20:42 +0000)]
net: mdio: add definition for clock stop capable bit

Add a definition for the clock stop capable bit in the PCS MMD. This
bit indicates whether the MAC is able to stop the transmit xMII clock
while it is signalling LPI.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADb-0014PV-6T@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'dev-covnert-dev_change_name-to-per-netns-rtnl'
Jakub Kicinski [Fri, 17 Jan 2025 01:20:52 +0000 (17:20 -0800)]
Merge branch 'dev-covnert-dev_change_name-to-per-netns-rtnl'

Kuniyuki Iwashima says:

====================
dev: Covnert dev_change_name() to per-netns RTNL.

Patch 1 adds a missing netdev_rename_lock in dev_change_name()
and Patch 2 removes unnecessary devnet_rename_sem there.

Patch 3 replaces RTNL with rtnl_net_lock() in dev_ifsioc(),
and now dev_change_name() is always called under per-netns RTNL.

Given it's close to -rc8 and Patch 1 touches the trivial unlikely
path, can Patch 1 go into net-next ?  Otherwise I'll post Patch 2 & 3
separately in the next cycle.
====================

Link: https://patch.msgid.link/20250115095545.52709-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>