]> www.infradead.org Git - linux.git/log
linux.git
5 months agonet: use skb_crc32c() in skb_crc32c_csum_help()
Eric Biggers [Mon, 19 May 2025 17:50:05 +0000 (10:50 -0700)]
net: use skb_crc32c() in skb_crc32c_csum_help()

Instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C, just call the new function skb_crc32c().  This is faster
and simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-4-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: add skb_crc32c()
Eric Biggers [Mon, 19 May 2025 17:50:04 +0000 (10:50 -0700)]
net: add skb_crc32c()

Add skb_crc32c(), which calculates the CRC32C of a sk_buff.  It will
replace __skb_checksum(), which unnecessarily supports arbitrary
checksums.  Compared to __skb_checksum(), skb_crc32c():

   - Uses the correct type for CRC32C values (u32, not __wsum).

   - Does not require the caller to provide a skb_checksum_ops struct.

   - Is faster because it does not use indirect calls and does not use
     the very slow crc32c_combine().

According to commit 2817a336d4d5 ("net: skb_checksum: allow custom
update/combine for walking skb") which added __skb_checksum(), the
original motivation for the abstraction layer was to avoid code
duplication for CRC32C and other checksums in the future.  However:

   - No additional checksums showed up after CRC32C.  __skb_checksum()
     is only used with the "regular" net checksum and CRC32C.

   - Indirect calls are expensive.  Commit 2544af0344ba ("net: avoid
     indirect calls in L4 checksum calculation") worked around this
     using the INDIRECT_CALL_1 macro. But that only avoided the indirect
     call for the net checksum, and at the cost of an extra branch.

   - The checksums use different types (__wsum and u32), causing casts
     to be needed.

   - It made the checksums of fragments be combined (rather than
     chained) for both checksums, despite this being highly
     counterproductive for CRC32C due to how slow crc32c_combine() is.
     This can clearly be seen in commit 4c2f24549644 ("sctp: linearize
     early if it's not GSO") which tried to work around this performance
     bug.  With a dedicated function for each checksum, we can instead
     just use the proper strategy for each checksum.

As shown by the following tables, the new function skb_crc32c() is
faster than __skb_checksum(), with the improvement varying greatly from
5% to 2500% depending on the case.  The largest improvements come from
fragmented packets, mainly due to eliminating the inefficient
crc32c_combine().  But linear packets are improved too, especially
shorter ones, mainly due to eliminating indirect calls.  These
benchmarks were done on AMD Zen 5.  On that CPU, Linux uses IBRS instead
of retpoline; an even greater improvement might be seen with retpoline:

    Linear sk_buffs

        Length in bytes    __skb_checksum cycles    skb_crc32c cycles
        ===============    =====================    =================
                     64                       43                   18
                    256                       94                   77
                   1420                      204                  161
                  16384                     1735                 1642

    Nonlinear sk_buffs (even split between head and one fragment)

        Length in bytes    __skb_checksum cycles    skb_crc32c cycles
        ===============    =====================    =================
                     64                      579                   22
                    256                      829                   77
                   1420                     1506                  194
                  16384                     4365                 1682

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: introduce CONFIG_NET_CRC32C
Eric Biggers [Mon, 19 May 2025 17:50:03 +0000 (10:50 -0700)]
net: introduce CONFIG_NET_CRC32C

Add a hidden kconfig symbol NET_CRC32C that will group together the
functions that calculate CRC32C checksums of packets, so that these
don't have to be built into NET-enabled kernels that don't need them.

Make skb_crc32c_csum_help() (which is called only when IP_SCTP is
enabled) conditional on this symbol, and make IP_SCTP select it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'tools-ynl-gen-add-support-for-inherited-selector-and-therefore-tc'
Jakub Kicinski [Wed, 21 May 2025 22:38:43 +0000 (15:38 -0700)]
Merge branch 'tools-ynl-gen-add-support-for-inherited-selector-and-therefore-tc'

Jakub Kicinski says:

====================
tools: ynl-gen: add support for "inherited" selector and therefore TC

Add C codegen support for constructs needed by TC, namely passing
sub-message selector from a lower nest, and sub-messages with
fixed headers.
====================

Link: https://patch.msgid.link/20250520161916.413298-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl: add a sample for TC
Jakub Kicinski [Tue, 20 May 2025 16:19:16 +0000 (09:19 -0700)]
tools: ynl: add a sample for TC

Add a very simple TC dump sample with decoding of fq_codel attrs:

  # ./tools/net/ynl/samples/tc
        dummy0: fq_codel  limit: 10240p target: 5ms new_flow_cnt: 0

proving that selector passing (for stats) works.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-13-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonetlink: specs: tc: add qdisc dump to TC spec
Jakub Kicinski [Tue, 20 May 2025 16:19:15 +0000 (09:19 -0700)]
netlink: specs: tc: add qdisc dump to TC spec

Hook TC qdisc dump in the TC qdisc get, it only supported doit
until now and dumping will be used by the sample code.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl: enable codegen for TC
Jakub Kicinski [Tue, 20 May 2025 16:19:14 +0000 (09:19 -0700)]
tools: ynl: enable codegen for TC

We are ready to support most of TC. Enable C code gen.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: support weird sub-message formats
Jakub Kicinski [Tue, 20 May 2025 16:19:13 +0000 (09:19 -0700)]
tools: ynl-gen: support weird sub-message formats

TC uses all possible sub-message formats:
 - nested attrs
 - fixed headers + nested attrs
 - fixed headers
 - empty

Nested attrs are already supported for rt-link. Add support
for remaining 3. The empty and fixed headers ones are fairly
trivial, we can fake a Binary or Flags type instead of a Nest.

For fixed headers + nest we need to teach nest parsing and
nest put to handle fixed headers.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: support local attrs in _multi_parse
Jakub Kicinski [Tue, 20 May 2025 16:19:12 +0000 (09:19 -0700)]
tools: ynl-gen: support local attrs in _multi_parse

The _multi_parse() helper calls the _attr_get() method of each attr,
but it only respects what code the helper wants to emit, not what
local variables it needs. Local variables will soon be needed,
support them.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: move fixed header info from RenderInfo to Struct
Jakub Kicinski [Tue, 20 May 2025 16:19:11 +0000 (09:19 -0700)]
tools: ynl-gen: move fixed header info from RenderInfo to Struct

RenderInfo describes a request-response exchange. Struct describes
a parsed attribute set. For ease of parsing sub-messages with
fixed headers move fixed header info from RenderInfo to Struct.
No functional changes.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: support passing selector to a nest
Jakub Kicinski [Tue, 20 May 2025 16:19:10 +0000 (09:19 -0700)]
tools: ynl-gen: support passing selector to a nest

In rtnetlink all submessages had the selector at the same level
of nesting as the submessage. We could refer to the relevant
attribute from the current struct. In TC, stats are one level
of nesting deeper than "kind". Teach the code-gen about structs
which need to be passed a selector by the caller for parsing.

Because structs are "topologically sorted" one pass of propagating
the selectors down is enough.

For generating netlink message we depend on the presence bits
so no selector passing needed there.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonetlink: specs: tc: drop the family name prefix from attrs
Jakub Kicinski [Tue, 20 May 2025 16:19:09 +0000 (09:19 -0700)]
netlink: specs: tc: drop the family name prefix from attrs

All attribute sets and messages are prefixed with tc-.
The C codegen also adds the family name to all structs.
We end up with names like struct tc_tc_act_attrs.
Remove the tc- prefixes to shorten the names.
This should not impact Python as the attr set names
are never exposed to user, they are only used to refer
to things internally, in the encoder / decoder.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonetlink: specs: tc: add C naming info
Jakub Kicinski [Tue, 20 May 2025 16:19:08 +0000 (09:19 -0700)]
netlink: specs: tc: add C naming info

Add naming info needed by C code gen.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonetlink: specs: tc: use tc-gact instead of tc-gen as struct name
Jakub Kicinski [Tue, 20 May 2025 16:19:07 +0000 (09:19 -0700)]
netlink: specs: tc: use tc-gact instead of tc-gen as struct name

There is a define in the uAPI header called tc_gen which expands
to the "generic" TC action fields. This helps other actions include
the base fields without having to deal with nested structs.

A couple of actions (sample, gact) do not define extra fields,
so the spec used a common tc-gen struct for both of them.
Unfortunately this struct does not exist in C. Let's use gact's
(generic act's) struct for basic actions.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonetlink: specs: tc: remove duplicate nests
Jakub Kicinski [Tue, 20 May 2025 16:19:06 +0000 (09:19 -0700)]
netlink: specs: tc: remove duplicate nests

tc-act-stats-attrs and tca-stats-attrs are almost identical.
The only difference is that the latter has sub-message decoding
for app, rather than declaring it as a binary attr.

tc-act-police-attrs and tc-police-attrs are identical but for
the TODO annotations.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: add makefile deps for neigh
Jakub Kicinski [Tue, 20 May 2025 16:19:05 +0000 (09:19 -0700)]
tools: ynl-gen: add makefile deps for neigh

Kory is reporting build issues after recent additions to YNL
if the system headers are old.

Link: https://lore.kernel.org/20250519164949.597d6e92@kmaincent-XPS-13-7390
Reported-by: Kory Maincent <kory.maincent@bootlin.com>
Fixes: 0939a418b3b0 ("tools: ynl: submsg: reverse parse / error reporting")
Tested-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250520161916.413298-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'net-airoha-add-per-flow-stats-support-to-hw-flowtable-offloading'
Jakub Kicinski [Wed, 21 May 2025 03:00:54 +0000 (20:00 -0700)]
Merge branch 'net-airoha-add-per-flow-stats-support-to-hw-flowtable-offloading'

Lorenzo Bianconi says:

====================
net: airoha: Add per-flow stats support to hw flowtable offloading

Introduce per-flow stats accounting to the flowtable hw offload in the
airoha_eth driver. Flow stats are split in the PPE and NPU modules:
- PPE: accounts for high 32bit of per-flow stats
- NPU: accounts for low 32bit of per-flow stats

v1: https://lore.kernel.org/20250514-airoha-en7581-flowstats-v1-0-c00ede12a2ca@kernel.org
====================

Link: https://patch.msgid.link/20250516-airoha-en7581-flowstats-v2-0-06d5fbf28984@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: airoha: ppe: Disable packet keepalive
Lorenzo Bianconi [Fri, 16 May 2025 08:00:01 +0000 (10:00 +0200)]
net: airoha: ppe: Disable packet keepalive

Since netfilter flowtable entries are now refreshed by flow-stats
polling, we can disable hw packet keepalive used to periodically send
packets belonging to offloaded flows to the kernel in order to refresh
flowtable entries.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516-airoha-en7581-flowstats-v2-3-06d5fbf28984@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: airoha: Add FLOW_CLS_STATS callback support
Lorenzo Bianconi [Fri, 16 May 2025 08:00:00 +0000 (10:00 +0200)]
net: airoha: Add FLOW_CLS_STATS callback support

Introduce per-flow stats accounting to the flowtable hw offload in
the airoha_eth driver. Flow stats are split in the PPE and NPU modules:
- PPE: accounts for high 32bit of per-flow stats
- NPU: accounts for low 32bit of per-flow stats

FLOW_CLS_STATS can be enabled or disabled at compile time.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516-airoha-en7581-flowstats-v2-2-06d5fbf28984@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: airoha: npu: Move memory allocation in airoha_npu_send_msg() caller
Lorenzo Bianconi [Fri, 16 May 2025 07:59:59 +0000 (09:59 +0200)]
net: airoha: npu: Move memory allocation in airoha_npu_send_msg() caller

Move ppe_mbox_data struct memory allocation from airoha_npu_send_msg
routine to the caller one. This is a preliminary patch to enable wlan NPU
offloading and flow counter stats support.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516-airoha-en7581-flowstats-v2-1-06d5fbf28984@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'ipv6-follow-up-for-rtnl-free-rtm_newroute-series'
Jakub Kicinski [Wed, 21 May 2025 02:18:26 +0000 (19:18 -0700)]
Merge branch 'ipv6-follow-up-for-rtnl-free-rtm_newroute-series'

Kuniyuki Iwashima says:

====================
ipv6: Follow up for RTNL-free RTM_NEWROUTE series.

Patch 1 removes rcu_read_lock() in fib6_get_table().
Patch 2 removes rtnl_is_held arg for lwtunnel_valid_encap_type(), which
 was short-term fix and is no longer used.
Patch 3 fixes RCU vs GFP_KERNEL report by syzkaller.
Patch 4~7 reverts GFP_ATOMIC uses to GFP_KERNEL.

v1: https://lore.kernel.org/20250514201943.74456-1-kuniyu@amazon.com
====================

Link: https://patch.msgid.link/20250516022759.44392-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoipv6: Revert two per-cpu var allocation for RTM_NEWROUTE.
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:23 +0000 (19:27 -0700)]
ipv6: Revert two per-cpu var allocation for RTM_NEWROUTE.

These two commits preallocated two per-cpu variables in
ip6_route_info_create() as fib_nh_common_init() and fib6_nh_init()
were expected to be called under RCU.

  * commit d27b9c40dbd6 ("ipv6: Preallocate nhc_pcpu_rth_output in
    ip6_route_info_create().")
  * commit 5720a328c3e9 ("ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in
    ip6_route_info_create().")

Now these functions can be called without RCU and can use GFP_KERNEL.

Let's revert the commits.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoipv6: Pass gfp_flags down to ip6_route_info_create_nh().
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:22 +0000 (19:27 -0700)]
ipv6: Pass gfp_flags down to ip6_route_info_create_nh().

Since commit c4837b9853e5 ("ipv6: Split ip6_route_info_create()."),
ip6_route_info_create_nh() uses GFP_ATOMIC as it was expected to be
called under RCU.

Now, we can call it without RCU and use GFP_KERNEL.

Let's pass gfp_flags to ip6_route_info_create_nh().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoRevert "ipv6: Factorise ip6_route_multipath_add()."
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:21 +0000 (19:27 -0700)]
Revert "ipv6: Factorise ip6_route_multipath_add()."

Commit 71c0efb6d12f ("ipv6: Factorise ip6_route_multipath_add().") split
a loop in ip6_route_multipath_add() so that we can put rcu_read_lock()
between ip6_route_info_create() and ip6_route_info_create_nh().

We no longer need to do so as ip6_route_info_create_nh() does not require
RCU now.

Let's revert the commit to simplify the code.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoRevert "ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT...
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:20 +0000 (19:27 -0700)]
Revert "ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup"

The previous patch fixed the same issue mentioned in
commit 14a0087e7236 ("ipv6: sr: switch to GFP_ATOMIC
flag to allocate memory during seg6local LWT setup").

Let's revert it.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250516022759.44392-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoipv6: Narrow down RCU critical section in inet6_rtm_newroute().
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:19 +0000 (19:27 -0700)]
ipv6: Narrow down RCU critical section in inet6_rtm_newroute().

Commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and
RTM_NEWROUTE.") added rcu_read_lock() covering
ip6_route_info_create_nh() and __ip6_ins_rt() to guarantee that
nexthop and netdev will not go away.

However, as reported by syzkaller [0], ip_tun_build_state() calls
dst_cache_init() with GFP_KERNEL during the RCU critical section.

ip6_route_info_create_nh() fetches nexthop or netdev depending on
whether RTA_NH_ID is set, and struct fib6_info holds a refcount
of either of them by nexthop_get() or netdev_get_by_index().

netdev_get_by_index() looks up a dev and calls dev_hold() under RCU.

So, we need RCU only around nexthop_find_by_id() and nexthop_get()
( and a few more nexthop code).

Let's add rcu_read_lock() there and remove rcu_read_lock() in
ip6_route_add() and ip6_route_multipath_add().

Now these functions called from fib6_add() need RCU:

  - inet6_rt_notify()
  - fib6_drop_pcpu_from() (via fib6_purge_rt())
  - rt6_flush_exceptions() (via fib6_purge_rt())
  - ip6_ignore_linkdown() (via rt6_multipath_rebalance())

All callers of inet6_rt_notify() need RCU, so rcu_read_lock() is
added there.

[0]:
[ BUG: Invalid wait context ]
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W
 ----------------------------
syz-executor234/5832 is trying to lock:
ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
other info that might help us debug this:
context-{5:5}
1 lock held by syz-executor234/5832:
 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
include/linux/rcupdate.h:331 [inline]
 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
include/linux/rcupdate.h:841 [inline]
 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at:
ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913
stack backtrace:
CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full)
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 04/29/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline]
 check_wait_context kernel/locking/lockdep.c:4903 [inline]
 __lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185
 lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
 __mutex_lock_common kernel/locking/mutex.c:601 [inline]
 __mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746
 pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
 dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145
 ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687
 lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137
 fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635
 fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669
 ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866
 ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915
 inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732
 rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955
 netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
 netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339
 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg+0x219/0x270 net/socket.c:727
 ____sys_sendmsg+0x505/0x830 net/socket.c:2566
 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620
 __sys_sendmsg net/socket.c:2652 [inline]
 __do_sys_sendmsg net/socket.c:2657 [inline]
 __se_sys_sendmsg net/socket.c:2655 [inline]
 __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94

Fixes: 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.")
Reported-by: syzbot+bcc12d6799364500fbec@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bcc12d6799364500fbec
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89i+r1cGacVC_6n3-A-WSkAa_Nr+pmxJ7Gt+oP-P9by2aGw@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoinet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?().
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:18 +0000 (19:27 -0700)]
inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?().

Commit f130a0cc1b4f ("inet: fix lwtunnel_valid_encap_type() lock
imbalance") added the rtnl_is_held argument as a temporary fix while
I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU.

Now all callers of lwtunnel_valid_encap_type() do not hold RTNL.

Let's remove the argument.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoipv6: Remove rcu_read_lock() in fib6_get_table().
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:17 +0000 (19:27 -0700)]
ipv6: Remove rcu_read_lock() in fib6_get_table().

Once allocated, the IPv6 routing table is not freed until
netns is dismantled.

fib6_get_table() uses rcu_read_lock() while iterating
net->ipv6.fib_table_hash[], but it's not needed and
rather confusing.

Because some callers have this pattern,

  table = fib6_get_table();

  rcu_read_lock();
  /* ... use table here ... */
  rcu_read_unlock();

  [ See: addrconf_get_prefix_route(), ip6_route_del(),
         rt6_get_route_info(), rt6_get_dflt_router() ]

and this looks illegal but is actually safe.

Let's remove rcu_read_lock() in fib6_get_table() and pass true
to the last argument of hlist_for_each_entry_rcu() to bypass
the RCU check.

Note that protection is not needed but RCU helper is used to
avoid data-race.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'net-bcmgenet-64bit-stats-and-expose-more-stats-in-ethtool'
Jakub Kicinski [Wed, 21 May 2025 01:36:37 +0000 (18:36 -0700)]
Merge branch 'net-bcmgenet-64bit-stats-and-expose-more-stats-in-ethtool'

Zak Kemble says:

====================
net: bcmgenet: 64bit stats and expose more stats in ethtool

Hi, this patchset updates the bcmgenet driver with new 64bit statistics via
ndo_get_stats64 and rtnl_link_stats64, now reports hardware discarded
packets in the rx_missed_errors stat and exposes more stats in ethtool.

v1: https://lore.kernel.org/20250513144107.1989-1-zakkemble@gmail.com
v2: https://lore.kernel.org/20250515145142.1415-1-zakkemble@gmail.com
====================

Link: https://patch.msgid.link/20250519113257.1031-1-zakkemble@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: bcmgenet: expose more stats in ethtool
Zak Kemble [Mon, 19 May 2025 11:32:57 +0000 (12:32 +0100)]
net: bcmgenet: expose more stats in ethtool

Expose more per-queue and overall stats in ethtool

Signed-off-by: Zak Kemble <zakkemble@gmail.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250519113257.1031-4-zakkemble@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: bcmgenet: count hw discarded packets in missed stat
Zak Kemble [Mon, 19 May 2025 11:32:56 +0000 (12:32 +0100)]
net: bcmgenet: count hw discarded packets in missed stat

Hardware discarded packets are now counted in their own missed stat
instead of being lumped in with general errors.

Signed-off-by: Zak Kemble <zakkemble@gmail.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250519113257.1031-3-zakkemble@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: bcmgenet: switch to use 64bit statistics
Zak Kemble [Mon, 19 May 2025 11:32:55 +0000 (12:32 +0100)]
net: bcmgenet: switch to use 64bit statistics

Update the driver to use ndo_get_stats64, rtnl_link_stats64 and
u64_stats_t counters for statistics.

Signed-off-by: Zak Kemble <zakkemble@gmail.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250519113257.1031-2-zakkemble@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'net-phy-fixed_phy-simplifications-and-improvements'
Jakub Kicinski [Wed, 21 May 2025 01:17:45 +0000 (18:17 -0700)]
Merge branch 'net-phy-fixed_phy-simplifications-and-improvements'

Heiner Kallweit says:

====================
net: phy: fixed_phy: simplifications and improvements

This series includes two types of changes:
- All callers pass PHY_POLL, therefore remove irq argument
- constify the passed struct fixed_phy_status *status
====================

Link: https://patch.msgid.link/4d4c468e-300d-42c7-92a1-eabbdb6be748@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: fixed_phy: constify status argument where possible
Heiner Kallweit [Sat, 17 May 2025 20:37:29 +0000 (22:37 +0200)]
net: phy: fixed_phy: constify status argument where possible

Constify the passed struct fixed_phy_status *status where possible.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/d1764b62-8538-408b-a4e3-b63715481a38@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: fixed_phy: remove irq argument from fixed_phy_register
Heiner Kallweit [Sat, 17 May 2025 20:35:56 +0000 (22:35 +0200)]
net: phy: fixed_phy: remove irq argument from fixed_phy_register

All callers pass PHY_POLL, therefore remove irq argument from
fixed_phy_register().

Note: I keep the irq argument in fixed_phy_add_gpiod() for now,
for the case that somebody may want to use a GPIO interrupt in
the future, by e.g. adding a call to fwnode_irq_get() to
fixed_phy_get_gpiod().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/31cdb232-a5e9-4997-a285-cb9a7d208124@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: fixed_phy: remove irq argument from fixed_phy_add
Heiner Kallweit [Sat, 17 May 2025 20:34:32 +0000 (22:34 +0200)]
net: phy: fixed_phy: remove irq argument from fixed_phy_add

All callers pass PHY_POLL, therefore remove irq argument from
fixed_phy_add().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Link: https://patch.msgid.link/b3b9b3bc-c310-4a54-b376-c909c83575de@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: let lockdep compare instance locks
Jakub Kicinski [Sat, 17 May 2025 20:08:10 +0000 (13:08 -0700)]
net: let lockdep compare instance locks

AFAIU always returning -1 from lockdep's compare function
basically disables checking of dependencies between given
locks. Try to be a little more precise about what guarantees
that instance locks won't deadlock.

Right now we only nest them under protection of rtnl_lock.
Mostly in unregister_netdevice_many() and dev_close_many().

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250517200810.466531-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoselftests: net: Fix spellings
Sumanth Gavini [Sat, 17 May 2025 03:25:33 +0000 (20:25 -0700)]
selftests: net: Fix spellings

Fix "withouth" to "without"
Fix "instaces" to "instances"

Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250517032535.1176351-1-sumanth.gavini@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoselftests: nci: Fix "Electrnoics" to "Electronics"
Sumanth Gavini [Sat, 17 May 2025 01:59:37 +0000 (18:59 -0700)]
selftests: nci: Fix "Electrnoics" to "Electronics"

Fix misspelling reported by codespell

Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250517020003.1159640-1-sumanth.gavini@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoselftests: net: validate team flags propagation
Stanislav Fomichev [Fri, 16 May 2025 23:22:05 +0000 (16:22 -0700)]
selftests: net: validate team flags propagation

Cover three recent cases:
1. missing ops locking for the lowers during netdev_sync_lower_features
2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked
   with a comment on why/when it's needed)
3. rcu lock during team_change_rx_flags

Verified that each one triggers when the respective fix is reverted.
Not sure about the placement, but since it all relies on teaming,
added to the teaming directory.

One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way
to trigger netdev_sync_lower_features without it.

Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250516232205.539266-1-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: Replace kzalloc/fbnic_fw_init_cmpl with fbnic_fw_alloc_cmpl
Lee Trager [Fri, 16 May 2025 16:46:41 +0000 (09:46 -0700)]
eth: fbnic: Replace kzalloc/fbnic_fw_init_cmpl with fbnic_fw_alloc_cmpl

Replace the pattern of calling and validating kzalloc then
fbnic_fw_init_cmpl with a single function, fbnic_fw_alloc_cmpl.

Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250516164804.741348-1-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'add-built-in-2-5g-ethernet-phy-support-on-mt7988'
Jakub Kicinski [Wed, 21 May 2025 01:11:06 +0000 (18:11 -0700)]
Merge branch 'add-built-in-2-5g-ethernet-phy-support-on-mt7988'

Sky Huang says:

====================
Add built-in 2.5G ethernet phy support on MT7988

From: Sky Huang <skylake.huang@mediatek.com>

This patchset adds support for built-in 2.5Gphy on MT7988, sort file
and config sequence in related Kconfig and Makefile.
====================

Link: https://patch.msgid.link/20250516102327.2014531-1-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: mediatek: add driver for built-in 2.5G ethernet PHY on MT7988
Sky Huang [Fri, 16 May 2025 10:23:27 +0000 (18:23 +0800)]
net: phy: mediatek: add driver for built-in 2.5G ethernet PHY on MT7988

Add support for internal 2.5Gphy on MT7988. This driver will load
necessary firmware and add appropriate time delay to make sure
that firmware works stably. The firmware loading procedure takes
about 11ms in this driver.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250516102327.2014531-3-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: mediatek: Sort config and file names in Kconfig and Makefile
Sky Huang [Fri, 16 May 2025 10:23:26 +0000 (18:23 +0800)]
net: phy: mediatek: Sort config and file names in Kconfig and Makefile

Sort config and file names in Kconfig and Makefile in
drivers/net/phy/mediatek/ in alphabetical order.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250516102327.2014531-2-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agosctp: Do not wake readers in __sctp_write_space()
Petr Malat [Fri, 16 May 2025 08:17:28 +0000 (10:17 +0200)]
sctp: Do not wake readers in __sctp_write_space()

Function __sctp_write_space() doesn't set poll key, which leads to
ep_poll_callback() waking up all waiters, not only these waiting
for the socket being writable. Set the key properly using
wake_up_interruptible_poll(), which is preferred over the sync
variant, as writers are not woken up before at least half of the
queue is available. Also, TCP does the same.

Signed-off-by: Petr Malat <oss@malat.biz>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250516081727.1361451-1-oss@malat.biz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: realtek: add RTL8127-internal PHY
ChunHao Lin [Fri, 16 May 2025 05:56:22 +0000 (13:56 +0800)]
net: phy: realtek: add RTL8127-internal PHY

RTL8127-internal PHY is RTL8261C which is a integrated 10Gbps PHY with ID
0x001cc890. It follows the code path of RTL8125/RTL8126 internal NBase-T
PHY.

Signed-off-by: ChunHao Lin <hau@realtek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250516055622.3772-1-hau@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: enetc: fix the error handling in enetc4_pf_netdev_create()
Wei Fang [Fri, 16 May 2025 05:27:34 +0000 (13:27 +0800)]
net: enetc: fix the error handling in enetc4_pf_netdev_create()

Fix the handling of err_wq_init and err_reg_netdev paths in
enetc4_pf_netdev_create() function.

Fixes: 6c5bafba347b ("net: enetc: add MAC filtering for i.MX95 ENETC PF")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516052734.3624191-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoocteontx2-pf: Add tracepoint for NIX_PARSE_S
Subbaraya Sundeep [Thu, 15 May 2025 17:44:08 +0000 (23:14 +0530)]
octeontx2-pf: Add tracepoint for NIX_PARSE_S

The NIX_PARSE_S structure populated by hardware in the
NIX RX CQE has parsing information for the received packet.
A tracepoint to dump the all words of NIX_PARSE_S
is helpful in debugging packet parser.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/1747331048-15347-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 months agonet: phy: make mdio consumer / device layer a separate module
Heiner Kallweit [Thu, 15 May 2025 08:11:54 +0000 (10:11 +0200)]
net: phy: make mdio consumer / device layer a separate module

After having factored out the provider part from mdio_bus.c, we can
make the mdio consumer / device layer a separate module. This also
allows to remove Kconfig symbol MDIO_DEVICE.
The module init / exit functions from mdio_bus.c no longer have to be
called from phy_device.c. The link order defined in
drivers/net/phy/Makefile ensures that init / exit functions are called
in the right order.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/dba6b156-5748-44ce-b5e2-e8dc2fcee5a7@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 months agoMerge branch 'queue_api-reduce-risk-of-name-collision-over-txq'
Jakub Kicinski [Tue, 20 May 2025 03:09:07 +0000 (20:09 -0700)]
Merge branch 'queue_api-reduce-risk-of-name-collision-over-txq'

Gur Stavi says:

====================
queue_api: reduce risk of name collision over txq

Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.
====================

Link: https://patch.msgid.link/cover.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoqueue_api: reduce risk of name collision over txq
Gur Stavi [Sun, 18 May 2025 10:00:54 +0000 (13:00 +0300)]
queue_api: reduce risk of name collision over txq

Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.

Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/95b60d218f004308486d92ed17c8cc6f28bac09d.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Tue, 20 May 2025 03:07:41 +0000 (20:07 -0700)]
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
idpf: add initial PTP support

Milena Olech says:

This patch series introduces support for Precision Time Protocol (PTP) to
Intel(R) Infrastructure Data Path Function (IDPF) driver. PTP feature is
supported when the PTP capability is negotiated with the Control
Plane (CP). IDPF creates a PTP clock and sets a set of supported
functions.

During the PTP initialization, IDPF requests a set of PTP capabilities
and receives a writeback from the CP with the set of supported options.
These options are:
- get time of the PTP clock
- set the time of the PTP clock
- adjust the PTP clock
- Tx timestamping

Each feature is considered to have direct access, where the operations
on PCIe BAR registers are allowed, or the mailbox access, where the
virtchnl messages are used to perform any PTP action. Mailbox access
means that PTP requests are sent to the CP through dedicated secondary
mailbox and the CP reads/writes/modifies desired resource - PTP Clock
or Tx timestamp registers.

Tx timestamp capabilities are negotiated only for vports that have
UPLINK_VPORT flag set by the CP. Capabilities provide information about
the number of available Tx timestamp latches, their indexes and size of
the Tx timestamp value. IDPF requests Tx timestamp by setting the
TSYN bit and the requested timestamp index in the context descriptor for
the PTP packets. When the completion tag for that packet is received,
IDPF schedules a worker to read the Tx timestamp value.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  idpf: add support for Rx timestamping
  idpf: add Tx timestamp flows
  idpf: add Tx timestamp capabilities negotiation
  idpf: add PTP clock configuration
  idpf: add mailbox access to read PTP clock time
  idpf: negotiate PTP capabilities and get PTP clock
  idpf: move virtchnl structures to the header file
  virtchnl: add PTP virtchnl definitions
  idpf: add initial PTP support
  idpf: change the method for mailbox workqueue allocation
====================

Link: https://patch.msgid.link/20250516170645.1172700-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoselftests: drv-net: Fix "envirnoments" to "environments"
Sumanth Gavini [Fri, 16 May 2025 22:51:48 +0000 (15:51 -0700)]
selftests: drv-net: Fix "envirnoments" to "environments"

Fix misspelling reported by codespell

Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516225156.1122058-1-sumanth.gavini@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: netlink: reduce extack cookie size
Johannes Berg [Fri, 16 May 2025 11:59:27 +0000 (13:59 +0200)]
net: netlink: reduce extack cookie size

Seems like the extack cookie hasn't found any users outside
of wireless, which always uses nl_set_extack_cookie_u64().
Thus, allocating 20 bytes for it is pointless, reduce that
to 8 bytes, and add a BUILD_BUG_ON() to ensure it's enough
(obviously it is, for a u64, but in case it changes again.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250516115927.38209-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge tag 'ovpn-net-next-20250515' of https://github.com/OpenVPN/ovpn-net-next
David S. Miller [Mon, 19 May 2025 11:10:43 +0000 (12:10 +0100)]
Merge tag 'ovpn-net-next-20250515' of https://github.com/OpenVPN/ovpn-net-next

Antonio Quartulli says:

====================
ovpn: pull request for net-next: ovpn 2025-05-15

this is a new version of the previous pull request.
These time I have removed the fixes that we are still discussing,
so that we don't hold the entire series back.

There is a new fix though: it's about properly checking the return value
of skb_to_sgvec_nomark(). I spotted the issue while testing pings larger
than the iface's MTU on a TCP VPN connection.

I have added various Closes and Link tags where applicable, so
that we have references to GitHub tickets and other public discussions.

Since I have resent the PR, I have also added Andrew's Reviewed-by to
the first patch.

Please pull or let me know if something should be changed!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Patchset highlights:
- update MAINTAINERS entry for ovpn
- extend selftest with more cases
- avoid crash in selftest in case of getaddrinfo() failure
- fix ndo_start_xmit return value on error
- set ignore_df flag for IPv6 packets
- drop useless reg_state check in keepalive worker
- retain skb's dst when entering xmit function
- fix check on skb_to_sgvec_nomark() return value

5 months agoMerge branch 'vsock-test-improve-sigpipe-test-reliability'
Jakub Kicinski [Sat, 17 May 2025 01:01:34 +0000 (18:01 -0700)]
Merge branch 'vsock-test-improve-sigpipe-test-reliability'

Stefano Garzarella says:

====================
vsock/test: improve sigpipe test reliability

Running the tests continuously I noticed that sometimes the sigpipe
test would fail due to a race between the control message of the test
and the vsock transport messages.

While I was at it I also improved the test by checking the errno we
expect.

v1: https://lore.kernel.org/20250508142005.135857-1-sgarzare@redhat.com
====================

Link: https://patch.msgid.link/20250514141927.159456-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: check also expected errno on sigpipe test
Stefano Garzarella [Wed, 14 May 2025 14:19:27 +0000 (16:19 +0200)]
vsock/test: check also expected errno on sigpipe test

In the sigpipe test, we expect send() to fail, but we do not check if
send() fails with the errno we expect (EPIPE).

Add this check and repeat the send() in case of EINTR as we do in other
tests.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-4-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: retry send() to avoid occasional failure in sigpipe test
Stefano Garzarella [Wed, 14 May 2025 14:19:26 +0000 (16:19 +0200)]
vsock/test: retry send() to avoid occasional failure in sigpipe test

When the other peer calls shutdown(SHUT_RD), there is a chance that
the send() call could occur before the message carrying the close
information arrives over the transport. In such cases, the send()
might still succeed. To avoid this race, let's retry the send() call
a few times, ensuring the test is more reliable.

Sleep a little before trying again to avoid flooding the other peer
and filling its receive buffer, causing false-negative.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-3-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: add timeout_usleep() to allow sleeping in timeout sections
Stefano Garzarella [Wed, 14 May 2025 14:19:25 +0000 (16:19 +0200)]
vsock/test: add timeout_usleep() to allow sleeping in timeout sections

The timeout API uses signals, so we have documented not to use sleep(),
but we can use nanosleep(2) since POSIX.1 explicitly specifies that it
does not interact with signals.

Let's provide timeout_usleep() for that.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-2-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'tools-ynl-gen-support-sub-messages-and-rt-link'
Jakub Kicinski [Fri, 16 May 2025 23:32:08 +0000 (16:32 -0700)]
Merge branch 'tools-ynl-gen-support-sub-messages-and-rt-link'

Jakub Kicinski says:

====================
tools: ynl-gen: support sub-messages and rt-link

Sub-messages are how we express "polymorphism" in YNL. Donald added
the support to specs and Python a while back, support them in C, too.
Sub-message is a nest, but the interpretation of the attribute types
within that nest depends on a value of another attribute. For example
in rt-link the "kind" attribute contains the link type (veth, bonding,
etc.) and based on that the right enum has to be applied to interpret
link-specific attributes.

The last message is probably the most interesting to look at, as it
adds a fairly advanced sample.

This patch only contains enough support for rtnetlink, we will need
a little more complexity to support TC, where sub-messages may contain
fixed headers, and where the selector may be in a different nest than
the submessage.
====================

Link: https://patch.msgid.link/20250515231650.1325372-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl: add a sample for rt-link
Jakub Kicinski [Thu, 15 May 2025 23:16:50 +0000 (16:16 -0700)]
tools: ynl: add a sample for rt-link

Add a fairly complete example of rt-link usage. If run without any
arguments it simply lists the interfaces and some of their attrs.
If run with an arg it tries to create and delete a netkit device.

 1 # ./tools/net/ynl/samples/rt-link 1
 2 Trying to create a Netkit interface
 3 Testing error message for policy being bad:
 4     Kernel error: 'Provided default xmit policy not supported' (bad attribute: .linkinfo.data(netkit).policy)
 5   1:               lo: mtu 65536
 6   2:           wlp0s1: mtu  1500
 7   3:          enp0s13: mtu  1500
 8   4:           dummy0: mtu  1500  kind dummy     altname one two
 9   5:              nk0: mtu  1500  kind netkit    primary 0  policy forward
10   6:              nk1: mtu  1500  kind netkit    primary 1  policy blackhole
11 Trying to delete a Netkit interface (ifindex 6)

Sample creates the device first, it sets an invalid value for a netkit
attribute to trigger reverse parsing. Line 4 shows the error with the
attribute path correctly generated by YNL.

Then sample fixes the bad attribute and re-issues the request, with
NLM_F_ECHO set. This flag causes the notification to be looped back
to the initiating socket (our socket). Sample parses this notification
to save the ifindex of the created netkit.

Sample then proceeds to list the devices. Line 8 above shows a dummy
device with two alt names. Lines 9 and 10 show the netkit devices
the sample itself created.

The "primary" and "policy" attrs are from inside the netkit submsg.
The string values are auto-generated for the enums by YNL.

To clean up sample deletes the interface it created (line 11).

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl: enable codegen for all rt- families
Jakub Kicinski [Thu, 15 May 2025 23:16:49 +0000 (16:16 -0700)]
tools: ynl: enable codegen for all rt- families

Switch from including Classic netlink families one by one to excluding.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl: submsg: reverse parse / error reporting
Jakub Kicinski [Thu, 15 May 2025 23:16:48 +0000 (16:16 -0700)]
tools: ynl: submsg: reverse parse / error reporting

Reverse parsing lets YNL convert bad and missing attr pointers
from extack into a string like "missing attribute nest1.nest2.attr_name".
It's a feature that's unique to YNL C AFAIU (even the Python YNL
can't do nested reverse parsing). Add support for reverse-parsing
of sub-messages.

To simplify the logic and the code annotate the type policies
with extra metadata. Mark the selectors and the messages with
the information we need. We assume that key / selector always
precedes the sub-message while parsing (and also if there are
multiple sub-messages like in rt-link they are interleaved
selector 1 ... submsg 1 ... selector 2 .. submsg 2, not
selector 1 ... selector 2 ... submsg 1 ... submsg 2).

The rt-link sample in a subsequent changes shows reverse parsing
of sub-messages in action.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: submsg: support parsing and rendering sub-messages
Jakub Kicinski [Thu, 15 May 2025 23:16:47 +0000 (16:16 -0700)]
tools: ynl-gen: submsg: support parsing and rendering sub-messages

Adjust parsing and rendering appropriately to make sub-messages work.
Rendering is pretty trivial, as the submsg -> netlink conversion looks
like rendering a nest in which only one attr was set. Only trick
is that we use the enum value of the sub-message rather than the nest
as the type, and effectively skip one layer of nesting. A real double
nested struct would look like this:

  [SELECTOR]
  [SUBMSG]
    [NEST]
      [MSG1-ATTR]

A submsg "is" the nest so by skipping I mean:

  [SELECTOR]
  [SUBMSG]
    [MSG1-ATTR]

There is no extra validation in YNL if caller has set the selector
matching the submsg type (e.g. link type = "macvlan" but the nest
attrs are set to carry "veth"). Let the kernel handle that.

Parsing side is a little more specialized as we need to render and
insert a new kind of function which switches between what to parse
based on the selector. But code isn't too complicated.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: submsg: render the structs
Jakub Kicinski [Thu, 15 May 2025 23:16:46 +0000 (16:16 -0700)]
tools: ynl-gen: submsg: render the structs

The easiest (or perhaps only sane) way to support submessages in C
is to treat them as if they were nests. Build fake attributes to
that effect in the codegen. Render the submsg as a big nest of all
possible values.

With this in place the main missing part is to hook in the switch
which selects how to parse based on the key.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: submsg: plumb thru an empty type
Jakub Kicinski [Thu, 15 May 2025 23:16:45 +0000 (16:16 -0700)]
tools: ynl-gen: submsg: plumb thru an empty type

Hook in handling of sub-messages, for now treat them as ignored attrs.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: prepare for submsg structs
Jakub Kicinski [Thu, 15 May 2025 23:16:44 +0000 (16:16 -0700)]
tools: ynl-gen: prepare for submsg structs

Prepare for constructing Struct() instances which represent
sub-messages rather than nested attributes.
Restructure the code / indentation to more easily insert
a case where nested reference comes from annotation other
than the 'nested-attributes' property. Make sure we don't
construct the Struct() object from scratch in multiple
places as the constructor will soon have more arguments.

This should cause no functional change.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: factor out the annotation of pure nested struct
Jakub Kicinski [Thu, 15 May 2025 23:16:43 +0000 (16:16 -0700)]
tools: ynl-gen: factor out the annotation of pure nested struct

We're about to add some code here for sub-messages.
Factor out the nest-related logic to make the code readable.
No functional change.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonetlink: specs: rt-link: add C naming info for ovpn
Jakub Kicinski [Thu, 15 May 2025 23:16:42 +0000 (16:16 -0700)]
netlink: specs: rt-link: add C naming info for ovpn

C naming info for OVPN which was added since I adjusted
the existing attrs. Also add missing reference to a header needed
for a bridge struct.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: microchip: document where the LAN88xx PHYs are used
Oleksij Rempel [Thu, 15 May 2025 08:20:51 +0000 (10:20 +0200)]
net: phy: microchip: document where the LAN88xx PHYs are used

The driver uses the name LAN88xx for PHYs with phy_id = 0x0007c132. But
with this placeholder name no documentation can be found on the net.

Document the fact that these PHYs are build into the LAN7800 and LAN7850
USB/Ethernet controllers.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250515082051.2644450-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: fixed_phy: remove fixed_phy_register_with_gpiod
Heiner Kallweit [Wed, 14 May 2025 18:54:29 +0000 (20:54 +0200)]
net: phy: fixed_phy: remove fixed_phy_register_with_gpiod

Since its introduction 6 yrs ago this functions has never had a user.
So remove it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/ccbeef28-65ae-4e28-b1db-816c44338dee@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: rfs: add sock_rps_delete_flow() helper
Eric Dumazet [Thu, 15 May 2025 10:03:54 +0000 (10:03 +0000)]
net: rfs: add sock_rps_delete_flow() helper

RFS can exhibit lower performance for workloads using short-lived
flows and a small set of 4-tuple.

This is often the case for load-testers, using a pair of hosts,
if the server has a single listener port.

Typical use case :

Server : tcp_crr -T128 -F1000 -6 -U -l30 -R 14250
Client : tcp_crr -T128 -F1000 -6 -U -l30 -c -H server | grep local_throughput

This is because RFS global hash table contains stale information,
when the same RSS key is recycled for another socket and another cpu.

Make sure to undo the changes and go back to initial state when
a flow is disconnected.

Performance of the above test is increased by 22 %,
going from 372604 transactions per second to 457773.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Octavian Purdila <tavip@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250515100354.3339920-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agor8169: add support for RTL8127A
ChunHao Lin [Thu, 15 May 2025 09:53:03 +0000 (17:53 +0800)]
r8169: add support for RTL8127A

This adds support for 10Gbs chip RTL8127A.

Signed-off-by: ChunHao Lin <hau@realtek.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/20250515095303.3138-1-hau@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: dlink: add synchronization for stats update
Moon Yeounsu [Thu, 15 May 2025 07:53:31 +0000 (16:53 +0900)]
net: dlink: add synchronization for stats update

This patch synchronizes code that accesses from both user-space
and IRQ contexts. The `get_stats()` function can be called from both
context.

`dev->stats.tx_errors` and `dev->stats.collisions` are also updated
in the `tx_errors()` function. Therefore, these fields must also be
protected by synchronized.

There is no code that accessses `dev->stats.tx_errors` between the
previous and updated lines, so the updating point can be moved.

Signed-off-by: Moon Yeounsu <yyyynoom@gmail.com>
Link: https://patch.msgid.link/20250515075333.48290-1-yyyynoom@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
Carolina Jubran [Wed, 14 May 2025 20:03:52 +0000 (23:03 +0300)]
net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead

CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
zero-initializing all stack variables on function entry. The mlx5 XDP
RX path previously allocated a struct mlx5e_xdp_buff on the stack per
received CQE, resulting in measurable performance degradation under
this config.

This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
avoiding per-CQE stack allocations and repeated zeroing.

With this change, XDP_DROP and XDP_TX performance matches that of
kernels built without CONFIG_INIT_STACK_ALL_ZERO.

Performance was measured on a ConnectX-6Dx using a single RX channel
(1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
net-next-6.15.

Stack zeroing disabled:
- XDP_DROP:
    * baseline:                     31.47 Mpps
    * baseline + per-RQ allocation: 32.31 Mpps (+2.68%)

- XDP_TX:
    * baseline:                     12.41 Mpps
    * baseline + per-RQ allocation: 12.95 Mpps (+4.30%)

Stack zeroing enabled:
- XDP_DROP:
    * baseline:                     24.32 Mpps
    * baseline + per-RQ allocation: 32.27 Mpps (+32.7%)

- XDP_TX:
    * baseline:                     11.80 Mpps
    * baseline + per-RQ allocation: 12.24 Mpps (+3.72%)

Reported-by: Sebastiano Miano <mianosebastiano@gmail.com>
Reported-by: Samuel Dobron <sdobron@redhat.com>
Link: https://lore.kernel.org/all/CAMENy5pb8ea+piKLg5q5yRTMZacQqYWAoVLE1FE9WhQPq92E0g@mail.gmail.com/
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/1747253032-663457-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: mediatek: do not require syscon compatible for pio property
Frank Wunderlich [Sat, 10 May 2025 17:49:32 +0000 (19:49 +0200)]
net: phy: mediatek: do not require syscon compatible for pio property

Current implementation requires syscon compatible for pio property
which is used for driving the switch leds on mt7988.

Replace syscon_regmap_lookup_by_phandle with of_parse_phandle and
device_node_to_regmap to get the regmap already assigned by pinctrl
driver.

Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
Link: https://patch.msgid.link/20250510174933.154589-1-linux@fw-web.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoidpf: add support for Rx timestamping
Milena Olech [Wed, 16 Apr 2025 12:19:25 +0000 (14:19 +0200)]
idpf: add support for Rx timestamping

Add Rx timestamp function when the Rx timestamp value is read directly
from the Rx descriptor. In order to extend the Rx timestamp value to 64
bit in hot path, the PHC time is cached in the receive groups.
Add supported Rx timestamp modes.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: YiFei Zhu <zhuyifei@google.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agoidpf: add Tx timestamp flows
Milena Olech [Wed, 16 Apr 2025 12:19:23 +0000 (14:19 +0200)]
idpf: add Tx timestamp flows

Add functions to request Tx timestamp for the PTP packets, read the Tx
timestamp when the completion tag for that packet is being received,
extend the Tx timestamp value and set the supported timestamping modes.

Tx timestamp is requested for the PTP packets by setting a TSYN bit and
index value in the Tx context descriptor. The driver assumption is that
the Tx timestamp value is ready to be read when the completion tag is
received. Then the driver schedules delayed work and the Tx timestamp
value read is requested through virtchnl message. At the end, the Tx
timestamp value is extended to 64-bit and provided back to the skb.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Josh Hay <joshua.a.hay@intel.com>
Signed-off-by: Josh Hay <joshua.a.hay@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agoidpf: add Tx timestamp capabilities negotiation
Milena Olech [Wed, 16 Apr 2025 12:19:20 +0000 (14:19 +0200)]
idpf: add Tx timestamp capabilities negotiation

Tx timestamp capabilities are negotiated for the uplink Vport.
Driver receives information about the number of available Tx timestamp
latches, the size of Tx timestamp value and the set of indexes used
for Tx timestamping.

Add function to get the Tx timestamp capabilities and parse the uplink
vport flag.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Co-developed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agoidpf: add PTP clock configuration
Milena Olech [Wed, 16 Apr 2025 12:19:18 +0000 (14:19 +0200)]
idpf: add PTP clock configuration

PTP clock configuration operations - set time, adjust time and adjust
frequency are required to control the clock and maintain synchronization
process.

Extend get PTP capabilities function to request for the clock adjustments
and add functions to enable these actions using dedicated virtchnl
messages.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agoidpf: add mailbox access to read PTP clock time
Milena Olech [Wed, 16 Apr 2025 12:19:11 +0000 (14:19 +0200)]
idpf: add mailbox access to read PTP clock time

When the access to read PTP clock is specified as mailbox, the driver
needs to send virtchnl message to perform PTP actions. Message is sent
using idpf_mbq_opc_send_msg_to_peer_drv mailbox opcode, with the parameters
received during PTP capabilities negotiation.

Add functions to recognize PTP messages, move them to dedicated secondary
mailbox, read the PTP clock time and cross timestamp using mailbox
messages.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agoidpf: negotiate PTP capabilities and get PTP clock
Milena Olech [Wed, 16 Apr 2025 12:19:08 +0000 (14:19 +0200)]
idpf: negotiate PTP capabilities and get PTP clock

PTP capabilities are negotiated using virtchnl command. Add get
capabilities function, direct access to read the PTP clock.
Set initial PTP capabilities exposed to the stack.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agoidpf: move virtchnl structures to the header file
Milena Olech [Wed, 16 Apr 2025 12:19:06 +0000 (14:19 +0200)]
idpf: move virtchnl structures to the header file

Move virtchnl structures to the header file to expose them for the PTP
virtchnl file.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agovirtchnl: add PTP virtchnl definitions
Milena Olech [Wed, 16 Apr 2025 12:19:04 +0000 (14:19 +0200)]
virtchnl: add PTP virtchnl definitions

PTP capabilities are negotiated using virtchnl commands. There are two
available modes of the PTP support: direct and mailbox. When the direct
access to PTP resources is negotiated, virtchnl messages returns a set
of registers that allow read/write directly. When the mailbox access to
PTP resources is negotiated, virtchnl messages are used to access
PTP clock and to read the timestamp values.

Virtchnl API covers both modes and exposes a set of PTP capabilities.

Using virtchnl API, the driver recognizes also HW abilities - maximum
adjustment of the clock and the basic increment value.

Additionally, API allows to configure the secondary mailbox, dedicated
exclusively for PTP purposes.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agoidpf: add initial PTP support
Milena Olech [Wed, 16 Apr 2025 12:19:02 +0000 (14:19 +0200)]
idpf: add initial PTP support

PTP feature is supported if the VIRTCHNL2_CAP_PTP is negotiated during the
capabilities recognition. Initial PTP support includes PTP initialization
and registration of the clock.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agoidpf: change the method for mailbox workqueue allocation
Milena Olech [Wed, 16 Apr 2025 12:19:00 +0000 (14:19 +0200)]
idpf: change the method for mailbox workqueue allocation

Since workqueues are created per CPU, the works scheduled to this
workqueues are run on the CPU they were assigned. It may result in
overloaded CPU that is not able to handle virtchnl messages in
relatively short time. Allocating workqueue with WQ_UNBOUND and
WQ_HIGHPRI flags allows scheduler to queue virtchl messages on less loaded
CPUs, what eliminates delays.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
5 months agonet: stmmac: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
Vladimir Oltean [Wed, 14 May 2025 14:32:49 +0000 (17:32 +0300)]
net: stmmac: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()

New timestamping API was introduced in commit 66f7223039c0 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6.

It is time to convert the stmmac driver to the new API, so that
timestamping configuration can be removed from the ndo_eth_ioctl()
path completely.

The existing timestamping calls are guarded by netif_running(). For
stmmac_hwtstamp_get() that is probably unnecessary, since no hardware
access is performed. But for stmmac_hwtstamp_set() I've preserved it,
since at least some IPs probably need pm_runtime_resume_and_get() to
access registers, which is otherwise called by __stmmac_open().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250514143249.1808377-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: lan743x: implement ndo_hwtstamp_get()
Vladimir Oltean [Wed, 14 May 2025 15:19:30 +0000 (18:19 +0300)]
net: lan743x: implement ndo_hwtstamp_get()

Permit programs such as "hwtstamp_ctl -i eth0" to retrieve the current
timestamping configuration of the NIC, rather than returning "Device
driver does not have support for non-destructive SIOCGHWTSTAMP."

The driver configures all channels with the same timestamping settings.
On TX, retrieve the settings of the first channel, those should be
representative for the entire NIC. On RX, save the filter settings in a
new adapter field.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250514151931.1988047-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: lan743x: convert to ndo_hwtstamp_set()
Vladimir Oltean [Wed, 14 May 2025 15:19:29 +0000 (18:19 +0300)]
net: lan743x: convert to ndo_hwtstamp_set()

New timestamping API was introduced in commit 66f7223039c0 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6.

It is time to convert the lan743x driver to the new API, so that
timestamping configuration can be removed from the ndo_eth_ioctl()
path completely.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250514151931.1988047-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotools: ynl-gen: array-nest: support arrays of nests
Jakub Kicinski [Tue, 13 May 2025 22:20:11 +0000 (15:20 -0700)]
tools: ynl-gen: array-nest: support arrays of nests

TC needs arrays of nests, but just a put for now.
Fairly straightforward addition.

Link: https://patch.msgid.link/20250513222011.844106-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: sched: uapi: add more sanely named duplicate defines
Jakub Kicinski [Tue, 13 May 2025 22:17:52 +0000 (15:17 -0700)]
net: sched: uapi: add more sanely named duplicate defines

The TCA_FLOWER_KEY_CFM enum has a UNSPEC and MAX with _OPT
in the name, but the real attributes don't. Add a MAX that
more reasonably matches the attrs.

The PAD in TCA_TAPRIO is the only attr which doesn't have
_ATTR in it, perhaps signifying that it's not a real attr?
If so interesting idea in abstract but it makes codegen painful.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250513221752.843102-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'tcp-receive-side-improvements'
Jakub Kicinski [Thu, 15 May 2025 18:30:11 +0000 (11:30 -0700)]
Merge branch 'tcp-receive-side-improvements'

Eric Dumazet says:

====================
tcp: receive side improvements

We have set tcp_rmem[2] to 15 MB for about 8 years at Google,
but had some issues for high speed flows on very small RTT.

TCP rx autotuning has a tendency to overestimate the RTT,
thus tp->rcvq_space.space and sk->sk_rcvbuf.

This makes TCP receive queues much bigger than necessary,
to a point cpu caches are evicted before application can
copy the data, on cpus using DDIO.

This series aims to fix this.

- First patch adds tcp_rcvbuf_grow() tracepoint, which was very
  convenient to study the various issues fixed in this series.

- Seven patches fix receiver autotune issues.

- Two patches fix sender side issues.

- Final patch increases tcp_rmem[2] so that TCP speed over WAN
  can meet modern needs.

Tested on a 200Gbit NIC, average max throughput of a single flow:

Before:
 73593 Mbit.

After:
 122514 Mbit.
====================

Link: https://patch.msgid.link/20250513193919.1089692-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotcp: increase tcp_rmem[2] to 32 MB
Eric Dumazet [Tue, 13 May 2025 19:39:19 +0000 (19:39 +0000)]
tcp: increase tcp_rmem[2] to 32 MB

Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e32
("tcp: change tcp_adv_win_scale and tcp_rmem[2]")

TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers.

After this series improvements, it is time to increase the default.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotcp: always use tcp_limit_output_bytes limitation
Eric Dumazet [Tue, 13 May 2025 19:39:18 +0000 (19:39 +0000)]
tcp: always use tcp_limit_output_bytes limitation

This partially reverts commit c73e5807e4f6 ("tcp: tsq: no longer use
limit_output_bytes for paced flows")

Overriding the tcp_limit_output_bytes sysctl value
for FQ enabled flows has the following problem:

It allows TCP to queue around 2 ms worth of data per flow,
defeating tcp_rcv_rtt_update() accuracy on the receiver,
forcing it to increase sk->sk_rcvbuf even if the real
RTT is around 100 us.

After this change, we keep enough packets in flight to fill
the pipe, and let receive queues small enough to get
good cache behavior (cpu caches and/or NIC driver page pools).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotcp: increase tcp_limit_output_bytes default value to 4MB
Eric Dumazet [Tue, 13 May 2025 19:39:17 +0000 (19:39 +0000)]
tcp: increase tcp_limit_output_bytes default value to 4MB

Last change happened in 2018 with commit c73e5807e4f6
("tcp: tsq: no longer use limit_output_bytes for paced flows")

Modern NIC speeds got a 4x increase since then.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotcp: skip big rtt sample if receive queue is not empty
Eric Dumazet [Tue, 13 May 2025 19:39:16 +0000 (19:39 +0000)]
tcp: skip big rtt sample if receive queue is not empty

tcp_rcv_rtt_update() role is to keep an estimation
of RTT (tp->rcv_rtt_est.rtt_us) for receivers.

If an application is too slow to drain the TCP receive
queue, it is better to leave the RTT estimation small,
so that tcp_rcv_space_adjust() does not inflate
tp->rcvq_space.space and sk->sk_rcvbuf.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotcp: always seek for minimal rtt in tcp_rcv_rtt_update()
Eric Dumazet [Tue, 13 May 2025 19:39:15 +0000 (19:39 +0000)]
tcp: always seek for minimal rtt in tcp_rcv_rtt_update()

tcp_rcv_rtt_update() goal is to maintain an estimation of the RTT
in tp->rcv_rtt_est.rtt_us, used by tcp_rcv_space_adjust()

When TCP TS are enabled, tcp_rcv_rtt_update() is using
EWMA to smooth the samples.

Change this to immediately latch the incoming value if it
is lower than tp->rcv_rtt_est.rtt_us, so that tcp_rcv_space_adjust()
does not overshoot tp->rcvq_space.space and sk->sk_rcvbuf.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotcp: fix initial tp->rcvq_space.space value for passive TS enabled flows
Eric Dumazet [Tue, 13 May 2025 19:39:14 +0000 (19:39 +0000)]
tcp: fix initial tp->rcvq_space.space value for passive TS enabled flows

tcp_rcv_state_process() must tweak tp->advmss for TS enabled flows
before the call to tcp_init_transfer() / tcp_init_buffer_space().

Otherwise tp->rcvq_space.space is off by 120 bytes
(TCP_INIT_CWND * TCPOLEN_TSTAMP_ALIGNED).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotcp: remove zero TCP TS samples for autotuning
Eric Dumazet [Tue, 13 May 2025 19:39:13 +0000 (19:39 +0000)]
tcp: remove zero TCP TS samples for autotuning

For TCP flows using ms RFC 7323 timestamp granularity
tcp_rcv_rtt_update() can be fed with 1 ms samples, breaking
TCP autotuning for data center flows with sub ms RTT.

Instead, rely on the window based samples, fed by tcp_rcv_rtt_measure()

tcp_rcvbuf_grow() for a 10 second TCP_STREAM sesssion now looks saner.
We can see rcvbuf is kept at a reasonable value.

  222.234976: tcp:tcp_rcvbuf_grow: time=348 rtt_us=330 copied=110592 inq=0 space=40960 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
  222.235276: tcp:tcp_rcvbuf_grow: time=300 rtt_us=288 copied=126976 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=246187 ...
  222.235569: tcp:tcp_rcvbuf_grow: time=294 rtt_us=288 copied=184320 inq=0 space=126976 ooo=0 scaling_ratio=230 rcvbuf=282659 ...
  222.235833: tcp:tcp_rcvbuf_grow: time=264 rtt_us=244 copied=373760 inq=0 space=184320 ooo=0 scaling_ratio=230 rcvbuf=410312 ...
  222.236142: tcp:tcp_rcvbuf_grow: time=308 rtt_us=219 copied=424960 inq=20480 space=373760 ooo=0 scaling_ratio=230 rcvbuf=832022 ...
  222.236378: tcp:tcp_rcvbuf_grow: time=236 rtt_us=219 copied=692224 inq=49152 space=404480 ooo=0 scaling_ratio=230 rcvbuf=900407 ...
  222.236602: tcp:tcp_rcvbuf_grow: time=225 rtt_us=219 copied=730112 inq=49152 space=643072 ooo=0 scaling_ratio=230 rcvbuf=1431534 ...
  222.237050: tcp:tcp_rcvbuf_grow: time=229 rtt_us=219 copied=1160192 inq=49152 space=680960 ooo=0 scaling_ratio=230 rcvbuf=1515876 ...
  222.237618: tcp:tcp_rcvbuf_grow: time=305 rtt_us=218 copied=2228224 inq=49152 space=1111040 ooo=0 scaling_ratio=230 rcvbuf=2473271 ...
  222.238591: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3063808 inq=360448 space=2179072 ooo=0 scaling_ratio=230 rcvbuf=4850803 ...
  222.240647: tcp:tcp_rcvbuf_grow: time=260 rtt_us=218 copied=2752512 inq=0 space=2703360 ooo=0 scaling_ratio=230 rcvbuf=6017914 ...
  222.243535: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2834432 inq=49152 space=2752512 ooo=0 scaling_ratio=230 rcvbuf=6127331 ...
  222.245108: tcp:tcp_rcvbuf_grow: time=240 rtt_us=218 copied=2883584 inq=49152 space=2785280 ooo=0 scaling_ratio=230 rcvbuf=6200275 ...
  222.245333: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2859008 inq=0 space=2834432 ooo=0 scaling_ratio=230 rcvbuf=6309692 ...
  222.301021: tcp:tcp_rcvbuf_grow: time=222 rtt_us=218 copied=2883584 inq=0 space=2859008 ooo=0 scaling_ratio=230 rcvbuf=6364400 ...
  222.989242: tcp:tcp_rcvbuf_grow: time=225 rtt_us=218 copied=2899968 inq=0 space=2883584 ooo=0 scaling_ratio=230 rcvbuf=6419108 ...
  224.139553: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3014656 inq=65536 space=2899968 ooo=0 scaling_ratio=230 rcvbuf=6455580 ...
  224.584608: tcp:tcp_rcvbuf_grow: time=232 rtt_us=218 copied=3014656 inq=49152 space=2949120 ooo=0 scaling_ratio=230 rcvbuf=6564997 ...
  230.145560: tcp:tcp_rcvbuf_grow: time=223 rtt_us=218 copied=2981888 inq=0 space=2965504 ooo=0 scaling_ratio=230 rcvbuf=6601469 ...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agotcp: add receive queue awareness in tcp_rcv_space_adjust()
Eric Dumazet [Tue, 13 May 2025 19:39:12 +0000 (19:39 +0000)]
tcp: add receive queue awareness in tcp_rcv_space_adjust()

If the application can not drain fast enough a TCP socket queue,
tcp_rcv_space_adjust() can overestimate tp->rcvq_space.space.

Then sk->sk_rcvbuf can grow and hit tcp_rmem[2] for no good reason.

Fix this by taking into acount the number of available bytes.

Keeping sk->sk_rcvbuf at the right size allows better cache efficiency.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>