www.infradead.org Git - users/jedix/linux-maple.git/log

]> www.infradead.org Git - users/jedix/linux-maple.git/log

projects / users / jedix / linux-maple.git / log

Phil Sutter [Wed, 21 May 2025 20:44:22 +0000 (22:44 +0200)]

netfilter: nf_tables: Introduce functions freeing nft_hook objects

Pointless wrappers around kfree() for now, prep work for an embedded
list of nf_hook_ops.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Florian Westphal [Thu, 22 May 2025 13:49:34 +0000 (15:49 +0200)]

netfilter: nf_tables: add packets conntrack state to debug trace info

Add the minimal relevant info needed for userspace ("nftables monitor
trace") to provide the conntrack view of the packet:

- state (new, related, established)
- direction (original, reply)
- status (e.g., if connection is subject to dnat)
- id (allows to query ctnetlink for remaining conntrack state info)

Example:
trace id a62 inet filter PRE_RAW packet: iif "enp0s3" ether [..]
[..]
trace id a62 inet filter PRE_MANGLE conntrack: ct direction original ct state new ct id 32
trace id a62 inet filter PRE_MANGLE packet: [..]
[..]
trace id a62 inet filter IN conntrack: ct direction original ct state new ct status dnat-done ct id 32
[..]

In this case one can see that while NAT is active, the new connection
isn't subject to a translation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Florian Westphal [Thu, 22 May 2025 13:49:33 +0000 (15:49 +0200)]

netfilter: conntrack: make nf_conntrack_id callable without a module dependency

While nf_conntrack_id() doesn't need any functionaliy from conntrack, it
does reside in nf_conntrack_core.c -- callers add a module
dependency on conntrack.

Followup patch will need to compute the conntrack id from nf_tables_trace.c
to include it in nf_trace messages emitted to userspace via netlink.

I don't want to introduce a module dependency between nf_tables and
conntrack for this.

Since trace is slowpath, the added indirection is ok.

One alternative is to move nf_conntrack_id to the netfilter/core.c,
but I don't see a compelling reason so far.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 10:28:46 +0000 (12:28 +0200)]

netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit

nf_dup_skb_recursion is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Move nf_dup_skb_recursion to struct netdev_xmit, provide wrappers.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 10:28:45 +0000 (12:28 +0200)]

netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx

nft_pcpu_tun_ctx is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a nft_inner_tun_ctx member (original
nft_pcpu_tun_ctx) and a local_lock_t and use local_lock_nested_bh() for
locking. This change adds only lockdep coverage and does not alter the
functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 10:28:44 +0000 (12:28 +0200)]

netfilter: nf_dup{4, 6}: Move duplication check to task_struct

nf_skb_duplicated is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Due to the recursion involved, the simplest change is to make it a
per-task variable.

Move the per-CPU variable nf_skb_duplicated to task_struct and name it
in_nf_duplicate. Add it to the existing bitfield so it doesn't use
additional memory.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Fernando Fernandez Mancera [Wed, 21 May 2025 09:41:08 +0000 (11:41 +0200)]

netfilter: nft_tunnel: fix geneve_opt dump

When dumping a nft_tunnel with more than one geneve_opt configured the
netlink attribute hierarchy should be as follow:

NFTA_TUNNEL_KEY_OPTS
|
|--NFTA_TUNNEL_KEY_OPTS_GENEVE
|  |
|  |--NFTA_TUNNEL_KEY_GENEVE_CLASS
|  |--NFTA_TUNNEL_KEY_GENEVE_TYPE
|  |--NFTA_TUNNEL_KEY_GENEVE_DATA
|
|--NFTA_TUNNEL_KEY_OPTS_GENEVE
|  |
|  |--NFTA_TUNNEL_KEY_GENEVE_CLASS
|  |--NFTA_TUNNEL_KEY_GENEVE_TYPE
|  |--NFTA_TUNNEL_KEY_GENEVE_DATA
|
|--NFTA_TUNNEL_KEY_OPTS_GENEVE
...

Otherwise, userspace tools won't be able to fetch the geneve options
configured correctly.

Fixes: 925d844696d9 ("netfilter: nft_tunnel: add support for geneve opts")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Florian Westphal [Wed, 21 May 2025 09:38:49 +0000 (11:38 +0200)]

selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs

Replace the existing VRF test with a more comprehensive one.

It tests following combinations:
- fib type (returns address type, e.g. unicast)
- fib oif (route output interface index
- both with and without 'iif' keyword (changes result, e.g.
  'fib daddr type local' will be true when the destination address
  is configured on the local machine, but
  'fib daddr . iif type local' will only be true when the destination
  address is configured on the incoming interface.

Add all types of addresses to test with for both ipv4 and ipv6:
- local address on the incoming interface
- local address on another interface
- local address on another interface thats part of a vrf
- address on another host

The ruleset stores obtained results from 'fib' in nftables sets and
then queries the sets to check that it has the expected results.

Perform one pass while packets are coming in on interface NOT part of
a VRF and then again when it was added and make sure fib returns the
expected routes and address types for the various addresses in the
setup.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Florian Westphal [Wed, 21 May 2025 09:38:48 +0000 (11:38 +0200)]

netfilter: nf_tables: nft_fib: consistent l3mdev handling

fib has two modes:
1. Obtain output device according to source or destination address
2. Obtain the type of the address, e.g. local, unicast, multicast.

'fib daddr type' should return 'local' if the address is configured
in this netns or unicast otherwise.

'fib daddr . iif type' should return 'local' if the address is configured
on the input interface or unicast otherwise, i.e. more restrictive.

However, if the interface is part of a VRF, then 'fib daddr type'
returns unicast even if the address is configured on the incoming
interface.

This is broken for both ipv4 and ipv6.

In the ipv4 case, inet_dev_addr_type must only be used if the
'iif' or 'oif' (strict mode) was requested.

Else inet_addr_type_dev_table() needs to be used and the correct
dev argument must be passed as well so the correct fib (vrf) table
is used.

In the ipv6 case, the bug is similar, without strict mode, dev is NULL
so .flowi6_l3mdev will be set to 0.

Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that
to init the .l3mdev structure member.

For ipv6, use it from nft_fib6_flowi_init() which gets called from
both the 'type' and the 'route' mode eval functions.

This provides consistent behaviour for all modes for both ipv4 and ipv6:
If strict matching is requested, the input respectively output device
of the netfilter hooks is used.

Otherwise, use skb->dev to obtain the l3mdev ifindex.

Without this, most type checks in updated nft_fib.sh selftest fail:

  FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4
  FAIL: did not find veth0 . dead:1::1 . local in fibtype6
  FAIL: did not find veth0 . dead:9::1 . local in fibtype6
  FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4
  FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4
  FAIL: did not find tvrf . dead:1::1 . local in fibtype6
  FAIL: did not find tvrf . dead:9::1 . local in fibtype6
  FAIL: fib expression address types match (iif in vrf)

(fib errounously returns 'unicast' for all of them, even
though all of these addresses are local to the vrf).

Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Florian Westphal [Wed, 21 May 2025 09:38:47 +0000 (11:38 +0200)]

netfilter: nf_tables: nft_fib_ipv6: fix VRF ipv4/ipv6 result discrepancy

With a VRF, ipv4 and ipv6 FIB expression behave differently.

   fib daddr . iif oif

Will return the input interface name for ipv4, but the real device
for ipv6.  Example:

If VRF device name is tvrf and real (incoming) device is veth0.
First round is ok, both ipv4 and ipv6 will yield 'veth0'.

But in the second round (incoming device will be set to "tvrf"), ipv4
will yield "tvrf" whereas ipv6 returns "veth0" for the second round too.

This makes ipv6 behave like ipv4.

A followup patch will add a test case for this, without this change
it will fail with:
  get element inet t fibif6iif { tvrf . dead:1::99 . tvrf }
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  FAIL: did not find tvrf . dead:1::99 . tvrf in fibif6iif

Alternatively we could either not do anything at all or change
ipv4 to also return the lower/real device, however, nft (userspace)
doc says "iif: if fib lookup provides a route then check its output
interface is identical to the packets input interface." which is what
the nft fib ipv4 behaviour is.

Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Florian Westphal [Wed, 21 May 2025 09:38:46 +0000 (11:38 +0200)]

selftests: netfilter: move fib vrf test to nft_fib.sh

It was located in conntrack_vrf.sh because that already had the VRF bits.
Lets not add to this and move it to nft_fib.sh where this belongs.

No functional changes for the subtest intended.
The subtest is limited, it only covered 'fib oif'
(route output interface query) when the incoming interface is part
of a VRF.

Next we can extend it to cover 'fib type' for VRFs and also check fib
results when there is an unrelated VRF in same netns.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Florian Westphal [Wed, 21 May 2025 09:38:45 +0000 (11:38 +0200)]

selftests: netfilter: nft_fib.sh: add 'type' mode tests

fib can either lookup the interface id/name of the output interface that
would be used for the given address, or it can check for the type of the
address according to the fib, e.g. local, unicast, multicast and so on.

This can be used to e.g. make a locally configured address only reachable
through its interface.

Example: given eth0:10.1.1.1 and eth1:10.1.2.1 then 'fib daddr type' for
10.1.1.1 arriving on eth1 will be 'local', but 'fib daddr . iif type' is
expected to return 'unicast', whereas 'fib daddr' and 'fib daddr . iif'
are expected to indicate 'local' if such a packet arrives on eth0.

So far nft_fib.sh only covered oif/oifname, not type.

Repeat tests both with default and a policy (ip rule) based setup.

Also try to run all remaining tests even if a subtest has failed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Florian Westphal [Fri, 16 May 2025 14:12:13 +0000 (16:12 +0200)]

netfilter: xtables: support arpt_mark and ipv6 optstrip for iptables-nft only builds

Its now possible to build a kernel that has no support for the classic
xtables get/setsockopt interfaces and builtin tables.

In this case, we have CONFIG_IP6_NF_MANGLE=n and
CONFIG_IP_NF_ARPTABLES=n.

For optstript, the ipv6 code is so small that we can enable it if
netfilter ipv6 support exists. For mark, check if either classic
arptables or NFT_ARP_COMPAT is set.

Fixes: a9525c7f6219 ("netfilter: xtables: allow xtables-nft only builds")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Florian Westphal [Tue, 6 May 2025 13:07:11 +0000 (15:07 +0200)]

selftests: netfilter: nft_concat_range.sh: add coverage for 4bit group representation

Pipapo supports a more compact '4 bit group' format that is chosen when
the memory needed for the default exceeds a threshold (2mb).

Add coverage for those code paths, the existing tests use small sets that
are handled by the default representation.

This comes with a test script run-time increase, but I think its ok:

normal: 2m35s -> 3m9s
debug: 3m24s -> 5m29s (with KSFT_MACHINE_SLOW=yes).

Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit | commitdiff | tree

Jakub Kicinski [Tue, 20 May 2025 03:09:07 +0000 (20:09 -0700)]

Merge branch 'queue_api-reduce-risk-of-name-collision-over-txq'

Gur Stavi says:

====================
queue_api: reduce risk of name collision over txq

Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.
====================

Link: https://patch.msgid.link/cover.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Gur Stavi [Sun, 18 May 2025 10:00:54 +0000 (13:00 +0300)]

queue_api: reduce risk of name collision over txq

Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.

Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/95b60d218f004308486d92ed17c8cc6f28bac09d.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Tue, 20 May 2025 03:07:41 +0000 (20:07 -0700)]

Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
idpf: add initial PTP support

Milena Olech says:

This patch series introduces support for Precision Time Protocol (PTP) to
Intel(R) Infrastructure Data Path Function (IDPF) driver. PTP feature is
supported when the PTP capability is negotiated with the Control
Plane (CP). IDPF creates a PTP clock and sets a set of supported
functions.

During the PTP initialization, IDPF requests a set of PTP capabilities
and receives a writeback from the CP with the set of supported options.
These options are:
- get time of the PTP clock
- set the time of the PTP clock
- adjust the PTP clock
- Tx timestamping

Each feature is considered to have direct access, where the operations
on PCIe BAR registers are allowed, or the mailbox access, where the
virtchnl messages are used to perform any PTP action. Mailbox access
means that PTP requests are sent to the CP through dedicated secondary
mailbox and the CP reads/writes/modifies desired resource - PTP Clock
or Tx timestamp registers.

Tx timestamp capabilities are negotiated only for vports that have
UPLINK_VPORT flag set by the CP. Capabilities provide information about
the number of available Tx timestamp latches, their indexes and size of
the Tx timestamp value. IDPF requests Tx timestamp by setting the
TSYN bit and the requested timestamp index in the context descriptor for
the PTP packets. When the completion tag for that packet is received,
IDPF schedules a worker to read the Tx timestamp value.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  idpf: add support for Rx timestamping
  idpf: add Tx timestamp flows
  idpf: add Tx timestamp capabilities negotiation
  idpf: add PTP clock configuration
  idpf: add mailbox access to read PTP clock time
  idpf: negotiate PTP capabilities and get PTP clock
  idpf: move virtchnl structures to the header file
  virtchnl: add PTP virtchnl definitions
  idpf: add initial PTP support
  idpf: change the method for mailbox workqueue allocation
====================

Link: https://patch.msgid.link/20250516170645.1172700-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sumanth Gavini [Fri, 16 May 2025 22:51:48 +0000 (15:51 -0700)]

selftests: drv-net: Fix "envirnoments" to "environments"

Fix misspelling reported by codespell

Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516225156.1122058-1-sumanth.gavini@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Johannes Berg [Fri, 16 May 2025 11:59:27 +0000 (13:59 +0200)]

net: netlink: reduce extack cookie size

Seems like the extack cookie hasn't found any users outside
of wireless, which always uses nl_set_extack_cookie_u64().
Thus, allocating 20 bytes for it is pointless, reduce that
to 8 bytes, and add a BUILD_BUG_ON() to ensure it's enough
(obviously it is, for a u64, but in case it changes again.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250516115927.38209-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David S. Miller [Mon, 19 May 2025 11:10:43 +0000 (12:10 +0100)]

Merge tag 'ovpn-net-next-20250515' of https://github.com/OpenVPN/ovpn-net-next

Antonio Quartulli says:

====================
ovpn: pull request for net-next: ovpn 2025-05-15

this is a new version of the previous pull request.
These time I have removed the fixes that we are still discussing,
so that we don't hold the entire series back.

There is a new fix though: it's about properly checking the return value
of skb_to_sgvec_nomark(). I spotted the issue while testing pings larger
than the iface's MTU on a TCP VPN connection.

I have added various Closes and Link tags where applicable, so
that we have references to GitHub tickets and other public discussions.

Since I have resent the PR, I have also added Andrew's Reviewed-by to
the first patch.

Please pull or let me know if something should be changed!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Patchset highlights:
- update MAINTAINERS entry for ovpn
- extend selftest with more cases
- avoid crash in selftest in case of getaddrinfo() failure
- fix ndo_start_xmit return value on error
- set ignore_df flag for IPv6 packets
- drop useless reg_state check in keepalive worker
- retain skb's dst when entering xmit function
- fix check on skb_to_sgvec_nomark() return value

commit | commitdiff | tree

Jakub Kicinski [Sat, 17 May 2025 01:01:34 +0000 (18:01 -0700)]

Merge branch 'vsock-test-improve-sigpipe-test-reliability'

Stefano Garzarella says:

====================
vsock/test: improve sigpipe test reliability

Running the tests continuously I noticed that sometimes the sigpipe
test would fail due to a race between the control message of the test
and the vsock transport messages.

While I was at it I also improved the test by checking the errno we
expect.

v1: https://lore.kernel.org/20250508142005.135857-1-sgarzare@redhat.com
====================

Link: https://patch.msgid.link/20250514141927.159456-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stefano Garzarella [Wed, 14 May 2025 14:19:27 +0000 (16:19 +0200)]

vsock/test: check also expected errno on sigpipe test

In the sigpipe test, we expect send() to fail, but we do not check if
send() fails with the errno we expect (EPIPE).

Add this check and repeat the send() in case of EINTR as we do in other
tests.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-4-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stefano Garzarella [Wed, 14 May 2025 14:19:26 +0000 (16:19 +0200)]

vsock/test: retry send() to avoid occasional failure in sigpipe test

When the other peer calls shutdown(SHUT_RD), there is a chance that
the send() call could occur before the message carrying the close
information arrives over the transport. In such cases, the send()
might still succeed. To avoid this race, let's retry the send() call
a few times, ensuring the test is more reliable.

Sleep a little before trying again to avoid flooding the other peer
and filling its receive buffer, causing false-negative.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-3-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stefano Garzarella [Wed, 14 May 2025 14:19:25 +0000 (16:19 +0200)]

vsock/test: add timeout_usleep() to allow sleeping in timeout sections

The timeout API uses signals, so we have documented not to use sleep(),
but we can use nanosleep(2) since POSIX.1 explicitly specifies that it
does not interact with signals.

Let's provide timeout_usleep() for that.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-2-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Fri, 16 May 2025 23:32:08 +0000 (16:32 -0700)]

Merge branch 'tools-ynl-gen-support-sub-messages-and-rt-link'

Jakub Kicinski says:

====================
tools: ynl-gen: support sub-messages and rt-link

Sub-messages are how we express "polymorphism" in YNL. Donald added
the support to specs and Python a while back, support them in C, too.
Sub-message is a nest, but the interpretation of the attribute types
within that nest depends on a value of another attribute. For example
in rt-link the "kind" attribute contains the link type (veth, bonding,
etc.) and based on that the right enum has to be applied to interpret
link-specific attributes.

The last message is probably the most interesting to look at, as it
adds a fairly advanced sample.

This patch only contains enough support for rtnetlink, we will need
a little more complexity to support TC, where sub-messages may contain
fixed headers, and where the selector may be in a different nest than
the submessage.
====================

Link: https://patch.msgid.link/20250515231650.1325372-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 23:16:50 +0000 (16:16 -0700)]

tools: ynl: add a sample for rt-link

Add a fairly complete example of rt-link usage. If run without any
arguments it simply lists the interfaces and some of their attrs.
If run with an arg it tries to create and delete a netkit device.

1 # ./tools/net/ynl/samples/rt-link 1
2 Trying to create a Netkit interface
3 Testing error message for policy being bad:
4     Kernel error: 'Provided default xmit policy not supported' (bad attribute: .linkinfo.data(netkit).policy)
5   1:               lo: mtu 65536
6   2:           wlp0s1: mtu  1500
7   3:          enp0s13: mtu  1500
8   4:           dummy0: mtu  1500  kind dummy     altname one two
9   5:              nk0: mtu  1500  kind netkit    primary 0  policy forward
10   6:              nk1: mtu  1500  kind netkit    primary 1  policy blackhole
11 Trying to delete a Netkit interface (ifindex 6)

Sample creates the device first, it sets an invalid value for a netkit
attribute to trigger reverse parsing. Line 4 shows the error with the
attribute path correctly generated by YNL.

Then sample fixes the bad attribute and re-issues the request, with
NLM_F_ECHO set. This flag causes the notification to be looped back
to the initiating socket (our socket). Sample parses this notification
to save the ifindex of the created netkit.

Sample then proceeds to list the devices. Line 8 above shows a dummy
device with two alt names. Lines 9 and 10 show the netkit devices
the sample itself created.

The "primary" and "policy" attrs are from inside the netkit submsg.
The string values are auto-generated for the enums by YNL.

To clean up sample deletes the interface it created (line 11).

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 23:16:49 +0000 (16:16 -0700)]

tools: ynl: enable codegen for all rt- families

Switch from including Classic netlink families one by one to excluding.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 23:16:48 +0000 (16:16 -0700)]

tools: ynl: submsg: reverse parse / error reporting

Reverse parsing lets YNL convert bad and missing attr pointers
from extack into a string like "missing attribute nest1.nest2.attr_name".
It's a feature that's unique to YNL C AFAIU (even the Python YNL
can't do nested reverse parsing). Add support for reverse-parsing
of sub-messages.

To simplify the logic and the code annotate the type policies
with extra metadata. Mark the selectors and the messages with
the information we need. We assume that key / selector always
precedes the sub-message while parsing (and also if there are
multiple sub-messages like in rt-link they are interleaved
selector 1 ... submsg 1 ... selector 2 .. submsg 2, not
selector 1 ... selector 2 ... submsg 1 ... submsg 2).

The rt-link sample in a subsequent changes shows reverse parsing
of sub-messages in action.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 23:16:47 +0000 (16:16 -0700)]

tools: ynl-gen: submsg: support parsing and rendering sub-messages

Adjust parsing and rendering appropriately to make sub-messages work.
Rendering is pretty trivial, as the submsg -> netlink conversion looks
like rendering a nest in which only one attr was set. Only trick
is that we use the enum value of the sub-message rather than the nest
as the type, and effectively skip one layer of nesting. A real double
nested struct would look like this:

  [SELECTOR]
  [SUBMSG]
    [NEST]
      [MSG1-ATTR]

A submsg "is" the nest so by skipping I mean:

  [SELECTOR]
  [SUBMSG]
    [MSG1-ATTR]

There is no extra validation in YNL if caller has set the selector
matching the submsg type (e.g. link type = "macvlan" but the nest
attrs are set to carry "veth"). Let the kernel handle that.

Parsing side is a little more specialized as we need to render and
insert a new kind of function which switches between what to parse
based on the selector. But code isn't too complicated.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 23:16:46 +0000 (16:16 -0700)]

tools: ynl-gen: submsg: render the structs

The easiest (or perhaps only sane) way to support submessages in C
is to treat them as if they were nests. Build fake attributes to
that effect in the codegen. Render the submsg as a big nest of all
possible values.

With this in place the main missing part is to hook in the switch
which selects how to parse based on the key.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 23:16:45 +0000 (16:16 -0700)]

tools: ynl-gen: submsg: plumb thru an empty type

Hook in handling of sub-messages, for now treat them as ignored attrs.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 23:16:44 +0000 (16:16 -0700)]

tools: ynl-gen: prepare for submsg structs

Prepare for constructing Struct() instances which represent
sub-messages rather than nested attributes.
Restructure the code / indentation to more easily insert
a case where nested reference comes from annotation other
than the 'nested-attributes' property. Make sure we don't
construct the Struct() object from scratch in multiple
places as the constructor will soon have more arguments.

This should cause no functional change.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 23:16:43 +0000 (16:16 -0700)]

tools: ynl-gen: factor out the annotation of pure nested struct

We're about to add some code here for sub-messages.
Factor out the nest-related logic to make the code readable.
No functional change.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 23:16:42 +0000 (16:16 -0700)]

netlink: specs: rt-link: add C naming info for ovpn

C naming info for OVPN which was added since I adjusted
the existing attrs. Also add missing reference to a header needed
for a bridge struct.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Thu, 15 May 2025 08:20:51 +0000 (10:20 +0200)]

net: phy: microchip: document where the LAN88xx PHYs are used

The driver uses the name LAN88xx for PHYs with phy_id = 0x0007c132. But
with this placeholder name no documentation can be found on the net.

Document the fact that these PHYs are build into the LAN7800 and LAN7850
USB/Ethernet controllers.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250515082051.2644450-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Heiner Kallweit [Wed, 14 May 2025 18:54:29 +0000 (20:54 +0200)]

net: phy: fixed_phy: remove fixed_phy_register_with_gpiod

Since its introduction 6 yrs ago this functions has never had a user.
So remove it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/ccbeef28-65ae-4e28-b1db-816c44338dee@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Thu, 15 May 2025 10:03:54 +0000 (10:03 +0000)]

net: rfs: add sock_rps_delete_flow() helper

RFS can exhibit lower performance for workloads using short-lived
flows and a small set of 4-tuple.

This is often the case for load-testers, using a pair of hosts,
if the server has a single listener port.

Typical use case :

Server : tcp_crr -T128 -F1000 -6 -U -l30 -R 14250
Client : tcp_crr -T128 -F1000 -6 -U -l30 -c -H server | grep local_throughput

This is because RFS global hash table contains stale information,
when the same RSS key is recycled for another socket and another cpu.

Make sure to undo the changes and go back to initial state when
a flow is disconnected.

Performance of the above test is increased by 22 %,
going from 372604 transactions per second to 457773.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Octavian Purdila <tavip@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250515100354.3339920-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

ChunHao Lin [Thu, 15 May 2025 09:53:03 +0000 (17:53 +0800)]

r8169: add support for RTL8127A

This adds support for 10Gbs chip RTL8127A.

Signed-off-by: ChunHao Lin <hau@realtek.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/20250515095303.3138-1-hau@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Moon Yeounsu [Thu, 15 May 2025 07:53:31 +0000 (16:53 +0900)]

net: dlink: add synchronization for stats update

This patch synchronizes code that accesses from both user-space
and IRQ contexts. The `get_stats()` function can be called from both
context.

`dev->stats.tx_errors` and `dev->stats.collisions` are also updated
in the `tx_errors()` function. Therefore, these fields must also be
protected by synchronized.

There is no code that accessses `dev->stats.tx_errors` between the
previous and updated lines, so the updating point can be moved.

Signed-off-by: Moon Yeounsu <yyyynoom@gmail.com>
Link: https://patch.msgid.link/20250515075333.48290-1-yyyynoom@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Carolina Jubran [Wed, 14 May 2025 20:03:52 +0000 (23:03 +0300)]

net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead

CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
zero-initializing all stack variables on function entry. The mlx5 XDP
RX path previously allocated a struct mlx5e_xdp_buff on the stack per
received CQE, resulting in measurable performance degradation under
this config.

This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
avoiding per-CQE stack allocations and repeated zeroing.

With this change, XDP_DROP and XDP_TX performance matches that of
kernels built without CONFIG_INIT_STACK_ALL_ZERO.

Performance was measured on a ConnectX-6Dx using a single RX channel
(1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
net-next-6.15.

Stack zeroing disabled:
- XDP_DROP:
    * baseline:                     31.47 Mpps
    * baseline + per-RQ allocation: 32.31 Mpps (+2.68%)

- XDP_TX:
    * baseline:                     12.41 Mpps
    * baseline + per-RQ allocation: 12.95 Mpps (+4.30%)

Stack zeroing enabled:
- XDP_DROP:
    * baseline:                     24.32 Mpps
    * baseline + per-RQ allocation: 32.27 Mpps (+32.7%)

- XDP_TX:
    * baseline:                     11.80 Mpps
    * baseline + per-RQ allocation: 12.24 Mpps (+3.72%)

Reported-by: Sebastiano Miano <mianosebastiano@gmail.com>
Reported-by: Samuel Dobron <sdobron@redhat.com>
Link: https://lore.kernel.org/all/CAMENy5pb8ea+piKLg5q5yRTMZacQqYWAoVLE1FE9WhQPq92E0g@mail.gmail.com/
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/1747253032-663457-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Frank Wunderlich [Sat, 10 May 2025 17:49:32 +0000 (19:49 +0200)]

net: phy: mediatek: do not require syscon compatible for pio property

Current implementation requires syscon compatible for pio property
which is used for driving the switch leds on mt7988.

Replace syscon_regmap_lookup_by_phandle with of_parse_phandle and
device_node_to_regmap to get the regmap already assigned by pinctrl
driver.

Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
Link: https://patch.msgid.link/20250510174933.154589-1-linux@fw-web.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:25 +0000 (14:19 +0200)]

idpf: add support for Rx timestamping

Add Rx timestamp function when the Rx timestamp value is read directly
from the Rx descriptor. In order to extend the Rx timestamp value to 64
bit in hot path, the PHC time is cached in the receive groups.
Add supported Rx timestamp modes.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: YiFei Zhu <zhuyifei@google.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:23 +0000 (14:19 +0200)]

idpf: add Tx timestamp flows

Add functions to request Tx timestamp for the PTP packets, read the Tx
timestamp when the completion tag for that packet is being received,
extend the Tx timestamp value and set the supported timestamping modes.

Tx timestamp is requested for the PTP packets by setting a TSYN bit and
index value in the Tx context descriptor. The driver assumption is that
the Tx timestamp value is ready to be read when the completion tag is
received. Then the driver schedules delayed work and the Tx timestamp
value read is requested through virtchnl message. At the end, the Tx
timestamp value is extended to 64-bit and provided back to the skb.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Josh Hay <joshua.a.hay@intel.com>
Signed-off-by: Josh Hay <joshua.a.hay@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:20 +0000 (14:19 +0200)]

idpf: add Tx timestamp capabilities negotiation

Tx timestamp capabilities are negotiated for the uplink Vport.
Driver receives information about the number of available Tx timestamp
latches, the size of Tx timestamp value and the set of indexes used
for Tx timestamping.

Add function to get the Tx timestamp capabilities and parse the uplink
vport flag.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Co-developed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:18 +0000 (14:19 +0200)]

idpf: add PTP clock configuration

PTP clock configuration operations - set time, adjust time and adjust
frequency are required to control the clock and maintain synchronization
process.

Extend get PTP capabilities function to request for the clock adjustments
and add functions to enable these actions using dedicated virtchnl
messages.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:11 +0000 (14:19 +0200)]

idpf: add mailbox access to read PTP clock time

When the access to read PTP clock is specified as mailbox, the driver
needs to send virtchnl message to perform PTP actions. Message is sent
using idpf_mbq_opc_send_msg_to_peer_drv mailbox opcode, with the parameters
received during PTP capabilities negotiation.

Add functions to recognize PTP messages, move them to dedicated secondary
mailbox, read the PTP clock time and cross timestamp using mailbox
messages.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:08 +0000 (14:19 +0200)]

idpf: negotiate PTP capabilities and get PTP clock

PTP capabilities are negotiated using virtchnl command. Add get
capabilities function, direct access to read the PTP clock.
Set initial PTP capabilities exposed to the stack.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:06 +0000 (14:19 +0200)]

idpf: move virtchnl structures to the header file

Move virtchnl structures to the header file to expose them for the PTP
virtchnl file.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:04 +0000 (14:19 +0200)]

virtchnl: add PTP virtchnl definitions

PTP capabilities are negotiated using virtchnl commands. There are two
available modes of the PTP support: direct and mailbox. When the direct
access to PTP resources is negotiated, virtchnl messages returns a set
of registers that allow read/write directly. When the mailbox access to
PTP resources is negotiated, virtchnl messages are used to access
PTP clock and to read the timestamp values.

Virtchnl API covers both modes and exposes a set of PTP capabilities.

Using virtchnl API, the driver recognizes also HW abilities - maximum
adjustment of the clock and the basic increment value.

Additionally, API allows to configure the secondary mailbox, dedicated
exclusively for PTP purposes.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:02 +0000 (14:19 +0200)]

idpf: add initial PTP support

PTP feature is supported if the VIRTCHNL2_CAP_PTP is negotiated during the
capabilities recognition. Initial PTP support includes PTP initialization
and registration of the clock.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Milena Olech [Wed, 16 Apr 2025 12:19:00 +0000 (14:19 +0200)]

idpf: change the method for mailbox workqueue allocation

Since workqueues are created per CPU, the works scheduled to this
workqueues are run on the CPU they were assigned. It may result in
overloaded CPU that is not able to handle virtchnl messages in
relatively short time. Allocating workqueue with WQ_UNBOUND and
WQ_HIGHPRI flags allows scheduler to queue virtchl messages on less loaded
CPUs, what eliminates delays.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

commit | commitdiff | tree

Vladimir Oltean [Wed, 14 May 2025 14:32:49 +0000 (17:32 +0300)]

net: stmmac: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()

New timestamping API was introduced in commit 66f7223039c0 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6.

It is time to convert the stmmac driver to the new API, so that
timestamping configuration can be removed from the ndo_eth_ioctl()
path completely.

The existing timestamping calls are guarded by netif_running(). For
stmmac_hwtstamp_get() that is probably unnecessary, since no hardware
access is performed. But for stmmac_hwtstamp_set() I've preserved it,
since at least some IPs probably need pm_runtime_resume_and_get() to
access registers, which is otherwise called by __stmmac_open().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250514143249.1808377-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vladimir Oltean [Wed, 14 May 2025 15:19:30 +0000 (18:19 +0300)]

net: lan743x: implement ndo_hwtstamp_get()

Permit programs such as "hwtstamp_ctl -i eth0" to retrieve the current
timestamping configuration of the NIC, rather than returning "Device
driver does not have support for non-destructive SIOCGHWTSTAMP."

The driver configures all channels with the same timestamping settings.
On TX, retrieve the settings of the first channel, those should be
representative for the entire NIC. On RX, save the filter settings in a
new adapter field.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250514151931.1988047-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vladimir Oltean [Wed, 14 May 2025 15:19:29 +0000 (18:19 +0300)]

net: lan743x: convert to ndo_hwtstamp_set()

New timestamping API was introduced in commit 66f7223039c0 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6.

It is time to convert the lan743x driver to the new API, so that
timestamping configuration can be removed from the ndo_eth_ioctl()
path completely.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250514151931.1988047-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Tue, 13 May 2025 22:20:11 +0000 (15:20 -0700)]

tools: ynl-gen: array-nest: support arrays of nests

TC needs arrays of nests, but just a put for now.
Fairly straightforward addition.

Link: https://patch.msgid.link/20250513222011.844106-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Tue, 13 May 2025 22:17:52 +0000 (15:17 -0700)]

net: sched: uapi: add more sanely named duplicate defines

The TCA_FLOWER_KEY_CFM enum has a UNSPEC and MAX with _OPT
in the name, but the real attributes don't. Add a MAX that
more reasonably matches the attrs.

The PAD in TCA_TAPRIO is the only attr which doesn't have
_ATTR in it, perhaps signifying that it's not a real attr?
If so interesting idea in abstract but it makes codegen painful.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250513221752.843102-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 18:30:11 +0000 (11:30 -0700)]

Merge branch 'tcp-receive-side-improvements'

Eric Dumazet says:

====================
tcp: receive side improvements

We have set tcp_rmem[2] to 15 MB for about 8 years at Google,
but had some issues for high speed flows on very small RTT.

TCP rx autotuning has a tendency to overestimate the RTT,
thus tp->rcvq_space.space and sk->sk_rcvbuf.

This makes TCP receive queues much bigger than necessary,
to a point cpu caches are evicted before application can
copy the data, on cpus using DDIO.

This series aims to fix this.

- First patch adds tcp_rcvbuf_grow() tracepoint, which was very
convenient to study the various issues fixed in this series.

- Seven patches fix receiver autotune issues.

- Two patches fix sender side issues.

- Final patch increases tcp_rmem[2] so that TCP speed over WAN
can meet modern needs.

Tested on a 200Gbit NIC, average max throughput of a single flow:

Before:
73593 Mbit.

After:
122514 Mbit.
====================

Link: https://patch.msgid.link/20250513193919.1089692-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:19 +0000 (19:39 +0000)]

tcp: increase tcp_rmem[2] to 32 MB

Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e32
("tcp: change tcp_adv_win_scale and tcp_rmem[2]")

TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers.

After this series improvements, it is time to increase the default.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:18 +0000 (19:39 +0000)]

tcp: always use tcp_limit_output_bytes limitation

This partially reverts commit c73e5807e4f6 ("tcp: tsq: no longer use
limit_output_bytes for paced flows")

Overriding the tcp_limit_output_bytes sysctl value
for FQ enabled flows has the following problem:

It allows TCP to queue around 2 ms worth of data per flow,
defeating tcp_rcv_rtt_update() accuracy on the receiver,
forcing it to increase sk->sk_rcvbuf even if the real
RTT is around 100 us.

After this change, we keep enough packets in flight to fill
the pipe, and let receive queues small enough to get
good cache behavior (cpu caches and/or NIC driver page pools).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:17 +0000 (19:39 +0000)]

tcp: increase tcp_limit_output_bytes default value to 4MB

Last change happened in 2018 with commit c73e5807e4f6
("tcp: tsq: no longer use limit_output_bytes for paced flows")

Modern NIC speeds got a 4x increase since then.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:16 +0000 (19:39 +0000)]

tcp: skip big rtt sample if receive queue is not empty

tcp_rcv_rtt_update() role is to keep an estimation
of RTT (tp->rcv_rtt_est.rtt_us) for receivers.

If an application is too slow to drain the TCP receive
queue, it is better to leave the RTT estimation small,
so that tcp_rcv_space_adjust() does not inflate
tp->rcvq_space.space and sk->sk_rcvbuf.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:15 +0000 (19:39 +0000)]

tcp: always seek for minimal rtt in tcp_rcv_rtt_update()

tcp_rcv_rtt_update() goal is to maintain an estimation of the RTT
in tp->rcv_rtt_est.rtt_us, used by tcp_rcv_space_adjust()

When TCP TS are enabled, tcp_rcv_rtt_update() is using
EWMA to smooth the samples.

Change this to immediately latch the incoming value if it
is lower than tp->rcv_rtt_est.rtt_us, so that tcp_rcv_space_adjust()
does not overshoot tp->rcvq_space.space and sk->sk_rcvbuf.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:14 +0000 (19:39 +0000)]

tcp: fix initial tp->rcvq_space.space value for passive TS enabled flows

tcp_rcv_state_process() must tweak tp->advmss for TS enabled flows
before the call to tcp_init_transfer() / tcp_init_buffer_space().

Otherwise tp->rcvq_space.space is off by 120 bytes
(TCP_INIT_CWND * TCPOLEN_TSTAMP_ALIGNED).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:13 +0000 (19:39 +0000)]

tcp: remove zero TCP TS samples for autotuning

For TCP flows using ms RFC 7323 timestamp granularity
tcp_rcv_rtt_update() can be fed with 1 ms samples, breaking
TCP autotuning for data center flows with sub ms RTT.

Instead, rely on the window based samples, fed by tcp_rcv_rtt_measure()

tcp_rcvbuf_grow() for a 10 second TCP_STREAM sesssion now looks saner.
We can see rcvbuf is kept at a reasonable value.

  222.234976: tcp:tcp_rcvbuf_grow: time=348 rtt_us=330 copied=110592 inq=0 space=40960 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
  222.235276: tcp:tcp_rcvbuf_grow: time=300 rtt_us=288 copied=126976 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=246187 ...
  222.235569: tcp:tcp_rcvbuf_grow: time=294 rtt_us=288 copied=184320 inq=0 space=126976 ooo=0 scaling_ratio=230 rcvbuf=282659 ...
  222.235833: tcp:tcp_rcvbuf_grow: time=264 rtt_us=244 copied=373760 inq=0 space=184320 ooo=0 scaling_ratio=230 rcvbuf=410312 ...
  222.236142: tcp:tcp_rcvbuf_grow: time=308 rtt_us=219 copied=424960 inq=20480 space=373760 ooo=0 scaling_ratio=230 rcvbuf=832022 ...
  222.236378: tcp:tcp_rcvbuf_grow: time=236 rtt_us=219 copied=692224 inq=49152 space=404480 ooo=0 scaling_ratio=230 rcvbuf=900407 ...
  222.236602: tcp:tcp_rcvbuf_grow: time=225 rtt_us=219 copied=730112 inq=49152 space=643072 ooo=0 scaling_ratio=230 rcvbuf=1431534 ...
  222.237050: tcp:tcp_rcvbuf_grow: time=229 rtt_us=219 copied=1160192 inq=49152 space=680960 ooo=0 scaling_ratio=230 rcvbuf=1515876 ...
  222.237618: tcp:tcp_rcvbuf_grow: time=305 rtt_us=218 copied=2228224 inq=49152 space=1111040 ooo=0 scaling_ratio=230 rcvbuf=2473271 ...
  222.238591: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3063808 inq=360448 space=2179072 ooo=0 scaling_ratio=230 rcvbuf=4850803 ...
  222.240647: tcp:tcp_rcvbuf_grow: time=260 rtt_us=218 copied=2752512 inq=0 space=2703360 ooo=0 scaling_ratio=230 rcvbuf=6017914 ...
  222.243535: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2834432 inq=49152 space=2752512 ooo=0 scaling_ratio=230 rcvbuf=6127331 ...
  222.245108: tcp:tcp_rcvbuf_grow: time=240 rtt_us=218 copied=2883584 inq=49152 space=2785280 ooo=0 scaling_ratio=230 rcvbuf=6200275 ...
  222.245333: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2859008 inq=0 space=2834432 ooo=0 scaling_ratio=230 rcvbuf=6309692 ...
  222.301021: tcp:tcp_rcvbuf_grow: time=222 rtt_us=218 copied=2883584 inq=0 space=2859008 ooo=0 scaling_ratio=230 rcvbuf=6364400 ...
  222.989242: tcp:tcp_rcvbuf_grow: time=225 rtt_us=218 copied=2899968 inq=0 space=2883584 ooo=0 scaling_ratio=230 rcvbuf=6419108 ...
  224.139553: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3014656 inq=65536 space=2899968 ooo=0 scaling_ratio=230 rcvbuf=6455580 ...
  224.584608: tcp:tcp_rcvbuf_grow: time=232 rtt_us=218 copied=3014656 inq=49152 space=2949120 ooo=0 scaling_ratio=230 rcvbuf=6564997 ...
  230.145560: tcp:tcp_rcvbuf_grow: time=223 rtt_us=218 copied=2981888 inq=0 space=2965504 ooo=0 scaling_ratio=230 rcvbuf=6601469 ...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:12 +0000 (19:39 +0000)]

tcp: add receive queue awareness in tcp_rcv_space_adjust()

If the application can not drain fast enough a TCP socket queue,
tcp_rcv_space_adjust() can overestimate tp->rcvq_space.space.

Then sk->sk_rcvbuf can grow and hit tcp_rmem[2] for no good reason.

Fix this by taking into acount the number of available bytes.

Keeping sk->sk_rcvbuf at the right size allows better cache efficiency.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:11 +0000 (19:39 +0000)]

tcp: adjust rcvbuf in presence of reorders

This patch takes care of the needed provisioning
when incoming packets are stored in the out of order queue.

This part was not implemented in the correct way, we need
to decouple it from tcp_rcv_space_adjust() logic.

Without it, stalls in the pipe could happen.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:10 +0000 (19:39 +0000)]

tcp: fix sk_rcvbuf overshoot

Current autosizing in tcp_rcv_space_adjust() is too aggressive.

Instead of betting on possible losses and over estimate BDP,
it is better to only account for slow start.

The following patch is then adding a more precise tuning
in the events of packet losses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 13 May 2025 19:39:09 +0000 (19:39 +0000)]

tcp: add tcp_rcvbuf_grow() tracepoint

Provide a new tracepoint to better understand
tcp_rcv_space_adjust() (currently broken) behavior.

Call it only when tcp_rcv_space_adjust() has a chance
to make a change.

I chose to leave trace_tcp_rcv_space_adjust() as is,
because commit 6163849d289b ("net: introduce a new tracepoint
for tcp_rcv_space_adjust") intent was to get it called after
each data delivery to user space.

Tested:

Pair of hosts in the same rack. Ideally, sk->sk_rcvbuf should be kept small.

echo "4096 131072 33554432" >/proc/sys/net/ipv4/tcp_rmem
./netserver
perf record -C10 -e tcp:tcp_rcvbuf_grow sleep 30

<launch from client : netperf -H server -T,10>

Trace for a TS enabled TCP flow (with standard ms granularity)

perf script // We can see that sk_rcvbuf is growing very fast to tcp_mem[2]
  260.500397: tcp:tcp_rcvbuf_grow: time=291 rtt_us=274 copied=110592 inq=0 space=41080 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
  260.501333: tcp:tcp_rcvbuf_grow: time=555 rtt_us=364 copied=333824 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
  260.501664: tcp:tcp_rcvbuf_grow: time=331 rtt_us=330 copied=798720 inq=0 space=333824 ooo=0 scaling_ratio=230 rcvbuf=4110551 ...
  260.502003: tcp:tcp_rcvbuf_grow: time=340 rtt_us=330 copied=1040384 inq=49152 space=798720 ooo=0 scaling_ratio=230 rcvbuf=7006410 ...
  260.502483: tcp:tcp_rcvbuf_grow: time=479 rtt_us=330 copied=2658304 inq=49152 space=1040384 ooo=0 scaling_ratio=230 rcvbuf=7006410 ...
  260.502899: tcp:tcp_rcvbuf_grow: time=416 rtt_us=413 copied=4026368 inq=147456 space=2658304 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.504233: tcp:tcp_rcvbuf_grow: time=493 rtt_us=487 copied=4800512 inq=196608 space=4026368 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.504792: tcp:tcp_rcvbuf_grow: time=559 rtt_us=551 copied=5672960 inq=49152 space=4800512 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.506614: tcp:tcp_rcvbuf_grow: time=610 rtt_us=607 copied=6688768 inq=180224 space=5672960 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.507280: tcp:tcp_rcvbuf_grow: time=666 rtt_us=656 copied=6868992 inq=49152 space=6688768 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.507979: tcp:tcp_rcvbuf_grow: time=699 rtt_us=699 copied=7000064 inq=0 space=6868992 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.508681: tcp:tcp_rcvbuf_grow: time=703 rtt_us=699 copied=7208960 inq=0 space=7000064 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.509426: tcp:tcp_rcvbuf_grow: time=744 rtt_us=737 copied=7569408 inq=0 space=7208960 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.510213: tcp:tcp_rcvbuf_grow: time=787 rtt_us=770 copied=7880704 inq=49152 space=7569408 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.511013: tcp:tcp_rcvbuf_grow: time=801 rtt_us=798 copied=8339456 inq=0 space=7880704 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.511860: tcp:tcp_rcvbuf_grow: time=847 rtt_us=824 copied=8601600 inq=49152 space=8339456 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.512710: tcp:tcp_rcvbuf_grow: time=850 rtt_us=846 copied=8814592 inq=65536 space=8601600 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.514428: tcp:tcp_rcvbuf_grow: time=871 rtt_us=865 copied=8855552 inq=49152 space=8814592 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.515333: tcp:tcp_rcvbuf_grow: time=905 rtt_us=882 copied=9228288 inq=49152 space=8855552 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.516237: tcp:tcp_rcvbuf_grow: time=905 rtt_us=896 copied=9371648 inq=49152 space=9228288 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.517149: tcp:tcp_rcvbuf_grow: time=911 rtt_us=909 copied=9543680 inq=49152 space=9371648 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.518070: tcp:tcp_rcvbuf_grow: time=921 rtt_us=921 copied=9793536 inq=0 space=9543680 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.520895: tcp:tcp_rcvbuf_grow: time=948 rtt_us=947 copied=10203136 inq=114688 space=9793536 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.521853: tcp:tcp_rcvbuf_grow: time=959 rtt_us=954 copied=10293248 inq=57344 space=10203136 ooo=0 scaling_ratio=230 rcvbuf=24691992 ...
  260.522818: tcp:tcp_rcvbuf_grow: time=964 rtt_us=959 copied=10330112 inq=0 space=10293248 ooo=0 scaling_ratio=230 rcvbuf=24691992 ...
  260.524760: tcp:tcp_rcvbuf_grow: time=979 rtt_us=969 copied=10633216 inq=49152 space=10330112 ooo=0 scaling_ratio=230 rcvbuf=24691992 ...
  260.526709: tcp:tcp_rcvbuf_grow: time=975 rtt_us=973 copied=12013568 inq=163840 space=10633216 ooo=0 scaling_ratio=230 rcvbuf=25136755 ...
  260.527694: tcp:tcp_rcvbuf_grow: time=985 rtt_us=976 copied=12025856 inq=32768 space=12013568 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.530655: tcp:tcp_rcvbuf_grow: time=991 rtt_us=986 copied=12050432 inq=98304 space=12025856 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.533626: tcp:tcp_rcvbuf_grow: time=993 rtt_us=989 copied=12124160 inq=0 space=12050432 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.538606: tcp:tcp_rcvbuf_grow: time=1000 rtt_us=994 copied=12222464 inq=49152 space=12124160 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.545605: tcp:tcp_rcvbuf_grow: time=1005 rtt_us=998 copied=12263424 inq=81920 space=12222464 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.553626: tcp:tcp_rcvbuf_grow: time=1005 rtt_us=999 copied=12320768 inq=12288 space=12263424 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.589749: tcp:tcp_rcvbuf_grow: time=1001 rtt_us=1000 copied=12398592 inq=16384 space=12320768 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.806577: tcp:tcp_rcvbuf_grow: time=1010 rtt_us=1000 copied=12402688 inq=32768 space=12398592 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  261.002386: tcp:tcp_rcvbuf_grow: time=1002 rtt_us=1000 copied=12419072 inq=98304 space=12402688 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  261.803432: tcp:tcp_rcvbuf_grow: time=1013 rtt_us=1000 copied=12468224 inq=49152 space=12419072 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  261.829533: tcp:tcp_rcvbuf_grow: time=1004 rtt_us=1000 copied=12615680 inq=0 space=12468224 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  265.505435: tcp:tcp_rcvbuf_grow: time=1007 rtt_us=1000 copied=12632064 inq=32768 space=12615680 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...

We also see rtt_us going gradually to 1000 usec, causing massive overshoot.

Trace for a usec TS enabled TCP flow (us granularity)

perf script // We can see that sk_rcvbuf is growing to a smaller value,
               thanks to tight rtt_us values.
1509.273955: tcp:tcp_rcvbuf_grow: time=396 rtt_us=377 copied=110592 inq=0 space=41080 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
1509.274366: tcp:tcp_rcvbuf_grow: time=412 rtt_us=365 copied=129024 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
1509.274738: tcp:tcp_rcvbuf_grow: time=372 rtt_us=355 copied=194560 inq=0 space=129024 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
1509.275020: tcp:tcp_rcvbuf_grow: time=282 rtt_us=257 copied=401408 inq=0 space=194560 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
1509.275190: tcp:tcp_rcvbuf_grow: time=170 rtt_us=144 copied=741376 inq=229376 space=401408 ooo=0 scaling_ratio=230 rcvbuf=3021625 ...
1509.275300: tcp:tcp_rcvbuf_grow: time=110 rtt_us=110 copied=1146880 inq=65536 space=741376 ooo=0 scaling_ratio=230 rcvbuf=4642390 ...
1509.275449: tcp:tcp_rcvbuf_grow: time=149 rtt_us=106 copied=1310720 inq=737280 space=1146880 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
1509.275560: tcp:tcp_rcvbuf_grow: time=111 rtt_us=107 copied=1388544 inq=430080 space=1310720 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
1509.275674: tcp:tcp_rcvbuf_grow: time=114 rtt_us=113 copied=1495040 inq=421888 space=1388544 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
1509.275800: tcp:tcp_rcvbuf_grow: time=126 rtt_us=126 copied=1572864 inq=77824 space=1495040 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
1509.275968: tcp:tcp_rcvbuf_grow: time=168 rtt_us=161 copied=1863680 inq=172032 space=1572864 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
1509.276129: tcp:tcp_rcvbuf_grow: time=161 rtt_us=161 copied=1941504 inq=204800 space=1863680 ooo=0 scaling_ratio=230 rcvbuf=5782790 ...
1509.276288: tcp:tcp_rcvbuf_grow: time=159 rtt_us=158 copied=1990656 inq=131072 space=1941504 ooo=0 scaling_ratio=230 rcvbuf=5782790 ...
1509.276900: tcp:tcp_rcvbuf_grow: time=228 rtt_us=226 copied=2883584 inq=266240 space=1990656 ooo=0 scaling_ratio=230 rcvbuf=5782790 ...
1509.277819: tcp:tcp_rcvbuf_grow: time=242 rtt_us=236 copied=3022848 inq=0 space=2883584 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.278072: tcp:tcp_rcvbuf_grow: time=253 rtt_us=247 copied=3055616 inq=49152 space=3022848 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.279560: tcp:tcp_rcvbuf_grow: time=268 rtt_us=264 copied=3133440 inq=180224 space=3055616 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.279833: tcp:tcp_rcvbuf_grow: time=274 rtt_us=270 copied=3424256 inq=0 space=3133440 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.282187: tcp:tcp_rcvbuf_grow: time=277 rtt_us=273 copied=3465216 inq=180224 space=3424256 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.284685: tcp:tcp_rcvbuf_grow: time=292 rtt_us=292 copied=3481600 inq=147456 space=3465216 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.284983: tcp:tcp_rcvbuf_grow: time=297 rtt_us=295 copied=3702784 inq=45056 space=3481600 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.285596: tcp:tcp_rcvbuf_grow: time=311 rtt_us=310 copied=3723264 inq=40960 space=3702784 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.285909: tcp:tcp_rcvbuf_grow: time=313 rtt_us=304 copied=3846144 inq=196608 space=3723264 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.291654: tcp:tcp_rcvbuf_grow: time=322 rtt_us=311 copied=3960832 inq=49152 space=3846144 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.291986: tcp:tcp_rcvbuf_grow: time=333 rtt_us=330 copied=4075520 inq=360448 space=3960832 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.292319: tcp:tcp_rcvbuf_grow: time=332 rtt_us=332 copied=4079616 inq=65536 space=4075520 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.292666: tcp:tcp_rcvbuf_grow: time=348 rtt_us=347 copied=4177920 inq=212992 space=4079616 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.293015: tcp:tcp_rcvbuf_grow: time=349 rtt_us=345 copied=4276224 inq=262144 space=4177920 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.293371: tcp:tcp_rcvbuf_grow: time=356 rtt_us=346 copied=4415488 inq=49152 space=4276224 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
1509.515798: tcp:tcp_rcvbuf_grow: time=424 rtt_us=411 copied=4833280 inq=81920 space=4415488 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 18:22:32 +0000 (11:22 -0700)]

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.15-rc7).

Conflicts:

tools/testing/selftests/drivers/net/hw/ncdevmem.c
  97c4e094a4b2 ("tests/ncdevmem: Fix double-free of queue array")
  2f1a805f32ba ("selftests: ncdevmem: Implement devmem TCP TX")
https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au

Adjacent changes:

net/core/devmem.c
net/core/devmem.h
  0afc44d8cdf6 ("net: devmem: fix kernel panic when netlink socket close after module unload")
  bd61848900bf ("net: devmem: Implement TX path")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Linus Torvalds [Thu, 15 May 2025 17:40:49 +0000 (10:40 -0700)]

Merge tag 'net-6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth and wireless.

  A few more fixes for the locking changes trickling in. Nothing too
  alarming, I suspect those will continue for another release. Other
  than that things are slowing down nicely.

  Current release - fix to a fix:

   - Bluetooth: hci_event: use key encryption size when its known

   - tools: ynl-gen: allow multi-attr without nested-attributes again

  Current release - regressions:

   - locking fixes:
      - lock lower level devices when updating features
      - eth: bnxt_en: bring back rtnl_lock() in the bnxt_open() path
      - devmem: fix panic when Netlink socket closes after module unload

  Current release - new code bugs:

   - eth: txgbe: fixes for FW communication on new AML devices

  Previous releases - always broken:

   - sched: flush gso_skb list too during ->change(), avoid potential
     null-deref on reconfig

   - wifi: mt76: disable NAPI on driver removal

   - hv_netvsc: fix error 'nvsp_rndis_pkt_complete error status: 2'"

* tag 'net-6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
  net: devmem: fix kernel panic when netlink socket close after module unload
  tsnep: fix timestamping with a stacked DSA driver
  net/tls: fix kernel panic when alloc_page failed
  bnxt_en: bring back rtnl_lock() in the bnxt_open() path
  mlxsw: spectrum_router: Fix use-after-free when deleting GRE net devices
  wifi: mac80211: Set n_channels after allocating struct cfg80211_scan_request
  octeontx2-pf: Do not reallocate all ntuple filters
  wifi: mt76: mt7925: fix missing hdr_trans_tlv command for broadcast wtbl
  wifi: mt76: disable napi on driver removal
  Drivers: hv: vmbus: Remove vmbus_sendpacket_pagebuffer()
  hv_netvsc: Remove rmsg_pgcnt
  hv_netvsc: Preserve contiguous PFN grouping in the page buffer array
  hv_netvsc: Use vmbus_sendpacket_mpb_desc() to send VMBus messages
  Drivers: hv: Allow vmbus_sendpacket_mpb_desc() to create multiple ranges
  octeontx2-af: Fix CGX Receive counters
  net: ethernet: mtk_eth_soc: fix typo for declaration MT7988 ESW capability
  net: libwx: Fix FW mailbox unknown command
  net: libwx: Fix FW mailbox reply timeout
  net: txgbe: Fix to calculate EEPROM checksum for AML devices
  octeontx2-pf: macsec: Fix incorrect max transmit size in TX secy
  ...

commit | commitdiff | tree

Taehee Yoo [Wed, 14 May 2025 15:40:28 +0000 (15:40 +0000)]

net: devmem: fix kernel panic when netlink socket close after module unload

Kernel panic occurs when a devmem TCP socket is closed after NIC module
is unloaded.

This is Devmem TCP unregistration scenarios. number is an order.
(a)netlink socket close    (b)pp destroy    (c)uninstall    result
1                          2                3               OK
1                          3                2               (d)Impossible
2                          1                3               OK
3                          1                2               (e)Kernel panic
2                          3                1               (d)Impossible
3                          2                1               (d)Impossible

(a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is
    closed.
(b) page_pool_destroy() is called when the interface is down.
(c) mp_ops->uninstall() is called when an interface is unregistered.
(d) There is no scenario in mp_ops->uninstall() is called before
    page_pool_destroy().
    Because unregister_netdevice_many_notify() closes interfaces first
    and then calls mp_ops->uninstall().
(e) netdev_nl_sock_priv_destroy() accesses struct net_device to acquire
    netdev_lock().
    But if the interface module has already been removed, net_device
    pointer is invalid, so it causes kernel panic.

In summary, there are only 3 possible scenarios.
A. sk close -> pp destroy -> uninstall.
B. pp destroy -> sk close -> uninstall.
C. pp destroy -> uninstall -> sk close.

Case C is a kernel panic scenario.

In order to fix this problem, It makes mp_dmabuf_devmem_uninstall() set
binding->dev to NULL.
It indicates an bound net_device was unregistered.

It makes netdev_nl_sock_priv_destroy() do not acquire netdev_lock()
if binding->dev is NULL.

A new binding->lock is added to protect a dev of a binding.
So, lock ordering is like below.
priv->lock
netdev_lock(dev)
binding->lock

Tests:
Scenario A:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
        -v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    kill $pid
    ip link set $interface down
    modprobe -rv $module

Scenario B:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
        -v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    ip link set $interface down
    kill $pid
    modprobe -rv $module

Scenario C:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
        -v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    modprobe -rv $module
    sleep 5
    kill $pid

Splat looks like:
Oops: general protection fault, probably for non-canonical address 0xdffffc001fffa9f7: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
KASAN: probably user-memory-access in range [0x00000000fffd4fb8-0x00000000fffd4fbf]
CPU: 0 UID: 0 PID: 2041 Comm: ncdevmem Tainted: G    B   W           6.15.0-rc1+ #2 PREEMPT(undef)  0947ec89efa0fd68838b78e36aa1617e97ff5d7f
Tainted: [B]=BAD_PAGE, [W]=WARN
RIP: 0010:__mutex_lock (./include/linux/sched.h:2244 kernel/locking/mutex.c:400 kernel/locking/mutex.c:443 kernel/locking/mutex.c:605 kernel/locking/mutex.c:746)
Code: ea 03 80 3c 02 00 0f 85 4f 13 00 00 49 8b 1e 48 83 e3 f8 74 6a 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 34 48 89 fa 48 c1 ea 03 <0f> b6 f
RSP: 0018:ffff88826f7ef730 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: 00000000fffd4f88 RCX: ffffffffaa9bc811
RDX: 000000001fffa9f7 RSI: 0000000000000008 RDI: 00000000fffd4fbc
RBP: ffff88826f7ef8b0 R08: 0000000000000000 R09: ffffed103e6aa1a4
R10: 0000000000000007 R11: ffff88826f7ef442 R12: fffffbfff669f65e
R13: ffff88812a830040 R14: ffff8881f3550d20 R15: 00000000fffd4f88
FS:  0000000000000000(0000) GS:ffff888866c05000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000563bed0cb288 CR3: 00000001a7c98000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<TASK>
...
netdev_nl_sock_priv_destroy (net/core/netdev-genl.c:953 (discriminator 3))
genl_release (net/netlink/genetlink.c:653 net/netlink/genetlink.c:694 net/netlink/genetlink.c:705)
...
netlink_release (net/netlink/af_netlink.c:737)
...
__sock_release (net/socket.c:647)
sock_close (net/socket.c:1393)

Fixes: 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250514154028.1062909-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Gerhard Engleder [Wed, 14 May 2025 19:56:57 +0000 (21:56 +0200)]

tsnep: fix timestamping with a stacked DSA driver

This driver is susceptible to a form of the bug explained in commit
c26a2c2ddc01 ("gianfar: Fix TX timestamping with a stacked DSA driver")
and in Documentation/networking/timestamping.rst section "Other caveats
for MAC drivers", specifically it timestamps any skb which has
SKBTX_HW_TSTAMP, and does not consider if timestamping has been enabled
in adapter->hwtstamp_config.tx_type.

Evaluate the proper TX timestamping condition only once on the TX
path (in tsnep_xmit_frame_ring()) and store the result in an additional
TX entry flag. Evaluate the new TX entry flag in the TX confirmation path
(in tsnep_tx_poll()).

This way SKBTX_IN_PROGRESS is set by the driver as required, but never
evaluated. SKBTX_IN_PROGRESS shall not be evaluated as it can be set
by a stacked DSA driver and evaluating it would lead to unwanted
timestamps.

Fixes: 403f69bbdbad ("tsnep: Add TSN endpoint Ethernet MAC driver")
Suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250514195657.25874-1-gerhard@engleder-embedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Chen Ni [Wed, 14 May 2025 06:40:53 +0000 (14:40 +0800)]

net: prestera: Use to_delayed_work()

Use to_delayed_work() instead of open-coding it.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://patch.msgid.link/20250514064053.2513921-1-nichen@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Chen Ni [Wed, 14 May 2025 07:24:19 +0000 (15:24 +0800)]

net/mlx5: Use to_delayed_work()

Use to_delayed_work() instead of open-coding it.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Acked-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250514072419.2707578-1-nichen@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Pengtao He [Wed, 14 May 2025 13:20:13 +0000 (21:20 +0800)]

net/tls: fix kernel panic when alloc_page failed

We cannot set frag_list to NULL pointer when alloc_page failed.
It will be used in tls_strp_check_queue_ok when the next time
tls_strp_read_sock is called.

This is because we don't reset full_len in tls_strp_flush_anchor_copy()
so the recv path will try to continue handling the partial record
on the next call but we dettached the rcvq from the frag list.
Alternative fix would be to reset full_len.

Unable to handle kernel NULL pointer dereference
at virtual address 0000000000000028
Call trace:
tls_strp_check_rcv+0x128/0x27c
tls_strp_data_ready+0x34/0x44
tls_data_ready+0x3c/0x1f0
tcp_data_ready+0x9c/0xe4
tcp_data_queue+0xf6c/0x12d0
tcp_rcv_established+0x52c/0x798

Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Signed-off-by: Pengtao He <hept.hept.hept@gmail.com>
Link: https://patch.msgid.link/20250514132013.17274-1-hept.hept.hept@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 15 May 2025 14:19:49 +0000 (07:19 -0700)]

Merge tag 'wireless-2025-05-15' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Couple of stragglers:
- mac80211: fix syzbot/ubsan in scan counted-by
- mt76: fix NAPI handling on driver remove
- mt67: fix multicast/ipv6 receive

* tag 'wireless-2025-05-15' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: mac80211: Set n_channels after allocating struct cfg80211_scan_request
  wifi: mt76: mt7925: fix missing hdr_trans_tlv command for broadcast wtbl
  wifi: mt76: disable napi on driver removal
====================

Link: https://patch.msgid.link/20250515121749.61912-4-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Michael Chan [Wed, 14 May 2025 06:29:08 +0000 (23:29 -0700)]

bnxt_en: bring back rtnl_lock() in the bnxt_open() path

Error recovery, PCIe AER, resume, and TX timeout will invoke bnxt_open()
with netdev_lock only.  This will cause RTNL assert failure in
netif_set_real_num_tx_queues(), netif_set_real_num_tx_queues(),
and netif_set_real_num_tx_queues().

Example error recovery assert:

RTNL: assertion failed at net/core/dev.c (3178)
WARNING: CPU: 3 PID: 3392 at net/core/dev.c:3178 netif_set_real_num_tx_queues+0x1fd/0x210

Call Trace:
<TASK>
? __pfx_bnxt_msix+0x10/0x10 [bnxt_en]
__bnxt_open_nic+0x1ef/0xb20 [bnxt_en]
bnxt_open+0xda/0x130 [bnxt_en]
bnxt_fw_reset_task+0x21f/0x780 [bnxt_en]
process_scheduled_works+0x9d/0x400

For now, bring back rtnl_lock() in all these code paths that can invoke
bnxt_open().  In the bnxt_queue_start() error path, we don't have
rtnl_lock held so we just change it to call netif_close() instead of
bnxt_reset_task() for simplicity.  This error path is unlikely so it
should be fine.

Fixes: 004b5008016a ("eth: bnxt: remove most dependencies on RTNL")
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250514062908.2766677-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Ido Schimmel [Wed, 14 May 2025 12:48:05 +0000 (14:48 +0200)]

mlxsw: spectrum_router: Fix use-after-free when deleting GRE net devices

The driver only offloads neighbors that are constructed on top of net
devices registered by it or their uppers (which are all Ethernet). The
device supports GRE encapsulation and decapsulation of forwarded
traffic, but the driver will not offload dummy neighbors constructed on
top of GRE net devices as they are not uppers of its net devices:

# ip link add name gre1 up type gre tos inherit local 192.0.2.1 remote 198.51.100.1
# ip neigh add 0.0.0.0 lladdr 0.0.0.0 nud noarp dev gre1
$ ip neigh show dev gre1 nud noarp
0.0.0.0 lladdr 0.0.0.0 NOARP

(Note that the neighbor is not marked with 'offload')

When the driver is reloaded and the existing configuration is replayed,
the driver does not perform the same check regarding existing neighbors
and offloads the previously added one:

# devlink dev reload pci/0000:01:00.0
$ ip neigh show dev gre1 nud noarp
0.0.0.0 lladdr 0.0.0.0 offload NOARP

If the neighbor is later deleted, the driver will ignore the
notification (given the GRE net device is not its upper) and will
therefore keep referencing freed memory, resulting in a use-after-free
[1] when the net device is deleted:

# ip neigh del 0.0.0.0 lladdr 0.0.0.0 dev gre1
# ip link del dev gre1

Fix by skipping neighbor replay if the net device for which the replay
is performed is not our upper.

[1]
BUG: KASAN: slab-use-after-free in mlxsw_sp_neigh_entry_update+0x1ea/0x200
Read of size 8 at addr ffff888155b0e420 by task ip/2282
[...]
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xa0
print_address_description.constprop.0+0x6f/0x350
print_report+0x108/0x205
kasan_report+0xdf/0x110
mlxsw_sp_neigh_entry_update+0x1ea/0x200
mlxsw_sp_router_rif_gone_sync+0x2a8/0x440
mlxsw_sp_rif_destroy+0x1e9/0x750
mlxsw_sp_netdevice_ipip_ol_event+0x3c9/0xdc0
mlxsw_sp_router_netdevice_event+0x3ac/0x15e0
notifier_call_chain+0xca/0x150
call_netdevice_notifiers_info+0x7f/0x100
unregister_netdevice_many_notify+0xc8c/0x1d90
rtnl_dellink+0x34e/0xa50
rtnetlink_rcv_msg+0x6fb/0xb70
netlink_rcv_skb+0x131/0x360
netlink_unicast+0x426/0x710
netlink_sendmsg+0x75a/0xc20
__sock_sendmsg+0xc1/0x150
____sys_sendmsg+0x5aa/0x7b0
___sys_sendmsg+0xfc/0x180
__sys_sendmsg+0x121/0x1b0
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: 8fdb09a7674c ("mlxsw: spectrum_router: Replay neighbours when RIF is made")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/c53c02c904fde32dad484657be3b1477884e9ad6.1747225701.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Paolo Abeni [Thu, 15 May 2025 13:23:42 +0000 (15:23 +0200)]

Merge branch 'net-cover-more-per-cpu-storage-with-local-nested-bh-locking'

Sebastian Andrzej Siewior says:

====================
net: Cover more per-CPU storage with local nested BH locking

I was looking at the build-time defined per-CPU variables in net/ and
added the needed local-BH-locks in order to be able to remove the
current per-CPU lock in local_bh_disable() on PREMPT_RT.

The work is not yet complete, I just wanted to post what I have so far
instead of sitting on it.

v3: https://lore.kernel.org/all/20250430124758.1159480-1-bigeasy@linutronix.de/
v2: https://lore.kernel.org/all/20250414160754.503321-1-bigeasy@linutronix.de
v1: https://lore.kernel.org/all/20250309144653.825351-1-bigeasy@linutronix.de
====================

Link: https://patch.msgid.link/20250512092736.229935-1-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:36 +0000 (11:27 +0200)]

rds: Use nested-BH locking for rds_page_remainder

rds_page_remainder is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.

Cc: Allison Henderson <allison.henderson@oracle.com>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-16-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:35 +0000 (11:27 +0200)]

rds: Acquire per-CPU pointer within BH disabled section

rds_page_remainder_alloc() obtains the current CPU with get_cpu() while
disabling preemption. Then the CPU number is used to access the per-CPU
data structure via per_cpu().

This can be optimized by relying on local_bh_disable() to provide a
stable CPU number/ avoid migration and then using this_cpu_ptr() to
retrieve the data structure.

Cc: Allison Henderson <allison.henderson@oracle.com>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-15-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:34 +0000 (11:27 +0200)]

rds: Disable only bottom halves in rds_page_remainder_alloc()

rds_page_remainder_alloc() is invoked from a preemptible context or a
tasklet. There is no need to disable interrupts for locking.

Use local_bh_disable() instead of local_irq_save() for locking.

Cc: Allison Henderson <allison.henderson@oracle.com>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-14-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:33 +0000 (11:27 +0200)]

mptcp: Use nested-BH locking for hmac_storage

mptcp_delegated_actions is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data
structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh() for
locking. This change adds only lockdep coverage and does not alter the
functional behaviour for !PREEMPT_RT.

Cc: Matthieu Baerts <matttbe@kernel.org>
Cc: Mat Martineau <martineau@kernel.org>
Cc: Geliang Tang <geliang@kernel.org>
Cc: mptcp@lists.linux.dev
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-13-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:32 +0000 (11:27 +0200)]

net/sched: Use nested-BH locking for sch_frag_data_storage

sch_frag_data_storage is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

Add local_lock_t to the struct and use local_lock_nested_bh() for locking.
This change adds only lockdep coverage and does not alter the functional
behaviour for !PREEMPT_RT.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-12-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:31 +0000 (11:27 +0200)]

net/sched: act_mirred: Move the recursion counter struct netdev_xmit

mirred_nest_level is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Move mirred_nest_level to struct netdev_xmit as u8, provide wrappers.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20250512092736.229935-11-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:30 +0000 (11:27 +0200)]

openvswitch: Move ovs_frag_data_storage into the struct ovs_pcpu_storage

ovs_frag_data_storage is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Move ovs_frag_data_storage into the struct ovs_pcpu_storage which already
provides locking for the structure.

Cc: Aaron Conole <aconole@redhat.com>
Cc: Eelco Chaudron <echaudro@redhat.com>
Cc: Ilya Maximets <i.maximets@ovn.org>
Cc: dev@openvswitch.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250512092736.229935-10-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:29 +0000 (11:27 +0200)]

openvswitch: Use nested-BH locking for ovs_pcpu_storage

ovs_pcpu_storage is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.
The data structure can be referenced recursive and there is a recursion
counter to avoid too many recursions.

Add a local_lock_t to the data structure and use
local_lock_nested_bh() for locking. Add an owner of the struct which is
the current task and acquire the lock only if the structure is not owned
by the current task.

Cc: Aaron Conole <aconole@redhat.com>
Cc: Eelco Chaudron <echaudro@redhat.com>
Cc: Ilya Maximets <i.maximets@ovn.org>
Cc: dev@openvswitch.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250512092736.229935-9-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:28 +0000 (11:27 +0200)]

openvswitch: Merge three per-CPU structures into one

exec_actions_level is a per-CPU integer allocated at compile time.
action_fifos and flow_keys are per-CPU pointer and have their data
allocated at module init time.
There is no gain in splitting it, once the module is allocated, the
structures are allocated.

Merge the three per-CPU variables into ovs_pcpu_storage, adapt callers.

Cc: Aaron Conole <aconole@redhat.com>
Cc: Eelco Chaudron <echaudro@redhat.com>
Cc: Ilya Maximets <i.maximets@ovn.org>
Cc: dev@openvswitch.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250512092736.229935-8-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:27 +0000 (11:27 +0200)]

xfrm: Use nested-BH locking for nat_keepalive_sk_ipv[46]

nat_keepalive_sk_ipv[46] is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

Use sock_bh_locked which has a sock pointer and a local_lock_t. Use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-7-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:26 +0000 (11:27 +0200)]

xdp: Use nested-BH locking for system_page_pool

system_page_pool is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a page_pool member (original system_page_pool) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.

Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-6-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:25 +0000 (11:27 +0200)]

ipv6: sr: Use nested-BH locking for hmac_storage

hmac_storage is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.

Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-5-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:24 +0000 (11:27 +0200)]

ipv4/route: Use this_cpu_inc() for stats on PREEMPT_RT

The statistics are incremented with raw_cpu_inc() assuming it always
happens with bottom half disabled. Without per-CPU locking in
local_bh_disable() on PREEMPT_RT this is no longer true.

Use this_cpu_inc() on PREEMPT_RT for the increment to not worry about
preemption.

Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-4-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:23 +0000 (11:27 +0200)]

net: dst_cache: Use nested-BH locking for dst_cache::cache

dst_cache::cache is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-3-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Mon, 12 May 2025 09:27:22 +0000 (11:27 +0200)]

net: page_pool: Don't recycle into cache on PREEMPT_RT

With preemptible softirq and no per-CPU locking in local_bh_disable() on
PREEMPT_RT the consumer can be preempted while a skb is returned.

Avoid the race by disabling the recycle into the cache on PREEMPT_RT.

Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-2-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Johannes Berg [Thu, 15 May 2025 11:42:48 +0000 (13:42 +0200)]

Merge tag 'mt76-fixes-2025-05-15' of https://github.com/nbd168/wireless

Felix Fietkau says:
===================
mt76 fix for 6.15

- disable napi on driver removal to fix warning
- fix multicast rx regression on mt7925
===================

Link: https://patch.msgid.link/3b526d06-b717-4d47-817c-a9f47b796a31@nbd.name/
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

commit | commitdiff | tree

Paolo Abeni [Thu, 15 May 2025 11:26:20 +0000 (13:26 +0200)]

Merge branch 'octeontx2-improve-mailbox-tracing'

Subbaraya Sundeep says:

====================
octeontx2: Improve mailbox tracing

Octeontx2 VF,PF and AF devices communicate using hardware
shared mailbox region where VFs can only to talk to its PFs
and PFs can only talk to AF. AF does the entire resource management
for all PFs and VFs. The shared mbox region is used for synchronous
requests (requests from PF to AF or VF to PF) and async notifications
(notifications from AF to PFs or PF to VFs). Sending a request to AF
from VF involves various stages like
1. VF allocates message in shared region
2. Triggers interrupt to PF
3. PF upon receiving interrupt from VF will copy the message
from VF<->PF region to PF<->AF region
4. Triggers interrupt to AF
5. AF processes it and writes response in PF<->AF region
6. Triggers interrupt to PF
7. PF copies responses from PF<->AF region to VF<->PF region
8. Triggers interrupt to Vf
9. VF reads response in VF<->PF region

Due to various stages involved, Tracepoints are used in mailbox code for
debugging. Existing tracepoints need some improvements so that maximum
information can be inferred from trace logs during an issue.
This patchset tries to enhance existing tracepoints and also adds
a couple of tracepoints.
====================

Link: https://patch.msgid.link/1747136408-30685-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Subbaraya Sundeep [Tue, 13 May 2025 11:40:08 +0000 (17:10 +0530)]

octeontx2: Add new tracepoint otx2_msg_status

Apart from netdev interface Octeontx2 PF does the following:
1. Sends its own requests to AF and receives responses from AF.
2. Receives async messages from AF.
3. Forwards VF requests to AF, sends respective responses from AF to VFs.
4. Sends async messages to VFs.
This patch adds new tracepoint otx2_msg_status to display the status
of PF wrt mailbox handling.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1747136408-30685-5-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Subbaraya Sundeep [Tue, 13 May 2025 11:40:07 +0000 (17:10 +0530)]

octeontx2: Add pcifunc also to mailbox tracepoints

This patch adds pcifunc which represents PF and VF device to the
tracepoints otx2_msg_alloc, otx2_msg_send, otx2_msg_process so that
it is easier to correlate which device allocated the message, which
device forwarded it and which device processed that message.
Also add message id in otx2_msg_send tracepoint to check which
message is sent at any point of time from a device.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1747136408-30685-4-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Subbaraya Sundeep [Tue, 13 May 2025 11:40:06 +0000 (17:10 +0530)]

octeontx2-af: Display names for CPT and UP messages

Mailbox UP messages and CPT messages names are not being
displayed with their names in trace log files. Add those
messages too in otx2_mbox_id2name.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1747136408-30685-3-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Subbaraya Sundeep [Tue, 13 May 2025 11:40:05 +0000 (17:10 +0530)]

octeontx2-af: convert dev_dbg to tracepoint in mbox

Use tracepoint instead of dev_dbg since the entire
mailbox code uses tracepoints for debugging.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1747136408-30685-2-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom