www.infradead.org Git - users/jedix/linux-maple.git/log

net/mlx5e: Convert mlx5 netdevs to instance locking

This patch convert mlx5 to use the new netdev instance lock in addition
to the pre-existing state_lock (and the RTNL).

mlx5e_priv.state_lock was already used throughout mlx5 to protect
against concurrent state modifications on the same netdev, usually in
addition to the RTNL. The new netdev instance lock will eventually
replace it, but for now, it is acquired in addition to the existing
locks in the order RTNL -> instance lock -> state_lock.

All three netdev types handled by mlx5 are converted to the new style of
locking, because they share a lot of code related to initializing
channels and dealing with NAPI, so it's better to convert all three
rather than introduce different assumptions deep in the call stack
depending on the type of device.

Because of the nature of the call graphs in mlx5, it wasn't possible to
incrementally convert parts of the driver to use the new lock, since
either all call paths into NAPI have to possess the new lock if the
*_locked variants are used, or none of them can have the lock.

One area which required extra care is the interaction between closing
channels and devlink health reporter tasks.
Previously, the recovery tasks were unconditionally acquiring the
RTNL, which could lead to deadlocks in these scenarios:

T1: mlx5e_close (== .ndo_stop(), has RTNL) -> mlx5e_close_locked
-> mlx5e_close_channels -> mlx5e_ptp_close
-> mlx5e_ptp_close_queues -> mlx5e_ptp_close_txqsqs
-> mlx5e_ptp_close_txqsq
-> cancel_work_sync(&ptpsq->report_unhealthy_work) waits for

T2: mlx5e_ptpsq_unhealthy_work -> mlx5e_reporter_tx_ptpsq_unhealthy
-> mlx5e_health_report -> devlink_health_report
-> devlink_health_reporter_recover
-> mlx5e_tx_reporter_ptpsq_unhealthy_recover which does:
rtnl_lock(); => Deadlock.

Another similar instance of this is:
T1: mlx5e_close (== .ndo_stop(), has RTNL) -> mlx5e_close_locked
-> mlx5e_close_channels -> mlx5e_ptp_close
-> mlx5e_ptp_close_queues -> mlx5e_ptp_close_txqsqs
-> mlx5e_ptp_close_txqsq
-> cancel_work_sync(&sq->recover_work) waits for

T2: mlx5e_tx_err_cqe_work -> mlx5e_reporter_tx_err_cqe
-> mlx5e_health_report -> devlink_health_report
-> devlink_health_reporter_recover
-> mlx5e_tx_reporter_err_cqe_recover which does:
rtnl_lock(); => Another deadlock.

Fix that by using the same pattern previously done in
mlx5e_tx_timeout_work, where the RTNL was repeatedly tried to be
acquired until either:
a) it is successfully acquired or
b) there's no need for the work to be done any more (channel is being
closed).

Now, for all three recovery tasks, the instance lock is repeatedly tried
to be acquired until successful or the channel/SQ is closed.
As a side-effect, drop the !test_bit(MLX5E_STATE_OPENED, &priv->state)
check from mlx5e_tx_timeout_work, it's weaker than
!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv->state) and unnecessary.

Future patches will introduce new call paths (from netdev queue
management ops) which can close channels (and call cancel_work_sync on
the recovery tasks) without the RTNL lock and only with the netdev
instance lock.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747829342-1018757-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Don't drop RTNL during firmware flash

There's no explanation in the original commit of why that was done, but
presumably flashing takes a long time and holding RTNL for so long
blocks other interactions with the netdev layer.

However, the stack is moving towards netdev instance locking and
dropping and reacquiring RTNL in the context of flashing introduces
locking ordering issues: RTNL must be acquired before the netdev
instance lock and released after it.

This patch therefore takes the simpler approach by no longer dropping
and reacquiring the RTNL, as soon RTNL for ethtool will be removed,
leaving only the instance lock to protect against races.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747829342-1018757-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

IB/IPoIB: Allow using netdevs that require the instance lock

After the last patch removing vlan_rwsem, it is an incremental step to
allow ipoib to work with netdevs that require the instance lock.
In several places, netdev_lock() is changed to netdev_lock_ops_to_full()
which takes care of not acquiring the lock again when the netdev is
already locked.

In ipoib_ib_tx_timeout_work() and __ipoib_ib_dev_flush() for HEAVY
flushes, the netdev lock is acquired/released. This is needed because
these functions end up calling .ndo_stop()/.ndo_open() on subinterfaces,
and the device may expect the netdev instance lock to be held.

ipoib_set_mode() now explicitly acquires ops lock while manipulating the
features, mtu and tx queues.

Finally, ipoib_napi_enable()/ipoib_napi_disable() now use the *_locked
variants of the napi_enable()/napi_disable() calls and optionally
acquire the netdev lock themselves depending on the dev they operate on.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747829342-1018757-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

IB/IPoIB: Replace vlan_rwsem with the netdev instance lock

vlan_rwsem was added more than a decade ago to work around a deadlock
involving the original mutex being acquired twice, once from the wq.
Subsequent changes then tweaked it to partially protect access to
ipoib_dev_priv->child_intfs together with the RTNL. Flushing the wq
synchronously was also since then refactored to happen separately.

This semaphore unfortunately prevents updating ipoib to work with
devices that require the netdev lock, because of lock ordering issues
between RTNL, vlan_rwsem and the netdev instance locks of parent and
child devices.

To uncomplicate things, this commit replaces vlan_rwsem with the netdev
instance lock of the parent device. Both parent child_intfs list and the
children's list membership in it require holding the parent netdev
instance lock.

All call paths were carefully reviewed and no-longer-needed ASSERT_RTNL
calls were dropped. Some non-trivial changes:
- ipoib_match_gid_pkey_addr() now only acquires the instance lock and
  iterates through child_intfs for the first level of recursion (the
  parent), as it's not possible to have multiple levels of nested
  subinterfaces.
- ipoib_open() and ipoib_stop() schedule tasks on the global workqueue
  to open/stop child interfaces to avoid potentially acquiring nested
  netdev instance locks. To avoid the device going away between the task
  scheduling and execution, netdev_hold/netdev_put are used.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747829342-1018757-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

IB/IPoIB: Enqueue separate work_structs for each flushed interface

Previously, flushing a netdevice involved first flushing all child
devices from the flush task itself. That requires holding the lock that
protects the list for the entire duration of the flush.

This poses a problem when converting from vlan_rwsem to the netdev
instance lock (next patch), because holding the parent lock while
trying to acquire a child lock makes lockdep unhappy, rightfully.

Fix this by splitting a big flush task into individual flush tasks
(all are already created in their respective ipoib_dev_priv structs)
and defining a helper function to enqueue all of them while holding the
list lock.

In ipoib_set_mac, the function is not used and the task is enqueued
directly, because in the subsequent patches locking is changed and this
function may be called with the netdev instance lock held.

This is effectively a noop, the wq is single-threaded and ordered and
will execute the same flush operations in the same order as before.

Furthermore, there should be no new races because
ipoib_parent_unregister_pre() calls flush_workqueue() after stopping new
work generation to wait for pending work to complete. flush_workqueue()
waits for all currently enqueued work to finish before returning.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747829342-1018757-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: fix deadlock when xdp is attached or detached

When xdp is attached or detached, dev->ndo_bpf() is called by
do_setlink(), and it acquires netdev_lock() if needed.
Unlike other drivers, the bnxt driver is protected by netdev_lock while
xdp is attached/detached because it sets dev->request_ops_lock to true.

So, the bnxt_xdp(), that is callback of ->ndo_bpf should not acquire
netdev_lock().
But the xdp_features_{set | clear}_redirect_target() was changed to
acquire netdev_lock() internally.
It causes a deadlock.
To fix this problem, bnxt driver should use
xdp_features_{set | clear}_redirect_target_locked() instead.

Splat looks like:
============================================
WARNING: possible recursive locking detected
6.15.0-rc6+ #1 Not tainted
--------------------------------------------
bpftool/1745 is trying to acquire lock:
ffff888131b85038 (&dev->lock){+.+.}-{4:4}, at: xdp_features_set_redirect_target+0x1f/0x80

but task is already holding lock:
ffff888131b85038 (&dev->lock){+.+.}-{4:4}, at: do_setlink.constprop.0+0x24e/0x35d0

other info that might help us debug this:
Possible unsafe locking scenario:

       CPU0
       ----
  lock(&dev->lock);
  lock(&dev->lock);

*** DEADLOCK ***

May be due to missing lock nesting notation

3 locks held by bpftool/1745:
#0: ffffffffa56131c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_setlink+0x1fe/0x570
#1: ffffffffaafa75a0 (&net->rtnl_mutex){+.+.}-{4:4}, at: rtnl_setlink+0x236/0x570
#2: ffff888131b85038 (&dev->lock){+.+.}-{4:4}, at: do_setlink.constprop.0+0x24e/0x35d0

stack backtrace:
CPU: 1 UID: 0 PID: 1745 Comm: bpftool Not tainted 6.15.0-rc6+ #1 PREEMPT(undef)
Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0603 11/01/2021
Call Trace:
<TASK>
dump_stack_lvl+0x7a/0xd0
print_deadlock_bug+0x294/0x3d0
__lock_acquire+0x153b/0x28f0
lock_acquire+0x184/0x340
? xdp_features_set_redirect_target+0x1f/0x80
__mutex_lock+0x1ac/0x18a0
? xdp_features_set_redirect_target+0x1f/0x80
? xdp_features_set_redirect_target+0x1f/0x80
? __pfx_bnxt_rx_page_skb+0x10/0x10 [bnxt_en
? __pfx___mutex_lock+0x10/0x10
? __pfx_netdev_update_features+0x10/0x10
? bnxt_set_rx_skb_mode+0x284/0x540 [bnxt_en
? __pfx_bnxt_set_rx_skb_mode+0x10/0x10 [bnxt_en
? xdp_features_set_redirect_target+0x1f/0x80
xdp_features_set_redirect_target+0x1f/0x80
bnxt_xdp+0x34e/0x730 [bnxt_en 11cbcce8fa11cff1dddd7ef358d6219e4ca9add3]
dev_xdp_install+0x3f4/0x830
? __pfx_bnxt_xdp+0x10/0x10 [bnxt_en 11cbcce8fa11cff1dddd7ef358d6219e4ca9add3]
? __pfx_dev_xdp_install+0x10/0x10
dev_xdp_attach+0x560/0xf70
dev_change_xdp_fd+0x22d/0x280
do_setlink.constprop.0+0x2989/0x35d0
? __pfx_do_setlink.constprop.0+0x10/0x10
? lock_acquire+0x184/0x340
? find_held_lock+0x32/0x90
? rtnl_setlink+0x236/0x570
? rcu_is_watching+0x11/0xb0
? trace_contention_end+0xdc/0x120
? __mutex_lock+0x946/0x18a0
? __pfx___mutex_lock+0x10/0x10
? __lock_acquire+0xa95/0x28f0
? rcu_is_watching+0x11/0xb0
? rcu_is_watching+0x11/0xb0
? cap_capable+0x172/0x350
rtnl_setlink+0x2cd/0x570

Fixes: 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250520071155.2462843-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Add support for providing the PTP hardware source in tsinfo

Multi-PTP source support within a network topology has been merged,
but the hardware timestamp source is not yet exposed to users.
Currently, users only see the PTP index, which does not indicate
whether the timestamp comes from a PHY or a MAC.

Add support for reporting the hwtstamp source using a
hwtstamp-source field, alongside hwtstamp-phyindex, to describe
the origin of the hardware timestamp.

Remove HWTSTAMP_SOURCE_UNSPEC enum value as it is not used at all.

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250519-feature_ptp_source-v4-1-5d10e19a0265@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-mlx5-hws-set-of-fixes-and-adjustments'

Tariq Toukan says:

====================
net/mlx5: HWS, set of fixes and adjustments

This patch series by Yevgeny and Vlad introduces a set of steering fixes
and adjustments.
====================

Link: https://patch.msgid.link/1747766802-958178-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: HWS, handle modify header actions dependency

Having adjacent accelerated modify header actions (so-called
pattern-argument actions) may result in inconsistent outcome.
These inconsistencies can take the form of writes to the same
field or a read coupled with a write to the same field. The
solution is to detect such dependencies and insert nops between
the offending actions.

The existing implementation had a few issues, which pretty much
required a complete rewrite of the code that handles these
dependencies.

In the new implementation we're doing the following:

* Checking any two adjacent actions for conflicts (not just
  odd-even pairs).
* Marking 'set' and 'add' action fields as destination, rather
  than source, for the purposes of checking for conflicts.
* Checking all types of actions ('add', 'set', 'copy') for
  dependencies.
* Managing offsets of the args in the buffer - copy the action
  args to the right place in the buffer.
* Checking that after inserting nops we're still within the number
  of supported actions - return an error otherwise.

Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747766802-958178-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: HWS, fix typo - 'nope' to 'nop'

Fix typo - rename 'nope_locations' to 'nop_locations', which describes
the locations of 'nop' actions. To shorten the lines, this renaming
also required some refactoring.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747766802-958178-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: HWS, register reformat actions with fw

Hardware steering handles actions differently from firmware, but for
termination rules that use encapsulation the firmware needs to be aware
of the action.

Fix this by registering reformat actions with the firmware the first
time this is needed. To do this, add a third possible owner for an
action, and also a lock to protect against registration of the same
action from different threads.

Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747766802-958178-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SWS, fix reformat id error handling

The firmware reformat id is a u32 and can't safely be returned as an
int. Because the functions also need a way to signal error, prefer to
return the id as an output parameter and keep the return code only for
success/error.

While we're at it, also extract some duplicate code to fetch the
reformat id from a more generic struct pkt_reformat.

Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747766802-958178-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add debug checks in ____napi_schedule() and napi_poll()

While tracking an IDPF bug, I found that idpf_vport_splitq_napi_poll()
was not following NAPI rules.

It can indeed return @budget after napi_complete() has been called.

Add two debug conditions in networking core to hopefully catch
this kind of bugs sooner.

IDPF bug will be fixed in a separate patch.

[   72.441242] repoll requested for device eth1 idpf_vport_splitq_napi_poll [idpf] but napi is not scheduled.
[   72.446291] list_del corruption. next->prev should be ff31783d93b14040, but was ff31783d93b10080. (next=ff31783d93b10080)
[   72.446659] kernel BUG at lib/list_debug.c:67!
[   72.446816] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI
[   72.447031] CPU: 156 UID: 0 PID: 16258 Comm: ip Tainted: G        W           6.15.0-dbg-DEV #1944 NONE
[   72.447340] Tainted: [W]=WARN
[   72.447702] RIP: 0010:__list_del_entry_valid_or_report (lib/list_debug.c:65)
[   72.450630] Call Trace:
[   72.450720]  <IRQ>
[   72.450797] net_rx_action (include/linux/list.h:215 include/linux/list.h:287 net/core/dev.c:7385 net/core/dev.c:7516)
[   72.450928] ? lock_release (kernel/locking/lockdep.c:?)
[   72.451059] ? clockevents_program_event (kernel/time/clockevents.c:?)
[   72.451222] handle_softirqs (kernel/softirq.c:579)
[   72.451356] ? do_softirq (kernel/softirq.c:480)
[   72.451480] ? idpf_vc_xn_exec (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:462) idpf
[   72.451635] do_softirq (kernel/softirq.c:480)
[   72.451750]  </IRQ>
[   72.451828]  <TASK>
[   72.451905] __local_bh_enable_ip (kernel/softirq.c:?)
[   72.452051] idpf_vc_xn_exec (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:462) idpf
[   72.452210] idpf_send_delete_queues_msg (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:2083) idpf
[   72.452390] idpf_vport_stop (drivers/net/ethernet/intel/idpf/idpf_lib.c:837 drivers/net/ethernet/intel/idpf/idpf_lib.c:868) idpf
[   72.452541] ? idpf_vport_stop (include/linux/bottom_half.h:? include/linux/netdevice.h:4762 drivers/net/ethernet/intel/idpf/idpf_lib.c:855) idpf
[   72.452695] idpf_initiate_soft_reset (drivers/net/ethernet/intel/idpf/idpf_lib.c:?) idpf
[   72.452867] idpf_change_mtu (drivers/net/ethernet/intel/idpf/idpf_lib.c:2189) idpf
[   72.453015] netif_set_mtu_ext (net/core/dev.c:9437)
[   72.453157] ? packet_notifier (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/packet/af_packet.c:4240)
[   72.453292] netif_set_mtu (net/core/dev.c:9515)
[   72.453416] dev_set_mtu (net/core/dev_api.c:?)
[   72.453534] bond_change_mtu (drivers/net/bonding/bond_main.c:4833)
[   72.453666] netif_set_mtu_ext (net/core/dev.c:9437)
[   72.453803] do_setlink (net/core/rtnetlink.c:3116)
[   72.453925] ? rtnl_newlink (net/core/rtnetlink.c:3901)
[   72.454055] ? rtnl_newlink (net/core/rtnetlink.c:3901)
[   72.454185] ? rtnl_newlink (net/core/rtnetlink.c:3901)
[   72.454314] ? trace_contention_end (include/trace/events/lock.h:122)
[   72.454467] ? __mutex_lock (arch/x86/include/asm/preempt.h:85 kernel/locking/mutex.c:611 kernel/locking/mutex.c:746)
[   72.454597] ? cap_capable (include/trace/events/capability.h:26)
[   72.454721] ? security_capable (security/security.c:?)
[   72.454857] rtnl_newlink (net/core/rtnetlink.c:?)
[   72.454982] ? lock_is_held_type (kernel/locking/lockdep.c:5599 kernel/locking/lockdep.c:5938)
[   72.455121] ? __lock_acquire (kernel/locking/lockdep.c:?)
[   72.455256] ? __change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:685)
[   72.455438] ? __lock_acquire (kernel/locking/lockdep.c:?)
[   72.455582] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885)
[   72.455721] ? lock_acquire (kernel/locking/lockdep.c:5866)
[   72.455848] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885)
[   72.455987] ? lock_release (kernel/locking/lockdep.c:?)
[   72.456117] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:871)
[   72.456249] ? __pfx_rtnl_newlink (net/core/rtnetlink.c:3956)
[   72.456388] rtnetlink_rcv_msg (net/core/rtnetlink.c:6955)
[   72.456526] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885)
[   72.456671] ? lock_acquire (kernel/locking/lockdep.c:5866)
[   72.456802] ? net_generic (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 include/net/netns/generic.h:45)
[   72.456929] ? __pfx_rtnetlink_rcv_msg (net/core/rtnetlink.c:6858)
[   72.457082] netlink_rcv_skb (net/netlink/af_netlink.c:2534)
[   72.457212] netlink_unicast (net/netlink/af_netlink.c:1313)
[   72.457344] netlink_sendmsg (net/netlink/af_netlink.c:1883)
[   72.457476] __sock_sendmsg (net/socket.c:712)
[   72.457602] ____sys_sendmsg (net/socket.c:?)
[   72.457735] ? _copy_from_user (arch/x86/include/asm/uaccess_64.h:126 arch/x86/include/asm/uaccess_64.h:134 arch/x86/include/asm/uaccess_64.h:141 include/linux/uaccess.h:178 lib/usercopy.c:18)
[   72.457875] ___sys_sendmsg (net/socket.c:2620)
[   72.458042] ? __call_rcu_common (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/rcu/tree.c:3107)
[   72.458185] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457)
[   72.458324] ? lock_acquire (kernel/locking/lockdep.c:5866)
[   72.458451] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457)
[   72.458588] ? lock_release (kernel/locking/lockdep.c:?)
[   72.458718] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457)
[   72.458856] __x64_sys_sendmsg (net/socket.c:2652)
[   72.458997] ? do_syscall_64 (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 include/linux/entry-common.h:198 arch/x86/entry/syscall_64.c:90)
[   72.459136] do_syscall_64 (arch/x86/entry/syscall_64.c:?)
[   72.459259] ? exc_page_fault (arch/x86/mm/fault.c:1542)
[   72.459387] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[   72.459555] RIP: 0033:0x7fd15f17cbd0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250520121908.1805732-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/enic: Allow at least 8 RQs to always be used

Enic started using netif_get_num_default_rss_queues() to set the number
of RQs used in commit cc94d6c4d40c ("enic: Adjust used MSI-X
wq/rq/cq/interrupt resources in a more robust way")

This resulted in machines with less than 16 cpus using less than 8 RQs.
Allow enic to use at least 8 RQs no matter how many cpus are in the
machine to not impact existing enic workloads after a kernel upgrade.

Reviewed-by: John Daley <johndale@cisco.com>
Reviewed-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250521-enic_min_8rq-v1-1-691bd2353273@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hinic3: module initialization and tx/rx logic

This is [1/3] part of hinic3 Ethernet driver initial submission.
With this patch hinic3 is a valid kernel module but non-functional
driver.

The driver parts contained in this patch:
Module initialization.
PCI driver registration but with empty id_table.
Auxiliary driver registration.
Net device_ops registration but open/stop are empty stubs.
tx/rx logic.

All major data structures of the driver are fully introduced with the
code that uses them but without their initialization code that requires
management interface with the hw.

Co-developed-by: Xin Guo <guoxin09@huawei.com>
Signed-off-by: Xin Guo <guoxin09@huawei.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Co-developed-by: Gur Stavi <gur.stavi@huawei.com>
Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
Link: https://patch.msgid.link/76a137ffdfe115c737c2c224f0c93b60ba53cc16.1747736586.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfc: Correct Samsung "Electronics" spelling in copyright headers

Fix the misspelling of "Electronics" in copyright headers across:
- s3fwrn5 driver
- virtual_ncidev driver

Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250520072119.176018-1-sumanth.gavini@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

emulex/benet: correct command version selection in be_cmd_get_stats()

Logic here always sets hdr->version to 2 if it is not a BE3 or Lancer chip,
even if it is BE2. Use 'else if' to prevent multiple assignments, setting
version 0 for BE2, version 1 for BE3 and Lancer, and version 2 for others.
Fixes potential incorrect version setting when BE2_chip and
BE3_chip/lancer_chip checks could both be true.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20250519141731.691136-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-add-support-for-new-aeonsemi-phys'

Christian Marangi says:

====================
net: phy: Add support for new Aeonsemi PHYs

Add support for new Aeonsemi 10G C45 PHYs. These PHYs intergate an IPC
to setup some configuration and require special handling to sync with
the parity bit. The parity bit is a way the IPC use to follow correct
order of command sent.

Supported PHYs AS21011JB1, AS21011PB1, AS21010JB1, AS21010PB1,
AS21511JB1, AS21511PB1, AS21510JB1, AS21510PB1, AS21210JB1,
AS21210PB1 that all register with the PHY ID 0x7500 0x7500
before the firmware is loaded.

The big special thing about this PHY is that it does provide
a generic PHY ID in C45 register that change to the correct one
one the firmware is loaded.

In practice:
- MMD 0x7 ID 0x7500 0x9410 -> FW LOAD -> ID 0x7500 0x9422

To handle this, we operate on .match_phy_device where
we check the PHY ID, if the ID match the generic one,
we load the firmware and we return 0 (PHY driver doesn't
match). Then PHY core will try the next PHY driver in the list
and this time the PHY is correctly filled in and we register
for it.

To help in the matching and not modify part of the PHY device
struct, .match_phy_device is extended to provide also the
current phy_driver is trying to match for. This add the
extra benefits that some other PHY can simplify their
.match_phy_device OP.
====================

Link: https://patch.msgid.link/20250517201353.5137-1-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: Document support for Aeonsemi PHYs

Add Aeonsemi PHYs and the requirement of a firmware to correctly work.
Also document the max number of LEDs supported and what PHY ID expose
when no firmware is loaded.

Supported PHYs AS21011JB1, AS21011PB1, AS21010JB1, AS21010PB1,
AS21511JB1, AS21511PB1, AS21510JB1, AS21510PB1, AS21210JB1,
AS21210PB1 that all register with the PHY ID 0x7500 0x9410 on C45
registers before the firmware is loaded.

Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://patch.msgid.link/20250517201353.5137-7-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Add support for Aeonsemi AS21xxx PHYs

Add support for Aeonsemi AS21xxx 10G C45 PHYs. These PHYs integrate
an IPC to setup some configuration and require special handling to
sync with the parity bit. The parity bit is a way the IPC use to
follow correct order of command sent.

Supported PHYs AS21011JB1, AS21011PB1, AS21010JB1, AS21010PB1,
AS21511JB1, AS21511PB1, AS21510JB1, AS21510PB1, AS21210JB1,
AS21210PB1 that all register with the PHY ID 0x7500 0x7510
before the firmware is loaded.

They all support up to 5 LEDs with various HW mode supported.

While implementing it was found some strange coincidence with using the
same logic for implementing C22 in MMD regs in Broadcom PHYs.

For reference here the AS21xxx PHY name logic:

AS21x1xxB1
    ^ ^^
    | |J: Supports SyncE/PTP
    | |P: No SyncE/PTP support
    | 1: Supports 2nd Serdes
    | 2: Not 2nd Serdes support
    0: 10G, 5G, 2.5G
    5: 5G, 2.5G
    2: 2.5G

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://patch.msgid.link/20250517201353.5137-6-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: introduce genphy_match_phy_device()

Introduce new API, genphy_match_phy_device(), to provide a way to check
to match a PHY driver for a PHY device based on the info stored in the
PHY device struct.

The function generalize the logic used in phy_bus_match() to check the
PHY ID whether if C45 or C22 ID should be used for matching.

This is useful for custom .match_phy_device function that wants to use
the generic logic under some condition. (example a PHY is already setup
and provide the correct PHY ID)

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://patch.msgid.link/20250517201353.5137-5-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: nxp-c45-tja11xx: simplify .match_phy_device OP

Simplify .match_phy_device OP by using a generic function and using the
new phy_id PHY driver info instead of hardcoding the matching PHY ID
with new variant for macsec and no_macsec PHYs.

Also make use of PHY_ID_MATCH_MODEL macro and drop PHY_ID_MASK define to
introduce phy_id and phy_id_mask again in phy_driver struct.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://patch.msgid.link/20250517201353.5137-4-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: bcm87xx: simplify .match_phy_device OP

Simplify .match_phy_device OP by using a generic function and using the
new phy_id PHY driver info instead of hardcoding the matching PHY ID.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://patch.msgid.link/20250517201353.5137-3-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: pass PHY driver to .match_phy_device OP

Pass PHY driver pointer to .match_phy_device OP in addition to phydev.
Having access to the PHY driver struct might be useful to check the
PHY ID of the driver is being matched for in case the PHY ID scanned in
the phydev is not consistent.

A scenario for this is a PHY that change PHY ID after a firmware is
loaded, in such case, the PHY ID stored in PHY device struct is not
valid anymore and PHY will manually scan the ID in the match_phy_device
function.

Having the PHY driver info is also useful for those PHY driver that
implement multiple simple .match_phy_device OP to match specific MMD PHY
ID. With this extra info if the parsing logic is the same, the matching
function can be generalized by using the phy_id in the PHY driver
instead of hardcoding.

Rust wrapper callback is updated to align to the new match_phy_device
arguments.

Suggested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Benno Lossin <lossin@kernel.org> # for Rust
Reviewed-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Link: https://patch.msgid.link/20250517201353.5137-2-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: libwx: Fix log level

There is a log should be printed as info level, not error level.

Fixes: 9bfd65980f8d ("net: libwx: Add sriov api for wangxun nics")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/67409DB57B87E2F0+20250519063357.21164-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtase: Use min() instead of min_t()

Use min() instead of min_t() to avoid the possibility of casting to the
wrong type.

Signed-off-by: Justin Lai <justinlai0215@realtek.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250520042031.9297-1-justinlai0215@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-faster-and-simpler-crc32c-computation'

Eric Biggers says:

====================
net: faster and simpler CRC32C computation

Update networking code that computes the CRC32C of packets to just call
crc32c() without unnecessary abstraction layers. The result is faster
and simpler code.

Patches 1-7 add skb_crc32c() and remove the overly-abstracted and
inefficient __skb_checksum().

Patches 8-10 replace skb_copy_and_hash_datagram_iter() with
skb_copy_and_crc32c_datagram_iter(), eliminating the unnecessary use of
the crypto layer. This unblocks the conversion of nvme-tcp to call
crc32c() directly instead of using the crypto layer, which patch 9 does.

v1: https://lore.kernel.org/20250511004110.145171-1-ebiggers@kernel.org
====================

Link: https://patch.msgid.link/20250519175012.36581-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove skb_copy_and_hash_datagram_iter()

Now that skb_copy_and_hash_datagram_iter() is no longer used, remove it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-11-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nvme-tcp: use crc32c() and skb_copy_and_crc32c_datagram_iter()

Now that the crc32c() library function directly takes advantage of
architecture-specific optimizations and there also now exists a function
skb_copy_and_crc32c_datagram_iter(), it is unnecessary to go through the
crypto_ahash API. Just use those functions. This is much simpler, and
it also improves performance due to eliminating the crypto API overhead.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-10-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add skb_copy_and_crc32c_datagram_iter()

Since skb_copy_and_hash_datagram_iter() is used only with CRC32C, the
crypto_ahash abstraction provides no value.  Add
skb_copy_and_crc32c_datagram_iter() which just calls crc32c() directly.

This is faster and simpler.  It also doesn't have the weird dependency
issue where skb_copy_and_hash_datagram_iter() depends on
CONFIG_CRYPTO_HASH=y without that being expressed explicitly in the
kconfig (presumably because it was too heavyweight for NET to select).
The new function is conditional on the hidden boolean symbol NET_CRC32C,
which selects CRC32.  So it gets compiled only when something that
actually needs CRC32C packet checksums is enabled, it has no implicit
dependency, and it doesn't depend on the heavyweight crypto layer.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-9-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib/crc32: remove unused support for CRC32C combination

crc32c_combine() and crc32c_shift() are no longer used (except by the
KUnit test that tests them), and their current implementation is very
slow. Remove them.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-8-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fold __skb_checksum() into skb_checksum()

Now that the only remaining caller of __skb_checksum() is
skb_checksum(), fold __skb_checksum() into skb_checksum(). This makes
struct skb_checksum_ops unnecessary, so remove that too and simply do
the "regular" net checksum. It also makes the wrapper functions
csum_partial_ext() and csum_block_add_ext() unnecessary, so remove those
too and just use the underlying functions.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-7-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: use skb_crc32c() instead of __skb_checksum()

Make sctp_compute_cksum() just use the new function skb_crc32c(),
instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C. This is faster and simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-6-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

RDMA/siw: use skb_crc32c() instead of __skb_checksum()

Instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C, just call the new function skb_crc32c(). This is faster
and simpler.

Acked-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-5-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: use skb_crc32c() in skb_crc32c_csum_help()

Instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C, just call the new function skb_crc32c(). This is faster
and simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-4-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add skb_crc32c()

Add skb_crc32c(), which calculates the CRC32C of a sk_buff.  It will
replace __skb_checksum(), which unnecessarily supports arbitrary
checksums.  Compared to __skb_checksum(), skb_crc32c():

   - Uses the correct type for CRC32C values (u32, not __wsum).

   - Does not require the caller to provide a skb_checksum_ops struct.

   - Is faster because it does not use indirect calls and does not use
     the very slow crc32c_combine().

According to commit 2817a336d4d5 ("net: skb_checksum: allow custom
update/combine for walking skb") which added __skb_checksum(), the
original motivation for the abstraction layer was to avoid code
duplication for CRC32C and other checksums in the future.  However:

   - No additional checksums showed up after CRC32C.  __skb_checksum()
     is only used with the "regular" net checksum and CRC32C.

   - Indirect calls are expensive.  Commit 2544af0344ba ("net: avoid
     indirect calls in L4 checksum calculation") worked around this
     using the INDIRECT_CALL_1 macro. But that only avoided the indirect
     call for the net checksum, and at the cost of an extra branch.

   - The checksums use different types (__wsum and u32), causing casts
     to be needed.

   - It made the checksums of fragments be combined (rather than
     chained) for both checksums, despite this being highly
     counterproductive for CRC32C due to how slow crc32c_combine() is.
     This can clearly be seen in commit 4c2f24549644 ("sctp: linearize
     early if it's not GSO") which tried to work around this performance
     bug.  With a dedicated function for each checksum, we can instead
     just use the proper strategy for each checksum.

As shown by the following tables, the new function skb_crc32c() is
faster than __skb_checksum(), with the improvement varying greatly from
5% to 2500% depending on the case.  The largest improvements come from
fragmented packets, mainly due to eliminating the inefficient
crc32c_combine().  But linear packets are improved too, especially
shorter ones, mainly due to eliminating indirect calls.  These
benchmarks were done on AMD Zen 5.  On that CPU, Linux uses IBRS instead
of retpoline; an even greater improvement might be seen with retpoline:

    Linear sk_buffs

        Length in bytes    __skb_checksum cycles    skb_crc32c cycles
        ===============    =====================    =================
                     64                       43                   18
                    256                       94                   77
                   1420                      204                  161
                  16384                     1735                 1642

    Nonlinear sk_buffs (even split between head and one fragment)

        Length in bytes    __skb_checksum cycles    skb_crc32c cycles
        ===============    =====================    =================
                     64                      579                   22
                    256                      829                   77
                   1420                     1506                  194
                  16384                     4365                 1682

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: introduce CONFIG_NET_CRC32C

Add a hidden kconfig symbol NET_CRC32C that will group together the
functions that calculate CRC32C checksums of packets, so that these
don't have to be built into NET-enabled kernels that don't need them.

Make skb_crc32c_csum_help() (which is called only when IP_SCTP is
enabled) conditional on this symbol, and make IP_SCTP select it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tools-ynl-gen-add-support-for-inherited-selector-and-therefore-tc'

Jakub Kicinski says:

====================
tools: ynl-gen: add support for "inherited" selector and therefore TC

Add C codegen support for constructs needed by TC, namely passing
sub-message selector from a lower nest, and sub-messages with
fixed headers.
====================

Link: https://patch.msgid.link/20250520161916.413298-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: add a sample for TC

Add a very simple TC dump sample with decoding of fq_codel attrs:

# ./tools/net/ynl/samples/tc
dummy0: fq_codel limit: 10240p target: 5ms new_flow_cnt: 0

proving that selector passing (for stats) works.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-13-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: tc: add qdisc dump to TC spec

Hook TC qdisc dump in the TC qdisc get, it only supported doit
until now and dumping will be used by the sample code.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: enable codegen for TC

We are ready to support most of TC. Enable C code gen.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: support weird sub-message formats

TC uses all possible sub-message formats:
- nested attrs
- fixed headers + nested attrs
- fixed headers
- empty

Nested attrs are already supported for rt-link. Add support
for remaining 3. The empty and fixed headers ones are fairly
trivial, we can fake a Binary or Flags type instead of a Nest.

For fixed headers + nest we need to teach nest parsing and
nest put to handle fixed headers.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: support local attrs in _multi_parse

The _multi_parse() helper calls the _attr_get() method of each attr,
but it only respects what code the helper wants to emit, not what
local variables it needs. Local variables will soon be needed,
support them.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: move fixed header info from RenderInfo to Struct

RenderInfo describes a request-response exchange. Struct describes
a parsed attribute set. For ease of parsing sub-messages with
fixed headers move fixed header info from RenderInfo to Struct.
No functional changes.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: support passing selector to a nest

In rtnetlink all submessages had the selector at the same level
of nesting as the submessage. We could refer to the relevant
attribute from the current struct. In TC, stats are one level
of nesting deeper than "kind". Teach the code-gen about structs
which need to be passed a selector by the caller for parsing.

Because structs are "topologically sorted" one pass of propagating
the selectors down is enough.

For generating netlink message we depend on the presence bits
so no selector passing needed there.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: tc: drop the family name prefix from attrs

All attribute sets and messages are prefixed with tc-.
The C codegen also adds the family name to all structs.
We end up with names like struct tc_tc_act_attrs.
Remove the tc- prefixes to shorten the names.
This should not impact Python as the attr set names
are never exposed to user, they are only used to refer
to things internally, in the encoder / decoder.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: tc: add C naming info

Add naming info needed by C code gen.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: tc: use tc-gact instead of tc-gen as struct name

There is a define in the uAPI header called tc_gen which expands
to the "generic" TC action fields. This helps other actions include
the base fields without having to deal with nested structs.

A couple of actions (sample, gact) do not define extra fields,
so the spec used a common tc-gen struct for both of them.
Unfortunately this struct does not exist in C. Let's use gact's
(generic act's) struct for basic actions.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: tc: remove duplicate nests

tc-act-stats-attrs and tca-stats-attrs are almost identical.
The only difference is that the latter has sub-message decoding
for app, rather than declaring it as a binary attr.

tc-act-police-attrs and tc-police-attrs are identical but for
the TODO annotations.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250520161916.413298-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: add makefile deps for neigh

Kory is reporting build issues after recent additions to YNL
if the system headers are old.

Link: https://lore.kernel.org/20250519164949.597d6e92@kmaincent-XPS-13-7390
Reported-by: Kory Maincent <kory.maincent@bootlin.com>
Fixes: 0939a418b3b0 ("tools: ynl: submsg: reverse parse / error reporting")
Tested-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250520161916.413298-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-airoha-add-per-flow-stats-support-to-hw-flowtable-offloading'

Lorenzo Bianconi says:

====================
net: airoha: Add per-flow stats support to hw flowtable offloading

Introduce per-flow stats accounting to the flowtable hw offload in the
airoha_eth driver. Flow stats are split in the PPE and NPU modules:
- PPE: accounts for high 32bit of per-flow stats
- NPU: accounts for low 32bit of per-flow stats

v1: https://lore.kernel.org/20250514-airoha-en7581-flowstats-v1-0-c00ede12a2ca@kernel.org
====================

Link: https://patch.msgid.link/20250516-airoha-en7581-flowstats-v2-0-06d5fbf28984@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: ppe: Disable packet keepalive

Since netfilter flowtable entries are now refreshed by flow-stats
polling, we can disable hw packet keepalive used to periodically send
packets belonging to offloaded flows to the kernel in order to refresh
flowtable entries.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516-airoha-en7581-flowstats-v2-3-06d5fbf28984@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Add FLOW_CLS_STATS callback support

Introduce per-flow stats accounting to the flowtable hw offload in
the airoha_eth driver. Flow stats are split in the PPE and NPU modules:
- PPE: accounts for high 32bit of per-flow stats
- NPU: accounts for low 32bit of per-flow stats

FLOW_CLS_STATS can be enabled or disabled at compile time.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516-airoha-en7581-flowstats-v2-2-06d5fbf28984@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: npu: Move memory allocation in airoha_npu_send_msg() caller

Move ppe_mbox_data struct memory allocation from airoha_npu_send_msg
routine to the caller one. This is a preliminary patch to enable wlan NPU
offloading and flow counter stats support.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516-airoha-en7581-flowstats-v2-1-06d5fbf28984@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipv6-follow-up-for-rtnl-free-rtm_newroute-series'

Kuniyuki Iwashima says:

====================
ipv6: Follow up for RTNL-free RTM_NEWROUTE series.

Patch 1 removes rcu_read_lock() in fib6_get_table().
Patch 2 removes rtnl_is_held arg for lwtunnel_valid_encap_type(), which
was short-term fix and is no longer used.
Patch 3 fixes RCU vs GFP_KERNEL report by syzkaller.
Patch 4~7 reverts GFP_ATOMIC uses to GFP_KERNEL.

v1: https://lore.kernel.org/20250514201943.74456-1-kuniyu@amazon.com
====================

Link: https://patch.msgid.link/20250516022759.44392-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Revert two per-cpu var allocation for RTM_NEWROUTE.

These two commits preallocated two per-cpu variables in
ip6_route_info_create() as fib_nh_common_init() and fib6_nh_init()
were expected to be called under RCU.

  * commit d27b9c40dbd6 ("ipv6: Preallocate nhc_pcpu_rth_output in
    ip6_route_info_create().")
  * commit 5720a328c3e9 ("ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in
    ip6_route_info_create().")

Now these functions can be called without RCU and can use GFP_KERNEL.

Let's revert the commits.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Pass gfp_flags down to ip6_route_info_create_nh().

Since commit c4837b9853e5 ("ipv6: Split ip6_route_info_create()."),
ip6_route_info_create_nh() uses GFP_ATOMIC as it was expected to be
called under RCU.

Now, we can call it without RCU and use GFP_KERNEL.

Let's pass gfp_flags to ip6_route_info_create_nh().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "ipv6: Factorise ip6_route_multipath_add()."

Commit 71c0efb6d12f ("ipv6: Factorise ip6_route_multipath_add().") split
a loop in ip6_route_multipath_add() so that we can put rcu_read_lock()
between ip6_route_info_create() and ip6_route_info_create_nh().

We no longer need to do so as ip6_route_info_create_nh() does not require
RCU now.

Let's revert the commit to simplify the code.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup"

The previous patch fixed the same issue mentioned in
commit 14a0087e7236 ("ipv6: sr: switch to GFP_ATOMIC
flag to allocate memory during seg6local LWT setup").

Let's revert it.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250516022759.44392-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Narrow down RCU critical section in inet6_rtm_newroute().

Commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and
RTM_NEWROUTE.") added rcu_read_lock() covering
ip6_route_info_create_nh() and __ip6_ins_rt() to guarantee that
nexthop and netdev will not go away.

However, as reported by syzkaller [0], ip_tun_build_state() calls
dst_cache_init() with GFP_KERNEL during the RCU critical section.

ip6_route_info_create_nh() fetches nexthop or netdev depending on
whether RTA_NH_ID is set, and struct fib6_info holds a refcount
of either of them by nexthop_get() or netdev_get_by_index().

netdev_get_by_index() looks up a dev and calls dev_hold() under RCU.

So, we need RCU only around nexthop_find_by_id() and nexthop_get()
( and a few more nexthop code).

Let's add rcu_read_lock() there and remove rcu_read_lock() in
ip6_route_add() and ip6_route_multipath_add().

Now these functions called from fib6_add() need RCU:

  - inet6_rt_notify()
  - fib6_drop_pcpu_from() (via fib6_purge_rt())
  - rt6_flush_exceptions() (via fib6_purge_rt())
  - ip6_ignore_linkdown() (via rt6_multipath_rebalance())

All callers of inet6_rt_notify() need RCU, so rcu_read_lock() is
added there.

[0]:
[ BUG: Invalid wait context ]
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W
----------------------------
syz-executor234/5832 is trying to lock:
ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
other info that might help us debug this:
context-{5:5}
1 lock held by syz-executor234/5832:
0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
include/linux/rcupdate.h:331 [inline]
0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
include/linux/rcupdate.h:841 [inline]
0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at:
ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913
stack backtrace:
CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full)
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 04/29/2025
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline]
check_wait_context kernel/locking/lockdep.c:4903 [inline]
__lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185
lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
__mutex_lock_common kernel/locking/mutex.c:601 [inline]
__mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746
pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145
ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687
lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137
fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635
fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669
ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866
ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915
inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732
rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955
netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:712 [inline]
__sock_sendmsg+0x219/0x270 net/socket.c:727
____sys_sendmsg+0x505/0x830 net/socket.c:2566
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620
__sys_sendmsg net/socket.c:2652 [inline]
__do_sys_sendmsg net/socket.c:2657 [inline]
__se_sys_sendmsg net/socket.c:2655 [inline]
__x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94

Fixes: 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.")
Reported-by: syzbot+bcc12d6799364500fbec@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bcc12d6799364500fbec
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89i+r1cGacVC_6n3-A-WSkAa_Nr+pmxJ7Gt+oP-P9by2aGw@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?().

Commit f130a0cc1b4f ("inet: fix lwtunnel_valid_encap_type() lock
imbalance") added the rtnl_is_held argument as a temporary fix while
I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU.

Now all callers of lwtunnel_valid_encap_type() do not hold RTNL.

Let's remove the argument.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Remove rcu_read_lock() in fib6_get_table().

Once allocated, the IPv6 routing table is not freed until
netns is dismantled.

fib6_get_table() uses rcu_read_lock() while iterating
net->ipv6.fib_table_hash[], but it's not needed and
rather confusing.

Because some callers have this pattern,

  table = fib6_get_table();

  rcu_read_lock();
  /* ... use table here ... */
  rcu_read_unlock();

  [ See: addrconf_get_prefix_route(), ip6_route_del(),
         rt6_get_route_info(), rt6_get_dflt_router() ]

and this looks illegal but is actually safe.

Let's remove rcu_read_lock() in fib6_get_table() and pass true
to the last argument of hlist_for_each_entry_rcu() to bypass
the RCU check.

Note that protection is not needed but RCU helper is used to
avoid data-race.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-bcmgenet-64bit-stats-and-expose-more-stats-in-ethtool'

Zak Kemble says:

====================
net: bcmgenet: 64bit stats and expose more stats in ethtool

Hi, this patchset updates the bcmgenet driver with new 64bit statistics via
ndo_get_stats64 and rtnl_link_stats64, now reports hardware discarded
packets in the rx_missed_errors stat and exposes more stats in ethtool.

v1: https://lore.kernel.org/20250513144107.1989-1-zakkemble@gmail.com
v2: https://lore.kernel.org/20250515145142.1415-1-zakkemble@gmail.com
====================

Link: https://patch.msgid.link/20250519113257.1031-1-zakkemble@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bcmgenet: expose more stats in ethtool

Expose more per-queue and overall stats in ethtool

Signed-off-by: Zak Kemble <zakkemble@gmail.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250519113257.1031-4-zakkemble@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bcmgenet: count hw discarded packets in missed stat

Hardware discarded packets are now counted in their own missed stat
instead of being lumped in with general errors.

Signed-off-by: Zak Kemble <zakkemble@gmail.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250519113257.1031-3-zakkemble@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bcmgenet: switch to use 64bit statistics

Update the driver to use ndo_get_stats64, rtnl_link_stats64 and
u64_stats_t counters for statistics.

Signed-off-by: Zak Kemble <zakkemble@gmail.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250519113257.1031-2-zakkemble@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-fixed_phy-simplifications-and-improvements'

Heiner Kallweit says:

====================
net: phy: fixed_phy: simplifications and improvements

This series includes two types of changes:
- All callers pass PHY_POLL, therefore remove irq argument
- constify the passed struct fixed_phy_status *status
====================

Link: https://patch.msgid.link/4d4c468e-300d-42c7-92a1-eabbdb6be748@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: fixed_phy: constify status argument where possible

Constify the passed struct fixed_phy_status *status where possible.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/d1764b62-8538-408b-a4e3-b63715481a38@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: fixed_phy: remove irq argument from fixed_phy_register

All callers pass PHY_POLL, therefore remove irq argument from
fixed_phy_register().

Note: I keep the irq argument in fixed_phy_add_gpiod() for now,
for the case that somebody may want to use a GPIO interrupt in
the future, by e.g. adding a call to fwnode_irq_get() to
fixed_phy_get_gpiod().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/31cdb232-a5e9-4997-a285-cb9a7d208124@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: fixed_phy: remove irq argument from fixed_phy_add

All callers pass PHY_POLL, therefore remove irq argument from
fixed_phy_add().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Link: https://patch.msgid.link/b3b9b3bc-c310-4a54-b376-c909c83575de@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: let lockdep compare instance locks

AFAIU always returning -1 from lockdep's compare function
basically disables checking of dependencies between given
locks. Try to be a little more precise about what guarantees
that instance locks won't deadlock.

Right now we only nest them under protection of rtnl_lock.
Mostly in unregister_netdevice_many() and dev_close_many().

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250517200810.466531-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: Fix spellings

Fix "withouth" to "without"
Fix "instaces" to "instances"

Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250517032535.1176351-1-sumanth.gavini@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: nci: Fix "Electrnoics" to "Electronics"

Fix misspelling reported by codespell

Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250517020003.1159640-1-sumanth.gavini@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: validate team flags propagation

Cover three recent cases:
1. missing ops locking for the lowers during netdev_sync_lower_features
2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked
with a comment on why/when it's needed)
3. rcu lock during team_change_rx_flags

Verified that each one triggers when the respective fix is reverted.
Not sure about the placement, but since it all relies on teaming,
added to the teaming directory.

One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way
to trigger netdev_sync_lower_features without it.

Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250516232205.539266-1-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Replace kzalloc/fbnic_fw_init_cmpl with fbnic_fw_alloc_cmpl

Replace the pattern of calling and validating kzalloc then
fbnic_fw_init_cmpl with a single function, fbnic_fw_alloc_cmpl.

Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250516164804.741348-1-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-built-in-2-5g-ethernet-phy-support-on-mt7988'

Sky Huang says:

====================
Add built-in 2.5G ethernet phy support on MT7988

From: Sky Huang <skylake.huang@mediatek.com>

This patchset adds support for built-in 2.5Gphy on MT7988, sort file
and config sequence in related Kconfig and Makefile.
====================

Link: https://patch.msgid.link/20250516102327.2014531-1-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mediatek: add driver for built-in 2.5G ethernet PHY on MT7988

Add support for internal 2.5Gphy on MT7988. This driver will load
necessary firmware and add appropriate time delay to make sure
that firmware works stably. The firmware loading procedure takes
about 11ms in this driver.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250516102327.2014531-3-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mediatek: Sort config and file names in Kconfig and Makefile

Sort config and file names in Kconfig and Makefile in
drivers/net/phy/mediatek/ in alphabetical order.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250516102327.2014531-2-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: Do not wake readers in __sctp_write_space()

Function __sctp_write_space() doesn't set poll key, which leads to
ep_poll_callback() waking up all waiters, not only these waiting
for the socket being writable. Set the key properly using
wake_up_interruptible_poll(), which is preferred over the sync
variant, as writers are not woken up before at least half of the
queue is available. Also, TCP does the same.

Signed-off-by: Petr Malat <oss@malat.biz>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250516081727.1361451-1-oss@malat.biz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: add RTL8127-internal PHY

RTL8127-internal PHY is RTL8261C which is a integrated 10Gbps PHY with ID
0x001cc890. It follows the code path of RTL8125/RTL8126 internal NBase-T
PHY.

Signed-off-by: ChunHao Lin <hau@realtek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250516055622.3772-1-hau@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: fix the error handling in enetc4_pf_netdev_create()

Fix the handling of err_wq_init and err_reg_netdev paths in
enetc4_pf_netdev_create() function.

Fixes: 6c5bafba347b ("net: enetc: add MAC filtering for i.MX95 ENETC PF")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516052734.3624191-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-pf: Add tracepoint for NIX_PARSE_S

The NIX_PARSE_S structure populated by hardware in the
NIX RX CQE has parsing information for the received packet.
A tracepoint to dump the all words of NIX_PARSE_S
is helpful in debugging packet parser.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/1747331048-15347-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: make mdio consumer / device layer a separate module

After having factored out the provider part from mdio_bus.c, we can
make the mdio consumer / device layer a separate module. This also
allows to remove Kconfig symbol MDIO_DEVICE.
The module init / exit functions from mdio_bus.c no longer have to be
called from phy_device.c. The link order defined in
drivers/net/phy/Makefile ensures that init / exit functions are called
in the right order.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/dba6b156-5748-44ce-b5e2-e8dc2fcee5a7@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'queue_api-reduce-risk-of-name-collision-over-txq'

Gur Stavi says:

====================
queue_api: reduce risk of name collision over txq

Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.
====================

Link: https://patch.msgid.link/cover.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

queue_api: reduce risk of name collision over txq

Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.

Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/95b60d218f004308486d92ed17c8cc6f28bac09d.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
idpf: add initial PTP support

Milena Olech says:

This patch series introduces support for Precision Time Protocol (PTP) to
Intel(R) Infrastructure Data Path Function (IDPF) driver. PTP feature is
supported when the PTP capability is negotiated with the Control
Plane (CP). IDPF creates a PTP clock and sets a set of supported
functions.

During the PTP initialization, IDPF requests a set of PTP capabilities
and receives a writeback from the CP with the set of supported options.
These options are:
- get time of the PTP clock
- set the time of the PTP clock
- adjust the PTP clock
- Tx timestamping

Each feature is considered to have direct access, where the operations
on PCIe BAR registers are allowed, or the mailbox access, where the
virtchnl messages are used to perform any PTP action. Mailbox access
means that PTP requests are sent to the CP through dedicated secondary
mailbox and the CP reads/writes/modifies desired resource - PTP Clock
or Tx timestamp registers.

Tx timestamp capabilities are negotiated only for vports that have
UPLINK_VPORT flag set by the CP. Capabilities provide information about
the number of available Tx timestamp latches, their indexes and size of
the Tx timestamp value. IDPF requests Tx timestamp by setting the
TSYN bit and the requested timestamp index in the context descriptor for
the PTP packets. When the completion tag for that packet is received,
IDPF schedules a worker to read the Tx timestamp value.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  idpf: add support for Rx timestamping
  idpf: add Tx timestamp flows
  idpf: add Tx timestamp capabilities negotiation
  idpf: add PTP clock configuration
  idpf: add mailbox access to read PTP clock time
  idpf: negotiate PTP capabilities and get PTP clock
  idpf: move virtchnl structures to the header file
  virtchnl: add PTP virtchnl definitions
  idpf: add initial PTP support
  idpf: change the method for mailbox workqueue allocation
====================

Link: https://patch.msgid.link/20250516170645.1172700-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: Fix "envirnoments" to "environments"

Fix misspelling reported by codespell

Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516225156.1122058-1-sumanth.gavini@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: netlink: reduce extack cookie size

Seems like the extack cookie hasn't found any users outside
of wireless, which always uses nl_set_extack_cookie_u64().
Thus, allocating 20 bytes for it is pointless, reduce that
to 8 bytes, and add a BUILD_BUG_ON() to ensure it's enough
(obviously it is, for a u64, but in case it changes again.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250516115927.38209-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'ovpn-net-next-20250515' of https://github.com/OpenVPN/ovpn-net-next

Antonio Quartulli says:

====================
ovpn: pull request for net-next: ovpn 2025-05-15

this is a new version of the previous pull request.
These time I have removed the fixes that we are still discussing,
so that we don't hold the entire series back.

There is a new fix though: it's about properly checking the return value
of skb_to_sgvec_nomark(). I spotted the issue while testing pings larger
than the iface's MTU on a TCP VPN connection.

I have added various Closes and Link tags where applicable, so
that we have references to GitHub tickets and other public discussions.

Since I have resent the PR, I have also added Andrew's Reviewed-by to
the first patch.

Please pull or let me know if something should be changed!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Patchset highlights:
- update MAINTAINERS entry for ovpn
- extend selftest with more cases
- avoid crash in selftest in case of getaddrinfo() failure
- fix ndo_start_xmit return value on error
- set ignore_df flag for IPv6 packets
- drop useless reg_state check in keepalive worker
- retain skb's dst when entering xmit function
- fix check on skb_to_sgvec_nomark() return value

Merge branch 'vsock-test-improve-sigpipe-test-reliability'

Stefano Garzarella says:

====================
vsock/test: improve sigpipe test reliability

Running the tests continuously I noticed that sometimes the sigpipe
test would fail due to a race between the control message of the test
and the vsock transport messages.

While I was at it I also improved the test by checking the errno we
expect.

v1: https://lore.kernel.org/20250508142005.135857-1-sgarzare@redhat.com
====================

Link: https://patch.msgid.link/20250514141927.159456-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock/test: check also expected errno on sigpipe test

In the sigpipe test, we expect send() to fail, but we do not check if
send() fails with the errno we expect (EPIPE).

Add this check and repeat the send() in case of EINTR as we do in other
tests.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-4-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock/test: retry send() to avoid occasional failure in sigpipe test

When the other peer calls shutdown(SHUT_RD), there is a chance that
the send() call could occur before the message carrying the close
information arrives over the transport. In such cases, the send()
might still succeed. To avoid this race, let's retry the send() call
a few times, ensuring the test is more reliable.

Sleep a little before trying again to avoid flooding the other peer
and filling its receive buffer, causing false-negative.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-3-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock/test: add timeout_usleep() to allow sleeping in timeout sections

The timeout API uses signals, so we have documented not to use sleep(),
but we can use nanosleep(2) since POSIX.1 explicitly specifies that it
does not interact with signals.

Let's provide timeout_usleep() for that.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-2-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tools-ynl-gen-support-sub-messages-and-rt-link'

Jakub Kicinski says:

====================
tools: ynl-gen: support sub-messages and rt-link

Sub-messages are how we express "polymorphism" in YNL. Donald added
the support to specs and Python a while back, support them in C, too.
Sub-message is a nest, but the interpretation of the attribute types
within that nest depends on a value of another attribute. For example
in rt-link the "kind" attribute contains the link type (veth, bonding,
etc.) and based on that the right enum has to be applied to interpret
link-specific attributes.

The last message is probably the most interesting to look at, as it
adds a fairly advanced sample.

This patch only contains enough support for rtnetlink, we will need
a little more complexity to support TC, where sub-messages may contain
fixed headers, and where the selector may be in a different nest than
the submessage.
====================

Link: https://patch.msgid.link/20250515231650.1325372-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: add a sample for rt-link

Add a fairly complete example of rt-link usage. If run without any
arguments it simply lists the interfaces and some of their attrs.
If run with an arg it tries to create and delete a netkit device.

1 # ./tools/net/ynl/samples/rt-link 1
2 Trying to create a Netkit interface
3 Testing error message for policy being bad:
4     Kernel error: 'Provided default xmit policy not supported' (bad attribute: .linkinfo.data(netkit).policy)
5   1:               lo: mtu 65536
6   2:           wlp0s1: mtu  1500
7   3:          enp0s13: mtu  1500
8   4:           dummy0: mtu  1500  kind dummy     altname one two
9   5:              nk0: mtu  1500  kind netkit    primary 0  policy forward
10   6:              nk1: mtu  1500  kind netkit    primary 1  policy blackhole
11 Trying to delete a Netkit interface (ifindex 6)

Sample creates the device first, it sets an invalid value for a netkit
attribute to trigger reverse parsing. Line 4 shows the error with the
attribute path correctly generated by YNL.

Then sample fixes the bad attribute and re-issues the request, with
NLM_F_ECHO set. This flag causes the notification to be looped back
to the initiating socket (our socket). Sample parses this notification
to save the ifindex of the created netkit.

Sample then proceeds to list the devices. Line 8 above shows a dummy
device with two alt names. Lines 9 and 10 show the netkit devices
the sample itself created.

The "primary" and "policy" attrs are from inside the netkit submsg.
The string values are auto-generated for the enums by YNL.

To clean up sample deletes the interface it created (line 11).

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: enable codegen for all rt- families

Switch from including Classic netlink families one by one to excluding.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: submsg: reverse parse / error reporting

Reverse parsing lets YNL convert bad and missing attr pointers
from extack into a string like "missing attribute nest1.nest2.attr_name".
It's a feature that's unique to YNL C AFAIU (even the Python YNL
can't do nested reverse parsing). Add support for reverse-parsing
of sub-messages.

To simplify the logic and the code annotate the type policies
with extra metadata. Mark the selectors and the messages with
the information we need. We assume that key / selector always
precedes the sub-message while parsing (and also if there are
multiple sub-messages like in rt-link they are interleaved
selector 1 ... submsg 1 ... selector 2 .. submsg 2, not
selector 1 ... selector 2 ... submsg 1 ... submsg 2).

The rt-link sample in a subsequent changes shows reverse parsing
of sub-messages in action.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: submsg: support parsing and rendering sub-messages

Adjust parsing and rendering appropriately to make sub-messages work.
Rendering is pretty trivial, as the submsg -> netlink conversion looks
like rendering a nest in which only one attr was set. Only trick
is that we use the enum value of the sub-message rather than the nest
as the type, and effectively skip one layer of nesting. A real double
nested struct would look like this:

  [SELECTOR]
  [SUBMSG]
    [NEST]
      [MSG1-ATTR]

A submsg "is" the nest so by skipping I mean:

  [SELECTOR]
  [SUBMSG]
    [MSG1-ATTR]

There is no extra validation in YNL if caller has set the selector
matching the submsg type (e.g. link type = "macvlan" but the nest
attrs are set to carry "veth"). Let the kernel handle that.

Parsing side is a little more specialized as we need to render and
insert a new kind of function which switches between what to parse
based on the selector. But code isn't too complicated.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: submsg: render the structs

The easiest (or perhaps only sane) way to support submessages in C
is to treat them as if they were nests. Build fake attributes to
that effect in the codegen. Render the submsg as a big nest of all
possible values.

With this in place the main missing part is to hook in the switch
which selects how to parse based on the key.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: submsg: plumb thru an empty type

Hook in handling of sub-messages, for now treat them as ignored attrs.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>