]> www.infradead.org Git - users/hch/misc.git/log
users/hch/misc.git
3 months agoMerge branch 'enic-set-link-speed-only-after-link-up'
Jakub Kicinski [Thu, 9 Jan 2025 20:27:10 +0000 (12:27 -0800)]
Merge branch 'enic-set-link-speed-only-after-link-up'

John Daley says:

====================
enic: Set link speed only after link up
====================

Link: https://patch.msgid.link/20250107214159.18807-1-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoenic: Fix typo in comment in table indexed by link speed
John Daley [Tue, 7 Jan 2025 21:41:59 +0000 (13:41 -0800)]
enic: Fix typo in comment in table indexed by link speed

The RX adaptive interrupt moderation table is indexed by link speed
range, where the last row of the table is the catch-all for all link
speeds greater than 10Gbps. The comment said 10 - 40Gbps, but since
there are now adapters with link speeds than 40Gbps, the comment is now
wrong and should indicate it applies to all speeds greater than 10Gbps.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250107214159.18807-4-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoenic: Obtain the Link speed only after the link comes up
John Daley [Tue, 7 Jan 2025 21:41:58 +0000 (13:41 -0800)]
enic: Obtain the Link speed only after the link comes up

The link speed is obtained in the RX adaptive coalescing function. It
was being called at probe time when the link may not be up. Change the
call to run after the Link comes up.

The impact of not getting the correct link speed was that the low end of
the adaptive interrupt range was always being set to 0 which could have
caused a slight increase in the number of RX interrupts.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250107214159.18807-3-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoenic: Move RX coalescing set function
John Daley [Tue, 7 Jan 2025 21:41:57 +0000 (13:41 -0800)]
enic: Move RX coalescing set function

Move the function used for setting the RX coalescing range to before
the function that checks the link status. It needs to be called from
there instead of from the probe function.

There is no functional change.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Link: https://patch.msgid.link/20250107214159.18807-2-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodt-bindings: net: qcom,ipa: Use recommended MBN firmware format in DTS example
Krzysztof Kozlowski [Wed, 8 Jan 2025 12:02:42 +0000 (13:02 +0100)]
dt-bindings: net: qcom,ipa: Use recommended MBN firmware format in DTS example

All Qualcomm firmwares uploaded to linux-firmware are in MBN format,
instead of split MDT.  No functional changes, just correct the DTS
example so people will not rely on unaccepted files.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Alex Elder <elder@kernel.org>
Link: https://patch.msgid.link/20250108120242.156201-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-make-sure-we-retain-napi-ordering-on-netdev-napi_list'
Paolo Abeni [Thu, 9 Jan 2025 14:33:10 +0000 (15:33 +0100)]
Merge branch 'net-make-sure-we-retain-napi-ordering-on-netdev-napi_list'

Jakub Kicinski says:

====================
net: make sure we retain NAPI ordering on netdev->napi_list

I promised Eric to remove the rtnl protection of the NAPI list,
when I sat down to implement it over the break I realized that
the recently added NAPI ID retention will break the list ordering
assumption we have in netlink dump. The ordering used to happen
"naturally", because we'd always add NAPIs that the head of the
list, and assign a new monotonically increasing ID.

Before the first patch of this series we'd still only add at
the head of the list but now the newly added NAPI may inherit
from its config an ID lower than something else already on the list.

The fix is in the first patch, the rest is netdevsim churn to test it.
I'm posting this for net-next, because AFAICT the problem can't
be triggered in net, given the very limited queue API adoption.

v2:
 - [patch 2] allocate the array with kcalloc() instead of kvcalloc()
 - [patch 2] set GFP_KERNEL_ACCOUNT when allocating queues
 - [patch 6] don't null-check page pool before page_pool_destroy()
 - [patch 6] controled -> controlled
 - [patch 7] change mode to 0200
 - [patch 7] reorder removal to be inverse of add
 - [patch 7] fix the spaces vs tabs
v1: https://lore.kernel.org/20250103185954.1236510-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250107160846.2223263-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoselftests: net: test listing NAPI vs queue resets
Jakub Kicinski [Tue, 7 Jan 2025 16:08:46 +0000 (08:08 -0800)]
selftests: net: test listing NAPI vs queue resets

Test listing netdevsim NAPIs before and after a single queue
has been reset (and NAPIs re-added).

Start from resetting the middle queue because edge cases
(first / last) may actually be less likely to trigger bugs.

  # ./tools/testing/selftests/net/nl_netdev.py
  KTAP version 1
  1..4
  ok 1 nl_netdev.empty_check
  ok 2 nl_netdev.lo_check
  ok 3 nl_netdev.page_pool_check
  ok 4 nl_netdev.napi_list_check
  # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonetdevsim: add debugfs-triggered queue reset
Jakub Kicinski [Tue, 7 Jan 2025 16:08:45 +0000 (08:08 -0800)]
netdevsim: add debugfs-triggered queue reset

Support triggering queue reset via debugfs for an upcoming test.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonetdevsim: add queue management API support
Jakub Kicinski [Tue, 7 Jan 2025 16:08:44 +0000 (08:08 -0800)]
netdevsim: add queue management API support

Add queue management API support. We need a way to reset queues
to test NAPI reordering, the queue management API provides a
handy scaffolding for that.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonetdevsim: add queue alloc/free helpers
Jakub Kicinski [Tue, 7 Jan 2025 16:08:43 +0000 (08:08 -0800)]
netdevsim: add queue alloc/free helpers

We'll need the code to allocate and free queues in the queue management
API, factor it out.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonetdevsim: allocate rqs individually
Jakub Kicinski [Tue, 7 Jan 2025 16:08:42 +0000 (08:08 -0800)]
netdevsim: allocate rqs individually

Make nsim->rqs an array of pointers and allocate them individually
so that we can swap them out one by one.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonetdevsim: support NAPI config
Jakub Kicinski [Tue, 7 Jan 2025 16:08:41 +0000 (08:08 -0800)]
netdevsim: support NAPI config

Link the NAPI instances to their configs. This will be needed to test
that NAPI config doesn't break list ordering.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonetdev: define NETDEV_INTERNAL
Jakub Kicinski [Tue, 7 Jan 2025 16:08:40 +0000 (08:08 -0800)]
netdev: define NETDEV_INTERNAL

Linus suggested during one of past maintainer summits (in context of
a DMA_BUF discussion) that symbol namespaces can be used to prevent
unwelcome but in-tree code from using all exported functions.
Create a namespace for netdev.

Export netdev_rx_queue_restart(), drivers may want to use it since
it gives them a simple and safe way to restart a queue to apply
config changes. But it's both too low level and too actively developed
to be used outside netdev.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: make sure we retain NAPI ordering on netdev->napi_list
Jakub Kicinski [Tue, 7 Jan 2025 16:08:39 +0000 (08:08 -0800)]
net: make sure we retain NAPI ordering on netdev->napi_list

Netlink code depends on NAPI instances being sorted by ID on
the netdev list for dump continuation. We need to be able to
find the position on the list where we left off if dump does
not fit in a single skb, and in the meantime NAPI instances
can come and go.

This was trivially true when we were assigning a new ID to every
new NAPI instance. Since we added the NAPI config API, we try
to retain the ID previously used for the same queue, but still
add the new NAPI instance at the start of the list.

This is fine if we reset the entire netdev and all NAPIs get
removed and added back. If driver replaces a NAPI instance
during an operation like DEVMEM queue reset, or recreates
a subset of NAPI instances in other ways we may end up with
broken ordering, and therefore Netlink dumps with either
missing or duplicated entries.

At this stage the problem is theoretical. Only two drivers
support queue API, bnxt and gve. gve recreates NAPIs during
queue reset, but it doesn't support NAPI config.
bnxt supports NAPI config but doesn't recreate instances
during reset.

We need to save the ID in the config as soon as it is assigned
because otherwise the new NAPI will not know what ID it will
get at enable time, at the time it is being added.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: hsr: remove synchronize_rcu() from hsr_add_port()
Eric Dumazet [Tue, 7 Jan 2025 14:47:01 +0000 (14:47 +0000)]
net: hsr: remove synchronize_rcu() from hsr_add_port()

A synchronize_rcu() was added by mistake in commit
c5a759117210 ("net/hsr: Use list_head (and rcu) instead
of array for slave devices.")

RCU does not mandate to observe a grace period after
list_add_tail_rcu().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250107144701.503884-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: no longer reset transport_header in __netif_receive_skb_core()
Eric Dumazet [Tue, 7 Jan 2025 14:43:42 +0000 (14:43 +0000)]
net: no longer reset transport_header in __netif_receive_skb_core()

In commit 66e4c8d95008 ("net: warn if transport header was not set")
I added a debug check in skb_transport_header() to detect
if a caller expects the transport_header to be set to a meaningful
value by a prior code path.

Unfortunately, __netif_receive_skb_core() resets the transport header
to the same value than the network header, defeating this check
in receive paths.

Pretending the transport and network headers are the same
is usually wrong.

This patch removes this reset for CONFIG_DEBUG_NET=y builds
to let fuzzers and CI find bugs.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250107144342.499759-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agodt-bindings: net: Correct indentation and style in DTS example
Krzysztof Kozlowski [Tue, 7 Jan 2025 12:56:13 +0000 (13:56 +0100)]
dt-bindings: net: Correct indentation and style in DTS example

DTS example in the bindings should be indented with 2- or 4-spaces and
aligned with opening '- |', so correct any differences like 3-spaces or
mixtures 2- and 4-spaces in one binding.

No functional changes here, but saves some comments during reviews of
new patches built on existing code.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can
Reviewed-by: Roger Quadros <rogerq@kernel.org> # for ti,k3-am654-*
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> # net/brcm,*
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Link: https://patch.msgid.link/20250107125613.211478-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonetlink: add IPv6 anycast join/leave notifications
Yuyang Huang [Tue, 7 Jan 2025 11:43:55 +0000 (20:43 +0900)]
netlink: add IPv6 anycast join/leave notifications

This change introduces a mechanism for notifying userspace
applications about changes to IPv6 anycast addresses via netlink. It
includes:

* Addition and deletion of IPv6 anycast addresses are reported using
  RTM_NEWANYCAST and RTM_DELANYCAST.
* A new netlink group (RTNLGRP_IPV6_ACADDR) for subscribing to these
  notifications.

This enables user space applications(e.g. ip monitor) to efficiently
track anycast addresses through netlink messages, improving metrics
collection and system monitoring. It also unlocks the potential for
advanced anycast management in user space, such as hardware offload
control and fine grained network control.

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250107114355.1766086-1-yuyanghuang@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet/mlx5: use do_aux_work for PHC overflow checks
Vadim Fedorenko [Tue, 7 Jan 2025 10:48:12 +0000 (02:48 -0800)]
net/mlx5: use do_aux_work for PHC overflow checks

The overflow_work is using system wq to do overflow checks and updates
for PHC device timecounter, which might be overhelmed by other tasks.
But there is dedicated kthread in PTP subsystem designed for such
things. This patch changes the work queue to proper align with PTP
subsystem and to avoid overloading system work queue.
The adjfine() function acts the same way as overflow check worker,
we can postpone ptp aux worker till the next overflow period after
adjfine() was called.

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250107104812.380225-1-vadfed@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: stmmac: Unexport stmmac_rx_offset() from stmmac.h
Furong Xu [Tue, 7 Jan 2025 07:54:48 +0000 (15:54 +0800)]
net: stmmac: Unexport stmmac_rx_offset() from stmmac.h

stmmac_rx_offset() is referenced in stmmac_main.c only,
let's move it to stmmac_main.c.

Drop the inline keyword by the way, it is better to let the compiler
to decide.

Compile tested only.
No functional change intended.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250107075448.4039925-1-0x1207@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agor8169: add support for RTL8125BP rev.b
ChunHao Lin [Tue, 7 Jan 2025 06:43:55 +0000 (14:43 +0800)]
r8169: add support for RTL8125BP rev.b

Add support for RTL8125BP rev.b. Its XID is 0x689. This chip supports
DASH and its dash type is "RTL_DASH_25_BP".

Signed-off-by: ChunHao Lin <hau@realtek.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/20250107064355.104711-1-hau@realtek.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoselftests: drv-net: test drivers sleeping in ndo_get_stats64
Jakub Kicinski [Tue, 7 Jan 2025 02:29:32 +0000 (18:29 -0800)]
selftests: drv-net: test drivers sleeping in ndo_get_stats64

Most of our tests use rtnetlink to read device stats, so they
don't expose the drivers much to paths in which device stats
are read under RCU. Add tests which hammer profcs reads to
make sure drivers:
 - don't sleep while reporting stats,
 - can handle parallel reads,
 - can handle device going down while reading.

Set ifname on the env class in NetDrvEnv, we already do that
in NetDrvEpEnv.

  KTAP version 1
  1..7
  ok 1 stats.check_pause
  ok 2 stats.check_fec
  ok 3 stats.pkt_byte_sum
  ok 4 stats.qstat_by_ifindex
  ok 5 stats.check_down
  ok 6 stats.procfs_hammer
  # completed up/down cycles: 6
  ok 7 stats.procfs_downup_hammer
  # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250107022932.2087744-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'intel-wired-lan-driver-updates-2025-01-06-igb-igc-ixgbe-ixgbevf-i40e...
Jakub Kicinski [Wed, 8 Jan 2025 02:16:02 +0000 (18:16 -0800)]
Merge branch 'intel-wired-lan-driver-updates-2025-01-06-igb-igc-ixgbe-ixgbevf-i40e-fm10k'

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-01-06 (igb, igc, ixgbe, ixgbevf, i40e, fm10k)

For igb:

Sriram Yagnaraman and Kurt Kanzenbach add support for AF_XDP
zero-copy.

Original cover letter:
The first couple of patches adds helper functions to prepare for AF_XDP
zero-copy support which comes in the last couple of patches, one each
for Rx and TX paths.

As mentioned in v1 patchset [0], I don't have access to an actual IGB
device to provide correct performance numbers. I have used Intel 82576EB
emulator in QEMU [1] to test the changes to IGB driver.

The tests use one isolated vCPU for RX/TX and one isolated vCPU for the
xdp-sock application [2]. Hope these measurements provide at the least
some indication on the increase in performance when using ZC, especially
in the TX path. It would be awesome if someone with a real IGB NIC can
test the patch.

AF_XDP performance using 64 byte packets in Kpps.
Benchmark: XDP-SKB XDP-DRV XDP-DRV(ZC)
rxdrop 220 235 350
txpush 1.000 1.000 410
l2fwd  1.000 1.000 200

AF_XDP performance using 1500 byte packets in Kpps.
Benchmark: XDP-SKB XDP-DRV XDP-DRV(ZC)
rxdrop 200 210 310
txpush 1.000 1.000 410
l2fwd  0.900 1.000 160

[0]: https://lore.kernel.org/intel-wired-lan/20230704095915.9750-1-sriram.yagnaraman@est.tech/
[1]: https://www.qemu.org/docs/master/system/devices/igb.html
[2]: https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-example

Subsequent changes and information can be found here:
https://lore.kernel.org/intel-wired-lan/20241018-b4-igb_zero_copy-v9-0-da139d78d796@linutronix.de/

Yue Haibing converts use of ERR_PTR return to traditional error code
which resolves a smatch warning.

For igc:

Song Yoong Siang allows for the XDP program to be hot-swapped.

Yue Haibing converts use of ERR_PTR return to traditional error code
which resolves a smatch warning.

Joe Damato adds sets IRQ and queues to NAPI instances to allow for
reporting via netdev-genl API.

For ixgbe:

Yue Haibing converts use of ERR_PTR return to traditional error code
which resolves a smatch warning.

For ixgbevf:

Yue Haibing converts use of ERR_PTR return to traditional error code
which resolves a smatch warning.

For i40e:

Alex implements "mdd-auto-reset-vf" private flag to automatically reset
VFs when encountering an MDD event.

For fm10k:

Dr. David Alan Gilbert removes an unused function.
====================

Link: https://patch.msgid.link/20250106221929.956999-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agointel/fm10k: Remove unused fm10k_iov_msg_mac_vlan_pf
Dr. David Alan Gilbert [Mon, 6 Jan 2025 22:19:23 +0000 (14:19 -0800)]
intel/fm10k: Remove unused fm10k_iov_msg_mac_vlan_pf

fm10k_iov_msg_mac_vlan_pf() has been unused since 2017's
commit 1f5c27e52857 ("fm10k: use the MAC/VLAN queue for VF<->PF MAC/VLAN
requests")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-16-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigc: Link queues to NAPI instances
Joe Damato [Mon, 6 Jan 2025 22:19:22 +0000 (14:19 -0800)]
igc: Link queues to NAPI instances

Link queues to NAPI instances via netdev-genl API so that users can
query this information with netlink. Handle a few cases in the driver:
  1. Link/unlink the NAPIs when XDP is enabled/disabled
  2. Handle IGC_FLAG_QUEUE_PAIRS enabled and disabled

Example output when IGC_FLAG_QUEUE_PAIRS is enabled:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump queue-get --json='{"ifindex": 2}'

[{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'rx'},
 {'id': 2, 'ifindex': 2, 'napi-id': 8195, 'type': 'rx'},
 {'id': 3, 'ifindex': 2, 'napi-id': 8196, 'type': 'rx'},
 {'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'tx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'tx'},
 {'id': 2, 'ifindex': 2, 'napi-id': 8195, 'type': 'tx'},
 {'id': 3, 'ifindex': 2, 'napi-id': 8196, 'type': 'tx'}]

Since IGC_FLAG_QUEUE_PAIRS is enabled, you'll note that the same NAPI ID
is present for both rx and tx queues at the same index, for example
index 0:

{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'},
{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'tx'},

To test IGC_FLAG_QUEUE_PAIRS disabled, a test system was booted using
the grub command line option "maxcpus=2" to force
igc_set_interrupt_capability to disable IGC_FLAG_QUEUE_PAIRS.

Example output when IGC_FLAG_QUEUE_PAIRS is disabled:

$ lscpu | grep "On-line CPU"
On-line CPU(s) list:      0,2

$ ethtool -l enp86s0  | tail -5
Current hardware settings:
RX: n/a
TX: n/a
Other: 1
Combined: 2

$ cat /proc/interrupts  | grep enp
 144: [...] enp86s0
 145: [...] enp86s0-rx-0
 146: [...] enp86s0-rx-1
 147: [...] enp86s0-tx-0
 148: [...] enp86s0-tx-1

1 "other" IRQ, and 2 IRQs for each of RX and Tx, so we expect netlink to
report 4 IRQs with unique NAPI IDs:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 2}'
[{'id': 8196, 'ifindex': 2, 'irq': 148},
 {'id': 8195, 'ifindex': 2, 'irq': 147},
 {'id': 8194, 'ifindex': 2, 'irq': 146},
 {'id': 8193, 'ifindex': 2, 'irq': 145}]

Now we examine which queues these NAPIs are associated with, expecting
that since IGC_FLAG_QUEUE_PAIRS is disabled each RX and TX queue will
have its own NAPI instance:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump queue-get --json='{"ifindex": 2}'
[{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'rx'},
 {'id': 0, 'ifindex': 2, 'napi-id': 8195, 'type': 'tx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8196, 'type': 'tx'}]

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-15-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigc: Link IRQs to NAPI instances
Joe Damato [Mon, 6 Jan 2025 22:19:21 +0000 (14:19 -0800)]
igc: Link IRQs to NAPI instances

Link IRQs to NAPI instances via netdev-genl API so that users can query
this information with netlink.

Compare the output of /proc/interrupts (noting that IRQ 128 is the
"other" IRQ which does not appear to have a NAPI instance):

$ cat /proc/interrupts | grep enp86s0 | cut --delimiter=":" -f1
 128
 129
 130
 131
 132

The output from netlink shows the mapping of NAPI IDs to IRQs (again
noting that 128 is absent as it is the "other" IRQ):

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 2}'

[{'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 8196,
  'ifindex': 2,
  'irq': 132},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 8195,
  'ifindex': 2,
  'irq': 131},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 8194,
  'ifindex': 2,
  'irq': 130},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 8193,
  'ifindex': 2,
  'irq': 129}]

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-14-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoi40e: add ability to reset VF for Tx and Rx MDD events
Aleksandr Loktionov [Mon, 6 Jan 2025 22:19:20 +0000 (14:19 -0800)]
i40e: add ability to reset VF for Tx and Rx MDD events

Implement "mdd-auto-reset-vf" priv-flag to handle Tx and Rx MDD events for VFs.
This flag is also used in other network adapters like ICE.

Usage:
- "on"  - The problematic VF will be automatically reset
  if a malformed descriptor is detected.
- "off" - The problematic VF will be disabled.

In cases where a VF sends malformed packets classified as malicious, it can
cause the Tx queue to freeze, rendering it unusable for several minutes. When
an MDD event occurs, this new implementation allows for a graceful VF reset to
quickly restore operational state.

Currently, VF queues are disabled if an MDD event occurs. This patch adds the
ability to reset the VF if a Tx or Rx MDD event occurs. It also includes MDD
event logging throttling to avoid dmesg pollution and unifies the format of
Tx and Rx MDD messages.

Note: Standard message rate limiting functions like dev_info_ratelimited()
do not meet our requirements. Custom rate limiting is implemented,
please see the code for details.

Co-developed-by: Jan Sokolowski <jan.sokolowski@intel.com>
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Co-developed-by: Padraig J Connolly <padraig.j.connolly@intel.com>
Signed-off-by: Padraig J Connolly <padraig.j.connolly@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-13-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoixgbevf: Fix passing 0 to ERR_PTR in ixgbevf_run_xdp()
Yue Haibing [Mon, 6 Jan 2025 22:19:19 +0000 (14:19 -0800)]
ixgbevf: Fix passing 0 to ERR_PTR in ixgbevf_run_xdp()

ixgbevf_run_xdp() converts customed xdp action to a negative error code
with the sk_buff pointer type which be checked with IS_ERR in
ixgbevf_clean_rx_irq(). Remove this error pointer handing instead use
plain int return value.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-12-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()
Yue Haibing [Mon, 6 Jan 2025 22:19:18 +0000 (14:19 -0800)]
ixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()

ixgbe_run_xdp() converts customed xdp action to a negative error code
with the sk_buff pointer type which be checked with IS_ERR in
ixgbe_clean_rx_irq(). Remove this error pointer handing instead use
plain int return value.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-11-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigb: Fix passing 0 to ERR_PTR in igb_run_xdp()
Yue Haibing [Mon, 6 Jan 2025 22:19:17 +0000 (14:19 -0800)]
igb: Fix passing 0 to ERR_PTR in igb_run_xdp()

igb_run_xdp() converts customed xdp action to a negative error code
with the sk_buff pointer type which be checked with IS_ERR in
igb_clean_rx_irq(). Remove this error pointer handing instead use plain
int return value.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-10-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigc: Fix passing 0 to ERR_PTR in igc_xdp_run_prog()
Yue Haibing [Mon, 6 Jan 2025 22:19:16 +0000 (14:19 -0800)]
igc: Fix passing 0 to ERR_PTR in igc_xdp_run_prog()

igc_xdp_run_prog() converts customed xdp action to a negative error code
with the sk_buff pointer type which be checked with IS_ERR in
igc_clean_rx_irq(). Remove this error pointer handing instead use plain
int return value to fix this smatch warnings:

drivers/net/ethernet/intel/igc/igc_main.c:2533
 igc_xdp_run_prog() warn: passing zero to 'ERR_PTR'

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-9-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigc: Allow hot-swapping XDP program
Song Yoong Siang [Mon, 6 Jan 2025 22:19:15 +0000 (14:19 -0800)]
igc: Allow hot-swapping XDP program

Currently, the driver would always close and reopen the network interface
when setting/removing the XDP program, regardless of the presence of XDP
resources. This could cause unnecessary disruptions.

To avoid this, introduces a check to determine if there is a need to
close and reopen the interface, allowing for seamless hot-swapping of
XDP programs.

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-8-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigb: Add AF_XDP zero-copy Tx support
Sriram Yagnaraman [Mon, 6 Jan 2025 22:19:14 +0000 (14:19 -0800)]
igb: Add AF_XDP zero-copy Tx support

Add support for AF_XDP zero-copy transmit path.

A new TX buffer type IGB_TYPE_XSK is introduced to indicate that the Tx
frame was allocated from the xsk buff pool, so igb_clean_tx_ring() and
igb_clean_tx_irq() can clean the buffers correctly based on type.

igb_xmit_zc() performs the actual packet transmit when AF_XDP zero-copy is
enabled. We share the TX ring between slow path, XDP and AF_XDP
zero-copy, so we use the netdev queue lock to ensure mutual exclusion.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Set olinfo_status in igb_xmit_zc() so that frames are transmitted,
       Use READ_ONCE() for xsk_pool and check Tx disabled and carrier in
       igb_xmit_zc(), Add FIXME for RS bit]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-7-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigb: Add AF_XDP zero-copy Rx support
Sriram Yagnaraman [Mon, 6 Jan 2025 22:19:13 +0000 (14:19 -0800)]
igb: Add AF_XDP zero-copy Rx support

Add support for AF_XDP zero-copy receive path.

When AF_XDP zero-copy is enabled, the rx buffers are allocated from the
xsk buff pool using igb_alloc_rx_buffers_zc().

Use xsk_pool_get_rx_frame_size() to set SRRCTL rx buf size when zero-copy
is enabled.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Port to v6.12 and provide napi_id for xdp_rxq_info_reg(),
       RCT, remove NETDEV_XDP_ACT_XSK_ZEROCOPY, update NTC handling,
       READ_ONCE() xsk_pool, likelyfy for XDP_REDIRECT case]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-6-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigb: Add XDP finalize and stats update functions
Kurt Kanzenbach [Mon, 6 Jan 2025 22:19:12 +0000 (14:19 -0800)]
igb: Add XDP finalize and stats update functions

Move XDP finalize and Rx statistics update into separate functions. This
way, they can be reused by the XDP and XDP/ZC code later.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-5-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigb: Introduce XSK data structures and helpers
Sriram Yagnaraman [Mon, 6 Jan 2025 22:19:11 +0000 (14:19 -0800)]
igb: Introduce XSK data structures and helpers

Add the following ring flag:
- IGB_RING_FLAG_TX_DISABLED (when xsk pool is being setup)

Add a xdp_buff array for use with XSK receive batch API, and a pointer
to xsk_pool in igb_adapter.

Add enable/disable functions for TX and RX rings.
Add enable/disable functions for XSK pool.
Add xsk wakeup function.

None of the above functionality will be active until
NETDEV_XDP_ACT_XSK_ZEROCOPY is advertised in netdev->xdp_features.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Add READ/WRITE_ONCE(), synchronize_net(),
       remove IGB_RING_FLAG_AF_XDP_ZC]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-4-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigb: Introduce igb_xdp_is_enabled()
Sriram Yagnaraman [Mon, 6 Jan 2025 22:19:10 +0000 (14:19 -0800)]
igb: Introduce igb_xdp_is_enabled()

Introduce igb_xdp_is_enabled() to check if an XDP program is assigned to
the device. Use that wherever xdp_prog is read and evaluated.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Split patches and use READ_ONCE()]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoigb: Remove static qualifiers
Sriram Yagnaraman [Mon, 6 Jan 2025 22:19:09 +0000 (14:19 -0800)]
igb: Remove static qualifiers

Remove static qualifiers on the following functions to be able to call
from XSK specific file that is added in the later patches:
- igb_xdp_tx_queue_mapping()
- igb_xdp_ring_update_tail()
- igb_clean_tx_ring()
- igb_clean_rx_ring()
- igb_xdp_xmit_back()
- igb_process_skb_fields()

While at it, inline igb_xdp_tx_queue_mapping() and
igb_xdp_ring_update_tail(). These functions are small enough and used in
XDP hot paths.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Split patches, inline small XDP functions]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'tools-ynl-decode-link-types-present-in-tests'
Jakub Kicinski [Wed, 8 Jan 2025 02:07:55 +0000 (18:07 -0800)]
Merge branch 'tools-ynl-decode-link-types-present-in-tests'

Jakub Kicinski says:

====================
tools: ynl: decode link types present in tests

Using a kernel built for the net selftest target to run drivers/net
tests currently fails, because the net kernel automatically spawns
a handful of tunnel devices which YNL can't decode.

Fill in those missing link types in rt_link. We need to extend subset
support a bit for it to work.

v1: https://lore.kernel.org/20250105012523.1722231-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250107022820.2087101-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonetlink: specs: rt_link: decode ip6tnl, vti and vti6 link attrs
Jakub Kicinski [Tue, 7 Jan 2025 02:28:20 +0000 (18:28 -0800)]
netlink: specs: rt_link: decode ip6tnl, vti and vti6 link attrs

Some of our tests load vti and ip6tnl so not being able to decode
the link attrs gets in the way of using Python YNL for testing.

Decode link attributes for ip6tnl, vti and vti6.

ip6tnl uses IFLA_IPTUN_FLAGS as u32, while ipv4 and sit expect
a u16 attribute, so we have a (first?) subset type override...

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250107022820.2087101-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agotools: ynl: print some information about attribute we can't parse
Jakub Kicinski [Tue, 7 Jan 2025 02:28:19 +0000 (18:28 -0800)]
tools: ynl: print some information about attribute we can't parse

When parsing throws an exception one often has to figure out which
attribute couldn't be parsed from first principles. For families
with large message parsing trees like rtnetlink guessing the
attribute can be hard.

Print a bit of information as the exception travels out, e.g.:

  # when dumping rt links
  Error decoding 'flags' from 'linkinfo-ip6tnl-attrs'
  Error decoding 'data' from 'linkinfo-attrs'
  Error decoding 'linkinfo' from 'link-attrs'
  Traceback (most recent call last):
    File "/home/kicinski/linux/./tools/net/ynl/cli.py", line 119, in <module>
      main()
    File "/home/kicinski/linux/./tools/net/ynl/cli.py", line 100, in main
      reply = ynl.dump(args.dump, attrs)
    File "/home/kicinski/linux/tools/net/ynl/lib/ynl.py", line 1064, in dump
      return self._op(method, vals, dump=True)
    File "/home/kicinski/linux/tools/net/ynl/lib/ynl.py", line 1058, in _op
      return self._ops(ops)[0]
    File "/home/kicinski/linux/tools/net/ynl/lib/ynl.py", line 1045, in _ops
      rsp_msg = self._decode(decoded.raw_attrs, op.attr_set.name)
    File "/home/kicinski/linux/tools/net/ynl/lib/ynl.py", line 738, in _decode
      subdict = self._decode(NlAttrs(attr.raw), attr_spec['nested-attributes'], search_attrs)
    File "/home/kicinski/linux/tools/net/ynl/lib/ynl.py", line 763, in _decode
      decoded = self._decode_sub_msg(attr, attr_spec, search_attrs)
    File "/home/kicinski/linux/tools/net/ynl/lib/ynl.py", line 714, in _decode_sub_msg
      subdict = self._decode(NlAttrs(attr.raw, offset), msg_format.attr_set)
    File "/home/kicinski/linux/tools/net/ynl/lib/ynl.py", line 749, in _decode
      decoded = attr.as_scalar(attr_spec['type'], attr_spec.byte_order)
    File "/home/kicinski/linux/tools/net/ynl/lib/ynl.py", line 147, in as_scalar
      return format.unpack(self.raw)[0]
  struct.error: unpack requires a buffer of 2 bytes

The Traceback is what we would previously see, the "Error..."
messages are new. We print a message per level (in the stack
order). Printing single combined message gets tricky quickly
given sub-messages etc.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250107022820.2087101-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agotools: ynl: correctly handle overrides of fields in subset
Jakub Kicinski [Tue, 7 Jan 2025 02:28:18 +0000 (18:28 -0800)]
tools: ynl: correctly handle overrides of fields in subset

We stated in documentation [1] and previous discussions [2]
that the need for overriding fields in members of subsets
is anticipated. Implement it.

Since each attr is now a new object we need to make sure
that the modifications are propagated. Specifically C codegen
wants to annotate which attrs are used in requests and replies
to generate the right validation artifacts.

[1] https://docs.kernel.org/next/userspace-api/netlink/specs.html#subset-of
[2] https://lore.kernel.org/netdev/20231004171350.1f59cd1d@kernel.org/

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250107022820.2087101-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoif_vlan: fix kdoc warnings
Jakub Kicinski [Mon, 6 Jan 2025 17:46:20 +0000 (09:46 -0800)]
if_vlan: fix kdoc warnings

While merging net to net-next I noticed that the kdoc above
__vlan_get_protocol_offset() has the wrong function name.
Fix that and all the other kdoc warnings in this file.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250106174620.1855269-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-dsa-cleanup-eee-part-2'
Jakub Kicinski [Wed, 8 Jan 2025 02:06:19 +0000 (18:06 -0800)]
Merge branch 'net-dsa-cleanup-eee-part-2'

Russell King says:

====================
net: dsa: cleanup EEE (part 2)

This is part 2 of the DSA EEE cleanups, removing what has become dead
code as a result of the EEE management phylib now does.

Patch 1 removes the useless setting of tx_lpi parameters in the
ksz driver.

Patch 2 does the same for mt753x.

Patch 3 removes the DSA core code that calls the get_mac_eee() operation.
This needs to be done before removing the implementations because doing
otherwise would cause dsa_user_get_eee() to return -EOPNOTSUPP.

Patches 4..8 remove the trivial get_mac_eee() implementations from DSA
drivers.

Patch 9 finally removes the get_mac_eee() method from struct
dsa_switch_ops.
====================

Link: https://patch.msgid.link/Z3vDwwsHSxH5D6Pm@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: remove get_mac_eee() method
Russell King (Oracle) [Mon, 6 Jan 2025 11:59:24 +0000 (11:59 +0000)]
net: dsa: remove get_mac_eee() method

The get_mac_eee() is no longer called by the core DSA code, nor are
there any implementations of this method. Remove it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tUllU-007UzL-KV@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: qca: remove qca8k_get_mac_eee()
Russell King (Oracle) [Mon, 6 Jan 2025 11:59:19 +0000 (11:59 +0000)]
net: dsa: qca: remove qca8k_get_mac_eee()

qca8k_get_mac_eee() is no longer called by the core DSA code. Remove it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tUllP-007UzF-Gk@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: mv88e6xxx: remove mv88e6xxx_get_mac_eee()
Russell King (Oracle) [Mon, 6 Jan 2025 11:59:14 +0000 (11:59 +0000)]
net: dsa: mv88e6xxx: remove mv88e6xxx_get_mac_eee()

mv88e6xxx_get_mac_eee() is no longer called by the core DSA code.
Remove it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tUllK-007Uz9-D7@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: mt753x: remove ksz_get_mac_eee()
Russell King (Oracle) [Mon, 6 Jan 2025 11:59:09 +0000 (11:59 +0000)]
net: dsa: mt753x: remove ksz_get_mac_eee()

mt753x_get_mac_eee() is no longer called by the core DSA code. Remove
it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Chester A. Unal <chester.a.unal@arinc9.com>
Link: https://patch.msgid.link/E1tUllF-007Uz3-95@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: ksz: remove ksz_get_mac_eee()
Russell King (Oracle) [Mon, 6 Jan 2025 11:59:04 +0000 (11:59 +0000)]
net: dsa: ksz: remove ksz_get_mac_eee()

ksz_get_mac_eee() is no longer called by the core DSA code. Remove it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tUllA-007Uyx-4o@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: b53/bcm_sf2: remove b53_get_mac_eee()
Russell King (Oracle) [Mon, 6 Jan 2025 11:58:59 +0000 (11:58 +0000)]
net: dsa: b53/bcm_sf2: remove b53_get_mac_eee()

b53_get_mac_eee() is no longer called by the core DSA code. Remove it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tUll5-007Uyr-1U@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: no longer call ds->ops->get_mac_eee()
Russell King (Oracle) [Mon, 6 Jan 2025 11:58:53 +0000 (11:58 +0000)]
net: dsa: no longer call ds->ops->get_mac_eee()

All implementations of get_mac_eee() now just return zero without doing
anything useful. Remove the call to this method in preparation to
removing the method from each DSA driver.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tUlkz-007Uyl-UA@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: mt753x: remove setting of tx_lpi parameters
Russell King (Oracle) [Mon, 6 Jan 2025 11:58:48 +0000 (11:58 +0000)]
net: dsa: mt753x: remove setting of tx_lpi parameters

dsa_user_get_eee() calls the DSA switch get_mac_eee() method followed
by phylink_ethtool_get_eee(), which goes on to call
phy_ethtool_get_eee(). This overwrites all members of the passed
ethtool_keee, which means anything written by the DSA switch
get_mac_eee() method will be discarded.

Remove setting any members in mt753x_get_mac_eee().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Chester A. Unal <chester.a.unal@arinc9.com>
Link: https://patch.msgid.link/E1tUlku-007Uyc-RP@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dsa: ksz: remove setting of tx_lpi parameters
Russell King (Oracle) [Mon, 6 Jan 2025 11:58:43 +0000 (11:58 +0000)]
net: dsa: ksz: remove setting of tx_lpi parameters

dsa_user_get_eee() calls the DSA switch get_mac_eee() method followed
by phylink_ethtool_get_eee(), which goes on to call
phy_ethtool_get_eee(). This overwrites all members of the passed
ethtool_keee, which means anything written by the DSA switch
get_mac_eee() method will be discarded.

Remove setting any members in ksz_get_mac_eee().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tUlkp-007UyW-OR@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-hold-per-netns-rtnl-during-netdev-notifier-registration'
Jakub Kicinski [Wed, 8 Jan 2025 01:49:24 +0000 (17:49 -0800)]
Merge branch 'net-hold-per-netns-rtnl-during-netdev-notifier-registration'

Kuniyuki Iwashima says:

====================
net: Hold per-netns RTNL during netdev notifier registration.

This series adds per-netns RTNL for registration of the global
and per-netns netdev notifiers.

v1: https://lore.kernel.org/netdev/20250104063735.36945-1-kuniyu@amazon.com/
====================

Link: https://patch.msgid.link/20250106070751.63146-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().
Kuniyuki Iwashima [Mon, 6 Jan 2025 07:07:51 +0000 (16:07 +0900)]
net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().

(un)?register_netdevice_notifier_dev_net() hold RTNL before triggering
the notifier for all netdev in the netns.

Let's convert the RTNL to rtnl_net_lock().

Note that move_netdevice_notifiers_dev_net() is assumed to be (but not
yet) protected by per-netns RTNL of both src and dst netns; we need to
convert wireless and hyperv drivers that call dev_change_net_namespace().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_net().
Kuniyuki Iwashima [Mon, 6 Jan 2025 07:07:50 +0000 (16:07 +0900)]
net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_net().

(un)?register_netdevice_notifier_net() hold RTNL before triggering the
notifier for all netdev in the netns.

Let's convert the RTNL to rtnl_net_lock().

Note that the per-netns netdev notifier is protected by per-netns RTNL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: Hold __rtnl_net_lock() in (un)?register_netdevice_notifier().
Kuniyuki Iwashima [Mon, 6 Jan 2025 07:07:49 +0000 (16:07 +0900)]
net: Hold __rtnl_net_lock() in (un)?register_netdevice_notifier().

(un)?register_netdevice_notifier() hold pernet_ops_rwsem and RTNL,
iterate all netns, and trigger the notifier for all netdev.

Let's hold __rtnl_net_lock() before triggering the notifier.

Note that we will need protection for netdev_chain when RTNL is
removed.  (e.g. blocking_notifier conversion [0] with a lockdep
annotation [1])

Link: https://lore.kernel.org/netdev/20250104063735.36945-2-kuniyu@amazon.com/
Link: https://lore.kernel.org/netdev/20250105075957.67334-1-kuniyu@amazon.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoixgbevf: Remove unused ixgbevf_hv_mbx_ops
Dr. David Alan Gilbert [Sun, 5 Jan 2025 12:28:47 +0000 (12:28 +0000)]
ixgbevf: Remove unused ixgbevf_hv_mbx_ops

The const struct ixgbevf_hv_mbx_ops was added in 2016 as part of
commit c6d45171d706 ("ixgbevf: Support Windows hosts (Hyper-V)")

but has remained unused.

The functions it references are still referenced elsewhere.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250105122847.27341-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: watchdog: rename __dev_watchdog_up() and dev_watchdog_down()
Eric Dumazet [Sun, 5 Jan 2025 09:09:24 +0000 (09:09 +0000)]
net: watchdog: rename __dev_watchdog_up() and dev_watchdog_down()

In commit d7811e623dd4 ("[NET]: Drop tx lock in dev_watchdog_up")
dev_watchdog_up() became a simple wrapper for __netdev_watchdog_up()

Herbert also said : "In 2.6.19 we can eliminate the unnecessary
__dev_watchdog_up and replace it with dev_watchdog_up."

This patch consolidates things to have only two functions, with
a common prefix.

- netdev_watchdog_up(), exported for the sake of one freescale driver.
  This replaces __netdev_watchdog_up() and dev_watchdog_up().

- netdev_watchdog_down(), static to net/sched/sch_generic.c
  This replaces dev_watchdog_down().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250105090924.1661822-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Tue, 7 Jan 2025 23:39:09 +0000 (15:39 -0800)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2025-01-07

We've added 7 non-merge commits during the last 32 day(s) which contain
a total of 11 files changed, 190 insertions(+), 103 deletions(-).

The main changes are:

1) Migrate the test_xdp_meta.sh BPF selftest into test_progs
   framework, from Bastien Curutchet.

2) Add ability to configure head/tailroom for netkit devices,
   from Daniel Borkmann.

3) Fixes and improvements to the xdp_hw_metadata selftest,
   from Song Yoong Siang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Extend netkit tests to validate set {head,tail}room
  netkit: Add add netkit {head,tail}room to rt_link.yaml
  netkit: Allow for configuring needed_{head,tail}room
  selftests/bpf: Migrate test_xdp_meta.sh into xdp_context_test_run.c
  selftests/bpf: test_xdp_meta: Rename BPF sections
  selftests/bpf: Enable Tx hwtstamp in xdp_hw_metadata
  selftests/bpf: Actuate tx_metadata_len in xdp_hw_metadata
====================

Link: https://patch.msgid.link/20250107130908.143644-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobridge: Make br_is_nd_neigh_msg() accept pointer to "const struct sk_buff"
Ted Chen [Sat, 4 Jan 2025 08:38:46 +0000 (16:38 +0800)]
bridge: Make br_is_nd_neigh_msg() accept pointer to "const struct sk_buff"

The skb_buff struct in br_is_nd_neigh_msg() is never modified. Mark it as
const.

Signed-off-by: Ted Chen <znscnchen@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250104083846.71612-1-znscnchen@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoMerge branch 'dev-hold-per-netns-rtnl-in-register-netdev'
Paolo Abeni [Tue, 7 Jan 2025 12:45:58 +0000 (13:45 +0100)]
Merge branch 'dev-hold-per-netns-rtnl-in-register-netdev'

Kuniyuki Iwashima says:

====================
dev: Hold per-netns RTNL in register_netdev().

Patch 1 adds rtnl_net_lock_killable() and Patch 2 uses it in
register_netdev() and converts it and unregister_netdev() to
per-netns RTNL.

With this and the netdev notifier series [0], ASSERT_RTNL_NET()
for NETDEV_REGISTER [1] wasn't fired on a simplest QEMU setup
like e1000 + x86_64_defconfig + CONFIG_DEBUG_NET_SMALL_RTNL.

[0]: https://lore.kernel.org/netdev/20250104063735.36945-1-kuniyu@amazon.com/

[1]:
---8<---
diff --git a/net/core/rtnl_net_debug.c b/net/core/rtnl_net_debug.c
index f406045cbd0e..c0c30929002e 100644
--- a/net/core/rtnl_net_debug.c
+++ b/net/core/rtnl_net_debug.c
@@ -21,7 +21,6 @@ static int rtnl_net_debug_event(struct notifier_block *nb,
  case NETDEV_DOWN:
  case NETDEV_REBOOT:
  case NETDEV_CHANGE:
- case NETDEV_REGISTER:
  case NETDEV_UNREGISTER:
  case NETDEV_CHANGEMTU:
  case NETDEV_CHANGEADDR:
@@ -60,19 +59,10 @@ static int rtnl_net_debug_event(struct notifier_block *nb,
  ASSERT_RTNL();
  break;

- /* Once an event fully supports RTNL_NET, move it here
-  * and remove "if (0)" below.
-  *
-  * case NETDEV_XXX:
-  * ASSERT_RTNL_NET(net);
-  * break;
-  */
- }
-
- /* Just to avoid unused-variable error for dev and net. */
- if (0)
+ case NETDEV_REGISTER:
  ASSERT_RTNL_NET(net);
+ break;
+ }

  return NOTIFY_DONE;
 }
---8<---
====================

Link: https://patch.msgid.link/20250104082149.48493-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agodev: Hold per-netns RTNL in (un)?register_netdev().
Kuniyuki Iwashima [Sat, 4 Jan 2025 08:21:49 +0000 (17:21 +0900)]
dev: Hold per-netns RTNL in (un)?register_netdev().

Let's hold per-netns RTNL of dev_net(dev) in register_netdev()
and unregister_netdev().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agortnetlink: Add rtnl_net_lock_killable().
Kuniyuki Iwashima [Sat, 4 Jan 2025 08:21:48 +0000 (17:21 +0900)]
rtnetlink: Add rtnl_net_lock_killable().

rtnl_lock_killable() is used only in register_netdev()
and will be converted to per-netns RTNL.

Let's unexport it and add the corresponding helper.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoeth: fbnic: update fbnic_poll return value
Mohsin Bashir [Sat, 4 Jan 2025 01:53:16 +0000 (17:53 -0800)]
eth: fbnic: update fbnic_poll return value

In cases where the work done is less than the budget, `fbnic_poll` is
returning 0. This affects the tracing of `napi_poll`. Following is a
snippet of before and after result from `napi_poll` tracepoint. Instead,
returning the work done improves the manual tracing.

Before:
@[10]: 1
...
@[64]: 208175
@[0]: 2128008

After:
@[56]: 86
@[48]: 222
...
@[5]: 1885756
@[6]: 1933841

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250104015316.3192946-1-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoMerge branch 'net-airoha-add-qdisc-offload-support'
Paolo Abeni [Tue, 7 Jan 2025 11:32:53 +0000 (12:32 +0100)]
Merge branch 'net-airoha-add-qdisc-offload-support'

Lorenzo Bianconi says:

====================
net: airoha: Add Qdisc offload support

Introduce support for ETS and HTB Qdisc offload available on the Airoha
EN7581 ethernet controller.
====================

Link: https://patch.msgid.link/20250103-airoha-en7581-qdisc-offload-v1-0-608a23fa65d5@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: airoha: Add sched HTB offload support
Lorenzo Bianconi [Fri, 3 Jan 2025 12:17:05 +0000 (13:17 +0100)]
net: airoha: Add sched HTB offload support

Introduce support for HTB Qdisc offload available in the Airoha EN7581
ethernet controller. EN7581 can offload only one level of HTB leafs.
Each HTB leaf represents a QoS channel supported by EN7581 SoC.
The typical use-case is creating a HTB leaf for QoS channel to rate
limit the egress traffic and attach an ETS Qdisc to each HTB leaf in
order to enforce traffic prioritization.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: airoha: Add sched ETS offload support
Lorenzo Bianconi [Fri, 3 Jan 2025 12:17:04 +0000 (13:17 +0100)]
net: airoha: Add sched ETS offload support

Introduce support for ETS Qdisc offload available on the Airoha EN7581
ethernet controller. In order to be effective, ETS Qdisc must configured
as leaf of a HTB Qdisc (HTB Qdisc offload will be added in the following
patch). ETS Qdisc available on EN7581 ethernet controller supports at
most 8 concurrent bands (QoS queues). We can enable an ETS Qdisc for
each available QoS channel.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: airoha: Introduce ndo_select_queue callback
Lorenzo Bianconi [Fri, 3 Jan 2025 12:17:03 +0000 (13:17 +0100)]
net: airoha: Introduce ndo_select_queue callback

Airoha EN7581 SoC supports 32 Tx DMA rings used to feed packets to QoS
channels. Each channels supports 8 QoS queues where the user can apply
QoS scheduling policies. In a similar way, the user can configure hw
rate shaping for each QoS channel.
Introduce ndo_select_queue callback in order to select the tx queue
based on QoS channel and QoS queue. In particular, for dsa device select
QoS channel according to the dsa user port index, rely on port id
otherwise. Select QoS queue based on the skb priority.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: airoha: Enable Tx drop capability for each Tx DMA ring
Lorenzo Bianconi [Fri, 3 Jan 2025 12:17:02 +0000 (13:17 +0100)]
net: airoha: Enable Tx drop capability for each Tx DMA ring

This is a preliminary patch in order to enable hw Qdisc offloading.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoselftests/net: packetdrill: report benign debug flakes as xfail
Willem de Bruijn [Fri, 3 Jan 2025 11:31:14 +0000 (06:31 -0500)]
selftests/net: packetdrill: report benign debug flakes as xfail

A few recently added packetdrill tests that are known time sensitive
(e.g., because testing timestamping) occasionally fail in debug mode:
https://netdev.bots.linux.dev/contest.html?executor=vmksft-packetdrill-dbg

These failures are well understood. Correctness of the tests is
verified in non-debug mode. Continue running in debug mode also, to
keep coverage with debug instrumentation.

But, only in debug mode, mark these tests with well understood
timing issues as XFAIL (known failing) rather than FAIL when failing.

Introduce an allow list xfail_list with known cases.

Expand the ktap infrastructure with XFAIL support.

Fixes: eab35989cc37 ("selftests/net: packetdrill: import tcp/fast_recovery, tcp/nagle, tcp/timestamping")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20241218100013.0c698629@kernel.org/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250103113142.129251-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: stmmac: Set dma_sync_size to zero for discarded frames
Furong Xu [Fri, 3 Jan 2025 09:37:33 +0000 (17:37 +0800)]
net: stmmac: Set dma_sync_size to zero for discarded frames

If a frame is going to be discarded by driver, this frame is never touched
by driver and the cache lines never become dirty obviously,
page_pool_recycle_direct() wastes CPU cycles on unnecessary calling of
page_pool_dma_sync_for_device() to sync entire frame.
page_pool_put_page() with sync_size setting to 0 is the proper method.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/20250103093733.3872939-1-0x1207@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoocteontx2-pf: mcs: Remove dead code and semi-colon from rsrc_name()
Nihar Chaithanya [Sat, 4 Jan 2025 17:19:15 +0000 (22:49 +0530)]
octeontx2-pf: mcs: Remove dead code and semi-colon from rsrc_name()

Every case in the switch-block ends with return statement, and the
default: branch handles the cases where rsrc_type is invalid and
returns "Unknown", this makes the return statement at the end of the
function unreachable and redundant.
The semi-colon is not required after the switch-block's curly braces.

Remove the semi-colon after the switch-block's curly braces and the
return statement at the end of the function.

This issue was reported by Coverity Scan.

Signed-off-by: Nihar Chaithanya <niharchaithanya@gmail.com>
Link: https://patch.msgid.link/20250104171905.13293-1-niharchaithanya@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonfc: st21nfca: Drop unneeded null check in st21nfca_tx_work()
Krzysztof Kozlowski [Sat, 4 Jan 2025 14:20:43 +0000 (15:20 +0100)]
nfc: st21nfca: Drop unneeded null check in st21nfca_tx_work()

Variable 'info' is obtained via container_of() of struct work_struct, so
it cannot be NULL.  Simplify the code and solve Smatch warning:

  drivers/nfc/st21nfca/dep.c:119 st21nfca_tx_work() warn: can 'info' even be NULL?

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250104142043.116045-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'mlx5-hardware-steering-part-2'
Jakub Kicinski [Tue, 7 Jan 2025 00:33:44 +0000 (16:33 -0800)]
Merge branch 'mlx5-hardware-steering-part-2'

Tariq Toukan says:

====================
mlx5 Hardware Steering part 2

This series contain HWS code cleanups, enhancements, bug fixes, and
additions. Note that some of these patches are fixing bugs in existing
code, but we submit them without 'Fixes' tag to avoid the unnecessary
burden for stable releases, as HWS still couldn't be enabled.

Patches 1-5:
HWS, various code cleanups and enhancements

Patches 6-14:
HWS, various bug fixes and additions

Patch 15:
HWS, setting timeout on polling
====================

Link: https://patch.msgid.link/20250102181415.1477316-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, set timeout on polling for completion
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:14 +0000 (20:14 +0200)]
net/mlx5: HWS, set timeout on polling for completion

Consolidate BWC polling for completion into one function
and set a time limit on the loop that polls for completion.
This can happen only if there is some issue with FW/PCI/HW,
such as FW being stuck, PCI issue, etc.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-16-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, support flow sampler destination
Vlad Dogaru [Thu, 2 Jan 2025 18:14:13 +0000 (20:14 +0200)]
net/mlx5: HWS, support flow sampler destination

Since sampler isn't currently supported via HWS, use a FW island
that forwards any packets to the supplied sampler.

Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-15-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, use the right size when writing arg data
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:12 +0000 (20:14 +0200)]
net/mlx5: HWS, use the right size when writing arg data

When writing arg data, wrong size was used - fixing this.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-14-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, handle returned error value in pool alloc
Vlad Dogaru [Thu, 2 Jan 2025 18:14:11 +0000 (20:14 +0200)]
net/mlx5: HWS, handle returned error value in pool alloc

Handle all negative return values as errors, not just -1.
The code previously treated -ENOMEM (and potentially other negative
values) as valid segment numbers, leading to incorrect behavior.
This fix ensures that any negative return value is treated as an error.

Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, fix definer's HWS_SET32 macro for negative offset
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:10 +0000 (20:14 +0200)]
net/mlx5: HWS, fix definer's HWS_SET32 macro for negative offset

When bit offset for HWS_SET32 macro is negative,
UBSAN complains about the shift-out-of-bounds:

  UBSAN: shift-out-of-bounds in
  drivers/net/ethernet/mellanox/mlx5/core/steering/hws/definer.c:177:2
  shift exponent -8 is negative

Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, separate SQ that HWS uses from the usual traffic SQs
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:09 +0000 (20:14 +0200)]
net/mlx5: HWS, separate SQ that HWS uses from the usual traffic SQs

Mark the HWS SQ as 'non_wire' so that 'Flow Update' flow
won't mix with network traffic.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, num_of_rules counter on matcher should be atomic
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:08 +0000 (20:14 +0200)]
net/mlx5: HWS, num_of_rules counter on matcher should be atomic

Rule counter in matcher's struct is used in two places:

1. As heuristics to decide when the number of rules have crossed a
certain percentage threshold and the matcher should be resized.
We don't mind here if the number will be off by 1-2 due to concurrency.

2. When destroying matcher, the counter value is checked and the
user is warned if it is not 0. Here we lock all the queues, so the
counter will be correct.

We don't need to always have *exact* number, but we do need this
number to not be corrupted, which is what is happening when the
counter isn't atomic, due to update by different threads.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, reduce memory consumption of a matcher struct
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:07 +0000 (20:14 +0200)]
net/mlx5: HWS, reduce memory consumption of a matcher struct

Instead of having a large array of action templates allocated with
kmalloc, have smaller array and allocate it with kvmalloc.

The size of the array represents the max number of AT attach
operations for the same matcher. This number is not expected
to be very high. In any case, when the limit is reached, the
next attempt to attach new AT will result in creation of a new
matcher and moving all the rules to this matcher.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, remove wrong deletion of the miss table list
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:06 +0000 (20:14 +0200)]
net/mlx5: HWS, remove wrong deletion of the miss table list

Remove wrong cleanup of the old miss table list and
simplify the error flow in the function.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, change error flow on matcher disconnect
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:05 +0000 (20:14 +0200)]
net/mlx5: HWS, change error flow on matcher disconnect

Currently, when firmware failure occurs during matcher disconnect flow,
the error flow of the function reconnects the matcher back and returns
an error, which continues running the calling function and eventually
frees the matcher that is being disconnected.
This leads to a case where we have a freed matcher on the matchers list,
which in turn leads to use-after-free and eventual crash.

This patch fixes that by not trying to reconnect the matcher back when
some FW command fails during disconnect.

Note that we're dealing here with FW error. We can't overcome this
problem. This might lead to bad steering state (e.g. wrong connection
between matchers), and will also lead to resource leakage, as it is
the case with any other error handling during resource destruction.

However, the goal here is to allow the driver to continue and not crash
the machine with use-after-free error.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, add error message on failure to move rules
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:04 +0000 (20:14 +0200)]
net/mlx5: HWS, add error message on failure to move rules

Add error message for failure to move rules from
old matcher to new one during rehash.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, simplify allocations as we support only FDB
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:03 +0000 (20:14 +0200)]
net/mlx5: HWS, simplify allocations as we support only FDB

In pools, STCs and actions: no need to allocate array for various
table types, as HWS is used to manage only FDB flow tables.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, denote how refcounts are protected
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:02 +0000 (20:14 +0200)]
net/mlx5: HWS, denote how refcounts are protected

Some HWS structs have refcounts that are just u32.
Comment how they are protected and add '__must_hold()'
annotation where applicable.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, remove implementation of unused FW commands
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:01 +0000 (20:14 +0200)]
net/mlx5: HWS, remove implementation of unused FW commands

Remove functions that manage alias objects - they are not used.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, remove the use of duplicated structs
Yevgeny Kliteynik [Thu, 2 Jan 2025 18:14:00 +0000 (20:14 +0200)]
net/mlx5: HWS, remove the use of duplicated structs

Remove definition in HWS of structs that are already defined
in mlx5_ifc.h, and fix the usage of these structs.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250102181415.1477316-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-pcs-add-supported_interfaces-bitmap-for-pcs'
Jakub Kicinski [Tue, 7 Jan 2025 00:26:16 +0000 (16:26 -0800)]
Merge branch 'net-pcs-add-supported_interfaces-bitmap-for-pcs'

Russell King says:

====================
net: pcs: add supported_interfaces bitmap for PCS

This series adds supported_interfaces for PCS, which gives MAC code
a way to determine the interface modes that the PCS supports without
having to implement functions such as xpcs_get_interfaces(), or
workarounds such as in

 https://lore.kernel.org/20241213090526.71516-3-maxime.chevallier@bootlin.com

Patch 1 adds the new bitmask to struct phylink_pcs, and code within
phylink to validate that the PCS returned by the MAC driver supports
the interface mode - but only if this bitmask is non-empty.

Patch 2 through 4 fills in the interface modes for XPCS, Mediatek LynxI
and Lynx PCS.

Patch 5 adds support to stmmac to make use of this bitmask when filling
in phylink_config.supported_interfaces, eliminating the call to
xpcs_get_interfaces.

As xpcs_get_interfaces() is now unused outside of pcs-xpcs.c, patch 6
makes this function static and removes it from the header file.
====================

Link: https://patch.msgid.link/Z3fG9oTY9F9fCYHv@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pcs: xpcs: make xpcs_get_interfaces() static
Russell King (Oracle) [Fri, 3 Jan 2025 11:16:56 +0000 (11:16 +0000)]
net: pcs: xpcs: make xpcs_get_interfaces() static

xpcs_get_interfaces() should no longer be used outside of the XPCS
code, so make it static.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1tTffk-007Roi-JM@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: use PCS supported_interfaces
Russell King (Oracle) [Fri, 3 Jan 2025 11:16:51 +0000 (11:16 +0000)]
net: stmmac: use PCS supported_interfaces

Use the PCS' supported_interfaces member to build the MAC level
supported_interfaces bitmap.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tTfff-007Roc-Ff@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pcs: lynx: fill in PCS supported_interfaces
Russell King (Oracle) [Fri, 3 Jan 2025 11:16:46 +0000 (11:16 +0000)]
net: pcs: lynx: fill in PCS supported_interfaces

Fill in the new PCS supported_interfaces member with the interfaces
that Lynx supports.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tTffa-007RoV-Bo@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pcs: mtk-lynxi: fill in PCS supported_interfaces
Russell King (Oracle) [Fri, 3 Jan 2025 11:16:41 +0000 (11:16 +0000)]
net: pcs: mtk-lynxi: fill in PCS supported_interfaces

Fill in the new PCS supported_interfaces member with the interfaces
that the Mediatek LynxI supports.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1tTffV-007RoP-8D@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pcs: xpcs: fill in PCS supported_interfaces
Russell King (Oracle) [Fri, 3 Jan 2025 11:16:36 +0000 (11:16 +0000)]
net: pcs: xpcs: fill in PCS supported_interfaces

Fill in the new PCS supported_interfaces member with the interfaces
that XPCS supports.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tTffQ-007RoJ-4u@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: add support for PCS supported_interfaces bitmap
Russell King (Oracle) [Fri, 3 Jan 2025 11:16:31 +0000 (11:16 +0000)]
net: phylink: add support for PCS supported_interfaces bitmap

Add support for the PCS to specify which interfaces it supports, which
can be used by MAC drivers to build the main supported_interfaces
bitmap. Phylink also validates that the PCS returned by the MAC driver
supports the interface that the MAC was asked for.

An empty supported_interfaces bitmap from the PCS indicates that it
does not provide this information, and we handle that appropriately.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tTffL-007RoD-1Y@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: hsr: remove one synchronize_rcu() from hsr_del_port()
Eric Dumazet [Fri, 3 Jan 2025 10:11:48 +0000 (10:11 +0000)]
net: hsr: remove one synchronize_rcu() from hsr_del_port()

Use kfree_rcu() instead of synchronize_rcu()+kfree().

This might allow syzbot to fuzz HSR a bit faster...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250103101148.3594545-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoax25: rcu protect dev->ax25_ptr
Eric Dumazet [Fri, 3 Jan 2025 21:05:14 +0000 (21:05 +0000)]
ax25: rcu protect dev->ax25_ptr

syzbot found a lockdep issue [1].

We should remove ax25 RTNL dependency in ax25_setsockopt()

This should also fix a variety of possible UAF in ax25.

[1]

WARNING: possible circular locking dependency detected
6.13.0-rc3-syzkaller-00762-g9268abe611b0 #0 Not tainted
------------------------------------------------------
syz.5.1818/12806 is trying to acquire lock:
 ffffffff8fcb3988 (rtnl_mutex){+.+.}-{4:4}, at: ax25_setsockopt+0xa55/0xe90 net/ax25/af_ax25.c:680

but task is already holding lock:
 ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1618 [inline]
 ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: ax25_setsockopt+0x209/0xe90 net/ax25/af_ax25.c:574

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (sk_lock-AF_AX25){+.+.}-{0:0}:
        lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
        lock_sock_nested+0x48/0x100 net/core/sock.c:3642
        lock_sock include/net/sock.h:1618 [inline]
        ax25_kill_by_device net/ax25/af_ax25.c:101 [inline]
        ax25_device_event+0x24d/0x580 net/ax25/af_ax25.c:146
        notifier_call_chain+0x1a5/0x3f0 kernel/notifier.c:85
       __dev_notify_flags+0x207/0x400
        dev_change_flags+0xf0/0x1a0 net/core/dev.c:9026
        dev_ifsioc+0x7c8/0xe70 net/core/dev_ioctl.c:563
        dev_ioctl+0x719/0x1340 net/core/dev_ioctl.c:820
        sock_do_ioctl+0x240/0x460 net/socket.c:1234
        sock_ioctl+0x626/0x8e0 net/socket.c:1339
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:906 [inline]
        __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (rtnl_mutex){+.+.}-{4:4}:
        check_prev_add kernel/locking/lockdep.c:3161 [inline]
        check_prevs_add kernel/locking/lockdep.c:3280 [inline]
        validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
        __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
        lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
        __mutex_lock_common kernel/locking/mutex.c:585 [inline]
        __mutex_lock+0x1ac/0xee0 kernel/locking/mutex.c:735
        ax25_setsockopt+0xa55/0xe90 net/ax25/af_ax25.c:680
        do_sock_setsockopt+0x3af/0x720 net/socket.c:2324
        __sys_setsockopt net/socket.c:2349 [inline]
        __do_sys_setsockopt net/socket.c:2355 [inline]
        __se_sys_setsockopt net/socket.c:2352 [inline]
        __x64_sys_setsockopt+0x1ee/0x280 net/socket.c:2352
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sk_lock-AF_AX25);
                               lock(rtnl_mutex);
                               lock(sk_lock-AF_AX25);
  lock(rtnl_mutex);

 *** DEADLOCK ***

1 lock held by syz.5.1818/12806:
  #0: ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1618 [inline]
  #0: ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: ax25_setsockopt+0x209/0xe90 net/ax25/af_ax25.c:574

stack backtrace:
CPU: 1 UID: 0 PID: 12806 Comm: syz.5.1818 Not tainted 6.13.0-rc3-syzkaller-00762-g9268abe611b0 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:94 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
  print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074
  check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206
  check_prev_add kernel/locking/lockdep.c:3161 [inline]
  check_prevs_add kernel/locking/lockdep.c:3280 [inline]
  validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
  __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
  lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
  __mutex_lock_common kernel/locking/mutex.c:585 [inline]
  __mutex_lock+0x1ac/0xee0 kernel/locking/mutex.c:735
  ax25_setsockopt+0xa55/0xe90 net/ax25/af_ax25.c:680
  do_sock_setsockopt+0x3af/0x720 net/socket.c:2324
  __sys_setsockopt net/socket.c:2349 [inline]
  __do_sys_setsockopt net/socket.c:2355 [inline]
  __se_sys_setsockopt net/socket.c:2352 [inline]
  __x64_sys_setsockopt+0x1ee/0x280 net/socket.c:2352
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f7b62385d29

Fixes: c433570458e4 ("ax25: fix a use-after-free in ax25_fillin_cb()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250103210514.87290-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agosctp: Prepare sctp_v4_get_dst() to dscp_t conversion.
Guillaume Nault [Thu, 2 Jan 2025 16:34:18 +0000 (17:34 +0100)]
sctp: Prepare sctp_v4_get_dst() to dscp_t conversion.

Define inet_sk_dscp() to get a dscp_t value from struct inet_sock, so
that sctp_v4_get_dst() can easily set ->flowi4_tos from a dscp_t
variable. For the SCTP_DSCP_SET_MASK case, we can just use
inet_dsfield_to_dscp() to get a dscp_t value.

Then, when converting ->flowi4_tos from __u8 to dscp_t, we'll just have
to drop the inet_dscp_to_dsfield() conversion function.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/1a645f4a0bc60ad18e7c0916642883ce8a43c013.1735835456.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>