]> www.infradead.org Git - users/hch/misc.git/log
users/hch/misc.git
3 months agogtp: Prepare ip4_route_output_gtp() to .flowi4_tos conversion.
Guillaume Nault [Thu, 16 Jan 2025 13:39:38 +0000 (14:39 +0100)]
gtp: Prepare ip4_route_output_gtp() to .flowi4_tos conversion.

Use inet_sk_dscp() to get the socket DSCP value as dscp_t, instead of
ip_sock_rt_tos() which returns a __u8. This will ease the conversion
of fl4->flowi4_tos to dscp_t, which now just becomes a matter of
dropping the inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/06bdb310a075355ff059cd32da2efc29a63981c9.1737034675.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodccp: Prepare dccp_v4_route_skb() to .flowi4_tos conversion.
Guillaume Nault [Thu, 16 Jan 2025 13:10:16 +0000 (14:10 +0100)]
dccp: Prepare dccp_v4_route_skb() to .flowi4_tos conversion.

Use inet_sk_dscp() to get the socket DSCP value as dscp_t, instead of
ip_sock_rt_tos() which returns a __u8. This will ease the conversion
of fl4->flowi4_tos to dscp_t, which now just becomes a matter of
dropping the inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/208dc5ca28bb5595d7a545de026bba18b1d63bda.1737032802.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: net: give up on the cmsg_time accuracy on slow machines
Jakub Kicinski [Thu, 16 Jan 2025 02:01:05 +0000 (18:01 -0800)]
selftests: net: give up on the cmsg_time accuracy on slow machines

Commit b9d5f5711dd8 ("selftests: net: increase the delay for relative
cmsg_time.sh test") widened the accepted value range 8x but we still
see flakes (at a rate of around 7%).

Return XFAIL for the most timing sensitive test on slow machines.

Before:

  # ./cmsg_time.sh
    Case UDPv4  - TXTIME rel returned '8074us - 7397us < 4000', expected 'OK'
  FAIL - 1/36 cases failed

After:

  # ./cmsg_time.sh
    Case UDPv4  - TXTIME rel returned '1123us - 941us < 500', expected 'OK' (XFAIL)
    Case UDPv6  - TXTIME rel returned '1227us - 776us < 500', expected 'OK' (XFAIL)
  OK

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250116020105.931338-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'add-perout-library-for-rds-ptp-supported-phys'
Jakub Kicinski [Fri, 17 Jan 2025 01:27:59 +0000 (17:27 -0800)]
Merge branch 'add-perout-library-for-rds-ptp-supported-phys'

Divya Koppera says:

====================
Add PEROUT library for RDS PTP supported phys

Adds support for PEROUT library, where phy can generate
periodic output signal on supported pin out.
====================

Link: https://patch.msgid.link/20250115090634.12941-1-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: microchip_rds_ptp : Add PEROUT feature library for RDS PTP supported Microc...
Divya Koppera [Wed, 15 Jan 2025 09:06:34 +0000 (14:36 +0530)]
net: phy: microchip_rds_ptp : Add PEROUT feature library for RDS PTP supported Microchip phys

Adds PEROUT feature for RDS PTP supported phys where
we can generate periodic output signal on supported
pin out

Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20250115090634.12941-4-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: microchip_t1: Enable pin out specific to lan887x phy for PEROUT signal
Divya Koppera [Wed, 15 Jan 2025 09:06:33 +0000 (14:36 +0530)]
net: phy: microchip_t1: Enable pin out specific to lan887x phy for PEROUT signal

Adds support for enabling pin out that is required
to generate periodic output signal on lan887x phy.

Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20250115090634.12941-3-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: microchip_rds_ptp: Header file library changes for PEROUT
Divya Koppera [Wed, 15 Jan 2025 09:06:32 +0000 (14:36 +0530)]
net: phy: microchip_rds_ptp: Header file library changes for PEROUT

This ptp header file library changes will cover PEROUT
macros that are required to generate periodic output
from pin out

Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20250115090634.12941-2-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests/net: packetdrill: make tcp buf limited timing tests benign
Jakub Kicinski [Wed, 15 Jan 2025 23:21:29 +0000 (15:21 -0800)]
selftests/net: packetdrill: make tcp buf limited timing tests benign

The following tests are failing on debug kernels:

  tcp_tcp_info_tcp-info-rwnd-limited.pkt
  tcp_tcp_info_tcp-info-sndbuf-limited.pkt

with reports like:

      assert 19000 <= tcpi_sndbuf_limited <= 21000, tcpi_sndbuf_limited; \
  AssertionError: 18000

and:

      assert 348000 <= tcpi_busy_time <= 360000, tcpi_busy_time
  AssertionError: 362000

Extend commit 912d6f669725 ("selftests/net: packetdrill: report benign
debug flakes as xfail") to cover them.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250115232129.845884-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-add-phylink-managed-eee-support'
Jakub Kicinski [Fri, 17 Jan 2025 01:23:02 +0000 (17:23 -0800)]
Merge branch 'net-add-phylink-managed-eee-support'

Russell King says:

====================
net: add phylink managed EEE support

Adding managed EEE support to phylink has been on the cards ever since
the idea in phylib was mooted. This overly large series attempts to do
so. I've included all the patches as it's important to get the driver
patches out there.

Patch 1 adds a definition for the clock stop capable bit in the PCS
MMD status register.

Patch 2 adds a phylib API to query whether the PHY allows the transmit
xMII clock to be stopped while in LPI mode. This capability is for MAC
drivers to save power when LPI is active, to allow them to stop their
transmit clock.

Patch 3 extracts a phylink internal helper for determining whether the
link is up.

Patch 4 adds basic phylink managed EEE support. Two new MAC APIs are
added, to enable and disable LPI. The enable method is passed the LPI
timer setting which it is expected to program into the hardware, and
also a flag ehther the transmit clock should be stopped.

I have taken the decision to make enable_tx_lpi() to return an error
code, but not do much with it other than report it - the intention
being that we can later use it to extend functionality if needed
without reworking loads of drivers.

I have also dropped the validation/limitation of the LPI timer, and
left that in the driver code prior to calling phylink_ethtool_set_eee().

The remainder of the patches convert mvneta, lan743x and stmmac, and
add support for mvneta.
====================

Link: https://patch.msgid.link/Z4gdtOaGsBhQCZXn@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: convert to phylink managed EEE support
Russell King (Oracle) [Wed, 15 Jan 2025 20:43:08 +0000 (20:43 +0000)]
net: stmmac: convert to phylink managed EEE support

Convert stmmac to use phylink managed EEE support rather than delving
into phylib:

1. Move the stmmac_eee_init() calls out of mac_link_down() and
   mac_link_up() methods into the new mac_{enable,disable}_lpi()
   methods. We leave the calls to stmmac_set_eee_pls() in place as
   these change bits which tell the EEE hardware when the link came
   up or down, and is used for a separate hardware timer. However,
   symmetrically conditionalise this with priv->dma_cap.eee.

2. Update the current LPI timer each time LPI is enabled - which we
   need for software-timed LPI.

3. With phylink managed EEE, phylink manages the receive clock stop
   configuration via phylink_config.eee_rx_clk_stop_enable. Set this
   appropriately which makes the call to phy_eee_rx_clock_stop()
   redundant.

4. From what I can work out, all supported interfaces support LPI
   signalling on stmmac (there's no restriction implemented.) It
   also appears to support LPI at all full duplex speeds at or over
   100M. Set these capabilities.

5. The default timer appears to be derived from a module parameter.
   Set this the same, although we keep code that reconfigures the
   timer in stmmac_init_phy().

6. Remove the direct call to phy_support_eee(), which phylink will do
   on the drivers behalf if phylink_config.eee_enabled_default is set.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYAEG-0014QH-9O@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: lan743x: convert to phylink managed EEE
Russell King (Oracle) [Wed, 15 Jan 2025 20:43:03 +0000 (20:43 +0000)]
net: lan743x: convert to phylink managed EEE

Convert lan743x to phylink managed EEE:
- Set the lpi_capabilties.
- Move the call to lan743x_mac_eee_enable() into the enable/disable
  tx_lpi functions.
- Ensure that EEEEN is clear during probe.
- Move the setting of the LPI timer into mac_enable_tx_lpi().
- Move reading of LPI timer to phylink initialisation to set the
  default timer value.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYAEB-0014QB-4s@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: lan743x: use netdev in lan743x_phylink_mac_link_down()
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:57 +0000 (20:42 +0000)]
net: lan743x: use netdev in lan743x_phylink_mac_link_down()

Use the netdev that we already have in lan743x_phylink_mac_link_down().

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYAE5-0014Q5-Up@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: mvpp2: add EEE implementation
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:52 +0000 (20:42 +0000)]
net: mvpp2: add EEE implementation

Add EEE support for mvpp2, using phylink's EEE implementation, which
means we just need to implement the two methods for LPI control, and
with the initial configuration. Only SGMII mode is supported, so only
100M and 1G speeds.

Disabling LPI requires clearing a single bit. Enabling LPI needs a full
configuration of several values, as the timer values are dependent on
the MAC operating speed.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYAE0-0014Pz-R9@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: mvneta: convert to phylink EEE implementation
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:47 +0000 (20:42 +0000)]
net: mvneta: convert to phylink EEE implementation

Convert mvneta to use phylink's EEE implementation by implementing the
two LPI control methods, and adding the initial configuration and
capabilities.

Although disabling LPI requires clearing a single bit, for safety we
clear the manual mode and force bits to ensure that auto mode will be
used.

Enabling LPI needs a full configuration of several values, as the timer
values are dependent on the MAC operating speed, as per the original
code.

As Armada 388 states that EEE is only supported in "SGMII" modes, mark
this in lpi_interfaces. Testing with RGMII on the Clearfog platform
indicates that the receive path fails to detect LPI over RGMII.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADv-0014Pt-NO@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: add EEE management
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:42 +0000 (20:42 +0000)]
net: phylink: add EEE management

Add EEE management to phylink, making use of the phylib implementation.
This will only be used where a MAC driver populates the methods and
capabilities bitfield, otherwise we keep our old behaviour.

Phylink will keep track of the EEE configuration, including the clock
stop abilities at each end of the MAC to PHY link, programming the PHY
appropriately and preserving the LPI configuration should the PHY go
away.

Phylink will call into the MAC driver when LPI needs to be enabled or
disabled, with the requirement that the MAC have LPI disabled prior
to the netdev being brought up (in other words, it will only call
mac_disable_tx_lpi() if it has already called mac_enable_tx_lpi().)

Support for phylink managed EEE is enabled by populating both tx_lpi
MAC operations method pointers, and filling in both LPI interfaces
and capabilities. If the methods are provided but the LPI interfaces
or capabilities remain empty, this indicates to phylink that EEE is
implemented by the driver but the hardware it is driving does not
support EEE, and thus the ethtool set_eee() and get_eee() methods will
return EOPNOTSUPP.

No validation of the LPI timer value is performed by this patch.

For interface modes which do not support LPI, we make no attempt to
manipulate the phylib EEE advertisement, but instead refuse to
activate LPI at the MAC, noting it at debug message level.

We also restrict the advertisement and reported userspace support
linkmode masks according to the lpi_capabilities provided to
phylink by the MAC driver.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADq-0014Pn-J1@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: add phylink_link_is_up() helper
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:37 +0000 (20:42 +0000)]
net: phylink: add phylink_link_is_up() helper

Add a helper to determine whether the link is up or down. Currently
this is only used in one location, but becomes necessary to test
when reconfiguring EEE.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADl-0014Ph-EV@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: add support for querying PHY clock stop capability
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:32 +0000 (20:42 +0000)]
net: phy: add support for querying PHY clock stop capability

Add support for querying whether the PHY allows the transmit xMII clock
to be stopped while in LPI mode. This will be used by phylink to pass
to the MAC driver so it can configure the generation of the xMII clock
appropriately.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADg-0014Pb-AJ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: mdio: add definition for clock stop capable bit
Russell King (Oracle) [Wed, 15 Jan 2025 20:42:27 +0000 (20:42 +0000)]
net: mdio: add definition for clock stop capable bit

Add a definition for the clock stop capable bit in the PCS MMD. This
bit indicates whether the MAC is able to stop the transmit xMII clock
while it is signalling LPI.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1tYADb-0014PV-6T@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'dev-covnert-dev_change_name-to-per-netns-rtnl'
Jakub Kicinski [Fri, 17 Jan 2025 01:20:52 +0000 (17:20 -0800)]
Merge branch 'dev-covnert-dev_change_name-to-per-netns-rtnl'

Kuniyuki Iwashima says:

====================
dev: Covnert dev_change_name() to per-netns RTNL.

Patch 1 adds a missing netdev_rename_lock in dev_change_name()
and Patch 2 removes unnecessary devnet_rename_sem there.

Patch 3 replaces RTNL with rtnl_net_lock() in dev_ifsioc(),
and now dev_change_name() is always called under per-netns RTNL.

Given it's close to -rc8 and Patch 1 touches the trivial unlikely
path, can Patch 1 go into net-next ?  Otherwise I'll post Patch 2 & 3
separately in the next cycle.
====================

Link: https://patch.msgid.link/20250115095545.52709-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodev: Hold rtnl_net_lock() for dev_ifsioc().
Kuniyuki Iwashima [Wed, 15 Jan 2025 09:55:45 +0000 (18:55 +0900)]
dev: Hold rtnl_net_lock() for dev_ifsioc().

Basically, dev_ifsioc() operates on the passed single netns (except
for netdev notifier chains with lower/upper devices for which we will
need more changes).

Let's hold rtnl_net_lock() for dev_ifsioc().

Now that NETDEV_CHANGENAME is always triggered under rtnl_net_lock()
of the device's netns. (do_setlink() and dev_ifsioc())

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodev: Remove devnet_rename_sem.
Kuniyuki Iwashima [Wed, 15 Jan 2025 09:55:44 +0000 (18:55 +0900)]
dev: Remove devnet_rename_sem.

devnet_rename_sem is no longer used since commit
0840556e5a3a ("net: Protect dev->name by seqlock.").

Also, RTNL serialises dev_change_name().

Let's remove devnet_rename_sem.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodev: Acquire netdev_rename_lock before restoring dev->name in dev_change_name().
Kuniyuki Iwashima [Wed, 15 Jan 2025 09:55:43 +0000 (18:55 +0900)]
dev: Acquire netdev_rename_lock before restoring dev->name in dev_change_name().

The cited commit forgot to add netdev_rename_lock in one of the
error paths in dev_change_name().

Let's hold netdev_rename_lock before restoring the old dev->name.

Fixes: 0840556e5a3a ("net: Protect dev->name by seqlock.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net-hw: inject pp_alloc_fail errors in the right place
John Daley [Wed, 15 Jan 2025 18:13:12 +0000 (10:13 -0800)]
selftests: drv-net-hw: inject pp_alloc_fail errors in the right place

The tool pp_alloc_fail.py tested error recovery by injecting errors
into the function page_pool_alloc_pages(). The page pool allocation
function page_pool_dev_alloc() does not end up calling
page_pool_alloc_pages(). page_pool_alloc_netmems() seems to be the
function that is called by all of the page pool alloc functions in
the API, so move error injection to that function instead.

Signed-off-by: John Daley <johndale@cisco.com>
Link: https://patch.msgid.link/20250115181312.3544-2-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoipv4: Prepare inet_rtm_getroute() to .flowi4_tos conversion.
Guillaume Nault [Wed, 15 Jan 2025 12:44:52 +0000 (13:44 +0100)]
ipv4: Prepare inet_rtm_getroute() to .flowi4_tos conversion.

Store rtm->rtm_tos in a dscp_t variable, which can then be used for
setting fl4.flowi4_tos and also be passed as parameter of
ip_route_input_rcu().

The .flowi4_tos field is going to be converted to dscp_t to ensure ECN
bits aren't erroneously taken into account during route lookups. Having
a dscp_t variable available will simplify that conversion, as we'll
just have to drop the inet_dscp_to_dsfield() call.

Note that we can't just convert rtm->rtm_tos to dscp_t because this
structure is exported to user space.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/7bc1c7dc47ad1393569095d334521fae59af5bc7.1736944951.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agogre: Prepare ipgre_open() to .flowi4_tos conversion.
Guillaume Nault [Wed, 15 Jan 2025 11:53:55 +0000 (12:53 +0100)]
gre: Prepare ipgre_open() to .flowi4_tos conversion.

Use ip4h_dscp() to get the tunnel DSCP option as dscp_t, instead of
manually masking the raw tos field with INET_DSCP_MASK. This will ease
the conversion of fl4->flowi4_tos to dscp_t, which just becomes a
matter of dropping the inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/6c05a11afdc61530f1a4505147e0909ad51feb15.1736941806.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 16 Jan 2025 18:30:22 +0000 (10:30 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.13-rc8).

Conflicts:

drivers/net/ethernet/realtek/r8169_main.c
  1f691a1fc4be ("r8169: remove redundant hwmon support")
  152d00a91396 ("r8169: simplify setting hwmon attribute visibility")
https://lore.kernel.org/20250115122152.760b4e8d@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  152f4da05aee ("bnxt_en: add support for rx-copybreak ethtool command")
  f0aa6a37a3db ("eth: bnxt: always recalculate features after XDP clearing, fix null-deref")

drivers/net/ethernet/intel/ice/ice_type.h
  50327223a8bb ("ice: add lock to protect low latency interface")
  dc26548d729e ("ice: Fix quad registers read on E825")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge tag 'net-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 16 Jan 2025 17:09:44 +0000 (09:09 -0800)]
Merge tag 'net-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Notably this includes fixes for a few regressions spotted very
  recently. No known outstanding ones.

  Current release - regressions:

   - core: avoid CFI problems with sock priv helpers

   - xsk: bring back busy polling support

   - netpoll: ensure skb_pool list is always initialized

  Current release - new code bugs:

   - core: make page_pool_ref_netmem work with net iovs

   - ipv4: route: fix drop reason being overridden in
     ip_route_input_slow

   - udp: make rehash4 independent in udp_lib_rehash()

  Previous releases - regressions:

   - bpf: fix bpf_sk_select_reuseport() memory leak

   - openvswitch: fix lockup on tx to unregistering netdev with carrier

   - mptcp: be sure to send ack when mptcp-level window re-opens

   - eth:
      - bnxt: always recalculate features after XDP clearing, fix
        null-deref
      - mlx5: fix sub-function add port error handling
      - fec: handle page_pool_dev_alloc_pages error

  Previous releases - always broken:

   - vsock: some fixes due to transport de-assignment

   - eth:
      - ice: fix E825 initialization
      - mlx5e: fix inversion dependency warning while enabling IPsec
        tunnel
      - gtp: destroy device along with udp socket's netns dismantle.
      - xilinx: axienet: Fix IRQ coalescing packet count overflow"

* tag 'net-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
  netdev: avoid CFI problems with sock priv helpers
  net/mlx5e: Always start IPsec sequence number from 1
  net/mlx5e: Rely on reqid in IPsec tunnel mode
  net/mlx5e: Fix inversion dependency warning while enabling IPsec tunnel
  net/mlx5: Clear port select structure when fail to create
  net/mlx5: SF, Fix add port error handling
  net/mlx5: Fix a lockdep warning as part of the write combining test
  net/mlx5: Fix RDMA TX steering prio
  net: make page_pool_ref_netmem work with net iovs
  net: ethernet: xgbe: re-add aneg to supported features in PHY quirks
  net: pcs: xpcs: actively unset DW_VR_MII_DIG_CTRL1_2G5_EN for 1G SGMII
  net: pcs: xpcs: fix DW_VR_MII_DIG_CTRL1_2G5_EN bit being set for 1G SGMII w/o inband
  selftests: net: Adapt ethtool mq tests to fix in qdisc graft
  net: fec: handle page_pool_dev_alloc_pages error
  net: netpoll: ensure skb_pool list is always initialized
  net: xilinx: axienet: Fix IRQ coalescing packet count overflow
  nfp: bpf: prevent integer overflow in nfp_bpf_event_output()
  selftests: mptcp: avoid spurious errors on disconnect
  mptcp: fix spurious wake-up on under memory pressure
  mptcp: be sure to send ack when mptcp-level window re-opens
  ...

3 months agoMerge tag 'pm-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Thu, 16 Jan 2025 17:04:10 +0000 (09:04 -0800)]
Merge tag 'pm-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "Update the documentation of cpuidle governors that does not match the
  code any more after previous functional changes (Rafael Wysocki) and
  fix up the cpufreq Kconfig file broken inadvertently by a previous
  update (Viresh Kumar)"

* tag 'pm-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Move endif to the end of Kconfig file
  cpuidle: teo: Update documentation after previous changes
  cpuidle: menu: Update documentation after previous changes

3 months agoMerge tag 'acpi-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Thu, 16 Jan 2025 17:02:10 +0000 (09:02 -0800)]
Merge tag 'acpi-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
 "Prevent acpi_video_device_EDID() from returning a pointer to a memory
  region that should not be passed to kfree() which causes one of its
  users to crash randomly on attempts to free it (Chris Bainbridge)"

* tag 'acpi-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: video: Fix random crashes due to bad kfree()

3 months agoMerge tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Thu, 16 Jan 2025 16:54:33 +0000 (08:54 -0800)]
Merge tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:

 - handle d_path() errors when canonicalizing device mapper paths during
   device scan

* tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add the missing error handling inside get_canonical_dev_path

3 months agoMerge branch 'pm-cpufreq'
Rafael J. Wysocki [Thu, 16 Jan 2025 14:36:41 +0000 (15:36 +0100)]
Merge branch 'pm-cpufreq'

Merge a cpufreq fix for 6.13:

 - Fix cpufreq Kconfig breakage after previous changes (Viresh Kumar).

* pm-cpufreq:
  cpufreq: Move endif to the end of Kconfig file

3 months agonetdev: avoid CFI problems with sock priv helpers
Jakub Kicinski [Wed, 15 Jan 2025 16:14:36 +0000 (08:14 -0800)]
netdev: avoid CFI problems with sock priv helpers

Li Li reports that casting away callback type may cause issues
for CFI. Let's generate a small wrapper for each callback,
to make sure compiler sees the anticipated types.

Reported-by: Li Li <dualli@chromium.org>
Link: https://lore.kernel.org/CANBPYPjQVqmzZ4J=rVQX87a9iuwmaetULwbK_5_3YWk2eGzkaA@mail.gmail.com
Fixes: 170aafe35cb9 ("netdev: support binding dma-buf to netdevice")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250115161436.648646-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoMerge branch 'mlx5-misc-fixes-2025-01-15'
Paolo Abeni [Thu, 16 Jan 2025 11:45:50 +0000 (12:45 +0100)]
Merge branch 'mlx5-misc-fixes-2025-01-15'

Tariq Toukan says:

====================
mlx5 misc fixes 2025-01-15

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.
====================

Link: https://patch.msgid.link/20250115113910.1990174-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet/mlx5e: Always start IPsec sequence number from 1
Leon Romanovsky [Wed, 15 Jan 2025 11:39:10 +0000 (13:39 +0200)]
net/mlx5e: Always start IPsec sequence number from 1

According to RFC4303, section "3.3.3. Sequence Number Generation",
the first packet sent using a given SA will contain a sequence
number of 1.

This is applicable to both ESN and non-ESN mode, which was not covered
in commit mentioned in Fixes line.

Fixes: 3d42c8cc67a8 ("net/mlx5e: Ensure that IPsec sequence packet number starts from 1")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet/mlx5e: Rely on reqid in IPsec tunnel mode
Leon Romanovsky [Wed, 15 Jan 2025 11:39:09 +0000 (13:39 +0200)]
net/mlx5e: Rely on reqid in IPsec tunnel mode

All packet offloads SAs have reqid in it to make sure they have
corresponding policy. While it is not strictly needed for transparent
mode, it is extremely important in tunnel mode. In that mode, policy and
SAs have different match criteria.

Policy catches the whole subnet addresses, and SA catches the tunnel gateways
addresses. The source address of such tunnel is not known during egress packet
traversal in flow steering as it is added only after successful encryption.

As reqid is required for packet offload and it is unique for every SA,
we can safely rely on it only.

The output below shows the configured egress policy and SA by strongswan:

[leonro@vm ~]$ sudo ip x s
src 192.169.101.2 dst 192.169.101.1
        proto esp spi 0xc88b7652 reqid 1 mode tunnel
        replay-window 0 flag af-unspec esn
        aead rfc4106(gcm(aes)) 0xe406a01083986e14d116488549094710e9c57bc6 128
        anti-replay esn context:
         seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
         replay_window 1, bitmap-length 1
         00000000
        crypto offload parameters: dev eth2 dir out mode packet

[leonro@064 ~]$ sudo ip x p
src 192.170.0.0/16 dst 192.170.0.0/16
        dir out priority 383615 ptype main
        tmpl src 192.169.101.2 dst 192.169.101.1
                proto esp spi 0xc88b7652 reqid 1 mode tunnel
        crypto offload parameters: dev eth2 mode packet

Fixes: b3beba1fb404 ("net/mlx5e: Allow policies with reqid 0, to support IKE policy holes")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet/mlx5e: Fix inversion dependency warning while enabling IPsec tunnel
Leon Romanovsky [Wed, 15 Jan 2025 11:39:08 +0000 (13:39 +0200)]
net/mlx5e: Fix inversion dependency warning while enabling IPsec tunnel

Attempt to enable IPsec packet offload in tunnel mode in debug kernel
generates the following kernel panic, which is happening due to two
issues:
1. In SA add section, the should be _bh() variant when marking SA mode.
2. There is not needed flush_workqueue in SA delete routine. It is not
needed as at this stage as it is removed from SADB and the running work
will be canceled later in SA free.

 =====================================================
 WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
 6.12.0+ #4 Not tainted
 -----------------------------------------------------
 charon/1337 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire:
 ffff88810f365020 (&xa->xa_lock#24){+.+.}-{3:3}, at: mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]

 and this task is already holding:
 ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30
 which would create a new lock dependency:
  (&x->lock){+.-.}-{3:3} -> (&xa->xa_lock#24){+.+.}-{3:3}

 but this new dependency connects a SOFTIRQ-irq-safe lock:
  (&x->lock){+.-.}-{3:3}

 ... which became SOFTIRQ-irq-safe at:
   lock_acquire+0x1be/0x520
   _raw_spin_lock_bh+0x34/0x40
   xfrm_timer_handler+0x91/0xd70
   __hrtimer_run_queues+0x1dd/0xa60
   hrtimer_run_softirq+0x146/0x2e0
   handle_softirqs+0x266/0x860
   irq_exit_rcu+0x115/0x1a0
   sysvec_apic_timer_interrupt+0x6e/0x90
   asm_sysvec_apic_timer_interrupt+0x16/0x20
   default_idle+0x13/0x20
   default_idle_call+0x67/0xa0
   do_idle+0x2da/0x320
   cpu_startup_entry+0x50/0x60
   start_secondary+0x213/0x2a0
   common_startup_64+0x129/0x138

 to a SOFTIRQ-irq-unsafe lock:
  (&xa->xa_lock#24){+.+.}-{3:3}

 ... which became SOFTIRQ-irq-unsafe at:
 ...
   lock_acquire+0x1be/0x520
   _raw_spin_lock+0x2c/0x40
   xa_set_mark+0x70/0x110
   mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core]
   xfrm_dev_state_add+0x3bb/0xd70
   xfrm_add_sa+0x2451/0x4a90
   xfrm_user_rcv_msg+0x493/0x880
   netlink_rcv_skb+0x12e/0x380
   xfrm_netlink_rcv+0x6d/0x90
   netlink_unicast+0x42f/0x740
   netlink_sendmsg+0x745/0xbe0
   __sock_sendmsg+0xc5/0x190
   __sys_sendto+0x1fe/0x2c0
   __x64_sys_sendto+0xdc/0x1b0
   do_syscall_64+0x6d/0x140
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

 other info that might help us debug this:

  Possible interrupt unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&xa->xa_lock#24);
                                local_irq_disable();
                                lock(&x->lock);
                                lock(&xa->xa_lock#24);
   <Interrupt>
     lock(&x->lock);

  *** DEADLOCK ***

 2 locks held by charon/1337:
  #0: ffffffff87f8f858 (&net->xfrm.xfrm_cfg_mutex){+.+.}-{4:4}, at: xfrm_netlink_rcv+0x5e/0x90
  #1: ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30

 the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
 -> (&x->lock){+.-.}-{3:3} ops: 29 {
    HARDIRQ-ON-W at:
                     lock_acquire+0x1be/0x520
                     _raw_spin_lock_bh+0x34/0x40
                     xfrm_alloc_spi+0xc0/0xe60
                     xfrm_alloc_userspi+0x5f6/0xbc0
                     xfrm_user_rcv_msg+0x493/0x880
                     netlink_rcv_skb+0x12e/0x380
                     xfrm_netlink_rcv+0x6d/0x90
                     netlink_unicast+0x42f/0x740
                     netlink_sendmsg+0x745/0xbe0
                     __sock_sendmsg+0xc5/0x190
                     __sys_sendto+0x1fe/0x2c0
                     __x64_sys_sendto+0xdc/0x1b0
                     do_syscall_64+0x6d/0x140
                     entry_SYSCALL_64_after_hwframe+0x4b/0x53
    IN-SOFTIRQ-W at:
                     lock_acquire+0x1be/0x520
                     _raw_spin_lock_bh+0x34/0x40
                     xfrm_timer_handler+0x91/0xd70
                     __hrtimer_run_queues+0x1dd/0xa60
                     hrtimer_run_softirq+0x146/0x2e0
                     handle_softirqs+0x266/0x860
                     irq_exit_rcu+0x115/0x1a0
                     sysvec_apic_timer_interrupt+0x6e/0x90
                     asm_sysvec_apic_timer_interrupt+0x16/0x20
                     default_idle+0x13/0x20
                     default_idle_call+0x67/0xa0
                     do_idle+0x2da/0x320
                     cpu_startup_entry+0x50/0x60
                     start_secondary+0x213/0x2a0
                     common_startup_64+0x129/0x138
    INITIAL USE at:
                    lock_acquire+0x1be/0x520
                    _raw_spin_lock_bh+0x34/0x40
                    xfrm_alloc_spi+0xc0/0xe60
                    xfrm_alloc_userspi+0x5f6/0xbc0
                    xfrm_user_rcv_msg+0x493/0x880
                    netlink_rcv_skb+0x12e/0x380
                    xfrm_netlink_rcv+0x6d/0x90
                    netlink_unicast+0x42f/0x740
                    netlink_sendmsg+0x745/0xbe0
                    __sock_sendmsg+0xc5/0x190
                    __sys_sendto+0x1fe/0x2c0
                    __x64_sys_sendto+0xdc/0x1b0
                    do_syscall_64+0x6d/0x140
                    entry_SYSCALL_64_after_hwframe+0x4b/0x53
  }
  ... key      at: [<ffffffff87f9cd20>] __key.18+0x0/0x40

 the dependencies between the lock to be acquired
  and SOFTIRQ-irq-unsafe lock:
 -> (&xa->xa_lock#24){+.+.}-{3:3} ops: 9 {
    HARDIRQ-ON-W at:
                     lock_acquire+0x1be/0x520
                     _raw_spin_lock_bh+0x34/0x40
                     mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core]
                     xfrm_dev_state_add+0x3bb/0xd70
                     xfrm_add_sa+0x2451/0x4a90
                     xfrm_user_rcv_msg+0x493/0x880
                     netlink_rcv_skb+0x12e/0x380
                     xfrm_netlink_rcv+0x6d/0x90
                     netlink_unicast+0x42f/0x740
                     netlink_sendmsg+0x745/0xbe0
                     __sock_sendmsg+0xc5/0x190
                     __sys_sendto+0x1fe/0x2c0
                     __x64_sys_sendto+0xdc/0x1b0
                     do_syscall_64+0x6d/0x140
                     entry_SYSCALL_64_after_hwframe+0x4b/0x53
    SOFTIRQ-ON-W at:
                     lock_acquire+0x1be/0x520
                     _raw_spin_lock+0x2c/0x40
                     xa_set_mark+0x70/0x110
                     mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core]
                     xfrm_dev_state_add+0x3bb/0xd70
                     xfrm_add_sa+0x2451/0x4a90
                     xfrm_user_rcv_msg+0x493/0x880
                     netlink_rcv_skb+0x12e/0x380
                     xfrm_netlink_rcv+0x6d/0x90
                     netlink_unicast+0x42f/0x740
                     netlink_sendmsg+0x745/0xbe0
                     __sock_sendmsg+0xc5/0x190
                     __sys_sendto+0x1fe/0x2c0
                     __x64_sys_sendto+0xdc/0x1b0
                     do_syscall_64+0x6d/0x140
                     entry_SYSCALL_64_after_hwframe+0x4b/0x53
    INITIAL USE at:
                    lock_acquire+0x1be/0x520
                    _raw_spin_lock_bh+0x34/0x40
                    mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core]
                    xfrm_dev_state_add+0x3bb/0xd70
                    xfrm_add_sa+0x2451/0x4a90
                    xfrm_user_rcv_msg+0x493/0x880
                    netlink_rcv_skb+0x12e/0x380
                    xfrm_netlink_rcv+0x6d/0x90
                    netlink_unicast+0x42f/0x740
                    netlink_sendmsg+0x745/0xbe0
                    __sock_sendmsg+0xc5/0x190
                    __sys_sendto+0x1fe/0x2c0
                    __x64_sys_sendto+0xdc/0x1b0
                    do_syscall_64+0x6d/0x140
                    entry_SYSCALL_64_after_hwframe+0x4b/0x53
  }
  ... key      at: [<ffffffffa078ff60>] __key.48+0x0/0xfffffffffff210a0 [mlx5_core]
  ... acquired at:
    __lock_acquire+0x30a0/0x5040
    lock_acquire+0x1be/0x520
    _raw_spin_lock_bh+0x34/0x40
    mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
    xfrm_dev_state_delete+0x90/0x160
    __xfrm_state_delete+0x662/0xae0
    xfrm_state_delete+0x1e/0x30
    xfrm_del_sa+0x1c2/0x340
    xfrm_user_rcv_msg+0x493/0x880
    netlink_rcv_skb+0x12e/0x380
    xfrm_netlink_rcv+0x6d/0x90
    netlink_unicast+0x42f/0x740
    netlink_sendmsg+0x745/0xbe0
    __sock_sendmsg+0xc5/0x190
    __sys_sendto+0x1fe/0x2c0
    __x64_sys_sendto+0xdc/0x1b0
    do_syscall_64+0x6d/0x140
    entry_SYSCALL_64_after_hwframe+0x4b/0x53

 stack backtrace:
 CPU: 7 UID: 0 PID: 1337 Comm: charon Not tainted 6.12.0+ #4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x74/0xd0
  check_irq_usage+0x12e8/0x1d90
  ? print_shortest_lock_dependencies_backwards+0x1b0/0x1b0
  ? check_chain_key+0x1bb/0x4c0
  ? __lockdep_reset_lock+0x180/0x180
  ? check_path.constprop.0+0x24/0x50
  ? mark_lock+0x108/0x2fb0
  ? print_circular_bug+0x9b0/0x9b0
  ? mark_lock+0x108/0x2fb0
  ? print_usage_bug.part.0+0x670/0x670
  ? check_prev_add+0x1c4/0x2310
  check_prev_add+0x1c4/0x2310
  __lock_acquire+0x30a0/0x5040
  ? lockdep_set_lock_cmp_fn+0x190/0x190
  ? lockdep_set_lock_cmp_fn+0x190/0x190
  lock_acquire+0x1be/0x520
  ? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
  ? lockdep_hardirqs_on_prepare+0x400/0x400
  ? __xfrm_state_delete+0x5f0/0xae0
  ? lock_downgrade+0x6b0/0x6b0
  _raw_spin_lock_bh+0x34/0x40
  ? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
  mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
  xfrm_dev_state_delete+0x90/0x160
  __xfrm_state_delete+0x662/0xae0
  xfrm_state_delete+0x1e/0x30
  xfrm_del_sa+0x1c2/0x340
  ? xfrm_get_sa+0x250/0x250
  ? check_chain_key+0x1bb/0x4c0
  xfrm_user_rcv_msg+0x493/0x880
  ? copy_sec_ctx+0x270/0x270
  ? check_chain_key+0x1bb/0x4c0
  ? lockdep_set_lock_cmp_fn+0x190/0x190
  ? lockdep_set_lock_cmp_fn+0x190/0x190
  netlink_rcv_skb+0x12e/0x380
  ? copy_sec_ctx+0x270/0x270
  ? netlink_ack+0xd90/0xd90
  ? netlink_deliver_tap+0xcd/0xb60
  xfrm_netlink_rcv+0x6d/0x90
  netlink_unicast+0x42f/0x740
  ? netlink_attachskb+0x730/0x730
  ? lock_acquire+0x1be/0x520
  netlink_sendmsg+0x745/0xbe0
  ? netlink_unicast+0x740/0x740
  ? __might_fault+0xbb/0x170
  ? netlink_unicast+0x740/0x740
  __sock_sendmsg+0xc5/0x190
  ? fdget+0x163/0x1d0
  __sys_sendto+0x1fe/0x2c0
  ? __x64_sys_getpeername+0xb0/0xb0
  ? do_user_addr_fault+0x856/0xe30
  ? lock_acquire+0x1be/0x520
  ? __task_pid_nr_ns+0x117/0x410
  ? lock_downgrade+0x6b0/0x6b0
  __x64_sys_sendto+0xdc/0x1b0
  ? lockdep_hardirqs_on_prepare+0x284/0x400
  do_syscall_64+0x6d/0x140
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
 RIP: 0033:0x7f7d31291ba4
 Code: 7d e8 89 4d d4 e8 4c 42 f7 ff 44 8b 4d d0 4c 8b 45 c8 89 c3 44 8b 55 d4 8b 7d e8 b8 2c 00 00 00 48 8b 55 d8 48 8b 75 e0 0f 05 <48> 3d 00 f0 ff ff 77 34 89 df 48 89 45 e8 e8 99 42 f7 ff 48 8b 45
 RSP: 002b:00007f7d2ccd94f0 EFLAGS: 00000297 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7d31291ba4
 RDX: 0000000000000028 RSI: 00007f7d2ccd96a0 RDI: 000000000000000a
 RBP: 00007f7d2ccd9530 R08: 00007f7d2ccd9598 R09: 000000000000000c
 R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000028
 R13: 00007f7d2ccd9598 R14: 00007f7d2ccd96a0 R15: 00000000000000e1
  </TASK>

Fixes: 4c24272b4e2b ("net/mlx5e: Listen to ARP events to update IPsec L2 headers in tunnel mode")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet/mlx5: Clear port select structure when fail to create
Mark Zhang [Wed, 15 Jan 2025 11:39:07 +0000 (13:39 +0200)]
net/mlx5: Clear port select structure when fail to create

Clear the port select structure on error so no stale values left after
definers are destroyed. That's because the mlx5_lag_destroy_definers()
always try to destroy all lag definers in the tt_map, so in the flow
below lag definers get double-destroyed and cause kernel crash:

  mlx5_lag_port_sel_create()
    mlx5_lag_create_definers()
      mlx5_lag_create_definer()     <- Failed on tt 1
        mlx5_lag_destroy_definers() <- definers[tt=0] gets destroyed
  mlx5_lag_port_sel_create()
    mlx5_lag_create_definers()
      mlx5_lag_create_definer()     <- Failed on tt 0
        mlx5_lag_destroy_definers() <- definers[tt=0] gets double-destroyed

 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
 Mem abort info:
   ESR = 0x0000000096000005
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x05: level 1 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
 user pgtable: 64k pages, 48-bit VAs, pgdp=0000000112ce2e00
 [0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
 Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
 Modules linked in: iptable_raw bonding ip_gre ip6_gre gre ip6_tunnel tunnel6 geneve ip6_udp_tunnel udp_tunnel ipip tunnel4 ip_tunnel rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_fwctl(OE) fwctl(OE) mlx5_core(OE) mlxdevm(OE) ib_core(OE) mlxfw(OE) memtrack(OE) mlx_compat(OE) openvswitch nsh nf_conncount psample xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc netconsole overlay efi_pstore sch_fq_codel zram ip_tables crct10dif_ce qemu_fw_cfg fuse ipv6 crc_ccitt [last unloaded: mlx_compat(OE)]
  CPU: 3 UID: 0 PID: 217 Comm: kworker/u53:2 Tainted: G           OE      6.11.0+ #2
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
  pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core]
  lr : mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core]
  sp : ffff800085fafb00
  x29: ffff800085fafb00 x28: ffff0000da0c8000 x27: 0000000000000000
  x26: ffff0000da0c8000 x25: ffff0000da0c8000 x24: ffff0000da0c8000
  x23: ffff0000c31f81a0 x22: 0400000000000000 x21: ffff0000da0c8000
  x20: 0000000000000000 x19: 0000000000000001 x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff8b0c9350
  x14: 0000000000000000 x13: ffff800081390d18 x12: ffff800081dc3cc0
  x11: 0000000000000001 x10: 0000000000000b10 x9 : ffff80007ab7304c
  x8 : ffff0000d00711f0 x7 : 0000000000000004 x6 : 0000000000000190
  x5 : ffff00027edb3010 x4 : 0000000000000000 x3 : 0000000000000000
  x2 : ffff0000d39b8000 x1 : ffff0000d39b8000 x0 : 0400000000000000
  Call trace:
   mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core]
   mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core]
   mlx5_lag_destroy_definers+0xa0/0x108 [mlx5_core]
   mlx5_lag_port_sel_create+0x2d4/0x6f8 [mlx5_core]
   mlx5_activate_lag+0x60c/0x6f8 [mlx5_core]
   mlx5_do_bond_work+0x284/0x5c8 [mlx5_core]
   process_one_work+0x170/0x3e0
   worker_thread+0x2d8/0x3e0
   kthread+0x11c/0x128
   ret_from_fork+0x10/0x20
  Code: a9025bf5 aa0003f6 a90363f7 f90023f9 (f9400400)
  ---[ end trace 0000000000000000 ]---

Fixes: dc48516ec7d3 ("net/mlx5: Lag, add support to create definers for LAG")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet/mlx5: SF, Fix add port error handling
Chris Mi [Wed, 15 Jan 2025 11:39:06 +0000 (13:39 +0200)]
net/mlx5: SF, Fix add port error handling

If failed to add SF, error handling doesn't delete the SF from the
SF table. But the hw resources are deleted. So when unload driver,
hw resources will be deleted again. Firmware will report syndrome
0x68def3 which means "SF is not allocated can not deallocate".

Fix it by delete SF from SF table if failed to add SF.

Fixes: 2597ee190b4e ("net/mlx5: Call mlx5_sf_id_erase() once in mlx5_sf_dealloc()")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet/mlx5: Fix a lockdep warning as part of the write combining test
Yishai Hadas [Wed, 15 Jan 2025 11:39:05 +0000 (13:39 +0200)]
net/mlx5: Fix a lockdep warning as part of the write combining test

Fix a lockdep warning [1] observed during the write combining test.

The warning indicates a potential nested lock scenario that could lead
to a deadlock.

However, this is a false positive alarm because the SF lock and its
parent lock are distinct ones.

The lockdep confusion arises because the locks belong to the same object
class (i.e., struct mlx5_core_dev).

To resolve this, the code has been refactored to avoid taking both
locks. Instead, only the parent lock is acquired.

[1]
raw_ethernet_bw/2118 is trying to acquire lock:
[  213.619032] ffff88811dd75e08 (&dev->wc_state_lock){+.+.}-{3:3}, at:
               mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.620270]
[  213.620270] but task is already holding lock:
[  213.620943] ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3}, at:
               mlx5_wc_support_get+0x10c/0x210 [mlx5_core]
[  213.622045]
[  213.622045] other info that might help us debug this:
[  213.622778]  Possible unsafe locking scenario:
[  213.622778]
[  213.623465]        CPU0
[  213.623815]        ----
[  213.624148]   lock(&dev->wc_state_lock);
[  213.624615]   lock(&dev->wc_state_lock);
[  213.625071]
[  213.625071]  *** DEADLOCK ***
[  213.625071]
[  213.625805]  May be due to missing lock nesting notation
[  213.625805]
[  213.626522] 4 locks held by raw_ethernet_bw/2118:
[  213.627019]  #0: ffff88813f80d578 (&uverbs_dev->disassociate_srcu){.+.+}-{0:0},
                at: ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[  213.628088]  #1: ffff88810fb23930 (&file->hw_destroy_rwsem){.+.+}-{3:3},
                at: ib_init_ucontext+0x2d/0xf0 [ib_uverbs]
[  213.629094]  #2: ffff88810fb23878 (&file->ucontext_lock){+.+.}-{3:3},
                at: ib_init_ucontext+0x49/0xf0 [ib_uverbs]
[  213.630106]  #3: ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3},
                at: mlx5_wc_support_get+0x10c/0x210 [mlx5_core]
[  213.631185]
[  213.631185] stack backtrace:
[  213.631718] CPU: 1 UID: 0 PID: 2118 Comm: raw_ethernet_bw Not tainted
               6.12.0-rc7_internal_net_next_mlx5_89a0ad0 #1
[  213.632722] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
               rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  213.633785] Call Trace:
[  213.634099]
[  213.634393]  dump_stack_lvl+0x7e/0xc0
[  213.634806]  print_deadlock_bug+0x278/0x3c0
[  213.635265]  __lock_acquire+0x15f4/0x2c40
[  213.635712]  lock_acquire+0xcd/0x2d0
[  213.636120]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.636722]  ? mlx5_ib_enable_lb+0x24/0xa0 [mlx5_ib]
[  213.637277]  __mutex_lock+0x81/0xda0
[  213.637697]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.638305]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.638902]  ? rcu_read_lock_sched_held+0x3f/0x70
[  213.639400]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.640016]  mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.640615]  set_ucontext_resp+0x68/0x2b0 [mlx5_ib]
[  213.641144]  ? debug_mutex_init+0x33/0x40
[  213.641586]  mlx5_ib_alloc_ucontext+0x18e/0x7b0 [mlx5_ib]
[  213.642145]  ib_init_ucontext+0xa0/0xf0 [ib_uverbs]
[  213.642679]  ib_uverbs_handler_UVERBS_METHOD_GET_CONTEXT+0x95/0xc0
                [ib_uverbs]
[  213.643426]  ? _copy_from_user+0x46/0x80
[  213.643878]  ib_uverbs_cmd_verbs+0xa6b/0xc80 [ib_uverbs]
[  213.644426]  ? ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x130/0x130
               [ib_uverbs]
[  213.645213]  ? __lock_acquire+0xa99/0x2c40
[  213.645675]  ? lock_acquire+0xcd/0x2d0
[  213.646101]  ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[  213.646625]  ? reacquire_held_locks+0xcf/0x1f0
[  213.647102]  ? do_user_addr_fault+0x45d/0x770
[  213.647586]  ib_uverbs_ioctl+0xe0/0x170 [ib_uverbs]
[  213.648102]  ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[  213.648632]  __x64_sys_ioctl+0x4d3/0xaa0
[  213.649060]  ? do_user_addr_fault+0x4a8/0x770
[  213.649528]  do_syscall_64+0x6d/0x140
[  213.649947]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[  213.650478] RIP: 0033:0x7fa179b0737b
[  213.650893] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c
               89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8
               10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d
               7d 2a 0f 00 f7 d8 64 89 01 48
[  213.652619] RSP: 002b:00007ffd2e6d46e8 EFLAGS: 00000246 ORIG_RAX:
               0000000000000010
[  213.653390] RAX: ffffffffffffffda RBX: 00007ffd2e6d47f8 RCX:
               00007fa179b0737b
[  213.654084] RDX: 00007ffd2e6d47e0 RSI: 00000000c0181b01 RDI:
               0000000000000003
[  213.654767] RBP: 00007ffd2e6d47c0 R08: 00007fa1799be010 R09:
               0000000000000002
[  213.655453] R10: 00007ffd2e6d4960 R11: 0000000000000246 R12:
               00007ffd2e6d487c
[  213.656170] R13: 0000000000000027 R14: 0000000000000001 R15:
               00007ffd2e6d4f70

Fixes: d98995b4bf98 ("net/mlx5: Reimplement write combining test")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet/mlx5: Fix RDMA TX steering prio
Patrisious Haddad [Wed, 15 Jan 2025 11:39:04 +0000 (13:39 +0200)]
net/mlx5: Fix RDMA TX steering prio

User added steering rules at RDMA_TX were being added to the first prio,
which is the counters prio.
Fix that so that they are correctly added to the BYPASS_PRIO instead.

Fixes: 24670b1a3166 ("net/mlx5: Add support for RDMA TX steering")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoMerge branch 'net-stmmac-rx-performance-improvement'
Paolo Abeni [Thu, 16 Jan 2025 11:14:24 +0000 (12:14 +0100)]
Merge branch 'net-stmmac-rx-performance-improvement'

Furong Xu says:

====================
net: stmmac: RX performance improvement

This series improves RX performance a lot, ~40% TCP RX throughput boost
has been observed with DWXGMAC CORE 3.20a running on Cortex-A65 CPUs:
from 2.18 Gbits/sec increased to 3.06 Gbits/sec.
====================

Link: https://patch.msgid.link/cover.1736910454.git.0x1207@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: stmmac: Convert prefetch() to net_prefetch() for received frames
Furong Xu [Wed, 15 Jan 2025 03:27:05 +0000 (11:27 +0800)]
net: stmmac: Convert prefetch() to net_prefetch() for received frames

The size of DMA descriptors is 32 bytes at most.
net_prefetch() for received frames, and keep prefetch() for descriptors.

This patch brings ~4.8% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, 2.92 Gbits/sec increased to 3.06 Gbits/sec.

Suggested-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: stmmac: Optimize cache prefetch in RX path
Furong Xu [Wed, 15 Jan 2025 03:27:04 +0000 (11:27 +0800)]
net: stmmac: Optimize cache prefetch in RX path

Current code prefetches cache lines for the received frame first, and
then dma_sync_single_for_cpu() against this frame, this is wrong.
Cache prefetch should be triggered after dma_sync_single_for_cpu().

This patch brings ~2.8% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, 2.84 Gbits/sec increased to 2.92 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: stmmac: Set page_pool_params.max_len to a precise size
Furong Xu [Wed, 15 Jan 2025 03:27:03 +0000 (11:27 +0800)]
net: stmmac: Set page_pool_params.max_len to a precise size

DMA engine will always write no more than dma_buf_sz bytes of a received
frame into a page buffer, the remaining spaces are unused or used by CPU
exclusively.
Setting page_pool_params.max_len to almost the full size of page(s) helps
nothing more, but wastes more CPU cycles on cache maintenance.

For a standard MTU of 1500, then dma_buf_sz is assigned to 1536, and this
patch brings ~16.9% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, from 2.43 Gbits/sec increased to 2.84 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: stmmac: Switch to zero-copy in non-XDP RX path
Furong Xu [Wed, 15 Jan 2025 03:27:02 +0000 (11:27 +0800)]
net: stmmac: Switch to zero-copy in non-XDP RX path

Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
be recycled in the upper network stack.

This patch brings ~11.5% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoMerge branch 'net-mlx5e-ct-add-support-for-hardware-steering'
Jakub Kicinski [Thu, 16 Jan 2025 03:28:07 +0000 (19:28 -0800)]
Merge branch 'net-mlx5e-ct-add-support-for-hardware-steering'

Tariq Toukan says:

====================
net/mlx5e: CT: Add support for hardware steering

This series start with one more HWS patch by Yevgeny, followed by
patches that add support for connection tracking in hardware steering
mode. It consists of:
- patch #2 hooks up the CT ops for the new mode in the right places.
- patch #3 moves a function into a common file, so it can be reused.
- patch #4 uses the HWS API to implement connection tracking.

The main advantage of hardware steering compared to software steering is
vastly improved performance when adding/removing/updating rules.  Using
the T-Rex traffic generator to initiate multi-million UDP flows per
second, a kernel running with these patches was able to offload ~600K
unique UDP flows per second, a number around ~7x larger than software
steering was able to achieve on the same hardware (256-thread AMD EPYC,
512 GB RAM, ConnectX 7 b2b).
====================

Link: https://patch.msgid.link/20250114130646.1937192-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: CT: Offload connections with hardware steering rules
Cosmin Ratiu [Tue, 14 Jan 2025 13:06:46 +0000 (15:06 +0200)]
net/mlx5e: CT: Offload connections with hardware steering rules

This is modeled similar to how software steering works:
- a reference-counted matcher is maintained for each
  combination of nat/no_nat x ipv4/ipv6 x tcp/udp/gre.
- adding a rule involves finding+referencing or creating a corresponding
  matcher, then actually adding a rule.
- updating rules is implemented using the bwc_rule update API, which can
  change a rule's actions without touching the match value.

By using a T-Rex traffic generator to initiate multi-million UDP flows
per second, a kernel running with these patches on the RX side was able
to offload ~600K flows per second, which is about ~7x larger than what
software steering could do on the same hardware (256-thread AMD EPYC,
512 GB RAM, ConnectX-7 b2b).

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250114130646.1937192-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: CT: Make mlx5_ct_fs_smfs_ct_validate_flow_rule reusable
Cosmin Ratiu [Tue, 14 Jan 2025 13:06:45 +0000 (15:06 +0200)]
net/mlx5e: CT: Make mlx5_ct_fs_smfs_ct_validate_flow_rule reusable

This function checks whether a flow_rule has the right flow dissector
keys and masks used for a connection tracking flow offload. It is
currently used locally by the tc_ct smfs module, but is about to be used
from another place, so this commit moves it to a better place, renames
it to mlx5e_tc_ct_is_valid_flow_rule and drops the unused fs argument.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250114130646.1937192-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: CT: Add initial support for Hardware Steering
Cosmin Ratiu [Tue, 14 Jan 2025 13:06:44 +0000 (15:06 +0200)]
net/mlx5e: CT: Add initial support for Hardware Steering

Connection tracking can offload tuple matches to the NIC either via
firmware commands (when the steering mode is dmfs or offload support is
disabled due to eswitch being set to legacy) or via software-managed
flow steering (smfs).

This commit adds stub operations for a third mode, hardware-managed flow
steering. This is enabled when both CONFIG_MLX5_TC_CT and
CONFIG_MLX5_HW_STEERING are enabled.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250114130646.1937192-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, rework the check if matcher size can be increased
Yevgeny Kliteynik [Tue, 14 Jan 2025 13:06:43 +0000 (15:06 +0200)]
net/mlx5: HWS, rework the check if matcher size can be increased

When checking if the matcher size can be increased, check both
match and action RTCs. Also, consider the increasing step - check
that it won't cause the new matcher size to become unsupported.

Additionally, since we're using '+ 1' for action RTC size yet
again, define it as macro and use in all the required places.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250114130646.1937192-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-reduce-rtnl-pressure-in-unregister_netdevice'
Jakub Kicinski [Thu, 16 Jan 2025 03:17:07 +0000 (19:17 -0800)]
Merge branch 'net-reduce-rtnl-pressure-in-unregister_netdevice'

Eric Dumazet says:

====================
net: reduce RTNL pressure in unregister_netdevice()

One major source of RTNL contention resides in unregister_netdevice()

Due to RCU protection of various network structures, and
unregister_netdevice() being a synchronous function,
it is calling potentially slow functions while holding RTNL.

I think we can release RTNL in two points, so that three
slow functions are called while RTNL can be used
by other threads.

v1: https://lore.kernel.org/netdev/20250107130906.098fc8d6@kernel.org/T/#m398c95f5778e1ff70938e079d3c4c43c050ad2a6
====================

Link: https://patch.msgid.link/20250114205531.967841-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)
Eric Dumazet [Tue, 14 Jan 2025 20:55:31 +0000 (20:55 +0000)]
net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)

One synchronize_net() call is currently done while holding RTNL.

This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.

For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().

This should be safe, because devices are no longer visible
to other threads after unlist_netdevice() call
and setting dev->reg_state to NETREG_UNREGISTERING.

In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)
Eric Dumazet [Tue, 14 Jan 2025 20:55:30 +0000 (20:55 +0000)]
net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)

Two synchronize_net() calls are currently done while holding RTNL.

This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.

For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().

This should be safe, because devices are no longer visible
to other threads at this point.

In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: no longer hold RTNL while calling flush_all_backlogs()
Eric Dumazet [Tue, 14 Jan 2025 20:55:29 +0000 (20:55 +0000)]
net: no longer hold RTNL while calling flush_all_backlogs()

flush_all_backlogs() is called from unregister_netdevice_many_notify()
as part of netdevice dismantles.

This is currently called under RTNL, and can last up to 50 ms
on busy hosts.

There is no reason to hold RTNL at this stage, if our caller
is cleanup_net() : netns are no more visible, devices
are in NETREG_UNREGISTERING state and no other thread
could mess our state while RTNL is temporarily released.

In order to provide isolation, this patch provides a separate
'net_todo_list' for cleanup_net().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: no longer assume RTNL is held in flush_all_backlogs()
Eric Dumazet [Tue, 14 Jan 2025 20:55:28 +0000 (20:55 +0000)]
net: no longer assume RTNL is held in flush_all_backlogs()

flush_all_backlogs() uses per-cpu and static data to hold its
temporary data, on the assumption it is called under RTNL
protection.

Following patch in the series will break this assumption.

Use instead a dynamically allocated piece of memory.

In the unlikely case the allocation fails,
use a boot-time allocated memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: expedite synchronize_net() for cleanup_net()
Eric Dumazet [Tue, 14 Jan 2025 20:55:27 +0000 (20:55 +0000)]
net: expedite synchronize_net() for cleanup_net()

cleanup_net() is the single thread responsible
for netns dismantles, and a serious bottleneck.

Before we can get per-netns RTNL, make sure
all synchronize_net() called from this thread
are using rcu_synchronize_expedited().

v3: deal with CONFIG_NET_NS=n

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-use-netdev-lock-to-protect-napi'
Jakub Kicinski [Thu, 16 Jan 2025 03:13:36 +0000 (19:13 -0800)]
Merge branch 'net-use-netdev-lock-to-protect-napi'

Jakub Kicinski says:

====================
net: use netdev->lock to protect NAPI

We recently added a lock member to struct net_device, with a vague
plan to start using it to protect netdev-local state, removing
the need to take rtnl_lock for new configuration APIs.

Lay some groundwork and use this lock for protecting NAPI APIs.

v1: https://lore.kernel.org/20250114035118.110297-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250115035319.559603-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonetdev-genl: remove rtnl_lock protection from NAPI ops
Jakub Kicinski [Wed, 15 Jan 2025 03:53:19 +0000 (19:53 -0800)]
netdev-genl: remove rtnl_lock protection from NAPI ops

NAPI lifetime, visibility and config are all fully under
netdev_lock protection now.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect NAPI config fields with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:18 +0000 (19:53 -0800)]
net: protect NAPI config fields with netdev_lock()

Protect the following members of netdev and napi by netdev_lock:
 - defer_hard_irqs,
 - gro_flush_timeout,
 - irq_suspend_timeout.

The first two are written via sysfs (which this patch switches
to new lock), and netdev genl which holds both netdev and rtnl locks.

irq_suspend_timeout is only written by netdev genl.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect napi->irq with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:17 +0000 (19:53 -0800)]
net: protect napi->irq with netdev_lock()

Take netdev_lock() in netif_napi_set_irq(). All NAPI "control fields"
are now protected by that lock (most of the other ones are set during
napi add/del). The napi_hash_node is fully protected by the hash
spin lock, but close enough for the kdoc...

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect threaded status of NAPI with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:16 +0000 (19:53 -0800)]
net: protect threaded status of NAPI with netdev_lock()

Now that NAPI instances can't come and go without holding
netdev->lock we can trivially switch from rtnl_lock() to
netdev_lock() for setting netdev->threaded via sysfs.

Note that since we do not lock netdev_lock around sysfs
calls in the core we don't have to "trylock" like we do
with rtnl_lock.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: make netdev netlink ops hold netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:15 +0000 (19:53 -0800)]
net: make netdev netlink ops hold netdev_lock()

In prep for dropping rtnl_lock, start locking netdev->lock in netlink
genl ops. We need to be using netdev->up instead of flags & IFF_UP.

We can remove the RCU lock protection for the NAPI since NAPI list
is protected by netdev->lock already.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect NAPI enablement with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:14 +0000 (19:53 -0800)]
net: protect NAPI enablement with netdev_lock()

Wrap napi_enable() / napi_disable() with netdev_lock().
Provide the "already locked" flavor of the API.

iavf needs the usual adjustment. A number of drivers call
napi_enable() under a spin lock, so they have to be modified
to take netdev_lock() first, then spin lock then call
napi_enable_locked().

Protecting napi_enable() implies that napi->napi_id is protected
by netdev_lock().

Acked-by: Francois Romieu <romieu@fr.zoreil.com> # via-velocity
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect netdev->napi_list with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:13 +0000 (19:53 -0800)]
net: protect netdev->napi_list with netdev_lock()

Hold netdev->lock when NAPIs are getting added or removed.
This will allow safe access to NAPI instances of a net_device
without rtnl_lock.

Create a family of helpers which assume the lock is already taken.
Switch iavf to them, as it makes extensive use of netdev->lock,
already.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: add netdev->up protected by netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:12 +0000 (19:53 -0800)]
net: add netdev->up protected by netdev_lock()

Some uAPI (netdev netlink) hide net_device's sub-objects while
the interface is down to ensure uniform behavior across drivers.
To remove the rtnl_lock dependency from those uAPIs we need a way
to safely tell if the device is down or up.

Add an indication of whether device is open or closed, protected
by netdev->lock. The semantics are the same as IFF_UP, but taking
netdev_lock around every write to ->flags would be a lot of code
churn.

We don't want to blanket the entire open / close path by netdev_lock,
because it will prevent us from applying it to specific structures -
core helpers won't be able to take that lock from any function
called by the drivers on open/close paths.

So the state of the flag is "pessimistic", as in it may report false
negatives, but never false positives.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: add helpers for lookup and walking netdevs under netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:11 +0000 (19:53 -0800)]
net: add helpers for lookup and walking netdevs under netdev_lock()

Add helpers for accessing netdevs under netdev_lock().
There's some careful handling needed to find the device and lock it
safely, without it getting unregistered, and without taking rtnl_lock
(the latter being the whole point of the new locking, after all).

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: make netdev_lock() protect netdev->reg_state
Jakub Kicinski [Wed, 15 Jan 2025 03:53:10 +0000 (19:53 -0800)]
net: make netdev_lock() protect netdev->reg_state

Protect writes to netdev->reg_state with netdev_lock().
From now on holding netdev_lock() is sufficient to prevent
the net_device from getting unregistered, so code which
wants to hold just a single netdev around no longer needs
to hold rtnl_lock.

We do not protect the NETREG_UNREGISTERED -> NETREG_RELEASED
transition. We'd need to move mutex_destroy(netdev->lock)
to .release, but the real reason is that trying to stop
the unregistration process mid-way would be unsafe / crazy.
Taking references on such devices is not safe, either.
So the intended semantics are to lock REGISTERED devices.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: add netdev_lock() / netdev_unlock() helpers
Jakub Kicinski [Wed, 15 Jan 2025 03:53:09 +0000 (19:53 -0800)]
net: add netdev_lock() / netdev_unlock() helpers

Add helpers for locking the netdev instance, use it in drivers
and the shaper code. This will make grepping for the lock usage
much easier, as we extend the lock to cover more fields.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20250115035319.559603-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: make page_pool_ref_netmem work with net iovs
Pavel Begunkov [Wed, 8 Jan 2025 22:06:22 +0000 (14:06 -0800)]
net: make page_pool_ref_netmem work with net iovs

page_pool_ref_netmem() should work with either netmem representation, but
currently it casts to a page with netmem_to_page(), which will fail with
net iovs. Use netmem_get_pp_ref_count_ref() instead.

Fixes: 8ab79ed50cf1 ("page_pool: devmem support")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/20250108220644.3528845-2-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: wwan: iosm: Fix hibernation by re-binding the driver around it
Maciej S. Szmigiero [Wed, 8 Jan 2025 23:33:50 +0000 (00:33 +0100)]
net: wwan: iosm: Fix hibernation by re-binding the driver around it

Currently, the driver is seriously broken with respect to the
hibernation (S4): after image restore the device is back into
IPC_MEM_EXEC_STAGE_BOOT (which AFAIK means bootloader stage) and needs
full re-launch of the rest of its firmware, but the driver restore
handler treats the device as merely sleeping and just sends it a
wake-up command.

This wake-up command times out but device nodes (/dev/wwan*) remain
accessible.
However attempting to use them causes the bootloader to crash and
enter IPC_MEM_EXEC_STAGE_CD_READY stage (which apparently means "a crash
dump is ready").

It seems that the device cannot be re-initialized from this crashed
stage without toggling some reset pin (on my test platform that's
apparently what the device _RST ACPI method does).

While it would theoretically be possible to rewrite the driver to tear
down the whole MUX / IPC layers on hibernation (so the bootloader does
not crash from improper access) and then re-launch the device on
restore this would require significant refactoring of the driver
(believe me, I've tried), since there are quite a few assumptions
hard-coded in the driver about the device never being partially
de-initialized (like channels other than devlink cannot be closed,
for example).
Probably this would also need some programming guide for this hardware.

Considering that the driver seems orphaned [1] and other people are
hitting this issue too [2] fix it by simply unbinding the PCI driver
before hibernation and re-binding it after restore, much like
USB_QUIRK_RESET_RESUME does for USB devices that exhibit a similar
problem.

Tested on XMM7360 in HP EliteBook 855 G7 both with s2idle (which uses
the existing suspend / resume handlers) and S4 (which uses the new code).

[1]: https://lore.kernel.org/all/c248f0b4-2114-4c61-905f-466a786bdebb@leemhuis.info/
[2]:
https://github.com/xmm7360/xmm7360-pci/issues/211#issuecomment-1804139413

Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Link: https://patch.msgid.link/e60287ebdb0ab54c4075071b72568a40a75d0205.1736372610.git.mail@maciej.szmigiero.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Thu, 16 Jan 2025 01:38:04 +0000 (17:38 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-01-08 (ice)

This series contains updates to ice driver only.

Przemek reworks implementation so that ice_init_hw() is called before
ice_adapter initialization. The motivation is to have ability to act
on the number of PFs in ice_adapter initialization. This is not done
here but the code is also a bit cleaner.

Michal adds priority to be considered when matching recipes for proper
differentiation.

Konrad adds devlink health reporting for firmware generated events.

R Sundar utilizes string helpers over open coded versions.

Jake adds implementation to utilize a lower latency interface to program
PHY timer when supported.

Additional information can be found on the original cover letter:

  https://lore.kernel.org/intel-wired-lan/20241216145453.333745-1-anton.nadezhdin@intel.com/

Karol adds and allows for different PTP delay values to be used per pin.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Add in/out PTP pin delays
  ice: implement low latency PHY timer updates
  ice: check low latency PHY timer update firmware capability
  ice: add lock to protect low latency interface
  ice: rename TS_LL_READ* macros to REG_LL_PROXY_H_*
  ice: use read_poll_timeout_atomic in ice_read_phy_tstamp_ll_e810
  ice: use string choice helpers
  ice: add fw and port health reporters
  ice: add recipe priority check in search
  ice: ice_probe: init ice_adapter after HW init
  ice: minor: rename goto labels from err to unroll
  ice: split ice_init_hw() out from ice_init_dev()
  ice: c827: move wait for FW to ice_init_hw()
====================

Link: https://patch.msgid.link/20250115000844.714530-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoinet: ipmr: fix data-races
Eric Dumazet [Tue, 14 Jan 2025 22:10:49 +0000 (22:10 +0000)]
inet: ipmr: fix data-races

Following fields of 'struct mr_mfc' can be updated
concurrently (no lock protection) from ip_mr_forward()
and ip6_mr_forward()

- bytes
- pkt
- wrong_if
- lastuse

They also can be read from other functions.

Convert bytes, pkt and wrong_if to atomic_long_t,
and use READ_ONCE()/WRITE_ONCE() for lastuse.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'bnxt_en-implement-tcp-data-split-and-thresh-option'
Jakub Kicinski [Wed, 15 Jan 2025 22:42:14 +0000 (14:42 -0800)]
Merge branch 'bnxt_en-implement-tcp-data-split-and-thresh-option'

Taehee Yoo says:

====================
bnxt_en: implement tcp-data-split and thresh option

This series implements hds-thresh ethtool command.
This series also implements backend of tcp-data-split and
hds-thresh ethtool command for bnxt_en driver.
These ethtool commands are mandatory options for device memory TCP.

NICs that use the bnxt_en driver support tcp-data-split feature named
HDS(header-data-split).
But there is no implementation for the HDS to enable by ethtool.
Only getting the current HDS status is implemented and the HDS is just
automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
The hds_threshold follows the rx-copybreak value but it wasn't
changeable.

Currently, bnxt_en driver enables tcp-data-split by default but not
always work.
There is hds_threshold value, which indicates that a packet size is
larger than this value, a packet will be split into header and data.
hds_threshold value has been 256, which is a default value of
rx-copybreak value too.
The rx-copybreak value hasn't been allowed to change so the
hds_threshold too.

This patchset decouples hds_threshold and rx-copybreak first.
and make tcp-data-split, rx-copybreak, and
hds-thresh configurable independently.

But the default configuration is the same.
The default value of rx-copybreak is 256 and default
hds-thresh is also 256.

The behavior of rx-copybreak will probably be changed in almost all
drivers. If HDS is not enabled, rx-copybreak copies both header and
payload from a page.
But if HDS is enabled, rx-copybreak copies only header from the first
page.
Due to this change, it may need to disable(set to 0) rx-copybreak when
the HDS is required.

There are several related options.
TPA(HW-GRO, LRO), JUMBO, jumbo_thresh(firmware command), and Aggregation
Ring.

The aggregation ring is fundamental to these all features.
When gro/lro/jumbo packets are received, NIC receives the first packet
from the normal ring.
follow packets come from the aggregation ring.

These features are working regardless of HDS.
If HDS is enabled, the first packet contains the header only, and the
following packets contain only payload.
So, HW-GRO/LRO is working regardless of HDS.

There is another threshold value, which is jumbo_thresh.
This is very similar to hds_thresh, but jumbo thresh doesn't split
header and data.
It just split the first and following data based on length.
When NIC receives 1500 sized packet, and jumbo_thresh is 256(default, but
follows rx-copybreak),
the first data is 256 and the following packet size is 1500-256.

Before this patch, at least if one of GRO, LRO, and JUMBO flags is
enabled, the Aggregation ring will be enabled.
If the Aggregation ring is enabled, both hds_threshold and
jumbo_thresh are set to the default value of rx-copybreak.

So, GRO, LRO, JUMBO frames, they larger than 256 bytes, they will
be split into header and data if the protocol is TCP or UDP.
for the other protocol, jumbo_thresh works instead of hds_thresh.

This means that tcp-data-split relies on the GRO, LRO, and JUMBO flags.
But by this patch, tcp-data-split no longer relies on these flags.
If the tcp-data-split is enabled, the Aggregation ring will be
enabled.
Also, hds_threshold no longer follows rx-copybreak value, it will
be set to the hds-thresh value by user-space, but the
default value is still 256.

If the protocol is TCP or UDP and the HDS is disabled and Aggregation
ring is enabled, a packet will be split into several pieces due to
jumbo_thresh.

When single buffer XDP is attached, tcp-data-split is automatically
disabled.

LRO, GRO, and JUMBO are tested with BCM57414, BCM57504 and the firmware
version is 230.0.157.0.
I couldn't find any specification about minimum and maximum value
of hds_threshold, but from my test result, it was about 0 ~ 1023.
It means, over 1023 sized packets will be split into header and data if
tcp-data-split is enabled regardless of hds_treshold value.
When hds_threshold is 1500 and received packet size is 1400, HDS should
not be activated, but it is activated.
The maximum value of hds-thresh value is 256 because it
has been working. It was decided very conservatively.

I checked out the tcp-data-split(HDS) works independently of GRO, LRO,
JUMBO.
Also, I checked out tcp-data-split should be disabled automatically
when XDP is attached and disallowed to enable it again while XDP is
attached. I tested ranged values from min to max for
hds-thresh and rx-copybreak, and it works.
hds-thresh from 0 to 256, and rx-copybreak 0 to 256.
When testing this patchset, I checked skb->data, skb->data_len, and
nr_frags values.

By this patchset, bnxt_en driver supports a force enable tcp-data-split,
but it doesn't support for disable tcp-data-split.
When tcp-data-split is explicitly enabled, HDS works always.
When tcp-data-split is unknown, it depends on the current
configuration of LRO/GRO/JUMBO.

1/10 patch adds a new hds_config member in the ethtool_netdev_state.
It indicates that what tcp-data-split value is really updated from
userspace.
So the driver can distinguish a passed tcp-data-split value is
came from user or driver itself.

2/10 patch adds hds-thresh command in the ethtool.
This threshold value indicates if a received packet size is larger
than this threshold, the packet's header and payload will be split.
Example:
   # ethtool -G <interface name> hds-thresh <value>
This option can not be used when tcp-data-split is disabled or not
supported.
   # ethtool -G enp14s0f0np0 tcp-data-split on hds-thresh 256
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   Current hardware settings:
   ...
   TCP data split:         on
   HDS thresh:  256

3/10, 4/10 add condition checks for devmem and ethtool.
If tcp-data-split is disabled or threshold value is not zero, setup of
devmem will be failed.
Also, tcp-data-split and hds-thresh will not be changed
while devmem is running.

5/10 add condition checks for netdev core.
It disallows setup single buffer XDP program when tcp-data-split is
enabled.

6/10 patch implements .{set, get}_tunable() in the bnxt_en.
The bnxt_en driver has been supporting the rx-copybreak feature but is
not configurable, Only the default rx-copybreak value has been working.
So, it changes the bnxt_en driver to be able to configure
the rx-copybreak value.

7/10 patch adds an implementation of tcp-data-split ethtool
command.
The HDS relies on the Aggregation ring, which is automatically enabled
when either LRO, GRO, or large mtu is configured.
So, if the Aggregation ring is enabled, HDS is automatically enabled by
it.

8/10 patch adds the implementation of hds-thresh logic
in the bnxt_en driver.
The default value is 256, which used to be the default rx-copybreak
value.

9/10 add HDS feature implementation for netdevsim.
HDS feature is not common so far. Only a few NICs support this feature.
There is no way to test HDS core-API unless we have proper hw NIC.
In order to test HDS core-API without  hw NIC, netdevsim can be used.
It implements HDS control and data plane for netdevsim.

10/10 add selftest for HDS(tcp-data-split and HDS-thresh).
The tcp-data-split tests are the same with
`ethtool -G tcp-data-split <on | auto>`
HDS-thresh tests are same with `ethtool -G eth0 hds-thresh <0 - MAX>`

This series is tested with BCM57504 and netdevsim.
====================

Link: https://patch.msgid.link/20250114142852.3364986-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftest: net-drv: hds: add test for HDS feature
Taehee Yoo [Tue, 14 Jan 2025 14:28:52 +0000 (14:28 +0000)]
selftest: net-drv: hds: add test for HDS feature

HDS/HDS-thresh features were updated/implemented. so add some tests for
these features.

HDS tests are the same with `ethtool -G eth0 tcp-data-split <on | off |
auto >` but `auto` depends on driver specification.
So, it doesn't include `auto` case.

HDS-thresh tests are same with `ethtool -G eth0 hds-thresh <0 - MAX>`
It includes both 0 and MAX cases. It also includes exceed case, MAX + 1.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-11-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonetdevsim: add HDS feature
Taehee Yoo [Tue, 14 Jan 2025 14:28:51 +0000 (14:28 +0000)]
netdevsim: add HDS feature

HDS options(tcp-data-split, hds-thresh) have dependencies between other
features like XDP. Basic dependencies are checked in the core API.
netdevsim is very useful to check basic dependencies.

The default tcp-data-split mode is UNKNOWN but netdevsim driver
returns ENABLED when ethtool dumps tcp-data-split mode.
The default value of HDS threshold is 0 and the maximum value is 1024.

ethtool shows like this.

ethtool -g eni1np1
Ring parameters for eni1np1:
Pre-set maximums:
...
HDS thresh:             1024
Current hardware settings:
...
TCP data split:         on
HDS thresh:             0

ethtool -G eni1np1 tcp-data-split on hds-thresh 1024
ethtool -g eni1np1
Ring parameters for eni1np1:
Pre-set maximums:
...
HDS thresh:             1024
Current hardware settings:
...
TCP data split:         on
HDS thresh:             1024

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-10-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobnxt_en: add support for hds-thresh ethtool command
Taehee Yoo [Tue, 14 Jan 2025 14:28:50 +0000 (14:28 +0000)]
bnxt_en: add support for hds-thresh ethtool command

The bnxt_en driver has configured the hds_threshold value automatically
when TPA is enabled based on the rx-copybreak default value.
Now the hds-thresh ethtool command is added, so it adds an
implementation of hds-thresh option.

Configuration of the hds-thresh is applied only when
the tcp-data-split is enabled. The default value of
hds-thresh is 256, which is the default value of
rx-copybreak, which used to be the hds_thresh value.

The maximum hds-thresh is 1023.

   # Example:
   # ethtool -G enp14s0f0np0 tcp-data-split on hds-thresh 256
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   HDS thresh:  1023
   Current hardware settings:
   ...
   TCP data split:         on
   HDS thresh:  256

Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250114142852.3364986-9-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobnxt_en: add support for tcp-data-split ethtool command
Taehee Yoo [Tue, 14 Jan 2025 14:28:49 +0000 (14:28 +0000)]
bnxt_en: add support for tcp-data-split ethtool command

NICs that uses bnxt_en driver supports tcp-data-split feature by the
name of HDS(header-data-split).
But there is no implementation for the HDS to enable by ethtool.
Only getting the current HDS status is implemented and The HDS is just
automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
The hds_threshold follows rx-copybreak value. and it was unchangeable.

This implements `ethtool -G <interface name> tcp-data-split <value>`
command option.
The value can be <on> and <auto>.
The value is <auto> and one of LRO/GRO/JUMBO is enabled, HDS is
automatically enabled and all LRO/GRO/JUMBO are disabled, HDS is
automatically disabled.

HDS feature relies on the aggregation ring.
So, if HDS is enabled, the bnxt_en driver initializes the aggregation ring.
This is the reason why BNXT_FLAG_AGG_RINGS contains HDS condition.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250114142852.3364986-8-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobnxt_en: add support for rx-copybreak ethtool command
Taehee Yoo [Tue, 14 Jan 2025 14:28:48 +0000 (14:28 +0000)]
bnxt_en: add support for rx-copybreak ethtool command

The bnxt_en driver supports rx-copybreak, but it couldn't be set by
userspace. Only the default value(256) has worked.
This patch makes the bnxt_en driver support following command.
`ethtool --set-tunable <devname> rx-copybreak <value> ` and
`ethtool --get-tunable <devname> rx-copybreak`.

By this patch, hds_threshol is set to the rx-copybreak value.
But it will be set by `ethtool -G eth0 hds-thresh N`
in the next patch.

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250114142852.3364986-7-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: disallow setup single buffer XDP when tcp-data-split is enabled.
Taehee Yoo [Tue, 14 Jan 2025 14:28:47 +0000 (14:28 +0000)]
net: disallow setup single buffer XDP when tcp-data-split is enabled.

When a single buffer XDP is attached, NIC should guarantee only single
page packets will be received.
tcp-data-split feature splits packets into header and payload. single
buffer XDP can't handle it properly.
So attaching single buffer XDP should be disallowed when tcp-data-split
is enabled.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-6-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: add ring parameter filtering
Taehee Yoo [Tue, 14 Jan 2025 14:28:46 +0000 (14:28 +0000)]
net: ethtool: add ring parameter filtering

While the devmem is running, the tcp-data-split and
hds-thresh configuration should not be changed.
If user tries to change tcp-data-split and threshold value while the
devmem is running, it fails and shows extack message.

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-5-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: devmem: add ring parameter filtering
Taehee Yoo [Tue, 14 Jan 2025 14:28:45 +0000 (14:28 +0000)]
net: devmem: add ring parameter filtering

If driver doesn't support ring parameter or tcp-data-split configuration
is not sufficient, the devmem should not be set up.
Before setup the devmem, tcp-data-split should be ON and hds-thresh
value should be 0.

Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-4-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: add support for configuring hds-thresh
Taehee Yoo [Tue, 14 Jan 2025 14:28:44 +0000 (14:28 +0000)]
net: ethtool: add support for configuring hds-thresh

The hds-thresh option configures the threshold value of
the header-data-split.
If a received packet size is larger than this threshold value, a packet
will be split into header and payload.
The header indicates TCP and UDP header, but it depends on driver spec.
The bnxt_en driver supports HDS(Header-Data-Split) configuration at
FW level, affecting TCP and UDP too.
So, If hds-thresh is set, it affects UDP and TCP packets.

Example:
   # ethtool -G <interface name> hds-thresh <value>

   # ethtool -G enp14s0f0np0 tcp-data-split on hds-thresh 256
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   HDS thresh:  1023
   Current hardware settings:
   ...
   TCP data split:         on
   HDS thresh:  256

The default/min/max values are not defined in the ethtool so the drivers
should define themself.
The 0 value means that all TCP/UDP packets' header and payload
will be split.

Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-3-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: add hds_config member in ethtool_netdev_state
Taehee Yoo [Tue, 14 Jan 2025 14:28:43 +0000 (14:28 +0000)]
net: ethtool: add hds_config member in ethtool_netdev_state

When tcp-data-split is UNKNOWN mode, drivers arbitrarily handle it.
For example, bnxt_en driver automatically enables if at least one of
LRO/GRO/JUMBO is enabled.
If tcp-data-split is UNKNOWN and LRO is enabled, a driver returns
ENABLES of tcp-data-split, not UNKNOWN.
So, `ethtool -g eth0` shows tcp-data-split is enabled.

The problem is in the setting situation.
In the ethnl_set_rings(), it first calls get_ringparam() to get the
current driver's config.
At that moment, if driver's tcp-data-split config is UNKNOWN, it returns
ENABLE if LRO/GRO/JUMBO is enabled.
Then, it sets values from the user and driver's current config to
kernel_ethtool_ringparam.
Last it calls .set_ringparam().
The driver, especially bnxt_en driver receives
ETHTOOL_TCP_DATA_SPLIT_ENABLED.
But it can't distinguish whether it is set by the user or just the
current config.

When user updates ring parameter, the new hds_config value is updated
and current hds_config value is stored to old_hdsconfig.
Driver's .set_ringparam() callback can distinguish a passed
tcp-data-split value is came from user explicitly.
If .set_ringparam() is failed, hds_config is rollbacked immediately.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-2-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: loopback: Hold rtnl_net_lock() in blackhole_netdev_init().
Kuniyuki Iwashima [Tue, 14 Jan 2025 08:13:52 +0000 (17:13 +0900)]
net: loopback: Hold rtnl_net_lock() in blackhole_netdev_init().

blackhole_netdev is the global device in init_net.

Let's hold rtnl_net_lock(&init_net) in blackhole_netdev_init().

While at it, the unnecessary dev_net_set() call is removed, which
is done in alloc_netdev_mqs().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250114081352.47404-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests/net/forwarding: teamd command not found
Alessandro Zanni [Tue, 14 Jan 2025 00:33:16 +0000 (01:33 +0100)]
selftests/net/forwarding: teamd command not found

Running "make kselftest TARGETS=net/forwarding" results in
multiple ccurrences of the same error:
- ./lib.sh: line 787: teamd: command not found

This patch adds the variable $REQUIRE_TEAMD in every test that uses the
command teamd and checks the $REQUIRE_TEAMD variable in the file "lib.sh"
to skip the test if the command is not installed.

Signed-off-by: Alessandro Zanni <alessandro.zanni87@gmail.com>
Link: https://patch.msgid.link/20250114003323.97207-1-alessandro.zanni87@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'eth-fbnic-add-hardware-monitoring-support'
Jakub Kicinski [Wed, 15 Jan 2025 22:14:15 +0000 (14:14 -0800)]
Merge branch 'eth-fbnic-add-hardware-monitoring-support'

Sanman Pradhan says:

====================
eth: fbnic: Add hardware monitoring support

This patch series adds hardware monitoring support to the fbnic driver.
It implements support for reading temperature and voltage sensors via
firmware requests, and exposes this data through the HWMON interface.

The series is structured as follows:

Patch 1: Adds completion infrastructure for firmware requests
Patch 2: Implements TSENE sensor message handling
Patch 3: Adds HWMON interface support

Output:
$ ls -l /sys/class/hwmon/hwmon1/
total 0
lrwxrwxrwx 1 root root    0 Sep 10 00:00 device -> ../../../0000:01:00.0
-r--r--r-- 1 root root 4096 Sep 10 00:00 in0_input
-r--r--r-- 1 root root 4096 Sep 10 00:00 name
lrwxrwxrwx 1 root root    0 Sep 10 00:00 subsystem -> ../../../../../../class/hwmon
-r--r--r-- 1 root root 4096 Sep 10 00:00 temp1_input
-rw-r--r-- 1 root root 4096 Sep 10 00:00 uevent

$ cat /sys/class/hwmon/hwmon1/temp1_input
40480
$ cat /sys/class/hwmon/hwmon1/in0_input
750
====================

Link: https://patch.msgid.link/20250114000705.2081288-1-sanman.p211993@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: Add hardware monitoring support via HWMON interface
Sanman Pradhan [Tue, 14 Jan 2025 00:07:05 +0000 (16:07 -0800)]
eth: fbnic: Add hardware monitoring support via HWMON interface

This patch adds support for hardware monitoring to the fbnic driver,
allowing for temperature and voltage sensor data to be exposed to
userspace via the HWMON interface. The driver registers a HWMON device
and provides callbacks for reading sensor data, enabling system
admins to monitor the health and operating conditions of fbnic.

Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250114000705.2081288-4-sanman.p211993@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: hwmon: Add support for reading temperature and voltage sensors
Sanman Pradhan [Tue, 14 Jan 2025 00:07:04 +0000 (16:07 -0800)]
eth: fbnic: hwmon: Add support for reading temperature and voltage sensors

Add support for reading temperature and voltage sensor data from firmware
by implementing a new TSENE message type and response parsing. This adds
message handler infrastructure to transmit sensor read requests and parse
responses. The sensor data will be exposed through the driver's hwmon interface.

Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250114000705.2081288-3-sanman.p211993@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: hwmon: Add completion infrastructure for firmware requests
Sanman Pradhan [Tue, 14 Jan 2025 00:07:03 +0000 (16:07 -0800)]
eth: fbnic: hwmon: Add completion infrastructure for firmware requests

Add infrastructure to support firmware request/response handling with
completions. Add a completion structure to track message state including
message type for matching, completion for waiting for response, and
result for error propagation. Use existing spinlock to protect the writes.
The data from the various response types will be added to the "union u"
by subsequent commits.

Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250114000705.2081288-2-sanman.p211993@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-lan969x-add-fdma-support'
Jakub Kicinski [Wed, 15 Jan 2025 22:13:36 +0000 (14:13 -0800)]
Merge branch 'net-lan969x-add-fdma-support'

Daniel Machon says:

====================
net: lan969x: add FDMA support

== Description:

This series is the last of a multi-part series, that prepares and adds
support for the new lan969x switch driver.

The upstreaming efforts has been split into multiple series:

        1) Prepare the Sparx5 driver for lan969x (merged)

        2) Add support for lan969x (same basic features as Sparx5
           provides excl. FDMA and VCAP, merged).

        3) Add lan969x VCAP functionality (merged).

        4) Add RGMII support (merged).

    --> 5) Add FDMA support.

== FDMA support:

The lan969x switch device uses the same FDMA engine as the Sparx5 switch
device, with the same number of channels etc. This means we can utilize
the newly added FDMA library, that is already in use by the lan966x and
sparx5 drivers.

As previous lan969x series, the FDMA implementation will hook into the
Sparx5 implementation where possible, however both RX and TX handling
will be done differently on lan969x and therefore requires a separate
implementation of the RX and TX path.

Details are in the commit description of the individual patches

== Patch breakdown:

Patch #1: Enable FDMA support on lan969x
Patch #2: Split start()/stop() functions
Patch #3: Activate TX FDMA in start()
Patch #4: Ops out a few functions that differ on the two platforms
Patch #5: Add FDMA implementation for lan969x

v1: https://lore.kernel.org/20250109-sparx5-lan969x-switch-driver-5-v1-0-13d6d8451e63@microchip.com
====================

Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-0-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: lan969x: add FDMA implementation
Daniel Machon [Mon, 13 Jan 2025 19:36:09 +0000 (20:36 +0100)]
net: lan969x: add FDMA implementation

The lan969x switch device supports manual frame injection and extraction
to and from the switch core, using a number of injection and extraction
queues.  This technique is currently supported, but delivers poor
performance compared to Frame DMA (FDMA).

This lan969x implementation of FDMA, hooks into the existing FDMA for
Sparx5, but requires its own RX and TX handling, as lan969x does not
support the same native cache coherency that Sparx5 does. Effectively,
this means that we are going to use the DMA mapping API for mapping and
unmapping TX buffers. The RX loop will utilize the page pool API for
efficient RX handling. Other than that, the implementation is largely
the same, and utilizes the FDMA library for DCB and DB handling.

Some numbers:

Manual injection/extraction (before this series):

// iperf3 -c 1.0.1.1

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.02  sec   345 MBytes   289 Mbits/sec  sender
[  5]   0.00-10.06  sec   345 MBytes   288 Mbits/sec  receiver

FDMA (after this series):

// iperf3 -c 1.0.1.1

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.03  sec  1.10 GBytes   940 Mbits/sec  sender
[  5]   0.00-10.07  sec  1.10 GBytes   936 Mbits/sec  receiver

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-5-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sparx5: ops out certain FDMA functions
Daniel Machon [Mon, 13 Jan 2025 19:36:08 +0000 (20:36 +0100)]
net: sparx5: ops out certain FDMA functions

We are going to implement the RX  and TX paths a bit differently on
lan969x and therefore need to introduce new ops for FDMA functions:
init, deinit, xmit and poll. Assign the Sparx5 equivalents for these and
update the code throughout. Also add a 'struct net_device' argument to
the xmit() function, as we will be needing that for lan969x.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-4-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sparx5: activate FDMA tx in start()
Daniel Machon [Mon, 13 Jan 2025 19:36:07 +0000 (20:36 +0100)]
net: sparx5: activate FDMA tx in start()

The function sparx5_fdma_tx_activate() is responsible for configuring
the TX FDMA instance and activating the channel. TX activation has
previously been done in the xmit() function, when the first frame is
transmitted. Now that we have separate functions for starting and
stopping the FDMA, it seems reasonable to move the TX activation to the
start function. This change has no implications on the functionality.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-3-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sparx5: split sparx5_fdma_{start(),stop()}
Daniel Machon [Mon, 13 Jan 2025 19:36:06 +0000 (20:36 +0100)]
net: sparx5: split sparx5_fdma_{start(),stop()}

The two functions: sparx5_fdma_{start(),stop()} are responsible for a
number of things, namely: allocation and initialization of FDMA buffers,
activation FDMA channels in hardware and activation of the NAPI
instance.

This patch splits the buffer allocation and initialization into init and
deinit functions, and the channel and NAPI activation into start and
stop functions. This serves two purposes: 1) the start() and stop()
functions can be reused for lan969x and 2) prepares for future MTU
change support, where we must be able to stop and start the FDMA
channels and NAPI instance, without free'ing and reallocating the FDMA
buffers.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-2-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sparx5: enable FDMA on lan969x
Daniel Machon [Mon, 13 Jan 2025 19:36:05 +0000 (20:36 +0100)]
net: sparx5: enable FDMA on lan969x

In a previous series, we made sure that FDMA was not initialized and
started on lan969x. Now that we are going to support it, undo that
change. In addition, make sure the chip ID check is only applicable on
Sparx5, as this is a check that is only relevant on this platform.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-1-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethernet: xgbe: re-add aneg to supported features in PHY quirks
Heiner Kallweit [Sun, 12 Jan 2025 21:59:59 +0000 (22:59 +0100)]
net: ethernet: xgbe: re-add aneg to supported features in PHY quirks

In 4.19, before the switch to linkmode bitmaps, PHY_GBIT_FEATURES
included feature bits for aneg and TP/MII ports.

 SUPPORTED_TP | \
 SUPPORTED_MII)

 SUPPORTED_10baseT_Full)

 SUPPORTED_100baseT_Full)

 SUPPORTED_1000baseT_Full)

 PHY_100BT_FEATURES | \
 PHY_DEFAULT_FEATURES)

 PHY_1000BT_FEATURES)

Referenced commit expanded PHY_GBIT_FEATURES, silently removing
PHY_DEFAULT_FEATURES. The removed part can be re-added by using
the new PHY_GBIT_FEATURES definition.
Not clear to me is why nobody seems to have noticed this issue.

I stumbled across this when checking what it takes to make
phy_10_100_features_array et al private to phylib.

Fixes: d0939c26c53a ("net: ethernet: xgbe: expand PHY_GBIT_FEAUTRES")
Cc: stable@vger.kernel.org
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/46521973-7738-4157-9f5e-0bb6f694acba@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-phylink-fix-pcs-without-autoneg'
Jakub Kicinski [Wed, 15 Jan 2025 21:23:32 +0000 (13:23 -0800)]
Merge branch 'net-phylink-fix-pcs-without-autoneg'

Russell King says:

====================
net: phylink: fix PCS without autoneg

Eric Woudstra reported that a PCS attached using 2500base-X does not
see link when phylink is using in-band mode, but autoneg is disabled,
despite there being a valid 2500base-X signal being received. We have
these settings:

act_link_an_mode = MLO_AN_INBAND
pcs_neg_mode = PHYLINK_PCS_NEG_INBAND_DISABLED

Eric diagnosed it to phylink_decode_c37_word() setting state->link
false because the full-duplex bit isn't set in the non-existent link
partner advertisement word (which doesn't exist because in-band
autoneg is disabled!)

The test in phylink_mii_c22_pcs_decode_state() is supposed to catch
this state, but since we converted PCS to use neg_mode, testing the
Autoneg in the local advertisement is no longer sufficient - we need
to be looking at the neg_mode, which currently isn't provided.

We need to provide this via the .pcs_get_state() method, and this
will require modifying all PCS implementations to add the extra
argument to this method.

Patch 1 uses the PCS neg_mode in phylink_mac_pcs_get_state() to correct
the now obsolute usage of the Autoneg bit in the advertisement.

Patch 2 passes neg_mode into the .pcs_get_state() method, and updates
all users.

Patch 3 adds neg_mode as an argument to the various clause 22 state
decoder functions in phylink, modifying drivers to pass the neg_mode
through.

Patch 4 makes use of phylink_mii_c22_pcs_decode_state() rather than
using the Autoneg bit in the advertising field.

Patch 5 may be required for Eric's case - it ensures that we report
the correct state for interface types that we support only one set
of modes for when autoneg is disabled.
====================

Link: https://patch.msgid.link/Z4TbR93B-X8A8iHe@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: provide fixed state for 1000base-X and 2500base-X
Russell King (Oracle) [Mon, 13 Jan 2025 09:22:44 +0000 (09:22 +0000)]
net: phylink: provide fixed state for 1000base-X and 2500base-X

When decoding clause 22 state, if in-band is disabled and using either
1000base-X or 2500base-X, rather than reporting link-down, we know the
speed, and we only support full duplex. Pause modes taken from XPCS.

This fixes a problem reported by Eric Woudstra.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1tXGei-000EtL-Fn@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: use neg_mode in phylink_mii_c22_pcs_decode_state()
Russell King (Oracle) [Mon, 13 Jan 2025 09:22:39 +0000 (09:22 +0000)]
net: phylink: use neg_mode in phylink_mii_c22_pcs_decode_state()

Rather than using the state of the Autoneg bit, which is unreliable
with the new PCS neg mode support, use the passed neg_mode to decide
whether to decode the link partner advertisement data.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1tXGed-000EtF-CN@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: pass neg_mode into c22 state decoder
Russell King (Oracle) [Mon, 13 Jan 2025 09:22:34 +0000 (09:22 +0000)]
net: phylink: pass neg_mode into c22 state decoder

Pass the current neg_mode into phylink_mii_c22_pcs_get_state() and
phylink_mii_c22_pcs_decode_state(). Update all users of phylink PCS
that use these functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1tXGeY-000Et9-8g@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>