www.infradead.org Git - users/jedix/linux-maple.git/log

eth: fbnic: wrap tx queue stats in a struct

The queue stats struct is used for Rx and Tx queues. Wrap
the Tx stats in a struct and a union, so that we can reuse
the same space for Rx stats on Rx queues.

This also makes it easy to add an assert to the stat handling
code to catch new stats not being aggregated on shutdown.

Acked-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20250211181356.580800-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: report csum_complete via qstats

Commit 13c7c941e729 ("netdev: add qstat for csum complete") reserved
the entry for csum complete in the qstats uAPI. Start reporting this
value now that we have a driver which needs it.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20250211181356.580800-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'use-phylib-for-reset-randomization-and-adjustable-polling'

Oleksij Rempel says:

====================
Use PHYlib for reset randomization and adjustable polling

This patch set tackles a DP83TG720 reset lock issue and improves PHY
polling. Rather than adding a separate polling worker to randomize PHY
resets, I chose to extend the PHYlib framework - which already handles
most of the needed functionality - with adjustable polling. This
approach not only addresses the DP83TG720-specific problem (where
synchronized resets can lock the link) but also lays the groundwork for
optimizing PHY stats polling across all PHY drivers. With generic PHY
stats coming in, we can adjust the polling interval based on hardware
characteristics, such as using longer intervals for PHYs with stable HW
counters or shorter ones for high-speed links prone to counter
overflows.

Patch version changes are tracked in separate patches.
====================

Link: https://patch.msgid.link/20250210082358.200751-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: dp83tg720: Add randomized polling intervals for link detection

Address the limitations of the DP83TG720 PHY, which cannot reliably
detect or report a stable link state. To handle this, the PHY must be
periodically reset when the link is down. However, synchronized reset
intervals between the PHY and its link partner can result in a deadlock,
preventing the link from re-establishing.

This change introduces a randomized polling interval when the link is
down to desynchronize resets between link partners.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250210082358.200751-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Add support for driver-specific next update time

Introduce the `phy_get_next_update_time` function to allow PHY drivers
to dynamically determine the time (in jiffies) until the next state
update event. This enables more flexible and adaptive polling intervals
based on the link state or other conditions.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250210082358.200751-2-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'rate-management-on-traffic-classes-misc'

Tariq Toukan says:

====================
mlx5 misc

Patches 1-3 by William reduce the memory consumption for representors to
achieve better scalability.

Patches 4-5 by Akiva expose ICM memory consumption per function.

Patches 6-8 expose helpful information on RSS resources in devlink RX
reporter diagnose.

Patches 9-10 are simple enhancements by Alex Lazar.
====================

Link: https://patch.msgid.link/20250209101716.112774-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: XDP, Enable TX side XDP multi-buffer support

In XDP scenarios, fragmented packets can occur if the MTU is larger
than the page size, even when the packet size fits within the linear
part.
If XDP multi-buffer support is disabled, the fragmented part won't be
handled in the TX flow, leading to packet drops.

Since XDP multi-buffer support is always available, this commit removes
the conditional check for enabling it.
This ensures that XDP multi-buffer support is always enabled,
regardless of the `is_xdp_mb` parameter, and guarantees the handling of
fragmented packets in such scenarios.

Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250209101716.112774-16-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Extend Ethtool loopback selftest to support non-linear SKB

Current loopback test validation ignores non-linear SKB case in
the SKB access, which can lead to failures in scenarios such as
when HW GRO is enabled.
Linearize the SKB so both cases will be handled.

Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250209101716.112774-15-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Expose RSS via devlink rx reporter diagnose

Underneath "rx resources" tag expose RSS diagnostic information. For
each RSS expose its rqtn, TIRs and inner TIRs.

$ devlink health diagnose auxiliary/mlx5_core.eth.0/65535 reporter rx
.......
RSS:
    Index: 0 rqtn: 0
    TIRs Numbers:
        tt: TT_IPV4_TCP tirn: 0
        tt: TT_IPV6_TCP tirn: 1
        tt: TT_IPV4_UDP tirn: 2
        tt: TT_IPV6_UDP tirn: 3
        tt: TT_IPV4_IPSEC_AH tirn: 4
        tt: TT_IPV6_IPSEC_AH tirn: 5
        tt: TT_IPV4_IPSEC_ESP tirn: 6
        tt: TT_IPV6_IPSEC_ESP tirn: 7
        tt: TT_IPV4 tirn: 8
        tt: TT_IPV6 tirn: 9
    Inner TIRs Numbers:
        tt: TT_IPV4_TCP tirn: 10
        tt: TT_IPV6_TCP tirn: 11
        tt: TT_IPV4_UDP tirn: 12
        tt: TT_IPV6_UDP tirn: 13
        tt: TT_IPV4_IPSEC_AH tirn: 14
        tt: TT_IPV6_IPSEC_AH tirn: 15
        tt: TT_IPV4_IPSEC_ESP tirn: 16
        tt: TT_IPV6_IPSEC_ESP tirn: 17
        tt: TT_IPV4 tirn: 18
        tt: TT_IPV6 tirn: 19
    Index: 2 rqtn: 27
    TIRs Numbers:
        tt: TT_IPV6_TCP tirn: 46

Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250209101716.112774-14-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Add direct TIRs to devlink rx reporter diagnose

Add "RX resources" tag to the output of rx reporter diagnose callback.
Underneath add tag for direct TIRs, for each TIR expose its tirn and
the corresponding rqtn.

$ sudo devlink health diagnose auxiliary/mlx5_core.eth.0/65535 reporter rx
....
rx resources:
   Direct TIRs:
       ix: 0 tirn: 20 rqtn: 1
       ix: 1 tirn: 21 rqtn: 2
       ix: 2 tirn: 22 rqtn: 3
       ix: 3 tirn: 23 rqtn: 4
       ix: 4 tirn: 24 rqtn: 5
       ix: 5 tirn: 25 rqtn: 6
       ix: 6 tirn: 26 rqtn: 7
       ix: 7 tirn: 27 rqtn: 8
       ix: 8 tirn: 28 rqtn: 9
       ix: 9 tirn: 29 rqtn: 10
       ix: 10 tirn: 30 rqtn: 11
       ix: 11 tirn: 31 rqtn: 12
       ix: 12 tirn: 32 rqtn: 13
       ix: 13 tirn: 33 rqtn: 14
       ix: 14 tirn: 34 rqtn: 15
       ix: 15 tirn: 35 rqtn: 16
       ix: 16 tirn: 36 rqtn: 17
       ix: 17 tirn: 37 rqtn: 18
       ix: 18 tirn: 38 rqtn: 19
       ix: 19 tirn: 39 rqtn: 20
       ix: 20 tirn: 40 rqtn: 21
       ix: 21 tirn: 41 rqtn: 22
       ix: 22 tirn: 42 rqtn: 23
       ix: 23 tirn: 43 rqtn: 24

Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250209101716.112774-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Move RQs diagnose to a dedicated function

Move rx reporter RQs diagnose from mlx5e_rx_reporter_diagnose() to a
dedicated function. This change is a preparation for the following
series which extends diagnose output for the rx reporter. While at it,
also pass a mlx5e_priv pointer to
mlx5e_rx_reporter_diagnose_common_config() as this is the argument the
latter actually needs.

Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250209101716.112774-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Expose ICM consumption per function

ICM is a portion of the host's memory assigned to a function by the OS
through requests made by the NIC's firmware.

PF ICM consumption can be accessed directly, while VF/SF ICM consumption
can be accessed through their representors in switchdev mode.

The value is exposed to the user in granularity of 4KB through the vnic
health reporter as follows:

$ devlink health diagnose pci/0000:08:00.0 reporter vnic
vNIC env counters:
     total_error_queues: 0 send_queue_priority_update_flow: 0
     comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
     invalid_command: 0 quota_exceeded_command: 0
     nic_receive_steering_discard: 0 icm_consumption: 1032

Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250209101716.112774-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Rename and move mlx5_esw_query_vport_vhca_id

Rename mlx5_esw_query_vport_vhca_id to mlx5_vport_get_vhca_id and move
it to vport file. Also, add function declaration to mlx5_core header
file. This better represents the function's usage and allows for it to
be called from other parts of the mlx5_core driver.

Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250209101716.112774-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: set the tx_queue_len for pfifo_fast

By default, the mq netdev creates a pfifo_fast qdisc. On a
system with 16 core, the pfifo_fast with 3 bands consumes
16 * 3 * 8 (size of pointer) * 1024 (default tx queue len)
= 393KB. The patch sets the tx qlen to representor default
value, 128 (1<<MLX5E_REP_PARAMS_DEF_LOG_SQ_SIZE), which
consumes 16 * 3 * 8 * 128 = 49KB, saving 344KB for each
representor at ECPF.

Signed-off-by: William Tu <witu@nvidia.com>
Reviewed-by: Daniel Jurgens <danielj@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250209101716.112774-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: reduce rep rxq depth to 256 for ECPF

By experiments, a single queue representor netdev consumes kernel
memory around 2.8MB, and 1.8MB out of the 2.8MB is due to page
pool for the RXQ. Scaling to a thousand representors consumes 2.8GB,
which becomes a memory pressure issue for embedded devices such as
BlueField-2 16GB / BlueField-3 32GB memory.

Since representor netdevs mostly handles miss traffic, and ideally,
most of the traffic will be offloaded, reduce the default non-uplink
rep netdev's RXQ default depth from 1024 to 256 if mdev is ecpf eswitch
manager. This saves around 1MB of memory per regular RQ,
(1024 - 256) * 2KB, allocated from page pool.

With rxq depth of 256, the netlink page pool tool reports
$./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump page-pool-get
{'id': 277,
  'ifindex': 9,
  'inflight': 128,
  'inflight-mem': 786432,
  'napi-id': 775}]

This is due to mtu 1500 + headroom consumes half pages, so 256 rxq
entries consumes around 128 pages (thus create a page pool with
size 128), shown above at inflight.

Note that each netdev has multiple types of RQs, including
Regular RQ, XSK, PTP, Drop, Trap RQ. Since non-uplink representor
only supports regular rq, this patch only changes the regular RQ's
default depth.

Signed-off-by: William Tu <witu@nvidia.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250209101716.112774-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: reduce the max log mpwrq sz for ECPF and reps

For the ECPF and representors, reduce the max MPWRQ size from 256KB (18)
to 128KB (17). This prepares the later patch for saving representor
memory.

With Striding RQ, there is a minimum of 4 MPWQEs. So with 128KB of max
MPWRQ size, the minimal memory is 4 * 128KB = 512KB. When creating page
pool, consider 1500 mtu, the minimal page pool size will be 512KB/4KB =
128 pages = 256 rx ring entries (2 entries per page).

Before this patch, setting RX ringsize (ethtool -G rx) to 256 causes
driver to allocate page pool size more than it needs due to max MPWRQ
is 256KB (18). Ex: 4 * 256KB = 1MB, 1MB/4KB = 256 pages, but actually
128 pages is good enough. Reducing the max MPWRQ to 128KB fixes the
limitation.

Signed-off-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250209101716.112774-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-02-10 (ice, igc, e1000e)

For ice:

Karol, Jake, and Michal add PTP support for E830 devices. Karol
refactors and cleans up PTP code. Jake allows for a common
cross-timestamp implementation to be shared for all devices and
Michal adds E830 support.

Mateusz cleans up initial Flow Director rule creation to loop rather
than duplicate repeated similar calls.

For igc:

Siang adjust calls to remove need for close and open calls on loading
XDP program.

For e1000e:

Gerhard Engleder batches register writes for writing multicast table
on real-time kernels.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  e1000e: Fix real-time violations on link up
  igc: Avoid unnecessary link down event in XDP_SETUP_PROG process
  ice: refactor ice_fdir_create_dflt_rules() function
  ice: Implement PTP support for E830 devices
  ice: Refactor ice_ptp_init_tx_*
  ice: Add unified ice_capture_crosststamp
  ice: Process TSYN IRQ in a separate function
  ice: Use FIELD_PREP for timestamp values
  ice: Remove unnecessary ice_is_e8xx() functions
  ice: Don't check device type when checking GNSS presence
====================

Link: https://patch.msgid.link/20250210192352.3799673-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'sfc-support-devlink-flash'

Edward Cree says:

====================
sfc: support devlink flash

Allow upgrading device firmware on Solarflare NICs through standard tools.
====================

Link: https://patch.msgid.link/cover.1739186252.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: document devlink flash support

Update the information in sfc's devlink documentation including
support for firmware update with devlink flash.
Also update the help text for CONFIG_SFC_MTD, as it is no longer
strictly required for firmware updates.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/3476b0ef04a0944f03e0b771ec8ed1a9c70db4dc.1739186253.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: deploy devlink flash images to NIC over MCDI

Use MC_CMD_NVRAM_* wrappers to write the firmware to the partition
identified from the image header.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/a746335876b621b3e54cf4e49948148e349a1745.1739186253.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: extend NVRAM MCDI handlers

Support variable write-alignment, and background updates. The latter
allows other MCDI to continue while the device is processing an
MC_CMD_NVRAM_UPDATE_FINISH, since this can take a long time owing to
e.g. cryptographic signature verification.
Expose these handlers in mcdi.h, and build them even when
CONFIG_SFC_MTD=n, so they can be used for devlink flash in a
subsequent patch.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/de3d9e14fee69e15d95b46258401a93b75659f78.1739186253.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: parse headers of devlink flash images

This parsing is necessary to obtain the metadata which will be
used in a subsequent patch to write the image to the device.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/65318300f3f1b1462925f917f7c0d0ac833955ae.1739186253.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'use-hwmon_channel_info-macro-to-simplify-code'

Huisong Li says:

====================
Use HWMON_CHANNEL_INFO macro to simplify code

The HWMON_CHANNEL_INFO macro is provided by hwmon.h and used widely by many
other drivers. This series use HWMON_CHANNEL_INFO macro to simplify code
in net subsystem.

Note: These patches do not depend on each other. Put them togeter just for
belonging to the same subsystem.
====================

Link: https://patch.msgid.link/20250210054710.12855-1-lihuisong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: aquantia: Use HWMON_CHANNEL_INFO macro to simplify code

Use HWMON_CHANNEL_INFO macro to simplify code.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Link: https://patch.msgid.link/20250210054710.12855-6-lihuisong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: marvell10g: Use HWMON_CHANNEL_INFO macro to simplify code

Use HWMON_CHANNEL_INFO macro to simplify code.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Link: https://patch.msgid.link/20250210054710.12855-5-lihuisong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: marvell: Use HWMON_CHANNEL_INFO macro to simplify code

Use HWMON_CHANNEL_INFO macro to simplify code.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Link: https://patch.msgid.link/20250210054710.12855-4-lihuisong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: nfp: Use HWMON_CHANNEL_INFO macro to simplify code

Use HWMON_CHANNEL_INFO macro to simplify code.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Link: https://patch.msgid.link/20250210054710.12855-3-lihuisong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: aquantia: Use HWMON_CHANNEL_INFO macro to simplify code

Use HWMON_CHANNEL_INFO macro to simplify code.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Link: https://patch.msgid.link/20250210054710.12855-2-lihuisong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wwan: t7xx: don't include '<linux/pm_wakeup.h>' directly

The header clearly states that it does not want to be included directly,
only via '<linux/(platform_)?device.h>'. Which is already present, so
delete the superfluous include.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Link: https://patch.msgid.link/20250210113710.52071-2-wsa+renesas@sang-engineering.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: broadcom: don't include '<linux/pm_wakeup.h>' directly

The header clearly states that it does not want to be included directly,
only via '<linux/(platform_)?device.h>'. Replace the include accordingly.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250210113658.52019-2-wsa+renesas@sang-engineering.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: freescale: ucc_geth: remove unused PHY_INIT_TIMEOUT and PHY_CHANGE_TIME

Both definitions are unused. Last users have been removed with:

1577ecef7666 ("netdev: Merge UCC and gianfar MDIO bus drivers")
728de4c927a3 ("ucc_geth: migrate ucc_geth to phylib")

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Link: https://patch.msgid.link/62e9429b-57e0-42ec-96a5-6a89553f441d@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: remove unused PHY_INIT_TIMEOUT and PHY_FORCE_TIMEOUT

Both definitions are unused. Last users have been removed with:

f3ba9d490d6e ("net: s6gmac: remove driver")
2bd229df5e2e ("net: phy: remove state PHY_FORCING")

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/f8e7b8ed-a665-41ad-b0ce-cbfdb65262ef@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hamradio: baycom: replace strcpy() with strscpy()

The strcpy() function has been deprecated and replaced with strscpy().
There is an effort to make this change treewide:
https://github.com/KSPP/linux/issues/88.

Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/3qo3fbrak7undfgocsi2s74v4uyjbylpdqhie4dohfoh4welfn@joq7up65ug6v
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

blackhole_dev: convert self-test to KUnit

Convert this very simple smoke test to a KUnit test.

Add a missing `htons` call that was spotted[0] by kernel test robot
<lkp@intel.com> after initial conversion to KUnit.

Link: https://lore.kernel.org/oe-kbuild-all/202502090223.qCYMBjWT-lkp@intel.com/
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Link: https://patch.msgid.link/20250208-blackholedev-kunit-convert-v2-1-182db9bd56ec@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-rename-eee_broken_mode'

Heiner Kallweit says:

====================
net: phy: rename eee_broken_mode

eee_broken_mode is used also if an EEE mode isn't actually broken
but e.g. not supported by MAC. So rename it.

This is split out from a bigger series that needs more rework.
====================

Link: https://patch.msgid.link/d7924d4e-49b0-4182-831f-73c558d4425e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: rename phy_set_eee_broken to phy_disable_eee_mode

Consider that an EEE mode may not be broken but simply not supported
by the MAC, and rename function phy_set_eee_broken().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/30deb630-3f6b-4ffb-a1e6-a9736021f43a@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: rename eee_broken_modes to eee_disabled_modes

This bitmap is used also if the MAC doesn't support an EEE mode.
So the mode isn't necessarily broken in the PHY. Therefore rename
the bitmap.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/6cd11422-dd67-4c87-a642-308de694af92@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-allow-to-reduce-max-rto'

Eric Dumazet says:

====================
tcp: allow to reduce max RTO

This is a followup of a discussion started 6 months ago
by Jason Xing.

Some applications want to lower the time between each
retransmit attempts.

TCP_KEEPINTVL and TCP_KEEPCNT socket options don't
work around the issue.

This series adds:

- a new TCP level socket option (TCP_RTO_MAX_MS)
- a new sysctl (/proc/sys/net/ipv4/tcp_rto_max_ms)

Admins and/or applications can now change the max rto value
at their own risk.

Link: https://lore.kernel.org/netdev/20240715033118.32322-1-kerneljasonxing@gmail.com/T/
====================

Link: https://patch.msgid.link/20250207152830.2527578-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: add tcp_rto_max_ms sysctl

Previous patch added a TCP_RTO_MAX_MS socket option
to tune a TCP socket max RTO value.

Many setups prefer to change a per netns sysctl.

This patch adds /proc/sys/net/ipv4/tcp_rto_max_ms

Its initial value is 120000 (120 seconds).

Keep in mind that a decrease of tcp_rto_max_ms
means shorter overall timeouts, unless tcp_retries2
sysctl is increased.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: add the ability to control max RTO

Currently, TCP stack uses a constant (120 seconds)
to limit the RTO value exponential growth.

Some applications want to set a lower value.

Add TCP_RTO_MAX_MS socket option to set a value (in ms)
between 1 and 120 seconds.

It is discouraged to change the socket rto max on a live
socket, as it might lead to unexpected disconnects.

Following patch is adding a netns sysctl to control the
default value at socket creation time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: use tcp_reset_xmit_timer()

In order to reduce TCP_RTO_MAX occurrences, replace:

inet_csk_reset_xmit_timer(sk, what, when, TCP_RTO_MAX)

With:

tcp_reset_xmit_timer(sk, what, when, false);

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: add a @pace_delay parameter to tcp_reset_xmit_timer()

We want to factorize calls to inet_csk_reset_xmit_timer(),
to ease TCP_RTO_MAX change.

Current users want to add tcp_pacing_delay(sk)
to the timeout.

Remaining calls to inet_csk_reset_xmit_timer()
do not add the pacing delay. Following patch
will convert them, passing false for @pace_delay.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: remove tcp_reset_xmit_timer() @max_when argument

All callers use TCP_RTO_MAX, we can factorize this constant,
becoming a variable soon.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mptcp-pm-misc-cleanups-part-2'

Matthieu Baerts says:

====================
mptcp: pm: misc cleanups, part 2

These cleanups lead the way to the unification of the path-manager
interfaces, and allow future extensions. The following patches are not
all linked to each others, but are all related to the path-managers.

- Patch 1: drop unneeded parameter in a function helper.

- Patch 2: clearer NL error message when an NL attribute is missing.

- Patch 3: more precise NL messages by avoiding 'this or that is NOK'.

- Patch 4: improve too vague or missing NL err messages.

- Patch 5: use GENL_REQ_ATTR_CHECK to look for mandatory NL attributes.

- Patch 6: avoid overriding the error message.

- Patch 7: check all mandatory NL attributes with GENL_REQ_ATTR_CHECK.

- Patch 8: use NL_SET_ERR_MSG_ATTR instead of GENL_SET_ERR_MSG

- Patch 9: move doit callbacks used for both PM to pm.c.

- Patch 10: drop another unneeded parameter in a function helper.

- Patch 11: share the ID parsing code for the 'get_addr' callback.

- Patch 12: share sending NL code for the 'get_addr' callback.

- Patch 13: drop yet another unneeded parameter in a function helper.

- Patch 14: pick the usual structure type for the remote address.

- Patch 15: share the local addr parsing code for the 'set_flags' cb.

The behaviour when there are no errors should then not be modified.

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
====================

v2: https://lore.kernel.org/r/20250117-net-next-mptcp-pm-misc-cleanup-2-v2-0-61d4fe0586e8@kernel.org
v1: https://lore.kernel.org/r/20250116-net-next-mptcp-pm-misc-cleanup-2-v1-0-c0b43f18fe06@kernel.org

Link: https://patch.msgid.link/20250207-net-next-mptcp-pm-misc-cleanup-2-v3-0-71753ed957de@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: add local parameter for set_flags

This patch updates the interfaces set_flags to reduce repetitive
code, adds a new parameter 'local' for them.

The local address is parsed in public helper mptcp_pm_nl_set_flags_doit(),
then pass it to mptcp_pm_nl_set_flags() and mptcp_userspace_pm_set_flags().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: change rem type of set_flags

Generally, in the path manager interfaces, the local address is defined
as an mptcp_pm_addr_entry type address, while the remote address is
defined as an mptcp_addr_info type one:

(struct mptcp_pm_addr_entry *local, struct mptcp_addr_info *remote)

But the set_flags() interface uses two mptcp_pm_addr_entry type
parameters.

This patch changes the second one to mptcp_addr_info type and use helper
mptcp_pm_parse_addr() to parse it instead of using mptcp_pm_parse_entry().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: drop skb parameter of set_flags

The first parameter 'skb' in mptcp_pm_nl_set_flags() is only used to
obtained the network namespace, which can also be obtained through the
second parameters 'info' by using genl_info_net() helper.

This patch drops these useless parameters 'skb' in all three set_flags()
interfaces.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: reuse sending nlmsg code in get_addr

The netlink messages are sent both in mptcp_pm_nl_get_addr() and
mptcp_userspace_pm_get_addr(), this makes the code somewhat repetitive.
This is because the netlink PM and userspace PM use different locks to
protect the address entry that needs to be sent via the netlink message.
The former uses rcu read lock, and the latter uses msk->pm.lock.

The current get_addr() flow looks like this:

lock();
entry = get_entry();
send_nlmsg(entry);
unlock();

After holding the lock, get the entry from the list, send the entry, and
finally release the lock.

This patch changes the process by getting the entry while holding the lock,
then making a copy of the entry so that the lock can be released. Finally,
the copy of the entry is sent without locking:

lock();
entry = get_entry();
*copy = *entry;
unlock();

send_nlmsg(copy);

This way we can reuse the send_nlmsg() code in get_addr() interfaces
between the netlink PM and userspace PM. They only need to implement their
own get_addr() interfaces to hold the different locks, get the entry from
the different lists, then release the locks.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: add id parameter for get_addr

The address id is parsed both in mptcp_pm_nl_get_addr() and
mptcp_userspace_pm_get_addr(), this makes the code somewhat repetitive.

So this patch adds a new parameter 'id' for all get_addr() interfaces.
The address id is only parsed in mptcp_pm_nl_get_addr_doit(), then pass
it to both mptcp_pm_nl_get_addr() and mptcp_userspace_pm_get_addr().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: drop skb parameter of get_addr

The first parameters 'skb' of get_addr() interfaces are now useless
since mptcp_userspace_pm_get_sock() helper is used. This patch drops
these useless parameters of them.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: make three pm wrappers static

Three netlink functions:

mptcp_pm_nl_get_addr_doit()
mptcp_pm_nl_get_addr_dumpit()
mptcp_pm_nl_set_flags_doit()

are generic, implemented for each PM, in-kernel PM and userspace PM. It's
clearer to move them from pm_netlink.c to pm.c.

And the linked three path manager wrappers

mptcp_pm_get_addr()
mptcp_pm_dump_addr()
mptcp_pm_set_flags()

can be changed as static functions, no need to export them in protocol.h.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: use NL_SET_ERR_MSG_ATTR when possible

Instead of only returning a text message with GENL_SET_ERR_MSG(),
NL_SET_ERR_MSG_ATTR() can help the userspace developers by also
reporting which attribute is faulty.

When the error is specific to an attribute, NL_SET_ERR_MSG_ATTR() is now
used. The error messages have not been modified in this commit.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: mark missing address attributes

mptcp_pm_parse_entry() will check if the given attribute is defined. If
not, it will return a generic error: "missing address info".

It might then not be clear for the userspace developer which attribute
is missing, especially when the command takes multiple addresses.

By using GENL_REQ_ATTR_CHECK(), the userspace will get a hint about
which attribute is missing, making thing clearer. Note that this is what
was already done for most of the other MPTCP NL commands, this patch
simply adds the missing ones.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: remove duplicated error messages

mptcp_pm_parse_entry() and mptcp_pm_parse_addr() will already set a
error message in case of parsing issue.

Then, no need to override this error message with another less precise
one: "error parsing address".

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: userspace: use GENL_REQ_ATTR_CHECK

A more general way to check if MPTCP_PM_ATTR_* exists in 'info'
is to use GENL_REQ_ATTR_CHECK(info, MPTCP_PM_ATTR_*) instead of
directly reading info->attrs[MPTCP_PM_ATTR_*] and then checking
if it's NULL.

So this patch uses GENL_REQ_ATTR_CHECK() for userspace PM in
mptcp_pm_nl_announce_doit(), mptcp_pm_nl_remove_doit(),
mptcp_pm_nl_subflow_create_doit(), mptcp_pm_nl_subflow_destroy_doit()
and mptcp_userspace_pm_get_sock().

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: improve error messages

Some error messages were:

- too generic: "missing input", "invalid request"

- not precise enough: "limit greater than maximum" but what's the max?

- missing: subflow not found, or connect error.

This can be easily improved by being more precise, or adding new error
messages.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: more precise error messages

Some errors reported by the userspace PM were vague: "this or that is
invalid".

It is easier for the userspace to know which part is wrong, instead of
having to guess that.

While at it, in mptcp_userspace_pm_set_flags() move the parsing after
the check linked to the local attribute.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: userspace: flags: clearer msg if no remote addr

Since its introduction in commit 892f396c8e68 ("mptcp: netlink: issue
MP_PRIO signals from userspace PMs"), it was mandatory to specify the
remote address, because of the 'if (rem->addr.family == AF_UNSPEC)'
check done later one.

In theory, this attribute can be optional, but it sounds better to be
precise to avoid sending the MP_PRIO on the wrong subflow, e.g. if there
are multiple subflows attached to the same local ID. This can be relaxed
later on if there is a need to act on multiple subflows with one
command.

For the moment, the check to see if attr_rem is NULL can be removed,
because mptcp_pm_parse_entry() will do this check as well, no need to do
that differently here.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: drop info of userspace_pm_remove_id_zero_address

The only use of 'info' parameter of userspace_pm_remove_id_zero_address()
is to set an error message into it.

Plus, this helper will only fail when it cannot find any subflows with a
local address ID 0.

This patch drops this parameter and sets the error message where this
function is called in mptcp_pm_nl_remove_doit().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support

This change introduces a new selftest case to verify the functionality
of dumping IPv4 multicast addresses using the RTM_GETMULTICAST netlink
message. The test utilizes the ynl library to interact with the
netlink interface and validate that the kernel correctly reports the
joined IPv4 multicast addresses.

To run the test, execute the following command:

$ vng -v --user root --cpus 16 -- \
make -C tools/testing/selftests TARGETS=net \
TEST_PROGS=rtnetlink.py TEST_GEN_PROGS="" run_tests

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Link: https://patch.msgid.link/20250207110836.2407224-2-yuyanghuang@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netlink: support dumping IPv4 multicast addresses

Extended RTM_GETMULTICAST to support dumping joined IPv4 multicast
addresses, in addition to the existing IPv6 functionality. This allows
userspace applications to retrieve both IPv4 and IPv6 multicast
addresses through similar netlink command and then monitor future
changes by registering to RTNLGRP_IPV4_MCADDR and RTNLGRP_IPV6_MCADDR.

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Link: https://patch.msgid.link/20250207110836.2407224-1-yuyanghuang@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: fec: Refactor MAC reset to function

The core is reset both in `fec_restart()` (called on link-up) and
`fec_stop()` (going to sleep, driver remove etc.). These two functions
had their separate implementations, which was at first only a register
write and a `udelay()` (and the accompanying block comment). However,
since then we got soft-reset (MAC disable) and Wake-on-LAN support, which
meant that these implementations diverged, often causing bugs.

For instance, as of now, `fec_stop()` does not check for
`FEC_QUIRK_NO_HARD_RESET`, meaning the MII/RMII mode is cleared on eg.
a PM power-down event; and `fec_restart()` missed the refactor renaming
the "magic" constant `1` to `FEC_ECR_RESET`.

To harmonize current implementations, and eliminate this source of
potential future bugs, refactor implementation to a common function.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Csókás, Bence <csokas.bence@prolan.hu>
Link: https://patch.msgid.link/20250207121255.161146-2-csokas.bence@prolan.hu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mlxsw: Enable Tx checksum offload

The device is able to checksum plain TCP / UDP packets over IPv4 / IPv6
when the 'ipcs' bit in the send descriptor is set. Advertise support for
the 'NETIF_F_IP{,6}_CSUM' features in net devices registered by the
driver and VLAN uppers and set the 'ipcs' bit when the stack requests Tx
checksum offload.

Note that the device also calculates the IPv4 checksum, but it first
zeroes the current checksum so there should not be any difference
compared to the checksum calculated by the kernel.

On SN5600 (Spectrum-4) there is about 10% improvement in Tx packet rate
with 1400 byte packets when using pktgen.

Tested on Spectrum-{1,2,3,4} with all the combinations of IPv4 / IPv6,
TCP / UDP, with and without VLAN.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/8dc86c95474ce10572a0fa83b8adb0259558e982.1738950446.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: add helper for path resolution

Refering to C binaries from Python code is going to be a common
need. Add a helper to convert from path in relation to the test.
Meaning, if the test is in the same directory as the binary, the
call would be simply: cfg.rpath("binary").

The helper name "rpath" is not great. I can't think of a better
name that would be accurate yet concise.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250207184140.1730466-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: factor out a DrvEnv base class

We have separate Env classes for local tests and tests with a remote
endpoint. Make it easier to share the code by creating a base class.
Make env loading a method of this class.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250207184140.1730466-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: remove an unnecessary libmnl include

ncdevmem doesn't need libmnl, remove the unnecessary include.

Since YNL doesn't depend on libmnl either, any more, it's actually
possible to build selftests without having libmnl installed.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250207183119.1721424-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'fib-rules-convert-rtm_newrule-and-rtm_delrule-to-per-netns-rtnl'

Kuniyuki Iwashima says:

====================
fib: rules: Convert RTM_NEWRULE and RTM_DELRULE to per-netns RTNL.

Patch 1 ~ 2 are small cleanup, and patch 3 ~ 8 make fib_nl_newrule()
and fib_nl_delrule() hold per-netns RTNL.

v1: https://lore.kernel.org/20250206084629.16602-1-kuniyu@amazon.com
====================

Link: https://patch.msgid.link/20250207072502.87775-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fib_rules: Convert RTM_DELRULE to per-netns RTNL.

fib_nl_delrule() is the doit() handler for RTM_DELRULE but also called
from vrf_newlink() in case something fails in vrf_add_fib_rules().

In the latter case, RTNL is already held and the 4th arg is true.

Let's hold per-netns RTNL in fib_delrule() if rtnl_held is false.

Now we can place ASSERT_RTNL_NET() in call_fib_rule_notifiers().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fib_rules: Add error_free label in fib_delrule().

We will hold RTNL just before calling fib_nl2rule_rtnl() in
fib_delrule() and release it before kfree(nlrule).

Let's add a new rule to make the following change cleaner.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fib_rules: Convert RTM_NEWRULE to per-netns RTNL.

fib_nl_newrule() is the doit() handler for RTM_NEWRULE but also called
from vrf_newlink().

In the latter case, RTNL is already held and the 4th arg is true.

Let's hold per-netns RTNL in fib_newrule() if rtnl_held is false.

Note that we call fib_rule_get() before releasing per-netns RTNL to call
notify_rule_change() without RTNL and prevent freeing the new rule.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fib_rules: Factorise fib_newrule() and fib_delrule().

fib_nl_newrule() / fib_nl_delrule() is the doit() handler for
RTM_NEWRULE / RTM_DELRULE but also called from vrf_newlink().

Currently, we hold RTNL on both paths but will not on the former.

Also, we set dev_net(dev)->rtnl to skb->sk in vrf_fib_rule() because
fib_nl_newrule() / fib_nl_delrule() fetch net as sock_net(skb->sk).

Let's Factorise the two functions and pass net and rtnl_held flag.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ip: fib_rules: Fetch net from fib_rule in fib[46]_rule_configure().

The following patch will not set skb->sk from VRF path.

Let's fetch net from fib_rule->fr_net instead of sock_net(skb->sk)
in fib[46]_rule_configure().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fib_rules: Split fib_nl2rule().

We will move RTNL down to fib_nl_newrule() and fib_nl_delrule().

Some operations in fib_nl2rule() require RTNL: fib_default_rule_pref()
and __dev_get_by_name().

Let's split the RTNL parts as fib_nl2rule_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fib_rules: Pass net to fib_nl2rule() instead of skb.

skb is not used in fib_nl2rule() other than sock_net(skb->sk),
which is already available in callers, fib_nl_newrule() and
fib_nl_delrule().

Let's pass net directly to fib_nl2rule().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fib_rules: Don't check net in rule_exists() and rule_find().

fib_nl_newrule() / fib_nl_delrule() looks up struct fib_rules_ops
in sock_net(skb->sk) and calls rule_exists() / rule_find() respectively.

fib_nl_newrule() creates a new rule and links it to the found ops, so
struct fib_rule never belongs to a different netns's ops->rules_list.

Let's remove redundant netns check in rule_exists() and rule_find().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tun-unify-vnet-implementation'

Akihiko Odaki says:

====================
tun: Unify vnet implementation

When I implemented virtio's hash-related features to tun/tap [1],
I found tun/tap does not fill the entire region reserved for the virtio
header, leaving some uninitialized hole in the middle of the buffer
after read()/recvmesg().

This series fills the uninitialized hole. More concretely, the
num_buffers field will be initialized with 1, and the other fields will
be inialized with 0. Setting the num_buffers field to 1 is mandated by
virtio 1.0 [2].

The change to virtio header is preceded by another change that refactors
tun and tap to unify their virtio-related code.

[1]: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
[2]: https://lore.kernel.org/r/20241227084256-mutt-send-email-mst@kernel.org/
====================

Link: https://patch.msgid.link/20250207-tun-v6-0-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tap: Use tun's vnet-related code

tun and tap implements the same vnet-related features so reuse the code.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-7-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tap: Keep hdr_len in tap_get_user()

hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-6-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun: Extract the vnet handling code

The vnet handling code will be reused by tap.

Functions are renamed to ensure that their names contain "vnet" to
clarify that they are part of the decoupled vnet handling code.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-5-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun: Decouple vnet handling

Decouple the vnet handling code so that we can reuse it for tap.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-4-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun: Decouple vnet from tun_struct

Decouple vnet-related functions from tun_struct so that we can reuse
them for tap in the future.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-3-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun: Keep hdr_len in tun_get_user()

hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-2-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun: Refactor CONFIG_TUN_VNET_CROSS_LE

Check IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) to save some lines and make
future changes easier.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-1-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-xilinx-axienet-enable-adaptive-irq-coalescing-with-dim'

Sean Anderson says:

====================
net: xilinx: axienet: Enable adaptive IRQ coalescing with DIM

To improve performance without sacrificing latency under low load,
enable DIM. While I appreciate not having to write the library myself, I
do think there are many unusual aspects to DIM, as detailed in the last
patch.
====================

Link: https://patch.msgid.link/20250206201036.1516800-1-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xilinx: axienet: Enable adaptive IRQ coalescing with DIM

The default RX IRQ coalescing settings of one IRQ per packet can represent
a significant CPU load. However, increasing the coalescing unilaterally
can result in undesirable latency under low load. Adaptive IRQ
coalescing with DIM offers a way to adjust the coalescing settings based
on load.

This device only supports "CQE" mode [1], where each packet resets the
timer. Therefore, an interrupt is fired either when we receive
coalesce_count_rx packets or when the interface is idle for
coalesce_usec_rx. With this in mind, consider the following scenarios:

Link saturated
    Here we want to set coalesce_count_rx to a large value, in order to
    coalesce more packets and reduce CPU load. coalesce_usec_rx should
    be set to at least the time for one packet. Otherwise the link will
    be "idle" and we will get an interrupt for each packet anyway.

Bursts of packets
    Each burst should be coalesced into a single interrupt, although it
    may be prudent to reduce coalesce_count_rx for better latency.
    coalesce_usec_rx should be set to at least the time for one packet
    so bursts are coalesced. However, additional time beyond the packet
    time will just increase latency at the end of a burst.

Sporadic packets
    Due to low load, we can set coalesce_count_rx to 1 in order to
    reduce latency to the minimum. coalesce_usec_rx does not matter in
    this case.

Based on this analysis, I expected the CQE profiles to look something
like

usec =  0, pkts = 1   // Low load
usec = 16, pkts = 4
usec = 16, pkts = 16
usec = 16, pkts = 64
usec = 16, pkts = 256 // High load

Where usec is set to 16 to be a few us greater than the 12.3 us packet
time of a 1500 MTU packet at 1 GBit/s. However, the CQE profile is
instead

usec =  2, pkts = 256 // Low load
usec =  8, pkts = 128
usec = 16, pkts =  64
usec = 32, pkts =  64
usec = 64, pkts =  64 // High load

I found this very surprising. The number of coalesced packets
*decreases* as load increases. But as load increases we have more
opportunities to coalesce packets without affecting latency as much.
Additionally, the profile *increases* the usec as the load increases.
But as load increases, the gaps between packets will tend to become
smaller, making it possible to *decrease* usec for better latency at the
end of a "burst".

I consider the default CQE profile unsuitable for this NIC. Therefore,
we use the first profile outlined in this commit instead.
coalesce_usec_rx is set to 16 by default, but the user can customize it.
This may be necessary if they are using jumbo frames. I think adjusting
the profile times based on the link speed/mtu would be good improvement
for generic DIM.

In addition to the above profile problems, I noticed the following
additional issues with DIM while testing:

- DIM tends to "wander" when at low load, since the performance gradient
  is pretty flat. If you only have 10p/ms anyway then adjusting the
  coalescing settings will not affect throughput very much.
- DIM takes a long time to adjust back to low indices when load is
  decreased following a period of high load. This is because it only
  re-evaluates its settings once every 64 interrupts. However, at low
  load 64 interrupts can be several seconds.

Finally: performance. This patch increases receive throughput with
iperf3 from 840 Mbits/sec to 938 Mbits/sec, decreases interrupts from
69920/sec to 316/sec, and decreases CPU utilization (4x Cortex-A53) from
43% to 9%.

[1] Who names this stuff?

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250206201036.1516800-5-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xilinx: axienet: Get coalesce parameters from driver state

The cr variables now contain the same values as the control registers
themselves. Extract/calculate the values from the variables instead of
saving the user-specified values. This allows us to remove some
bookeeping, and also lets the user know what the actual coalesce
settings are.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250206201036.1516800-4-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xilinx: axienet: Support adjusting coalesce settings while running

In preparation for adaptive IRQ coalescing, we first need to support
adjusting the settings at runtime. The existing code doesn't require any
locking because

- dma_start is the only function that modifies rx/tx_dma_cr. It is
  always called with IRQs and NAPI disabled, so nothing else is touching
  the hardware.
- The IRQs don't race with poll, since the latter is a softirq.
- The IRQs don't race with dma_stop since they both just clear the
  control registers.
- dma_stop doesn't race with poll since the former is called with NAPI
  disabled.

However, once we introduce another function that modifies rx/tx_dma_cr,
we need to have some locking to prevent races. Introduce two locks to
protect these variables and their registers.

The control register values are now generated where the coalescing
settings are set. Converting coalescing settings to control register
values may require sleeping because of clk_get_rate. However, the
read/modify/write of the control registers themselves can't sleep
because it needs to happen in IRQ context. By pre-calculating the
control register values, we avoid introducing an additional mutex.

Since axienet_dma_start writes the control settings when it runs, we
don't bother updating the CR registers when rx/tx_dma_started is false.
This prevents any issues from writing to the control registers in the
middle of a reset sequence.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250206201036.1516800-3-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xilinx: axienet: Combine CR calculation

Combine the common parts of the CR calculations for better code reuse.
While we're at it, simplify the code a bit.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250206201036.1516800-2-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8152: add vendor/device ID pair for Dell Alienware AW1022z

The Dell AW1022z is an RTL8156B based 2.5G Ethernet controller.

Add the vendor and product ID values to the driver. This makes Ethernet
work with the adapter.

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Link: https://patch.msgid.link/20250206224033.980115-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'xsk-the-lost-bits-from-chapter-iii'

Alexander Lobakin says:

====================
xsk: the lost bits from Chapter III

Before introducing libeth_xdp, we need to add a couple more generic
helpers. Notably:

* 01: add generic loop unrolling hint helpers;
* 04: add helper to get both xdp_desc's DMA address and metadata
  pointer in one go, saving several cycles and hotpath object
  code size in drivers (especially when unrolling).

Bonus:

* 02, 03: convert two drivers which were using custom macros to
  generic unrolled_count() (trivial, no object code changes).
====================

Link: https://patch.msgid.link/20250206182630.3914318-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

xsk: add helper to get &xdp_desc's DMA and meta pointer in one go

Currently, when your driver supports XSk Tx metadata and you want to
send an XSk frame, you need to do the following:

* call external xsk_buff_raw_get_dma();
* call inline xsk_buff_get_metadata(), which calls external
xsk_buff_raw_get_data() and then do some inline checks.

This effectively means that the following piece:

addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;

is done twice per frame, plus you have 2 external calls per frame, plus
this:

meta = pool->addrs + addr - pool->tx_metadata_len;
if (unlikely(!xsk_buff_valid_tx_metadata(meta)))

is always inlined, even if there's no meta or it's invalid.

Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that
in one go. It returns a small structure with 2 fields: DMA address,
filled unconditionally, and metadata pointer, non-NULL only if it's
present and valid. The address correction is performed only once and
you also have only 1 external call per XSk frame, which does all the
calculations and checks outside of your hotpath. You only need to
check `if (ctx.meta)` for the metadata presence.
To not copy any existing code, derive address correction and getting
virtual and DMA address into small helpers. bloat-o-meter reports no
object code changes for the existing functionality.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: use generic unrolled_count() macro

ice, same as i40e, has a custom loop unrolling macros for unrolling
Tx descriptors filling on XSk xmit.
Replace ice defs with generic unrolled_count(), which is also more
convenient as it allows passing defines as its argument, not hardcoded
values, while the loop declaration will still be usual for-loop.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

i40e: use generic unrolled_count() macro

i40e, as well as ice, has a custom loop unrolling macro for unrolling
Tx descriptors filling on XSk xmit.
Replace i40e defs with generic unrolled_count(), which is also more
convenient as it allows passing defines as its argument, not hardcoded
values, while the loop declaration will still be a usual for-loop.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

unroll: add generic loop unroll helpers

There are cases when we need to explicitly unroll loops. For example,
cache operations, filling DMA descriptors on very high speeds etc.
Add compiler-specific attribute macros to give the compiler a hint
that we'd like to unroll a loop.
Example usage:

#define UNROLL_BATCH 8

unrolled_count(UNROLL_BATCH)
for (u32 i = 0; i < UNROLL_BATCH; i++)
op(priv, i);

Note that sometimes the compilers won't unroll loops if they think this
would have worse optimization and perf than without unrolling, and that
unroll attributes are available only starting GCC 8. For older compiler
versions, no hints/attributes will be applied.
For better unrolling/parallelization, don't have any variables that
interfere between iterations except for the iterator itself.

Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: dp83td510: introduce LED framework support

Add LED brightness, mode, HW control and polarity functions to enable
external LED control in the TI DP83TD510 PHY.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250205103846.2273833-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e1000e: Fix real-time violations on link up

Link down and up triggers update of MTA table. This update executes many
PCIe writes and a final flush. Thus, PCIe will be blocked until all
writes are flushed. As a result, DMA transfers of other targets suffer
from delay in the range of 50us. This results in timing violations on
real-time systems during link down and up of e1000e in combination with
an Intel i3-2310E Sandy Bridge CPU.

The i3-2310E is quite old. Launched 2011 by Intel but still in use as
robot controller. The exact root cause of the problem is unclear and
this situation won't change as Intel support for this CPU has ended
years ago. Our experience is that the number of posted PCIe writes needs
to be limited at least for real-time systems. With posted PCIe writes a
much higher throughput can be generated than with PCIe reads which
cannot be posted. Thus, the load on the interconnect is much higher.
Additionally, a PCIe read waits until all posted PCIe writes are done.
Therefore, the PCIe read can block the CPU for much more than 10us if a
lot of PCIe writes were posted before. Both issues are the reason why we
are limiting the number of posted PCIe writes in row in general for our
real-time systems, not only for this driver.

A flush after a low enough number of posted PCIe writes eliminates the
delay but also increases the time needed for MTA table update. The
following measurements were done on i3-2310E with e1000e for 128 MTA
table entries:

Single flush after all writes: 106us
Flush after every write:       429us
Flush after every 2nd write:   266us
Flush after every 4th write:   180us
Flush after every 8th write:   141us
Flush after every 16th write:  121us

A flush after every 8th write delays the link up by 35us and the
negative impact to DMA transfers of other targets is still tolerable.

Execute a flush after every 8th write. This prevents overloading the
interconnect with posted writes.

Signed-off-by: Gerhard Engleder <eg@keba.com>
Link: https://lore.kernel.org/netdev/f8fe665a-5e6c-4f95-b47a-2f3281aa0e6c@lunn.ch/T/
CC: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

igc: Avoid unnecessary link down event in XDP_SETUP_PROG process

The igc_close()/igc_open() functions are too drastic for installing a new
XDP prog because they cause undesirable link down event and device reset.

To avoid delays in Ethernet traffic, improve the XDP_SETUP_PROG process by
using the same sequence as igc_xdp_setup_pool(), which performs only the
necessary steps, as follows:
1. stop the traffic and clean buffer
2. stop NAPI
3. install the XDP program
4. resume NAPI
5. allocate buffer and resume the traffic

This patch has been tested using the 'ip link set xdpdrv' command to attach
a simple XDP prog that always returns XDP_PASS.

Before this patch, attaching xdp program will cause ptp4l to lose sync for
few seconds, as shown in ptp4l log below:
  ptp4l[198.082]: rms    4 max    8 freq   +906 +/-   2 delay    12 +/-   0
  ptp4l[199.082]: rms    3 max    4 freq   +906 +/-   3 delay    12 +/-   0
  ptp4l[199.536]: port 1 (enp2s0): link down
  ptp4l[199.536]: port 1 (enp2s0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
  ptp4l[199.600]: selected local clock 22abbc.fffe.bb1234 as best master
  ptp4l[199.600]: port 1 (enp2s0): assuming the grand master role
  ptp4l[199.600]: port 1 (enp2s0): master state recommended in slave only mode
  ptp4l[199.600]: port 1 (enp2s0): defaultDS.priority1 probably misconfigured
  ptp4l[202.266]: port 1 (enp2s0): link up
  ptp4l[202.300]: port 1 (enp2s0): FAULTY to LISTENING on INIT_COMPLETE
  ptp4l[205.558]: port 1 (enp2s0): new foreign master 44abbc.fffe.bb2144-1
  ptp4l[207.558]: selected best master clock 44abbc.fffe.bb2144
  ptp4l[207.559]: port 1 (enp2s0): LISTENING to UNCALIBRATED on RS_SLAVE
  ptp4l[208.308]: port 1 (enp2s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
  ptp4l[208.933]: rms  742 max 1303 freq   -195 +/- 682 delay    12 +/-   0
  ptp4l[209.933]: rms  178 max  274 freq   +387 +/- 243 delay    12 +/-   0

After this patch, attaching xdp program no longer cause ptp4l to lose sync,
as shown in ptp4l log below:
  ptp4l[201.183]: rms    1 max    3 freq   +959 +/-   1 delay     8 +/-   0
  ptp4l[202.183]: rms    1 max    3 freq   +961 +/-   2 delay     8 +/-   0
  ptp4l[203.183]: rms    2 max    3 freq   +958 +/-   2 delay     8 +/-   0
  ptp4l[204.183]: rms    3 max    5 freq   +961 +/-   3 delay     8 +/-   0
  ptp4l[205.183]: rms    2 max    4 freq   +964 +/-   3 delay     8 +/-   0

Besides, before this patch, attaching xdp program will causes flood ping to
lose 10 packets, as shown in ping statistics below:
  --- 169.254.1.2 ping statistics ---
  100000 packets transmitted, 99990 received, +6 errors, 0.01% packet loss, time 34001ms
  rtt min/avg/max/mdev = 0.028/0.301/3104.360/13.838 ms, pipe 10, ipg/ewma 0.340/0.243 ms

After this patch, attaching xdp program no longer cause flood ping to loss
any packets, as shown in ping statistics below:
  --- 169.254.1.2 ping statistics ---
  100000 packets transmitted, 100000 received, 0% packet loss, time 32326ms
  rtt min/avg/max/mdev = 0.027/0.231/19.589/0.155 ms, pipe 2, ipg/ewma 0.323/0.322 ms

On the other hand, this patch has been tested with tools/testing/selftests/
bpf/xdp_hw_metadata app to make sure AF_XDP zero-copy is working fine with
XDP Tx and Rx metadata. Below is the result of last packet after received
10000 UDP packets with interval 1 ms:
  poll: 1 (0) skip=0 fail=0 redir=10000
  xsk_ring_cons__peek: 1
  0x55881c7ef7a8: rx_desc[9999]->addr=8f110 addr=8f110 comp_addr=8f110 EoP
  rx_hash: 0xFB9BB6A3 with RSS type:0x1
  HW RX-time:   1733923136269470866 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (43.280 usec)
  XDP RX-time:   1733923136269482482 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (31.664 usec)
  No rx_vlan_tci or rx_vlan_proto, err=-95
  0x55881c7ef7a8: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6
  0x55881c7ef7a8: complete tx idx=9999 addr=f010
  HW TX-complete-time:   1733923136269591637 (sec:1733923136.2696) delta to User TX-complete-time sec:0.0001 (108.571 usec)
  XDP RX-time:   1733923136269482482 (sec:1733923136.2695) delta to User TX-complete-time sec:0.0002 (217.726 usec)
  HW RX-time:   1733923136269470866 (sec:1733923136.2695) delta to HW TX-complete-time sec:0.0001 (120.771 usec)
  0x55881c7ef7a8: complete rx idx=10127 addr=8f110

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

ice: refactor ice_fdir_create_dflt_rules() function

The Flow Director function ice_fdir_create_dflt_rules() calls few
times function ice_create_init_fdir_rule() each time with different
enum ice_fltr_ptype parameter. Next step is to return error code if
error occurred.

Change the code to store all necessary default rules in constant array
and call ice_create_init_fdir_rule() in the loop. It makes it easy to
extend the list of default rules in the future, without the need of
duplicate code more and more.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

ice: Implement PTP support for E830 devices

Add specific functions and definitions for E830 devices to enable
PTP support.

E830 devices support direct write to GLTSYN_ registers without shadow
registers and 64 bit read of PHC time.

Enable PTM for E830 device, which is required for cross timestamp and
and dependency on PCIE_PTM for ICE_HWTS.

Check X86_FEATURE_ART for E830 as it may not be present in the CPU.

Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Michal Michalik <michal.michalik@intel.com>
Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

ice: Refactor ice_ptp_init_tx_*

Unify ice_ptp_init_tx_* functions for most of the MAC types except E82X.
This simplifies the code for the future use with new MAC types.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>