Donald Hunter [Tue, 11 Feb 2025 12:01:19 +0000 (12:01 +0000)]
tools/net/ynl: support decoding indexed arrays as enums
When decoding an indexed-array with a scalar subtype, it is currently
only possible to add a display-hint. Add support for decoding each value
as an enum.
Jakub Kicinski [Thu, 13 Feb 2025 02:20:06 +0000 (18:20 -0800)]
Merge branch 'net: dsa: add support for phylink managed EEE'
Russell King says:
====================
net: dsa: add support for phylink managed EEE
This series adds support for phylink managed EEE to DSA, and converts
mt753x to make use of this feature.
Patch 1 implements a helper to indicate whether the MAC LPI operations
are populated (suggested by Vladimir)
Patch 2 makes the necessary changes to the core code - we retain calling
set_mac_eee(), but this method now becomes a way to merely validate the
arguments when using phylink managed EEE rather than performing any
configuration.
Patch 3 converts the mt7530 driver to use phylink managed EEE.
====================
Russell King (Oracle) [Mon, 10 Feb 2025 10:36:54 +0000 (10:36 +0000)]
net: dsa: mt7530: convert to phylink managed EEE
Convert mt7530 to use phylink managed EEE. When enabling EEE, we set
both PMCR_FORCE_EEE1G and PMCR_FORCE_EEE100 irrespective of the speed,
and clear them both when disabling.
Russell King (Oracle) [Mon, 10 Feb 2025 10:36:49 +0000 (10:36 +0000)]
net: dsa: allow use of phylink managed EEE support
In order to allow DSA drivers to use phylink managed EEE, we need to
change the behaviour of the DSA's .set_eee() ethtool method.
Implementation of the DSA .set_mac_eee() method becomes optional with
phylink managed EEE as it is only used to validate the EEE parameters
supplied from userspace. The rest of the EEE state management should
be left to phylink.
Note that we don't collect csum-unnecessary, just the uncommon
cases (and unnecessary is all the rest of the packets). There
is no programatic use for these stats AFAIK, just manual debug.
====================
Jakub Kicinski [Tue, 11 Feb 2025 18:13:53 +0000 (10:13 -0800)]
eth: fbnic: wrap tx queue stats in a struct
The queue stats struct is used for Rx and Tx queues. Wrap
the Tx stats in a struct and a union, so that we can reuse
the same space for Rx stats on Rx queues.
This also makes it easy to add an assert to the stat handling
code to catch new stats not being aggregated on shutdown.
Jakub Kicinski [Tue, 11 Feb 2025 18:13:52 +0000 (10:13 -0800)]
net: report csum_complete via qstats
Commit 13c7c941e729 ("netdev: add qstat for csum complete") reserved
the entry for csum complete in the qstats uAPI. Start reporting this
value now that we have a driver which needs it.
====================
Use PHYlib for reset randomization and adjustable polling
This patch set tackles a DP83TG720 reset lock issue and improves PHY
polling. Rather than adding a separate polling worker to randomize PHY
resets, I chose to extend the PHYlib framework - which already handles
most of the needed functionality - with adjustable polling. This
approach not only addresses the DP83TG720-specific problem (where
synchronized resets can lock the link) but also lays the groundwork for
optimizing PHY stats polling across all PHY drivers. With generic PHY
stats coming in, we can adjust the polling interval based on hardware
characteristics, such as using longer intervals for PHYs with stable HW
counters or shorter ones for high-speed links prone to counter
overflows.
Patch version changes are tracked in separate patches.
====================
Oleksij Rempel [Mon, 10 Feb 2025 08:23:58 +0000 (09:23 +0100)]
net: phy: dp83tg720: Add randomized polling intervals for link detection
Address the limitations of the DP83TG720 PHY, which cannot reliably
detect or report a stable link state. To handle this, the PHY must be
periodically reset when the link is down. However, synchronized reset
intervals between the PHY and its link partner can result in a deadlock,
preventing the link from re-establishing.
This change introduces a randomized polling interval when the link is
down to desynchronize resets between link partners.
Oleksij Rempel [Mon, 10 Feb 2025 08:23:57 +0000 (09:23 +0100)]
net: phy: Add support for driver-specific next update time
Introduce the `phy_get_next_update_time` function to allow PHY drivers
to dynamically determine the time (in jiffies) until the next state
update event. This enables more flexible and adaptive polling intervals
based on the link state or other conditions.
Alexei Lazar [Sun, 9 Feb 2025 10:17:16 +0000 (12:17 +0200)]
net/mlx5: XDP, Enable TX side XDP multi-buffer support
In XDP scenarios, fragmented packets can occur if the MTU is larger
than the page size, even when the packet size fits within the linear
part.
If XDP multi-buffer support is disabled, the fragmented part won't be
handled in the TX flow, leading to packet drops.
Since XDP multi-buffer support is always available, this commit removes
the conditional check for enabling it.
This ensures that XDP multi-buffer support is always enabled,
regardless of the `is_xdp_mb` parameter, and guarantees the handling of
fragmented packets in such scenarios.
Alexei Lazar [Sun, 9 Feb 2025 10:17:15 +0000 (12:17 +0200)]
net/mlx5: Extend Ethtool loopback selftest to support non-linear SKB
Current loopback test validation ignores non-linear SKB case in
the SKB access, which can lead to failures in scenarios such as
when HW GRO is enabled.
Linearize the SKB so both cases will be handled.
Amir Tzin [Sun, 9 Feb 2025 10:17:13 +0000 (12:17 +0200)]
net/mlx5e: Add direct TIRs to devlink rx reporter diagnose
Add "RX resources" tag to the output of rx reporter diagnose callback.
Underneath add tag for direct TIRs, for each TIR expose its tirn and
the corresponding rqtn.
Amir Tzin [Sun, 9 Feb 2025 10:17:12 +0000 (12:17 +0200)]
net/mlx5e: Move RQs diagnose to a dedicated function
Move rx reporter RQs diagnose from mlx5e_rx_reporter_diagnose() to a
dedicated function. This change is a preparation for the following
series which extends diagnose output for the rx reporter. While at it,
also pass a mlx5e_priv pointer to
mlx5e_rx_reporter_diagnose_common_config() as this is the argument the
latter actually needs.
Akiva Goldberger [Sun, 9 Feb 2025 10:17:10 +0000 (12:17 +0200)]
net/mlx5: Rename and move mlx5_esw_query_vport_vhca_id
Rename mlx5_esw_query_vport_vhca_id to mlx5_vport_get_vhca_id and move
it to vport file. Also, add function declaration to mlx5_core header
file. This better represents the function's usage and allows for it to
be called from other parts of the mlx5_core driver.
William Tu [Sun, 9 Feb 2025 10:17:09 +0000 (12:17 +0200)]
net/mlx5e: set the tx_queue_len for pfifo_fast
By default, the mq netdev creates a pfifo_fast qdisc. On a
system with 16 core, the pfifo_fast with 3 bands consumes
16 * 3 * 8 (size of pointer) * 1024 (default tx queue len)
= 393KB. The patch sets the tx qlen to representor default
value, 128 (1<<MLX5E_REP_PARAMS_DEF_LOG_SQ_SIZE), which
consumes 16 * 3 * 8 * 128 = 49KB, saving 344KB for each
representor at ECPF.
Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Daniel Jurgens <danielj@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250209101716.112774-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
William Tu [Sun, 9 Feb 2025 10:17:08 +0000 (12:17 +0200)]
net/mlx5e: reduce rep rxq depth to 256 for ECPF
By experiments, a single queue representor netdev consumes kernel
memory around 2.8MB, and 1.8MB out of the 2.8MB is due to page
pool for the RXQ. Scaling to a thousand representors consumes 2.8GB,
which becomes a memory pressure issue for embedded devices such as
BlueField-2 16GB / BlueField-3 32GB memory.
Since representor netdevs mostly handles miss traffic, and ideally,
most of the traffic will be offloaded, reduce the default non-uplink
rep netdev's RXQ default depth from 1024 to 256 if mdev is ecpf eswitch
manager. This saves around 1MB of memory per regular RQ,
(1024 - 256) * 2KB, allocated from page pool.
With rxq depth of 256, the netlink page pool tool reports
$./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump page-pool-get
{'id': 277,
'ifindex': 9,
'inflight': 128,
'inflight-mem': 786432,
'napi-id': 775}]
This is due to mtu 1500 + headroom consumes half pages, so 256 rxq
entries consumes around 128 pages (thus create a page pool with
size 128), shown above at inflight.
Note that each netdev has multiple types of RQs, including
Regular RQ, XSK, PTP, Drop, Trap RQ. Since non-uplink representor
only supports regular rq, this patch only changes the regular RQ's
default depth.
Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Bodong Wang <bodong@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250209101716.112774-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
William Tu [Sun, 9 Feb 2025 10:17:07 +0000 (12:17 +0200)]
net/mlx5e: reduce the max log mpwrq sz for ECPF and reps
For the ECPF and representors, reduce the max MPWRQ size from 256KB (18)
to 128KB (17). This prepares the later patch for saving representor
memory.
With Striding RQ, there is a minimum of 4 MPWQEs. So with 128KB of max
MPWRQ size, the minimal memory is 4 * 128KB = 512KB. When creating page
pool, consider 1500 mtu, the minimal page pool size will be 512KB/4KB =
128 pages = 256 rx ring entries (2 entries per page).
Before this patch, setting RX ringsize (ethtool -G rx) to 256 causes
driver to allocate page pool size more than it needs due to max MPWRQ
is 256KB (18). Ex: 4 * 256KB = 1MB, 1MB/4KB = 256 pages, but actually
128 pages is good enough. Reducing the max MPWRQ to 128KB fixes the
limitation.
Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250209101716.112774-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 12 Feb 2025 03:51:16 +0000 (19:51 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-02-10 (ice, igc, e1000e)
For ice:
Karol, Jake, and Michal add PTP support for E830 devices. Karol
refactors and cleans up PTP code. Jake allows for a common
cross-timestamp implementation to be shared for all devices and
Michal adds E830 support.
Mateusz cleans up initial Flow Director rule creation to loop rather
than duplicate repeated similar calls.
For igc:
Siang adjust calls to remove need for close and open calls on loading
XDP program.
For e1000e:
Gerhard Engleder batches register writes for writing multicast table
on real-time kernels.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
e1000e: Fix real-time violations on link up
igc: Avoid unnecessary link down event in XDP_SETUP_PROG process
ice: refactor ice_fdir_create_dflt_rules() function
ice: Implement PTP support for E830 devices
ice: Refactor ice_ptp_init_tx_*
ice: Add unified ice_capture_crosststamp
ice: Process TSYN IRQ in a separate function
ice: Use FIELD_PREP for timestamp values
ice: Remove unnecessary ice_is_e8xx() functions
ice: Don't check device type when checking GNSS presence
====================
Edward Cree [Mon, 10 Feb 2025 11:25:45 +0000 (11:25 +0000)]
sfc: document devlink flash support
Update the information in sfc's devlink documentation including
support for firmware update with devlink flash.
Also update the help text for CONFIG_SFC_MTD, as it is no longer
strictly required for firmware updates.
Edward Cree [Mon, 10 Feb 2025 11:25:43 +0000 (11:25 +0000)]
sfc: extend NVRAM MCDI handlers
Support variable write-alignment, and background updates. The latter
allows other MCDI to continue while the device is processing an
MC_CMD_NVRAM_UPDATE_FINISH, since this can take a long time owing to
e.g. cryptographic signature verification.
Expose these handlers in mcdi.h, and build them even when
CONFIG_SFC_MTD=n, so they can be used for devlink flash in a
subsequent patch.
====================
Use HWMON_CHANNEL_INFO macro to simplify code
The HWMON_CHANNEL_INFO macro is provided by hwmon.h and used widely by many
other drivers. This series use HWMON_CHANNEL_INFO macro to simplify code
in net subsystem.
Note: These patches do not depend on each other. Put them togeter just for
belonging to the same subsystem.
====================
Wolfram Sang [Mon, 10 Feb 2025 11:37:11 +0000 (12:37 +0100)]
net: wwan: t7xx: don't include '<linux/pm_wakeup.h>' directly
The header clearly states that it does not want to be included directly,
only via '<linux/(platform_)?device.h>'. Which is already present, so
delete the superfluous include.
Ethan Carter Edwards [Sun, 9 Feb 2025 04:06:21 +0000 (23:06 -0500)]
hamradio: baycom: replace strcpy() with strscpy()
The strcpy() function has been deprecated and replaced with strscpy().
There is an effort to make this change treewide:
https://github.com/KSPP/linux/issues/88.
Paolo Abeni [Tue, 11 Feb 2025 11:46:39 +0000 (12:46 +0100)]
Merge branch 'mptcp-pm-misc-cleanups-part-2'
Matthieu Baerts says:
====================
mptcp: pm: misc cleanups, part 2
These cleanups lead the way to the unification of the path-manager
interfaces, and allow future extensions. The following patches are not
all linked to each others, but are all related to the path-managers.
- Patch 1: drop unneeded parameter in a function helper.
- Patch 2: clearer NL error message when an NL attribute is missing.
- Patch 3: more precise NL messages by avoiding 'this or that is NOK'.
- Patch 4: improve too vague or missing NL err messages.
- Patch 5: use GENL_REQ_ATTR_CHECK to look for mandatory NL attributes.
- Patch 6: avoid overriding the error message.
- Patch 7: check all mandatory NL attributes with GENL_REQ_ATTR_CHECK.
- Patch 8: use NL_SET_ERR_MSG_ATTR instead of GENL_SET_ERR_MSG
- Patch 9: move doit callbacks used for both PM to pm.c.
- Patch 10: drop another unneeded parameter in a function helper.
- Patch 11: share the ID parsing code for the 'get_addr' callback.
- Patch 12: share sending NL code for the 'get_addr' callback.
- Patch 13: drop yet another unneeded parameter in a function helper.
- Patch 14: pick the usual structure type for the remote address.
- Patch 15: share the local addr parsing code for the 'set_flags' cb.
The behaviour when there are no errors should then not be modified.
Geliang Tang [Fri, 7 Feb 2025 13:59:32 +0000 (14:59 +0100)]
mptcp: pm: change rem type of set_flags
Generally, in the path manager interfaces, the local address is defined
as an mptcp_pm_addr_entry type address, while the remote address is
defined as an mptcp_addr_info type one:
Geliang Tang [Fri, 7 Feb 2025 13:59:31 +0000 (14:59 +0100)]
mptcp: pm: drop skb parameter of set_flags
The first parameter 'skb' in mptcp_pm_nl_set_flags() is only used to
obtained the network namespace, which can also be obtained through the
second parameters 'info' by using genl_info_net() helper.
This patch drops these useless parameters 'skb' in all three set_flags()
interfaces.
Geliang Tang [Fri, 7 Feb 2025 13:59:30 +0000 (14:59 +0100)]
mptcp: pm: reuse sending nlmsg code in get_addr
The netlink messages are sent both in mptcp_pm_nl_get_addr() and
mptcp_userspace_pm_get_addr(), this makes the code somewhat repetitive.
This is because the netlink PM and userspace PM use different locks to
protect the address entry that needs to be sent via the netlink message.
The former uses rcu read lock, and the latter uses msk->pm.lock.
After holding the lock, get the entry from the list, send the entry, and
finally release the lock.
This patch changes the process by getting the entry while holding the lock,
then making a copy of the entry so that the lock can be released. Finally,
the copy of the entry is sent without locking:
This way we can reuse the send_nlmsg() code in get_addr() interfaces
between the netlink PM and userspace PM. They only need to implement their
own get_addr() interfaces to hold the different locks, get the entry from
the different lists, then release the locks.
Geliang Tang [Fri, 7 Feb 2025 13:59:29 +0000 (14:59 +0100)]
mptcp: pm: add id parameter for get_addr
The address id is parsed both in mptcp_pm_nl_get_addr() and
mptcp_userspace_pm_get_addr(), this makes the code somewhat repetitive.
So this patch adds a new parameter 'id' for all get_addr() interfaces.
The address id is only parsed in mptcp_pm_nl_get_addr_doit(), then pass
it to both mptcp_pm_nl_get_addr() and mptcp_userspace_pm_get_addr().
Geliang Tang [Fri, 7 Feb 2025 13:59:28 +0000 (14:59 +0100)]
mptcp: pm: drop skb parameter of get_addr
The first parameters 'skb' of get_addr() interfaces are now useless
since mptcp_userspace_pm_get_sock() helper is used. This patch drops
these useless parameters of them.
Instead of only returning a text message with GENL_SET_ERR_MSG(),
NL_SET_ERR_MSG_ATTR() can help the userspace developers by also
reporting which attribute is faulty.
When the error is specific to an attribute, NL_SET_ERR_MSG_ATTR() is now
used. The error messages have not been modified in this commit.
Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
mptcp_pm_parse_entry() will check if the given attribute is defined. If
not, it will return a generic error: "missing address info".
It might then not be clear for the userspace developer which attribute
is missing, especially when the command takes multiple addresses.
By using GENL_REQ_ATTR_CHECK(), the userspace will get a hint about
which attribute is missing, making thing clearer. Note that this is what
was already done for most of the other MPTCP NL commands, this patch
simply adds the missing ones.
Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Geliang Tang [Fri, 7 Feb 2025 13:59:23 +0000 (14:59 +0100)]
mptcp: pm: userspace: use GENL_REQ_ATTR_CHECK
A more general way to check if MPTCP_PM_ATTR_* exists in 'info'
is to use GENL_REQ_ATTR_CHECK(info, MPTCP_PM_ATTR_*) instead of
directly reading info->attrs[MPTCP_PM_ATTR_*] and then checking
if it's NULL.
So this patch uses GENL_REQ_ATTR_CHECK() for userspace PM in
mptcp_pm_nl_announce_doit(), mptcp_pm_nl_remove_doit(),
mptcp_pm_nl_subflow_create_doit(), mptcp_pm_nl_subflow_destroy_doit()
and mptcp_userspace_pm_get_sock().
Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
mptcp: pm: userspace: flags: clearer msg if no remote addr
Since its introduction in commit 892f396c8e68 ("mptcp: netlink: issue
MP_PRIO signals from userspace PMs"), it was mandatory to specify the
remote address, because of the 'if (rem->addr.family == AF_UNSPEC)'
check done later one.
In theory, this attribute can be optional, but it sounds better to be
precise to avoid sending the MP_PRIO on the wrong subflow, e.g. if there
are multiple subflows attached to the same local ID. This can be relaxed
later on if there is a need to act on multiple subflows with one
command.
For the moment, the check to see if attr_rem is NULL can be removed,
because mptcp_pm_parse_entry() will do this check as well, no need to do
that differently here.
Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yuyang Huang [Fri, 7 Feb 2025 11:08:36 +0000 (20:08 +0900)]
selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support
This change introduces a new selftest case to verify the functionality
of dumping IPv4 multicast addresses using the RTM_GETMULTICAST netlink
message. The test utilizes the ynl library to interact with the
netlink interface and validate that the kernel correctly reports the
joined IPv4 multicast addresses.
Yuyang Huang [Fri, 7 Feb 2025 11:08:35 +0000 (20:08 +0900)]
netlink: support dumping IPv4 multicast addresses
Extended RTM_GETMULTICAST to support dumping joined IPv4 multicast
addresses, in addition to the existing IPv6 functionality. This allows
userspace applications to retrieve both IPv4 and IPv6 multicast
addresses through similar netlink command and then monitor future
changes by registering to RTNLGRP_IPV4_MCADDR and RTNLGRP_IPV6_MCADDR.
Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Link: https://patch.msgid.link/20250207110836.2407224-1-yuyanghuang@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Csókás, Bence [Fri, 7 Feb 2025 12:12:55 +0000 (13:12 +0100)]
net: fec: Refactor MAC reset to function
The core is reset both in `fec_restart()` (called on link-up) and
`fec_stop()` (going to sleep, driver remove etc.). These two functions
had their separate implementations, which was at first only a register
write and a `udelay()` (and the accompanying block comment). However,
since then we got soft-reset (MAC disable) and Wake-on-LAN support, which
meant that these implementations diverged, often causing bugs.
For instance, as of now, `fec_stop()` does not check for
`FEC_QUIRK_NO_HARD_RESET`, meaning the MII/RMII mode is cleared on eg.
a PM power-down event; and `fec_restart()` missed the refactor renaming
the "magic" constant `1` to `FEC_ECR_RESET`.
To harmonize current implementations, and eliminate this source of
potential future bugs, refactor implementation to a common function.
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Csókás, Bence <csokas.bence@prolan.hu> Link: https://patch.msgid.link/20250207121255.161146-2-csokas.bence@prolan.hu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ido Schimmel [Fri, 7 Feb 2025 18:00:44 +0000 (19:00 +0100)]
mlxsw: Enable Tx checksum offload
The device is able to checksum plain TCP / UDP packets over IPv4 / IPv6
when the 'ipcs' bit in the send descriptor is set. Advertise support for
the 'NETIF_F_IP{,6}_CSUM' features in net devices registered by the
driver and VLAN uppers and set the 'ipcs' bit when the stack requests Tx
checksum offload.
Note that the device also calculates the IPv4 checksum, but it first
zeroes the current checksum so there should not be any difference
compared to the checksum calculated by the kernel.
On SN5600 (Spectrum-4) there is about 10% improvement in Tx packet rate
with 1400 byte packets when using pktgen.
Tested on Spectrum-{1,2,3,4} with all the combinations of IPv4 / IPv6,
TCP / UDP, with and without VLAN.
Jakub Kicinski [Fri, 7 Feb 2025 18:41:40 +0000 (10:41 -0800)]
selftests: drv-net: add helper for path resolution
Refering to C binaries from Python code is going to be a common
need. Add a helper to convert from path in relation to the test.
Meaning, if the test is in the same directory as the binary, the
call would be simply: cfg.rpath("binary").
The helper name "rpath" is not great. I can't think of a better
name that would be accurate yet concise.
Jakub Kicinski [Fri, 7 Feb 2025 18:41:39 +0000 (10:41 -0800)]
selftests: drv-net: factor out a DrvEnv base class
We have separate Env classes for local tests and tests with a remote
endpoint. Make it easier to share the code by creating a base class.
Make env loading a method of this class.
When I implemented virtio's hash-related features to tun/tap [1],
I found tun/tap does not fill the entire region reserved for the virtio
header, leaving some uninitialized hole in the middle of the buffer
after read()/recvmesg().
This series fills the uninitialized hole. More concretely, the
num_buffers field will be initialized with 1, and the other fields will
be inialized with 0. Setting the num_buffers field to 1 is mandated by
virtio 1.0 [2].
The change to virtio header is preceded by another change that refactors
tun and tap to unify their virtio-related code.
====================
net: xilinx: axienet: Enable adaptive IRQ coalescing with DIM
To improve performance without sacrificing latency under low load,
enable DIM. While I appreciate not having to write the library myself, I
do think there are many unusual aspects to DIM, as detailed in the last
patch.
====================
Sean Anderson [Thu, 6 Feb 2025 20:10:36 +0000 (15:10 -0500)]
net: xilinx: axienet: Enable adaptive IRQ coalescing with DIM
The default RX IRQ coalescing settings of one IRQ per packet can represent
a significant CPU load. However, increasing the coalescing unilaterally
can result in undesirable latency under low load. Adaptive IRQ
coalescing with DIM offers a way to adjust the coalescing settings based
on load.
This device only supports "CQE" mode [1], where each packet resets the
timer. Therefore, an interrupt is fired either when we receive
coalesce_count_rx packets or when the interface is idle for
coalesce_usec_rx. With this in mind, consider the following scenarios:
Link saturated
Here we want to set coalesce_count_rx to a large value, in order to
coalesce more packets and reduce CPU load. coalesce_usec_rx should
be set to at least the time for one packet. Otherwise the link will
be "idle" and we will get an interrupt for each packet anyway.
Bursts of packets
Each burst should be coalesced into a single interrupt, although it
may be prudent to reduce coalesce_count_rx for better latency.
coalesce_usec_rx should be set to at least the time for one packet
so bursts are coalesced. However, additional time beyond the packet
time will just increase latency at the end of a burst.
Sporadic packets
Due to low load, we can set coalesce_count_rx to 1 in order to
reduce latency to the minimum. coalesce_usec_rx does not matter in
this case.
Based on this analysis, I expected the CQE profiles to look something
like
I found this very surprising. The number of coalesced packets
*decreases* as load increases. But as load increases we have more
opportunities to coalesce packets without affecting latency as much.
Additionally, the profile *increases* the usec as the load increases.
But as load increases, the gaps between packets will tend to become
smaller, making it possible to *decrease* usec for better latency at the
end of a "burst".
I consider the default CQE profile unsuitable for this NIC. Therefore,
we use the first profile outlined in this commit instead.
coalesce_usec_rx is set to 16 by default, but the user can customize it.
This may be necessary if they are using jumbo frames. I think adjusting
the profile times based on the link speed/mtu would be good improvement
for generic DIM.
In addition to the above profile problems, I noticed the following
additional issues with DIM while testing:
- DIM tends to "wander" when at low load, since the performance gradient
is pretty flat. If you only have 10p/ms anyway then adjusting the
coalescing settings will not affect throughput very much.
- DIM takes a long time to adjust back to low indices when load is
decreased following a period of high load. This is because it only
re-evaluates its settings once every 64 interrupts. However, at low
load 64 interrupts can be several seconds.
Finally: performance. This patch increases receive throughput with
iperf3 from 840 Mbits/sec to 938 Mbits/sec, decreases interrupts from
69920/sec to 316/sec, and decreases CPU utilization (4x Cortex-A53) from
43% to 9%.
Sean Anderson [Thu, 6 Feb 2025 20:10:35 +0000 (15:10 -0500)]
net: xilinx: axienet: Get coalesce parameters from driver state
The cr variables now contain the same values as the control registers
themselves. Extract/calculate the values from the variables instead of
saving the user-specified values. This allows us to remove some
bookeeping, and also lets the user know what the actual coalesce
settings are.
Sean Anderson [Thu, 6 Feb 2025 20:10:34 +0000 (15:10 -0500)]
net: xilinx: axienet: Support adjusting coalesce settings while running
In preparation for adaptive IRQ coalescing, we first need to support
adjusting the settings at runtime. The existing code doesn't require any
locking because
- dma_start is the only function that modifies rx/tx_dma_cr. It is
always called with IRQs and NAPI disabled, so nothing else is touching
the hardware.
- The IRQs don't race with poll, since the latter is a softirq.
- The IRQs don't race with dma_stop since they both just clear the
control registers.
- dma_stop doesn't race with poll since the former is called with NAPI
disabled.
However, once we introduce another function that modifies rx/tx_dma_cr,
we need to have some locking to prevent races. Introduce two locks to
protect these variables and their registers.
The control register values are now generated where the coalescing
settings are set. Converting coalescing settings to control register
values may require sleeping because of clk_get_rate. However, the
read/modify/write of the control registers themselves can't sleep
because it needs to happen in IRQ context. By pre-calculating the
control register values, we avoid introducing an additional mutex.
Since axienet_dma_start writes the control settings when it runs, we
don't bother updating the CR registers when rx/tx_dma_started is false.
This prevents any issues from writing to the control registers in the
middle of a reset sequence.
Jakub Kicinski [Tue, 11 Feb 2025 01:54:45 +0000 (17:54 -0800)]
Merge branch 'xsk-the-lost-bits-from-chapter-iii'
Alexander Lobakin says:
====================
xsk: the lost bits from Chapter III
Before introducing libeth_xdp, we need to add a couple more generic
helpers. Notably:
* 01: add generic loop unrolling hint helpers;
* 04: add helper to get both xdp_desc's DMA address and metadata
pointer in one go, saving several cycles and hotpath object
code size in drivers (especially when unrolling).
Bonus:
* 02, 03: convert two drivers which were using custom macros to
generic unrolled_count() (trivial, no object code changes).
====================