]> www.infradead.org Git - users/willy/pagecache.git/log
users/willy/pagecache.git
3 months agonet: stmmac: Switch to zero-copy in non-XDP RX path
Furong Xu [Wed, 15 Jan 2025 03:27:02 +0000 (11:27 +0800)]
net: stmmac: Switch to zero-copy in non-XDP RX path

Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
be recycled in the upper network stack.

This patch brings ~11.5% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoMerge branch 'net-mlx5e-ct-add-support-for-hardware-steering'
Jakub Kicinski [Thu, 16 Jan 2025 03:28:07 +0000 (19:28 -0800)]
Merge branch 'net-mlx5e-ct-add-support-for-hardware-steering'

Tariq Toukan says:

====================
net/mlx5e: CT: Add support for hardware steering

This series start with one more HWS patch by Yevgeny, followed by
patches that add support for connection tracking in hardware steering
mode. It consists of:
- patch #2 hooks up the CT ops for the new mode in the right places.
- patch #3 moves a function into a common file, so it can be reused.
- patch #4 uses the HWS API to implement connection tracking.

The main advantage of hardware steering compared to software steering is
vastly improved performance when adding/removing/updating rules.  Using
the T-Rex traffic generator to initiate multi-million UDP flows per
second, a kernel running with these patches was able to offload ~600K
unique UDP flows per second, a number around ~7x larger than software
steering was able to achieve on the same hardware (256-thread AMD EPYC,
512 GB RAM, ConnectX 7 b2b).
====================

Link: https://patch.msgid.link/20250114130646.1937192-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: CT: Offload connections with hardware steering rules
Cosmin Ratiu [Tue, 14 Jan 2025 13:06:46 +0000 (15:06 +0200)]
net/mlx5e: CT: Offload connections with hardware steering rules

This is modeled similar to how software steering works:
- a reference-counted matcher is maintained for each
  combination of nat/no_nat x ipv4/ipv6 x tcp/udp/gre.
- adding a rule involves finding+referencing or creating a corresponding
  matcher, then actually adding a rule.
- updating rules is implemented using the bwc_rule update API, which can
  change a rule's actions without touching the match value.

By using a T-Rex traffic generator to initiate multi-million UDP flows
per second, a kernel running with these patches on the RX side was able
to offload ~600K flows per second, which is about ~7x larger than what
software steering could do on the same hardware (256-thread AMD EPYC,
512 GB RAM, ConnectX-7 b2b).

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250114130646.1937192-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: CT: Make mlx5_ct_fs_smfs_ct_validate_flow_rule reusable
Cosmin Ratiu [Tue, 14 Jan 2025 13:06:45 +0000 (15:06 +0200)]
net/mlx5e: CT: Make mlx5_ct_fs_smfs_ct_validate_flow_rule reusable

This function checks whether a flow_rule has the right flow dissector
keys and masks used for a connection tracking flow offload. It is
currently used locally by the tc_ct smfs module, but is about to be used
from another place, so this commit moves it to a better place, renames
it to mlx5e_tc_ct_is_valid_flow_rule and drops the unused fs argument.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250114130646.1937192-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: CT: Add initial support for Hardware Steering
Cosmin Ratiu [Tue, 14 Jan 2025 13:06:44 +0000 (15:06 +0200)]
net/mlx5e: CT: Add initial support for Hardware Steering

Connection tracking can offload tuple matches to the NIC either via
firmware commands (when the steering mode is dmfs or offload support is
disabled due to eswitch being set to legacy) or via software-managed
flow steering (smfs).

This commit adds stub operations for a third mode, hardware-managed flow
steering. This is enabled when both CONFIG_MLX5_TC_CT and
CONFIG_MLX5_HW_STEERING are enabled.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250114130646.1937192-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: HWS, rework the check if matcher size can be increased
Yevgeny Kliteynik [Tue, 14 Jan 2025 13:06:43 +0000 (15:06 +0200)]
net/mlx5: HWS, rework the check if matcher size can be increased

When checking if the matcher size can be increased, check both
match and action RTCs. Also, consider the increasing step - check
that it won't cause the new matcher size to become unsupported.

Additionally, since we're using '+ 1' for action RTC size yet
again, define it as macro and use in all the required places.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250114130646.1937192-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-reduce-rtnl-pressure-in-unregister_netdevice'
Jakub Kicinski [Thu, 16 Jan 2025 03:17:07 +0000 (19:17 -0800)]
Merge branch 'net-reduce-rtnl-pressure-in-unregister_netdevice'

Eric Dumazet says:

====================
net: reduce RTNL pressure in unregister_netdevice()

One major source of RTNL contention resides in unregister_netdevice()

Due to RCU protection of various network structures, and
unregister_netdevice() being a synchronous function,
it is calling potentially slow functions while holding RTNL.

I think we can release RTNL in two points, so that three
slow functions are called while RTNL can be used
by other threads.

v1: https://lore.kernel.org/netdev/20250107130906.098fc8d6@kernel.org/T/#m398c95f5778e1ff70938e079d3c4c43c050ad2a6
====================

Link: https://patch.msgid.link/20250114205531.967841-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)
Eric Dumazet [Tue, 14 Jan 2025 20:55:31 +0000 (20:55 +0000)]
net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)

One synchronize_net() call is currently done while holding RTNL.

This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.

For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().

This should be safe, because devices are no longer visible
to other threads after unlist_netdevice() call
and setting dev->reg_state to NETREG_UNREGISTERING.

In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)
Eric Dumazet [Tue, 14 Jan 2025 20:55:30 +0000 (20:55 +0000)]
net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)

Two synchronize_net() calls are currently done while holding RTNL.

This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.

For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().

This should be safe, because devices are no longer visible
to other threads at this point.

In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: no longer hold RTNL while calling flush_all_backlogs()
Eric Dumazet [Tue, 14 Jan 2025 20:55:29 +0000 (20:55 +0000)]
net: no longer hold RTNL while calling flush_all_backlogs()

flush_all_backlogs() is called from unregister_netdevice_many_notify()
as part of netdevice dismantles.

This is currently called under RTNL, and can last up to 50 ms
on busy hosts.

There is no reason to hold RTNL at this stage, if our caller
is cleanup_net() : netns are no more visible, devices
are in NETREG_UNREGISTERING state and no other thread
could mess our state while RTNL is temporarily released.

In order to provide isolation, this patch provides a separate
'net_todo_list' for cleanup_net().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: no longer assume RTNL is held in flush_all_backlogs()
Eric Dumazet [Tue, 14 Jan 2025 20:55:28 +0000 (20:55 +0000)]
net: no longer assume RTNL is held in flush_all_backlogs()

flush_all_backlogs() uses per-cpu and static data to hold its
temporary data, on the assumption it is called under RTNL
protection.

Following patch in the series will break this assumption.

Use instead a dynamically allocated piece of memory.

In the unlikely case the allocation fails,
use a boot-time allocated memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: expedite synchronize_net() for cleanup_net()
Eric Dumazet [Tue, 14 Jan 2025 20:55:27 +0000 (20:55 +0000)]
net: expedite synchronize_net() for cleanup_net()

cleanup_net() is the single thread responsible
for netns dismantles, and a serious bottleneck.

Before we can get per-netns RTNL, make sure
all synchronize_net() called from this thread
are using rcu_synchronize_expedited().

v3: deal with CONFIG_NET_NS=n

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-use-netdev-lock-to-protect-napi'
Jakub Kicinski [Thu, 16 Jan 2025 03:13:36 +0000 (19:13 -0800)]
Merge branch 'net-use-netdev-lock-to-protect-napi'

Jakub Kicinski says:

====================
net: use netdev->lock to protect NAPI

We recently added a lock member to struct net_device, with a vague
plan to start using it to protect netdev-local state, removing
the need to take rtnl_lock for new configuration APIs.

Lay some groundwork and use this lock for protecting NAPI APIs.

v1: https://lore.kernel.org/20250114035118.110297-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250115035319.559603-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonetdev-genl: remove rtnl_lock protection from NAPI ops
Jakub Kicinski [Wed, 15 Jan 2025 03:53:19 +0000 (19:53 -0800)]
netdev-genl: remove rtnl_lock protection from NAPI ops

NAPI lifetime, visibility and config are all fully under
netdev_lock protection now.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect NAPI config fields with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:18 +0000 (19:53 -0800)]
net: protect NAPI config fields with netdev_lock()

Protect the following members of netdev and napi by netdev_lock:
 - defer_hard_irqs,
 - gro_flush_timeout,
 - irq_suspend_timeout.

The first two are written via sysfs (which this patch switches
to new lock), and netdev genl which holds both netdev and rtnl locks.

irq_suspend_timeout is only written by netdev genl.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect napi->irq with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:17 +0000 (19:53 -0800)]
net: protect napi->irq with netdev_lock()

Take netdev_lock() in netif_napi_set_irq(). All NAPI "control fields"
are now protected by that lock (most of the other ones are set during
napi add/del). The napi_hash_node is fully protected by the hash
spin lock, but close enough for the kdoc...

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect threaded status of NAPI with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:16 +0000 (19:53 -0800)]
net: protect threaded status of NAPI with netdev_lock()

Now that NAPI instances can't come and go without holding
netdev->lock we can trivially switch from rtnl_lock() to
netdev_lock() for setting netdev->threaded via sysfs.

Note that since we do not lock netdev_lock around sysfs
calls in the core we don't have to "trylock" like we do
with rtnl_lock.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: make netdev netlink ops hold netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:15 +0000 (19:53 -0800)]
net: make netdev netlink ops hold netdev_lock()

In prep for dropping rtnl_lock, start locking netdev->lock in netlink
genl ops. We need to be using netdev->up instead of flags & IFF_UP.

We can remove the RCU lock protection for the NAPI since NAPI list
is protected by netdev->lock already.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect NAPI enablement with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:14 +0000 (19:53 -0800)]
net: protect NAPI enablement with netdev_lock()

Wrap napi_enable() / napi_disable() with netdev_lock().
Provide the "already locked" flavor of the API.

iavf needs the usual adjustment. A number of drivers call
napi_enable() under a spin lock, so they have to be modified
to take netdev_lock() first, then spin lock then call
napi_enable_locked().

Protecting napi_enable() implies that napi->napi_id is protected
by netdev_lock().

Acked-by: Francois Romieu <romieu@fr.zoreil.com> # via-velocity
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: protect netdev->napi_list with netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:13 +0000 (19:53 -0800)]
net: protect netdev->napi_list with netdev_lock()

Hold netdev->lock when NAPIs are getting added or removed.
This will allow safe access to NAPI instances of a net_device
without rtnl_lock.

Create a family of helpers which assume the lock is already taken.
Switch iavf to them, as it makes extensive use of netdev->lock,
already.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: add netdev->up protected by netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:12 +0000 (19:53 -0800)]
net: add netdev->up protected by netdev_lock()

Some uAPI (netdev netlink) hide net_device's sub-objects while
the interface is down to ensure uniform behavior across drivers.
To remove the rtnl_lock dependency from those uAPIs we need a way
to safely tell if the device is down or up.

Add an indication of whether device is open or closed, protected
by netdev->lock. The semantics are the same as IFF_UP, but taking
netdev_lock around every write to ->flags would be a lot of code
churn.

We don't want to blanket the entire open / close path by netdev_lock,
because it will prevent us from applying it to specific structures -
core helpers won't be able to take that lock from any function
called by the drivers on open/close paths.

So the state of the flag is "pessimistic", as in it may report false
negatives, but never false positives.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: add helpers for lookup and walking netdevs under netdev_lock()
Jakub Kicinski [Wed, 15 Jan 2025 03:53:11 +0000 (19:53 -0800)]
net: add helpers for lookup and walking netdevs under netdev_lock()

Add helpers for accessing netdevs under netdev_lock().
There's some careful handling needed to find the device and lock it
safely, without it getting unregistered, and without taking rtnl_lock
(the latter being the whole point of the new locking, after all).

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: make netdev_lock() protect netdev->reg_state
Jakub Kicinski [Wed, 15 Jan 2025 03:53:10 +0000 (19:53 -0800)]
net: make netdev_lock() protect netdev->reg_state

Protect writes to netdev->reg_state with netdev_lock().
From now on holding netdev_lock() is sufficient to prevent
the net_device from getting unregistered, so code which
wants to hold just a single netdev around no longer needs
to hold rtnl_lock.

We do not protect the NETREG_UNREGISTERED -> NETREG_RELEASED
transition. We'd need to move mutex_destroy(netdev->lock)
to .release, but the real reason is that trying to stop
the unregistration process mid-way would be unsafe / crazy.
Taking references on such devices is not safe, either.
So the intended semantics are to lock REGISTERED devices.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: add netdev_lock() / netdev_unlock() helpers
Jakub Kicinski [Wed, 15 Jan 2025 03:53:09 +0000 (19:53 -0800)]
net: add netdev_lock() / netdev_unlock() helpers

Add helpers for locking the netdev instance, use it in drivers
and the shaper code. This will make grepping for the lock usage
much easier, as we extend the lock to cover more fields.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20250115035319.559603-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: wwan: iosm: Fix hibernation by re-binding the driver around it
Maciej S. Szmigiero [Wed, 8 Jan 2025 23:33:50 +0000 (00:33 +0100)]
net: wwan: iosm: Fix hibernation by re-binding the driver around it

Currently, the driver is seriously broken with respect to the
hibernation (S4): after image restore the device is back into
IPC_MEM_EXEC_STAGE_BOOT (which AFAIK means bootloader stage) and needs
full re-launch of the rest of its firmware, but the driver restore
handler treats the device as merely sleeping and just sends it a
wake-up command.

This wake-up command times out but device nodes (/dev/wwan*) remain
accessible.
However attempting to use them causes the bootloader to crash and
enter IPC_MEM_EXEC_STAGE_CD_READY stage (which apparently means "a crash
dump is ready").

It seems that the device cannot be re-initialized from this crashed
stage without toggling some reset pin (on my test platform that's
apparently what the device _RST ACPI method does).

While it would theoretically be possible to rewrite the driver to tear
down the whole MUX / IPC layers on hibernation (so the bootloader does
not crash from improper access) and then re-launch the device on
restore this would require significant refactoring of the driver
(believe me, I've tried), since there are quite a few assumptions
hard-coded in the driver about the device never being partially
de-initialized (like channels other than devlink cannot be closed,
for example).
Probably this would also need some programming guide for this hardware.

Considering that the driver seems orphaned [1] and other people are
hitting this issue too [2] fix it by simply unbinding the PCI driver
before hibernation and re-binding it after restore, much like
USB_QUIRK_RESET_RESUME does for USB devices that exhibit a similar
problem.

Tested on XMM7360 in HP EliteBook 855 G7 both with s2idle (which uses
the existing suspend / resume handlers) and S4 (which uses the new code).

[1]: https://lore.kernel.org/all/c248f0b4-2114-4c61-905f-466a786bdebb@leemhuis.info/
[2]:
https://github.com/xmm7360/xmm7360-pci/issues/211#issuecomment-1804139413

Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Link: https://patch.msgid.link/e60287ebdb0ab54c4075071b72568a40a75d0205.1736372610.git.mail@maciej.szmigiero.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Thu, 16 Jan 2025 01:38:04 +0000 (17:38 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-01-08 (ice)

This series contains updates to ice driver only.

Przemek reworks implementation so that ice_init_hw() is called before
ice_adapter initialization. The motivation is to have ability to act
on the number of PFs in ice_adapter initialization. This is not done
here but the code is also a bit cleaner.

Michal adds priority to be considered when matching recipes for proper
differentiation.

Konrad adds devlink health reporting for firmware generated events.

R Sundar utilizes string helpers over open coded versions.

Jake adds implementation to utilize a lower latency interface to program
PHY timer when supported.

Additional information can be found on the original cover letter:

  https://lore.kernel.org/intel-wired-lan/20241216145453.333745-1-anton.nadezhdin@intel.com/

Karol adds and allows for different PTP delay values to be used per pin.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Add in/out PTP pin delays
  ice: implement low latency PHY timer updates
  ice: check low latency PHY timer update firmware capability
  ice: add lock to protect low latency interface
  ice: rename TS_LL_READ* macros to REG_LL_PROXY_H_*
  ice: use read_poll_timeout_atomic in ice_read_phy_tstamp_ll_e810
  ice: use string choice helpers
  ice: add fw and port health reporters
  ice: add recipe priority check in search
  ice: ice_probe: init ice_adapter after HW init
  ice: minor: rename goto labels from err to unroll
  ice: split ice_init_hw() out from ice_init_dev()
  ice: c827: move wait for FW to ice_init_hw()
====================

Link: https://patch.msgid.link/20250115000844.714530-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoinet: ipmr: fix data-races
Eric Dumazet [Tue, 14 Jan 2025 22:10:49 +0000 (22:10 +0000)]
inet: ipmr: fix data-races

Following fields of 'struct mr_mfc' can be updated
concurrently (no lock protection) from ip_mr_forward()
and ip6_mr_forward()

- bytes
- pkt
- wrong_if
- lastuse

They also can be read from other functions.

Convert bytes, pkt and wrong_if to atomic_long_t,
and use READ_ONCE()/WRITE_ONCE() for lastuse.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'bnxt_en-implement-tcp-data-split-and-thresh-option'
Jakub Kicinski [Wed, 15 Jan 2025 22:42:14 +0000 (14:42 -0800)]
Merge branch 'bnxt_en-implement-tcp-data-split-and-thresh-option'

Taehee Yoo says:

====================
bnxt_en: implement tcp-data-split and thresh option

This series implements hds-thresh ethtool command.
This series also implements backend of tcp-data-split and
hds-thresh ethtool command for bnxt_en driver.
These ethtool commands are mandatory options for device memory TCP.

NICs that use the bnxt_en driver support tcp-data-split feature named
HDS(header-data-split).
But there is no implementation for the HDS to enable by ethtool.
Only getting the current HDS status is implemented and the HDS is just
automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
The hds_threshold follows the rx-copybreak value but it wasn't
changeable.

Currently, bnxt_en driver enables tcp-data-split by default but not
always work.
There is hds_threshold value, which indicates that a packet size is
larger than this value, a packet will be split into header and data.
hds_threshold value has been 256, which is a default value of
rx-copybreak value too.
The rx-copybreak value hasn't been allowed to change so the
hds_threshold too.

This patchset decouples hds_threshold and rx-copybreak first.
and make tcp-data-split, rx-copybreak, and
hds-thresh configurable independently.

But the default configuration is the same.
The default value of rx-copybreak is 256 and default
hds-thresh is also 256.

The behavior of rx-copybreak will probably be changed in almost all
drivers. If HDS is not enabled, rx-copybreak copies both header and
payload from a page.
But if HDS is enabled, rx-copybreak copies only header from the first
page.
Due to this change, it may need to disable(set to 0) rx-copybreak when
the HDS is required.

There are several related options.
TPA(HW-GRO, LRO), JUMBO, jumbo_thresh(firmware command), and Aggregation
Ring.

The aggregation ring is fundamental to these all features.
When gro/lro/jumbo packets are received, NIC receives the first packet
from the normal ring.
follow packets come from the aggregation ring.

These features are working regardless of HDS.
If HDS is enabled, the first packet contains the header only, and the
following packets contain only payload.
So, HW-GRO/LRO is working regardless of HDS.

There is another threshold value, which is jumbo_thresh.
This is very similar to hds_thresh, but jumbo thresh doesn't split
header and data.
It just split the first and following data based on length.
When NIC receives 1500 sized packet, and jumbo_thresh is 256(default, but
follows rx-copybreak),
the first data is 256 and the following packet size is 1500-256.

Before this patch, at least if one of GRO, LRO, and JUMBO flags is
enabled, the Aggregation ring will be enabled.
If the Aggregation ring is enabled, both hds_threshold and
jumbo_thresh are set to the default value of rx-copybreak.

So, GRO, LRO, JUMBO frames, they larger than 256 bytes, they will
be split into header and data if the protocol is TCP or UDP.
for the other protocol, jumbo_thresh works instead of hds_thresh.

This means that tcp-data-split relies on the GRO, LRO, and JUMBO flags.
But by this patch, tcp-data-split no longer relies on these flags.
If the tcp-data-split is enabled, the Aggregation ring will be
enabled.
Also, hds_threshold no longer follows rx-copybreak value, it will
be set to the hds-thresh value by user-space, but the
default value is still 256.

If the protocol is TCP or UDP and the HDS is disabled and Aggregation
ring is enabled, a packet will be split into several pieces due to
jumbo_thresh.

When single buffer XDP is attached, tcp-data-split is automatically
disabled.

LRO, GRO, and JUMBO are tested with BCM57414, BCM57504 and the firmware
version is 230.0.157.0.
I couldn't find any specification about minimum and maximum value
of hds_threshold, but from my test result, it was about 0 ~ 1023.
It means, over 1023 sized packets will be split into header and data if
tcp-data-split is enabled regardless of hds_treshold value.
When hds_threshold is 1500 and received packet size is 1400, HDS should
not be activated, but it is activated.
The maximum value of hds-thresh value is 256 because it
has been working. It was decided very conservatively.

I checked out the tcp-data-split(HDS) works independently of GRO, LRO,
JUMBO.
Also, I checked out tcp-data-split should be disabled automatically
when XDP is attached and disallowed to enable it again while XDP is
attached. I tested ranged values from min to max for
hds-thresh and rx-copybreak, and it works.
hds-thresh from 0 to 256, and rx-copybreak 0 to 256.
When testing this patchset, I checked skb->data, skb->data_len, and
nr_frags values.

By this patchset, bnxt_en driver supports a force enable tcp-data-split,
but it doesn't support for disable tcp-data-split.
When tcp-data-split is explicitly enabled, HDS works always.
When tcp-data-split is unknown, it depends on the current
configuration of LRO/GRO/JUMBO.

1/10 patch adds a new hds_config member in the ethtool_netdev_state.
It indicates that what tcp-data-split value is really updated from
userspace.
So the driver can distinguish a passed tcp-data-split value is
came from user or driver itself.

2/10 patch adds hds-thresh command in the ethtool.
This threshold value indicates if a received packet size is larger
than this threshold, the packet's header and payload will be split.
Example:
   # ethtool -G <interface name> hds-thresh <value>
This option can not be used when tcp-data-split is disabled or not
supported.
   # ethtool -G enp14s0f0np0 tcp-data-split on hds-thresh 256
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   Current hardware settings:
   ...
   TCP data split:         on
   HDS thresh:  256

3/10, 4/10 add condition checks for devmem and ethtool.
If tcp-data-split is disabled or threshold value is not zero, setup of
devmem will be failed.
Also, tcp-data-split and hds-thresh will not be changed
while devmem is running.

5/10 add condition checks for netdev core.
It disallows setup single buffer XDP program when tcp-data-split is
enabled.

6/10 patch implements .{set, get}_tunable() in the bnxt_en.
The bnxt_en driver has been supporting the rx-copybreak feature but is
not configurable, Only the default rx-copybreak value has been working.
So, it changes the bnxt_en driver to be able to configure
the rx-copybreak value.

7/10 patch adds an implementation of tcp-data-split ethtool
command.
The HDS relies on the Aggregation ring, which is automatically enabled
when either LRO, GRO, or large mtu is configured.
So, if the Aggregation ring is enabled, HDS is automatically enabled by
it.

8/10 patch adds the implementation of hds-thresh logic
in the bnxt_en driver.
The default value is 256, which used to be the default rx-copybreak
value.

9/10 add HDS feature implementation for netdevsim.
HDS feature is not common so far. Only a few NICs support this feature.
There is no way to test HDS core-API unless we have proper hw NIC.
In order to test HDS core-API without  hw NIC, netdevsim can be used.
It implements HDS control and data plane for netdevsim.

10/10 add selftest for HDS(tcp-data-split and HDS-thresh).
The tcp-data-split tests are the same with
`ethtool -G tcp-data-split <on | auto>`
HDS-thresh tests are same with `ethtool -G eth0 hds-thresh <0 - MAX>`

This series is tested with BCM57504 and netdevsim.
====================

Link: https://patch.msgid.link/20250114142852.3364986-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftest: net-drv: hds: add test for HDS feature
Taehee Yoo [Tue, 14 Jan 2025 14:28:52 +0000 (14:28 +0000)]
selftest: net-drv: hds: add test for HDS feature

HDS/HDS-thresh features were updated/implemented. so add some tests for
these features.

HDS tests are the same with `ethtool -G eth0 tcp-data-split <on | off |
auto >` but `auto` depends on driver specification.
So, it doesn't include `auto` case.

HDS-thresh tests are same with `ethtool -G eth0 hds-thresh <0 - MAX>`
It includes both 0 and MAX cases. It also includes exceed case, MAX + 1.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-11-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonetdevsim: add HDS feature
Taehee Yoo [Tue, 14 Jan 2025 14:28:51 +0000 (14:28 +0000)]
netdevsim: add HDS feature

HDS options(tcp-data-split, hds-thresh) have dependencies between other
features like XDP. Basic dependencies are checked in the core API.
netdevsim is very useful to check basic dependencies.

The default tcp-data-split mode is UNKNOWN but netdevsim driver
returns ENABLED when ethtool dumps tcp-data-split mode.
The default value of HDS threshold is 0 and the maximum value is 1024.

ethtool shows like this.

ethtool -g eni1np1
Ring parameters for eni1np1:
Pre-set maximums:
...
HDS thresh:             1024
Current hardware settings:
...
TCP data split:         on
HDS thresh:             0

ethtool -G eni1np1 tcp-data-split on hds-thresh 1024
ethtool -g eni1np1
Ring parameters for eni1np1:
Pre-set maximums:
...
HDS thresh:             1024
Current hardware settings:
...
TCP data split:         on
HDS thresh:             1024

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-10-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobnxt_en: add support for hds-thresh ethtool command
Taehee Yoo [Tue, 14 Jan 2025 14:28:50 +0000 (14:28 +0000)]
bnxt_en: add support for hds-thresh ethtool command

The bnxt_en driver has configured the hds_threshold value automatically
when TPA is enabled based on the rx-copybreak default value.
Now the hds-thresh ethtool command is added, so it adds an
implementation of hds-thresh option.

Configuration of the hds-thresh is applied only when
the tcp-data-split is enabled. The default value of
hds-thresh is 256, which is the default value of
rx-copybreak, which used to be the hds_thresh value.

The maximum hds-thresh is 1023.

   # Example:
   # ethtool -G enp14s0f0np0 tcp-data-split on hds-thresh 256
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   HDS thresh:  1023
   Current hardware settings:
   ...
   TCP data split:         on
   HDS thresh:  256

Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250114142852.3364986-9-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobnxt_en: add support for tcp-data-split ethtool command
Taehee Yoo [Tue, 14 Jan 2025 14:28:49 +0000 (14:28 +0000)]
bnxt_en: add support for tcp-data-split ethtool command

NICs that uses bnxt_en driver supports tcp-data-split feature by the
name of HDS(header-data-split).
But there is no implementation for the HDS to enable by ethtool.
Only getting the current HDS status is implemented and The HDS is just
automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
The hds_threshold follows rx-copybreak value. and it was unchangeable.

This implements `ethtool -G <interface name> tcp-data-split <value>`
command option.
The value can be <on> and <auto>.
The value is <auto> and one of LRO/GRO/JUMBO is enabled, HDS is
automatically enabled and all LRO/GRO/JUMBO are disabled, HDS is
automatically disabled.

HDS feature relies on the aggregation ring.
So, if HDS is enabled, the bnxt_en driver initializes the aggregation ring.
This is the reason why BNXT_FLAG_AGG_RINGS contains HDS condition.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250114142852.3364986-8-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobnxt_en: add support for rx-copybreak ethtool command
Taehee Yoo [Tue, 14 Jan 2025 14:28:48 +0000 (14:28 +0000)]
bnxt_en: add support for rx-copybreak ethtool command

The bnxt_en driver supports rx-copybreak, but it couldn't be set by
userspace. Only the default value(256) has worked.
This patch makes the bnxt_en driver support following command.
`ethtool --set-tunable <devname> rx-copybreak <value> ` and
`ethtool --get-tunable <devname> rx-copybreak`.

By this patch, hds_threshol is set to the rx-copybreak value.
But it will be set by `ethtool -G eth0 hds-thresh N`
in the next patch.

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250114142852.3364986-7-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: disallow setup single buffer XDP when tcp-data-split is enabled.
Taehee Yoo [Tue, 14 Jan 2025 14:28:47 +0000 (14:28 +0000)]
net: disallow setup single buffer XDP when tcp-data-split is enabled.

When a single buffer XDP is attached, NIC should guarantee only single
page packets will be received.
tcp-data-split feature splits packets into header and payload. single
buffer XDP can't handle it properly.
So attaching single buffer XDP should be disallowed when tcp-data-split
is enabled.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-6-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: add ring parameter filtering
Taehee Yoo [Tue, 14 Jan 2025 14:28:46 +0000 (14:28 +0000)]
net: ethtool: add ring parameter filtering

While the devmem is running, the tcp-data-split and
hds-thresh configuration should not be changed.
If user tries to change tcp-data-split and threshold value while the
devmem is running, it fails and shows extack message.

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-5-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: devmem: add ring parameter filtering
Taehee Yoo [Tue, 14 Jan 2025 14:28:45 +0000 (14:28 +0000)]
net: devmem: add ring parameter filtering

If driver doesn't support ring parameter or tcp-data-split configuration
is not sufficient, the devmem should not be set up.
Before setup the devmem, tcp-data-split should be ON and hds-thresh
value should be 0.

Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-4-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: add support for configuring hds-thresh
Taehee Yoo [Tue, 14 Jan 2025 14:28:44 +0000 (14:28 +0000)]
net: ethtool: add support for configuring hds-thresh

The hds-thresh option configures the threshold value of
the header-data-split.
If a received packet size is larger than this threshold value, a packet
will be split into header and payload.
The header indicates TCP and UDP header, but it depends on driver spec.
The bnxt_en driver supports HDS(Header-Data-Split) configuration at
FW level, affecting TCP and UDP too.
So, If hds-thresh is set, it affects UDP and TCP packets.

Example:
   # ethtool -G <interface name> hds-thresh <value>

   # ethtool -G enp14s0f0np0 tcp-data-split on hds-thresh 256
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   HDS thresh:  1023
   Current hardware settings:
   ...
   TCP data split:         on
   HDS thresh:  256

The default/min/max values are not defined in the ethtool so the drivers
should define themself.
The 0 value means that all TCP/UDP packets' header and payload
will be split.

Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-3-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: add hds_config member in ethtool_netdev_state
Taehee Yoo [Tue, 14 Jan 2025 14:28:43 +0000 (14:28 +0000)]
net: ethtool: add hds_config member in ethtool_netdev_state

When tcp-data-split is UNKNOWN mode, drivers arbitrarily handle it.
For example, bnxt_en driver automatically enables if at least one of
LRO/GRO/JUMBO is enabled.
If tcp-data-split is UNKNOWN and LRO is enabled, a driver returns
ENABLES of tcp-data-split, not UNKNOWN.
So, `ethtool -g eth0` shows tcp-data-split is enabled.

The problem is in the setting situation.
In the ethnl_set_rings(), it first calls get_ringparam() to get the
current driver's config.
At that moment, if driver's tcp-data-split config is UNKNOWN, it returns
ENABLE if LRO/GRO/JUMBO is enabled.
Then, it sets values from the user and driver's current config to
kernel_ethtool_ringparam.
Last it calls .set_ringparam().
The driver, especially bnxt_en driver receives
ETHTOOL_TCP_DATA_SPLIT_ENABLED.
But it can't distinguish whether it is set by the user or just the
current config.

When user updates ring parameter, the new hds_config value is updated
and current hds_config value is stored to old_hdsconfig.
Driver's .set_ringparam() callback can distinguish a passed
tcp-data-split value is came from user explicitly.
If .set_ringparam() is failed, hds_config is rollbacked immediately.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-2-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: loopback: Hold rtnl_net_lock() in blackhole_netdev_init().
Kuniyuki Iwashima [Tue, 14 Jan 2025 08:13:52 +0000 (17:13 +0900)]
net: loopback: Hold rtnl_net_lock() in blackhole_netdev_init().

blackhole_netdev is the global device in init_net.

Let's hold rtnl_net_lock(&init_net) in blackhole_netdev_init().

While at it, the unnecessary dev_net_set() call is removed, which
is done in alloc_netdev_mqs().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250114081352.47404-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests/net/forwarding: teamd command not found
Alessandro Zanni [Tue, 14 Jan 2025 00:33:16 +0000 (01:33 +0100)]
selftests/net/forwarding: teamd command not found

Running "make kselftest TARGETS=net/forwarding" results in
multiple ccurrences of the same error:
- ./lib.sh: line 787: teamd: command not found

This patch adds the variable $REQUIRE_TEAMD in every test that uses the
command teamd and checks the $REQUIRE_TEAMD variable in the file "lib.sh"
to skip the test if the command is not installed.

Signed-off-by: Alessandro Zanni <alessandro.zanni87@gmail.com>
Link: https://patch.msgid.link/20250114003323.97207-1-alessandro.zanni87@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'eth-fbnic-add-hardware-monitoring-support'
Jakub Kicinski [Wed, 15 Jan 2025 22:14:15 +0000 (14:14 -0800)]
Merge branch 'eth-fbnic-add-hardware-monitoring-support'

Sanman Pradhan says:

====================
eth: fbnic: Add hardware monitoring support

This patch series adds hardware monitoring support to the fbnic driver.
It implements support for reading temperature and voltage sensors via
firmware requests, and exposes this data through the HWMON interface.

The series is structured as follows:

Patch 1: Adds completion infrastructure for firmware requests
Patch 2: Implements TSENE sensor message handling
Patch 3: Adds HWMON interface support

Output:
$ ls -l /sys/class/hwmon/hwmon1/
total 0
lrwxrwxrwx 1 root root    0 Sep 10 00:00 device -> ../../../0000:01:00.0
-r--r--r-- 1 root root 4096 Sep 10 00:00 in0_input
-r--r--r-- 1 root root 4096 Sep 10 00:00 name
lrwxrwxrwx 1 root root    0 Sep 10 00:00 subsystem -> ../../../../../../class/hwmon
-r--r--r-- 1 root root 4096 Sep 10 00:00 temp1_input
-rw-r--r-- 1 root root 4096 Sep 10 00:00 uevent

$ cat /sys/class/hwmon/hwmon1/temp1_input
40480
$ cat /sys/class/hwmon/hwmon1/in0_input
750
====================

Link: https://patch.msgid.link/20250114000705.2081288-1-sanman.p211993@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: Add hardware monitoring support via HWMON interface
Sanman Pradhan [Tue, 14 Jan 2025 00:07:05 +0000 (16:07 -0800)]
eth: fbnic: Add hardware monitoring support via HWMON interface

This patch adds support for hardware monitoring to the fbnic driver,
allowing for temperature and voltage sensor data to be exposed to
userspace via the HWMON interface. The driver registers a HWMON device
and provides callbacks for reading sensor data, enabling system
admins to monitor the health and operating conditions of fbnic.

Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250114000705.2081288-4-sanman.p211993@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: hwmon: Add support for reading temperature and voltage sensors
Sanman Pradhan [Tue, 14 Jan 2025 00:07:04 +0000 (16:07 -0800)]
eth: fbnic: hwmon: Add support for reading temperature and voltage sensors

Add support for reading temperature and voltage sensor data from firmware
by implementing a new TSENE message type and response parsing. This adds
message handler infrastructure to transmit sensor read requests and parse
responses. The sensor data will be exposed through the driver's hwmon interface.

Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250114000705.2081288-3-sanman.p211993@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: hwmon: Add completion infrastructure for firmware requests
Sanman Pradhan [Tue, 14 Jan 2025 00:07:03 +0000 (16:07 -0800)]
eth: fbnic: hwmon: Add completion infrastructure for firmware requests

Add infrastructure to support firmware request/response handling with
completions. Add a completion structure to track message state including
message type for matching, completion for waiting for response, and
result for error propagation. Use existing spinlock to protect the writes.
The data from the various response types will be added to the "union u"
by subsequent commits.

Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250114000705.2081288-2-sanman.p211993@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-lan969x-add-fdma-support'
Jakub Kicinski [Wed, 15 Jan 2025 22:13:36 +0000 (14:13 -0800)]
Merge branch 'net-lan969x-add-fdma-support'

Daniel Machon says:

====================
net: lan969x: add FDMA support

== Description:

This series is the last of a multi-part series, that prepares and adds
support for the new lan969x switch driver.

The upstreaming efforts has been split into multiple series:

        1) Prepare the Sparx5 driver for lan969x (merged)

        2) Add support for lan969x (same basic features as Sparx5
           provides excl. FDMA and VCAP, merged).

        3) Add lan969x VCAP functionality (merged).

        4) Add RGMII support (merged).

    --> 5) Add FDMA support.

== FDMA support:

The lan969x switch device uses the same FDMA engine as the Sparx5 switch
device, with the same number of channels etc. This means we can utilize
the newly added FDMA library, that is already in use by the lan966x and
sparx5 drivers.

As previous lan969x series, the FDMA implementation will hook into the
Sparx5 implementation where possible, however both RX and TX handling
will be done differently on lan969x and therefore requires a separate
implementation of the RX and TX path.

Details are in the commit description of the individual patches

== Patch breakdown:

Patch #1: Enable FDMA support on lan969x
Patch #2: Split start()/stop() functions
Patch #3: Activate TX FDMA in start()
Patch #4: Ops out a few functions that differ on the two platforms
Patch #5: Add FDMA implementation for lan969x

v1: https://lore.kernel.org/20250109-sparx5-lan969x-switch-driver-5-v1-0-13d6d8451e63@microchip.com
====================

Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-0-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: lan969x: add FDMA implementation
Daniel Machon [Mon, 13 Jan 2025 19:36:09 +0000 (20:36 +0100)]
net: lan969x: add FDMA implementation

The lan969x switch device supports manual frame injection and extraction
to and from the switch core, using a number of injection and extraction
queues.  This technique is currently supported, but delivers poor
performance compared to Frame DMA (FDMA).

This lan969x implementation of FDMA, hooks into the existing FDMA for
Sparx5, but requires its own RX and TX handling, as lan969x does not
support the same native cache coherency that Sparx5 does. Effectively,
this means that we are going to use the DMA mapping API for mapping and
unmapping TX buffers. The RX loop will utilize the page pool API for
efficient RX handling. Other than that, the implementation is largely
the same, and utilizes the FDMA library for DCB and DB handling.

Some numbers:

Manual injection/extraction (before this series):

// iperf3 -c 1.0.1.1

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.02  sec   345 MBytes   289 Mbits/sec  sender
[  5]   0.00-10.06  sec   345 MBytes   288 Mbits/sec  receiver

FDMA (after this series):

// iperf3 -c 1.0.1.1

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.03  sec  1.10 GBytes   940 Mbits/sec  sender
[  5]   0.00-10.07  sec  1.10 GBytes   936 Mbits/sec  receiver

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-5-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sparx5: ops out certain FDMA functions
Daniel Machon [Mon, 13 Jan 2025 19:36:08 +0000 (20:36 +0100)]
net: sparx5: ops out certain FDMA functions

We are going to implement the RX  and TX paths a bit differently on
lan969x and therefore need to introduce new ops for FDMA functions:
init, deinit, xmit and poll. Assign the Sparx5 equivalents for these and
update the code throughout. Also add a 'struct net_device' argument to
the xmit() function, as we will be needing that for lan969x.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-4-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sparx5: activate FDMA tx in start()
Daniel Machon [Mon, 13 Jan 2025 19:36:07 +0000 (20:36 +0100)]
net: sparx5: activate FDMA tx in start()

The function sparx5_fdma_tx_activate() is responsible for configuring
the TX FDMA instance and activating the channel. TX activation has
previously been done in the xmit() function, when the first frame is
transmitted. Now that we have separate functions for starting and
stopping the FDMA, it seems reasonable to move the TX activation to the
start function. This change has no implications on the functionality.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-3-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sparx5: split sparx5_fdma_{start(),stop()}
Daniel Machon [Mon, 13 Jan 2025 19:36:06 +0000 (20:36 +0100)]
net: sparx5: split sparx5_fdma_{start(),stop()}

The two functions: sparx5_fdma_{start(),stop()} are responsible for a
number of things, namely: allocation and initialization of FDMA buffers,
activation FDMA channels in hardware and activation of the NAPI
instance.

This patch splits the buffer allocation and initialization into init and
deinit functions, and the channel and NAPI activation into start and
stop functions. This serves two purposes: 1) the start() and stop()
functions can be reused for lan969x and 2) prepares for future MTU
change support, where we must be able to stop and start the FDMA
channels and NAPI instance, without free'ing and reallocating the FDMA
buffers.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-2-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sparx5: enable FDMA on lan969x
Daniel Machon [Mon, 13 Jan 2025 19:36:05 +0000 (20:36 +0100)]
net: sparx5: enable FDMA on lan969x

In a previous series, we made sure that FDMA was not initialized and
started on lan969x. Now that we are going to support it, undo that
change. In addition, make sure the chip ID check is only applicable on
Sparx5, as this is a check that is only relevant on this platform.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20250113-sparx5-lan969x-switch-driver-5-v2-1-c468f02fd623@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-phylink-fix-pcs-without-autoneg'
Jakub Kicinski [Wed, 15 Jan 2025 21:23:32 +0000 (13:23 -0800)]
Merge branch 'net-phylink-fix-pcs-without-autoneg'

Russell King says:

====================
net: phylink: fix PCS without autoneg

Eric Woudstra reported that a PCS attached using 2500base-X does not
see link when phylink is using in-band mode, but autoneg is disabled,
despite there being a valid 2500base-X signal being received. We have
these settings:

act_link_an_mode = MLO_AN_INBAND
pcs_neg_mode = PHYLINK_PCS_NEG_INBAND_DISABLED

Eric diagnosed it to phylink_decode_c37_word() setting state->link
false because the full-duplex bit isn't set in the non-existent link
partner advertisement word (which doesn't exist because in-band
autoneg is disabled!)

The test in phylink_mii_c22_pcs_decode_state() is supposed to catch
this state, but since we converted PCS to use neg_mode, testing the
Autoneg in the local advertisement is no longer sufficient - we need
to be looking at the neg_mode, which currently isn't provided.

We need to provide this via the .pcs_get_state() method, and this
will require modifying all PCS implementations to add the extra
argument to this method.

Patch 1 uses the PCS neg_mode in phylink_mac_pcs_get_state() to correct
the now obsolute usage of the Autoneg bit in the advertisement.

Patch 2 passes neg_mode into the .pcs_get_state() method, and updates
all users.

Patch 3 adds neg_mode as an argument to the various clause 22 state
decoder functions in phylink, modifying drivers to pass the neg_mode
through.

Patch 4 makes use of phylink_mii_c22_pcs_decode_state() rather than
using the Autoneg bit in the advertising field.

Patch 5 may be required for Eric's case - it ensures that we report
the correct state for interface types that we support only one set
of modes for when autoneg is disabled.
====================

Link: https://patch.msgid.link/Z4TbR93B-X8A8iHe@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: provide fixed state for 1000base-X and 2500base-X
Russell King (Oracle) [Mon, 13 Jan 2025 09:22:44 +0000 (09:22 +0000)]
net: phylink: provide fixed state for 1000base-X and 2500base-X

When decoding clause 22 state, if in-band is disabled and using either
1000base-X or 2500base-X, rather than reporting link-down, we know the
speed, and we only support full duplex. Pause modes taken from XPCS.

This fixes a problem reported by Eric Woudstra.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1tXGei-000EtL-Fn@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: use neg_mode in phylink_mii_c22_pcs_decode_state()
Russell King (Oracle) [Mon, 13 Jan 2025 09:22:39 +0000 (09:22 +0000)]
net: phylink: use neg_mode in phylink_mii_c22_pcs_decode_state()

Rather than using the state of the Autoneg bit, which is unreliable
with the new PCS neg mode support, use the passed neg_mode to decide
whether to decode the link partner advertisement data.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1tXGed-000EtF-CN@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: pass neg_mode into c22 state decoder
Russell King (Oracle) [Mon, 13 Jan 2025 09:22:34 +0000 (09:22 +0000)]
net: phylink: pass neg_mode into c22 state decoder

Pass the current neg_mode into phylink_mii_c22_pcs_get_state() and
phylink_mii_c22_pcs_decode_state(). Update all users of phylink PCS
that use these functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1tXGeY-000Et9-8g@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: pass neg_mode into .pcs_get_state() method
Russell King (Oracle) [Mon, 13 Jan 2025 09:22:29 +0000 (09:22 +0000)]
net: phylink: pass neg_mode into .pcs_get_state() method

Pass the current neg_mode into the .pcs_get_state() method. Update all
users of phylink PCS.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1tXGeT-000Et3-4L@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: use pcs_neg_mode in phylink_mac_pcs_get_state()
Russell King (Oracle) [Mon, 13 Jan 2025 09:22:24 +0000 (09:22 +0000)]
net: phylink: use pcs_neg_mode in phylink_mac_pcs_get_state()

As in-band AN no longer just depends on MLO_AN_INBAND + Autoneg bit,
we need to take account of the pcs_neg_mode when deciding how to
initialise the speed, duplex and pause state members before calling
into the .pcs_neg_mode() method. Add this.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1tXGeO-000Esx-0r@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'mptcp-selftests-more-debug-in-case-of-errors'
Jakub Kicinski [Wed, 15 Jan 2025 21:21:19 +0000 (13:21 -0800)]
Merge branch 'mptcp-selftests-more-debug-in-case-of-errors'

Matthieu Baerts says:

====================
mptcp: selftests: more debug in case of errors

Here are just a bunch of small improvements for the MPTCP selftests:

Patch 1: Unify errors messages in simult_flows: print MIB and 'ss -Me'.

Patch 2: Unify errors messages in sockopt: print MIB.

Patch 3: Move common code to print debug info to mptcp_lib.sh.

Patch 4: Use 'ss' with '-m' in case of errors.

Patch 5: Remove an unused variable.

Patch 6: Print only the size instead of size + filename again.
====================

Link: https://patch.msgid.link/20250114-net-next-mptcp-st-more-debug-err-v1-0-2ffb16a6cf35@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: mptcp: connect: better display the files size
Matthieu Baerts (NGI0) [Tue, 14 Jan 2025 18:03:16 +0000 (19:03 +0100)]
selftests: mptcp: connect: better display the files size

'du' will print the name of the file, which was already displayed
before, e.g.

  Created /tmp/tmp.UOyy0ghfmQ (size 4703740/tmp/tmp.UOyy0ghfmQ) containing data sent by client
  Created /tmp/tmp.xq3zvFinGo (size 1391724/tmp/tmp.xq3zvFinGo) containing data sent by server

'stat' can be used instead, to display this instead:

  Created /tmp/tmp.UOyy0ghfmQ (size 4703740 B) containing data sent by client
  Created /tmp/tmp.xq3zvFinGo (size 1391724 B) containing data sent by server

So easier to spot the file sizes.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250114-net-next-mptcp-st-more-debug-err-v1-6-2ffb16a6cf35@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: mptcp: connect: remove unused variable
Matthieu Baerts (NGI0) [Tue, 14 Jan 2025 18:03:15 +0000 (19:03 +0100)]
selftests: mptcp: connect: remove unused variable

'cin_disconnect' is used in run_tests_disconnect(), but not
'cout_disconnect', so it is safe to drop it.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250114-net-next-mptcp-st-more-debug-err-v1-5-2ffb16a6cf35@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: mptcp: add -m with ss in case of errors
Matthieu Baerts (NGI0) [Tue, 14 Jan 2025 18:03:14 +0000 (19:03 +0100)]
selftests: mptcp: add -m with ss in case of errors

Recently, we had an issue where getting info about the memory would have
helped better understanding what went wrong.

Let add it just in case for later.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250114-net-next-mptcp-st-more-debug-err-v1-4-2ffb16a6cf35@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: mptcp: move stats info in case of errors to lib.sh
Matthieu Baerts (NGI0) [Tue, 14 Jan 2025 18:03:13 +0000 (19:03 +0100)]
selftests: mptcp: move stats info in case of errors to lib.sh

A few MPTCP selftests are using the same code to print stats in case of
error. This code can then be moved to mptcp_lib.sh.

No behaviour changes intended, except to print the error in red and to
stderr, like most error messages.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250114-net-next-mptcp-st-more-debug-err-v1-3-2ffb16a6cf35@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: mptcp: sockopt: save nstat infos
Geliang Tang [Tue, 14 Jan 2025 18:03:12 +0000 (19:03 +0100)]
selftests: mptcp: sockopt: save nstat infos

Similar to the way nstat information is stored in mptcp_connect.sh
and mptcp_join.sh scripts, this patch adds a similar way for
mptcp_sockopt.sh and displays the nstat information when errors
occur.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250114-net-next-mptcp-st-more-debug-err-v1-2-2ffb16a6cf35@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: mptcp: simult_flows: unify errors msgs
Matthieu Baerts (NGI0) [Tue, 14 Jan 2025 18:03:11 +0000 (19:03 +0100)]
selftests: mptcp: simult_flows: unify errors msgs

In order to unify what is printed in case of error, similar to what is
done in mptcp_connect.sh and mptcp_join.sh, it is interesting to do the
following modifications in simult_flows.sh:

- Print the rc errors at the end of the line.

- Print the MIB counters.

- Use the same ss options: add -M (MPTCP sockets) and -e (detailed
  socket information).

While at it, also print of the 'max' time only in case of success,
because 'mptcp_connect.c' will already print this info in case of error,
e.g.:

  transfer slower than expected! runtime 11948 ms, expected 11921 ms

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250114-net-next-mptcp-st-more-debug-err-v1-1-2ffb16a6cf35@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomptcp: fix for setting remote ipv4mapped address
Geliang Tang [Tue, 14 Jan 2025 18:06:22 +0000 (19:06 +0100)]
mptcp: fix for setting remote ipv4mapped address

Commit 1c670b39cec7 ("mptcp: change local addr type of subflow_destroy")
introduced a bug in mptcp_pm_nl_subflow_destroy_doit().

ipv6_addr_set_v4mapped() should be called to set the remote ipv4 address
'addr_r.addr.s_addr' to the remote ipv6 address 'addr_r.addr6', not
'addr_l.addr.addr6', which is the local ipv6 address.

Fixes: 1c670b39cec7 ("mptcp: change local addr type of subflow_destroy")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250114-net-next-mptcp-fix-remote-addr-v1-1-debcd84ea86f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-bcm-asp2-fix-fallout-from-phylib-eee-changes'
Jakub Kicinski [Wed, 15 Jan 2025 21:17:58 +0000 (13:17 -0800)]
Merge branch 'net-bcm-asp2-fix-fallout-from-phylib-eee-changes'

Russell King says:

====================
net: bcm: asp2: fix fallout from phylib EEE changes

This series addresses the fallout from the phylib changes in the
Broadcom ASP2 driver.

The first patch uses phylib's copy of the LPI timer setting, which
means the driver no longer has to track this. It will be set in
hardware each time the adjust_link function is called when the link
is up, and will be read at initialisation time to set the current
value.

The second patch removes the driver's storage of tx_lpi_enabled,
which has become redundant since phylib managed EEE was merged. The
driver does nothing with this flag other than storing it.

The last patch converts the driver to use phylib's enable_tx_lpi
flag rather than trying to maintain its own copy.
====================

Link: https://patch.msgid.link/Z4aV3RmSZJ1WS3oR@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcm: asp2: convert to phylib managed EEE
Russell King (Oracle) [Tue, 14 Jan 2025 16:50:57 +0000 (16:50 +0000)]
net: bcm: asp2: convert to phylib managed EEE

Convert the Broadcom ASP2 driver to use phylib managed EEE support.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/E1tXk81-000r4x-TS@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcm: asp2: remove tx_lpi_enabled
Russell King (Oracle) [Tue, 14 Jan 2025 16:50:52 +0000 (16:50 +0000)]
net: bcm: asp2: remove tx_lpi_enabled

Phylib maintains a copy of tx_lpi_enabled, which will be used to
populate the member when phy_ethtool_get_eee(). Therefore, writing to
this member before phy_ethtool_get_eee() will have no effect. Remove
it. Also remove setting our copy of info->eee.tx_lpi_enabled which
becomes write-only.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/E1tXk7w-000r4r-Pq@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcm: asp2: fix LPI timer handling
Russell King (Oracle) [Tue, 14 Jan 2025 16:50:47 +0000 (16:50 +0000)]
net: bcm: asp2: fix LPI timer handling

Fix the LPI timer handling in Broadcom ASP2 driver after the phylib
managed EEE patches were merged.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/E1tXk7r-000r4l-Li@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-stmmac-further-eee-cleanups-and-one-fix'
Jakub Kicinski [Wed, 15 Jan 2025 02:22:06 +0000 (18:22 -0800)]
Merge branch 'net-stmmac-further-eee-cleanups-and-one-fix'

Russell King says:

====================
net: stmmac: further EEE cleanups (and one fix!)

This series continues the EEE cleanup of the stmmac driver, and
includes one fix.

As mentioned in the previous series, I wasn't entirely happy with the
"stmmac_disable_sw_eee_mode" name, so the first patch renames this to
"stmmac_stop_sw_lpi" instead, which I think better describes what this
function is doing - stopping the transmit of the LPI state because we
have a packet ot send.

Patch 2 corrects the priv->eee_sw_timer_en flag when EEE has been
disabled. Currently upon disable, priv->eee_enabled is set false,
but through the weird logic that was present prior to the previous
series, priv->eee_sw_timer_en was set true. This behaviour was kept
as the previous series was cleanup, not fixes. This patch fixes this.

Having fixed priv->eee_sw_timer_en to actually indicate whether
software timed EEE mode is being used, it becomes no longer necessary
to test priv->eee_enabled in addition. Patch 3 removes the redundant
test. Patch 4 also uses priv->eee_sw_timer_en before manipulating the
software EEE state in the suspend method rather than using
priv->eee_enabled, which brings consistency.

Patch 5 provides stmmac_try_to_start_sw_lpi() which complements
stmmac_stop_sw_lpi(), and allows us to move duplicated code into one
location.

Patch 6 splits stmmac_enable_eee_mode() - one part of this function
tests whether there are any queues that have unfinished work (in
other words are busy). Separate out this code into a separate function.

Patch 7 also splits out the mod_timer() for the software EEE timer
intoi a seperate function (the reason will be in patch 9.)

Patch 8 merges the remains of stmmac_enable_eee_mode() into
stmmac_try_to_start_sw_lpi().

Patch 9 fixes the delay between transmit and entering LPI. Currently,
when cleaning the transmit queues, if we discover that we have finished
cleaning up all queues, we immediately instruct the hardware to enter
LPI mode without waiting for the LPI timer. However, we should wait for
the LPI timer to expire. Therefore, the transmit cleanup path needs
to call stmmac_restart_sw_lpi_timer() instead of
stmmac_try_to_start_sw_lpi().
====================

Link: https://patch.msgid.link/Z4T84SbaC4D-fN5y@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: restart LPI timer after cleaning transmit descriptors
Russell King (Oracle) [Mon, 13 Jan 2025 11:46:20 +0000 (11:46 +0000)]
net: stmmac: restart LPI timer after cleaning transmit descriptors

Fix a bug in the LPI handling, where it is possible to immediately
enter LPI mode after cleaning the transmit descriptors when all queues
are empty rather than waiting for the LPI timeout to expire.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tXItg-000MBg-TW@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: combine stmmac_enable_eee_mode()
Russell King (Oracle) [Mon, 13 Jan 2025 11:46:15 +0000 (11:46 +0000)]
net: stmmac: combine stmmac_enable_eee_mode()

Combine stmmac_enable_eee_mode() with stmmac_try_to_start_sw_lpi()
which makes the code easier to read and the flow more logical. We
can now trivially see that if the transmit queues are busy, we
(re-)start the eee_ctrl_timer. Otherwise, if the transmit path is
not already in LPI mode, we ask the hardware to enter LPI mode.

I believe that now we can see better what is going on here, this
shows that there is a bug with the software LPI timer implementation.

The LPI timer is supposed to define how long after the last
transmittion completed before we start signalling LPI. However,
this code structure shows that if all transmit queues are empty,
and stmmac_try_to_start_sw_lpi() is called immediately after cleaning
the transmit queue, we will instruct the hardware to start signalling
LPI immediately.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tXItb-000MBa-OU@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: provide function for restarting sw LPI timer
Russell King (Oracle) [Mon, 13 Jan 2025 11:46:10 +0000 (11:46 +0000)]
net: stmmac: provide function for restarting sw LPI timer

Provide a function that encapsulates restarting the software LPI
timer when we have determined that the transmit path is busy, or
whether the EEE parameters have changed.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tXItW-000MBU-KQ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: provide stmmac_eee_tx_busy()
Russell King (Oracle) [Mon, 13 Jan 2025 11:46:05 +0000 (11:46 +0000)]
net: stmmac: provide stmmac_eee_tx_busy()

Extract the code which checks whether there's still work to do on any
of the stmmac transmit queues. This will allow us to combine
stmmac_enable_eee_mode() with stmmac_try_to_start_sw_lpi() in the
next patch.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tXItR-000MBO-GF@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: add stmmac_try_to_start_sw_lpi()
Russell King (Oracle) [Mon, 13 Jan 2025 11:46:00 +0000 (11:46 +0000)]
net: stmmac: add stmmac_try_to_start_sw_lpi()

There are two places which call stmmac_enable_eee_mode() and follow it
immediately by modifying the expiry of priv->eee_ctrl_timer. Both code
paths are trying to enable LPI mode. Remove this duplication by
providing a function for this.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tXItM-000MBI-CX@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: check priv->eee_sw_timer_en in suspend path
Russell King (Oracle) [Mon, 13 Jan 2025 11:45:55 +0000 (11:45 +0000)]
net: stmmac: check priv->eee_sw_timer_en in suspend path

The suspend path uses priv->eee_enabled when cleaning up the software
timed LPI mode. Use priv->eee_sw_timer_en instead so we're consistently
using a single control for software-based timer handling.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tXItH-000MBC-8i@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: simplify TX cleanup decision for ending sw LPI mode
Russell King (Oracle) [Mon, 13 Jan 2025 11:45:50 +0000 (11:45 +0000)]
net: stmmac: simplify TX cleanup decision for ending sw LPI mode

As mentioned in "net: stmmac: correct priv->eee_sw_timer_en setting",
we can simplify some fast-path tests.

The transmit cleaning path checks whether EEE is enabled, the transmit
path is not in LPI mode, and that we're using software timed mode.
Since the above mentioned commit, checking whether EEE is enabled is
no longer necessary as priv->eee_sw_timer_en will be false when EEE is
disabled. Simplify this test.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tXItC-000MB6-54@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: correct priv->eee_sw_timer_en setting
Russell King (Oracle) [Mon, 13 Jan 2025 11:45:45 +0000 (11:45 +0000)]
net: stmmac: correct priv->eee_sw_timer_en setting

If we are disabling EEE/LPI, then we should not be enabling software
mode. The only time when we should is if EEE is active, and we are
wanting to use software-timed EEE mode.

Therefore, in the disable path of stmmac_eee_init(), ensure that
priv->eee_sw_timer_en is set false as we are going to be calling
del_timer_sync() on the timer.

This will allow us to simplify some fast-path tests in later patches.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tXIt7-000MB0-0W@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: rename stmmac_disable_sw_eee_mode()
Russell King (Oracle) [Mon, 13 Jan 2025 11:45:39 +0000 (11:45 +0000)]
net: stmmac: rename stmmac_disable_sw_eee_mode()

stmmac_disable_sw_eee_mode() was not a good choice for this functions
purpose - which is to stop transmitting LPI because we want to send a
packet. Rename it to stmmac_stop_sw_lpi().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tXIt1-000MAu-TE@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethernet: sunplus: Switch to ndo_eth_ioctl
谢致邦 (XIE Zhibang) [Mon, 13 Jan 2025 09:41:56 +0000 (09:41 +0000)]
net: ethernet: sunplus: Switch to ndo_eth_ioctl

The device ioctl handler no longer calls ndo_do_ioctl, but calls
ndo_eth_ioctl to handle mii ioctls since commit a76053707dbf
("dev_ioctl: split out ndo_eth_ioctl"). However, sunplus still used
ndo_do_ioctl when it was introduced. So switch to ndo_eth_ioctl.

Bad commit fd3040b9394c ("net: ethernet: Add driver for Sunplus SP7021")
was the initial driver commit, meaning that PHY IOCTLs where never
available on this driver. Therefore don't consider this as a fix.

Found by code inspection.

Signed-off-by: 谢致邦 (XIE Zhibang) <Yeking@Red54.com>
Link: https://patch.msgid.link/tencent_8CF8A72C708E96B9C7DC1AF96FEE19AF3D05@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-ethernet-simplify-few-things'
Jakub Kicinski [Wed, 15 Jan 2025 02:04:28 +0000 (18:04 -0800)]
Merge branch 'net-ethernet-simplify-few-things'

Krzysztof Kozlowski says:

====================
net: ethernet: Simplify few things

Few code simplifications without functional impact.
Not tested on hardware.
====================

Link: https://patch.msgid.link/20250112-syscon-phandle-args-net-v1-0-3423889935f7@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: stm32: Use syscon_regmap_lookup_by_phandle_args
Krzysztof Kozlowski [Sun, 12 Jan 2025 13:32:47 +0000 (14:32 +0100)]
net: stmmac: stm32: Use syscon_regmap_lookup_by_phandle_args

Use syscon_regmap_lookup_by_phandle_args() which is a wrapper over
syscon_regmap_lookup_by_phandle() combined with getting the syscon
argument.  Except simpler code this annotates within one line that given
phandle has arguments, so grepping for code would be easier.

There is also no real benefit in printing errors on missing syscon
argument, because this is done just too late: runtime check on
static/build-time data.  Dtschema and Devicetree bindings offer the
static/build-time check for this already.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250112-syscon-phandle-args-net-v1-5-3423889935f7@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: sti: Use syscon_regmap_lookup_by_phandle_args
Krzysztof Kozlowski [Sun, 12 Jan 2025 13:32:46 +0000 (14:32 +0100)]
net: stmmac: sti: Use syscon_regmap_lookup_by_phandle_args

Use syscon_regmap_lookup_by_phandle_args() which is a wrapper over
syscon_regmap_lookup_by_phandle() combined with getting the syscon
argument.  Except simpler code this annotates within one line that given
phandle has arguments, so grepping for code would be easier.

There is also no real benefit in printing errors on missing syscon
argument, because this is done just too late: runtime check on
static/build-time data.  Dtschema and Devicetree bindings offer the
static/build-time check for this already.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250112-syscon-phandle-args-net-v1-4-3423889935f7@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: imx: Use syscon_regmap_lookup_by_phandle_args
Krzysztof Kozlowski [Sun, 12 Jan 2025 13:32:45 +0000 (14:32 +0100)]
net: stmmac: imx: Use syscon_regmap_lookup_by_phandle_args

Use syscon_regmap_lookup_by_phandle_args() which is a wrapper over
syscon_regmap_lookup_by_phandle() combined with getting the syscon
argument.  Except simpler code this annotates within one line that given
phandle has arguments, so grepping for code would be easier.

There is also no real benefit in printing errors on missing syscon
argument, because this is done just too late: runtime check on
static/build-time data.  Dtschema and Devicetree bindings offer the
static/build-time check for this already.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250112-syscon-phandle-args-net-v1-3-3423889935f7@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ti: am65-cpsw-nuss: Use syscon_regmap_lookup_by_phandle_args
Krzysztof Kozlowski [Sun, 12 Jan 2025 13:32:44 +0000 (14:32 +0100)]
net: ti: am65-cpsw-nuss: Use syscon_regmap_lookup_by_phandle_args

Use syscon_regmap_lookup_by_phandle_args() which is a wrapper over
syscon_regmap_lookup_by_phandle() combined with getting the syscon
argument.  Except simpler code this annotates within one line that given
phandle has arguments, so grepping for code would be easier.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250112-syscon-phandle-args-net-v1-2-3423889935f7@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ti: icssg-prueth: Do not print physical memory addresses
Krzysztof Kozlowski [Sun, 12 Jan 2025 13:32:43 +0000 (14:32 +0100)]
net: ti: icssg-prueth: Do not print physical memory addresses

Debugging messages should not reveal anything about memory addresses.
This also solves arm compile test warnings:

  drivers/net/ethernet/ti/icssg/icssg_prueth_sr1.c:1034:49: error:
    format specifies type 'unsigned long long' but the argument has type 'phys_addr_t' (aka 'unsigned int') [-Werror,-Wformat]

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Link: https://patch.msgid.link/20250112-syscon-phandle-args-net-v1-1-3423889935f7@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agosocket: Remove unused kernel_sendmsg_locked
Dr. David Alan Gilbert [Sun, 12 Jan 2025 13:13:18 +0000 (13:13 +0000)]
socket: Remove unused kernel_sendmsg_locked

The last use of kernel_sendmsg_locked() was removed in 2023 by
commit dc97391e6610 ("sock: Remove ->sendpage*() in favour of
sendmsg(MSG_SPLICE_PAGES)")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250112131318.63753-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: Constify struct mdio_device_id
Christophe JAILLET [Sun, 12 Jan 2025 14:14:50 +0000 (15:14 +0100)]
net: phy: Constify struct mdio_device_id

'struct mdio_device_id' is not modified in these drivers.

Constifying these structures moves some data to a read-only section, so
increase overall security.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text    data     bss     dec     hex filename
  27014   12792       0   39806    9b7e drivers/net/phy/broadcom.o

After:
=====
   text    data     bss     dec     hex filename
  27206   12600       0   39806    9b7e drivers/net/phy/broadcom.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/403c381b7d9156b67ad68ffc44b8eee70c5e86a9.1736691226.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-phy-realtek-add-hwmon-support'
Jakub Kicinski [Tue, 14 Jan 2025 22:51:36 +0000 (14:51 -0800)]
Merge branch 'net-phy-realtek-add-hwmon-support'

Heiner Kallweit says:

====================
net: phy: realtek: add hwmon support

This adds hwmon support for the temperature sensor on RTL822x.
It's available on the standalone versions of the PHY's, and on the
internal PHY's of RTL8125B(P)/RTL8125D/RTL8126.
====================

Link: https://patch.msgid.link/7319d8f9-2d6f-4522-92e8-a8a4990042fb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: realtek: add hwmon support for temp sensor on RTL822x
Heiner Kallweit [Sat, 11 Jan 2025 20:51:24 +0000 (21:51 +0100)]
net: phy: realtek: add hwmon support for temp sensor on RTL822x

This adds hwmon support for the temperature sensor on RTL822x.
It's available on the standalone versions of the PHY's, and on
the integrated PHY's in RTL8125B/RTL8125D/RTL8126.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/ad6bfe9f-6375-4a00-84b4-bfb38a21bd71@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: move realtek PHY driver to its own subdirectory
Heiner Kallweit [Sat, 11 Jan 2025 20:50:19 +0000 (21:50 +0100)]
net: phy: move realtek PHY driver to its own subdirectory

In preparation of adding a source file with hwmon support, move the
Realtek PHY driver to its own subdirectory and rename realtek.c to
realtek_main.c.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/c566551b-c915-4e34-9b33-129a6ddd6e4c@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: realtek: add support for reading MDIO_MMD_VEND2 regs on RTL8125/RTL8126
Heiner Kallweit [Sat, 11 Jan 2025 20:49:31 +0000 (21:49 +0100)]
net: phy: realtek: add support for reading MDIO_MMD_VEND2 regs on RTL8125/RTL8126

RTL8125/RTL8126 don't support MMD access to the internal PHY, but
provide a mechanism to access at least all MDIO_MMD_VEND2 registers.
By exposing this mechanism standard MMD access functions can be used
to access the MDIO_MMD_VEND2 registers.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/e821b302-5fe6-49ab-aabd-05da500581c0@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: airoha: Enforce ETS Qdisc priomap
Lorenzo Bianconi [Sun, 12 Jan 2025 18:32:45 +0000 (19:32 +0100)]
net: airoha: Enforce ETS Qdisc priomap

EN7581 SoC supports fixed QoS band priority where WRR queues have lowest
priorities with respect to SP ones.
E.g: WRR0, WRR1, .., WRRm, SP0, SP1, .., SPn

Enforce ETS Qdisc priomap according to the hw capabilities.

Suggested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20250112-airoha_ets_priomap-v1-1-fb616de159ba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethernet: ti: am65-cpsw: VLAN-aware CPSW only if !DSA
Alexander Sverdlin [Fri, 10 Jan 2025 12:57:35 +0000 (13:57 +0100)]
net: ethernet: ti: am65-cpsw: VLAN-aware CPSW only if !DSA

Only configure VLAN-aware CPSW mode if no port is used as DSA CPU port.
VLAN-aware mode interferes with some DSA tagging schemes and makes stacking
DSA switches downstream of CPSW impossible. Previous attempts to address
the issue linked below.

Link: https://lore.kernel.org/netdev/20240227082815.2073826-1-s-vadapalli@ti.com/
Link: https://lore.kernel.org/linux-arm-kernel/4699400.vD3TdgH1nR@localhost/
Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Link: https://patch.msgid.link/20250110125737.546184-1-alexander.sverdlin@siemens.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agotsnep: Link queues to NAPIs
Gerhard Engleder [Fri, 10 Jan 2025 22:39:39 +0000 (23:39 +0100)]
tsnep: Link queues to NAPIs

Use netif_queue_set_napi() to link queues to NAPI instances so that they
can be queried with netlink.

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump queue-get --json='{"ifindex": 11}'
[{'id': 0, 'ifindex': 11, 'napi-id': 9, 'type': 'rx'},
 {'id': 1, 'ifindex': 11, 'napi-id': 10, 'type': 'rx'},
 {'id': 0, 'ifindex': 11, 'napi-id': 9, 'type': 'tx'},
 {'id': 1, 'ifindex': 11, 'napi-id': 10, 'type': 'tx'}]

Additionally use netif_napi_set_irq() to also provide NAPI interrupt
number to userspace.

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --do napi-get --json='{"id": 9}'
{'defer-hard-irqs': 0,
 'gro-flush-timeout': 0,
 'id': 9,
 'ifindex': 11,
 'irq': 42,
 'irq-suspend-timeout': 0}

Providing information about queues to userspace makes sense as APIs like
XSK provide queue specific access. Also XSK busy polling relies on
queues linked to NAPIs.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20250110223939.37490-1-gerhard@engleder-embedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoice: Add in/out PTP pin delays
Karol Kolacinski [Wed, 4 Dec 2024 09:46:11 +0000 (10:46 +0100)]
ice: Add in/out PTP pin delays

HW can have different input/output delays for each of the pins.

Currently, only E82X adapters have delay compensation based on TSPLL
config and E810 adapters have constant 1 ms compensation, both cases
only for output delays and the same one for all pins.

E825 adapters have different delays for SDP and other pins. Those
delays are also based on direction and input delays are different than
output ones. This is the main reason for moving delays to pin
description structure.

Add a field in ice_ptp_pin_desc structure to reflect that. Delay values
are based on approximate calculations of HW delays based on HW spec.

Implement external timestamp (input) delay compensation.

Remove existing definitions and wrappers for periodic output propagation
delays.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 months agoice: implement low latency PHY timer updates
Jacob Keller [Mon, 16 Dec 2024 14:53:32 +0000 (09:53 -0500)]
ice: implement low latency PHY timer updates

Programming the PHY registers in preparation for an increment value change
or a timer adjustment on E810 requires issuing Admin Queue commands for
each PHY register. It has been found that the firmware Admin Queue
processing occasionally has delays of tens or rarely up to hundreds of
milliseconds. This delay cascades to failures in the PTP applications which
depend on these updates being low latency.

Consider a standard PTP profile with a sync rate of 16 times per second.
This means there is ~62 milliseconds between sync messages. A complete
cycle of the PTP algorithm

1) Sync message (with Tx timestamp) from source
2) Follow-up message from source
3) Delay request (with Tx timestamp) from sink
4) Delay response (with Rx timestamp of request) from source
5) measure instantaneous clock offset
6) request time adjustment via CLOCK_ADJTIME systemcall

The Tx timestamps have a default maximum timeout of 10 milliseconds. If we
assume that the maximum possible time is used, this leaves us with ~42
milliseconds of processing time for a complete cycle.

The CLOCK_ADJTIME system call is synchronous and will block until the
driver completes its timer adjustment or frequency change.

If the writes to prepare the PHY timers get hit by a latency spike of 50
milliseconds, then the PTP application will be delayed past the point where
the next cycle should start. Packets from the next cycle may have already
arrived and are waiting on the socket.

In particular, LinuxPTP ptp4l may start complaining about missing an
announce message from the source, triggering a fault. In addition, the
clockcheck logic it uses may trigger. This clockcheck failure occurs
because the timestamp captured by hardware is compared against a reading of
CLOCK_MONOTONIC. It is assumed that the time when the Rx timestamp is
captured and the read from CLOCK_MONOTONIC are relatively close together.
This is not the case if there is a significant delay to processing the Rx
packet.

Newer firmware supports programming the PHY registers over a low latency
interface which bypasses the Admin Queue. Instead, software writes to the
REG_LL_PROXY_L and REG_LL_PROXY_H registers. Firmware reads these registers
and then programs the PHY timers.

Implement functions to use this interface when available to program the PHY
timers instead of using the Admin Queue. This avoids the Admin Queue
latency and ensures that adjustments happen within acceptable latency
bounds.

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 months agoice: check low latency PHY timer update firmware capability
Jacob Keller [Mon, 16 Dec 2024 14:53:31 +0000 (09:53 -0500)]
ice: check low latency PHY timer update firmware capability

Newer versions of firmware support programming the PHY timer via the low
latency interface exposed over REG_LL_PROXY_L and REG_LL_PROXY_H. Add
support for checking the device capabilities for this feature.

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 months agoice: add lock to protect low latency interface
Jacob Keller [Mon, 16 Dec 2024 14:53:30 +0000 (09:53 -0500)]
ice: add lock to protect low latency interface

Newer firmware for the E810 devices support a 'low latency' interface to
interact with the PHY without using the Admin Queue. This is interacted
with via the REG_LL_PROXY_L and REG_LL_PROXY_H registers.

Currently, this interface is only used for Tx timestamps. There are two
different mechanisms, including one which uses an interrupt for firmware to
signal completion. However, these two methods are mutually exclusive, so no
synchronization between them was necessary.

This low latency interface is being extended in future firmware to support
also programming the PHY timers. Use of the interface for PHY timers will
need synchronization to ensure there is no overlap with a Tx timestamp.

The interrupt-based response complicates the locking somewhat. We can't use
a simple spinlock. This would require being acquired in
ice_ptp_req_tx_single_tstamp, and released in
ice_ptp_complete_tx_single_tstamp. The ice_ptp_req_tx_single_tstamp
function is called from the threaded IRQ, and the
ice_ptp_complete_tx_single_stamp is called from the low latency IRQ, so we
would need to acquire the lock with IRQs disabled.

To handle this, we'll use a wait queue along with
wait_event_interruptible_locked_irq in the update flows which don't use the
interrupt.

The interrupt flow will acquire the wait queue lock, set the
ATQBAL_FLAGS_INTR_IN_PROGRESS, and then initiate the firmware low latency
request, and unlock the wait queue lock.

Upon receipt of the low latency interrupt, the lock will be acquired, the
ATQBAL_FLAGS_INTR_IN_PROGRESS bit will be cleared, and the firmware
response will be captured, and wake_up_locked() will be called on the wait
queue.

The other flows will use wait_event_interruptible_locked_irq() to wait
until the ATQBAL_FLAGS_INTR_IN_PROGRESS is clear. This function checks the
condition under lock, but does not hold the lock while waiting. On return,
the lock is held, and a return of zero indicates we hold the lock and the
in-progress flag is not set.

This will ensure that threads which need to use the low latency interface
will sleep until they can acquire the lock without any pending low latency
interrupt flow interfering.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 months agoice: rename TS_LL_READ* macros to REG_LL_PROXY_H_*
Jacob Keller [Mon, 16 Dec 2024 14:53:29 +0000 (09:53 -0500)]
ice: rename TS_LL_READ* macros to REG_LL_PROXY_H_*

The TS_LL_READ macros are used as part of the low latency Tx timestamp
interface. A future firmware extension will add support for performing PHY
timer updates over this interface. Using TS_LL_READ as the prefix for these
macros will be confusing once the interface is used for other purposes.

Rename the macros, using the prefix REG_LL_PROXY_H, to better clarify that
this is for the low latency interface.
Additionally add macros for PF_SB_ATQBAH and PF_SB_ATQBAL registers to
better clarify content of this registers as PF_SB_ATQBAH contain low
part of Tx timestamp

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 months agoice: use read_poll_timeout_atomic in ice_read_phy_tstamp_ll_e810
Jacob Keller [Mon, 16 Dec 2024 14:53:28 +0000 (09:53 -0500)]
ice: use read_poll_timeout_atomic in ice_read_phy_tstamp_ll_e810

The ice_read_phy_tstamp_ll_e810 function repeatedly reads the PF_SB_ATQBAL
register until the TS_LL_READ_TS bit is cleared. This is a perfect
candidate for using rd32_poll_timeout. However, the default implementation
uses a sleep-based wait. Use read_poll_timeout_atomic macro which is based
on the non-sleeping implementation and use it to replace the loop reading
in the ice_read_phy_tstamp_ll_e810 function.

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>