Cosmin Ratiu [Wed, 21 May 2025 12:09:02 +0000 (15:09 +0300)]
net/mlx5e: Convert mlx5 netdevs to instance locking
This patch convert mlx5 to use the new netdev instance lock in addition
to the pre-existing state_lock (and the RTNL).
mlx5e_priv.state_lock was already used throughout mlx5 to protect
against concurrent state modifications on the same netdev, usually in
addition to the RTNL. The new netdev instance lock will eventually
replace it, but for now, it is acquired in addition to the existing
locks in the order RTNL -> instance lock -> state_lock.
All three netdev types handled by mlx5 are converted to the new style of
locking, because they share a lot of code related to initializing
channels and dealing with NAPI, so it's better to convert all three
rather than introduce different assumptions deep in the call stack
depending on the type of device.
Because of the nature of the call graphs in mlx5, it wasn't possible to
incrementally convert parts of the driver to use the new lock, since
either all call paths into NAPI have to possess the new lock if the
*_locked variants are used, or none of them can have the lock.
One area which required extra care is the interaction between closing
channels and devlink health reporter tasks.
Previously, the recovery tasks were unconditionally acquiring the
RTNL, which could lead to deadlocks in these scenarios:
Another similar instance of this is:
T1: mlx5e_close (== .ndo_stop(), has RTNL) -> mlx5e_close_locked
-> mlx5e_close_channels -> mlx5e_ptp_close
-> mlx5e_ptp_close_queues -> mlx5e_ptp_close_txqsqs
-> mlx5e_ptp_close_txqsq
-> cancel_work_sync(&sq->recover_work) waits for
T2: mlx5e_tx_err_cqe_work -> mlx5e_reporter_tx_err_cqe
-> mlx5e_health_report -> devlink_health_report
-> devlink_health_reporter_recover
-> mlx5e_tx_reporter_err_cqe_recover which does:
rtnl_lock(); => Another deadlock.
Fix that by using the same pattern previously done in
mlx5e_tx_timeout_work, where the RTNL was repeatedly tried to be
acquired until either:
a) it is successfully acquired or
b) there's no need for the work to be done any more (channel is being
closed).
Now, for all three recovery tasks, the instance lock is repeatedly tried
to be acquired until successful or the channel/SQ is closed.
As a side-effect, drop the !test_bit(MLX5E_STATE_OPENED, &priv->state)
check from mlx5e_tx_timeout_work, it's weaker than
!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv->state) and unnecessary.
Future patches will introduce new call paths (from netdev queue
management ops) which can close channels (and call cancel_work_sync on
the recovery tasks) without the RTNL lock and only with the netdev
instance lock.
Cosmin Ratiu [Wed, 21 May 2025 12:09:01 +0000 (15:09 +0300)]
net/mlx5e: Don't drop RTNL during firmware flash
There's no explanation in the original commit of why that was done, but
presumably flashing takes a long time and holding RTNL for so long
blocks other interactions with the netdev layer.
However, the stack is moving towards netdev instance locking and
dropping and reacquiring RTNL in the context of flashing introduces
locking ordering issues: RTNL must be acquired before the netdev
instance lock and released after it.
This patch therefore takes the simpler approach by no longer dropping
and reacquiring the RTNL, as soon RTNL for ethtool will be removed,
leaving only the instance lock to protect against races.
Cosmin Ratiu [Wed, 21 May 2025 12:09:00 +0000 (15:09 +0300)]
IB/IPoIB: Allow using netdevs that require the instance lock
After the last patch removing vlan_rwsem, it is an incremental step to
allow ipoib to work with netdevs that require the instance lock.
In several places, netdev_lock() is changed to netdev_lock_ops_to_full()
which takes care of not acquiring the lock again when the netdev is
already locked.
In ipoib_ib_tx_timeout_work() and __ipoib_ib_dev_flush() for HEAVY
flushes, the netdev lock is acquired/released. This is needed because
these functions end up calling .ndo_stop()/.ndo_open() on subinterfaces,
and the device may expect the netdev instance lock to be held.
ipoib_set_mode() now explicitly acquires ops lock while manipulating the
features, mtu and tx queues.
Finally, ipoib_napi_enable()/ipoib_napi_disable() now use the *_locked
variants of the napi_enable()/napi_disable() calls and optionally
acquire the netdev lock themselves depending on the dev they operate on.
Cosmin Ratiu [Wed, 21 May 2025 12:08:59 +0000 (15:08 +0300)]
IB/IPoIB: Replace vlan_rwsem with the netdev instance lock
vlan_rwsem was added more than a decade ago to work around a deadlock
involving the original mutex being acquired twice, once from the wq.
Subsequent changes then tweaked it to partially protect access to
ipoib_dev_priv->child_intfs together with the RTNL. Flushing the wq
synchronously was also since then refactored to happen separately.
This semaphore unfortunately prevents updating ipoib to work with
devices that require the netdev lock, because of lock ordering issues
between RTNL, vlan_rwsem and the netdev instance locks of parent and
child devices.
To uncomplicate things, this commit replaces vlan_rwsem with the netdev
instance lock of the parent device. Both parent child_intfs list and the
children's list membership in it require holding the parent netdev
instance lock.
All call paths were carefully reviewed and no-longer-needed ASSERT_RTNL
calls were dropped. Some non-trivial changes:
- ipoib_match_gid_pkey_addr() now only acquires the instance lock and
iterates through child_intfs for the first level of recursion (the
parent), as it's not possible to have multiple levels of nested
subinterfaces.
- ipoib_open() and ipoib_stop() schedule tasks on the global workqueue
to open/stop child interfaces to avoid potentially acquiring nested
netdev instance locks. To avoid the device going away between the task
scheduling and execution, netdev_hold/netdev_put are used.
Cosmin Ratiu [Wed, 21 May 2025 12:08:58 +0000 (15:08 +0300)]
IB/IPoIB: Enqueue separate work_structs for each flushed interface
Previously, flushing a netdevice involved first flushing all child
devices from the flush task itself. That requires holding the lock that
protects the list for the entire duration of the flush.
This poses a problem when converting from vlan_rwsem to the netdev
instance lock (next patch), because holding the parent lock while
trying to acquire a child lock makes lockdep unhappy, rightfully.
Fix this by splitting a big flush task into individual flush tasks
(all are already created in their respective ipoib_dev_priv structs)
and defining a helper function to enqueue all of them while holding the
list lock.
In ipoib_set_mac, the function is not used and the task is enqueued
directly, because in the subsequent patches locking is changed and this
function may be called with the netdev instance lock held.
This is effectively a noop, the wq is single-threaded and ordered and
will execute the same flush operations in the same order as before.
Furthermore, there should be no new races because
ipoib_parent_unregister_pre() calls flush_workqueue() after stopping new
work generation to wait for pending work to complete. flush_workqueue()
waits for all currently enqueued work to finish before returning.
Taehee Yoo [Tue, 20 May 2025 07:11:55 +0000 (07:11 +0000)]
eth: bnxt: fix deadlock when xdp is attached or detached
When xdp is attached or detached, dev->ndo_bpf() is called by
do_setlink(), and it acquires netdev_lock() if needed.
Unlike other drivers, the bnxt driver is protected by netdev_lock while
xdp is attached/detached because it sets dev->request_ops_lock to true.
So, the bnxt_xdp(), that is callback of ->ndo_bpf should not acquire
netdev_lock().
But the xdp_features_{set | clear}_redirect_target() was changed to
acquire netdev_lock() internally.
It causes a deadlock.
To fix this problem, bnxt driver should use
xdp_features_{set | clear}_redirect_target_locked() instead.
Splat looks like:
============================================
WARNING: possible recursive locking detected
6.15.0-rc6+ #1 Not tainted
--------------------------------------------
bpftool/1745 is trying to acquire lock: ffff888131b85038 (&dev->lock){+.+.}-{4:4}, at: xdp_features_set_redirect_target+0x1f/0x80
but task is already holding lock: ffff888131b85038 (&dev->lock){+.+.}-{4:4}, at: do_setlink.constprop.0+0x24e/0x35d0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&dev->lock);
lock(&dev->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by bpftool/1745:
#0: ffffffffa56131c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_setlink+0x1fe/0x570
#1: ffffffffaafa75a0 (&net->rtnl_mutex){+.+.}-{4:4}, at: rtnl_setlink+0x236/0x570
#2: ffff888131b85038 (&dev->lock){+.+.}-{4:4}, at: do_setlink.constprop.0+0x24e/0x35d0
Kory Maincent [Mon, 19 May 2025 08:45:05 +0000 (10:45 +0200)]
net: Add support for providing the PTP hardware source in tsinfo
Multi-PTP source support within a network topology has been merged,
but the hardware timestamp source is not yet exposed to users.
Currently, users only see the PTP index, which does not indicate
whether the timestamp comes from a PHY or a MAC.
Add support for reporting the hwtstamp source using a
hwtstamp-source field, alongside hwtstamp-phyindex, to describe
the origin of the hardware timestamp.
Remove HWTSTAMP_SOURCE_UNSPEC enum value as it is not used at all.
Having adjacent accelerated modify header actions (so-called
pattern-argument actions) may result in inconsistent outcome.
These inconsistencies can take the form of writes to the same
field or a read coupled with a write to the same field. The
solution is to detect such dependencies and insert nops between
the offending actions.
The existing implementation had a few issues, which pretty much
required a complete rewrite of the code that handles these
dependencies.
In the new implementation we're doing the following:
* Checking any two adjacent actions for conflicts (not just
odd-even pairs).
* Marking 'set' and 'add' action fields as destination, rather
than source, for the purposes of checking for conflicts.
* Checking all types of actions ('add', 'set', 'copy') for
dependencies.
* Managing offsets of the args in the buffer - copy the action
args to the right place in the buffer.
* Checking that after inserting nops we're still within the number
of supported actions - return an error otherwise.
Yevgeny Kliteynik [Tue, 20 May 2025 18:46:41 +0000 (21:46 +0300)]
net/mlx5: HWS, fix typo - 'nope' to 'nop'
Fix typo - rename 'nope_locations' to 'nop_locations', which describes
the locations of 'nop' actions. To shorten the lines, this renaming
also required some refactoring.
Vlad Dogaru [Tue, 20 May 2025 18:46:40 +0000 (21:46 +0300)]
net/mlx5: HWS, register reformat actions with fw
Hardware steering handles actions differently from firmware, but for
termination rules that use encapsulation the firmware needs to be aware
of the action.
Fix this by registering reformat actions with the firmware the first
time this is needed. To do this, add a third possible owner for an
action, and also a lock to protect against registration of the same
action from different threads.
Vlad Dogaru [Tue, 20 May 2025 18:46:39 +0000 (21:46 +0300)]
net/mlx5: SWS, fix reformat id error handling
The firmware reformat id is a u32 and can't safely be returned as an
int. Because the functions also need a way to signal error, prefer to
return the id as an output parameter and keep the return code only for
success/error.
While we're at it, also extract some duplicate code to fetch the
reformat id from a more generic struct pkt_reformat.
Nelson Escobar [Wed, 21 May 2025 01:19:29 +0000 (01:19 +0000)]
net/enic: Allow at least 8 RQs to always be used
Enic started using netif_get_num_default_rss_queues() to set the number
of RQs used in commit cc94d6c4d40c ("enic: Adjust used MSI-X
wq/rq/cq/interrupt resources in a more robust way")
This resulted in machines with less than 16 cpus using less than 8 RQs.
Allow enic to use at least 8 RQs no matter how many cpus are in the
machine to not impact existing enic workloads after a kernel upgrade.
Reviewed-by: John Daley <johndale@cisco.com> Reviewed-by: Satish Kharat <satishkh@cisco.com> Signed-off-by: Nelson Escobar <neescoba@cisco.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250521-enic_min_8rq-v1-1-691bd2353273@cisco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fan Gong [Tue, 20 May 2025 10:26:59 +0000 (13:26 +0300)]
hinic3: module initialization and tx/rx logic
This is [1/3] part of hinic3 Ethernet driver initial submission.
With this patch hinic3 is a valid kernel module but non-functional
driver.
The driver parts contained in this patch:
Module initialization.
PCI driver registration but with empty id_table.
Auxiliary driver registration.
Net device_ops registration but open/stop are empty stubs.
tx/rx logic.
All major data structures of the driver are fully introduced with the
code that uses them but without their initialization code that requires
management interface with the hw.
Alok Tiwari [Mon, 19 May 2025 14:17:19 +0000 (07:17 -0700)]
emulex/benet: correct command version selection in be_cmd_get_stats()
Logic here always sets hdr->version to 2 if it is not a BE3 or Lancer chip,
even if it is BE2. Use 'else if' to prevent multiple assignments, setting
version 0 for BE2, version 1 for BE3 and Lancer, and version 2 for others.
Fixes potential incorrect version setting when BE2_chip and
BE3_chip/lancer_chip checks could both be true.
====================
net: phy: Add support for new Aeonsemi PHYs
Add support for new Aeonsemi 10G C45 PHYs. These PHYs intergate an IPC
to setup some configuration and require special handling to sync with
the parity bit. The parity bit is a way the IPC use to follow correct
order of command sent.
Supported PHYs AS21011JB1, AS21011PB1, AS21010JB1, AS21010PB1,
AS21511JB1, AS21511PB1, AS21510JB1, AS21510PB1, AS21210JB1,
AS21210PB1 that all register with the PHY ID 0x7500 0x7500
before the firmware is loaded.
The big special thing about this PHY is that it does provide
a generic PHY ID in C45 register that change to the correct one
one the firmware is loaded.
In practice:
- MMD 0x7 ID 0x7500 0x9410 -> FW LOAD -> ID 0x7500 0x9422
To handle this, we operate on .match_phy_device where
we check the PHY ID, if the ID match the generic one,
we load the firmware and we return 0 (PHY driver doesn't
match). Then PHY core will try the next PHY driver in the list
and this time the PHY is correctly filled in and we register
for it.
To help in the matching and not modify part of the PHY device
struct, .match_phy_device is extended to provide also the
current phy_driver is trying to match for. This add the
extra benefits that some other PHY can simplify their
.match_phy_device OP.
====================
Christian Marangi [Sat, 17 May 2025 20:13:50 +0000 (22:13 +0200)]
dt-bindings: net: Document support for Aeonsemi PHYs
Add Aeonsemi PHYs and the requirement of a firmware to correctly work.
Also document the max number of LEDs supported and what PHY ID expose
when no firmware is loaded.
Supported PHYs AS21011JB1, AS21011PB1, AS21010JB1, AS21010PB1,
AS21511JB1, AS21511PB1, AS21510JB1, AS21510PB1, AS21210JB1,
AS21210PB1 that all register with the PHY ID 0x7500 0x9410 on C45
registers before the firmware is loaded.
Christian Marangi [Sat, 17 May 2025 20:13:49 +0000 (22:13 +0200)]
net: phy: Add support for Aeonsemi AS21xxx PHYs
Add support for Aeonsemi AS21xxx 10G C45 PHYs. These PHYs integrate
an IPC to setup some configuration and require special handling to
sync with the parity bit. The parity bit is a way the IPC use to
follow correct order of command sent.
Supported PHYs AS21011JB1, AS21011PB1, AS21010JB1, AS21010PB1,
AS21511JB1, AS21511PB1, AS21510JB1, AS21510PB1, AS21210JB1,
AS21210PB1 that all register with the PHY ID 0x7500 0x7510
before the firmware is loaded.
They all support up to 5 LEDs with various HW mode supported.
While implementing it was found some strange coincidence with using the
same logic for implementing C22 in MMD regs in Broadcom PHYs.
For reference here the AS21xxx PHY name logic:
AS21x1xxB1
^ ^^
| |J: Supports SyncE/PTP
| |P: No SyncE/PTP support
| 1: Supports 2nd Serdes
| 2: Not 2nd Serdes support
0: 10G, 5G, 2.5G
5: 5G, 2.5G
2: 2.5G
Christian Marangi [Sat, 17 May 2025 20:13:48 +0000 (22:13 +0200)]
net: phy: introduce genphy_match_phy_device()
Introduce new API, genphy_match_phy_device(), to provide a way to check
to match a PHY driver for a PHY device based on the info stored in the
PHY device struct.
The function generalize the logic used in phy_bus_match() to check the
PHY ID whether if C45 or C22 ID should be used for matching.
This is useful for custom .match_phy_device function that wants to use
the generic logic under some condition. (example a PHY is already setup
and provide the correct PHY ID)
Christian Marangi [Sat, 17 May 2025 20:13:47 +0000 (22:13 +0200)]
net: phy: nxp-c45-tja11xx: simplify .match_phy_device OP
Simplify .match_phy_device OP by using a generic function and using the
new phy_id PHY driver info instead of hardcoding the matching PHY ID
with new variant for macsec and no_macsec PHYs.
Also make use of PHY_ID_MATCH_MODEL macro and drop PHY_ID_MASK define to
introduce phy_id and phy_id_mask again in phy_driver struct.
Christian Marangi [Sat, 17 May 2025 20:13:45 +0000 (22:13 +0200)]
net: phy: pass PHY driver to .match_phy_device OP
Pass PHY driver pointer to .match_phy_device OP in addition to phydev.
Having access to the PHY driver struct might be useful to check the
PHY ID of the driver is being matched for in case the PHY ID scanned in
the phydev is not consistent.
A scenario for this is a PHY that change PHY ID after a firmware is
loaded, in such case, the PHY ID stored in PHY device struct is not
valid anymore and PHY will manually scan the ID in the match_phy_device
function.
Having the PHY driver info is also useful for those PHY driver that
implement multiple simple .match_phy_device OP to match specific MMD PHY
ID. With this extra info if the parsing logic is the same, the matching
function can be generalized by using the phy_id in the PHY driver
instead of hardcoding.
Rust wrapper callback is updated to align to the new match_phy_device
arguments.
Suggested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Benno Lossin <lossin@kernel.org> # for Rust Reviewed-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Link: https://patch.msgid.link/20250517201353.5137-2-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: faster and simpler CRC32C computation
Update networking code that computes the CRC32C of packets to just call
crc32c() without unnecessary abstraction layers. The result is faster
and simpler code.
Patches 1-7 add skb_crc32c() and remove the overly-abstracted and
inefficient __skb_checksum().
Patches 8-10 replace skb_copy_and_hash_datagram_iter() with
skb_copy_and_crc32c_datagram_iter(), eliminating the unnecessary use of
the crypto layer. This unblocks the conversion of nvme-tcp to call
crc32c() directly instead of using the crypto layer, which patch 9 does.
Eric Biggers [Mon, 19 May 2025 17:50:11 +0000 (10:50 -0700)]
nvme-tcp: use crc32c() and skb_copy_and_crc32c_datagram_iter()
Now that the crc32c() library function directly takes advantage of
architecture-specific optimizations and there also now exists a function
skb_copy_and_crc32c_datagram_iter(), it is unnecessary to go through the
crypto_ahash API. Just use those functions. This is much simpler, and
it also improves performance due to eliminating the crypto API overhead.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-10-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Biggers [Mon, 19 May 2025 17:50:10 +0000 (10:50 -0700)]
net: add skb_copy_and_crc32c_datagram_iter()
Since skb_copy_and_hash_datagram_iter() is used only with CRC32C, the
crypto_ahash abstraction provides no value. Add
skb_copy_and_crc32c_datagram_iter() which just calls crc32c() directly.
This is faster and simpler. It also doesn't have the weird dependency
issue where skb_copy_and_hash_datagram_iter() depends on
CONFIG_CRYPTO_HASH=y without that being expressed explicitly in the
kconfig (presumably because it was too heavyweight for NET to select).
The new function is conditional on the hidden boolean symbol NET_CRC32C,
which selects CRC32. So it gets compiled only when something that
actually needs CRC32C packet checksums is enabled, it has no implicit
dependency, and it doesn't depend on the heavyweight crypto layer.
Eric Biggers [Mon, 19 May 2025 17:50:09 +0000 (10:50 -0700)]
lib/crc32: remove unused support for CRC32C combination
crc32c_combine() and crc32c_shift() are no longer used (except by the
KUnit test that tests them), and their current implementation is very
slow. Remove them.
Eric Biggers [Mon, 19 May 2025 17:50:08 +0000 (10:50 -0700)]
net: fold __skb_checksum() into skb_checksum()
Now that the only remaining caller of __skb_checksum() is
skb_checksum(), fold __skb_checksum() into skb_checksum(). This makes
struct skb_checksum_ops unnecessary, so remove that too and simply do
the "regular" net checksum. It also makes the wrapper functions
csum_partial_ext() and csum_block_add_ext() unnecessary, so remove those
too and just use the underlying functions.
Eric Biggers [Mon, 19 May 2025 17:50:07 +0000 (10:50 -0700)]
sctp: use skb_crc32c() instead of __skb_checksum()
Make sctp_compute_cksum() just use the new function skb_crc32c(),
instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C. This is faster and simpler.
Eric Biggers [Mon, 19 May 2025 17:50:06 +0000 (10:50 -0700)]
RDMA/siw: use skb_crc32c() instead of __skb_checksum()
Instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C, just call the new function skb_crc32c(). This is faster
and simpler.
Acked-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-5-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Biggers [Mon, 19 May 2025 17:50:05 +0000 (10:50 -0700)]
net: use skb_crc32c() in skb_crc32c_csum_help()
Instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C, just call the new function skb_crc32c(). This is faster
and simpler.
Eric Biggers [Mon, 19 May 2025 17:50:04 +0000 (10:50 -0700)]
net: add skb_crc32c()
Add skb_crc32c(), which calculates the CRC32C of a sk_buff. It will
replace __skb_checksum(), which unnecessarily supports arbitrary
checksums. Compared to __skb_checksum(), skb_crc32c():
- Uses the correct type for CRC32C values (u32, not __wsum).
- Does not require the caller to provide a skb_checksum_ops struct.
- Is faster because it does not use indirect calls and does not use
the very slow crc32c_combine().
According to commit 2817a336d4d5 ("net: skb_checksum: allow custom
update/combine for walking skb") which added __skb_checksum(), the
original motivation for the abstraction layer was to avoid code
duplication for CRC32C and other checksums in the future. However:
- No additional checksums showed up after CRC32C. __skb_checksum()
is only used with the "regular" net checksum and CRC32C.
- Indirect calls are expensive. Commit 2544af0344ba ("net: avoid
indirect calls in L4 checksum calculation") worked around this
using the INDIRECT_CALL_1 macro. But that only avoided the indirect
call for the net checksum, and at the cost of an extra branch.
- The checksums use different types (__wsum and u32), causing casts
to be needed.
- It made the checksums of fragments be combined (rather than
chained) for both checksums, despite this being highly
counterproductive for CRC32C due to how slow crc32c_combine() is.
This can clearly be seen in commit 4c2f24549644 ("sctp: linearize
early if it's not GSO") which tried to work around this performance
bug. With a dedicated function for each checksum, we can instead
just use the proper strategy for each checksum.
As shown by the following tables, the new function skb_crc32c() is
faster than __skb_checksum(), with the improvement varying greatly from
5% to 2500% depending on the case. The largest improvements come from
fragmented packets, mainly due to eliminating the inefficient
crc32c_combine(). But linear packets are improved too, especially
shorter ones, mainly due to eliminating indirect calls. These
benchmarks were done on AMD Zen 5. On that CPU, Linux uses IBRS instead
of retpoline; an even greater improvement might be seen with retpoline:
Eric Biggers [Mon, 19 May 2025 17:50:03 +0000 (10:50 -0700)]
net: introduce CONFIG_NET_CRC32C
Add a hidden kconfig symbol NET_CRC32C that will group together the
functions that calculate CRC32C checksums of packets, so that these
don't have to be built into NET-enabled kernels that don't need them.
Make skb_crc32c_csum_help() (which is called only when IP_SCTP is
enabled) conditional on this symbol, and make IP_SCTP select it.
====================
tools: ynl-gen: add support for "inherited" selector and therefore TC
Add C codegen support for constructs needed by TC, namely passing
sub-message selector from a lower nest, and sub-messages with
fixed headers.
====================
Jakub Kicinski [Tue, 20 May 2025 16:19:13 +0000 (09:19 -0700)]
tools: ynl-gen: support weird sub-message formats
TC uses all possible sub-message formats:
- nested attrs
- fixed headers + nested attrs
- fixed headers
- empty
Nested attrs are already supported for rt-link. Add support
for remaining 3. The empty and fixed headers ones are fairly
trivial, we can fake a Binary or Flags type instead of a Nest.
For fixed headers + nest we need to teach nest parsing and
nest put to handle fixed headers.
Jakub Kicinski [Tue, 20 May 2025 16:19:12 +0000 (09:19 -0700)]
tools: ynl-gen: support local attrs in _multi_parse
The _multi_parse() helper calls the _attr_get() method of each attr,
but it only respects what code the helper wants to emit, not what
local variables it needs. Local variables will soon be needed,
support them.
Jakub Kicinski [Tue, 20 May 2025 16:19:11 +0000 (09:19 -0700)]
tools: ynl-gen: move fixed header info from RenderInfo to Struct
RenderInfo describes a request-response exchange. Struct describes
a parsed attribute set. For ease of parsing sub-messages with
fixed headers move fixed header info from RenderInfo to Struct.
No functional changes.
Jakub Kicinski [Tue, 20 May 2025 16:19:10 +0000 (09:19 -0700)]
tools: ynl-gen: support passing selector to a nest
In rtnetlink all submessages had the selector at the same level
of nesting as the submessage. We could refer to the relevant
attribute from the current struct. In TC, stats are one level
of nesting deeper than "kind". Teach the code-gen about structs
which need to be passed a selector by the caller for parsing.
Because structs are "topologically sorted" one pass of propagating
the selectors down is enough.
For generating netlink message we depend on the presence bits
so no selector passing needed there.
Jakub Kicinski [Tue, 20 May 2025 16:19:09 +0000 (09:19 -0700)]
netlink: specs: tc: drop the family name prefix from attrs
All attribute sets and messages are prefixed with tc-.
The C codegen also adds the family name to all structs.
We end up with names like struct tc_tc_act_attrs.
Remove the tc- prefixes to shorten the names.
This should not impact Python as the attr set names
are never exposed to user, they are only used to refer
to things internally, in the encoder / decoder.
Jakub Kicinski [Tue, 20 May 2025 16:19:07 +0000 (09:19 -0700)]
netlink: specs: tc: use tc-gact instead of tc-gen as struct name
There is a define in the uAPI header called tc_gen which expands
to the "generic" TC action fields. This helps other actions include
the base fields without having to deal with nested structs.
A couple of actions (sample, gact) do not define extra fields,
so the spec used a common tc-gen struct for both of them.
Unfortunately this struct does not exist in C. Let's use gact's
(generic act's) struct for basic actions.
Jakub Kicinski [Tue, 20 May 2025 16:19:06 +0000 (09:19 -0700)]
netlink: specs: tc: remove duplicate nests
tc-act-stats-attrs and tca-stats-attrs are almost identical.
The only difference is that the latter has sub-message decoding
for app, rather than declaring it as a binary attr.
tc-act-police-attrs and tc-police-attrs are identical but for
the TODO annotations.
====================
net: airoha: Add per-flow stats support to hw flowtable offloading
Introduce per-flow stats accounting to the flowtable hw offload in the
airoha_eth driver. Flow stats are split in the PPE and NPU modules:
- PPE: accounts for high 32bit of per-flow stats
- NPU: accounts for low 32bit of per-flow stats
Lorenzo Bianconi [Fri, 16 May 2025 08:00:01 +0000 (10:00 +0200)]
net: airoha: ppe: Disable packet keepalive
Since netfilter flowtable entries are now refreshed by flow-stats
polling, we can disable hw packet keepalive used to periodically send
packets belonging to offloaded flows to the kernel in order to refresh
flowtable entries.
Lorenzo Bianconi [Fri, 16 May 2025 08:00:00 +0000 (10:00 +0200)]
net: airoha: Add FLOW_CLS_STATS callback support
Introduce per-flow stats accounting to the flowtable hw offload in
the airoha_eth driver. Flow stats are split in the PPE and NPU modules:
- PPE: accounts for high 32bit of per-flow stats
- NPU: accounts for low 32bit of per-flow stats
FLOW_CLS_STATS can be enabled or disabled at compile time.
Lorenzo Bianconi [Fri, 16 May 2025 07:59:59 +0000 (09:59 +0200)]
net: airoha: npu: Move memory allocation in airoha_npu_send_msg() caller
Move ppe_mbox_data struct memory allocation from airoha_npu_send_msg
routine to the caller one. This is a preliminary patch to enable wlan NPU
offloading and flow counter stats support.
====================
ipv6: Follow up for RTNL-free RTM_NEWROUTE series.
Patch 1 removes rcu_read_lock() in fib6_get_table().
Patch 2 removes rtnl_is_held arg for lwtunnel_valid_encap_type(), which
was short-term fix and is no longer used.
Patch 3 fixes RCU vs GFP_KERNEL report by syzkaller.
Patch 4~7 reverts GFP_ATOMIC uses to GFP_KERNEL.
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:23 +0000 (19:27 -0700)]
ipv6: Revert two per-cpu var allocation for RTM_NEWROUTE.
These two commits preallocated two per-cpu variables in
ip6_route_info_create() as fib_nh_common_init() and fib6_nh_init()
were expected to be called under RCU.
* commit d27b9c40dbd6 ("ipv6: Preallocate nhc_pcpu_rth_output in
ip6_route_info_create().")
* commit 5720a328c3e9 ("ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in
ip6_route_info_create().")
Now these functions can be called without RCU and can use GFP_KERNEL.
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:22 +0000 (19:27 -0700)]
ipv6: Pass gfp_flags down to ip6_route_info_create_nh().
Since commit c4837b9853e5 ("ipv6: Split ip6_route_info_create()."),
ip6_route_info_create_nh() uses GFP_ATOMIC as it was expected to be
called under RCU.
Now, we can call it without RCU and use GFP_KERNEL.
Let's pass gfp_flags to ip6_route_info_create_nh().
Commit 71c0efb6d12f ("ipv6: Factorise ip6_route_multipath_add().") split
a loop in ip6_route_multipath_add() so that we can put rcu_read_lock()
between ip6_route_info_create() and ip6_route_info_create_nh().
We no longer need to do so as ip6_route_info_create_nh() does not require
RCU now.
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:20 +0000 (19:27 -0700)]
Revert "ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup"
The previous patch fixed the same issue mentioned in
commit 14a0087e7236 ("ipv6: sr: switch to GFP_ATOMIC
flag to allocate memory during seg6local LWT setup").
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:19 +0000 (19:27 -0700)]
ipv6: Narrow down RCU critical section in inet6_rtm_newroute().
Commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and
RTM_NEWROUTE.") added rcu_read_lock() covering
ip6_route_info_create_nh() and __ip6_ins_rt() to guarantee that
nexthop and netdev will not go away.
However, as reported by syzkaller [0], ip_tun_build_state() calls
dst_cache_init() with GFP_KERNEL during the RCU critical section.
ip6_route_info_create_nh() fetches nexthop or netdev depending on
whether RTA_NH_ID is set, and struct fib6_info holds a refcount
of either of them by nexthop_get() or netdev_get_by_index().
netdev_get_by_index() looks up a dev and calls dev_hold() under RCU.
So, we need RCU only around nexthop_find_by_id() and nexthop_get()
( and a few more nexthop code).
Let's add rcu_read_lock() there and remove rcu_read_lock() in
ip6_route_add() and ip6_route_multipath_add().
Now these functions called from fib6_add() need RCU:
Kuniyuki Iwashima [Fri, 16 May 2025 02:27:18 +0000 (19:27 -0700)]
inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?().
Commit f130a0cc1b4f ("inet: fix lwtunnel_valid_encap_type() lock
imbalance") added the rtnl_is_held argument as a temporary fix while
I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU.
Now all callers of lwtunnel_valid_encap_type() do not hold RTNL.
====================
net: bcmgenet: 64bit stats and expose more stats in ethtool
Hi, this patchset updates the bcmgenet driver with new 64bit statistics via
ndo_get_stats64 and rtnl_link_stats64, now reports hardware discarded
packets in the rx_missed_errors stat and exposes more stats in ethtool.
====================
net: phy: fixed_phy: simplifications and improvements
This series includes two types of changes:
- All callers pass PHY_POLL, therefore remove irq argument
- constify the passed struct fixed_phy_status *status
====================
Heiner Kallweit [Sat, 17 May 2025 20:35:56 +0000 (22:35 +0200)]
net: phy: fixed_phy: remove irq argument from fixed_phy_register
All callers pass PHY_POLL, therefore remove irq argument from
fixed_phy_register().
Note: I keep the irq argument in fixed_phy_add_gpiod() for now,
for the case that somebody may want to use a GPIO interrupt in
the future, by e.g. adding a call to fwnode_irq_get() to
fixed_phy_get_gpiod().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/31cdb232-a5e9-4997-a285-cb9a7d208124@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 17 May 2025 20:08:10 +0000 (13:08 -0700)]
net: let lockdep compare instance locks
AFAIU always returning -1 from lockdep's compare function
basically disables checking of dependencies between given
locks. Try to be a little more precise about what guarantees
that instance locks won't deadlock.
Right now we only nest them under protection of rtnl_lock.
Mostly in unregister_netdevice_many() and dev_close_many().
Stanislav Fomichev [Fri, 16 May 2025 23:22:05 +0000 (16:22 -0700)]
selftests: net: validate team flags propagation
Cover three recent cases:
1. missing ops locking for the lowers during netdev_sync_lower_features
2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked
with a comment on why/when it's needed)
3. rcu lock during team_change_rx_flags
Verified that each one triggers when the respective fix is reverted.
Not sure about the placement, but since it all relies on teaming,
added to the teaming directory.
One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way
to trigger netdev_sync_lower_features without it.
Sky Huang [Fri, 16 May 2025 10:23:27 +0000 (18:23 +0800)]
net: phy: mediatek: add driver for built-in 2.5G ethernet PHY on MT7988
Add support for internal 2.5Gphy on MT7988. This driver will load
necessary firmware and add appropriate time delay to make sure
that firmware works stably. The firmware loading procedure takes
about 11ms in this driver.
Petr Malat [Fri, 16 May 2025 08:17:28 +0000 (10:17 +0200)]
sctp: Do not wake readers in __sctp_write_space()
Function __sctp_write_space() doesn't set poll key, which leads to
ep_poll_callback() waking up all waiters, not only these waiting
for the socket being writable. Set the key properly using
wake_up_interruptible_poll(), which is preferred over the sync
variant, as writers are not woken up before at least half of the
queue is available. Also, TCP does the same.
ChunHao Lin [Fri, 16 May 2025 05:56:22 +0000 (13:56 +0800)]
net: phy: realtek: add RTL8127-internal PHY
RTL8127-internal PHY is RTL8261C which is a integrated 10Gbps PHY with ID
0x001cc890. It follows the code path of RTL8125/RTL8126 internal NBase-T
PHY.
Subbaraya Sundeep [Thu, 15 May 2025 17:44:08 +0000 (23:14 +0530)]
octeontx2-pf: Add tracepoint for NIX_PARSE_S
The NIX_PARSE_S structure populated by hardware in the
NIX RX CQE has parsing information for the received packet.
A tracepoint to dump the all words of NIX_PARSE_S
is helpful in debugging packet parser.
Heiner Kallweit [Thu, 15 May 2025 08:11:54 +0000 (10:11 +0200)]
net: phy: make mdio consumer / device layer a separate module
After having factored out the provider part from mdio_bus.c, we can
make the mdio consumer / device layer a separate module. This also
allows to remove Kconfig symbol MDIO_DEVICE.
The module init / exit functions from mdio_bus.c no longer have to be
called from phy_device.c. The link order defined in
drivers/net/phy/Makefile ensures that init / exit functions are called
in the right order.
====================
queue_api: reduce risk of name collision over txq
Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.
====================
Gur Stavi [Sun, 18 May 2025 10:00:54 +0000 (13:00 +0300)]
queue_api: reduce risk of name collision over txq
Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.
Jakub Kicinski [Tue, 20 May 2025 03:07:41 +0000 (20:07 -0700)]
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
idpf: add initial PTP support
Milena Olech says:
This patch series introduces support for Precision Time Protocol (PTP) to
Intel(R) Infrastructure Data Path Function (IDPF) driver. PTP feature is
supported when the PTP capability is negotiated with the Control
Plane (CP). IDPF creates a PTP clock and sets a set of supported
functions.
During the PTP initialization, IDPF requests a set of PTP capabilities
and receives a writeback from the CP with the set of supported options.
These options are:
- get time of the PTP clock
- set the time of the PTP clock
- adjust the PTP clock
- Tx timestamping
Each feature is considered to have direct access, where the operations
on PCIe BAR registers are allowed, or the mailbox access, where the
virtchnl messages are used to perform any PTP action. Mailbox access
means that PTP requests are sent to the CP through dedicated secondary
mailbox and the CP reads/writes/modifies desired resource - PTP Clock
or Tx timestamp registers.
Tx timestamp capabilities are negotiated only for vports that have
UPLINK_VPORT flag set by the CP. Capabilities provide information about
the number of available Tx timestamp latches, their indexes and size of
the Tx timestamp value. IDPF requests Tx timestamp by setting the
TSYN bit and the requested timestamp index in the context descriptor for
the PTP packets. When the completion tag for that packet is received,
IDPF schedules a worker to read the Tx timestamp value.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
idpf: add support for Rx timestamping
idpf: add Tx timestamp flows
idpf: add Tx timestamp capabilities negotiation
idpf: add PTP clock configuration
idpf: add mailbox access to read PTP clock time
idpf: negotiate PTP capabilities and get PTP clock
idpf: move virtchnl structures to the header file
virtchnl: add PTP virtchnl definitions
idpf: add initial PTP support
idpf: change the method for mailbox workqueue allocation
====================
Johannes Berg [Fri, 16 May 2025 11:59:27 +0000 (13:59 +0200)]
net: netlink: reduce extack cookie size
Seems like the extack cookie hasn't found any users outside
of wireless, which always uses nl_set_extack_cookie_u64().
Thus, allocating 20 bytes for it is pointless, reduce that
to 8 bytes, and add a BUILD_BUG_ON() to ensure it's enough
(obviously it is, for a u64, but in case it changes again.)
David S. Miller [Mon, 19 May 2025 11:10:43 +0000 (12:10 +0100)]
Merge tag 'ovpn-net-next-20250515' of https://github.com/OpenVPN/ovpn-net-next
Antonio Quartulli says:
====================
ovpn: pull request for net-next: ovpn 2025-05-15
this is a new version of the previous pull request.
These time I have removed the fixes that we are still discussing,
so that we don't hold the entire series back.
There is a new fix though: it's about properly checking the return value
of skb_to_sgvec_nomark(). I spotted the issue while testing pings larger
than the iface's MTU on a TCP VPN connection.
I have added various Closes and Link tags where applicable, so
that we have references to GitHub tickets and other public discussions.
Since I have resent the PR, I have also added Andrew's Reviewed-by to
the first patch.
Please pull or let me know if something should be changed!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Patchset highlights:
- update MAINTAINERS entry for ovpn
- extend selftest with more cases
- avoid crash in selftest in case of getaddrinfo() failure
- fix ndo_start_xmit return value on error
- set ignore_df flag for IPv6 packets
- drop useless reg_state check in keepalive worker
- retain skb's dst when entering xmit function
- fix check on skb_to_sgvec_nomark() return value
====================
vsock/test: improve sigpipe test reliability
Running the tests continuously I noticed that sometimes the sigpipe
test would fail due to a race between the control message of the test
and the vsock transport messages.
While I was at it I also improved the test by checking the errno we
expect.
Stefano Garzarella [Wed, 14 May 2025 14:19:26 +0000 (16:19 +0200)]
vsock/test: retry send() to avoid occasional failure in sigpipe test
When the other peer calls shutdown(SHUT_RD), there is a chance that
the send() call could occur before the message carrying the close
information arrives over the transport. In such cases, the send()
might still succeed. To avoid this race, let's retry the send() call
a few times, ensuring the test is more reliable.
Sleep a little before trying again to avoid flooding the other peer
and filling its receive buffer, causing false-negative.
Stefano Garzarella [Wed, 14 May 2025 14:19:25 +0000 (16:19 +0200)]
vsock/test: add timeout_usleep() to allow sleeping in timeout sections
The timeout API uses signals, so we have documented not to use sleep(),
but we can use nanosleep(2) since POSIX.1 explicitly specifies that it
does not interact with signals.
====================
tools: ynl-gen: support sub-messages and rt-link
Sub-messages are how we express "polymorphism" in YNL. Donald added
the support to specs and Python a while back, support them in C, too.
Sub-message is a nest, but the interpretation of the attribute types
within that nest depends on a value of another attribute. For example
in rt-link the "kind" attribute contains the link type (veth, bonding,
etc.) and based on that the right enum has to be applied to interpret
link-specific attributes.
The last message is probably the most interesting to look at, as it
adds a fairly advanced sample.
This patch only contains enough support for rtnetlink, we will need
a little more complexity to support TC, where sub-messages may contain
fixed headers, and where the selector may be in a different nest than
the submessage.
====================
Jakub Kicinski [Thu, 15 May 2025 23:16:50 +0000 (16:16 -0700)]
tools: ynl: add a sample for rt-link
Add a fairly complete example of rt-link usage. If run without any
arguments it simply lists the interfaces and some of their attrs.
If run with an arg it tries to create and delete a netkit device.
1 # ./tools/net/ynl/samples/rt-link 1
2 Trying to create a Netkit interface
3 Testing error message for policy being bad:
4 Kernel error: 'Provided default xmit policy not supported' (bad attribute: .linkinfo.data(netkit).policy)
5 1: lo: mtu 65536
6 2: wlp0s1: mtu 1500
7 3: enp0s13: mtu 1500
8 4: dummy0: mtu 1500 kind dummy altname one two
9 5: nk0: mtu 1500 kind netkit primary 0 policy forward
10 6: nk1: mtu 1500 kind netkit primary 1 policy blackhole
11 Trying to delete a Netkit interface (ifindex 6)
Sample creates the device first, it sets an invalid value for a netkit
attribute to trigger reverse parsing. Line 4 shows the error with the
attribute path correctly generated by YNL.
Then sample fixes the bad attribute and re-issues the request, with
NLM_F_ECHO set. This flag causes the notification to be looped back
to the initiating socket (our socket). Sample parses this notification
to save the ifindex of the created netkit.
Sample then proceeds to list the devices. Line 8 above shows a dummy
device with two alt names. Lines 9 and 10 show the netkit devices
the sample itself created.
The "primary" and "policy" attrs are from inside the netkit submsg.
The string values are auto-generated for the enums by YNL.
To clean up sample deletes the interface it created (line 11).
Reverse parsing lets YNL convert bad and missing attr pointers
from extack into a string like "missing attribute nest1.nest2.attr_name".
It's a feature that's unique to YNL C AFAIU (even the Python YNL
can't do nested reverse parsing). Add support for reverse-parsing
of sub-messages.
To simplify the logic and the code annotate the type policies
with extra metadata. Mark the selectors and the messages with
the information we need. We assume that key / selector always
precedes the sub-message while parsing (and also if there are
multiple sub-messages like in rt-link they are interleaved
selector 1 ... submsg 1 ... selector 2 .. submsg 2, not
selector 1 ... selector 2 ... submsg 1 ... submsg 2).
The rt-link sample in a subsequent changes shows reverse parsing
of sub-messages in action.
Jakub Kicinski [Thu, 15 May 2025 23:16:47 +0000 (16:16 -0700)]
tools: ynl-gen: submsg: support parsing and rendering sub-messages
Adjust parsing and rendering appropriately to make sub-messages work.
Rendering is pretty trivial, as the submsg -> netlink conversion looks
like rendering a nest in which only one attr was set. Only trick
is that we use the enum value of the sub-message rather than the nest
as the type, and effectively skip one layer of nesting. A real double
nested struct would look like this:
[SELECTOR]
[SUBMSG]
[NEST]
[MSG1-ATTR]
A submsg "is" the nest so by skipping I mean:
[SELECTOR]
[SUBMSG]
[MSG1-ATTR]
There is no extra validation in YNL if caller has set the selector
matching the submsg type (e.g. link type = "macvlan" but the nest
attrs are set to carry "veth"). Let the kernel handle that.
Parsing side is a little more specialized as we need to render and
insert a new kind of function which switches between what to parse
based on the selector. But code isn't too complicated.
Jakub Kicinski [Thu, 15 May 2025 23:16:46 +0000 (16:16 -0700)]
tools: ynl-gen: submsg: render the structs
The easiest (or perhaps only sane) way to support submessages in C
is to treat them as if they were nests. Build fake attributes to
that effect in the codegen. Render the submsg as a big nest of all
possible values.
With this in place the main missing part is to hook in the switch
which selects how to parse based on the key.