Russell King (Oracle) [Wed, 17 Sep 2025 15:12:11 +0000 (16:12 +0100)]
net: stmmac: imx: convert to use phy_interface
Checking the IMX8MP documentation, there is no requirement for a
separate mac_interface mode definition. As mac_interface and
phy_interface will be the same, use phy_interface internally rather
than mac_interface.
Also convert the error prints to use phy_modes() so that we get a
meaningful string rather than a number for the interface mode.
Russell King (Oracle) [Wed, 17 Sep 2025 15:12:06 +0000 (16:12 +0100)]
net: stmmac: use phy_interface in stmmac_check_pcs_mode()
In the majority, if not all cases, mac_interface and phy_interface
are the same with the exception of some drivers that I have suggested
only use phy_interface and set mac_interface to PHY_INTERFACE_MODE_NA.
The only two that currently set mac_interface to PHY_INTERFACE_MODE_NA
are dwmac-loongson and dwmac-lpc18xx, neither of which use RGMII nor
SGMII.
In order to phase out the use of mac_interface, we need to have a path
for existing drivers so they can update to only using phy_interface
without causing regressions.
Therefore, in order to keep the "pcs" code working, we need to choose
the STMMAC integrated PCS mode based on phy_interface if mac_interface
is PHY_INTERFACE_MODE_NA.
This will allow more drivers to set mac_interface to
PHY_INTERFACE_MODE_NA without risking regressions.
Russell King (Oracle) [Wed, 17 Sep 2025 15:12:01 +0000 (16:12 +0100)]
net: stmmac: rework mac_interface and phy_interface documentation
Based on new research, it has come to light that the comment that I
added in a014c35556b9 ("net: stmmac: clarify difference between
"interface" and "phy_interface"") is not fully correct.
Update the comment to properly describe the difference between the two.
All of the DTS files in the kernel tree do not mention the "mac-mode"
property, which results in mac_interface and phy_interface being the
same. Also, none of the platform glue drivers set mac_interface to
anything but PHY_INTERFACE_MODE_NA. This means that for all the
platforms known to mainline, mac_interface is either the same as
phy_interface, or it is PHY_INTERFACE_MODE_NA.
Thus, updating the definition for mac_interface in stmmac.h has no
material effect on current uses known to mainline, but the change opens
the door to cleaning up all uses.
The mlx5_devlink_total_vfs_set function branches based on per_pf_support
twice. Remove the second branch as the first one exits the function when
per_pf_support is false.
Accidentally added as part of commit a4c49611cf4f ("net/mlx5: Implement
devlink total_vfs parameter").
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-rdma/aMQWenzpdjhAX4fm@stanley.mountain/ Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/a6142a60-1948-439a-b0ae-ff1df26a37f8@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Zahka [Thu, 18 Sep 2025 21:27:20 +0000 (14:27 -0700)]
psp: clarify checksum behavior of psp_dev_rcv()
psp_dev_rcv() decapsulates psp headers from a received frame. This
will make any csum complete computed by the device inaccurate. Rather
than attempt to patch up skb->csum in psp_dev_rcv() just make it clear
to callers what they can expect regarding checksum complete.
Jakub Kicinski [Thu, 18 Sep 2025 18:31:19 +0000 (11:31 -0700)]
net: phy: micrel: use %pe in print format
New cocci check complains:
drivers/net/phy/micrel.c:4308:6-13: WARNING: Consider using %pe to print PTR_ERR()
drivers/net/phy/micrel.c:5742:6-13: WARNING: Consider using %pe to print PTR_ERR()
====================
address miscellaneous issues with psp_sk_get_assoc_rcu()
There were a few minor issues with psp_sk_get_assoc_rcu() identified
by Eric in his review of the initial psp series. This series addresses
them.
====================
Daniel Zahka [Thu, 18 Sep 2025 15:52:04 +0000 (08:52 -0700)]
psp: don't use flags for checking sk_state
Using flags to check sk_state only makes sense to check for a subset
of states in parallel e.g. sk_fullsock(). We are not doing that
here. Compare for individual states directly.
Daniel Zahka [Thu, 18 Sep 2025 15:52:03 +0000 (08:52 -0700)]
psp: fix preemptive inet_twsk() cast in psp_sk_get_assoc_rcu()
It is weird to cast to a timewait_sock before checking sk_state, even
if the use is after such a check. Remove the tw local variable, and
use inet_twsk() directly in the timewait branch.
ptp_ocp: make ptp_ocp driver compatible with PTP_EXTTS_REQUEST2
Originally ptp_ocp driver was not strictly checking flags for external
timestamper and was always activating rising edge timestamping as it's
the only supported mode. Recent changes to ptp made it incompatible with
PTP_EXTTS_REQUEST2 ioctl. Adjust ptp_clock_info to provide supported
mode and be compatible with new infra.
While at here remove explicit check of periodic output flags from the
driver and provide supported flags for ptp core to check.
Dan Carpenter [Thu, 18 Sep 2025 09:48:26 +0000 (12:48 +0300)]
net: ti: icssm-prueth: unwind cleanly in probe()
This error handling triggers a Smatch warning:
drivers/net/ethernet/ti/icssm/icssm_prueth.c:1574 icssm_prueth_probe()
warn: 'prueth->pru1' is an error pointer or valid
The warning is harmless because the pru_rproc_put() function has an
IS_ERR_OR_NULL() check built in. However, there is a small bug if
syscon_regmap_lookup_by_phandle() fails. In that case we should call
of_node_put() on eth0_node and eth1_node.
It's a little bit easier to re-write this code to only free things which
we know have been allocated successfully.
Fixes: 511f6c1ae093 ("net: ti: icssm-prueth: Adds ICSSM Ethernet driver") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Parvathi Pudi <parvathi@couthit.com> Link: https://patch.msgid.link/aMvVagz8aBRxMvFn@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net/mlx5e: Support RSS for IPSec offload
The series by Jianbo uses a new firmware feature to identify the inner
protocol of decrypted packets, adding new flow groups and steering rules
to redirect them for proper L4-based RSS. This ensures traffic is spread
across multiple CPU cores.
====================
Jianbo Liu [Thu, 18 Sep 2025 07:19:23 +0000 (10:19 +0300)]
net/mlx5e: Add flow rules for the decrypted ESP packets
The previous commit introduced two new flow groups to enable L4 RSS
for decrypted IPsec traffic. This commit implements the logic to
populate these groups with the necessary steering rules.
The rules are created dynamically whenever the first IPSec offload
rule is configured via the xfrm subsystem and the decryption tables
for RX are created. Each rule matches a specific decrypted traffic
type based on its ip version (or ethertype) and outer/inner
l4_type_ext, directing it to the appropriate L4 RSS-enabled TIR.
The lifecycle of these steering rules is tied directly to the RX
tables. They are deleted when the RX tables are destroyed.
Jianbo Liu [Thu, 18 Sep 2025 07:19:22 +0000 (10:19 +0300)]
net/mlx5e: Add flow groups for the packets decrypted by crypto offload
When using IPsec crypto offload, the hardware decrypts the packet
payload but preserves the ESP header. This prevents the standard RSS
mechanism from accessing the inner L4 (TCP/UDP) headers. As a result,
the RSS hash is calculated only on the outer L3 IP headers, causing
all traffic for a given IPsec tunnel to be directed to a single queue,
leading to poor traffic distribution.
Newer firmware introduces the ability to match on l4_type_ext, which
exposes the L4 protocol type following an ESP header. This allows the
driver to create steering rules that can identify the inner protocols
of decrypted packets.
This commit leverages this new capability to improve traffic
distribution. It adds two new flow groups to steer decrypted packets
to dedicated TIRs that was configured to perform RSS on the inner L4
headers.
These groups are inserted after the standard L4 group and before the
group that handles undecrypted ESP packets added in this series. The
first new group matches decrypted packets based on the outer IP
version (or ethertype) and l4_type_ext. The second new group matches
decrypted tunneled packets based on the inner IP version and
l4_type_ext. Eight new traffic types are also defined to support this
functionality.
Jianbo Liu [Thu, 18 Sep 2025 07:19:21 +0000 (10:19 +0300)]
net/mlx5e: Recirculate decrypted packets into TTC table
In the commit 5e466345291a ("net/mlx5e: IPsec: Add IPsec steering in
local NIC RX"), the decrypted packets are handled in RX error flow
table. There is only one rule in the table, which forwards packets to
the default ESP TIR.
This patch updates the design to allow RSS after decryption. For ESP
traffic, SPI and IP addresses are the fields selected for RSS hash,
and it's common that only one SPI is configured in RX direction, so
RSS can't work properly as all the packets are hashed to one key in
this case. To take advantage of RSS and improve performance, the
decrypted packets need to be forwarded back to TTC table, where RSS
can work based on the decrypted packet types.
Jianbo Liu [Thu, 18 Sep 2025 07:19:20 +0000 (10:19 +0300)]
net/mlx5: Change TTC rules to match on undecrypted ESP packets
The TTC (Traffic Type Classifier) table classifies the traffic and
steers packet to TIRs, where RSS works based on the hash calculated
from the selected packet fields. For AH/ESP packets, SPI and IP
addresses are the fields used to calculate the hash value for RSS. So,
it's hard to distribute packets to different receiving queues as there
is usually only one SPI in that direction.
IPSec hardware offloads, crypto offload and full (packet) offload were
introduced later. For crypto offload, hardware does encryption,
decryption and authentication, kernel does the others. Kernel always
sends/receives formatted ESP packets with plaintext data instead of
the ciphertext data, all other fields are unmodified. For full
offload, hardware will take care of almost everything, kernel just
sends/receives packets without any IPSec headers.
Currently, all packets with ESP protocols are forwarded to IPSec
offload tables if IPSec rules are configured. In a downstream patch,
the decrypted packets will be recirculated to TTC table, in order to
use RSS, which does the hash on L4 fields after IPSec headers are
stripped by full offload. So those packets handled by crypto offload
must filtered out, as they still have the ESP headers, but apparently
no need to be decrypted again. To do that, ipsec_next_header is added
for the packet matching, as it is valid only after passing through
IPSec decryption.
Fix typo in PPE_IP_PROTO_CHK_IPV4_MASK and PPE_IP_PROTO_CHK_IPV6_MASK
register mask definitions. This is not a real problem since this
register is not actually used in the current codebase.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ChunHao Lin [Thu, 18 Sep 2025 02:34:25 +0000 (10:34 +0800)]
r8169: set EEE speed down ratio to 1
EEE speed down means speed down MAC MCU clock. It is not from spec.
It is kind of Realtek specific power saving feature. But enable it
may cause some issues, like packet drop or interrupt loss. Different
hardware may have different issues.
EEE speed down ratio (mac ocp 0xe056[7:4]) is used to set EEE speed
down rate. The larger this value is, the more power can save. But it
actually save less power then we expected. And, as mentioned above,
will impact compatibility. So set it to 1 (mac ocp 0xe056[7:4] = 0)
, which means not to speed down, to improve compatibility.
mptcp: reset blackhole on success with non-loopback ifaces
When a first MPTCP connection gets successfully established after a
blackhole period, 'active_disable_times' was supposed to be reset when
this connection was done via any non-loopback interfaces.
Unfortunately, the opposite condition was checked: only reset when the
connection was established via a loopback interface. Fixing this by
simply looking at the opposite.
This is similar to what is done with TCP FastOpen, see
tcp_fastopen_active_disable_ofo_check().
This patch is a follow-up of a previous discussion linked to commit 893c49a78d9f ("mptcp: Use __sk_dst_get() and dst_dev_rcu() in
mptcp_active_enable()."), see [1].
Daniel Machon [Wed, 17 Sep 2025 11:49:43 +0000 (13:49 +0200)]
net: sparx5/lan969x: Add support for ethtool pause parameters
Implement get_pauseparam() and set_pauseparam() ethtool operations for
Sparx5 ports. This allows users to query and configure IEEE 802.3x
pause frame settings via:
ethtool -a ethX
ethtool -A ethX rx on|off tx on|off autoneg on|off
The driver delegates pause parameter handling to phylink through
phylink_ethtool_get_pauseparam() and phylink_ethtool_set_pauseparam().
The underlying configuration of pause frame generation and reception is
already implemented in the driver; this patch only wires it up to the
standard ethtool interface, making the feature accessible to userspace.
net: phy: micrel: Add Fast link failure support for lan8842
Add support for fast link failure for lan8842, when this is enabled the
PHY will detect link down immediately (~1ms). The disadvantage of this
is that also small instability might be reported as link down.
Therefore add this feature as a tunable configuration and the user will
know when to enable or not. By default it is not enabled.
net: phy: clear link parameters on admin link down
When a PHY is halted (e.g. `ip link set dev lan2 down`), several
fields in struct phy_device may still reflect the last active
connection. This leads to ethtool showing stale values even though
the link is down.
Reset selected fields in _phy_state_machine() when transitioning
to PHY_HALTED and the link was previously up:
- speed/duplex -> UNKNOWN, but only in autoneg mode (in forced mode
these fields carry configuration, not status)
- master_slave_state -> UNKNOWN if previously supported
- mdix -> INVALID (state only, same meaning as "unknown")
- lp_advertising -> always cleared
The cleanup is skipped if the PHY is in PHY_ERROR state, so the
last values remain available for diagnostics.
CPTS module of CPSW supports hardware timestamping of PTPv1 packets.Update
the "hwtstamp_rx_filters" of CPSW driver to enable timestamping of received
PTPv1 packets. Also update the advertised capability to include PTPv1.
Cross-merge networking fixes after downstream PR (net-6.17-rc7).
No conflicts.
Adjacent changes:
drivers/net/ethernet/mellanox/mlx5/core/en/fs.h 9536fbe10c9d ("net/mlx5e: Add PSP steering in local NIC RX") 7601a0a46216 ("net/mlx5e: Add a miss level for ipsec crypto offload")
- net: clear sk->sk_ino in sk_set_socket(sk, NULL), fix CRIU
Previous releases - regressions:
- bonding: set random address only when slaves already exist
- rxrpc: fix untrusted unsigned subtract
- eth:
- ice: fix Rx page leak on multi-buffer frames
- mlx5: don't return mlx5_link_info table when speed is unknown
Previous releases - always broken:
- tls: make sure to abort the stream if headers are bogus
- tcp: fix null-deref when using TCP-AO with TCP_REPAIR
- dpll: fix skipping last entry in clock quality level reporting
- eth: qed: don't collect too many protection override GRC elements,
fix memory corruption"
* tag 'net-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
octeontx2-pf: Fix use-after-free bugs in otx2_sync_tstamp()
cnic: Fix use-after-free bugs in cnic_delete_task
devlink rate: Remove unnecessary 'static' from a couple places
MAINTAINERS: update sundance entry
net: liquidio: fix overflow in octeon_init_instr_queue()
net: clear sk->sk_ino in sk_set_socket(sk, NULL)
Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set"
selftests: tls: test skb copy under mem pressure and OOB
tls: make sure to abort the stream if headers are bogus
selftest: packetdrill: Add tcp_fastopen_server_reset-after-disconnect.pkt.
tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect().
octeon_ep: fix VF MAC address lifecycle handling
selftests: bonding: add vlan over bond testing
bonding: don't set oif to bond dev when getting NS target destination
net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer
net/mlx5e: Add a miss level for ipsec crypto offload
net/mlx5e: Harden uplink netdev access against device unbind
MAINTAINERS: make the DPLL entry cover drivers
doc/netlink: Fix typos in operation attributes
igc: don't fail igc_probe() on LED setup error
...
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"These are mostly Oliver's Arm changes: lock ordering fixes for the
vGIC, and reverts for a buggy attempt to avoid RCU stalls on large
VMs.
Arm:
- Invalidate nested MMUs upon freeing the PGD to avoid WARNs when
visiting from an MMU notifier
- Fixes to the TLB match process and TLB invalidation range for
managing the VCNR pseudo-TLB
- Prevent SPE from erroneously profiling guests due to UNKNOWN reset
values in PMSCR_EL1
- Fix save/restore of host MDCR_EL2 to account for eagerly
programming at vcpu_load() on VHE systems
- Correct lock ordering when dealing with VGIC LPIs, avoiding
scenarios where an xarray's spinlock was nested with a *raw*
spinlock
- Permit stage-2 read permission aborts which are possible in the
case of NV depending on the guest hypervisor's stage-2 translation
- Call raw_spin_unlock() instead of the internal spinlock API
- Fix parameter ordering when assigning VBAR_EL1
- Reverted a couple of fixes for RCU stalls when destroying a stage-2
page table.
There appears to be some nasty refcounting / UAF issues lurking in
those patches and the band-aid we tried to apply didn't hold.
s390:
- mm fixes, including userfaultfd bug fix
x86:
- Sync the vTPR from the local APIC to the VMCB even when AVIC is
active.
This fixes a bug where host updates to the vTPR, e.g. via
KVM_SET_LAPIC or emulation of a guest access, are lost and result
in interrupt delivery issues in the guest"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active
Revert "KVM: arm64: Split kvm_pgtable_stage2_destroy()"
Revert "KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables"
KVM: arm64: vgic: fix incorrect spinlock API usage
KVM: arm64: Remove stage 2 read fault check
KVM: arm64: Fix parameter ordering for VBAR_EL1 assignment
KVM: arm64: nv: Fix incorrect VNCR invalidation range calculation
KVM: arm64: vgic-v3: Indicate vgic_put_irq() may take LPI xarray lock
KVM: arm64: vgic-v3: Don't require IRQs be disabled for LPI xarray lock
KVM: arm64: vgic-v3: Erase LPIs from xarray outside of raw spinlocks
KVM: arm64: Spin off release helper from vgic_put_irq()
KVM: arm64: vgic-v3: Use bare refcount for VGIC LPIs
KVM: arm64: vgic: Drop stale comment on IRQ active state
KVM: arm64: VHE: Save and restore host MDCR_EL2 value correctly
KVM: arm64: Initialize PMSCR_EL1 when in VHE
KVM: arm64: nv: fix VNCR TLB ASID match logic for non-Global entries
KVM: s390: Fix FOLL_*/FAULT_FLAG_* confusion
KVM: s390: Fix incorrect usage of mmu_notifier_register()
KVM: s390: Fix access to unavailable adapter indicator pages during postcopy
KVM: arm64: Mark freed S2 MMUs as invalid
Merge tag 'platform-drivers-x86-v6.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
"Fixes and new HW support:
- amd/pmc: Add MECHREVO Yilong15Pro to spurious_8042 list
- amd/pmf: Support new ACPI ID AMDI0108
- asus-wmi: Re-add extra keys to ignore_key_wlan quirk
- oxpec: Add support for AOKZOE A1X and OneXPlayer X1Pro EVA-02"
* tag 'platform-drivers-x86-v6.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: asus-wmi: Re-add extra keys to ignore_key_wlan quirk
platform/x86/amd/pmf: Support new ACPI ID AMDI0108
platform/x86: oxpec: Add support for AOKZOE A1X
platform/x86: oxpec: Add support for OneXPlayer X1Pro EVA-02
platform/x86/amd/pmc: Add MECHREVO Yilong15Pro to spurious_8042 list
Merge tag 'uml-for-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux
Pull UML fixes from Johannes Berg:
"A few fixes for UML, which I'd meant to send earlier but then forgot.
All of them are pretty long-standing issues that are either not really
happening (the UAF), in rarely used code (the FD buffer issue), or an
issue only for some host configurations (the executable stack):
- mark stack not executable to work on more modern systems with
selinux
* tag 'uml-for-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
um: Fix FD copy size in os_rcv_fd_msg()
um: virtio_uml: Fix use-after-free after put_device in probe
um: Don't mark stack executable
octeontx2-pf: Fix use-after-free bugs in otx2_sync_tstamp()
The original code relies on cancel_delayed_work() in otx2_ptp_destroy(),
which does not ensure that the delayed work item synctstamp_work has fully
completed if it was already running. This leads to use-after-free scenarios
where otx2_ptp is deallocated by otx2_ptp_destroy(), while synctstamp_work
remains active and attempts to dereference otx2_ptp in otx2_sync_tstamp().
Furthermore, the synctstamp_work is cyclic, the likelihood of triggering
the bug is nonnegligible.
A typical race condition is illustrated below:
CPU 0 (cleanup) | CPU 1 (delayed work callback)
otx2_remove() |
otx2_ptp_destroy() | otx2_sync_tstamp()
cancel_delayed_work() |
kfree(ptp) |
| ptp = container_of(...); //UAF
| ptp-> //UAF
Replace cancel_delayed_work() with cancel_delayed_work_sync() to ensure
that the delayed work item is properly canceled before the otx2_ptp is
deallocated.
This bug was initially identified through static analysis. To reproduce
and test it, I simulated the OcteonTX2 PCI device in QEMU and introduced
artificial delays within the otx2_sync_tstamp() function to increase the
likelihood of triggering the bug.
Fixes: 2958d17a8984 ("octeontx2-pf: Add support for ptp 1-step mode on CN10K silicon") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The original code uses cancel_delayed_work() in cnic_cm_stop_bnx2x_hw(),
which does not guarantee that the delayed work item 'delete_task' has
fully completed if it was already running. Additionally, the delayed work
item is cyclic, the flush_workqueue() in cnic_cm_stop_bnx2x_hw() only
blocks and waits for work items that were already queued to the
workqueue prior to its invocation. Any work items submitted after
flush_workqueue() is called are not included in the set of tasks that the
flush operation awaits. This means that after the cyclic work items have
finished executing, a delayed work item may still exist in the workqueue.
This leads to use-after-free scenarios where the cnic_dev is deallocated
by cnic_free_dev(), while delete_task remains active and attempt to
dereference cnic_dev in cnic_delete_task().
A typical race condition is illustrated below:
CPU 0 (cleanup) | CPU 1 (delayed work callback)
cnic_netdev_event() |
cnic_stop_hw() | cnic_delete_task()
cnic_cm_stop_bnx2x_hw() | ...
cancel_delayed_work() | /* the queue_delayed_work()
flush_workqueue() | executes after flush_workqueue()*/
| queue_delayed_work()
cnic_free_dev(dev)//free | cnic_delete_task() //new instance
| dev = cp->dev; //use
Replace cancel_delayed_work() with cancel_delayed_work_sync() to ensure
that the cyclic delayed work item is properly canceled and that any
ongoing execution of the work item completes before the cnic_dev is
deallocated. Furthermore, since cancel_delayed_work_sync() uses
__flush_work(work, true) to synchronously wait for any currently
executing instance of the work item to finish, the flush_workqueue()
becomes redundant and should be removed.
This bug was identified through static analysis. To reproduce the issue
and validate the fix, I simulated the cnic PCI device in QEMU and
introduced intentional delays — such as inserting calls to ssleep()
within the cnic_delete_task() function — to increase the likelihood
of triggering the bug.
devlink rate: Remove unnecessary 'static' from a couple places
devlink_rate_node_get_by_name() and devlink_rate_nodes_destroy() have a
couple of unnecessary static variables for iterating over devlink rates.
This could lead to races/corruption/unhappiness if two concurrent
operations execute the same function.
Remove 'static' from both. It's amazing this was missed for 4+ years.
While at it, I confirmed there are no more examples of this mistake in
net/ with 1, 2 or 3 levels of indentation.
net: liquidio: fix overflow in octeon_init_instr_queue()
The expression `(conf->instr_type == 64) << iq_no` can overflow because
`iq_no` may be as high as 64 (`CN23XX_MAX_RINGS_PER_PF`). Casting the
operand to `u64` ensures correct 64-bit arithmetic.
Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters") Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set"
This reverts commit d24341740fe48add8a227a753e68b6eedf4b385a.
It causes errors when trying to configure QoS, as well as
loss of L2 connectivity (on multi-host devices).
Reported-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/20250910170011.70528106@kernel.org Fixes: d24341740fe4 ("net/mlx5e: Update and set Xon/Xoff upon port speed set") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patchset introduces a new dedicated ethtool_ops callback,
.get_rx_ring_count, which enables drivers to provide the number of RX
rings directly, improving efficiency and clarity in RX ring queries and
RSS configuration.
Number of drivers implements .get_rxnfc callback just to report the ring
count, so, having a proper callback makes sense and simplify .get_rxnfc
(in some cases remove it completely).
This has been suggested by Jakub, and follow the same idea as RXFH
driver callbacks [1].
This also port virtio_net to this new callback. Once there is consensus
on this approach, I can start moving the drivers to this new callback.
net: virtio_net: add get_rxrings ethtool callback for RX ring queries
Replace the existing virtnet_get_rxnfc callback with a dedicated
virtnet_get_rxrings implementation to provide the number of RX rings
directly via the new ethtool_ops get_rx_ring_count pointer.
This simplifies the RX ring count retrieval and aligns virtio_net with
the new ethtool API for querying RX ring parameters.
net: ethtool: use the new helper in rss_set_prep_indir()
Refactor rss_set_prep_indir() to utilize the new
ethtool_get_rx_ring_count() helper for determining the number of RX
rings, replacing the direct use of get_rxnfc with ETHTOOL_GRXRINGS.
This ensures compatibility with both legacy and new ethtool_ops
interfaces by transparently multiplexing between them.
net: ethtool: update set_rxfh_indir to use ethtool_get_rx_ring_count helper
Modify ethtool_set_rxfh() to use the new ethtool_get_rx_ring_count()
helper function for retrieving the number of RX rings instead of
directly calling get_rxnfc with ETHTOOL_GRXRINGS.
This way, we can leverage the new helper if it is available in ethtool_ops.
net: ethtool: update set_rxfh to use ethtool_get_rx_ring_count helper
Modify ethtool_set_rxfh() to use the new ethtool_get_rx_ring_count()
helper function for retrieving the number of RX rings instead of
directly calling get_rxnfc with ETHTOOL_GRXRINGS.
This way, we can leverage the new helper if it is available in ethtool_ops.
net: ethtool: add get_rx_ring_count callback to optimize RX ring queries
Add a new optional get_rx_ring_count callback in ethtool_ops to allow
drivers to provide the number of RX rings directly without going through
the full get_rxnfc flow classification interface.
Create ethtool_get_rx_ring_count() to use .get_rx_ring_count if
available, falling back to get_rxnfc() otherwise. It needs to be
non-static, given it will be called by other ethtool functions laters,
as those calling get_rxfh().
net: ethtool: add support for ETHTOOL_GRXRINGS ioctl
This patch adds handling for the ETHTOOL_GRXRINGS ioctl command in the
ethtool ioctl dispatcher. It introduces a new helper function
ethtool_get_rxrings() that calls the driver's get_rxnfc() callback with
appropriate parameters to retrieve the number of RX rings supported
by the device.
By explicitly handling ETHTOOL_GRXRINGS, userspace queries through
ethtool can now obtain RX ring information in a structured manner.
In this patch, ethtool_get_rxrings() is a simply copy of
ethtool_get_rxnfc().
net: ethtool: pass the num of RX rings directly to ethtool_copy_validate_indir
Modify ethtool_copy_validate_indir() and callers to validate indirection
table entries against the number of RX rings as an integer instead of
accessing rx_rings->data.
This will be useful in the future, given that struct ethtool_rxnfc might
not exist for native GRXRINGS call.
Eric Dumazet [Thu, 18 Sep 2025 11:35:46 +0000 (11:35 +0000)]
psp: rename our psp_dev_destroy()
psp_dev_destroy() was already used in drivers/crypto/ccp/psp-dev.c
Use psp_dev_free() instead, to avoid a link error when
CRYPTO_DEV_SP_CCP=y
Fixes: 00c94ca2b99e ("psp: base PSP device support") Closes: https://lore.kernel.org/netdev/CANn89i+ZdBDEV6TE=Nw5gn9ycTzWw4mZOpPuCswgwEsrgOyNnw@mail.gmail.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250918113546.177946-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Thu, 18 Sep 2025 11:09:44 +0000 (13:09 +0200)]
Merge branch 'bnxt_en-updates-for-net-next'
Michael Chan says:
====================
bnxt_en: Updates for net-next
This series includes some code clean-ups and optimizations. New features
include 2 new backing store memory types to collect FW logs for core
dumps, dynamic SRIOV resource allocations for RoCE, and ethtool tunable
for PFC watchdog.
v2: Drop patch #4. The patch makes the code different from the original
bnxt_hwrm_func_backing_store_cfg_v2() that allows instance_bmap to have
bits that are not contiguous. It is safer to keep the original code.
Michael Chan [Wed, 17 Sep 2025 04:08:39 +0000 (21:08 -0700)]
bnxt_en: Implement ethtool .set_tunable() for ETHTOOL_PFC_PREVENTION_TOUT
Support the setting of the tunable if it is supported by firmware.
The supported range is 0 to the maximum msec value reported by
firmware. PFC_STORM_PREVENTION_AUTO is also supported and 0 means it
is disabled.
bnxt_en: Support for RoCE resources dynamically shared within VFs.
Add support for dynamic RoCE SRIOV resource configuration. Instead of
statically dividing the RoCE resources by the number of VFs, provide
the maximum resources and let the FW dynamically dsitribute to the VFs
on the fly.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Anantha Prabhu <anantha.prabhu@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250917040839.1924698-8-michael.chan@broadcom.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
bnxt_en: Add fw log trace support for 5731X/5741X chips
These older chips now support the fw log traces via backing store
qcaps_v2. No other backing store memory types are supported besides
the fw trace types.
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com> Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Shruti Parab <shruti.parab@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250917040839.1924698-6-michael.chan@broadcom.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Michael Chan [Wed, 17 Sep 2025 04:08:33 +0000 (21:08 -0700)]
bnxt_en: Improve bnxt_backing_store_cfg_v2()
Improve the logic that determines the last_type in this function.
The different context memory types are configured in a loop. The
last_type signals the last context memory type to be configured
which requires the ALL_DONE flag to be set for the FW.
The existing logic makes some assumptions that TIM is the last_type
when RDMA is enabled or FTQM is the last_type when only L2 is
enabled. Improve it to just search for the last_type so that we
don't need to make these assumptions that won't necessary be true
for future devices.
Kalesh AP [Wed, 17 Sep 2025 04:08:32 +0000 (21:08 -0700)]
bnxt_en: Optimize bnxt_sriov_disable()
bnxt_sriov_disable() is invoked from 2 places:
1. When the user deletes the VFs.
2. During the unload of the PF driver instance.
Inside bnxt_sriov_disable(), driver invokes
bnxt_restore_pf_fw_resources() which in turn causes a close/open_nic().
There is no harm doing this in the unload path, although it is inefficient
and unnecessary.
Optimize the function to make it more efficient.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250917040839.1924698-4-michael.chan@broadcom.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kalesh AP [Wed, 17 Sep 2025 04:08:31 +0000 (21:08 -0700)]
bnxt_en: Remove unnecessary VF check in bnxt_hwrm_nvm_req()
The driver registers the supported configuration parameters with the
devlink stack only on the PF using devlink_params_register().
Hence there is no need for a VF check inside bnxt_hwrm_nvm_req().
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250917040839.1924698-3-michael.chan@broadcom.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kalesh AP [Wed, 17 Sep 2025 04:08:30 +0000 (21:08 -0700)]
bnxt_en: Drop redundant if block in bnxt_dl_flash_update()
The devlink stack has sanity checks and it invokes flash_update()
only if it is supported by the driver. The VF driver does not
advertise the support for flash_update in struct devlink_ops.
This makes if condition inside bnxt_dl_flash_update() redundant.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250917040839.1924698-2-michael.chan@broadcom.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 17 Sep 2025 00:28:13 +0000 (17:28 -0700)]
tls: make sure to abort the stream if headers are bogus
Normally we wait for the socket to buffer up the whole record
before we service it. If the socket has a tiny buffer, however,
we read out the data sooner, to prevent connection stalls.
Make sure that we abort the connection when we find out late
that the record is actually invalid. Retrying the parsing is
fine in itself but since we copy some more data each time
before we parse we can overflow the allocated skb space.
Constructing a scenario in which we're under pressure without
enough data in the socket to parse the length upfront is quite
hard. syzbot figured out a way to do this by serving us the header
in small OOB sends, and then filling in the recvbuf with a large
normal send.
Make sure that tls_rx_msg_size() aborts strp, if we reach
an invalid record there's really no way to recover.
Reported-by: Lee Jones <lee@kernel.org> Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250917002814.1743558-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
==================
add basic PSP encryption for TCP connections
This is v13 of the PSP RFC [1] posted by Jakub Kicinski one year
ago. General developments since v1 include a fork of packetdrill [2]
with support for PSP added, as well as some test cases, and an
implementation of PSP key exchange and connection upgrade [3]
integrated into the fbthrift RPC library. Both [2] and [3] have been
tested on server platforms with PSP-capable CX7 NICs. Below is the
cover letter from the original RFC:
Add support for PSP encryption of TCP connections.
PSP is a protocol out of Google:
https://github.com/google/psp/blob/main/doc/PSP_Arch_Spec.pdf
which shares some similarities with IPsec. I added some more info
in the first patch so I'll keep it short here.
The protocol can work in multiple modes including tunneling.
But I'm mostly interested in using it as TLS replacement because
of its superior offload characteristics. So this patch does three
things:
- it adds "core" PSP code
PSP is offload-centric, and requires some additional care and
feeding, so first chunk of the code exposes device info.
This part can be reused by PSP implementations in xfrm, tunneling etc.
- TCP integration TLS style
Reuse some of the existing concepts from TLS offload, such as
attaching crypto state to a socket, marking skbs as "decrypted",
egress validation. PSP does not prescribe key exchange protocols.
To use PSP as a more efficient TLS offload we intend to perform
a TLS handshake ("inline" in the same TCP connection) and negotiate
switching to PSP based on capabilities of both endpoints.
This is also why I'm not including a software implementation.
Nobody would use it in production, software TLS is faster,
it has larger crypto records.
- mlx5 implementation
That's mostly other people's work, not 100% sure those folks
consider it ready hence the RFC in the title. But it works :)
Not posted, queued a branch [4] are follow up pieces:
- standard stats
- netdevsim implementation and tests
Comments we intend to defer to future series:
- we prefer to keep the version field in the tx-assoc netlink
request, because it makes parsing keys require less state early
on, but we are willing to change in the next version of this
series.
- using a static branch to wrap psp_enqueue_set_decrypted() and
other functions called from tcp.
- using INDIRECT_CALL for tls/psp in sk_validate_xmit_skb(). We
prefer to address this in a dedicated patch series, so that this
series does not need to modify the way tls_validate_xmit_skb() is
declared and stubbed out.
Links: https://patch.msgid.link/20250917000954.859376-1-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
* add-basic-psp-encryption-for-tcp-connections:
net/mlx5e: Implement PSP key_rotate operation
net/mlx5e: Add Rx data path offload
psp: provide decapsulation and receive helper for drivers
net/mlx5e: Configure PSP Rx flow steering rules
net/mlx5e: Add PSP steering in local NIC RX
net/mlx5e: Implement PSP Tx data path
psp: provide encapsulation helper for drivers
net/mlx5e: Implement PSP operations .assoc_add and .assoc_del
net/mlx5e: Support PSP offload functionality
psp: track generations of device key
net: psp: update the TCP MSS to reflect PSP packet overhead
net: psp: add socket security association code
net: tcp: allow tcp_timewait_sock to validate skbs before handing to device
net: move sk_validate_xmit_skb() to net/core/dev.c
psp: add op for rotation of device key
tcp: add datapath logic for PSP with inline key exchange
net: modify core data structures for PSP datapath support
psp: base PSP device support
psp: add documentation
Raed Salem [Wed, 17 Sep 2025 00:09:46 +0000 (17:09 -0700)]
net/mlx5e: Implement PSP key_rotate operation
Implement .key_rotate operation where when invoked will cause the HW to use
a new master key to derive PSP spi/key pairs with complience with PSP spec.
Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250917000954.859376-20-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Raed Salem [Wed, 17 Sep 2025 00:09:45 +0000 (17:09 -0700)]
net/mlx5e: Add Rx data path offload
On receive flow inspect received packets for PSP offload indication using
the cqe, for PSP offloaded packets set SKB PSP metadata i.e spi, header
length and key generation number to stack for further processing.
Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250917000954.859376-19-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Raed Salem [Wed, 17 Sep 2025 00:09:44 +0000 (17:09 -0700)]
psp: provide decapsulation and receive helper for drivers
Create psp_dev_rcv(), which drivers can call to psp decapsulate and attach
a psp_skb_ext to an skb.
psp_dev_rcv() only supports what the PSP architecture specification
refers to as "transport mode" packets, where the L3 header is either
IPv6 or IPv4.
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-18-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Raed Salem [Wed, 17 Sep 2025 00:09:43 +0000 (17:09 -0700)]
net/mlx5e: Configure PSP Rx flow steering rules
Set the Rx PSP flow steering rule where PSP packet is identified and
decrypted using the dedicated UDP destination port number 1000. If packet
is decrypted then a PSP marker and syndrome are added to metadata so SW can
use it later on in Rx data path.
The rule is set as part of init_rx netdev profile implementation.
Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250917000954.859376-17-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Raed Salem [Wed, 17 Sep 2025 00:09:42 +0000 (17:09 -0700)]
net/mlx5e: Add PSP steering in local NIC RX
Introduce decrypt FT, the RX error FT, and the default rules.
The PSP (PSP) RX decrypt flow table is pointed by the TTC
(Traffic Type Classifier) UDP steering rules.
The decrypt flow table has two flow groups. The first flow group
keeps the decrypt steering rule programmed always when PSP packet is
recognized using the dedicated udp destination port number 1000, if
packet is decrypted then a PSP marker is set in metadata_regB[30].
The second flow group has a default rule to forward all non-offloaded
PSP packet to the TTC UDP default RSS TIR.
The RX error flow table is the destination of the decrypt steering rules in
the PSP RX decrypt flow table. It has two fixed rule one with single copy
action that copies psp_syndrome to metadata_regB[23:29]. The PSP marker
and syndrome is used to filter out non-psp packet and to return the PSP
crypto offload status in Rx flow. The marker is used to identify such
packet in driver so the driver could set SKB PSP metadata. The destination
of RX error flow table is the TTC UDP default RSS TIR. The second rule will
drop packets that failed to be decrypted (like in case illegal SPI or
expired SPI is used).
Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250917000954.859376-16-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Raed Salem [Wed, 17 Sep 2025 00:09:41 +0000 (17:09 -0700)]
net/mlx5e: Implement PSP Tx data path
Setup PSP offload on Tx data path based on whether skb indicates that it is
intended for PSP or not. Support driver side encapsulation of the UDP
headers, PSP headers, and PSP trailer for the PSP traffic that will be
encrypted by the NIC.
Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250917000954.859376-15-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Raed Salem [Wed, 17 Sep 2025 00:09:40 +0000 (17:09 -0700)]
psp: provide encapsulation helper for drivers
Create a new function psp_encapsulate(), which takes a TCP packet and
PSP encapsulates it according to the "Transport Mode Packet Format"
section of the PSP Architecture Specification.
psp_encapsulate() does not push a PSP trailer onto the skb. Both IPv6
and IPv4 are supported. Virtualization cookie is not included.
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-14-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Raed Salem [Wed, 17 Sep 2025 00:09:39 +0000 (17:09 -0700)]
net/mlx5e: Implement PSP operations .assoc_add and .assoc_del
Implement .assoc_add and .assoc_del PSP operations used in the tx control
path. Allocate the relevant hardware resources when a new key is registered
using .assoc_add. Destroy the key when .assoc_del is called. Use a atomic
counter to keep track of the current number of keys being used by the
device.
Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250917000954.859376-13-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Raed Salem [Wed, 17 Sep 2025 00:09:38 +0000 (17:09 -0700)]
net/mlx5e: Support PSP offload functionality
Add PSP offload related structs, layouts, and enumerations. Implement
.set_config and .rx_spi_alloc PSP device operations. Driver does not
need to make use of the .set_config operation. Stub .assoc_add and
.assoc_del PSP operations.
Introduce the MLX5_EN_PSP configuration option for enabling PSP offload
support on mlx5 devices.
Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250917000954.859376-12-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 17 Sep 2025 00:09:37 +0000 (17:09 -0700)]
psp: track generations of device key
There is a (somewhat theoretical in absence of multi-host support)
possibility that another entity will rotate the key and we won't
know. This may lead to accepting packets with matching SPI but
which used different crypto keys than we expected.
The PSP Architecture specification mentions that an implementation
should track device key generation when device keys are managed by the
NIC. Some PSP implementations may opt to include this key generation
state in decryption metadata each time a device key is used to decrypt
a packet. If that is the case, that key generation counter can also be
used when policy checking a decrypted skb against a psp_assoc. This is
an optional feature that is not explicitly part of the PSP spec, but
can provide additional security in the case where an attacker may have
the ability to force key rotations faster than rekeying can occur.
Since we're tracking "key generations" more explicitly now,
maintain different lists for associations from different generations.
This way we can catch stale associations (the user space should
listen to rotation notifications and change the keys).
Drivers can "opt out" of generation tracking by setting
the generation value to 0.
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-11-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 17 Sep 2025 00:09:36 +0000 (17:09 -0700)]
net: psp: update the TCP MSS to reflect PSP packet overhead
PSP eats 40B of header space. Adjust MSS appropriately.
We can either modify tcp_mtu_to_mss() / tcp_mss_to_mtu()
or reuse icsk_ext_hdr_len. The former option is more TCP
specific and has runtime overhead. The latter is a bit
of a hack as PSP is not an ext_hdr. If one squints hard
enough, UDP encap is just a more practical version of
IPv6 exthdr, so go with the latter. Happy to change.
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-10-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 17 Sep 2025 00:09:35 +0000 (17:09 -0700)]
net: psp: add socket security association code
Add the ability to install PSP Rx and Tx crypto keys on TCP
connections. Netlink ops are provided for both operations.
Rx side combines allocating a new Rx key and installing it
on the socket. Theoretically these are separate actions,
but in practice they will always be used one after the
other. We can add distinct "alloc" and "install" ops later.
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-9-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Daniel Zahka [Wed, 17 Sep 2025 00:09:34 +0000 (17:09 -0700)]
net: tcp: allow tcp_timewait_sock to validate skbs before handing to device
Provide a callback to validate skb's originating from tcp timewait
socks before passing to the device layer. Full socks have a
sk_validate_xmit_skb member for checking that a device is capable of
performing offloads required for transmitting an skb. With psp, tcp
timewait socks will inherit the crypto state from their corresponding
full socks. Any ACKs or RSTs that originate from a tcp timewait sock
carrying psp state should be psp encapsulated.
Daniel Zahka [Wed, 17 Sep 2025 00:09:33 +0000 (17:09 -0700)]
net: move sk_validate_xmit_skb() to net/core/dev.c
Move definition of sk_validate_xmit_skb() from net/core/sock.c to
net/core/dev.c.
This change is in preparation of the next patch, where
sk_validate_xmit_skb() will need to cast sk to a tcp_timewait_sock *,
and access member fields. Including linux/tcp.h from linux/sock.h
creates a circular dependency, and dev.c is the only current call site
of this function.
Jakub Kicinski [Wed, 17 Sep 2025 00:09:32 +0000 (17:09 -0700)]
psp: add op for rotation of device key
Rotating the device key is a key part of the PSP protocol design.
Some external daemon needs to do it once a day, or so.
Add a netlink op to perform this operation.
Add a notification group for informing users that key has been
rotated and they should rekey (next rotation will cut them off).
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-6-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 17 Sep 2025 00:09:31 +0000 (17:09 -0700)]
tcp: add datapath logic for PSP with inline key exchange
Add validation points and state propagation to support PSP key
exchange inline, on TCP connections. The expectation is that
application will use some well established mechanism like TLS
handshake to establish a secure channel over the connection and
if both endpoints are PSP-capable - exchange and install PSP keys.
Because the connection can existing in PSP-unsecured and PSP-secured
state we need to make sure that there are no race conditions or
retransmission leaks.
On Tx - mark packets with the skb->decrypted bit when PSP key
is at the enqueue time. Drivers should only encrypt packets with
this bit set. This prevents retransmissions getting encrypted when
original transmission was not. Similarly to TLS, we'll use
sk->sk_validate_xmit_skb to make sure PSP skbs can't "escape"
via a PSP-unaware device without being encrypted.
On Rx - validation is done under socket lock. This moves the validation
point later than xfrm, for example. Please see the documentation patch
for more details on the flow of securing a connection, but for
the purpose of this patch what's important is that we want to
enforce the invariant that once connection is secured any skb
in the receive queue has been encrypted with PSP.
Add GRO and coalescing checks to prevent PSP authenticated data from
being combined with cleartext data, or data with non-matching PSP
state. On Rx, check skb's with psp_skb_coalesce_diff() at points
before psp_sk_rx_policy_check(). After skb's are policy checked and on
the socket receive queue, skb_cmp_decrypted() is sufficient for
checking for coalescable PSP state. On Tx, tcp_write_collapse_fence()
should be called when transitioning a socket into PSP Tx state to
prevent data sent as cleartext from being coalesced with PSP
encapsulated data.
This change only adds the validation points, for ease of review.
Subsequent change will add the ability to install keys, and flesh
the enforcement logic out
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-5-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 17 Sep 2025 00:09:30 +0000 (17:09 -0700)]
net: modify core data structures for PSP datapath support
Add pointers to psp data structures to core networking structs,
and an SKB extension to carry the PSP information from the drivers
to the socket layer.
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-4-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 17 Sep 2025 00:09:29 +0000 (17:09 -0700)]
psp: base PSP device support
Add a netlink family for PSP and allow drivers to register support.
The "PSP device" is its own object. This allows us to perform more
flexible reference counting / lifetime control than if PSP information
was part of net_device. In the future we should also be able
to "delegate" PSP access to software devices, such as *vlan, veth
or netkit more easily.
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-3-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 17 Sep 2025 00:09:28 +0000 (17:09 -0700)]
psp: add documentation
Add documentation of things which belong in the docs rather
than commit messages.
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-2-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Tue, 16 Sep 2025 23:14:20 +0000 (16:14 -0700)]
eth: fbnic: add OTP health reporter
OTP memory ("fuses") are used for secure boot and anti-rollback
protection. The OTP memory is ECC protected. Check for its health
periodically to notice when the chip is starting to go bad.
Jakub Kicinski [Tue, 16 Sep 2025 23:14:16 +0000 (16:14 -0700)]
eth: fbnic: support allocating FW completions with extra space
Support allocating extra space after the FW completion.
This makes it easy to pass extra variable size buffer space
to FW response handlers without worrying about synchronization
(completion itself is already refcounted).
Jakub Kicinski [Tue, 16 Sep 2025 23:14:15 +0000 (16:14 -0700)]
eth: fbnic: reprogram TCAMs after FW crash
FW may mess with the TCAM after it boots, to try to restore
the traffic flow to the BMC (it may not be aware that the host
is already up). Make sure that we reprogram the TCAMs after
detecting a crash.
Jakub Kicinski [Tue, 16 Sep 2025 23:14:14 +0000 (16:14 -0700)]
eth: fbnic: factor out clearing the action TCAM
We'll want to wipe the driver TCAM state after FW crash, to force
a re-programming. Factor out the clearing logic. Remove the micro-
-optimization to skip clearing the BMC entry twice, it doesn't hurt.
Jakub Kicinski [Tue, 16 Sep 2025 23:14:13 +0000 (16:14 -0700)]
eth: fbnic: use fw uptime to detect fw crashes
Currently we only detect FW crashes when it stops responding
to heartbeat messages. FW has a watchdog which will reset it
in case of crashes. Use FW uptime sent in the ownership and
heartbeat messages to detect that the watchdog has fired
(uptime went down).
====================
udp: increase RX performance under stress
This series is the result of careful analysis of UDP stack,
to optimize the receive side, especially when under one or several
UDP sockets are receiving a DDOS attack.
I have measured a 47 % increase of throughput when using
IPv6 UDP packets with 120 bytes of payload, under DDOS.
16 cpus are receiving traffic targeting a single socket.
Even after adding NUMA aware drop counters, we were suffering
from false sharing between packet producers and the consumer.
1) First four patches are shrinking struct ipv6_pinfo size
and reorganize fields to get more efficient TX path.
They should also benefit TCP, by removing one cache line miss.
2) patches 5 & 6 changes how sk->sk_rmem_alloc is read and updated.
They reduce reduce spinlock contention on the busylock.
3) Patches 7 & 8 change the ordering of sk_backlog (including
sk_rmem_alloc) sk_receive_queue and sk_drop_counters for
better data locality.
4) Patch 9 removes the hashed array of spinlocks in favor of
a per-udp-socket one.
5) Final patch adopts skb_attempt_defer_free(), after TCP got
good results with it.
====================
Eric Dumazet [Tue, 16 Sep 2025 16:09:51 +0000 (16:09 +0000)]
udp: use skb_attempt_defer_free()
Move skb freeing from udp recvmsg() path to the cpu
which allocated/received it, as TCP did in linux-5.17.
This increases max thoughput by 20% to 30%, depending
on number of BH producers.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-11-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>