The pcs-rzn1-miic driver makes use of ARRAY_SIZE(), BIT() and GENMASK()
macros but does not explicitly include the headers where they are
defined. Add the missing <linux/array_size.h> and <linux/bits.h>
includes.
dt-bindings: net: pcs: renesas,rzn1-miic: Add RZ/T2H and RZ/N2H support
Add device tree binding support for RZ/T2H and RZ/N2H SoCs to the
existing RZ/N1 MIIC converter binding. These SoCs share similar MIIC
functionality but have architectural differences that require schema
updates.
Add new compatible strings "renesas,r9a09g077-miic" for RZ/T2H and
"renesas,r9a09g087-miic" for RZ/N2H, with the latter falling back to
the RZ/T2H variant. The new SoCs require reset support with two reset
lines for converter register reset and converter reset, which are not
present on RZ/N1.
Update port configurations to accommodate the different architectures.
RZ/N1 supports 5 ports numbered 1-5 with complex input mappings
covering indices 0-13, while RZ/T2H and RZ/N2H support 4 ports
numbered 0-3 with simplified input mappings covering indices 0-8.
Extend the switch port configuration property to support value 0 for
the new SoCs.
Add a new dt-bindings header file with media interface connection
matrix constants that map GMAC, ESC, and ETHSW ports to numeric
identifiers for use with RZ/T2H and RZ/N2H device trees.
Update DT schema validation to ensure proper port numbering and input
mappings per SoC variant.
tcp: reorganize tcp_sock_write_txrx group for variables later
Use the first 3-byte hole at the beginning of the tcp_sock_write_txrx
group for 'noneagle'/'rate_app_limited' to fill in the existing hole
in later patches. Therefore, the group size of tcp_sock_write_txrx is
reduced from 92 + 4 to 91 + 4. In addition, the group size of
tcp_sock_write_rx is changed to 96 to fit in the pahole outcome.
Below are the trimmed pahole outcomes before and after this patch:
Ilpo Järvinen [Thu, 11 Sep 2025 11:06:30 +0000 (13:06 +0200)]
tcp: fast path functions later
The following patch will use tcp_ecn_mode_accecn(),
TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
__tcp_fast_path_on() to make new flag for AccECN.
net: phy: clear EEE runtime state in PHY_HALTED/PHY_ERROR
Clear EEE runtime flags when the PHY transitions to HALTED or ERROR
and the state machine drops the link. This avoids stale EEE state being
reported via ethtool after the PHY is stopped or hits an error.
This change intentionally only clears software runtime flags and avoids
MDIO accesses in HALTED/ERROR. A follow-up patch will address other
link state variables.
Suggested-by: Russell King (Oracle) <linux@armlinux.org.uk> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250912132000.1598234-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Victor Nogueira [Fri, 12 Sep 2025 15:46:16 +0000 (12:46 -0300)]
selftests/tc-testing: Adapt tc police action tests for Gb rounding changes
For the tc police action, iproute2 rounds up mtu and burst sizes to a
higher order representation. For example, if the user specifies the default
mtu for a police action instance (4294967295 bytes), iproute2 will output
it as 4096Mb when this action instance is dumped. After Jay's changes [1],
iproute2 will round up to Gb, so 4096Mb becomes 4Gb. With that in mind,
fix police's tc test output so that it works both with the current
iproute2 version and Jay's.
Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Cong Wang <cwang@multikernel.io> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Link: https://patch.msgid.link/20250912154616.67489-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
dpll: zl3073x: Add support for devlink flash
Add functionality for accessing device hardware registers, loading
firmware bundles, and accessing the device's internal flash memory,
and use it to implement the devlink flash functionality.
====================
Ivan Vecera [Tue, 9 Sep 2025 09:15:32 +0000 (11:15 +0200)]
dpll: zl3073x: Implement devlink flash callback
Use the introduced functionality to read firmware files and flash their
contents into the device's internal flash memory to implement the devlink
flash update callback.
Ivan Vecera [Tue, 9 Sep 2025 09:15:30 +0000 (11:15 +0200)]
dpll: zl3073x: Add firmware loading functionality
Add functionality for loading firmware files provided by the vendor
to be flashed into the device's internal flash memory. The firmware
consists of several components, such as the firmware executable itself,
chip-specific customizations, and configuration files.
The firmware file contains at least a flash utility, which is executed
on the device side, and one or more flashable components. Each component
has its own specific properties, such as the address where it should be
loaded during flashing, one or more destination flash pages, and
the flashing method that should be used.
Ivan Vecera [Tue, 9 Sep 2025 09:15:29 +0000 (11:15 +0200)]
dpll: zl3073x: Add low-level flash functions
To implement the devlink device flash functionality, the driver needs
to access both the device memory and the internal flash memory. The flash
memory is accessed using a device-specific program (called the flash
utility). This flash utility must be downloaded by the driver into
the device memory and then executed by the device CPU. Once running,
the flash utility provides a flash API to access the flash memory itself.
During this operation, the normal functionality provided by the standard
firmware is not available. Therefore, the driver must ensure that DPLL
callbacks and monitoring functions are not executed during the flash
operation.
Add all necessary functions for downloading the utility to device memory,
entering and exiting flash mode, and performing flash operations.
Ivan Vecera [Tue, 9 Sep 2025 09:15:28 +0000 (11:15 +0200)]
dpll: zl3073x: Add functions to access hardware registers
Besides the device host registers that are directly accessible, there
are also hardware registers that can be accessed indirectly via specific
host registers.
Add register definitions for accessing hardware registers and provide
helper functions for working with them. Additionally, extend the number
of pages in the regmap configuration to 256, as the host registers used
for accessing hardware registers are located on page 255.
Add support for hardware PPS (Pulse Per Second) output to the
AMD XGBE driver. The implementation enables flexible periodic
output mode, exposing it via the PTP per_out interface.
The driver supports configuring PPS output using the standard
PTP subsystem, allowing precise periodic signal generation for
time synchronization applications.
The feature has been verified using the testptp tool and
oscilloscope.
Shenwei Wang [Wed, 10 Sep 2025 18:52:11 +0000 (13:52 -0500)]
net: fec: enable the Jumbo frame support for i.MX8QM
Certain i.MX SoCs, such as i.MX8QM and i.MX8QXP, feature enhanced
FEC hardware that supports Ethernet Jumbo frames with packet sizes
up to 16K bytes.
When Jumbo frames are supported, the TX FIFO may not be large enough
to hold an entire frame. To handle this, the FIFO is configured to
operate in cut-through mode when the frame size exceeds
(PKT_MAXBUF_SIZE - ETH_HLEN - ETH_FCS_LEN), which allows transmission
to begin once the FIFO reaches a certain threshold.
Shenwei Wang [Wed, 10 Sep 2025 18:52:10 +0000 (13:52 -0500)]
net: fec: add change_mtu to support dynamic buffer allocation
Add a fec_change_mtu() handler to recalculate the pagepool_order based on
the new_mtu value. And update the rx_frame_size accordingly when
pagepool_order changes.
MTU changes are only allowed when the adater is not running.
Shenwei Wang [Wed, 10 Sep 2025 18:52:09 +0000 (13:52 -0500)]
net: fec: add rx_frame_size to support configurable RX length
Add a new rx_frame_size member in the fec_enet_private structure to
track the RX buffer size. On the Jumbo frame enabled system, the value
will be recalculated whenever the MTU is updated, allowing the driver
to allocate RX buffer efficiently.
Configure the TRUNC_FL (Frame Truncation Length) based on the smaller
value between max_buf_size and the rx_frame_size to maintain consistent
RX error behavior, regardless of whether Jumbo frames are enabled.
Shenwei Wang [Wed, 10 Sep 2025 18:52:08 +0000 (13:52 -0500)]
net: fec: update MAX_FL based on the current MTU
Configure the MAX_FL (Maximum Frame Length) register according to the
current MTU value, which ensures that packets exceeding the configured MTU
trigger an RX error.
Shenwei Wang [Wed, 10 Sep 2025 18:52:07 +0000 (13:52 -0500)]
net: fec: add pagepool_order to support variable page size
Add a new pagepool_order member in the fec_enet_private struct
to allow dynamic configuration of page size for an instance. This
change clears the hardcoded page size assumptions.
Shenwei Wang [Wed, 10 Sep 2025 18:52:06 +0000 (13:52 -0500)]
net: fec: use a member variable for maximum buffer size
Refactor code to support Jumbo frame functionality by adding a member
variable in the fec_enet_private structure to store PKT_MAXBUF_SIZE.
Remove the OPT_FRAME_SIZE and define a new macro OPT_ARCH_HAS_MAX_FL to
indicate architectures that support configurable maximum frame length.
And update the MAX_FL register value to max_buf_size when
OPT_ARCH_HAS_MAX_FL is defined as 1.
Jakub Kicinski [Sun, 14 Sep 2025 20:13:12 +0000 (13:13 -0700)]
Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Fwlog support in ixgbe
Michal Swiatkowski says:
Firmware logging is a feature that allow user to dump firmware log using
debugfs interface. It is supported on device that can handle specific
firmware ops. It is true for ice and ixgbe driver.
Prepare code from ice driver to be moved to the library code and reuse
it in ixgbe driver.
* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ixgbe: fwlog support for e610
ice, libie: move fwlog code to libie
ice: reregister fwlog after driver reinit
ice: prepare for moving file to libie
ice: move debugfs code to fwlog
libie, ice: move fwlog admin queue to libie
ice: drop driver specific structure from fwlog code
ice: check for PF number outside the fwlog code
ice: move out debugfs init from fwlog
ice: allow calling custom send function in fwlog
ice: add pdev into fwlog structure and use it for logging
ice: introduce ice_fwlog structure
ice: drop ice_pf_fwlog_update_module()
ice: move get_fwlog_data() to fwlog file
ice: make fwlog functions static
====================
Jakub Kicinski [Sun, 14 Sep 2025 20:00:56 +0000 (13:00 -0700)]
Merge branch 'pru-icssm-ethernet-driver'
Parvathi Pudi says:
====================
PRU-ICSSM Ethernet Driver
The Programmable Real-Time Unit Industrial Communication Sub-system (PRU-ICSS)
is available on the TI SOCs in two flavors: Gigabit ICSS (ICSSG) and the older
Megabit ICSS (ICSSM).
Support for ICSSG Dual-EMAC mode has already been mainlined [1] and the
fundamental components/drivers such as PRUSS driver, Remoteproc driver,
PRU-ICSS INTC, and PRU-ICSS IEP drivers are already available in the mainline
Linux kernel. The current set of patch series builds on top of these components
and introduces changes to support the Dual-EMAC using ICSSM on the TI AM57xx,
AM437x and AM335x devices.
AM335x, AM437x and AM57xx devices may have either one or two PRU-ICSS instances
with two 32-bit RISC PRU cores. Each PRU core has (a) dedicated Ethernet interface
(MII, MDIO), timers, capture modules, and serial communication interfaces, and
(b) dedicated data and instruction RAM as well as shared RAM for inter PRU
communication within the PRU-ICSS.
These patches add support for basic RX and TX functionality over PRU Ethernet
ports in Dual-EMAC mode.
Further, note that these are the initial set of patches for a single instance of
PRU-ICSS Ethernet. Additional features such as Ethtool support, VLAN Filtering,
Multicast Filtering, Promiscuous mode, Storm prevention, Interrupt coalescing,
Linux PTP (ptp4l) Ordinary clock and Switch mode support for AM335x, AM437x
and AM57x along with support for a second instance of PRU-ICSS on AM57x
will be posted subsequently.
The patches presented in this series have gone through the patch verification
tools and no warnings or errors are reported. Sample test logs obtained from AM33x,
AM43x and AM57x verifying the functionality on Linux next kernel are available here:
Roger Quadros [Fri, 12 Sep 2025 11:53:28 +0000 (17:23 +0530)]
net: ti: icssm-prueth: Adds link detection, RX and TX support.
Changes corresponding to link configuration such as speed and duplexity.
IRQ and handler initializations are performed for packet reception.Firmware
receives the packet from the wire and stores it into OCMC queue. Next, it
notifies the CPU via interrupt. Upon receiving the interrupt CPU will
service the IRQ and packet will be processed by pushing the newly allocated
SKB to upper layers.
When the user application want to transmit a packet, it will invoke
sys_send() which will in turn invoke the PRUETH driver, then it will write
the packet into OCMC queues. PRU firmware will pick up the packet and
transmit it on to the wire.
Reviewed-by: Mohan Reddy Putluru <pmohan@couthit.com> Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Andrew F. Davis <afd@ti.com> Signed-off-by: Basharath Hussain Khaja <basharath@couthit.com> Signed-off-by: Parvathi Pudi <parvathi@couthit.com> Link: https://patch.msgid.link/20250912115443.529856-5-parvathi@couthit.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Roger Quadros [Fri, 12 Sep 2025 11:53:27 +0000 (17:23 +0530)]
net: ti: icssm-prueth: Adds PRUETH HW and SW configuration
Updates for MII_RT hardware peripheral configuration such as RX and TX
configuration for PRU0 and PRU1, frame sizes, and MUX config.
Updates for PRU-ICSS firmware register configuration and DRAM, SRAM and
OCMC memory initialization, which will be used in the runtime for packet
reception and transmission.
DUAL-EMAC memory allocation for software queues and its supporting
components such as the buffer descriptors and queue descriptors. These
software queues are placed in OCMC memory and are shared with CPU by
PRU-ICSS for packet receive and transmit.
All declarations and macros are being used from common header file
for various protocols.
Reviewed-by: Mohan Reddy Putluru <pmohan@couthit.com> Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Andrew F. Davis <afd@ti.com> Signed-off-by: Basharath Hussain Khaja <basharath@couthit.com> Signed-off-by: Parvathi Pudi <parvathi@couthit.com> Link: https://patch.msgid.link/20250912115443.529856-4-parvathi@couthit.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Roger Quadros [Fri, 12 Sep 2025 10:44:51 +0000 (16:14 +0530)]
net: ti: icssm-prueth: Adds ICSSM Ethernet driver
Updates Kernel configuration to enable PRUETH driver and its dependencies
along with makefile changes to add the new PRUETH driver.
Changes includes init and deinit of ICSSM PRU Ethernet driver including
net dev registration and firmware loading for DUAL-MAC mode running on
PRU-ICSS2 instance.
Changes also includes link handling, PRU booting, default firmware loading
and PRU stopping using existing remoteproc driver APIs.
Reviewed-by: Mohan Reddy Putluru <pmohan@couthit.com> Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Andrew F. Davis <afd@ti.com> Signed-off-by: Basharath Hussain Khaja <basharath@couthit.com> Signed-off-by: Parvathi Pudi <parvathi@couthit.com> Link: https://patch.msgid.link/20250912104741.528721-3-parvathi@couthit.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dt-bindings: net: ti: Adds DUAL-EMAC mode support on PRU-ICSS2 for AM57xx, AM43xx and AM33xx SOCs
Documentation update for the newly added "pruss2_eth" device tree
node and its dependencies along with compatibility for PRU-ICSS
Industrial Ethernet Peripheral (IEP), PRU-ICSS Enhanced Capture
(eCAP) peripheral and using YAML binding document for AM57xx SoCs.
This series cleans up the hardware timestamping / PTP initialisation
and cleanup code in the stmmac driver. Several key points in no
particular order:
1. Golden rule: unregister first, then release resources.
stmmac_release_ptp didn't do this.
2. Avoid leaking resources - __stmmac_open() failure leaves the
timestamping support initialised, but stops its clock. Also
violates (1).
3. Avoid double-release of resources - stmmac_open() followed by
stmmac_xdp_open() failing results in the PTP clock prepare and
enable counts being released, and if the interface is then
brought down, they are incorrectly released again. As XDP
doesn't gain any additional prepare/enables on the PTP clock,
remove this incorrect cleanup.
4. Changing the MTU of the interface is disruptive to PTP, and
remains so as long as. This is not fixed by this series (too
invasive at the moment.)
5. Avoid exporting functions that aren't used...
6. Avoid unnecessary runtime PM state manipulations (no point
manipulating this when MTU changes).
7. Make the PTP/timestamping initialisation more readable - no
point calling functions in the same file from one callsite
that return error codes from one location in the called function,
to only have the sole callee print messages depending on that
return code. Also simplifying the mess in stmmac_hw_setup().
Also placing support checks in a better location. Also getting
rid of the "ptp_register" boolean through this restructuring.
v1 was tested by Gatien CHEVALLIER - thanks.
====================
Russell King (Oracle) [Thu, 11 Sep 2025 11:10:28 +0000 (12:10 +0100)]
net: stmmac: move timestamping/ptp init to stmmac_hw_setup() caller
Move the call to stmmac_init_timestamping() or stmmac_setup_ptp() out
of stmmac_hw_setup() to its caller after stmmac_hw_setup() has
successfully completed. This slightly changes the ordering during
setup, but should be safe to do.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 11 Sep 2025 11:10:18 +0000 (12:10 +0100)]
net: stmmac: add stmmac_setup_ptp()
Add a function to setup PTP, which will enable the clock, initialise
the timestamping, and register with the PTP clock subsystem. Call this
when we want to register the PTP clock in stmmac_hw_setup(), otherwise
just call the
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 11 Sep 2025 11:10:13 +0000 (12:10 +0100)]
net: stmmac: rename stmmac_init_ptp()
Changes to the stmmac driver to fix various issues with PTP have made
stmmac_init_ptp() less about initialising the entire PTP block, and
now primarily deals with the packet timestamping support. The exception
to this is ptp_clk_freq_config(), which is an odditiy. It remains
as stmmac_init_ptp() is used both at .ndo_open() time and in the
resume paths.
However, restructuring this code to make it more easily readable makes
the continued use of "init_ptp" confusing.
In preparation to cleaning up the (re-)initialisation of timestamping,
rename the existing stmmac_init_ptp() to stmmac_init_timestamping()
which better reflects its functionality.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 11 Sep 2025 11:10:02 +0000 (12:10 +0100)]
net: stmmac: add __stmmac_release() to complement __stmmac_open()
Rename stmmac_release() to __stmmac_release(), providing a new
stmmac_release() method. Update stmmac_change_mtu() to use
__stmmac_release(). Move the runtime PM handling into stmmac_open()
and stmmac_release().
This avoids stmmac_change_mtu() needlessly fiddling with the runtime
PM state, and will allow future changes to remove code from
__stmmac_open() and __stmmac_release() that should only happen when
the net device is administratively brought up or down.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nothing outside of stmmac_main.c makes use of
stmmac_init_tstamp_counter(), so there's no point exporting it for
modules, or even having it non-static. Remove the export and make
it static.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Neither stmmac_xdp_release() nor the normal paths of stmmac_xdp_open()
touch clk_ptp_ref, so stmmac_xdp_open() should not be doing anything
with this clock. However, in its error path, it calls
stmmac_hw_teardown() which disables and unprepares this clock, which
can lead to the clock state becoming unbalanced when the netdev is
taken administratively down.
Remove this call to stmmac_hw_teardown(), and as this is the last user
of this function, remove the function as well.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 11 Sep 2025 11:09:47 +0000 (12:09 +0100)]
net: stmmac: fix PTP error cleanup in __stmmac_open()
The cleanup function for stmmac_setup_ptp() is stmmac_release_ptp()
which entirely undoes the effects of stmmac_setup_ptp() by
unregistering the PTP device and then stopping the PTP clock,
whereas stmmac_hw_teardown() will only stop the PTP clock while
leaving the PTP device registered.
This can lead to a kernel oops - if __stmmac_open() fails after
registering the PTP clock, the PTP device will remain registered,
and if the module is removed, subsequent PTP device accesses will
lead to a kernel oops.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 11 Sep 2025 11:09:42 +0000 (12:09 +0100)]
net: stmmac: disable PTP clock after unregistering PTP
Follow the principle of unpublish from userspace and then teardown
resources.
Disable the PTP clock only after unregistering with the PTP subsystem,
which ensures that we only stop the clock that ticks the timesource
after we have removed the PTP device.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 11 Sep 2025 11:09:37 +0000 (12:09 +0100)]
net: stmmac: ptp: improve handling of aux_ts_lock lifetime
The aux_ts_lock mutex is only required while the PTP clock has been
successfully registered.
stmmac_ptp_register() does not return any errors (as we don't wish to
prevent the netdev being opened if PTP fails), stmmac_ptp_unregister()
was coded to allow it to be called irrespective of whether PTP was
successfully registered or not.
Arrange for the aux_ts_lock mutex to be destroyed if the PTP clock
is not functional during stmmac_ptp_register(), and only destroy it
in stmmac_ptp_unregister() if we had a PTP clock registered.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mv88e6xxx as accumulated some unused data structures and code over the
years. This series removes it and simplifies the code. See the patches
for each change.
====================
Russell King (Oracle) [Thu, 11 Sep 2025 10:46:43 +0000 (11:46 +0100)]
net: dsa: mv88e6xxx: remove unused support for PPS event capture
mv88e6352_config_eventcap() is documented as handling both EXTTS and
PPS capture modes, but nothing ever calls it for PPS capture. Remove
the unused PPS capture mode support, and the now unused
MV88E6XXX_TAI_EVENT_STATUS_CAP_TRIG definition.
Russell King (Oracle) [Thu, 11 Sep 2025 10:46:38 +0000 (11:46 +0100)]
net: dsa: mv88e6xxx: remove chip->evcap_config
evcap_config is only read and written in mv88e6352_config_eventcap(),
so it makes little sense to store it in the global chip struct. Make
it a local variable instead.
Russell King (Oracle) [Thu, 11 Sep 2025 10:46:28 +0000 (11:46 +0100)]
net: dsa: mv88e6xxx: remove mv88e6250_ptp_ops
mv88e6250_ptp_ops and mv88e6352_ptp_ops are identical since commit 7e3c18097a70 ("net: dsa: mv88e6xxx: read cycle counter period from
hardware"). Remove the unnecessary duplication.
net/cls_cgroup: Fix task_get_classid() during qdisc run
During recent testing with the netem qdisc to inject delays into TCP
traffic, we observed that our CLS BPF program failed to function correctly
due to incorrect classid retrieval from task_get_classid(). The issue
manifests in the following call stack:
The problem occurs when multiple tasks share a single qdisc. In such cases,
__qdisc_run() may transmit skbs created by different tasks. Consequently,
task_get_classid() retrieves an incorrect classid since it references the
current task's context rather than the skb's originating task.
Given that dev_queue_xmit() always executes with bh disabled, we can use
softirq_count() instead to obtain the correct classid.
The simple steps to reproduce this issue:
1. Add network delay to the network interface:
such as: tc qdisc add dev bond0 root netem delay 1.5ms
2. Build two distinct net_cls cgroups, each with a network-intensive task
3. Initiate parallel TCP streams from both tasks to external servers.
Under this specific condition, the issue reliably occurs. The kernel
eventually dequeues an SKB that originated from Task-A while executing in
the context of Task-B.
It is worth noting that it will change the established behavior for a
slightly different scenario:
<sock S is created by task A>
<class ID for task A is changed>
<skb is created by sock S xmit and classified>
prior to this patch the skb will be classified with the 'new' task A
classid, now with the old/original one. The bpf_get_cgroup_classid_curr()
function is a more appropriate choice for this case.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Thomas Graf <tgraf@suug.ch> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250902062933.30087-1-laoar.shao@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: mana: Reduce waiting time if HWC not responding
If HW Channel (HWC) is not responding, reduce the waiting time, so further
steps will fail quickly.
This will prevent getting stuck for a long time (30 minutes or more), for
example, during unloading while HWC is not responding.
Eric Dumazet [Tue, 9 Sep 2025 12:19:42 +0000 (12:19 +0000)]
net: use NUMA drop counters for softnet_data.dropped
Hosts under DOS attack can suffer from false sharing
in enqueue_to_backlog() : atomic_inc(&sd->dropped).
This is because sd->dropped can be touched from many cpus,
possibly residing on different NUMA nodes.
Generalize the sk_drop_counters infrastucture
added in commit c51613fa276f ("net: add sk->sk_drop_counters")
and use it to replace softnet_data.dropped
with NUMA friendly softnet_data.drop_counters.
This adds 64 bytes per cpu, maybe more in the future
if we increase the number of counters (currently 2)
per 'struct numa_drop_counters'.
====================
Add GMAC support for Renesas RZ/{T2H, N2H} SoCs
This series adds support for the Ethernet MAC (GMAC) IP present on
the Renesas RZ/T2H and RZ/N2H SoCs.
While these SoCs use the same Synopsys DesignWare MAC IP (version 5.20) as
the existing RZ/V2H(P), the hardware is synthesized with different options
that require driver and binding updates:
- 8 RX/TX queue pairs instead of 4 (requiring 19 interrupts vs 11)
- Different clock requirements (3 clocks vs 7)
- Different reset handling (2 named resets vs 1 unnamed)
- Split header feature enabled
- GMAC connected through a MIIC PCS on RZ/T2H
The series first updates the generic dwmac binding to accommodate the
higher interrupt count, then extends the Renesas-specific binding with
a to document both SoCs.
The driver changes prepare for multi-SoC support by introducing OF match
data for per-SoC configuration, then add RZ/T2H support including PCS
integration through the existing RZN1 MIIC driver.
Note this patch series is dependent on the PCS driver [0]
(not a build dependency).
[0] https://lore.kernel.org/all/20250904114204.4148520-1-prabhakar.mahadev-lad.rj@bp.renesas.com/
====================
net: stmmac: dwmac-renesas-gbeth: Add support for RZ/T2H SoC
Extend the Renesas GBETH stmmac glue driver to support the RZ/T2H SoC,
where the GMAC is connected through a MIIC PCS. Introduce a new
`has_pcs` flag in `struct renesas_gbeth_of_data` to indicate when PCS
handling is required.
When enabled, the driver parses the `pcs-handle` phandle, creates a PCS
instance with `miic_create()`, and wires it into phylink. Proper cleanup
is done with `miic_destroy()`. New init/exit/select hooks are added to
`plat_stmmacenet_data` for PCS integration.
Update Kconfig to select `PCS_RZN1_MIIC` when building the Renesas GBETH
driver so the PCS support is always available.
net: stmmac: dwmac-renesas-gbeth: Use OF data for configuration
Prepare for adding RZ/T2H SoC support by making the driver configuration
selectable via OF match data. While the RZ/V2H(P) and RZ/T2H use the same
version of the Synopsys DesignWare MAC (version 5.20), the hardware is
synthesized with different options. To accommodate these differences,
introduce a struct holding per-SoC configuration such as clock list,
number of clocks, TX clock rate control, and STMMAC flags, and retrieve
it from the device tree match entry during probe.
dt-bindings: net: renesas,rzv2h-gbeth: Document Renesas RZ/T2H and RZ/N2H SoCs
Add device tree binding support for the Gigabit Ethernet MAC (GMAC) IP
on Renesas RZ/T2H and RZ/N2H SoCs. While these SoCs use the same
Synopsys DesignWare MAC version 5.20 as RZ/V2H, they are synthesized
with different hardware configurations.
Add new compatible strings "renesas,r9a09g077-gbeth" for RZ/T2H and
"renesas,r9a09g087-gbeth" for RZ/N2H, with the latter using RZ/T2H as
fallback since they share identical GMAC IP.
Update the schema to handle hardware differences between SoC variants.
RZ/T2H requires only 3 clocks compared to 7 on RZ/V2H, supports 8 RX/TX
queue pairs instead of 4, and needs 2 reset controls with reset-names
property versus a single unnamed reset. RZ/T2H also has the split header
feature enabled which is disabled on RZ/V2H.
Add support for an optional pcs-handle property to connect the GMAC to
the MIIC PCS converter on RZ/T2H. Use conditional schema validation to
enforce the correct clock, reset, and interrupt configurations per SoC
variant.
Extend the base snps,dwmac.yaml schema to accommodate the increased
interrupt count, supporting up to 19 interrupts and extending the
rx-queue and tx-queue interrupt name patterns to cover queues 0-7.
Jonas Rebmann [Thu, 11 Sep 2025 08:29:03 +0000 (10:29 +0200)]
net: phy: micrel: Update Kconfig help text
This driver by now supports 17 different Microchip (formerly known as
Micrel) chips: KSZ9021, KSZ9031, KSZ9131, KSZ8001, KS8737, KSZ8021,
KSZ8031, KSZ8041, KSZ8051, KSZ8061, KSZ8081, KSZ8873MLL, KSZ886X,
KSZ9477, LAN8814, LAN8804 and LAN8841.
Support for the VSC8201 was removed in commit 51f932c4870f ("micrel phy
driver - updated(1)")
Update the help text to reflect that, list families instead of models to
ease future maintenance.
Jakub Kicinski [Sat, 13 Sep 2025 00:06:25 +0000 (17:06 -0700)]
Merge tag 'nf-next-25-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
1) Don't respond to ICMP_UNREACH errors with another ICMP_UNREACH
error.
2) Support fetching the current bridge ethernet address.
This allows a more flexible approach to packet redirection
on bridges without need to use hardcoded addresses. From
Fernando Fernandez Mancera.
3) Zap a few no-longer needed conditionals from ipvs packet path
and convert to READ/WRITE_ONCE to avoid KCSAN warnings.
From Zhang Tengfei.
4) Remove a no-longer-used macro argument in ipset, from Zhen Ni.
* tag 'nf-next-25-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_reject: don't reply to icmp error messages
ipvs: Use READ_ONCE/WRITE_ONCE for ipvs->enable
netfilter: nft_meta_bridge: introduce NFT_META_BRI_IIFHWADDR support
netfilter: ipset: Remove unused htable_bits in macro ahash_region
selftest:net: fixed spelling mistakes
====================
Russell King [Wed, 10 Sep 2025 12:50:46 +0000 (13:50 +0100)]
net: mvneta: add support for hardware timestamps
Add support for hardware timestamps in (e.g.) the PHY by calling
skb_tx_timestamp() as close as reasonably possible to the point that
the hardware is instructed to send the queued packets.
udp_tunnel: use netdev_warn() instead of netdev_WARN()
netdev_WARN() uses WARN/WARN_ON to print a backtrace along with
file and line information. In this case, udp_tunnel_nic_register()
returning an error is just a failed operation, not a kernel bug.
udp_tunnel_nic_register() can fail due to a memory allocation
failure (kzalloc() or udp_tunnel_nic_alloc()).
This is a normal runtime error and not a kernel bug.
Replace netdev_WARN() with netdev_warn() accordingly.
====================
tcp: Destroy TCP-AO, TCP-MD5 keys in .sk_destruct()
On one side a minor/cosmetic issue, especially nowadays when
TCP-AO/TCP-MD5 signature verification failures aren't logged to dmesg.
Yet, I think worth addressing for two reasons:
- unsigned RST gets ignored by the peer and the connection is alive for
longer (keep-alive interval)
- netstat counters increase and trace events report that trusted BGP peer
is sending unsigned/incorrectly signed segments, which can ring alarm
on monitoring.
====================
Now that the destruction of info/keys is delayed until the socket
destructor, it's safe to use kfree() without an RCU callback.
The socket is in TCP_CLOSE state either because it never left it,
or it's already closed and the refcounter is zero. In any way,
no one can discover it anymore, it's safe to release memory
straight away.
tcp: Destroy TCP-AO, TCP-MD5 keys in .sk_destruct()
Currently there are a couple of minor issues with destroying the keys
tcp_v4_destroy_sock():
1. The socket is yet in TCP bind buckets, making it reachable for
incoming segments [on another CPU core], potentially available to send
late FIN/ACK/RST replies.
2. There is at least one code path, where tcp_done() is called before
sending RST [kudos to Bob for investigation]. This is a case of
a server, that finished sending its data and just called close().
The socket is in TCP_FIN_WAIT2 and has RCV_SHUTDOWN (set by
__tcp_close())
Note the signed RSTs later in the dump - those are sent by the server
when the fin-wait socket gets removed from hash buckets, by
the listener socket.
Instead of destroying AO/MD5 info and their keys in inet_csk_destroy_sock(),
slightly delay it until the actual socket .sk_destruct(). As shutdown'ed
socket can yet send non-data replies, they should be signed in order for
the peer to process them. Now it also matches how AO/MD5 gets destructed
for TIME-WAIT sockets (in tcp_twsk_destructor()).
This seems optimal for TCP-MD5, while for TCP-AO it seems to have an
open problem: once RST get sent and socket gets actually destructed,
there is no information on the initial sequence numbers. So, in case
this last RST gets lost in the network, the server's listener socket
won't be able to properly sign another RST. Nothing in RFC 1122
prescribes keeping any local state after non-graceful reset.
Luckily, BGP are known to use keep alive(s).
While the issue is quite minor/cosmetic, these days monitoring network
counters is a common practice and getting invalid signed segments from
a trusted BGP peer can get customers worried.
====================
bridge: Allow keeping local FDB entries only on VLAN 0
The bridge FDB contains one local entry per port per VLAN, for the MAC of
the port in question, and likewise for the bridge itself. This allows
bridge to locally receive and punt "up" any packets whose destination MAC
address matches that of one of the bridge interfaces or of the bridge
itself.
The number of these local "service" FDB entries grows linearly with number
of bridge-global VLAN memberships, but that in turn will tend to grow
quadratically with number of ports and per-port VLAN memberships. While
that does not cause issues during forwarding lookups, it does make dumps
impractically slow.
As an example, with 100 interfaces, each on 4K VLANs, a full dump of FDB
that just contains these 400K local entries, takes 6.5s. That's _without_
considering iproute2 formatting overhead, this is just how long it takes to
walk the FDB (repeatedly), serialize it into netlink messages, and parse
the messages back in userspace.
This is to illustrate that with growing number of ports and VLANs, the time
required to dump this repetitive information blows up. Arguably 4K VLANs
per interface is not a very realistic configuration, but then modern
switches can instead have several hundred interfaces, and we have fielded
requests for >1K VLAN memberships per port among customers.
FDB entries are currently all kept on a single linked list, and then
dumping uses this linked list to walk all entries and dump them in order.
When the message buffer is full, the iteration is cut short, and later
restarted. Of course, to restart the iteration, it's first necessary to
walk the already-dumped front part of the list before starting dumping
again. So one possibility is to organize the FDB entries in different
structure more amenable to walk restarts.
One option is to walk directly the hash table. The advantage is that no
auxiliary structure needs to be introduced. With a rough sketch of this
approach, the above scenario gets dumped in not quite 3 s, saving over 50 %
of time. However hash table iteration requires maintaining an active cursor
that must be collected when the dump is aborted. It looks like that would
require changes in the NDO protocol to allow to run this cleanup. Moreover,
on hash table resize the iteration is simply restarted. FDB dumps are
currently not guaranteed to correspond to any one particular state: entries
can be missed, or be duplicated. But with hash table iteration we would get
that plus the much less graceful resize behavior, where swaths of FDB are
duplicated.
Another option is to maintain the FDB entries in a red-black tree. We have
a PoC of this approach on hand, and the above scenario is dumped in about
2.5 s. Still not as snappy as we'd like it, but better than the hash table.
However the savings come at the expense of a more expensive insertion, and
require locking during dumps, which blocks insertion.
The upside of these approaches is that they provide benefits whatever the
FDB contents. But it does not seem like either of these is workable.
However we intend to clean up the RB tree PoC and present it for
consideration later on in case the trade-offs are considered acceptable.
Yet another option might be to use in-kernel FDB filtering, and to filter
the local entries when dumping. Unfortunately, this does not help all that
much either, because the linked-list walk still needs to happen. Also, with
the obvious filtering interface built around ndm_flags / ndm_state
filtering, one can't just exclude pure local entries in one query. One
needs to dump all non-local entries first, and then to get permanent
entries in another run filter local & added_by_user. I.e. one needs to pay
the iteration overhead twice, and then integrate the result in userspace.
To get significant savings, one would need a very specific knob like "dump,
but skip/only include local entries". But if we are adding a local-specific
knobs, maybe let's have an option to just not duplicate them in the first
place.
All this FDB duplication is there merely to make things snappy during
forwarding. But high-radix switches with thousands of VLANs typically do
not process much traffic in the SW datapath at all, but rather offload vast
majority of it. So we could exchange some of the runtime performance for a
neater FDB.
To that end, in this patchset, introduce a new bridge option,
BR_BOOLOPT_FDB_LOCAL_VLAN_0, which when enabled, has local FDB entries
installed only on VLAN 0, instead of duplicating them across all VLANs.
Then to maintain the local termination behavior, on FDB miss, the bridge
does a second lookup on VLAN 0.
Enabling this option changes the bridge behavior in expected ways. Since
the entries are only kept on VLAN 0, FDB get, flush and dump will not
perceive them on non-0 VLANs. And deleting the VLAN 0 entry affects
forwarding on all VLANs.
This patchset is loosely based on a privately circulated patch by Nikolay
Aleksandrov.
The patchset progresses as follows:
- Patch #1 introduces a bridge option to enable the above feature. Then
patches #2 to #5 gradually patch the bridge to do the right thing when
the option is enabled. Finally patch #6 adds the UAPI knob and the code
for when the feature is enabled or disabled.
- Patches #7, #8 and #9 contain fixes and improvements to selftest
libraries
- Patch #10 contains a new selftest
====================
Usually the autodefer helpers in lib.sh are expected to be run in context
where success is the expected outcome. However when using them for feature
detection, failure can legitimately occur. But the failed command still
schedules a cleanup, which will likely fail again.
Instead, only schedule deferred cleanup when the positive command succeeds.
This way of organizing the cleanup has the added benefit that now the
return code from these functions reflects whether the command passed.
Petr Machata [Thu, 4 Sep 2025 17:07:25 +0000 (19:07 +0200)]
selftests: defer: Introduce DEFER_PAUSE_ON_FAIL
The fact that all cleanup (ideally) goes through the defer framework makes
debugging of these commands a bit tricky. However, this also gives us a
nice point to place a hook along the lines of PAUSE_ON_FAIL. When the
environment variable DEFER_PAUSE_ON_FAIL is set, and a cleanup command
results in non-zero exit status, show a bit of debuginfo and give the user
an opportunity to interrupt the execution altogether.
Petr Machata [Thu, 4 Sep 2025 17:07:24 +0000 (19:07 +0200)]
selftests: defer: Allow spaces in arguments of deferred commands
Currently the way deferred commands are stored and invoked causes any
whitespace to act as an argument separator when the command is executed.
To make it possible to use spaces in deferred commands, store the commands
quoted, and then eval the string prior to execution.
Petr Machata [Thu, 4 Sep 2025 17:07:23 +0000 (19:07 +0200)]
net: bridge: Introduce UAPI for BR_BOOLOPT_FDB_LOCAL_VLAN_0
The previous patches introduced a new option, BR_BOOLOPT_FDB_LOCAL_VLAN_0.
When enabled, it has local FDB entries installed only on VLAN 0, instead of
duplicating them across all VLANs.
In this patch, add the corresponding UAPI toggle, and the code for turning
the feature on and off.
Petr Machata [Thu, 4 Sep 2025 17:07:22 +0000 (19:07 +0200)]
net: bridge: BROPT_FDB_LOCAL_VLAN_0: Skip local FDBs on VLAN creation
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for the
member ports as well as the bridge itself should not be created per-VLAN,
but instead only on VLAN 0.
Thus when a VLAN is added for a port or the bridge itself, a local FDB
entry with the corresponding address should not be added when in the VLAN-0
mode.
Petr Machata [Thu, 4 Sep 2025 17:07:21 +0000 (19:07 +0200)]
net: bridge: BROPT_FDB_LOCAL_VLAN_0: On bridge changeaddr, skip per-VLAN FDBs
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for the
bridge itself should not be created per-VLAN, but instead only on VLAN 0.
When the bridge address changes, the local FDB entries need to be updated,
which is done in br_fdb_change_mac_address().
Bail out early when in VLAN-0 mode, so that the per-VLAN FDB entries are
not created. The per-VLAN walk is only done afterwards.
Petr Machata [Thu, 4 Sep 2025 17:07:20 +0000 (19:07 +0200)]
net: bridge: BROPT_FDB_LOCAL_VLAN_0: On port changeaddr, skip per-VLAN FDBs
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for member
ports should not be created per-VLAN, but instead only on VLAN 0. When the
member port address changes, the local FDB entries need to be updated,
which is done in br_fdb_changeaddr().
Under the VLAN-0 mode, only one local FDB entry will ever be added for a
port's address, and that on VLAN 0. Thus bail out of the delete loop early.
For the same reason, also skip adding the per-VLAN entries.
Petr Machata [Thu, 4 Sep 2025 17:07:19 +0000 (19:07 +0200)]
net: bridge: BROPT_FDB_LOCAL_VLAN_0: Look up FDB on VLAN 0 on miss
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for the
member ports as well as the bridge itself should not be created per-VLAN,
but instead only on VLAN 0.
That means that br_handle_frame_finish() needs to make two lookups: the
primary lookup on an appropriate VLAN, and when that misses, a lookup on
VLAN 0.
Have the second lookup only accept local MAC addresses. Turning this into a
generic second-lookup feature is not the goal.
Petr Machata [Thu, 4 Sep 2025 17:07:18 +0000 (19:07 +0200)]
net: bridge: Introduce BROPT_FDB_LOCAL_VLAN_0
The following patches will gradually introduce the ability of the bridge
to look up local FDB entries on VLAN 0 instead of using the VLAN indicated
by a packet.
In this patch, just introduce the option itself, with which the feature
will be linked.
Stanislav Fomichev [Wed, 10 Sep 2025 16:24:29 +0000 (09:24 -0700)]
net: devmem: expose tcp_recvmsg_locked errors
tcp_recvmsg_dmabuf can export the following errors:
- EFAULT when linear copy fails
- ETOOSMALL when cmsg put fails
- ENODEV if one of the frags is readable
- ENOMEM on xarray failures
But they are all ignored and replaced by EFAULT in the caller
(tcp_recvmsg_locked). Expose real error to the userspace to
add more transparency on what specifically fails.
In non-devmem case (skb_copy_datagram_msg) doing `if (!copied)
copied=-EFAULT` is ok because skb_copy_datagram_msg can return only EFAULT.
Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250910162429.4127997-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Hildenbrand [Wed, 10 Sep 2025 01:36:43 +0000 (03:36 +0200)]
wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config
It's no longer user-selectable (and the default was already "y"), so
let's just drop it.
It was never really relevant to the wireguard selftests either way.
Cc: Shuah Khan <shuah@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://patch.msgid.link/20250910013644.4153708-4-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wireguard: queueing: always return valid online CPU in wg_cpumask_choose_online()
The function gets number of online CPUS, and uses it to search for
Nth cpu in cpu_online_mask.
If id == num_online_cpus() - 1, and one CPU gets offlined between
calling num_online_cpus() -> cpumask_nth(), there's a chance for
cpumask_nth() to find nothing and return >= nr_cpu_ids.
The caller code in __queue_work() tries to avoid that by checking the
returned CPU against WORK_CPU_UNBOUND, which is NR_CPUS. It's not the
same as '>= nr_cpu_ids'. On a typical Ubuntu desktop, NR_CPUS is 8192,
while nr_cpu_ids is the actual number of possible CPUs, say 8.
The non-existing cpu may later be passed to rcu_dereference() and
corrupt the logic. Fix it by switching from 'if' to 'while'.
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://patch.msgid.link/20250910013644.4153708-3-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wg_cpumask_choose_online() opencodes cpumask_nth(). Use it and make the
function significantly simpler. While there, fix opencoded cpu_online()
too.
Signed-off-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://patch.msgid.link/20250910013644.4153708-2-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Move the conflicting declaration to the end of the corresponding
structure. Notice that `struct ip_tunnel_info` is a flexible
structure, this is a structure that contains a flexible-array
member.
Fix the following warning:
drivers/net/geneve.c:56:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/aMBK78xT2fUnpwE5@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In a system with high real-time requirements, the timeout mechanism of
ordinary timers with jiffies granularity is insufficient to meet the
demands for real-time performance. Meanwhile, the optimization of CPU
usage with af_packet is quite significant. Use hrtimer instead of timer
to help compensate for the shortcomings in real-time performance.
In HZ=100 or HZ=250 system, the update of TP_STATUS_USER is not real-time
enough, with fluctuations reaching over 8ms (on a system with HZ=250).
This is unacceptable in some high real-time systems that require timely
processing of network packets. By replacing it with hrtimer, if a timeout
of 2ms is set, the update of TP_STATUS_USER can be stabilized to within
3 ms.
====================
net: af_packet: Use hrtimer to do the retire operation
In a system with high real-time requirements, the timeout mechanism of
ordinary timers with jiffies granularity is insufficient to meet the
demands for real-time performance. Meanwhile, the optimization of CPU
usage with af_packet is quite significant. Use hrtimer instead of timer
to help compensate for the shortcomings in real-time performance.
In HZ=100 or HZ=250 system, the update of TP_STATUS_USER is not real-time
enough, with fluctuations reaching over 8ms (on a system with HZ=250).
This is unacceptable in some high real-time systems that require timely
processing of network packets. By replacing it with hrtimer, if a timeout
of 2ms is set, the update of TP_STATUS_USER can be stabilized to within
3 ms.
Delete delete_blk_timer field, because hrtimer_cancel will check and wait
until the timer callback return and ensure never enter callback again.
Simplify the logic related to setting timeout, only update the hrtimer
expire time within the hrtimer callback, no longer update the expire time
in prb_open_block which is called by tpacket_rcv or timer callback.
Reasons why NOT update hrtimer in prb_open_block:
1) It will increase complexity to distinguish the two caller scenario.
2) hrtimer_cancel and hrtimer_start need to be called if you want to update
TMO of an already enqueued hrtimer, leading to complex shutdown logic.
One side effect of NOT update hrtimer when called by tpacket_rcv is that
a newly opened block triggered by tpacket_rcv may be retired earlier than
expected. On the other hand, if timeout is updated in prb_open_block, the
frequent reception of network packets that leads to prb_open_block being
called may cause hrtimer to be removed and enqueued repeatedly.
The retire hrtimer expiration is unconditional and periodic. If there are
numerous packet sockets on the system, please set an appropriate timeout
to avoid frequent enqueueing of hrtimers.
kactive_blk_num (K) is only incremented on block close.
In timer callback prb_retire_rx_blk_timer_expired, except delete_blk_timer
is true, last_kactive_blk_num (L) is set to match kactive_blk_num (K) in
all cases. L is also set to match K in prb_open_block.
The only case K not equal to L is when scheduled by tpacket_rcv
and K is just incremented on block close but no new block could be opened,
so that it does not call prb_open_block in prb_dispatch_next_block.
This patch modifies the prb_retire_rx_blk_timer_expired function by simply
removing the check for L == K. This patch just provides another checkpoint
to thaw the might-be-frozen block in any case. It doesn't have any effect
because __packet_lookup_frame_in_block() has the same logic and does it
again without this patch when detecting the ring is frozen. The patch only
advances checking the status of the ring.
The daughter driver rcar_gen4_ptp used by both rswitch and rtsn where
upstreamed with support for possible different memory layouts on
different users. With all Gen4 boards upstream no such setup is
documented.
There are other issues related to how the rcar_gen4_ptp driver is shared
between multiple useres that needs to be cleaned up. But that will be a
larger work. So before that get some simple fixes done.
Patch 1/3 and 2/3 removes the support to allow different register
layouts on different SoCs by looking up offsets at runtime with a much
simpler interface. The new interface computes the offsets at compile
time.
While patch 3/3 is a drive-by patch taking a spurs comment and making a
lockdep check of it.
There is no intentional functional change in this series just cleaning
up in preparation of larger works to follow.
====================
With the support for multiple register layout removed all support
structures can be removed from the header file. Covert to a simpler
structure using defines for the register offsets.
There is no functional change, only switching from looking up offsets at
runtime to compile time.
net: ethernet: renesas: rcar_gen4_ptp: Remove different memory layout
When upstreaming the Gen4 PTP support for R-Car S4 the possibility for
different memory layouts on other Gen4 SoCs was build in. It turns out
this is not needed and instead needlessly makes the driver harder to
read, remove the support code that would have allowed different memory
layouts.
This change only deals with the public functions used by other drivers,
follow up work will clean up the rcar_gen4_ptp internals.
Daniel Palmer [Sun, 7 Sep 2025 06:43:49 +0000 (15:43 +0900)]
eth: 8139too: Make 8139TOO_PIO depend on !NO_IOPORT_MAP
When 8139too is probing and 8139TOO_PIO=y it will call pci_iomap_range()
and from there __pci_ioport_map() for the PCI IO space.
If HAS_IOPORT_MAP=n and NO_GENERIC_PCI_IOPORT_MAP=n, like it is on my
m68k config, __pci_ioport_map() becomes NULL, pci_iomap_range() will
always fail and the driver will complain it couldn't map the PIO space
and return an error.
NO_IOPORT_MAP seems to cover the case where what 8139too is trying
to do cannot ever work so make 8139TOO_PIO depend on being it false
and avoid creating an unusable driver.