Hangbin Liu [Sat, 2 Dec 2023 02:00:57 +0000 (10:00 +0800)]
selftests/net: add lib.sh
Add a lib.sh for net selftests. This file can be used to define commonly
used variables and functions. Some commonly used functions can be moved
from forwarding/lib.sh to this lib file. e.g. busywait().
Add function setup_ns() for user to create unique namespaces with given
prefix name.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
Intel Wired LAN Driver Updates 2023-12-01 (ice)
This series contains updates to ice driver only.
Konrad provides temperature reporting via hwmon.
Arkadiusz adds reporting of Clock Generation Unit (CGU) information via
devlink info.
Pawel adjusts error messaging for ntuple filters to account for additional
possibility for encountering an error.
Karol ensures that all timestamps occurring around reset are processed and
renames some E822 functions to convey additional usage for E823 devices.
Jake provides mechanism to ensure that all timestamps on E822 devices
are processed.
The following are changes since commit 15bc81212f593fbd7bda787598418b931842dc14:
octeon_ep: set backpressure watermark for RX queues
and are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue 100GbE
====================
Karol Kolacinski [Fri, 1 Dec 2023 18:08:44 +0000 (10:08 -0800)]
ice: Rename E822 to E82X
When code is applicable for both E822 and E823 devices, rename it from
E822 to E82X.
ICE_PHY_PER_NAC_E822 was unused, so just remove it.
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jacob Keller [Fri, 1 Dec 2023 18:08:43 +0000 (10:08 -0800)]
ice: periodically kick Tx timestamp interrupt
The E822 hardware for Tx timestamping keeps track of how many
outstanding timestamps are still in the PHY memory block. It will not
generate a new interrupt to the MAC until all of the timestamps in the
region have been read.
If somehow all the available data is not read, but the driver has exited
its interrupt routine already, the PHY will not generate a new interrupt
even if new timestamp data is captured. Because no interrupt is
generated, the driver never processes the timestamp data. This state
results in a permanent failure for all future Tx timestamps.
It is not clear how the driver and hardware could enter this state.
However, if it does, there is currently no recovery mechanism.
Add a recovery mechanism via the periodic PTP work thread which invokes
ice_ptp_periodic_work(). Introduce a new check,
ice_ptp_maybe_trigger_tx_interrupt() which checks the PHY timestamp
ready bitmask. If any bits are set, trigger a software interrupt by
writing to PFINT_OICR.
Once triggered, the main timestamp processing thread will read through
the PHY data and clear the outstanding timestamp data. Once cleared, new
data should trigger interrupts as expected.
This should allow recovery from such a state rather than leaving the
device in a state where we cannot process Tx timestamps.
It is possible that this function checks for timestamp data
simultaneously with the interrupt, and it might trigger additional
unnecessary interrupts. This will cause a small amount of additional
processing.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Reviewed-by: Andrii Staikov <andrii.staikov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Karol Kolacinski [Fri, 1 Dec 2023 18:08:42 +0000 (10:08 -0800)]
ice: Re-enable timestamping correctly after reset
During reset, TX_TSYN interrupt should be processed as it may process
timestamps in brief moments before and after reset.
Timestamping should be enabled on VSIs at the end of reset procedure.
On ice_get_phy_tx_tstamp_ready error, interrupt should not be rearmed
because error only happens on resets.
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pawel Kaminski [Fri, 1 Dec 2023 18:08:41 +0000 (10:08 -0800)]
ice: Improve logs for max ntuple errors
Supported number of ntuple filters affect also maximum location value that
can be provided to ethtool command. Update error message to provide info
about max supported value.
Fix double spaces in the error messages.
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Pawel Kaminski <pawel.kaminski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arkadiusz Kubalewski [Fri, 1 Dec 2023 18:08:40 +0000 (10:08 -0800)]
ice: add CGU info to devlink info callback
If Clock Generation Unit is present on NIC board user shall know its
details.
Provide the devlink info callback with a new:
- fixed type object (cgu.id) indicating hardware variant of onboard CGU,
- running type object (fw.cgu) consisting of CGU id, config and firmware
versions.
These information shall be known for debugging purposes.
Test (on NIC board with CGU)
$ devlink dev info <bus_name>/<dev_name> | grep cgu
cgu.id 36
fw.cgu 8032.16973825.6021
Test (on NIC board without CGU)
$ devlink dev info <bus_name>/<dev_name> | grep cgu -c
0
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested on Intel Corporation Ethernet Controller E810-C for SFP
Co-developed-by: Marcin Domagala <marcinx.domagala@intel.com> Signed-off-by: Marcin Domagala <marcinx.domagala@intel.com> Co-developed-by: Eric Joyner <eric.joyner@intel.com> Signed-off-by: Eric Joyner <eric.joyner@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Konrad Knitter <konrad.knitter@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When dealing with action arrays in act_api it's natural to ask if they
are always contiguous (no NULL pointers in between). Yes, they are in
all cases so far, so make use of the already present tcf_act_for_each_action
macro to explicitly document this assumption.
There was an instance where it was not, but it was refactorable (patch 2)
to make the array contiguous.
Pedro Tammela [Fri, 1 Dec 2023 17:50:15 +0000 (14:50 -0300)]
net/sched: act_api: use tcf_act_for_each_action in tcf_idr_insert_many
The actions array is contiguous, so stop processing whenever a NULL
is found. This is already the assumption for tcf_action_destroy[1],
which is called from tcf_actions_init.
In tcf_action_add, when putting the reference for the bound actions
it assigns NULLs to just created actions passing a non contiguous
array to tcf_action_put_many.
Refactor the code so the actions array is always contiguous.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Tue, 5 Dec 2023 09:48:03 +0000 (10:48 +0100)]
Merge branch 'doc-update-bridge-doc'
Hangbin Liu says:
====================
Doc: update bridge doc
The current bridge kernel doc is too old. It only pointed to the
linuxfoundation wiki page which lacks of the new features.
Here let's start the new bridge document and put all the bridge info
so new developers and users could catch up the last bridge status soon.
v3 -> v4:
- Patch01: Reference and borrow definitions from the IEEE 802.1Q-2022 standard
for bridge (Stephen Hemminger)
- Patch04: Remind that kAPI is unstable. Add back sysfs part, but only note
that sysfs is deprecated. (Stephen Hemminger, Florian Fainelli)
- Patch05: Mention the RSTP and IEEE 802.1D developing info. (Stephen Hemminger)
- Some other grammar fixes.
v2 -> v3:
- Split the bridge doc update and adding kAPI/uAPI field to 2 part (Nikolay Aleksandrov)
- Update bridge and bridge enum descriptions (Nikolay Aleksandrov)
- Add user space stp help for STP doc (Vladimir Oltean)
v1 -> v2:
- Update bridge and bridge port enum descriptions (Vladimir Oltean)
RFCv3 -> v1:
- Fix up various typos, grammar and technical issues (Nikolay Aleksandrov)
RFCv2 -> RFCv3:
- Update netfilter part (Florian Westphal)
- Break the one large patch in to multiparts for easy reviewing. Please tell
me if I break it too much.. (Nikolay Aleksandrov)
- Update the description of each enum and doc (Nikolay Aleksandrov)
- Add more descriptions for STP/Multicast/VLAN.
RFCv1 -> RFCv2:
- Drop the python tool that generate iproute man page from kernel doc
====================
Hangbin Liu [Fri, 1 Dec 2023 08:19:48 +0000 (16:19 +0800)]
docs: bridge: add switchdev doc
Add switchdev part for bridge document.
Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Hangbin Liu [Fri, 1 Dec 2023 08:19:46 +0000 (16:19 +0800)]
docs: bridge: add VLAN doc
Add VLAN part for bridge document.
Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Tue, 5 Dec 2023 02:37:40 +0000 (18:37 -0800)]
Merge branch 'net-stmmac-est-implementation'
Rohan G Thomas says:
====================
net: stmmac: EST implementation
This patchset extends EST interrupt handling support to DWXGMAC IP
followed by refactoring of EST implementation. Added a separate
module for EST and moved all EST related functions to the new module.
Also added support for EST cycle-time-extension.
====================
Rohan G Thomas [Fri, 1 Dec 2023 05:52:52 +0000 (13:52 +0800)]
net: stmmac: Add support for EST cycle-time-extension
Add support for cycle-time-extension. TER GCL-register needs to be
updated with the cycle-time-extension. Width of TER register is EST
time interval width + 7 bits.
Rohan G Thomas [Fri, 1 Dec 2023 05:52:51 +0000 (13:52 +0800)]
net: stmmac: Refactor EST implementation
Refactor EST implementation by moving common code for DWMAC4 and
DWXGMAC IPs into a separate EST module. EST implementation for DWMAC4
and DWXGMAC differs only for CSR base address, PTOV field offset
width, and PTOV clock multiplier value.
Thanks, Serge Semin and Jakub Kicinski for the suggestions on
refactoring EST implementation into a separate EST module.
Rohan G Thomas [Fri, 1 Dec 2023 05:52:50 +0000 (13:52 +0800)]
net: stmmac: xgmac: EST interrupts handling
Enabled the following EST related interrupts:
1) Constant Gate Control Error (CGCE)
2) Head-of-Line Blocking due to Scheduling (HLBS)
3) Head-of-Line Blocking due to Frame Size (HLBF)
4) Base Time Register error (BTRE)
5) Switch to S/W owned list Complete (SWLC)
Also, add EST errors into the ethtool statistic.
The commit e49aa315cb01 ("net: stmmac: EST interrupts handling and
error reporting") and commit 9f298959191b ("net: stmmac: Add EST
errors into ethtool statistic") add EST interrupts handling and error
reporting support to DWMAC4 core. This patch enables the same support
for XGMAC.
Ravi Gunasekaran [Fri, 1 Dec 2023 13:20:33 +0000 (18:50 +0530)]
net: ethernet: ti: davinci_mdio: Update K3 SoCs list for errata i2329
The errata i2329 affects all the currently available silicon revisions of
AM62x, AM64x, AM65x, J7200, J721E and J721S2. So remove the revision
string from the SoC list.
The silicon revisions affected by the errata i2329 can be found under
the MDIO module in the "Advisories by Modules" section of each
SoC errata document listed below
====================
Introduce queue and NAPI support in netdev-genl (Was: Introduce NAPI queues support)
Add the capability to export the following via netdev-genl interface:
- queue information supported by the device
- NAPI information supported by the device
Introduce support for associating queue and NAPI instance.
Extend the netdev_genl generic netlink family for netdev
with queue and NAPI data.
The queue parameters exposed are:
- queue index
- queue type
- ifindex
- NAPI id associated with the queue
Additional rx and tx queue parameters can be exposed in follow up
patches by stashing them in netdev queue structures. XDP queue type
can also be supported in future.
The NAPI fields exposed are:
- NAPI id
- NAPI device ifindex
- Interrupt number associated with the NAPI instance
- PID for the NAPI thread
This series only supports 'get' ability for retrieving
certain queue and NAPI attributes. The 'set' ability for
configuring queue and associated NAPI instance via netdev-genl
will be submitted as a separate patch series.
Amritha Nambiar [Fri, 1 Dec 2023 23:28:56 +0000 (15:28 -0800)]
netdev-genl: Add netlink framework functions for napi
Implement the netdev netlink framework functions for
napi support. The netdev structure tracks all the napi
instances and napi fields. The napi instances and associated
parameters can be retrieved this way.
====================
bnxt_en: Support new 5760X P7 devices
This series completes the basic support for the new 5760X P7 devices
with new PCI IDs added in the last patch.
Thie first patch fixes a backing store issue introduced in the last
patchset last week. The 2nd patch is the new firmware interface
required to support the new chips. The next few patches are doorbell
changes, refactoring, and new hardware interface structures. New
changes to support packet reception including TPA are added in patch 10.
The next 4 patches are ethernet link related changes to support the
new chip.
====================
Michael Chan [Fri, 1 Dec 2023 22:39:21 +0000 (14:39 -0800)]
bnxt_en: Support new firmware link parameters
Newer firmware supporting PAM4 112Gbps speeds use new parameters in
firmware message structures. Detect the new firmware capability and
add basic logic to report and store these new fields.
Michael Chan [Fri, 1 Dec 2023 22:39:20 +0000 (14:39 -0800)]
bnxt_en: Refactor ethtool speeds logic
Add helper functions to refactor the logic that converts firmware
speed masks to ethtool speeds. Pass the phy_flags to
bnxt_get_ethtool_speeds() and the call chain. The refactoring and the
phy_flags will be needed when adding support for the new speeds in the
next patches.
Michael Chan [Fri, 1 Dec 2023 22:39:19 +0000 (14:39 -0800)]
bnxt_en: Add support for new RX and TPA_START completion types for P7
These new completion types are supported on the new P7 chips.
These new types have commonalities with the legacy types. After
the refactoring, we mainly have to add new functions to handle the
the new meta data formats and the RX hash information in the new
types.
Michael Chan [Fri, 1 Dec 2023 22:39:18 +0000 (14:39 -0800)]
bnxt_en: Refactor and refine bnxt_tpa_start() and bnxt_tpa_end().
Refactor bnxt_tpa_start() by adding bnxt_tpa_metadata() to gather the
metadata from the TPA_START completion. This makes it easier to
support the new P7 chip which has a modified TPA_START completion
structure with different metadata formats. We also add vlan_valid
and cfa_code_valid fields to the bnxt_tpa_info structure so that the
VLAN and VF rep logic can be common for all chips. The VLAN metadata
is now collected in bnxt_tpa_start() only when it is valid and the
vlan_valid field will be set. bnxt_tpa_end() can now use common VLAN
logic for all chips.
Michael Chan [Fri, 1 Dec 2023 22:39:17 +0000 (14:39 -0800)]
bnxt_en: Refactor RX VLAN acceleration logic.
Refactor the logic in the RX path that checks for the accelerated VLAN
tag by adding a new function. This will make it easier to support
the new receive logic on P7 chips.
Ajit Khaparde [Fri, 1 Dec 2023 22:39:15 +0000 (14:39 -0800)]
bnxt_en: Refactor RSS capability fields
Add a new rss_cap field in the per device struct bnxt and move all
the RSS capability fields there. It will be easier to add new RSS
capabilities for the new P7 chips.
Michael Chan [Fri, 1 Dec 2023 22:39:14 +0000 (14:39 -0800)]
bnxt_en: Implement the new toggle bit doorbell mechanism on P7 chips
The new chip family passes the Toggle bits to the driver in the NQE
notification. The driver now stores this value and sends it back to
hardware when it re-arms the RX and TX CQs. Together with the earlier
patch that guarantees the driver will only re-arm the CQ at the end of
NAPI polling if it has seen a new NQE, this method allows the hardware
to detect any dropped doorbells.
Hongguang Gao [Fri, 1 Dec 2023 22:39:13 +0000 (14:39 -0800)]
bnxt_en: Consolidate DB offset calculation
The doorbell offset on P5 chips is hard coded. On the new P7 chips,
it is returned by the firmware. Simplify the logic that determines
this offset and store it in a new db_offset field in struct bnxt.
Also, provide this offset to the RoCE driver in struct bnxt_en_dev.
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20231201223924.26955-5-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michael Chan [Fri, 1 Dec 2023 22:39:12 +0000 (14:39 -0800)]
bnxt_en: Define basic P7 macros
Repurpose the BNXT_FLAG_CHIP_SR2 flag by renaming it to
BNXT_FLAG_CHIP_P7 since the SR2 chip never went to production. The SR2
statictics structure is also renamed for the P7 chip. Define the basic
P7 doorbell bits (Epoch. Toggle, etc) and implement the Epoch bit
logic. The next higher bit beyond the legal doorbell mask is the
Epoch bit used for doorbells on P7 chips. This bit is used by the
chip to detect dropped doorbells.
The 57608 chip ID belonging to the P7 family is also defined. Note
that the PCI ID is not added until the last patch in the series.
Michael Chan [Fri, 1 Dec 2023 22:39:11 +0000 (14:39 -0800)]
bnxt_en: Update firmware interface to 1.10.3.15
This updated interface supports the new 5760X P7 chip family. It has
the changes to support the new link speeds/modes and other changes
for the basic L2 features.
Michael Chan [Fri, 1 Dec 2023 22:39:10 +0000 (14:39 -0800)]
bnxt_en: Fix backing store V2 logic
The current code determines the last backing store valid type during
bnxt_hwrm_func_backing_store_qcaps_v2(). In effect, the last type
is determined based on what firmware advertises. The more correct
way is to determine it based on what the driver is configuring. The
driver may not configure all the backing store types advertised by
firmware.
Move the logic to determine the last type to bnxt_backing_store_cfg_v2().
We need to pass the legacy enable flags to the function in case only
the legacy types are being configured.
Guillaume Nault [Fri, 1 Dec 2023 14:49:52 +0000 (15:49 +0100)]
tcp: Dump bound-only sockets in inet_diag.
Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets
that are bound but haven't yet called connect() or listen().
The code is inspired by the ->lhash2 loop. However there's no manual
test of the source port, since this kind of filtering is already
handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped
at a time, to avoid running with bh disabled for too long.
There's no TCP state for bound but otherwise inactive sockets. Such
sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed
to only dump listening sockets, actually requests the kernel to dump
sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping
bound-only sockets with "ss -l", we therefore need to define a new
pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set
explicitly.
With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to
40000, 64000, 60000, an updated version of iproute2 could work as
follow:
$ ss -t state bound-inactive
Recv-Q Send-Q Local Address:Port Peer Address:Port Process
0 0 0.0.0.0:40000 0.0.0.0:*
0 0 [::]:60000 [::]:*
0 0 *:64000 *:*
Some Micrel phys define a specific rmii-ref clock (added in 2014) while
the generic phy binding specifies an unnamed clock for ethernet phys.
This allows Micrel phys to use both, so as to keep the phys not using
the named rmii-ref clock to conform to the generic binding while allowing
them to enable a supplying clock, when the phy is not supplied by a
dedicated oscillator.
====================
Heiko Stuebner [Fri, 1 Dec 2023 15:01:31 +0000 (16:01 +0100)]
net: phy: micrel: allow usage of generic ethernet-phy clock
The generic ethernet-phy binding allows describing an external clock since
commit 350b7a258f20 ("dt-bindings: net: phy: Document support for external PHY clk")
for cases where the phy is not supplied by an oscillator but instead
by a clock from the host system.
And the old named "rmii-ref" clock from 2014 is only specified for phys
of the KSZ8021, KSZ8031, KSZ8081, KSZ8091 types.
So allow retrieving and enabling the optional generic clock on phys that
do not provide a rmii-ref clock.
Signed-off-by: Heiko Stuebner <heiko.stuebner@cherry.de> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231201150131.326766-3-heiko@sntech.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiko Stuebner [Fri, 1 Dec 2023 15:01:30 +0000 (16:01 +0100)]
net: phy: micrel: use devm_clk_get_optional_enabled for the rmii-ref clock
While the external clock input will most likely be enabled, it's not
guaranteed and clk_get_rate in some suppliers will even just return
valid results when the clock is running.
So use devm_clk_get_optional_enabled to retrieve and enable the clock
in one go.
Signed-off-by: Heiko Stuebner <heiko.stuebner@cherry.de> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231201150131.326766-2-heiko@sntech.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Fri, 1 Dec 2023 18:01:54 +0000 (19:01 +0100)]
docs: netlink: add NLMSG_DONE message format for doit actions
In case NLMSG_DONE message is sent as a reply to doit action, multiple
kernel implementation do not send anything else than struct nlmsghdr.
Add this note to the Netlink intro documentation.
Suman Ghosh [Thu, 30 Nov 2023 03:43:24 +0000 (09:13 +0530)]
octeontx2-pf: TC flower offload support for mirror
This patch extends TC flower offload support for mirroring ingress
traffic to a different PF/VF. Below is an example command,
'tc filter add dev eth1 ingress protocol ip flower src_ip <ip-addr>
skip_sw action mirred ingress mirror dev eth2'
Signed-off-by: Suman Ghosh <sumang@marvell.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Suman Ghosh [Thu, 30 Nov 2023 03:43:23 +0000 (09:13 +0530)]
octeontx2-af: Add new mbox to support multicast/mirror offload
A new mailbox is added to support offloading of multicast/mirror
functionality. The mailbox also supports dynamic updation of the
multicast/mirror list.
Signed-off-by: Suman Ghosh <sumang@marvell.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 2 Dec 2023 22:24:37 +0000 (22:24 +0000)]
Merge branch 'net-cacheline-optimizations'
Coco Li says:
====================
Analyze and Reorganize core Networking Structs to optimize cacheline consumption
Currently, variable-heavy structs in the networking stack is organized
chronologically, logically and sometimes by cacheline access.
This patch series attempts to reorganize the core networking stack
variables to minimize cacheline consumption during the phase of data
transfer. Specifically, we looked at the TCP/IP stack and the fast
path definition in TCP.
For documentation purposes, we also added new files for each core data
structure we considered, although not all ended up being modified due
to the amount of existing cacheline they span in the fast path. In
the documentation, we recorded all variables we identified on the
fast path and the reasons. We also hope that in the future when
variables are added/modified, the document can be referred to and
updated accordingly to reflect the latest variable organization.
Tested:
Our tests were run with neper tcp_rr using tcp traffic. The tests have $cpu
number of threads and variable number of flows (see below).
Tests were run on 6.5-rc1
Efficiency is computed as cpu seconds / throughput (one tcp_rr round trip).
The following result shows efficiency delta before and after the patch
series is applied.
v8 changes:
1. Update net_device_read_txrx cache group maximum
2. Update MAINTAINERS for documentations
3. Skip __cache_group variables in scripts/kernel-doc
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Coco Li [Wed, 29 Nov 2023 07:27:54 +0000 (07:27 +0000)]
netns-ipv4: reorganize netns_ipv4 fast path variables
Reorganize fast path variables on tx-txrx-rx order.
Fastpath cacheline ends after sysctl_tcp_rmem.
There are only read-only variables here. (write is on the control path
and not considered in this case)
Below data generated with pahole on x86 architecture.
Fast path variables span cache lines before change: 4
Fast path variables span cache lines after change: 2
Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Coco Li [Wed, 29 Nov 2023 07:27:53 +0000 (07:27 +0000)]
cache: enforce cache groups
Set up build time warnings to safeguard against future header changes of
organized structs.
Warning includes:
1) whether all variables are still in the same cache group
2) whether all the cache groups have the sum of the members size (in the
maximum condition, including all members defined in configs)
The __cache_group* variables are ignored in kernel-doc check in the
various header files they appear in to enforce the cache groups.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Coco Li [Wed, 29 Nov 2023 07:27:52 +0000 (07:27 +0000)]
Documentations: Analyze heavily used Networking related structs
Analyzed a few structs in the networking stack by looking at variables
within them that are used in the TCP/IP fast path.
Fast path is defined as TCP path where data is transferred from sender to
receiver unidirectionally. It doesn't include phases other than
TCP_ESTABLISHED, nor does it look at error paths.
We hope to re-organizing variables that span many cachelines whose fast
path variables are also spread out, and this document can help future
developers keep networking fast path cachelines small.
Optimized_cacheline field is computed as
(Fastpath_Bytes/L3_cacheline_size_x86), and not the actual organized
results (see patches to come for these).
Note how there isn't much improvement space for inet_sock and
Inet_connection_sock because sk and icsk_inet respectively takes up so
much of the struct that rest of the variables become a small portion of
the struct size.
So, we decided to reorganize tcp_sock, net_device, netns_ipv4
Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Niklas Söderlund [Wed, 29 Nov 2023 11:11:42 +0000 (12:11 +0100)]
net: ethernet: renesas: rcar_gen4_ptp: Depend on PTP_1588_CLOCK
When breaking out the Gen4 gPTP support to its own module the dependency
on the PTP_1588_CLOCK framework was left as optional and only stated for
the driver using the module. This leads to issues when doing
COMPILE_TEST of RENESAS_GEN4_PTP separately and PTP_1588_CLOCK is built
as a module and the other as a built-in. Add an explicit depend on
PTP_1588_CLOCK.
While at it remove the optional support for PTP_1588_CLOCK from
RENESAS_ETHER_SWITCH as the driver unconditionally calls the Gen4 gPTP
module and thus also requires the PTP_1588_CLOCK framework.
Reported-by: Arnd Bergmann <arnd@arndb.de> Fixes: 8c1c66235e03 ("net: ethernet: renesas: rcar_gen4_ptp: Break out to module") Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20231129111142.3322667-1-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Document the IPA on the SM8650 Platform which uses version 5.5.1,
which is a minor revision of v5.5 found on SM8550, thus we can
use the SM8550 bindings as fallback since it shares the same
register mappings.
Shinas Rasheed [Wed, 29 Nov 2023 05:31:31 +0000 (21:31 -0800)]
octeon_ep: set backpressure watermark for RX queues
Set backpressure watermark for hardware RX queues. Backpressure
gets triggered when the available buffers of a hardware RX queue
falls below the set watermark. This backpressure will propagate
to packet processing pipeline in the OCTEON card, so that the host
receives fewer packets and prevents packet dropping at host.
Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Tue, 28 Nov 2023 13:13:19 +0000 (16:13 +0300)]
octeon_ep: Fix error code in probe()
Set the error code if octep_ctrl_net_get_mtu() fails. Currently the code
returns success.
Fixes: 0a5f8534e398 ("octeon_ep: get max rx packet length from firmware") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Sathesh B Edara <sedara@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shinas Rasheed [Wed, 29 Nov 2023 04:53:48 +0000 (20:53 -0800)]
octeon_ep: support OCTEON CN98 devices
Add PCI Endpoint NIC support for Octeon CN98 devices.
CN98 devices are part of Octeon 9 family products with
similar PCI NIC characteristics to CN93, already supported
driver.
Add CN98 card to the device id table, as well
as support differences in the register fields and
certain usage scenarios such as unload.
Andrew Halaney [Mon, 27 Nov 2023 21:41:10 +0000 (15:41 -0600)]
net: phy: mdio_device: Reset device only when necessary
Currently the phy reset sequence is as shown below for a
devicetree described mdio phy on boot:
1. Assert the phy_device's reset as part of registering
2. Deassert the phy_device's reset as part of registering
3. Deassert the phy_device's reset as part of phy_probe
4. Deassert the phy_device's reset as part of phy_hw_init
The extra two deasserts include waiting the deassert delay afterwards,
which is adding unnecessary delay.
This applies to both possible types of resets (reset controller
reference and a reset gpio) that can be used.
Here's some snipped tracing output using the following command line
params "trace_event=gpio:* trace_options=stacktrace" illustrating
the reset handling and where its coming from:
There's a lot of paths where the device is getting its reset
asserted and deasserted. Let's track the state and only actually
do the assert/deassert when it changes.
We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 58 files changed, 1598 insertions(+), 154 deletions(-).
The main changes are:
1) Add initial TX metadata implementation for AF_XDP with support in mlx5
and stmmac drivers. Two types of offloads are supported right now, that
is, TX timestamp and TX checksum offload, from Stanislav Fomichev with
stmmac implementation from Song Yoong Siang.
2) Change BPF verifier logic to validate global subprograms lazily instead
of unconditionally before the main program, so they can be guarded using
BPF CO-RE techniques, from Andrii Nakryiko.
3) Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter, from Jiri Olsa.
4) Use pkg-config in BPF selftests to determine ld flags which is
in particular needed for linking statically, from Akihiko Odaki.
5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits)
bpf/tests: Remove duplicate JSGT tests
selftests/bpf: Add TX side to xdp_hw_metadata
selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP
selftests/bpf: Add TX side to xdp_metadata
selftests/bpf: Add csum helpers
selftests/xsk: Support tx_metadata_len
xsk: Add option to calculate TX checksum in SW
xsk: Validate xsk_tx_metadata flags
xsk: Document tx_metadata_len layout
net: stmmac: Add Tx HWTS support to XDP ZC
net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
tools: ynl: Print xsk-features from the sample
xsk: Add TX timestamp and TX checksum offload support
xsk: Support tx_metadata_len
selftests/bpf: Use pkg-config for libelf
selftests/bpf: Override PKG_CONFIG for static builds
selftests/bpf: Choose pkg-config for the target
bpftool: Add support to display uprobe_multi links
selftests/bpf: Add link_info test for uprobe_multi link
selftests/bpf: Use bpf_link__destroy in fill_link_info tests
...
====================
Conflicts:
Documentation/netlink/specs/netdev.yaml: 839ff60df3ab ("net: page_pool: add nlspec for basic access to page pools") 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/
While at it also regen, tree is dirty after: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
looks like code wasn't re-rendered after "render-max" was removed.
- dpaa2: recycle the RX buffer only after all processing done
- rswitch: fix missing dev_kfree_skb_any() in error path
Previous releases - always broken:
- ipv4: fix uaf issue when receiving igmp query packet
- wifi: mac80211: fix debugfs deadlock at device removal time
- bpf:
- sockmap: af_unix stream sockets need to hold ref for pair sock
- netdevsim: don't accept device bound programs
- selftests: fix a char signedness issue
- dsa: mv88e6xxx: fix marvell 6350 probe crash
- octeontx2-pf: restore TC ingress police rules when interface is up
- wangxun: fix memory leak on msix entry
- ravb: keep reverse order of operations in ravb_remove()"
* tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
net: ravb: Keep reverse order of operations in ravb_remove()
net: ravb: Stop DMA in case of failures on ravb_open()
net: ravb: Start TX queues after HW initialization succeeded
net: ravb: Make write access to CXR35 first before accessing other EMAC registers
net: ravb: Use pm_runtime_resume_and_get()
net: ravb: Check return value of reset_control_deassert()
net: libwx: fix memory leak on msix entry
ice: Fix VF Reset paths when interface in a failed over aggregate
bpf, sockmap: Add af_unix test with both sockets in map
bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
tools: ynl-gen: always construct struct ynl_req_state
ethtool: don't propagate EOPNOTSUPP from dumps
ravb: Fix races between ravb_tx_timeout_work() and net related ops
r8169: prevent potential deadlock in rtl8169_close
r8169: fix deadlock on RTL8125 in jumbo mtu mode
neighbour: Fix __randomize_layout crash in struct neighbour
octeontx2-pf: Restore TC ingress police rules when interface is up
octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64
net: stmmac: xgmac: Disable FPE MMC interrupts
octeontx2-af: Fix possible buffer overflow
...
MMC host:
- cqhci: Fix CQE error recovery path
- sdhci-pci-gli: Fix initialization of LPM
- sdhci-sprd: Fix enabling/disabling of the vqmmc regulator"
* tag 'mmc-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci-sprd: Fix vqmmc not shutting down after the card was pulled
mmc: sdhci-pci-gli: Disable LPM during initialization
mmc: cqhci: Fix task clearing in CQE error recovery
mmc: cqhci: Warn of halt or task clear failure
mmc: block: Retry commands in CQE error recovery
mmc: block: Be sure to wait while busy in CQE error recovery
mmc: cqhci: Increase recovery halt timeout
mmc: block: Do not lose cache flush during CQE error recovery
Linus Torvalds [Thu, 30 Nov 2023 22:57:08 +0000 (07:57 +0900)]
Merge tag 'efi-urgent-for-v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fix from Ard Biesheuvel:
- Fix for EFI unaccepted memory handling
* tag 'efi-urgent-for-v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/unaccepted: Fix off-by-one when checking for overlapping ranges
Eric Dumazet [Thu, 30 Nov 2023 09:22:59 +0000 (09:22 +0000)]
net: page_pool: fix general protection fault in page_pool_unlist
syzbot was able to trigger a crash [1] in page_pool_unlist()
page_pool_list() only inserts a page pool into a netdev page pool list
if a netdev was set in params.
Even if the kzalloc() call in page_pool_create happens to initialize
pool->user.list, I chose to be more explicit in page_pool_list()
adding one INIT_HLIST_NODE().
We could test in page_pool_unlist() if netdev was set,
but since netdev can be changed to lo, it seems more robust to
check if pool->user.list is hashed before calling hlist_del().
Fixes: 083772c9f972 ("net: page_pool: record pools per netdev") Reported-and-tested-by: syzbot+f9f8efb58a4db2ca98d0@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231130092259.3797753-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: ethernet: Convert to platform remove callback returning void
in (implicit) v1 of this series
(https://lore.kernel.org/netdev/20231117091655.872426-1-u.kleine-koenig@pengutronix.de)
I tried to address the resource leaks in the three cpsw drivers. However
this is hard to get right without being able to test the changes. So
here comes a series that just converts all drivers below
drivers/net/ethernet to use .remove_new() and adds a comment about the
potential leaks for someone else to fix the problem.
See commit 5c5a7680e67b ("platform: Provide a remove callback that
returns no value") for an extended explanation and the eventual goal.
The TL;DR; is to prevent bugs like the three noticed here.
Note this series results in no change of behaviour apart from improving
the error message for the three cpsw drivers from
remove callback returned a non-zero value. This will be ignored.
to
Failed to resume device (-ESOMETHING)
====================
Uwe Kleine-König [Tue, 28 Nov 2023 17:38:28 +0000 (18:38 +0100)]
net: ethernet: ezchip: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Replace the error path returning a non-zero value by an error message
and a comment that there is more to do. With that this patch results in
no change of behaviour in this driver apart from improving the error
message.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Roger Quadros <rogerq@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>