Michal Swiatkowski [Thu, 19 Oct 2023 17:32:22 +0000 (10:32 -0700)]
ice: set MSI-X vector count on VF
Implement ops needed to set MSI-X vector count on VF.
sriov_get_vf_total_msix() should return total number of MSI-X that can
be used by the VFs. Return the value set by devlink resources API
(pf->req_msix.vf).
sriov_set_msix_vec_count() will set number of MSI-X on particular VF.
Disable VF register mapping, rebuild VSI with new MSI-X and queues
values and enable new VF register mapping.
For best performance set number of queues equal to number of MSI-X.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Swiatkowski [Thu, 19 Oct 2023 17:32:21 +0000 (10:32 -0700)]
ice: add bitmap to track VF MSI-X usage
Create a bitamp to track MSI-X usage for VFs. The bitmap has the size of
total MSI-X amount on device, because at init time the amount of MSI-X
used by VFs isn't known.
The bitmap is used in follow up patchset to provide a block of
continuous block of MSI-X indexes for each created VF.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Swiatkowski [Thu, 19 Oct 2023 17:32:20 +0000 (10:32 -0700)]
ice: implement num_msix field per VF
Store the amount of MSI-X per VF instead of storing it in pf struct. It
is used to calculate number of q_vectors (and queues) for VF VSI.
This is necessary because with follow up changes the number of MSI-X can
be different between VFs. Use it instead of using pf->vf_msix value in
all cases.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Swiatkowski [Thu, 19 Oct 2023 17:32:18 +0000 (10:32 -0700)]
ice: add drop rule matching on not active lport
Inactive LAG port should not receive any packets, as it can cause adding
invalid FDBs (bridge offload). Add a drop rule matching on inactive lport
in LAG.
Reviewed-by: Simon Horman <horms@kernel.org> Co-developed-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Przemek Kitszel [Thu, 19 Oct 2023 17:32:17 +0000 (10:32 -0700)]
ice: remove unused ice_flow_entry fields
Remove ::entry and ::entry_sz fields of &ice_flow_entry,
as they were never set.
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 19 Oct 2023 15:28:15 +0000 (08:28 -0700)]
ethtool: untangle the linkmode and ethtool headers
Commit 26c5334d344d ("ethtool: Add forced speed to supported link
modes maps") added a dependency between ethtool.h and linkmode.h.
The dependency in the opposite direction already exists so the
new code was inserted in an awkward place.
The reason for ethtool.h to include linkmode.h, is that
ethtool_forced_speed_maps_init() is a static inline helper.
That's not really necessary.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heng Guo [Thu, 19 Oct 2023 01:20:53 +0000 (09:20 +0800)]
net: fix IPSTATS_MIB_OUTPKGS increment in OutForwDatagrams.
Reproduce environment:
network with 3 VM linuxs is connected as below:
VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3
VM1: eth0 ip: 192.168.122.207 MTU 1500
VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500
VM3: eth0 ip: 192.168.123.240 MTU 1500
Reproduce:
VM1 send 1400 bytes UDP data to VM3 using tools scapy with flags=0.
scapy command:
send(IP(dst="192.168.123.240",flags=0)/UDP()/str('0'*1400),count=1,
inter=1.000000)
Result:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates
Ip: 1 64 11 0 3 4 0 0 4 7 0 0 0 0 0 0 0 0 0
......
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates
Ip: 1 64 12 0 3 5 0 0 4 8 0 0 0 0 0 0 0 0 0
......
----------------------------------------------------------------------
"ForwDatagrams" increase from 4 to 5 and "OutRequests" also increase
from 7 to 8.
Issue description and patch:
IPSTATS_MIB_OUTPKTS("OutRequests") is counted with IPSTATS_MIB_OUTOCTETS
("OutOctets") in ip_finish_output2().
According to RFC 4293, it is "OutOctets" counted with "OutTransmits" but
not "OutRequests". "OutRequests" does not include any datagrams counted
in "ForwDatagrams".
ipSystemStatsOutOctets OBJECT-TYPE
DESCRIPTION
"The total number of octets in IP datagrams delivered to the
lower layers for transmission. Octets from datagrams
counted in ipIfStatsOutTransmits MUST be counted here.
ipSystemStatsOutRequests OBJECT-TYPE
DESCRIPTION
"The total number of IP datagrams that local IP user-
protocols (including ICMP) supplied to IP in requests for
transmission. Note that this counter does not include any
datagrams counted in ipSystemStatsOutForwDatagrams.
So do patch to define IPSTATS_MIB_OUTPKTS to "OutTransmits" and add
IPSTATS_MIB_OUTREQUESTS for "OutRequests".
Add IPSTATS_MIB_OUTREQUESTS counter in __ip_local_out() for ipv4 and add
IPSTATS_MIB_OUT counter in ip6_finish_output2() for ipv6.
Signed-off-by: Heng Guo <heng.guo@windriver.com> Reviewed-by: Kun Song <Kun.Song@windriver.com> Reviewed-by: Filip Pudak <filip.pudak@windriver.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Oct 2023 10:50:46 +0000 (11:50 +0100)]
Merge branch 'ksz886x-forced-link-modes'
Oleksij Rempel says:
====================
fix forced link mode for KSZ886X switches
changes v3:
- squash patch 1 and 2
- use genphy_config_aneg() instead of genphy_setup_forced()
changes v2:
- address kernel test robot warning
- change comment explaining clearing of KSZ886X_CTRL_FORCE_LINK bit
- s/PHY we create/PHY will create/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksij Rempel [Thu, 19 Oct 2023 11:14:59 +0000 (13:14 +0200)]
net: phy: micrel: Fix forced link mode for KSZ886X switches
Address a link speed detection issue in KSZ886X PHY driver when in
forced link mode. Previously, link partners like "ASIX AX88772B"
with KSZ8873 could fall back to 10Mbit instead of configured 100Mbit.
The issue arises as KSZ886X PHY continues sending Fast Link Pulses (FLPs)
even with autonegotiation off, misleading link partners in autoneg mode,
leading to incorrect link speed detection.
Now, when autonegotiation is disabled, the driver sets the link state
forcefully using KSZ886X_CTRL_FORCE_LINK bit. This action, beyond just
disabling autonegotiation, makes the PHY state more reliably detected by
link partners using parallel detection, thus fixing the link speed
misconfiguration.
With autonegotiation enabled, link state is not forced, allowing proper
autonegotiation process participation.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Divya Koppera <divya.koppera@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksij Rempel [Thu, 19 Oct 2023 11:14:58 +0000 (13:14 +0200)]
net: dsa: microchip: ksz8: Enable MIIM PHY Control reg access
Provide access to MIIM PHY Control register (Reg. 31) through
ksz8_r_phy_ctrl() and ksz8_w_phy_ctrl() functions. Necessary for
upcoming micrel.c patch to address forced link mode configuration.
Closes: https://lore.kernel.org/oe-kbuild-all/202310112224.iYgvjBUy-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Oct 2023 10:47:51 +0000 (11:47 +0100)]
Merge branch 'mlxsw-lag-table-allocation'
Petr Machata says:
====================
mlxsw: Move allocation of LAG table to the driver
PGT is an in-HW table that maps addresses to sets of ports. Then when some
HW process needs a set of ports as an argument, instead of embedding the
actual set in the dynamic configuration, what gets configured is the
address referencing the set. The HW then works with the appropriate PGT
entry.
Within the PGT is placed a LAG table. That is a contiguous block of PGT
memory where each entry describes which ports are members of the
corresponding LAG port.
The PGT is split to two parts: one managed by the FW, and one managed by
the driver. Historically, the FW part included also the LAG table, referred
to as FW LAG mode. Giving the responsibility for placement of the LAG table
to the driver, referred to as SW LAG mode, makes the whole system more
flexible. The FW currently supports both FW and SW LAG modes. To shed
complexity, the FW should in the future only support SW LAG mode.
Hence this patchset, where support for placement of LAG is added to mlxsw.
There are FW versions out there that do not support SW LAG mode, and on
Spectrum-1 in particular, there is no plan to support it at all. mlxsw will
therefore have to support both modes of operation.
Another aspect is that at least on Spectrum-1, there are FW versions out
there that claim to support driver-placed LAG table, but then reject or
ignore configurations enabling the same. The driver thus has to have a say
in whether an attempt to configure SW LAG mode should even be done.
The feature is therefore expressed in terms of "does the driver prefer SW
LAG mode?", and "what LAG mode the PCI module managed to configure the FW
with". This is unlike current flood mode configuration, where the driver
can give a strict value, and that's what gets configured. But it gives a
chance to the driver to determine whether LAG mode should be enabled at
all.
The "does the driver prefer SW LAG mode?" bit is expressed as a boolean
lag_mode_prefer_sw. The reason for this is largely another feature that
will be introduced in a follow-up patchset: support for CFF flood mode. The
driver currently requires that the FW be configured with what is called
controlled flood mode. But on capable systems, CFF would be preferred. So
there are two values in flight: the preferred flood mode, and the fallback.
This could be expressed with an array of flood modes ordered by preference,
but that looks like an overkill in comparison. This flag/value model is
then reused for LAG mode as well, except the fallback value is absent and
implied to be FW, because there are no other values to choose from.
The patchset progresses as follows:
- Patches #1 to #5 adjust reg.h and cmd.h with new register fields,
constants and remarks.
- Patches #6 and #7 add the ability to request SW LAG mode and to query the
LAG mode that was actually negotiated. This is where the abovementioned
lag_mode_prefer_sw flag is added.
- Patches #7 to #9 generalize PGT allocations to make it possible to
allocate the LAG table, which is done in patch #10.
- In patch #11, toggle lag_mode_prefer_sw on Spectrum-2 and above, which
makes the newly-added code live.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:20 +0000 (12:27 +0200)]
mlxsw: spectrum: Set SW LAG mode on Spectrum>1
On Spectrum-2, Spectrum-3 and Spectrum-4 machines, request SW
responsibility for placement of the LAG table.
On Spectrum-1, some FW versions claim to support lag_mode field despite
quietly ignoring any settings made to that field. Thus refrain from
attempting to configure lag_mode on those systems at all.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:19 +0000 (12:27 +0200)]
mlxsw: spectrum: Allocate LAG table when in SW LAG mode
In this patch, if the LAG mode is SW, allocate the LAG table and configure
SGCR to indicate where it was allocated.
We use the default "DDD" (for dynamic data duplication) layout of the LAG
table. In the DDD mode, the membership information for each LAG is copied
in 8 PGT entries. This is done for performance reasons. The LAG table then
needs to be allocated on an address aligned to 8. Deal with this by
moving the LAG init ahead so that the LAG table is allocated at address 0.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:18 +0000 (12:27 +0200)]
mlxsw: spectrum_pgt: Generalize PGT allocation
PGT blocks are allocated through the function
mlxsw_sp_pgt_mid_alloc_range(). The interface assumes that the caller knows
which piece of PGT exactly they want to get. That was fine while the FID
code was the only client allocating blocks of PGT. However for SW-allocated
LAG table, there will be an additional client: mlxsw_sp_lag_init(). The
interface should therefore be changed to not require particular
coordinates, but to take just the requested size, allocate the block
wherever, and give back the PGT address.
In this patch, change the interface accordingly. Initialize FID family's
pgt_base from the result of the PGT allocation (note that mlxsw makes a
copy of the family structure, so what gets initialized is not actually the
global structure). Drop the now-unnecessary pgt_base initializations and
the corresponding defines.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:17 +0000 (12:27 +0200)]
mlxsw: spectrum_fid: Allocate PGT for the whole FID family in one go
PGT blocks are allocated through the function
mlxsw_sp_pgt_mid_alloc_range(). The interface assumes that the caller knows
which piece of PGT exactly they want to get. That was fine while the FID
code was the only client allocating blocks of PGT. However for SW-allocated
LAG table, there will be an additional client: mlxsw_sp_lag_init(). The
interface should therefore be changed to not require particular
coordinates, but to take just the requested size, allocate the block
wherever, and give back the PGT address.
The current FID mode has one place where PGT address can be stored: the FID
family's pgt_base. The allocation scheme should therefore be changed from
allocating a block per FID flood table, to allocating a block per FID
family.
Do just that in this patch.
The per-family allocation is going to be useful for another related feature
as well: the CFF mode.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:16 +0000 (12:27 +0200)]
mlxsw: pci: Permit toggling LAG mode
Add to struct mlxsw_config_profile a field lag_mode_prefer_sw for the
driver to indicate that SW LAG mode should be configured if possible. Add
to the PCI module code to set lag_mode as appropriate.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:15 +0000 (12:27 +0200)]
mlxsw: core, pci: Add plumbing related to LAG mode
lag_mode describes where the responsibility for LAG table placement lies:
SW or FW. The bus module determines whether LAG is supported, can configure
it if it is, and knows what (if any) configuration has been applied.
Therefore add a bus callback to determine the configured LAG mode. Also add
to core an API to query it.
The LAG mode is for now kept at the default value of 0 for FW-managed. The
code to actually toggle it will be added later.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:13 +0000 (12:27 +0200)]
mlxsw: cmd: Add CONFIG_PROFILE.{set_, }lag_mode
Add CONFIG_PROFILE.lag_mode, which serves for moving responsibility for
placement of the LAG table from FW to SW. Whether lag_mode should be
configured is determined by CONFIG_PROFILE.set_lag_mode, which also add.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:12 +0000 (12:27 +0200)]
mlxsw: cmd: Fix omissions in CONFIG_PROFILE field names in comments
A number of CONFIG_PROFILE fields' comments refer to a field named like
cmd_mbox_config_* instead of cmd_mbox_config_profile_*. Correct these
omissions.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:11 +0000 (12:27 +0200)]
mlxsw: reg: Add SGCR.lag_lookup_pgt_base
Add SGCR.lag_lookup_pgt_base, which is used for configuring the base
address of the LAG table within the PGT table for cases when the driver
is responsible for the table placement.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 19 Oct 2023 10:27:10 +0000 (12:27 +0200)]
mlxsw: reg: Drop SGCR.llb
SGCR, Switch General Configuration Register, has not been used since commit b0d80c013b04 ("mlxsw: Remove Mellanox SwitchX-2 ASIC support"). We will
need the register again shortly, so instead of dropping it and
reintroducing again, just drop the sole unused field.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Oct 2023 10:43:36 +0000 (11:43 +0100)]
Merge branch 'netlink-auto-integers'
Jakub Kicinski says:
====================
netlink: add variable-length / auto integers
Add netlink support for "common" / variable-length / auto integers
which are carried at the message level as either 4B or 8B depending
on the exact value. This saves space and will hopefully decrease
the number of instances where we realize that we needed more bits
after uAPI is set is stone. It also loosens the alignment requirements,
avoiding the need for padding.
This mini-series is a fuller version of the previous RFC:
https://lore.kernel.org/netdev/20121204.130914.1457976839967676240.davem@davemloft.net/
No user included here. I have tested (and will use) it
in the upcoming page pool API but the assumption is that
it will be widely applicable. So sending without a user.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 18 Oct 2023 21:39:21 +0000 (14:39 -0700)]
netlink: specs: add support for auto-sized scalars
Support uint / sint types in specs and YNL.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 18 Oct 2023 21:39:20 +0000 (14:39 -0700)]
netlink: add variable-length / auto integers
We currently push everyone to use padding to align 64b values
in netlink. Un-padded nla_put_u64() doesn't even exist any more.
The story behind this possibly start with this thread:
https://lore.kernel.org/netdev/20121204.130914.1457976839967676240.davem@davemloft.net/
where DaveM was concerned about the alignment of a structure
containing 64b stats. If user space tries to access such struct
directly:
lack of alignment may become problematic for some architectures.
These days we most often put every single member in a separate
attribute, meaning that the code above would use a helper like
nla_get_u64(), which can deal with alignment internally.
Even for arches which don't have good unaligned access - access
aligned to 4B should be pretty efficient.
Kernel and well known libraries deal with unaligned input already.
Padded 64b is quite space-inefficient (64b + pad means at worst 16B
per attr vs 32b which takes 8B). It is also more typing:
if (nla_put_u64_pad(rsp, NETDEV_A_SOMETHING_SOMETHING,
value, NETDEV_A_SOMETHING_PAD))
Create a new attribute type which will use 32 bits at netlink
level if value is small enough (probably most of the time?),
and (4B-aligned) 64 bits otherwise. Kernel API is just:
if (nla_put_uint(rsp, NETDEV_A_SOMETHING_SOMETHING, value))
Calling this new type "just" sint / uint with no specific size
will hopefully also make people more comfortable with using it.
Currently telling people "don't use u8, you may need the bits,
and netlink will round up to 4B, anyway" is the #1 comment
we give to newcomers.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Oct 2023 10:34:51 +0000 (11:34 +0100)]
Merge branch 'devlink-errors-fmsg'
Przemek Kitszel says:
====================
devlink: retain error in struct devlink_fmsg
Extend devlink fmsg to retain error (patch 1),
so drivers could omit error checks after devlink_fmsg_*() (patches 2-10),
and finally enforce future uses to follow this practice by change to
return void (patch 11)
Przemek Kitszel [Wed, 18 Oct 2023 20:26:47 +0000 (22:26 +0200)]
devlink: convert most of devlink_fmsg_*() to return void
Since struct devlink_fmsg retains error by now (see 1st patch of this
series), there is no longer need to keep returning it in each call.
This is a separate commit to allow per-driver conversion to stop using
those return values.
Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Przemek Kitszel [Wed, 18 Oct 2023 20:26:46 +0000 (22:26 +0200)]
staging: qlge: devlink health: use retained error fmsg API
Drop unneeded error checking.
devlink_fmsg_*() family of functions is now retaining errors,
so there is no need to check for them after each call.
Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Przemek Kitszel [Wed, 18 Oct 2023 20:26:45 +0000 (22:26 +0200)]
qed: devlink health: use retained error fmsg API
Drop unneeded error checking.
devlink_fmsg_*() family of functions is now retaining errors,
so there is no need to check for them after each call.
Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Przemek Kitszel [Wed, 18 Oct 2023 20:26:39 +0000 (22:26 +0200)]
pds_core: devlink health: use retained error fmsg API
Drop unneeded error checking.
devlink_fmsg_*() family of functions is now retaining errors,
so there is no need to check for them after each call.
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Przemek Kitszel [Wed, 18 Oct 2023 20:26:38 +0000 (22:26 +0200)]
netdevsim: devlink health: use retained error fmsg API
Drop unneeded error checking.
devlink_fmsg_*() family of functions is now retaining errors,
so there is no need to check for them after each call.
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Przemek Kitszel [Wed, 18 Oct 2023 20:26:37 +0000 (22:26 +0200)]
devlink: retain error in struct devlink_fmsg
Retain error value in struct devlink_fmsg, to relieve drivers from
checking it after each call.
Note that fmsg is an in-memory builder/buffer of formatted message,
so it's not the case that half baked message was sent somewhere.
We could find following scheme in multiple drivers:
err = devlink_fmsg_obj_nest_start(fmsg);
if (err)
return err;
err = devlink_fmsg_string_pair_put(fmsg, "src", src);
if (err)
return err;
err = devlink_fmsg_something(fmsg, foo, bar);
if (err)
return err;
// and so on...
err = devlink_fmsg_obj_nest_end(fmsg);
With retaining error API that translates to:
devlink_fmsg_obj_nest_start(fmsg);
devlink_fmsg_string_pair_put(fmsg, "src", src);
devlink_fmsg_something(fmsg, foo, bar);
// and so on...
devlink_fmsg_obj_nest_end(fmsg);
What means we check error just when is time to send.
Possible error scenarios are developer error (API misuse) and memory
exhaustion, both cases are good candidates to choose readability
over fastest possible exit.
Note that this patch keeps returning errors, to allow per-driver conversion
to the new API, but those are not needed at this point already.
This commit itself is an illustration of benefits for the dev-user,
more of it will be in separate commits of the series.
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
tools: ynl-gen: support full range of min/max checks
YNL code gen currently supports only very simple range checks
within the range of s16. Add support for full range of u64 / s64
which is good to have, and will be even more important with uint / sint.
====================
Jakub Kicinski [Wed, 18 Oct 2023 16:39:16 +0000 (09:39 -0700)]
tools: ynl-gen: support full range of min/max checks for integer values
Extend the support to full range of min/max checks.
None of the existing YNL families required complex integer validation.
The support is less than trivial, because we try to keep struct nla_policy
tiny the min/max members it holds in place are s16. Meaning we can only
express checks in range of s16. For larger ranges we need to define
a structure and link it in the policy.
Jakub Kicinski [Wed, 18 Oct 2023 16:39:15 +0000 (09:39 -0700)]
tools: ynl-gen: track attribute use
For range validation we'll need to know if any individual
attribute is used on input (i.e. whether we will generate
a policy for it). Track this information.
Linus Torvalds [Thu, 19 Oct 2023 19:08:18 +0000 (12:08 -0700)]
Merge tag 'net-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, netfilter, WiFi.
Feels like an up-tick in regression fixes, mostly for older releases.
The hfsc fix, tcp_disconnect() and Intel WWAN fixes stand out as
fairly clear-cut user reported regressions. The mlx5 DMA bug was
causing strife for 390x folks. The fixes themselves are not
particularly scary, tho. No open investigations / outstanding reports
at the time of writing.
Current release - regressions:
- eth: mlx5: perform DMA operations in the right locations, make
devices usable on s390x, again
- sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner
curve, previous fix of rejecting invalid config broke some scripts
- rfkill: reduce data->mtx scope in rfkill_fop_open, avoid deadlock
- revert "ethtool: Fix mod state of verbose no_mask bitset", needs
more work
Current release - new code bugs:
- tcp: fix listen() warning with v4-mapped-v6 address
Previous releases - regressions:
- tcp: allow tcp_disconnect() again when threads are waiting, it was
denied to plug a constant source of bugs but turns out .NET depends
on it
- eth: mlx5: fix double-free if buffer refill fails under OOM
- revert "net: wwan: iosm: enable runtime pm support for 7560", it's
causing regressions and the WWAN team at Intel disappeared
- tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a
single skb, fix single-stream perf regression on some devices
Previous releases - always broken:
- Bluetooth:
- fix issues in legacy BR/EDR PIN code pairing
- correctly bounds check and pad HCI_MON_NEW_INDEX name
- netfilter:
- more fixes / follow ups for the large "commit protocol" rework,
which went in as a fix to 6.5
- fix null-derefs on netlink attrs which user may not pass in
- tcp: fix excessive TLP and RACK timeouts from HZ rounding (bless
Debian for keeping HZ=250 alive)
- net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation, prevent
letting frankenstein UDP super-frames from getting into the stack
- net: fix interface altnames when ifc moves to a new namespace
- eth: qed: fix the size of the RX buffers
- mptcp: avoid sending RST when closing the initial subflow"
* tag 'net-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
Revert "ethtool: Fix mod state of verbose no_mask bitset"
selftests: mptcp: join: no RST when rm subflow/addr
mptcp: avoid sending RST when closing the initial subflow
mptcp: more conservative check for zero probes
tcp: check mptcp-level constraints for backlog coalescing
selftests: mptcp: join: correctly check for no RST
net: ti: icssg-prueth: Fix r30 CMDs bitmasks
selftests: net: add very basic test for netdev names and namespaces
net: move altnames together with the netdevice
net: avoid UAF on deleted altname
net: check for altname conflicts when changing netdev's netns
net: fix ifname in netlink ntf during netns move
net: ethernet: ti: Fix mixed module-builtin object
net: phy: bcm7xxx: Add missing 16nm EPHY statistics
ipv4: fib: annotate races around nh->nh_saddr_genid and nh->nh_saddr
tcp_bpf: properly release resources on error paths
net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve
net: mdio-mux: fix C45 access returning -EIO after API change
tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skb
octeon_ep: update BQL sent bytes before ringing doorbell
...
Linus Torvalds [Thu, 19 Oct 2023 18:02:28 +0000 (11:02 -0700)]
Merge tag 'loongarch-fixes-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai ChenL
"Fix 4-level pagetable building, disable WUC for pgprot_writecombine()
like ioremap_wc(), use correct annotation for exception handlers, and
a trivial cleanup"
* tag 'loongarch-fixes-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: Disable WUC for pgprot_writecombine() like ioremap_wc()
LoongArch: Replace kmap_atomic() with kmap_local_page() in copy_user_highpage()
LoongArch: Export symbol invalid_pud_table for modules building
LoongArch: Use SYM_CODE_* to annotate exception handlers
Linus Torvalds [Thu, 19 Oct 2023 17:53:31 +0000 (10:53 -0700)]
Merge tag 'slab-fixes-for-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
- stable fix to prevent kernel warnings with KASAN_HW_TAGS on arm64
due to improperly resolved kmalloc alignment restrictions (Catalin
Marinas)
* tag 'slab-fixes-for-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm: slab: Do not create kmalloc caches smaller than arch_slab_minalign()
Linus Torvalds [Thu, 19 Oct 2023 16:37:41 +0000 (09:37 -0700)]
Merge tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fix from Christian Brauner:
"An openat() call from io_uring triggering an audit call can apparently
cause the refcount of struct filename to be incremented from multiple
threads concurrently during async execution, triggering a refcount
underflow and hitting a BUG_ON(). That bug has been lurking around
since at least v5.16 apparently.
Switch to an atomic counter to fix that. The underflow check is
downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily
remove that check altogether tbh"
* tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
audit,io_uring: io_uring openat triggers audit reference count underflow
It was reported that this fix breaks the possibility to remove existing WoL
flags. For example:
~$ ethtool lan2
...
Supports Wake-on: pg
Wake-on: d
...
~$ ethtool -s lan2 wol gp
~$ ethtool lan2
...
Wake-on: pg
...
~$ ethtool -s lan2 wol d
~$ ethtool lan2
...
Wake-on: pg
...
This worked correctly before this commit because we were always updating
a zero bitmap (since commit 6699170376ab ("ethtool: fix application of
verbose no_mask bitset"), that is) so that the rest was left zero
naturally. But now the 1->0 change (old_val is true, bit not present in
netlink nest) no longer works.
Reported-by: Oleksij Rempel <o.rempel@pengutronix.de> Reported-by: Michal Kubecek <mkubecek@suse.cz> Closes: https://lore.kernel.org/netdev/20231019095140.l6fffnszraeb6iiw@lion.mk-sys.cz/ Cc: stable@vger.kernel.org Fixes: 108a36d07c01 ("ethtool: Fix mod state of verbose no_mask bitset") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Link: https://lore.kernel.org/r/20231019-feature_ptp_bitset_fix-v1-1-70f3c429a221@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 19 Oct 2023 16:10:18 +0000 (09:10 -0700)]
Merge tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 fixes from Konstantin Komarov:
- memory leak
- some logic errors, NULL dereferences
- some code was refactored
- more sanity checks
* tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3:
fs/ntfs3: Avoid possible memory leak
fs/ntfs3: Fix directory element type detection
fs/ntfs3: Fix possible null-pointer dereference in hdr_find_e()
fs/ntfs3: Fix OOB read in ntfs_init_from_boot
fs/ntfs3: fix panic about slab-out-of-bounds caused by ntfs_list_ea()
fs/ntfs3: Fix NULL pointer dereference on error in attr_allocate_frame()
fs/ntfs3: Fix possible NULL-ptr-deref in ni_readpage_cmpr()
fs/ntfs3: Do not allow to change label if volume is read-only
fs/ntfs3: Add more info into /proc/fs/ntfs3/<dev>/volinfo
fs/ntfs3: Refactoring and comments
fs/ntfs3: Fix alternative boot searching
fs/ntfs3: Allow repeated call to ntfs3_put_sbi
fs/ntfs3: Use inode_set_ctime_to_ts instead of inode_set_ctime
fs/ntfs3: Fix shift-out-of-bounds in ntfs_fill_super
fs/ntfs3: fix deadlock in mark_as_free_ex
fs/ntfs3: Add more attributes checks in mi_enum_attr()
fs/ntfs3: Use kvmalloc instead of kmalloc(... __GFP_NOWARN)
fs/ntfs3: Write immediately updated ntfs state
fs/ntfs3: Add ckeck in ni_update_parent()
Geliang Tang [Wed, 18 Oct 2023 18:23:55 +0000 (11:23 -0700)]
mptcp: avoid sending RST when closing the initial subflow
When closing the first subflow, the MPTCP protocol unconditionally
calls tcp_disconnect(), which in turn generates a reset if the subflow
is established.
That is unexpected and different from what MPTCP does with MPJ
subflows, where resets are generated only on FASTCLOSE and other edge
scenarios.
We can't reuse for the first subflow the same code in place for MPJ
subflows, as MPTCP clean them up completely via a tcp_close() call,
while must keep the first subflow socket alive for later re-usage, due
to implementation constraints.
This patch adds a new helper __mptcp_subflow_disconnect() that
encapsulates, a logic similar to tcp_close, issuing a reset only when
the MPTCP_CF_FASTCLOSE flag is set, and performing a clean shutdown
otherwise.
Fixes: c2b2ae3925b6 ("mptcp: handle correctly disconnect() failures") Cc: stable@vger.kernel.org Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-4-17ecb002e41d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Wed, 18 Oct 2023 18:23:54 +0000 (11:23 -0700)]
mptcp: more conservative check for zero probes
Christoph reported that the MPTCP protocol can find the subflow-level
write queue unexpectedly not empty while crafting a zero-window probe,
hitting a warning:
Paolo Abeni [Wed, 18 Oct 2023 18:23:53 +0000 (11:23 -0700)]
tcp: check mptcp-level constraints for backlog coalescing
The MPTCP protocol can acquire the subflow-level socket lock and
cause the tcp backlog usage. When inserting new skbs into the
backlog, the stack will try to coalesce them.
Currently, we have no check in place to ensure that such coalescing
will respect the MPTCP-level DSS, and that may cause data stream
corruption, as reported by Christoph.
Address the issue by adding the relevant admission check for coalescing
in tcp_add_backlog().
Note the issue is not easy to reproduce, as the MPTCP protocol tries
hard to avoid acquiring the subflow-level socket lock.
Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/420 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-2-17ecb002e41d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts [Wed, 18 Oct 2023 18:23:52 +0000 (11:23 -0700)]
selftests: mptcp: join: correctly check for no RST
The commit mentioned below was more tolerant with the number of RST seen
during a test because in some uncontrollable situations, multiple RST
can be generated.
But it was not taking into account the case where no RST are expected:
this validation was then no longer reporting issues for the 0 RST case
because it is not possible to have less than 0 RST in the counter. This
patch fixes the issue by adding a specific condition.
Fixes: 6bf41020b72b ("selftests: mptcp: update and extend fastclose test-cases") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-1-17ecb002e41d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The bitmasks for EMAC_PORT_DISABLE and EMAC_PORT_FORWARD r30 commands are
wrong in the driver.
Update the bitmasks of these commands to the correct ones as used by the
ICSSG firmware. These bitmasks are backwards compatible and work with
any ICSSG firmware version.
Fixes: e9b4ece7d74b ("net: ti: icssg-prueth: Add Firmware config and classification APIs.") Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231018150715.3085380-1-danishanwar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ivan Vecera [Wed, 18 Oct 2023 12:35:55 +0000 (14:35 +0200)]
i40e: Align devlink info versions with ice driver and add docs
Align devlink info versions with ice driver so change 'fw.mgmt'
version to be 2-digit version [major.minor], add 'fw.mgmt.build'
that reports mgmt firmware build number and use '"fw.psid.api'
for NVM format version instead of incorrect '"fw.psid'.
Additionally add missing i40e devlink documentation.
Fixes: 5a423552e0d9 ("i40e: Add handler for devlink .info_get") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231018123558.552453-1-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a series of clean-ups related to ensuring that child node
schemas are constrained to not allow undefined properties. Typically,
that means just adding additionalProperties or unevaluatedProperties as
appropriate. The DSA/switch schemas turned out to be a bit more
involved, so there's some more fixes and a bit of restructuring in them.
====================
Rob Herring [Mon, 16 Oct 2023 21:44:27 +0000 (16:44 -0500)]
dt-bindings: net: dsa: Drop 'ethernet-ports' node properties
Constraints on 'ethernet-ports' node properties are already defined by the
reference to ethernet-switch.yaml, so they can be dropped from the DSA
schema.
Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20231016-dt-net-cleanups-v1-8-a525a090b444@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rob Herring [Mon, 16 Oct 2023 21:44:26 +0000 (16:44 -0500)]
dt-bindings: net: mscc,vsc7514-switch: Simplify DSA and switch references
The mscc,vsc7514-switch schema doesn't add any custom port properties,
so it can just reference ethernet-switch.yaml#/$defs/base and
dsa.yaml#/$defs/ethernet-ports instead of the base file and can skip
defining port nodes.
Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20231016-dt-net-cleanups-v1-7-a525a090b444@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rob Herring [Mon, 16 Oct 2023 21:44:24 +0000 (16:44 -0500)]
dt-bindings: net: ethernet-switch: Rename $defs "base" to 'ethernet-ports'
The name "base" is misleading as the definition is for a complete schema
definition without additional properties allowed, not a "base class".
Align the same to be the same as dsa.yaml. This schema file without any
json pointer path is the base schema which can be extended.
There are not yet any references to $defs/base to update.
Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20231016-dt-net-cleanups-v1-5-a525a090b444@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The '$defs/ethernet-ports' schema is referenced by schemas defining a
child node 'ethernet-ports', but this schema misses the
'ethernet-ports' node. It would work if referring schemas made a
reference like this:
Rob Herring [Mon, 16 Oct 2023 21:44:21 +0000 (16:44 -0500)]
dt-bindings: net: renesas: Drop ethernet-phy node schema
What's connected on the MDIO bus is outside the scope of the binding for
ethernet controller's MDIO bus unless it's a fixed internal device, so
drop the node name and reference to ethernet-phy.yaml.
Rob Herring [Mon, 16 Oct 2023 21:44:20 +0000 (16:44 -0500)]
dt-bindings: net: Add missing (unevaluated|additional)Properties on child node schemas
Just as unevaluatedProperties or additionalProperties are required at
the top level of schemas, they should (and will) also be required for
child node schemas. That ensures only documented properties are
present for any node.
Add unevaluatedProperties or additionalProperties as appropriate.
Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com> Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20231016-dt-net-cleanups-v1-1-a525a090b444@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 19 Oct 2023 15:56:01 +0000 (08:56 -0700)]
Merge tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Fix a bug in chunk size decision that could lead to suboptimal
placement and filling patterns"
* tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix stripe length calculation for non-zoned data chunk allocation
====================
net: fix bugs in device netns-move and rename
Daniel reported issues with the uevents generated during netdev
namespace move, if the netdev is getting renamed at the same time.
While the issue that he actually cares about is not fixed here,
there is a bunch of seemingly obvious other bugs in this code.
Fix the purely networking bugs while the discussion around
the uevent fix is still ongoing.
====================
Jakub Kicinski [Wed, 18 Oct 2023 01:38:16 +0000 (18:38 -0700)]
net: move altnames together with the netdevice
The altname nodes are currently not moved to the new netns
when netdevice itself moves:
[ ~]# ip netns add test
[ ~]# ip -netns test link add name eth0 type dummy
[ ~]# ip -netns test link property add dev eth0 altname some-name
[ ~]# ip -netns test link show dev some-name
2: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1e:67:ed:19:3d:24 brd ff:ff:ff:ff:ff:ff
altname some-name
[ ~]# ip -netns test link set dev eth0 netns 1
[ ~]# ip link
...
3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff
altname some-name
[ ~]# ip li show dev some-name
Device "some-name" does not exist.
Remove them from the hash table when device is unlisted
and add back when listed again.
Fixes: 36fbf1e52bd3 ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 18 Oct 2023 01:38:15 +0000 (18:38 -0700)]
net: avoid UAF on deleted altname
Altnames are accessed under RCU (dev_get_by_name_rcu())
but freed by kfree() with no synchronization point.
Each node has one or two allocations (node and a variable-size
name, sometimes the name is netdev->name). Adding rcu_heads
here is a bit tedious. Besides most code which unlists the names
already has rcu barriers - so take the simpler approach of adding
synchronize_rcu(). Note that the one on the unregistration path
(which matters more) is removed by the next fix.
Fixes: ff92741270bf ("net: introduce name_node struct to be used in hashlist") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 18 Oct 2023 01:38:14 +0000 (18:38 -0700)]
net: check for altname conflicts when changing netdev's netns
It's currently possible to create an altname conflicting
with an altname or real name of another device by creating
it in another netns and moving it over:
[ ~]$ ip link add dev eth0 type dummy
[ ~]$ ip netns add test
[ ~]$ ip -netns test link add dev ethX netns test type dummy
[ ~]$ ip -netns test link property add dev ethX altname eth0
[ ~]$ ip -netns test link set dev ethX netns 1
[ ~]$ ip link
...
3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff
...
5: ethX: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 26:b7:28:78:38:0f brd ff:ff:ff:ff:ff:ff
altname eth0
Create a macro for walking the altnames, this hopefully makes
it clearer that the list we walk contains only altnames.
Which is otherwise not entirely intuitive.
Fixes: 36fbf1e52bd3 ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 18 Oct 2023 01:38:13 +0000 (18:38 -0700)]
net: fix ifname in netlink ntf during netns move
dev_get_valid_name() overwrites the netdev's name on success.
This makes it hard to use in prepare-commit-like fashion,
where we do validation first, and "commit" to the change
later.
Factor out a helper which lets us save the new name to a buffer.
Use it to fix the problem of notification on netns move having
incorrect name:
5: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff
6: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 1e:4a:34:36:e3:cd brd ff:ff:ff:ff:ff:ff
[ ~]# ip link set dev eth0 netns 1 name eth1
ip monitor inside netns:
Deleted inet eth0
Deleted inet6 eth0
Deleted 5: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff new-netnsid 0 new-ifindex 7
Name is reported as eth1 in old netns for ifindex 5, already renamed.
Fixes: d90310243fd7 ("net: device name allocation cleanups") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This series comes with the intention of restoring original performance
of stmmac on some router/device that used the stmmac driver to handle
gigabit traffic.
More info are present in patch 3. This cover letter is to show results
and improvements of the following change.
The move to hr_timer for tx timer and commit 8fce33317023 ("net: stmmac:
Rework coalesce timer and fix multi-queue races") caused big performance
regression on these kind of device.
This was observed on ipq806x that after kernel 4.19 couldn't handle
gigabit speed anymore.
The following series is currently applied and tested in OpenWrt SNAPSHOT
and have great performance increase. (the scenario is qca8k switch +
stmmac dwmac1000) Some good comparison can be found here [1].
The difference is from a swconfig scenario (where dsa tagging is not
used so very low CPU impact in handling traffic) and DSA scenario where
tagging is used and there is a minimal impact in the CPU. As can be
notice even with DSA in place we have better perf.
It was observed by other user that also SQM scenario with cake scheduler
were improved in the order of 100mbps (this scenario is CPU limited and
any increase of perf is caused by removing load on the CPU)
Been at least 15 days that this is in use without any complain or bug
reported about queue timeout. (was the case with v1 before the
additional patch was added, only appear on real world tests and not on
iperf tests)
Christian Marangi [Wed, 18 Oct 2023 12:35:50 +0000 (14:35 +0200)]
net: stmmac: increase TX coalesce timer to 5ms
Commit 8fce33317023 ("net: stmmac: Rework coalesce timer and fix
multi-queue races") decreased the TX coalesce timer from 40ms to 1ms.
This caused some performance regression on some target (regression was
reported at least on ipq806x) in the order of 600mbps dropping from
gigabit handling to only 200mbps.
The problem was identified in the TX timer getting armed too much time.
While this was fixed and improved in another commit, performance can be
improved even further by increasing the timer delay a bit moving from
1ms to 5ms.
The value is a good balance between battery saving by prevending too
much interrupt to be generated and permitting good performance for
internet oriented devices.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Christian Marangi [Wed, 18 Oct 2023 12:35:49 +0000 (14:35 +0200)]
net: stmmac: move TX timer arm after DMA enable
Move TX timer arm call after DMA interrupt is enabled again.
The TX timer arm function changed logic and now is skipped if a napi is
already scheduled. By moving the TX timer arm call after DMA is enabled,
we permit to correctly skip if a DMA interrupt has been fired and a napi
has been scheduled again.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Christian Marangi [Wed, 18 Oct 2023 12:35:48 +0000 (14:35 +0200)]
net: stmmac: improve TX timer arm logic
There is currently a problem with the TX timer getting armed multiple
unnecessary times causing big performance regression on some device that
suffer from heavy handling of hrtimer rearm.
The use of the TX timer is an old implementation that predates the napi
implementation and the interrupt enable/disable handling.
Due to stmmac being a very old code, the TX timer was never evaluated
again with this new implementation and was kept there causing
performance regression. The performance regression started to appear
with kernel version 4.19 with 8fce33317023 ("net: stmmac: Rework coalesce
timer and fix multi-queue races") where the timer was reduced to 1ms
causing it to be armed 40 times more than before.
Decreasing the timer made the problem more present and caused the
regression in the other of 600-700mbps on some device (regression where
this was notice is ipq806x).
The problem is in the fact that handling the hrtimer on some target is
expensive and recent kernel made the timer armed much more times.
A solution that was proposed was reverting the hrtimer change and use
mod_timer but such solution would still hide the real problem in the
current implementation.
To fix the regression, apply some additional logic and skip arming the
timer when not needed.
Arm the timer ONLY if a napi is not already scheduled. Running the timer
is redundant since the same function (stmmac_tx_clean) will run in the
napi TX poll. Also try to cancel any timer if a napi is scheduled to
prevent redundant run of TX call.
With the following new logic the original performance are restored while
keeping using the hrtimer.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Christian Marangi [Wed, 18 Oct 2023 12:35:47 +0000 (14:35 +0200)]
net: introduce napi_is_scheduled helper
We currently have napi_if_scheduled_mark_missed that can be used to
check if napi is scheduled but that does more thing than simply checking
it and return a bool. Some driver already implement custom function to
check if napi is scheduled.
Drop these custom function and introduce napi_is_scheduled that simply
check if napi is scheduled atomically.
Update any driver and code that implement a similar check and instead
use this new helper.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Michal Schmidt [Wed, 18 Oct 2023 11:15:27 +0000 (13:15 +0200)]
iavf: delete unused iavf_mac_info fields
'san_addr' and 'mac_fcoeq' members of struct iavf_mac_info are unused.
'type' is write-only. Delete all three.
The function iavf_set_mac_type that sets 'type' also checks if the PCI
vendor ID is Intel. This is unnecessary. Delete the whole function.
If in the future there's a need for the MAC type (or other PCI
ID-dependent data), I would prefer to use .driver_data in iavf_pci_tbl[]
for this purpose.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231018111527.78194-1-mschmidt@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
net: stmmac: use correct PPS input indexing
The stmmac can have 0 to 4 auxiliary snapshot in channels, which can be
used for capturing external triggers with respect to the eqos PTP timer.
Previously when enabling the auxiliary snapshot, an invalid request was
written to the hardware register, except for the Intel variant of this
driver, where the only snapshot available was hardcoded.
Patch 1 of this series cleans up the debug netdev_dbg message indicating
the auxiliary snapshot being {en,dis}abled. No functional changes here
Patch 2 of this series writes the correct PPS input indexing to the
hardware registers instead of a previously used fixed value
Patch 3 of this series removes a field member from plat_stmmacnet_data
that is no longer needed
Patch 4 of this series prepares Patch 5 by protecting the snapshot
enabled flag by the aux_ts_lock mutex
Patch 5 of this series adds a temporary workaround, since at the moment
the driver can handle only one single auxiliary snapshot at a time.
Previously the driver silently dropped the previous configuration and
enabled the new one. Now, if a snapshot is already enabled and userspace
tries to enable another without previously disabling the snapshot currently
enabled: issue a netdev_err and return an errorcode indicating the device is
busy.
This series is a "never worked, doesn't hurt anyone" touchup to the PPS
capture for non-intel variants of the dwmac driver.
====================
Johannes Zink [Wed, 18 Oct 2023 07:09:57 +0000 (09:09 +0200)]
net: stmmac: do not silently change auxiliary snapshot capture channel
Even though the hardware theoretically supports up to 4 simultaneous
auxiliary snapshot capture channels, the stmmac driver does support only
a single channel to be active at a time.
Previously in case of a PTP_CLK_REQ_EXTTS request, previously active
auxiliary snapshot capture channels were silently dropped and the new
channel was activated.
Instead of silently changing the state for all consumers, log an error
and return -EBUSY if a channel is already in use in order to signal to
userspace to disable the currently active channel before enabling another one.
Signed-off-by: Johannes Zink <j.zink@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Johannes Zink [Wed, 18 Oct 2023 07:09:56 +0000 (09:09 +0200)]
net: stmmac: ptp: stmmac_enable(): move change of plat->flags into mutex
This is a preparation patch. The next patch will check if an external TS
is active and return with an error. So we have to move the change of the
plat->flags that tracks if external timestamping is enabled after that
check.
Prepare for this change and move the plat->flags change into the mutex
and the if (on).
Signed-off-by: Johannes Zink <j.zink@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Johannes Zink [Wed, 18 Oct 2023 07:09:55 +0000 (09:09 +0200)]
net: stmmac: intel: remove unnecessary field struct plat_stmmacenet_data::ext_snapshot_num
Do not store bitmask for enabling AUX_SNAPSHOT0. The previous commit
("net: stmmac: fix PPS capture input index") takes care of calculating
the proper bit mask from the request data's extts.index field, which is
0 if not explicitly specified otherwise.
Signed-off-by: Johannes Zink <j.zink@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Johannes Zink [Wed, 18 Oct 2023 07:09:54 +0000 (09:09 +0200)]
net: stmmac: use correct PPS capture input index
The stmmac supports up to 4 auxiliary snapshots that can be enabled by
setting the appropriate bits in the PTP_ACR bitfield.
Previously as of commit f4da56529da6 ("net: stmmac: Add support for
external trigger timestamping") instead of setting the bits, a fixed
value was written to this bitfield instead of passing the appropriate
bitmask.
Now the correct bit is set according to the ptp_clock_request.extts_index
passed as a parameter to stmmac_enable().
Signed-off-by: Johannes Zink <j.zink@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With CONFIG_TI_K3_AM65_CPSW_NUSS=y and CONFIG_TI_ICSSG_PRUETH=m,
k3-cppi-desc-pool.o is linked to a module and also to vmlinux even though
the expected CFLAGS are different between builtins and modules.
The build system is complaining about the following:
k3-cppi-desc-pool.o is added to multiple modules: icssg-prueth
ti-am65-cpsw-nuss
Introduce the new module, k3-cppi-desc-pool, to provide the common
functions to ti-am65-cpsw-nuss and icssg-prueth.
Gan Yi Fang [Wed, 18 Oct 2023 03:08:02 +0000 (11:08 +0800)]
net: stmmac: Remove redundant checking for rx_coalesce_usecs
The datatype of rx_coalesce_usecs is u32, always larger or equal to zero.
Previous checking does not include value 0, this patch removes the
checking to handle the value 0. This change in behaviour making the
value of 0 cause an error is not a problem because 0 is out of
range of rx_coalesce_usecs.
Jakub Kicinski [Wed, 18 Oct 2023 01:07:58 +0000 (18:07 -0700)]
docs: networking: document multi-RSS context
There seems to be no docs for the concept of multiple RSS
contexts and how to configure it. I had to explain it three
times recently, the last one being the charm, document it.
Paolo Abeni [Thu, 19 Oct 2023 08:59:42 +0000 (10:59 +0200)]
Merge branch 'rswitch-add-pm-ops'
Yoshihiro Shimoda says:
====================
rswitch: Add PM ops
This patch is based on the latest net-next.git / next branch.
After applied this patch with the following patches, the system can
enter/exit Suspend to Idle without any error:
https://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy.git/commit/?h=next&id=aa4c0bbf820ddb9dd8105a403aa12df57b9e5129
https://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy.git/commit/?h=next&id=1a5361189b7acac15b9b086b2300a11b7aa84c06
====================
Yoshihiro Shimoda [Tue, 17 Oct 2023 11:34:02 +0000 (20:34 +0900)]
rswitch: Add PM ops
Add PM ops for Suspend to Idle. When the system suspended,
the Ethernet Serdes's clock will be stopped. So, this driver needs
to re-initialize the Ethernet Serdes by phy_init() in
renesas_eth_sw_resume(). Otherwise, timeout happened in phy_power_on().
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yoshihiro Shimoda [Tue, 17 Oct 2023 11:34:01 +0000 (20:34 +0900)]
rswitch: Use unsigned int for port related array index
Array index should not be negative, so modify the condition of
rswitch_for_each_enabled_port_continue_reverse() macro, and then
use unsigned int instead.
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Thu, 19 Oct 2023 01:17:50 +0000 (18:17 -0700)]
Merge tag 'nf-23-10-18' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter: updates for net
First patch, from Phil Sutter, reduces number of audit notifications
when userspace requests to re-set stateful objects.
This change also comes with a selftest update.
Second patch, also from Phil, moves the nftables audit selftest
to its own netns to avoid interference with the init netns.
Third patch, from Pablo Neira, fixes an inconsistency with the "rbtree"
set backend: When set element X has expired, a request to delete element
X should fail (like with all other backends).
Finally, patch four, also from Pablo, reverts a recent attempt to speed
up abort of a large pending update with the "pipapo" set backend.
It could cause stray references to remain in the set, which then
results in a double-free.
* tag 'nf-23-10-18' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: revert do not remove elements if set backend implements .abort
netfilter: nft_set_rbtree: .deactivate fails if element has expired
selftests: netfilter: Run nft_audit.sh in its own netns
netfilter: nf_tables: audit log object reset once per table
====================
Jakub Kicinski [Thu, 19 Oct 2023 01:14:25 +0000 (18:14 -0700)]
Merge tag 'wireless-2023-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
A few more fixes:
* prevent value bounce/glitch in rfkill GPIO probe
* fix lockdep report in rfkill
* fix error path leak in mac80211 key handling
* use system_unbound_wq for wiphy work since it
can take longer
* tag 'wireless-2023-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
net: rfkill: reduce data->mtx scope in rfkill_fop_open
net: rfkill: gpio: prevent value glitch during probe
wifi: mac80211: fix error path key leak
wifi: cfg80211: use system_unbound_wq for wiphy work
====================
The .probe() function would allocate the necessary space and ensure that
the library call sizes the number of statistics but the callbacks
necessary to fetch the name and values were not wired up.
Reported-by: Justin Chen <justin.chen@broadcom.com> Fixes: f68d08c437f9 ("net: phy: bcm7xxx: Add EPHY entry for 72165") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231017205119.416392-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 6759 Comm: kworker/u4:15 Not tainted 6.6.0-rc4-syzkaller-00029-gcbf3a2cb156a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023
Workqueue: wg-kex-wg1 wg_packet_handshake_send_worker
Fixes: 436c3b66ec98 ("ipv4: Invalidate nexthop cache nh_saddr more correctly.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231017192304.82626-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Intel Wired LAN Driver Updates 2023-10-17
This series contains cleanups for all the Intel drivers relating to their
use of format specifiers and the use of strncpy.
Jesse fixes various -Wformat warnings across all the Intel networking,
including various cases where a "%s" string format specifier is preferred,
and using kasprintf instead of snprintf.
Justin replaces all of the uses of the now deprecated strncpy with a more
modern string function, primarily strscpy.
====================