]> www.infradead.org Git - nvme.git/log
nvme.git
10 months agomptcp: change local addr type of subflow_destroy
Geliang Tang [Fri, 13 Dec 2024 19:52:57 +0000 (20:52 +0100)]
mptcp: change local addr type of subflow_destroy

Generally, in the path manager interfaces, the local address is defined as
an mptcp_pm_addr_entry type address, while the remote address is defined as
an mptcp_addr_info type one:

        (struct mptcp_pm_addr_entry *local, struct mptcp_addr_info *remote)

But subflow_destroy() interface uses two mptcp_addr_info type parameters.
This patch changes the first one to mptcp_pm_addr_entry type and use helper
mptcp_pm_parse_entry() to parse it instead of using mptcp_pm_parse_addr().

This patch doesn't change the behaviour of the code, just refactoring.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-6-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agomptcp: drop free_list for deleting entries
Geliang Tang [Fri, 13 Dec 2024 19:52:56 +0000 (20:52 +0100)]
mptcp: drop free_list for deleting entries

mptcp_pm_remove_addrs() actually only deletes one address, which does
not match its name. This patch renames it to mptcp_pm_remove_addr_entry()
and changes the parameter "rm_list" to "entry".

With the help of mptcp_pm_remove_addr_entry(), it's no longer necessary to
move the entry to be deleted to free_list and then traverse the list to
delete the entry, which is not allowed in BPF. The entry can be directly
deleted through list_del_rcu() and sock_kfree_s() now.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-5-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agomptcp: move mptcp_pm_remove_addrs into pm_userspace
Geliang Tang [Fri, 13 Dec 2024 19:52:55 +0000 (20:52 +0100)]
mptcp: move mptcp_pm_remove_addrs into pm_userspace

Since mptcp_pm_remove_addrs() is only called from the userspace PM, this
patch moves it into pm_userspace.c.

For this, lookup_subflow_by_saddr() and remove_anno_list_by_saddr()
helpers need to be exported in protocol.h. Also add "mptcp_" prefix for
these helpers.

Here, mptcp_pm_remove_addrs() is not changed to a static function because
it will be used in BPF Path Manager.

This patch doesn't change the behaviour of the code, just refactoring.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-4-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agomptcp: add mptcp_userspace_pm_get_sock helper
Geliang Tang [Fri, 13 Dec 2024 19:52:54 +0000 (20:52 +0100)]
mptcp: add mptcp_userspace_pm_get_sock helper

Each userspace pm netlink function uses nla_get_u32() to get the msk
token value, then pass it to mptcp_token_get_sock() to get the msk.
Finally check whether userspace PM is selected on this msk. It makes
sense to wrap them into a helper, named mptcp_userspace_pm_get_sock(),
to do this.

This patch doesn't change the behaviour of the code, just refactoring.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-3-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agomptcp: add mptcp_for_each_userspace_pm_addr macro
Geliang Tang [Fri, 13 Dec 2024 19:52:53 +0000 (20:52 +0100)]
mptcp: add mptcp_for_each_userspace_pm_addr macro

Similar to mptcp_for_each_subflow() macro, this patch adds a new macro
mptcp_for_each_userspace_pm_addr() for userspace PM to iterate over the
address entries on the local address list userspace_pm_local_addr_list
of the mptcp socket.

This patch doesn't change the behaviour of the code, just refactoring.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-2-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agomptcp: add mptcp_userspace_pm_lookup_addr helper
Geliang Tang [Fri, 13 Dec 2024 19:52:52 +0000 (20:52 +0100)]
mptcp: add mptcp_userspace_pm_lookup_addr helper

Like __lookup_addr() helper in pm_netlink.c, a new helper
mptcp_userspace_pm_lookup_addr() is also defined in pm_userspace.c.
It looks up the corresponding mptcp_pm_addr_entry address in
userspace_pm_local_addr_list through the passed "addr" parameter
and returns the found address entry.

This helper can be used in mptcp_userspace_pm_delete_local_addr(),
mptcp_userspace_pm_set_flags(), mptcp_userspace_pm_get_local_id()
and mptcp_userspace_pm_is_backup() to simplify the code.

Please note that with this change now list_for_each_entry() is used in
mptcp_userspace_pm_append_new_local_addr(), not list_for_each_entry_safe(),
but that's OK to do so because mptcp_userspace_pm_lookup_addr() only
returns an entry from the list, the list hasn't been modified here.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-1-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: wan: framer: Simplify API framer_provider_simple_of_xlate() implementation
Zijun Hu [Fri, 13 Dec 2024 12:09:11 +0000 (20:09 +0800)]
net: wan: framer: Simplify API framer_provider_simple_of_xlate() implementation

Simplify framer_provider_simple_of_xlate() implementation by API
class_find_device_by_of_node().

Also correct comments to mark its parameter @dev as unused instead of
@args in passing.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241213-net_fix-v2-1-6d06130d630f@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agogve: Convert timeouts to secs_to_jiffies()
Easwar Hariharan [Thu, 12 Dec 2024 17:33:01 +0000 (17:33 +0000)]
gve: Convert timeouts to secs_to_jiffies()

Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.

This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:

@@ constant C; @@

- msecs_to_jiffies(C * 1000)
+ secs_to_jiffies(C)

@@ constant C; @@

- msecs_to_jiffies(C * MSEC_PER_SEC)
+ secs_to_jiffies(C)

Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Link: https://patch.msgid.link/20241212-netdev-converge-secs-to-jiffies-v4-1-6dac97a6d6ab@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonetlink: specs: add uint, sint to netlink-raw schema
Donald Hunter [Fri, 13 Dec 2024 11:08:27 +0000 (11:08 +0000)]
netlink: specs: add uint, sint to netlink-raw schema

Add uint, sint to the list of attr types in the netlink-raw schema. This
fixes the rt_link spec which had a uint attr added in commit
f858cc9eed5b ("net: add IFLA_MAX_PACING_OFFLOAD_HORIZON device attribute")

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20241213110827.32250-1-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoocteontx2-af: fix build regression without CONFIG_DCB
Arnd Bergmann [Fri, 13 Dec 2024 08:32:18 +0000 (09:32 +0100)]
octeontx2-af: fix build regression without CONFIG_DCB

When DCB is disabled, the pfc_en struct member cannot be accessed:

drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c: In function 'otx2_is_pfc_enabled':
drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c:22:48: error: 'struct otx2_nic' has no member named 'pfc_en'
   22 |         return IS_ENABLED(CONFIG_DCB) && !!pfvf->pfc_en;
      |                                                ^~
drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c: In function 'otx2_nix_config_bp':
drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c:1755:33: error: 'IEEE_8021QAZ_MAX_TCS' undeclared (first use in this function)
 1755 |                 req->chan_cnt = IEEE_8021QAZ_MAX_TCS;
      |                                 ^~~~~~~~~~~~~~~~~~~~

Move the member out of the #ifdef block to avoid putting back another
check in the source file and add the missing include file unconditionally.

Fixes: a7ef63dbd588 ("octeontx2-af: Disable backpressure between CPT and NIX")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241213083228.2645757-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoionic: remove the unused nb_work
Brett Creeley [Thu, 12 Dec 2024 21:20:42 +0000 (13:20 -0800)]
ionic: remove the unused nb_work

Remove the empty and unused nb_work and associated
ionic_lif_notify_work() function.

v2: separated from previous net patch

Link: https://lore.kernel.org/netdev/20241210174828.69525-2-shannon.nelson@amd.com/
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20241212212042.9348-1-shannon.nelson@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoethernet: Make OA_TC6 config symbol invisible
Geert Uytterhoeven [Thu, 12 Dec 2024 09:11:43 +0000 (10:11 +0100)]
ethernet: Make OA_TC6 config symbol invisible

Commit aa58bec064ab1622 ("net: ethernet: oa_tc6: implement register
write operation") introduced a library that implements the OPEN Alliance
TC6 10BASE-T1x MAC-PHY Serial Interface protocol for supporting
10BASE-T1x MAC-PHYs.

There is no need to ask the user about enabling this library, as all
drivers that use it select the OA_TC6 symbol.  Hence make the symbol
invisible, unless when compile-testing.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/3b600550745af10ab7d7c3526353931c1d39f641.1733994552.git.geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: phylink: improve phylink_sfp_config_phy() error message with missing PHY driver
Vladimir Oltean [Thu, 12 Dec 2024 14:08:34 +0000 (16:08 +0200)]
net: phylink: improve phylink_sfp_config_phy() error message with missing PHY driver

It seems that phylink does not support driving PHYs in SFP modules using
the Generic PHY or Generic Clause 45 PHY driver. I've come to this
conclusion after analyzing these facts:

- sfp_sm_probe_phy(), who is our caller here, first calls
  phy_device_register() and then sfp_add_phy() -> ... ->
  phylink_sfp_connect_phy().

- phydev->supported is populated by phy_probe()

- phy_probe() is usually called synchronously from phy_device_register()
  via phy_bus_match(), if a precise device driver is found for the PHY.
  In that case, phydev->supported has a good chance of being set to a
  non-zero mask.

- There is an exceptional case for the PHYs for which phy_bus_match()
  didn't find a driver. Those devices sit for a while without a driver,
  then phy_attach_direct() force-binds the genphy_c45_driver or
  genphy_driver to them. Again, this triggers phy_probe() and renders
  a good chance of phydev->supported being populated, assuming
  compatibility with genphy_read_abilities() or
  genphy_c45_pma_read_abilities().

- phylink_sfp_config_phy() does not support the exceptional case of
  retrieving phydev->supported from the Generic PHY driver, due to its
  code flow. It expects the phydev->supported mask to already be
  non-empty, because it first calls phylink_validate() on it, and only
  calls phylink_attach_phy() if that succeeds. Thus, phylink_attach_phy()
  -> phy_attach_direct() has no chance of running.

It is not my wish to change the state of affairs by altering the code
flow, but merely to document the limitation rather than have the current
unspecific error:

[   61.800079] mv88e6085 d0032004.mdio-mii:12 sfp: validation with support 00,00000000,00000000,00000000 failed: -EINVAL
[   61.820743] sfp sfp: sfp_add_phy failed: -EINVAL

On the premise that an empty phydev->supported is going to make
phylink_validate() fail anyway, and that this is caused by a missing PHY
driver, it would be more informative to single out that case, undercut
the entire phylink_sfp_config_phy() call, including phylink_validate(),
and print a more specific message for this common gotcha:

[   37.076403] mv88e6085 d0032004.mdio-mii:12 sfp: PHY i2c:sfp:16 (id 0x01410cc2) has no driver loaded
[   37.089157] mv88e6085 d0032004.mdio-mii:12 sfp: Drivers which handle known common cases: CONFIG_BCM84881_PHY, CONFIG_MARVELL_PHY
[   37.108047] sfp sfp: sfp_add_phy failed: -EINVAL

Link: https://lore.kernel.org/netdev/20241113144229.3ff4bgsalvj7spb7@skbuf/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20241212140834.278894-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: ena: Fix incorrect indentation
Shay Agroskin [Thu, 12 Dec 2024 11:59:08 +0000 (13:59 +0200)]
net: ena: Fix incorrect indentation

The assignment was accidentally aligned to the string one line before.
This was raised by the kernel bot.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412101739.umNl7yYu-lkp@intel.com/
Signed-off-by: David Arinzon <darinzon@amazon.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241212115910.2485851-1-shayagr@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoipv4: output metric as unsigned int
Maximilian Güntner [Thu, 12 Dec 2024 16:19:11 +0000 (17:19 +0100)]
ipv4: output metric as unsigned int

adding a route metric greater than 0x7fff_ffff leads to an
unintended wrap when printing the underlying u32 as an
unsigned int (`%d`) thus incorrectly rendering the metric
as negative.  Formatting using `%u` corrects the issue.

Signed-off-by: Maximilian Güntner <code@mguentner.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241212161911.51598-1-code@mguentner.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge branch 'dp83822-gpio2'
David S. Miller [Sun, 15 Dec 2024 21:12:37 +0000 (21:12 +0000)]
Merge branch 'dp83822-gpio2'

Dimitri Fedrau says:

====================
net: phy: dp83822: Add support for GPIO2 clock output

The DP83822 has several clock configuration options for pins GPIO1, GPIO2
and GPIO3. Clock options include:
  - MAC IF clock
  - XI clock
  - Free-Running clock
  - Recovered clock
This patch adds support for GPIO2, the support for GPIO1 and GPIO3 can be
easily added if needed. Code and device tree bindings are derived from
dp83867 which has a similar feature.

Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
---
Changes in v3:
- Dropped <dt-bindings/net/ti-dp83822.h>
- Moved defines from <dt-bindings/net/ti-dp83822.h> to dp83822.c
- Switched to enum of type string for property ti,gpio2-clk-out and added
  explanation for values, added example.
- Link to v2: https://lore.kernel.org/r/20241211-dp83822-gpio2-clk-out-v2-0-614a54f6acab@liebherr.com

Changes in v2:
- Move MII_DP83822_IOCTRL2 before MII_DP83822_GENCFG
- List case statements together, and have one break at the end.
- Move dp83822->set_gpio2_clk_out = true at the end of the validation
- Link to v1: https://lore.kernel.org/r/20241209-dp83822-gpio2-clk-out-v1-0-fd3c8af59ff5@liebherr.com
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 months agonet: phy: dp83822: Add support for GPIO2 clock output
Dimitri Fedrau [Thu, 12 Dec 2024 08:44:07 +0000 (09:44 +0100)]
net: phy: dp83822: Add support for GPIO2 clock output

The GPIO2 pin on the DP83822 can be configured as clock output. Add support
for configuration via DT.

Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 months agodt-bindings: net: dp83822: Add support for GPIO2 clock output
Dimitri Fedrau [Thu, 12 Dec 2024 08:44:06 +0000 (09:44 +0100)]
dt-bindings: net: dp83822: Add support for GPIO2 clock output

The GPIO2 pin on the DP83822 can be configured as clock output. Add
binding to support this feature.

Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 months agonetlink: add IGMP/MLD join/leave notifications
Yuyang Huang [Wed, 11 Dec 2024 08:22:41 +0000 (17:22 +0900)]
netlink: add IGMP/MLD join/leave notifications

This change introduces netlink notifications for multicast address
changes. The following features are included:
* Addition and deletion of multicast addresses are reported using
  RTM_NEWMULTICAST and RTM_DELMULTICAST messages with AF_INET and
  AF_INET6.
* Two new notification groups: RTNLGRP_IPV4_MCADDR and
  RTNLGRP_IPV6_MCADDR are introduced for receiving these events.

This change allows user space applications (e.g., ip monitor) to
efficiently track multicast group memberships by listening for netlink
events. Previously, applications relied on inefficient polling of
procfs, introducing delays. With netlink notifications, applications
receive realtime updates on multicast group membership changes,
enabling more precise metrics collection and system monitoring. 

This change also unlocks the potential for implementing a wide range
of sophisticated multicast related features in user space by allowing
applications to combine kernel provided multicast address information
with user space data and communicate decisions back to the kernel for
more fine grained control. This mechanism can be used for various
purposes, including multicast filtering, IGMP/MLD offload, and
IGMP/MLD snooping.

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Co-developed-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Link: https://lore.kernel.org/r/20180906091056.21109-1-pruddy@vyatta.att-mail.com
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 months agonet: stmmac: Drop redundant dwxgmac_tc_ops variable
Furong Xu [Thu, 12 Dec 2024 03:33:25 +0000 (11:33 +0800)]
net: stmmac: Drop redundant dwxgmac_tc_ops variable

dwmac510_tc_ops and dwxgmac_tc_ops are completely identical,
keep dwmac510_tc_ops to provide better backward compatibility.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20241212033325.282817-1-0x1207@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge branch 'devmem-tcp-fixes'
Jakub Kicinski [Fri, 13 Dec 2024 02:49:11 +0000 (18:49 -0800)]
Merge branch 'devmem-tcp-fixes'

Mina Almasry says:

====================
devmem TCP fixes

Couple unrelated devmem TCP fixes bundled in a series for some
convenience.

- fix naming and provide page_pool_alloc_netmem for fragged
netmem.

- fix issues with dma-buf dma addresses being potentially
passed to dma_sync_for_* helpers.
====================

Link: https://patch.msgid.link/20241211212033.1684197-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agopage_pool: disable sync for cpu for dmabuf memory provider
Mina Almasry [Wed, 11 Dec 2024 21:20:31 +0000 (21:20 +0000)]
page_pool: disable sync for cpu for dmabuf memory provider

dmabuf dma-addresses should not be dma_sync'd for CPU/device. Typically
its the driver responsibility to dma_sync for CPU, but the driver should
not dma_sync for CPU if the netmem is actually coming from a dmabuf
memory provider.

The page_pool already exposes a helper for dma_sync_for_cpu:
page_pool_dma_sync_for_cpu. Upgrade this existing helper to handle
netmem, and have it skip dma_sync if the memory is from a dmabuf memory
provider. Drivers should migrate to using this helper when adding
support for netmem.

Also minimize the impact on the dma syncing performance for pages. Special
case the dma-sync path for pages to not go through the overhead checks
for dma-syncing and conversion to netmem.

Cc: Alexander Lobakin <aleksander.lobakin@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20241211212033.1684197-5-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agopage_pool: Set `dma_sync` to false for devmem memory provider
Samiullah Khawaja [Wed, 11 Dec 2024 21:20:30 +0000 (21:20 +0000)]
page_pool: Set `dma_sync` to false for devmem memory provider

Move the `dma_map` and `dma_sync` checks to `page_pool_init` to make
them generic. Set dma_sync to false for devmem memory provider because
the dma_sync APIs should not be used for dma_buf backed devmem memory
provider.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20241211212033.1684197-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: page_pool: create page_pool_alloc_netmem
Mina Almasry [Wed, 11 Dec 2024 21:20:29 +0000 (21:20 +0000)]
net: page_pool: create page_pool_alloc_netmem

Create page_pool_alloc_netmem to be the mirror of page_pool_alloc.

This enables drivers that want currently use page_pool_alloc to
transition to netmem by converting the call sites to
page_pool_alloc_netmem.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241211212033.1684197-3-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: page_pool: rename page_pool_alloc_netmem to *_netmems
Mina Almasry [Wed, 11 Dec 2024 21:20:28 +0000 (21:20 +0000)]
net: page_pool: rename page_pool_alloc_netmem to *_netmems

page_pool_alloc_netmem (without an s) was the mirror of
page_pool_alloc_pages (with an s), which was confusing.

Rename to page_pool_alloc_netmems so it's the mirror of
page_pool_alloc_pages.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241211212033.1684197-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge branch 'xdp-a-fistful-of-generic-changes-pt-ii'
Jakub Kicinski [Fri, 13 Dec 2024 02:23:08 +0000 (18:23 -0800)]
Merge branch 'xdp-a-fistful-of-generic-changes-pt-ii'

Alexander Lobakin says:

====================
xdp: a fistful of generic changes pt. II (part)

XDP for idpf is currently 5.5 chapters:
* convert Rx to libeth;
* convert Tx and stats to libeth;
* generic XDP and XSk code changes;
* generic XDP and XSk code additions (you are here);
* actual XDP for idpf via new libeth_xdp;
* XSk for idpf (via ^).

Part III.2.1 does the following:
* allows mixing pages from several Page Pools within one XDP frame;
* optimizes &xdp_frame structure and removes no-more-used field;

Everything is prereq for libeth_xdp, but will be useful standalone
as well: faster xdp_return_frame_bulk() and xdp_frame fields access.
====================

Link: https://patch.msgid.link/20241211172649.761483-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoskbuff: allow 2-4-argument skb_frag_dma_map()
Alexander Lobakin [Wed, 11 Dec 2024 17:26:47 +0000 (18:26 +0100)]
skbuff: allow 2-4-argument skb_frag_dma_map()

skb_frag_dma_map(dev, frag, 0, skb_frag_size(frag), DMA_TO_DEVICE)
is repeated across dozens of drivers and really wants a shorthand.
Add a macro which will count args and handle all possible number
from 2 to 5. Semantics:

skb_frag_dma_map(dev, frag) ->
__skb_frag_dma_map(dev, frag, 0, skb_frag_size(frag), DMA_TO_DEVICE)

skb_frag_dma_map(dev, frag, offset) ->
__skb_frag_dma_map(dev, frag, offset, skb_frag_size(frag) - offset,
   DMA_TO_DEVICE)

skb_frag_dma_map(dev, frag, offset, size) ->
__skb_frag_dma_map(dev, frag, offset, size, DMA_TO_DEVICE)

skb_frag_dma_map(dev, frag, offset, size, dir) ->
__skb_frag_dma_map(dev, frag, offset, size, dir)

No object code size changes for the existing callers. Users passing
less arguments also won't have bigger size comparing to the full
equivalent call.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-11-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoxdp: make __xdp_return() MP-agnostic
Alexander Lobakin [Wed, 11 Dec 2024 17:26:40 +0000 (18:26 +0100)]
xdp: make __xdp_return() MP-agnostic

Currently, __xdp_return() takes pointer to the virtual memory to free
a buffer. Apart from that this sometimes provokes redundant
data <--> page conversions, taking data pointer effectively prevents
lots of XDP code to support non-page-backed buffers, as there's no
mapping for the non-host memory (data is always NULL).
Just convert it to always take netmem reference. For
xdp_return_{buff,frame*}(), this chops off one page_address() per each
frag and adds one virt_to_netmem() (same as virt_to_page()) per header
buffer. For __xdp_return() itself, it removes one virt_to_page() for
MEM_TYPE_PAGE_POOL and another one for MEM_TYPE_PAGE_ORDER0, adding
one page_address() for [not really common nowadays]
MEM_TYPE_PAGE_SHARED, but the main effect is that the abovementioned
functions won't die or memleak anymore if the frame has non-host memory
attached and will correctly free those.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoxdp: get rid of xdp_frame::mem.id
Alexander Lobakin [Wed, 11 Dec 2024 17:26:39 +0000 (18:26 +0100)]
xdp: get rid of xdp_frame::mem.id

Initially, xdp_frame::mem.id was used to search for the corresponding
&page_pool to return the page correctly.
However, after that struct page was extended to have a direct pointer
to its PP (netmem has it as well), further keeping of this field makes
no sense. xdp_return_frame_bulk() still used it to do a lookup, and
this leftover is now removed.
Remove xdp_frame::mem and replace it with ::mem_type, as only memory
type still matters and we need to know it to be able to free the frame
correctly.
As a cute side effect, we can now make every scalar field in &xdp_frame
of 4 byte width, speeding up accesses to them.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agopage_pool: allow mixing PPs within one bulk
Alexander Lobakin [Wed, 11 Dec 2024 17:26:38 +0000 (18:26 +0100)]
page_pool: allow mixing PPs within one bulk

The main reason for this change was to allow mixing pages from different
&page_pools within one &xdp_buff/&xdp_frame. Why not? With stuff like
devmem and io_uring zerocopy Rx, it's required to have separate PPs for
header buffers and payload buffers.
Adjust xdp_return_frame_bulk() and page_pool_put_netmem_bulk(), so that
they won't be tied to a particular pool. Let the latter create a
separate bulk of pages which's PP is different from the first netmem of
the bulk and process it after the main loop.
This greatly optimizes xdp_return_frame_bulk(): no more hashtable
lookups and forced flushes on PP mismatch. Also make
xdp_flush_frame_bulk() inline, as it's just one if + function call + one
u32 read, not worth extending the call ladder.

Co-developed-by: Toke Høiland-Jørgensen <toke@redhat.com> # iterative
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org> # while (count)
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet_sched: sch_cake: Add drop reasons
Toke Høiland-Jørgensen [Wed, 11 Dec 2024 10:17:09 +0000 (11:17 +0100)]
net_sched: sch_cake: Add drop reasons

Add three qdisc-specific drop reasons and use them in sch_cake:

 1) SKB_DROP_REASON_QDISC_OVERLIMIT
    Whenever the total queue limit for a qdisc instance is exceeded
    and a packet is dropped to make room.

 2) SKB_DROP_REASON_QDISC_CONGESTED
    Whenever a packet is dropped by the qdisc AQM algorithm because
    congestion is detected.

 3) SKB_DROP_REASON_CAKE_FLOOD
    Whenever a packet is dropped by the flood protection part of the
    CAKE AQM algorithm (BLUE).

Also use the existing SKB_DROP_REASON_QUEUE_PURGE in cake_clear_tin().

Reasons show up as:

perf record -a -e skb:kfree_skb sleep 1; perf script

          iperf3     665 [005]   848.656964: skb:kfree_skb: skbaddr=0xffff98168a333500 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x10f0 reason: QDISC_OVERLIMIT
         swapper       0 [001]   909.166055: skb:kfree_skb: skbaddr=0xffff98168280cee0 rx_sk=(nil) protocol=34525 location=cake_dequeue+0x5ef reason: QDISC_CONGESTED

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241211-cake-drop-reason-v2-1-920afadf4d1b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 5 Dec 2024 19:48:58 +0000 (11:48 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.13-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 12 Dec 2024 19:28:05 +0000 (11:28 -0800)]
Merge tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, netfilter and wireless.

  Current release - fix to a fix:

   - rtnetlink: fix error code in rtnl_newlink()

   - tipc: fix NULL deref in cleanup_bearer()

  Current release - regressions:

   - ip: fix warning about invalid return from in ip_route_input_rcu()

  Current release - new code bugs:

   - udp: fix L4 hash after reconnect

   - eth: lan969x: fix cyclic dependency between modules

   - eth: bnxt_en: fix potential crash when dumping FW log coredump

  Previous releases - regressions:

   - wifi: mac80211:
      - fix a queue stall in certain cases of channel switch
      - wake the queues in case of failure in resume

   - splice: do not checksum AF_UNIX sockets

   - virtio_net: fix BUG()s in BQL support due to incorrect accounting
     of purged packets during interface stop

   - eth:
      - stmmac: fix TSO DMA API mis-usage causing oops
      - bnxt_en: fixes for HW GRO: GSO type on 5750X chips and oops
        due to incorrect aggregation ID mask on 5760X chips

  Previous releases - always broken:

   - Bluetooth: improve setsockopt() handling of malformed user input

   - eth: ocelot: fix PTP timestamping in presence of packet loss

   - ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply not
     supported"

* tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  net: dsa: tag_ocelot_8021q: fix broken reception
  net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries
  net: renesas: rswitch: fix initial MPIC register setting
  Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
  Bluetooth: iso: Fix circular lock in iso_conn_big_sync
  Bluetooth: iso: Fix circular lock in iso_listen_bis
  Bluetooth: SCO: Add support for 16 bits transparent voice setting
  Bluetooth: iso: Fix recursive locking warning
  Bluetooth: iso: Always release hdev at the end of iso_listen_bis
  Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
  Bluetooth: hci_core: Fix sleeping function called from invalid context
  team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
  team: Fix initial vlan_feature set in __team_compute_features
  bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
  bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features
  net, team, bonding: Add netdev_base_features helper
  net/sched: netem: account for backlog updates from child qdisc
  net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
  splice: do not checksum AF_UNIX sockets
  net: usb: qmi_wwan: add Telit FE910C04 compositions
  ...

10 months agoMerge tag 'for-net-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluet...
Jakub Kicinski [Thu, 12 Dec 2024 15:10:39 +0000 (07:10 -0800)]
Merge tag 'for-net-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - SCO: Fix transparent voice setting
 - ISO: Locking fixes
 - hci_core: Fix sleeping function called from invalid context
 - hci_event: Fix using rcu_read_(un)lock while iterating
 - btmtk: avoid UAF in btmtk_process_coredump

* tag 'for-net-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
  Bluetooth: iso: Fix circular lock in iso_conn_big_sync
  Bluetooth: iso: Fix circular lock in iso_listen_bis
  Bluetooth: SCO: Add support for 16 bits transparent voice setting
  Bluetooth: iso: Fix recursive locking warning
  Bluetooth: iso: Always release hdev at the end of iso_listen_bis
  Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
  Bluetooth: hci_core: Fix sleeping function called from invalid context
  Bluetooth: Improve setsockopt() handling of malformed user input
====================

Link: https://patch.msgid.link/20241212142806.2046274-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: tag_ocelot_8021q: fix broken reception
Robert Hodaszi [Wed, 11 Dec 2024 14:47:41 +0000 (15:47 +0100)]
net: dsa: tag_ocelot_8021q: fix broken reception

The blamed commit changed the dsa_8021q_rcv() calling convention to
accept pre-populated source_port and switch_id arguments. If those are
not available, as in the case of tag_ocelot_8021q, the arguments must be
pre-initialized with -1.

Due to the bug of passing uninitialized arguments in tag_ocelot_8021q,
dsa_8021q_rcv() does not detect that it needs to populate the
source_port and switch_id, and this makes dsa_conduit_find_user() fail,
which leads to packet loss on reception.

Fixes: dcfe7673787b ("net: dsa: tag_sja1105: absorb logic for not overwriting precise info into dsa_8021q_rcv()")
Signed-off-by: Robert Hodaszi <robert.hodaszi@digi.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241211144741.1415758-1-robert.hodaszi@digi.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries
Jesse Van Gavere [Wed, 11 Dec 2024 09:29:32 +0000 (10:29 +0100)]
net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries

Commit 8d7ae22ae9f8 ("net: dsa: microchip: KSZ9477 register regmap
alignment to 32 bit boundaries") fixed an issue whereby regmap_reg_range
did not allow writes as 32 bit words to KSZ9477 PHY registers, this fix
for KSZ9896 is adapted from there as the same errata is present in
KSZ9896C as "Module 5: Certain PHY registers must be written as pairs
instead of singly" the explanation below is likewise taken from this
commit.

The commit provided code
to apply "Module 6: Certain PHY registers must be written as pairs instead
of singly" errata for KSZ9477 as this chip for certain PHY registers
(0xN120 to 0xN13F, N=1,2,3,4,5) must be accessed as 32 bit words instead
of 16 or 8 bit access.
Otherwise, adjacent registers (no matter if reserved or not) are
overwritten with 0x0.

Without this patch some registers (e.g. 0x113c or 0x1134) required for 32
bit access are out of valid regmap ranges.

As a result, following error is observed and KSZ9896 is not properly
configured:

ksz-switch spi1.0: can't rmw 32bit reg 0x113c: -EIO
ksz-switch spi1.0: can't rmw 32bit reg 0x1134: -EIO
ksz-switch spi1.0 lan1 (uninitialized): failed to connect to PHY: -EIO
ksz-switch spi1.0 lan1 (uninitialized): error -5 setting up PHY for tree 0, switch 0, port 0

The solution is to modify regmap_reg_range to allow accesses with 4 bytes
boundaries.

Fixes: 5c844d57aa78 ("net: dsa: microchip: fix writes to phy registers >= 0x10")
Signed-off-by: Jesse Van Gavere <jesse.vangavere@scioteq.com>
Link: https://patch.msgid.link/20241211092932.26881-1-jesse.vangavere@scioteq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: renesas: rswitch: fix initial MPIC register setting
Nikita Yushchenko [Wed, 11 Dec 2024 05:30:12 +0000 (10:30 +0500)]
net: renesas: rswitch: fix initial MPIC register setting

MPIC.PIS must be set per phy interface type.
MPIC.LSC must be set per speed.

Do that strictly per datasheet, instead of hardcoding MPIC.PIS to GMII.

Fixes: 3590918b5d07 ("net: ethernet: renesas: Add support for "Ethernet Switch"")
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20241211053012.368914-1-nikita.yoush@cogentembedded.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoBluetooth: btmtk: avoid UAF in btmtk_process_coredump
Thadeu Lima de Souza Cascardo [Tue, 10 Dec 2024 19:36:10 +0000 (16:36 -0300)]
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump

hci_devcd_append may lead to the release of the skb, so it cannot be
accessed once it is called.

==================================================================
BUG: KASAN: slab-use-after-free in btmtk_process_coredump+0x2a7/0x2d0 [btmtk]
Read of size 4 at addr ffff888033cfabb0 by task kworker/0:3/82

CPU: 0 PID: 82 Comm: kworker/0:3 Tainted: G     U             6.6.40-lockdep-03464-g1d8b4eb3060e #1 b0b3c1cc0c842735643fb411799d97921d1f688c
Hardware name: Google Yaviks_Ufs/Yaviks_Ufs, BIOS Google_Yaviks_Ufs.15217.552.0 05/07/2024
Workqueue: events btusb_rx_work [btusb]
Call Trace:
 <TASK>
 dump_stack_lvl+0xfd/0x150
 print_report+0x131/0x780
 kasan_report+0x177/0x1c0
 btmtk_process_coredump+0x2a7/0x2d0 [btmtk 03edd567dd71a65958807c95a65db31d433e1d01]
 btusb_recv_acl_mtk+0x11c/0x1a0 [btusb 675430d1e87c4f24d0c1f80efe600757a0f32bec]
 btusb_rx_work+0x9e/0xe0 [btusb 675430d1e87c4f24d0c1f80efe600757a0f32bec]
 worker_thread+0xe44/0x2cc0
 kthread+0x2ff/0x3a0
 ret_from_fork+0x51/0x80
 ret_from_fork_asm+0x1b/0x30
 </TASK>

Allocated by task 82:
 stack_trace_save+0xdc/0x190
 kasan_set_track+0x4e/0x80
 __kasan_slab_alloc+0x4e/0x60
 kmem_cache_alloc+0x19f/0x360
 skb_clone+0x132/0xf70
 btusb_recv_acl_mtk+0x104/0x1a0 [btusb]
 btusb_rx_work+0x9e/0xe0 [btusb]
 worker_thread+0xe44/0x2cc0
 kthread+0x2ff/0x3a0
 ret_from_fork+0x51/0x80
 ret_from_fork_asm+0x1b/0x30

Freed by task 1733:
 stack_trace_save+0xdc/0x190
 kasan_set_track+0x4e/0x80
 kasan_save_free_info+0x28/0xb0
 ____kasan_slab_free+0xfd/0x170
 kmem_cache_free+0x183/0x3f0
 hci_devcd_rx+0x91a/0x2060 [bluetooth]
 worker_thread+0xe44/0x2cc0
 kthread+0x2ff/0x3a0
 ret_from_fork+0x51/0x80
 ret_from_fork_asm+0x1b/0x30

The buggy address belongs to the object at ffff888033cfab40
 which belongs to the cache skbuff_head_cache of size 232
The buggy address is located 112 bytes inside of
 freed 232-byte region [ffff888033cfab40ffff888033cfac28)

The buggy address belongs to the physical page:
page:00000000a174ba93 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x33cfa
head:00000000a174ba93 order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
anon flags: 0x4000000000000840(slab|head|zone=1)
page_type: 0xffffffff()
raw: 4000000000000840 ffff888100848a00 0000000000000000 0000000000000001
raw: 0000000000000000 0000000080190019 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888033cfaa80: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
 ffff888033cfab00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
>ffff888033cfab80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                     ^
 ffff888033cfac00: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
 ffff888033cfac80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Check if we need to call hci_devcd_complete before calling
hci_devcd_append. That requires that we check data->cd_info.cnt >=
MTK_COREDUMP_NUM instead of data->cd_info.cnt > MTK_COREDUMP_NUM, as we
increment data->cd_info.cnt only once the call to hci_devcd_append
succeeds.

Fixes: 0b7015132878 ("Bluetooth: btusb: mediatek: add MediaTek devcoredump support")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
10 months agoBluetooth: iso: Fix circular lock in iso_conn_big_sync
Iulia Tanasescu [Mon, 9 Dec 2024 09:42:18 +0000 (11:42 +0200)]
Bluetooth: iso: Fix circular lock in iso_conn_big_sync

This fixes the circular locking dependency warning below, by reworking
iso_sock_recvmsg, to ensure that the socket lock is always released
before calling a function that locks hdev.

[  561.670344] ======================================================
[  561.670346] WARNING: possible circular locking dependency detected
[  561.670349] 6.12.0-rc6+ #26 Not tainted
[  561.670351] ------------------------------------------------------
[  561.670353] iso-tester/3289 is trying to acquire lock:
[  561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3},
               at: iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670405]
               but task is already holding lock:
[  561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0},
               at: iso_sock_recvmsg+0xbf/0x500 [bluetooth]
[  561.670450]
               which lock already depends on the new lock.

[  561.670452]
               the existing dependency chain (in reverse order) is:
[  561.670453]
               -> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}:
[  561.670458]        lock_acquire+0x7c/0xc0
[  561.670463]        lock_sock_nested+0x3b/0xf0
[  561.670467]        bt_accept_dequeue+0x1a5/0x4d0 [bluetooth]
[  561.670510]        iso_sock_accept+0x271/0x830 [bluetooth]
[  561.670547]        do_accept+0x3dd/0x610
[  561.670550]        __sys_accept4+0xd8/0x170
[  561.670553]        __x64_sys_accept+0x74/0xc0
[  561.670556]        x64_sys_call+0x17d6/0x25f0
[  561.670559]        do_syscall_64+0x87/0x150
[  561.670563]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670567]
               -> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[  561.670571]        lock_acquire+0x7c/0xc0
[  561.670574]        lock_sock_nested+0x3b/0xf0
[  561.670577]        iso_sock_listen+0x2de/0xf30 [bluetooth]
[  561.670617]        __sys_listen_socket+0xef/0x130
[  561.670620]        __x64_sys_listen+0xe1/0x190
[  561.670623]        x64_sys_call+0x2517/0x25f0
[  561.670626]        do_syscall_64+0x87/0x150
[  561.670629]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670632]
               -> #0 (&hdev->lock){+.+.}-{3:3}:
[  561.670636]        __lock_acquire+0x32ad/0x6ab0
[  561.670639]        lock_acquire.part.0+0x118/0x360
[  561.670642]        lock_acquire+0x7c/0xc0
[  561.670644]        __mutex_lock+0x18d/0x12f0
[  561.670647]        mutex_lock_nested+0x1b/0x30
[  561.670651]        iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670687]        iso_sock_recvmsg+0x3e9/0x500 [bluetooth]
[  561.670722]        sock_recvmsg+0x1d5/0x240
[  561.670725]        sock_read_iter+0x27d/0x470
[  561.670727]        vfs_read+0x9a0/0xd30
[  561.670731]        ksys_read+0x1a8/0x250
[  561.670733]        __x64_sys_read+0x72/0xc0
[  561.670736]        x64_sys_call+0x1b12/0x25f0
[  561.670738]        do_syscall_64+0x87/0x150
[  561.670741]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670744]
               other info that might help us debug this:

[  561.670745] Chain exists of:
&hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH

[  561.670751]  Possible unsafe locking scenario:

[  561.670753]        CPU0                    CPU1
[  561.670754]        ----                    ----
[  561.670756]   lock(sk_lock-AF_BLUETOOTH);
[  561.670758]                                lock(sk_lock
                                              AF_BLUETOOTH-BTPROTO_ISO);
[  561.670761]                                lock(sk_lock-AF_BLUETOOTH);
[  561.670764]   lock(&hdev->lock);
[  561.670767]
                *** DEADLOCK ***

Fixes: 07a9342b94a9 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
10 months agoBluetooth: iso: Fix circular lock in iso_listen_bis
Iulia Tanasescu [Mon, 9 Dec 2024 09:42:17 +0000 (11:42 +0200)]
Bluetooth: iso: Fix circular lock in iso_listen_bis

This fixes the circular locking dependency warning below, by
releasing the socket lock before enterning iso_listen_bis, to
avoid any potential deadlock with hdev lock.

[   75.307983] ======================================================
[   75.307984] WARNING: possible circular locking dependency detected
[   75.307985] 6.12.0-rc6+ #22 Not tainted
[   75.307987] ------------------------------------------------------
[   75.307987] kworker/u81:2/2623 is trying to acquire lock:
[   75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO)
               at: iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308021]
               but task is already holding lock:
[   75.308022] ffff8fdd61a10078 (&hdev->lock)
               at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308053]
               which lock already depends on the new lock.

[   75.308054]
               the existing dependency chain (in reverse order) is:
[   75.308055]
               -> #1 (&hdev->lock){+.+.}-{3:3}:
[   75.308057]        __mutex_lock+0xad/0xc50
[   75.308061]        mutex_lock_nested+0x1b/0x30
[   75.308063]        iso_sock_listen+0x143/0x5c0 [bluetooth]
[   75.308085]        __sys_listen_socket+0x49/0x60
[   75.308088]        __x64_sys_listen+0x4c/0x90
[   75.308090]        x64_sys_call+0x2517/0x25f0
[   75.308092]        do_syscall_64+0x87/0x150
[   75.308095]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   75.308098]
               -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[   75.308100]        __lock_acquire+0x155e/0x25f0
[   75.308103]        lock_acquire+0xc9/0x300
[   75.308105]        lock_sock_nested+0x32/0x90
[   75.308107]        iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308128]        hci_connect_cfm+0x6c/0x190 [bluetooth]
[   75.308155]        hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth]
[   75.308180]        hci_le_meta_evt+0xe7/0x200 [bluetooth]
[   75.308206]        hci_event_packet+0x21f/0x5c0 [bluetooth]
[   75.308230]        hci_rx_work+0x3ae/0xb10 [bluetooth]
[   75.308254]        process_one_work+0x212/0x740
[   75.308256]        worker_thread+0x1bd/0x3a0
[   75.308258]        kthread+0xe4/0x120
[   75.308259]        ret_from_fork+0x44/0x70
[   75.308261]        ret_from_fork_asm+0x1a/0x30
[   75.308263]
               other info that might help us debug this:

[   75.308264]  Possible unsafe locking scenario:

[   75.308264]        CPU0                CPU1
[   75.308265]        ----                ----
[   75.308265]   lock(&hdev->lock);
[   75.308267]                            lock(sk_lock-
                                                AF_BLUETOOTH-BTPROTO_ISO);
[   75.308268]                            lock(&hdev->lock);
[   75.308269]   lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO);
[   75.308270]
                *** DEADLOCK ***

[   75.308271] 4 locks held by kworker/u81:2/2623:
[   75.308272]  #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0},
                at: process_one_work+0x443/0x740
[   75.308276]  #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)),
                at: process_one_work+0x1ce/0x740
[   75.308280]  #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3}
                at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308304]  #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2},
                at: hci_connect_cfm+0x29/0x190 [bluetooth]

Fixes: 02171da6e86a ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
10 months agoBluetooth: SCO: Add support for 16 bits transparent voice setting
Frédéric Danis [Thu, 5 Dec 2024 15:51:59 +0000 (16:51 +0100)]
Bluetooth: SCO: Add support for 16 bits transparent voice setting

The voice setting is used by sco_connect() or sco_conn_defer_accept()
after being set by sco_sock_setsockopt().

The PCM part of the voice setting is used for offload mode through PCM
chipset port.
This commits add support for mSBC 16 bits offloading, i.e. audio data
not transported over HCI.

The BCM4349B1 supports 16 bits transparent data on its I2S port.
If BT_VOICE_TRANSPARENT is used when accepting a SCO connection, this
gives only garbage audio while using BT_VOICE_TRANSPARENT_16BIT gives
correct audio.
This has been tested with connection to iPhone 14 and Samsung S24.

Fixes: ad10b1a48754 ("Bluetooth: Add Bluetooth socket voice option")
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
10 months agoBluetooth: iso: Fix recursive locking warning
Iulia Tanasescu [Wed, 4 Dec 2024 12:28:49 +0000 (14:28 +0200)]
Bluetooth: iso: Fix recursive locking warning

This updates iso_sock_accept to use nested locking for the parent
socket, to avoid lockdep warnings caused because the parent and
child sockets are locked by the same thread:

[   41.585683] ============================================
[   41.585688] WARNING: possible recursive locking detected
[   41.585694] 6.12.0-rc6+ #22 Not tainted
[   41.585701] --------------------------------------------
[   41.585705] iso-tester/3139 is trying to acquire lock:
[   41.585711] ffff988b29530a58 (sk_lock-AF_BLUETOOTH)
               at: bt_accept_dequeue+0xe3/0x280 [bluetooth]
[   41.585905]
               but task is already holding lock:
[   41.585909] ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
               at: iso_sock_accept+0x61/0x2d0 [bluetooth]
[   41.586064]
               other info that might help us debug this:
[   41.586069]  Possible unsafe locking scenario:

[   41.586072]        CPU0
[   41.586076]        ----
[   41.586079]   lock(sk_lock-AF_BLUETOOTH);
[   41.586086]   lock(sk_lock-AF_BLUETOOTH);
[   41.586093]
                *** DEADLOCK ***

[   41.586097]  May be due to missing lock nesting notation

[   41.586101] 1 lock held by iso-tester/3139:
[   41.586107]  #0: ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
                at: iso_sock_accept+0x61/0x2d0 [bluetooth]

Fixes: ccf74f2390d6 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
10 months agoBluetooth: iso: Always release hdev at the end of iso_listen_bis
Iulia Tanasescu [Wed, 4 Dec 2024 12:28:48 +0000 (14:28 +0200)]
Bluetooth: iso: Always release hdev at the end of iso_listen_bis

Since hci_get_route holds the device before returning, the hdev
should be released with hci_dev_put at the end of iso_listen_bis
even if the function returns with an error.

Fixes: 02171da6e86a ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
10 months agoBluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Luiz Augusto von Dentz [Wed, 4 Dec 2024 16:40:59 +0000 (11:40 -0500)]
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating

The usage of rcu_read_(un)lock while inside list_for_each_entry_rcu is
not safe since for the most part entries fetched this way shall be
treated as rcu_dereference:

Note that the value returned by rcu_dereference() is valid
only within the enclosing RCU read-side critical section [1]_.
For example, the following is **not** legal::

rcu_read_lock();
p = rcu_dereference(head.next);
rcu_read_unlock();
x = p->address; /* BUG!!! */
rcu_read_lock();
y = p->data; /* BUG!!! */
rcu_read_unlock();

Fixes: a0bfde167b50 ("Bluetooth: ISO: Add support for connecting multiple BISes")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
10 months agoBluetooth: hci_core: Fix sleeping function called from invalid context
Luiz Augusto von Dentz [Tue, 3 Dec 2024 21:07:32 +0000 (16:07 -0500)]
Bluetooth: hci_core: Fix sleeping function called from invalid context

This reworks hci_cb_list to not use mutex hci_cb_list_lock to avoid bugs
like the bellow:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5070, name: kworker/u9:2
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
4 locks held by kworker/u9:2/5070:
 #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3229 [inline]
 #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_scheduled_works+0x8e0/0x1770 kernel/workqueue.c:3335
 #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3230 [inline]
 #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_scheduled_works+0x91b/0x1770 kernel/workqueue.c:3335
 #2: ffff8880665d0078 (&hdev->lock){+.+.}-{3:3}, at: hci_le_create_big_complete_evt+0xcf/0xae0 net/bluetooth/hci_event.c:6914
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline]
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline]
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: hci_le_create_big_complete_evt+0xdb/0xae0 net/bluetooth/hci_event.c:6915
CPU: 0 PID: 5070 Comm: kworker/u9:2 Not tainted 6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: hci0 hci_rx_work
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 __might_resched+0x5d4/0x780 kernel/sched/core.c:10187
 __mutex_lock_common kernel/locking/mutex.c:585 [inline]
 __mutex_lock+0xc1/0xd70 kernel/locking/mutex.c:752
 hci_connect_cfm include/net/bluetooth/hci_core.h:2004 [inline]
 hci_le_create_big_complete_evt+0x3d9/0xae0 net/bluetooth/hci_event.c:6939
 hci_event_func net/bluetooth/hci_event.c:7514 [inline]
 hci_event_packet+0xa53/0x1540 net/bluetooth/hci_event.c:7569
 hci_rx_work+0x3e8/0xca0 net/bluetooth/hci_core.c:4171
 process_one_work kernel/workqueue.c:3254 [inline]
 process_scheduled_works+0xa00/0x1770 kernel/workqueue.c:3335
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
 kthread+0x2f0/0x390 kernel/kthread.c:388
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
 </TASK>

Reported-by: syzbot+2fb0835e0c9cefc34614@syzkaller.appspotmail.com
Tested-by: syzbot+2fb0835e0c9cefc34614@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2fb0835e0c9cefc34614
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
10 months agoMerge branch 'net-smc-two-features-for-smc-r'
Paolo Abeni [Thu, 12 Dec 2024 12:50:20 +0000 (13:50 +0100)]
Merge branch 'net-smc-two-features-for-smc-r'

Guangguan Wang says:

====================
net/smc: Two features for smc-r

v2: https://lore.kernel.org/netdev/20241202125203.48821-1-guangguan.wang@linux.alibaba.com/
v1: https://lore.kernel.org/oe-kbuild-all/202411282154.DjX7ilwF-lkp@intel.com/
====================

Link: https://patch.msgid.link/20241211023055.89610-1-guangguan.wang@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agonet/smc: support ipv4 mapped ipv6 addr client for smc-r v2
Guangguan Wang [Wed, 11 Dec 2024 02:30:55 +0000 (10:30 +0800)]
net/smc: support ipv4 mapped ipv6 addr client for smc-r v2

AF_INET6 is not supported for smc-r v2 client before, even if the
ipv6 addr is ipv4 mapped. Thus, when using AF_INET6, smc-r connection
will fallback to tcp, especially for java applications running smc-r.
This patch support ipv4 mapped ipv6 addr client for smc-r v2. Clients
using real global ipv6 addr is still not supported yet.

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agonet/smc: support SMC-R V2 for rdma devices with max_recv_sge equals to 1
Guangguan Wang [Wed, 11 Dec 2024 02:30:54 +0000 (10:30 +0800)]
net/smc: support SMC-R V2 for rdma devices with max_recv_sge equals to 1

For SMC-R V2, llc msg can be larger than SMC_WR_BUF_SIZE, thus every
recv wr has 2 sges, the first sge with length SMC_WR_BUF_SIZE is for
V1/V2 compatible llc/cdc msg, and the second sge with length
SMC_WR_BUF_V2_SIZE-SMC_WR_TX_SIZE is for V2 specific llc msg, like
SMC_LLC_DELETE_RKEY and SMC_LLC_ADD_LINK for SMC-R V2. The memory
buffer in the second sge is shared by all recv wr in one link and
all link in one lgr for saving memory usage purpose.

But not all RDMA devices with max_recv_sge greater than 1. Thus SMC-R
V2 can not support on such RDMA devices and SMC_CLC_DECL_INTERR fallback
happens because of the failure of create qp.

This patch introduce the support for SMC-R V2 on RDMA devices with
max_recv_sge equals to 1. Every recv wr has only one sge with individual
buffer whose size is SMC_WR_BUF_V2_SIZE once the RDMA device's max_recv_sge
equals to 1. It may use more memory, but it is better than
SMC_CLC_DECL_INTERR fallback.

Co-developed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoMerge tag 'nf-24-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Paolo Abeni [Thu, 12 Dec 2024 12:11:38 +0000 (13:11 +0100)]
Merge tag 'nf-24-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix bogus test reports in rpath.sh selftest by adding permanent
   neighbor entries, from Phil Sutter.

2) Lockdep reports possible ABBA deadlock in xt_IDLETIMER, fix it by
   removing sysfs out of the mutex section, also from Phil Sutter.

3) It is illegal to release basechain via RCU callback, for several
   reasons. Keep it simple and safe by calling synchronize_rcu() instead.
   This is a partially reverting a botched recent attempt of me to fix
   this basechain release path on netdevice removal.
   From Florian Westphal.

netfilter pull request 24-12-11

* tag 'nf-24-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: do not defer rule destruction via call_rcu
  netfilter: IDLETIMER: Fix for possible ABBA deadlock
  selftests: netfilter: Stabilize rpath.sh
====================

Link: https://patch.msgid.link/20241211230130.176937-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoselftests: forwarding: add a pvid_change test to bridge_vlan_unaware
Vladimir Oltean [Tue, 10 Dec 2024 23:35:40 +0000 (01:35 +0200)]
selftests: forwarding: add a pvid_change test to bridge_vlan_unaware

Historically, DSA drivers have seen problems with the model in which
bridge VLANs work, particularly with them being offloaded to switchdev
asynchronously relative to when they become active (vlan_filtering=1).

This switchdev API peculiarity was papered over by commit 2ea7a679ca2a
("net: dsa: Don't add vlans when vlan filtering is disabled"), which
introduced other problems, fixed by commit 54a0ed0df496 ("net: dsa:
provide an option for drivers to always receive bridge VLANs") through
an opt-in ds->configure_vlan_while_not_filtering bool (which later
became an opt-out).

The point is that some DSA drivers still skip VLAN configuration while
VLAN-unaware, and there is a desire to get rid of that behavior.

It's hard to deduce from the wording "at least one corner case" what
Andrew saw, but my best guess is that there is a discrepancy of meaning
between bridge pvid and hardware port pvid which caused breakage.

On one side, the Linux bridge with vlan_filtering=0 is completely
VLAN-unaware, and will accept and process a packet the same way
irrespective of the VLAN groups on the ports or the bridge itself
(there may not even be a pvid, and this makes no difference).

On the other hand, DSA switches still do VLAN processing internally,
even with vlan_filtering disabled, but they are expected to classify all
packets to the port pvid. That pvid shouldn't be confused with the
bridge pvid, and there lies the problem.

When a switch port is under a VLAN-unaware bridge, the hardware pvid
must be explicitly managed by the driver to classify all received
packets to it, regardless of bridge VLAN groups. When under a VLAN-aware
bridge, the hardware pvid must be synchronized to the bridge port pvid.
To do this correctly, the pattern is unfortunately a bit complicated,
and involves hooking the pvid change logic into quite a few places
(the ones that change the input variables which determine the value to
use as hardware pvid for a port). See mv88e6xxx_port_commit_pvid(),
sja1105_commit_pvid(), ocelot_port_set_pvid() etc.

The point is that not all drivers used to do that, especially in older
kernels. If a driver is to blindly program a bridge pvid VLAN received
from switchdev while it's VLAN-unaware, this might in turn change the
hardware pvid used by a VLAN-unaware bridge port, which might result in
packet loss depending which other ports have that pvid too (in that same
note, it might also go unnoticed).

To capture that condition, it is sufficient to take a VLAN-unaware
bridge and change the [VLAN-aware] bridge pvid on a single port, to a
VID that isn't present on any other port. This shouldn't have absolutely
any effect on packet classification or forwarding. However, broken
drivers will take the bait, and change their PVID to 3, causing packet
loss.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20241210233541.1401837-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoMerge branch 'ionic-minor-code-updates'
Paolo Abeni [Thu, 12 Dec 2024 11:06:35 +0000 (12:06 +0100)]
Merge branch 'ionic-minor-code-updates'

Shannon Nelson says:

====================
ionic: minor code updates

These are a few updates to the ionic driver, mostly for handling
newer/faster QSFP connectors.
====================

Link: https://patch.msgid.link/20241210183045.67878-1-shannon.nelson@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoionic: add support for QSFP_PLUS_CMIS
Shannon Nelson [Tue, 10 Dec 2024 18:30:45 +0000 (10:30 -0800)]
ionic: add support for QSFP_PLUS_CMIS

Teach the driver to recognize and decode the sfp pid
SFF8024_ID_QSFP_PLUS_CMIS correctly.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoionic: add speed defines for 200G and 400G
Shannon Nelson [Tue, 10 Dec 2024 18:30:44 +0000 (10:30 -0800)]
ionic: add speed defines for 200G and 400G

Add higher speed defines to the ionic_if.h API and decode them
in the ethtool get_link_ksettings callback.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoionic: Translate IONIC_RC_ENOSUPP to EOPNOTSUPP
Brett Creeley [Tue, 10 Dec 2024 18:30:43 +0000 (10:30 -0800)]
ionic: Translate IONIC_RC_ENOSUPP to EOPNOTSUPP

Instead of reporting -EINVAL when IONIC_RC_ENOSUPP is returned use
the -EOPNOTSUPP value. This aligns better since the FW only returns
IONIC_RC_ENOSUPP when operations aren't supported not when invalid
values are used.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoionic: Use VLAN_ETH_HLEN when possible
Brett Creeley [Tue, 10 Dec 2024 18:30:42 +0000 (10:30 -0800)]
ionic: Use VLAN_ETH_HLEN when possible

Replace when ETH_HLEN and VLAN_HLEN are used together with
VLAN_ETH_HLEN since it's the same value and uses 1 define
instead of 2.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoionic: add asic codes to firmware interface file
Shannon Nelson [Tue, 10 Dec 2024 18:30:41 +0000 (10:30 -0800)]
ionic: add asic codes to firmware interface file

Now that the firmware has learned how to properly report
the asic type id, add the values to our interface file.

The sharp-eyed reviewers will catch that the CAPRI value
changed here from 0 to 1.  This comes with the FW actually
defining it correctly.  This is safe for us to change as
nothing actually uses that value yet.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoteam: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
Daniel Borkmann [Tue, 10 Dec 2024 14:12:45 +0000 (15:12 +0100)]
team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL

Similar to bonding driver, add NETIF_F_GSO_ENCAP_ALL to TEAM_VLAN_FEATURES
in order to support slave devices which propagate NETIF_F_GSO_UDP_TUNNEL &
NETIF_F_GSO_UDP_TUNNEL_CSUM as vlan_features.

Fixes: 3625920b62c3 ("teaming: fix vlan_features computing")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20241210141245.327886-5-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoteam: Fix initial vlan_feature set in __team_compute_features
Daniel Borkmann [Tue, 10 Dec 2024 14:12:44 +0000 (15:12 +0100)]
team: Fix initial vlan_feature set in __team_compute_features

Similarly as with bonding, fix the calculation of vlan_features to reuse
netdev_base_features() in order derive the set in the same way as
ndo_fix_features before iterating through the slave devices to refine the
feature set.

Fixes: 3625920b62c3 ("teaming: fix vlan_features computing")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20241210141245.327886-4-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agobonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
Daniel Borkmann [Tue, 10 Dec 2024 14:12:43 +0000 (15:12 +0100)]
bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL

Drivers like mlx5 expose NIC's vlan_features such as
NETIF_F_GSO_UDP_TUNNEL & NETIF_F_GSO_UDP_TUNNEL_CSUM which are
later not propagated when the underlying devices are bonded and
a vlan device created on top of the bond.

Right now, the more cumbersome workaround for this is to create
the vlan on top of the mlx5 and then enslave the vlan devices
to a bond.

To fix this, add NETIF_F_GSO_ENCAP_ALL to BOND_VLAN_FEATURES
such that bond_compute_features() can probe and propagate the
vlan_features from the slave devices up to the vlan device.

Given the following bond:

  # ethtool -i enp2s0f{0,1}np{0,1}
  driver: mlx5_core
  [...]

  # ethtool -k enp2s0f0np0 | grep udp
  tx-udp_tnl-segmentation: on
  tx-udp_tnl-csum-segmentation: on
  tx-udp-segmentation: on
  rx-udp_tunnel-port-offload: on
  rx-udp-gro-forwarding: off

  # ethtool -k enp2s0f1np1 | grep udp
  tx-udp_tnl-segmentation: on
  tx-udp_tnl-csum-segmentation: on
  tx-udp-segmentation: on
  rx-udp_tunnel-port-offload: on
  rx-udp-gro-forwarding: off

  # ethtool -k bond0 | grep udp
  tx-udp_tnl-segmentation: on
  tx-udp_tnl-csum-segmentation: on
  tx-udp-segmentation: on
  rx-udp_tunnel-port-offload: off [fixed]
  rx-udp-gro-forwarding: off

Before:

  # ethtool -k bond0.100 | grep udp
  tx-udp_tnl-segmentation: off [requested on]
  tx-udp_tnl-csum-segmentation: off [requested on]
  tx-udp-segmentation: on
  rx-udp_tunnel-port-offload: off [fixed]
  rx-udp-gro-forwarding: off

After:

  # ethtool -k bond0.100 | grep udp
  tx-udp_tnl-segmentation: on
  tx-udp_tnl-csum-segmentation: on
  tx-udp-segmentation: on
  rx-udp_tunnel-port-offload: off [fixed]
  rx-udp-gro-forwarding: off

Various users have run into this reporting performance issues when
configuring Cilium in vxlan tunneling mode and having the combination
of bond & vlan for the core devices connecting the Kubernetes cluster
to the outside world.

Fixes: a9b3ace44c7d ("bonding: fix vlan_features computing")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20241210141245.327886-3-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agobonding: Fix initial {vlan,mpls}_feature set in bond_compute_features
Daniel Borkmann [Tue, 10 Dec 2024 14:12:42 +0000 (15:12 +0100)]
bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features

If a bonding device has slave devices, then the current logic to derive
the feature set for the master bond device is limited in that flags which
are fully supported by the underlying slave devices cannot be propagated
up to vlan devices which sit on top of bond devices. Instead, these get
blindly masked out via current NETIF_F_ALL_FOR_ALL logic.

vlan_features and mpls_features should reuse netdev_base_features() in
order derive the set in the same way as ndo_fix_features before iterating
through the slave devices to refine the feature set.

Fixes: a9b3ace44c7d ("bonding: fix vlan_features computing")
Fixes: 2e770b507ccd ("net: bonding: Inherit MPLS features from slave devices")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20241210141245.327886-2-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agonet, team, bonding: Add netdev_base_features helper
Daniel Borkmann [Tue, 10 Dec 2024 14:12:41 +0000 (15:12 +0100)]
net, team, bonding: Add netdev_base_features helper

Both bonding and team driver have logic to derive the base feature
flags before iterating over their slave devices to refine the set
via netdev_increment_features().

Add a small helper netdev_base_features() so this can be reused
instead of having it open-coded multiple times.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241210141245.327886-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 months agoMerge branch 'net-dsa-cleanup-eee-part-1'
Jakub Kicinski [Thu, 12 Dec 2024 04:29:41 +0000 (20:29 -0800)]
Merge branch 'net-dsa-cleanup-eee-part-1'

Russell King says:

====================
net: dsa: cleanup EEE (part 1)

First part of DSA EEE cleanups.

Patch 1 removes a useless test that is always false. dp->pl will always
be set for user ports, so !dp->pl in the EEE methods will always be
false.

Patch 2 adds support for a new DSA support_eee() method, which tells
DSA whether the DSA driver supports EEE, and thus whether the ethtool
set_eee() and get_eee() methods should return -EOPNOTSUPP.

Patch 3 adds a trivial implementation for this new method which
indicates that EEE is supported.

Patches 4..8 adds implementations for .supports_eee() to all drivers
that support EEE in some form.

Patch 9 switches the core DSA code to require a .supports_eee()
implementation if DSA is supported. Any DSA driver that doesn't
implement this method after this patch will not support the ethtool
EEE methods.

Part 2 will remove the (now) useless .get_mac_eee() DSA operation.
====================

Link: https://patch.msgid.link/Z1hNkEb13FMuDQiY@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: require .support_eee() method to be implemented
Russell King (Oracle) [Tue, 10 Dec 2024 14:18:52 +0000 (14:18 +0000)]
net: dsa: require .support_eee() method to be implemented

Now that we have updated all drivers, switch DSA to require an
implementation of the .support_eee() method for EEE to be usable,
rather than defaulting to being permissive when not implemented.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL14e-006cZy-AT@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: ksz: implement .support_eee() method
Russell King (Oracle) [Tue, 10 Dec 2024 14:18:47 +0000 (14:18 +0000)]
net: dsa: ksz: implement .support_eee() method

Implement the .support_eee() method by reusing the ksz_validate_eee()
method as a template, renaming the function, changing the return type
and values, and removing it from the ksz_set_mac_eee() and
ksz_get_mac_eee() methods.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL14Z-006cZs-6o@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: mv88e6xxx: implement .support_eee() method
Russell King (Oracle) [Tue, 10 Dec 2024 14:18:42 +0000 (14:18 +0000)]
net: dsa: mv88e6xxx: implement .support_eee() method

Implement the .support_eee() method by using the generic helper as all
user ports support EEE.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL14U-006cZm-2K@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: qca8k: implement .support_eee() method
Russell King (Oracle) [Tue, 10 Dec 2024 14:18:36 +0000 (14:18 +0000)]
net: dsa: qca8k: implement .support_eee() method

Implement the .support_eee() method by using the generic helper as all
user ports support EEE.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL14O-006cZg-VM@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: mt753x: implement .support_eee() method
Russell King (Oracle) [Tue, 10 Dec 2024 14:18:31 +0000 (14:18 +0000)]
net: dsa: mt753x: implement .support_eee() method

Implement the .support_eee() method by using the generic helper as all
user ports support EEE.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL14J-006cZa-Rh@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: b53/bcm_sf2: implement .support_eee() method
Russell King (Oracle) [Tue, 10 Dec 2024 14:18:26 +0000 (14:18 +0000)]
net: dsa: b53/bcm_sf2: implement .support_eee() method

Implement the .support_eee() method to indicate that EEE is not
supported by two switch variants, rather than making these checks in
the .set_mac_eee() and .get_mac_eee() methods.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL14E-006cZU-Nc@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: provide implementation of .support_eee()
Russell King (Oracle) [Tue, 10 Dec 2024 14:18:21 +0000 (14:18 +0000)]
net: dsa: provide implementation of .support_eee()

Provide a trivial implementation for the .support_eee() method which
switch drivers can use to simply indicate that they support EEE on
all their user ports.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL149-006cZJ-JJ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: add hook to determine whether EEE is supported
Russell King (Oracle) [Tue, 10 Dec 2024 14:18:16 +0000 (14:18 +0000)]
net: dsa: add hook to determine whether EEE is supported

Add a hook to determine whether the switch supports EEE. This will
return false if the switch does not, or true if it does. If the
method is not implemented, we assume (currently) that the switch
supports EEE.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL144-006cZD-El@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: remove check for dp->pl in EEE methods
Russell King (Oracle) [Tue, 10 Dec 2024 14:18:11 +0000 (14:18 +0000)]
net: dsa: remove check for dp->pl in EEE methods

When user ports are initialised, a phylink instance is always created,
and so dp->pl will always be non-NULL. The EEE methods are only used
for user ports, so checking for dp->pl to be NULL makes no sense. No
other phylink-calling method implements similar checks in DSA. Remove
this unnecessary check.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL13z-006cZ7-BZ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet/sched: netem: account for backlog updates from child qdisc
Martin Ottens [Tue, 10 Dec 2024 13:14:11 +0000 (14:14 +0100)]
net/sched: netem: account for backlog updates from child qdisc

In general, 'qlen' of any classful qdisc should keep track of the
number of packets that the qdisc itself and all of its children holds.
In case of netem, 'qlen' only accounts for the packets in its internal
tfifo. When netem is used with a child qdisc, the child qdisc can use
'qdisc_tree_reduce_backlog' to inform its parent, netem, about created
or dropped SKBs. This function updates 'qlen' and the backlog statistics
of netem, but netem does not account for changes made by a child qdisc.
'qlen' then indicates the wrong number of packets in the tfifo.
If a child qdisc creates new SKBs during enqueue and informs its parent
about this, netem's 'qlen' value is increased. When netem dequeues the
newly created SKBs from the child, the 'qlen' in netem is not updated.
If 'qlen' reaches the configured sch->limit, the enqueue function stops
working, even though the tfifo is not full.

Reproduce the bug:
Ensure that the sender machine has GSO enabled. Configure netem as root
qdisc and tbf as its child on the outgoing interface of the machine
as follows:
$ tc qdisc add dev <oif> root handle 1: netem delay 100ms limit 100
$ tc qdisc add dev <oif> parent 1:0 tbf rate 50Mbit burst 1542 latency 50ms

Send bulk TCP traffic out via this interface, e.g., by running an iPerf3
client on the machine. Check the qdisc statistics:
$ tc -s qdisc show dev <oif>

Statistics after 10s of iPerf3 TCP test before the fix (note that
netem's backlog > limit, netem stopped accepting packets):
qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
 Sent 2767766 bytes 1848 pkt (dropped 652, overlimits 0 requeues 0)
 backlog 4294528236b 1155p requeues 0
qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
 Sent 2767766 bytes 1848 pkt (dropped 327, overlimits 7601 requeues 0)
 backlog 0b 0p requeues 0

Statistics after the fix:
qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
 Sent 37766372 bytes 24974 pkt (dropped 9, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
 Sent 37766372 bytes 24974 pkt (dropped 327, overlimits 96017 requeues 0)
 backlog 0b 0p requeues 0

tbf segments the GSO SKBs (tbf_segment) and updates the netem's 'qlen'.
The interface fully stops transferring packets and "locks". In this case,
the child qdisc and tfifo are empty, but 'qlen' indicates the tfifo is at
its limit and no more packets are accepted.

This patch adds a counter for the entries in the tfifo. Netem's 'qlen' is
only decreased when a packet is returned by its dequeue function, and not
during enqueuing into the child qdisc. External updates to 'qlen' are thus
accounted for and only the behavior of the backlog statistics changes. As
in other qdiscs, 'qlen' then keeps track of  how many packets are held in
netem and all of its children. As before, sch->limit remains as the
maximum number of packets in the tfifo. The same applies to netem's
backlog statistics.

Fixes: 50612537e9ab ("netem: fix classful handling")
Signed-off-by: Martin Ottens <martin.ottens@fau.de>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20241210131412.1837202-1-martin.ottens@fau.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge tag 'batadv-net-pullrequest-20241210' of git://git.open-mesh.org/linux-merge
Jakub Kicinski [Thu, 12 Dec 2024 04:25:59 +0000 (20:25 -0800)]
Merge tag 'batadv-net-pullrequest-20241210' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - fix TT unitialized data and size limit issues, by Remi Pommarel
  (3 patches)

* tag 'batadv-net-pullrequest-20241210' of git://git.open-mesh.org/linux-merge:
  batman-adv: Do not let TT changes list grows indefinitely
  batman-adv: Remove uninitialized data in full table TT response
  batman-adv: Do not send uninitialized TT changes
====================

Link: https://patch.msgid.link/20241210135024.39068-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: dsa: felix: fix stuck CPU-injected packets with short taprio windows
Vladimir Oltean [Tue, 10 Dec 2024 13:26:40 +0000 (15:26 +0200)]
net: dsa: felix: fix stuck CPU-injected packets with short taprio windows

With this port schedule:

tc qdisc replace dev $send_if parent root handle 100 taprio \
num_tc 8 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
map 0 1 2 3 4 5 6 7 \
base-time 0 cycle-time 10000 \
sched-entry S 01 1250 \
sched-entry S 02 1250 \
sched-entry S 04 1250 \
sched-entry S 08 1250 \
sched-entry S 10 1250 \
sched-entry S 20 1250 \
sched-entry S 40 1250 \
sched-entry S 80 1250 \
flags 2

ptp4l would fail to take TX timestamps of Pdelay_Resp messages like this:

increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
ptp4l[4134.168]: port 2: send peer delay response failed

It turns out that the driver can't take their TX timestamps because it
can't transmit them in the first place. And there's nothing special
about the Pdelay_Resp packets - they're just regular 68 byte packets.
But with this taprio configuration, the switch would refuse to send even
the ETH_ZLEN minimum packet size.

This should have definitely not been the case. When applying the taprio
config, the driver prints:

mscc_felix 0000:00:00.5: port 0 tc 0 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 1 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 2 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 3 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 4 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 5 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 6 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS

and thus, everything under 132 bytes - ETH_FCS_LEN should have been sent
without problems. Yet it's not.

For the forwarding path, the configuration is fine, yet packets injected
from Linux get stuck with this schedule no matter what.

The first hint that the static guard bands are the cause of the problem
is that reverting Michael Walle's commit 297c4de6f780 ("net: dsa: felix:
re-enable TAS guard band mode") made things work. It must be that the
guard bands are calculated incorrectly.

I remembered that there is a magic constant in the driver, set to 33 ns
for no logical reason other than experimentation, which says "never let
the static guard bands get so large as to leave less than this amount of
remaining space in the time slot, because the queue system will refuse
to schedule packets otherwise, and they will get stuck". I had a hunch
that my previous experimentally-determined value was only good for
packets coming from the forwarding path, and that the CPU injection path
needed more.

I came to the new value of 35 ns through binary search, after seeing
that with 544 ns (the bit time required to send the Pdelay_Resp packet
at gigabit) it works. Again, this is purely experimental, there's no
logic and the manual doesn't say anything.

The new driver prints for this schedule look like this:

mscc_felix 0000:00:00.5: port 0 tc 0 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 1 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 2 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 3 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 4 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 5 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 6 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS

So yes, the maximum MTU is now even smaller by 1 byte than before.
This is maybe counter-intuitive, but makes more sense with a diagram of
one time slot.

Before:

 Gate open                                   Gate close
 |                                                    |
 v           1250 ns total time slot duration         v
 <---------------------------------------------------->
 <----><---------------------------------------------->
  33 ns            1217 ns static guard band
  useful

 Gate open                                   Gate close
 |                                                    |
 v           1250 ns total time slot duration         v
 <---------------------------------------------------->
 <-----><--------------------------------------------->
  35 ns            1215 ns static guard band
  useful

The static guard band implemented by this switch hardware directly
determines the maximum allowable MTU for that traffic class. The larger
it is, the earlier the switch will stop scheduling frames for
transmission, because otherwise they might overrun the gate close time
(and avoiding that is the entire purpose of Michael's patch).
So, we now have guard bands smaller by 2 ns, thus, in this particular
case, we lose a byte of the maximum MTU.

Fixes: 11afdc6526de ("net: dsa: felix: tc-taprio intervals smaller than MTU should send at least one packet")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Michael Walle <mwalle@kernel.org>
Link: https://patch.msgid.link/20241210132640.3426788-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: fec: use phydev->eee_cfg.tx_lpi_timer
Russell King (Oracle) [Tue, 10 Dec 2024 12:38:26 +0000 (12:38 +0000)]
net: fec: use phydev->eee_cfg.tx_lpi_timer

Rather than maintaining a private copy of the LPI timer, make use of
the LPI timer maintained by phylib. In any case, phylib overwrites the
value of tx_lpi_timer set by the driver in phy_ethtool_get_eee().

Note that feb->eee.tx_lpi_timer is initialised to zero, which is just
the same with phylib's copy, so there should be no functional change.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/E1tKzVS-006c67-IJ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agomlxsw: spectrum_flower: Do not allow mixing sample and mirror actions
Ido Schimmel [Tue, 10 Dec 2024 09:45:37 +0000 (10:45 +0100)]
mlxsw: spectrum_flower: Do not allow mixing sample and mirror actions

The device does not support multiple mirror actions per rule and the
driver rejects such configuration:

 # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw action mirred egress mirror dev swp2 action mirred egress mirror dev swp3
 Error: mlxsw_spectrum: Multiple mirror actions per rule are not supported.
 We have an error talking to the kernel

Internally, the sample action is implemented by the device by mirroring
to the CPU port. Therefore, mixing sample and mirror actions in a single
rule does not work correctly and results in the last action effect.

Solve by rejecting such misconfiguration:

 # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw action mirred egress mirror dev swp2 action sample rate 100 group 1
 Error: mlxsw_spectrum: Sample action after mirror action is not supported.
 We have an error talking to the kernel

 # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw action sample rate 100 group 1 action mirred egress mirror dev swp2
 Error: mlxsw_spectrum: Mirror action after sample action is not supported.
 We have an error talking to the kernel

Reported-by: Vladyslav Mykhaliuk <vmykhaliuk@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/d6c979914e8706dbe1dedbaf29ffffb0b8d71166.1733822570.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agosplice: do not checksum AF_UNIX sockets
Frederik Deweerdt [Tue, 10 Dec 2024 05:06:48 +0000 (21:06 -0800)]
splice: do not checksum AF_UNIX sockets

When `skb_splice_from_iter` was introduced, it inadvertently added
checksumming for AF_UNIX sockets. This resulted in significant
slowdowns, for example when using sendfile over unix sockets.

Using the test code in [1] in my test setup (2G single core qemu),
the client receives a 1000M file in:
- without the patch: 1482ms (+/- 36ms)
- with the patch: 652.5ms (+/- 22.9ms)

This commit addresses the issue by marking checksumming as unnecessary in
`unix_stream_sendmsg`

Cc: stable@vger.kernel.org
Signed-off-by: Frederik Deweerdt <deweerdt.lkml@gmail.com>
Fixes: 2e910b95329c ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES")
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/Z1fMaHkRf8cfubuE@xiberoa
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: usb: qmi_wwan: add Telit FE910C04 compositions
Daniele Palmas [Mon, 9 Dec 2024 15:18:21 +0000 (16:18 +0100)]
net: usb: qmi_wwan: add Telit FE910C04 compositions

Add the following Telit FE910C04 compositions:

0x10c0: rmnet + tty (AT/NMEA) + tty (AT) + tty (diag)
T:  Bus=02 Lev=01 Prnt=03 Port=06 Cnt=01 Dev#= 13 Spd=480  MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=1bc7 ProdID=10c0 Rev=05.15
S:  Manufacturer=Telit Cinterion
S:  Product=FE910
S:  SerialNumber=f71b8b32
C:  #Ifs= 4 Cfg#= 1 Atr=e0 MxPwr=500mA
I:  If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan
E:  Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=82(I) Atr=03(Int.) MxPS=   8 Ivl=32ms
I:  If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=60 Driver=option
E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=84(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
I:  If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
E:  Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=86(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
I:  If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option
E:  Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms

0x10c4: rmnet + tty (AT) + tty (AT) + tty (diag)
T:  Bus=02 Lev=01 Prnt=03 Port=06 Cnt=01 Dev#= 14 Spd=480  MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=1bc7 ProdID=10c4 Rev=05.15
S:  Manufacturer=Telit Cinterion
S:  Product=FE910
S:  SerialNumber=f71b8b32
C:  #Ifs= 4 Cfg#= 1 Atr=e0 MxPwr=500mA
I:  If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan
E:  Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=82(I) Atr=03(Int.) MxPS=   8 Ivl=32ms
I:  If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=84(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
I:  If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
E:  Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=86(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
I:  If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option
E:  Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms

0x10c8: rmnet + tty (AT) + tty (diag) + DPL (data packet logging) + adb
T:  Bus=02 Lev=01 Prnt=03 Port=06 Cnt=01 Dev#= 17 Spd=480  MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=1bc7 ProdID=10c8 Rev=05.15
S:  Manufacturer=Telit Cinterion
S:  Product=FE910
S:  SerialNumber=f71b8b32
C:  #Ifs= 5 Cfg#= 1 Atr=e0 MxPwr=500mA
I:  If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan
E:  Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=82(I) Atr=03(Int.) MxPS=   8 Ivl=32ms
I:  If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=84(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
I:  If#= 2 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option
E:  Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:  If#= 3 Alt= 0 #EPs= 1 Cls=ff(vend.) Sub=ff Prot=80 Driver=(none)
E:  Ad=86(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:  If#= 4 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=(none)
E:  Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms

Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
Link: https://patch.msgid.link/20241209151821.3688829-1-dnlplm@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge branch 'mana-fix-few-memory-leaks-in-mana_gd_setup_irqs'
Jakub Kicinski [Thu, 12 Dec 2024 04:21:06 +0000 (20:21 -0800)]
Merge branch 'mana-fix-few-memory-leaks-in-mana_gd_setup_irqs'

Maxim Levitsky says:

====================
MANA: Fix few memory leaks in mana_gd_setup_irqs

Fix 2 minor memory leaks in the mana driver,
introduced by commit
====================

Link: https://patch.msgid.link/20241209175751.287738-1-mlevitsk@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: mana: Fix irq_contexts memory leak in mana_gd_setup_irqs
Maxim Levitsky [Mon, 9 Dec 2024 17:57:51 +0000 (12:57 -0500)]
net: mana: Fix irq_contexts memory leak in mana_gd_setup_irqs

gc->irq_contexts is not freeded if one of the later operations
fail.

Suggested-by: Michael Kelley <mhklinux@outlook.com>
Fixes: 8afefc361209 ("net: mana: Assigning IRQ affinity on HT cores")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Link: https://patch.msgid.link/20241209175751.287738-3-mlevitsk@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agonet: mana: Fix memory leak in mana_gd_setup_irqs
Maxim Levitsky [Mon, 9 Dec 2024 17:57:50 +0000 (12:57 -0500)]
net: mana: Fix memory leak in mana_gd_setup_irqs

Commit 8afefc361209 ("net: mana: Assigning IRQ affinity on HT cores")
added memory allocation in mana_gd_setup_irqs of 'irqs' but the code
doesn't free this temporary array in the success path.

This was caught by kmemleak.

Fixes: 8afefc361209 ("net: mana: Assigning IRQ affinity on HT cores")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Link: https://patch.msgid.link/20241209175751.287738-2-mlevitsk@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge branch 'make-time-wait-reuse-delay-deterministic-and-configurable'
Jakub Kicinski [Thu, 12 Dec 2024 04:17:38 +0000 (20:17 -0800)]
Merge branch 'make-time-wait-reuse-delay-deterministic-and-configurable'

Jakub Sitnicki says:

====================
Make TIME-WAIT reuse delay deterministic and configurable

This patch set is an effort to enable faster reuse of TIME-WAIT sockets.
We have recently talked about the motivation and the idea at Plumbers [1].

Experiment in production
------------------------

We are restarting our experiment on a small set of production nodes as the
code has slightly changed since v1 [2], and there are still a few weeks of
development window to soak the changes. We will report back if we observe
any regressions.

Packetdrill tests
-----------------

The packetdrill tests for TIME-WAIT reuse [3] did not change since v1.
Although we are not touching PAWS code any more, I would still like to add
tests to cover PAWS reject after TW reuse. This, however, requires patching
packetdrill as I mentioned in the last cover letter [2].

[1] https://lpc.events/event/18/contributions/1962/
[2] https://lore.kernel.org/r/20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com
[3] https://github.com/google/packetdrill/pull/90

v1: https://lore.kernel.org/20241204-jakub-krn-909-poc-msec-tw-tstamp-v1-0-8b54467a0f34@cloudflare.com
RFCv2: https://lore.kernel.org/20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com
RFCv1: https://lore.kernel.org/20240819-jakub-krn-909-poc-msec-tw-tstamp-v1-1-6567b5006fbe@cloudflare.com
====================

Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-0-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agotcp: Add sysctl to configure TIME-WAIT reuse delay
Jakub Sitnicki [Mon, 9 Dec 2024 19:38:04 +0000 (20:38 +0100)]
tcp: Add sysctl to configure TIME-WAIT reuse delay

Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be
reused by reopening a connection. This is a safe choice based on an
assumption that the other TCP timestamp clock frequency, which is unknown
to us, may be as low as 1 Hz (RFC 7323, section 5.4).

However, this means that in the presence of short lived connections with an
RTT of couple of milliseconds, the time during which a 4-tuple is blocked
from reuse can be orders of magnitude longer that the connection lifetime.
Combined with a reduced pool of ephemeral ports, when using
IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the
long TIME-WAIT reuse delay can lead to port exhaustion, where all available
4-tuples are tied up in TIME-WAIT state.

Turn the reuse delay into a per-netns setting so that sysadmins can make
more aggressive assumptions about remote TCP timestamp clock frequency and
shorten the delay in order to allow connections to reincarnate faster.

Note that applications can completely bypass the TIME-WAIT delay protection
already today by locking the local port with bind() before connecting. Such
immediate connection reuse may result in PAWS failing to detect old
duplicate segments, leaving us with just the sequence number check as a
safety net.

This new configurable offers a trade off where the sysadmin can balance
between the risk of PAWS detection failing to act versus exhausting ports
by having sockets tied up in TIME-WAIT state for too long.

[1] https://lpc.events/event/16/contributions/1349/

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-2-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agotcp: Measure TIME-WAIT reuse delay with millisecond precision
Jakub Sitnicki [Mon, 9 Dec 2024 19:38:03 +0000 (20:38 +0100)]
tcp: Measure TIME-WAIT reuse delay with millisecond precision

Prepare ground for TIME-WAIT socket reuse with subsecond delay.

Today the last TS.Recent update timestamp, recorded in seconds and stored
tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.

Firstly, it is used to track the age of the last recorded TS.Recent value
to detect when that value becomes outdated due to potential wrap-around of
the other TCP timestamp clock (RFC 7323, section 5.5).

For this purpose a second-based timestamp is completely sufficient as even
in the worst case scenario of a peer using a high resolution microsecond
timestamp, the wrap-around interval is ~36 minutes long.

Secondly, it serves as a threshold value for allowing TIME-WAIT socket
reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
ktime_get_seconds, is past the TS.Recent update timestamp.

The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
other TCP timestamp clock to tick at least once before reusing the
connection. It is only then that the PAWS mechanism for the reopened
connection can detect old duplicate segments from the previous connection
incarnation (RFC 7323, appendix B.2).

In this case using a timestamp with second resolution not only blocks the
way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
but also makes it impossible to reliably delay TW reuse by one second.

As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
average. We delay TW reuse for one full second only when last TS.Recent
update coincides with our virtual 1 Hz clock tick.

Considering the above, introduce a dedicated field to store a millisecond
timestamp of transition into the TIME-WAIT state. Place it in an existing
4-byte hole inside inet_timewait_sock structure to avoid an additional
memory cost.

Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
and (ii) prepare for configurable subsecond reuse delay in the subsequent
change.

We assume here that a full one second delay was the original intention in
[2] because it accounts for the worst case scenario of the other TCP using
the slowest recommended 1 Hz timestamp clock.

A more involved alternative would be to change the resolution of the last
TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.

[1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-1-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge branch 'ipv6-mcast-add-data-race-annotations'
Jakub Kicinski [Thu, 12 Dec 2024 04:15:30 +0000 (20:15 -0800)]
Merge branch 'ipv6-mcast-add-data-race-annotations'

Eric Dumazet says:

====================
ipv6: mcast: add data-race annotations

ipv6_chk_mcast_addr() and igmp6_mcf_seq_show() are reading
fields under RCU. Add missing annotations.
====================

Link: https://patch.msgid.link/20241210183352.86530-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoipv6: mcast: annotate data-race around psf->sf_count[MCAST_XXX]
Eric Dumazet [Tue, 10 Dec 2024 18:33:52 +0000 (18:33 +0000)]
ipv6: mcast: annotate data-race around psf->sf_count[MCAST_XXX]

psf->sf_count[MCAST_XXX] fields are read locklessly from
ipv6_chk_mcast_addr() and igmp6_mcf_seq_show().

Add READ_ONCE() and WRITE_ONCE() annotations accordingly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoipv6: mcast: annotate data-races around mc->mca_sfcount[MCAST_EXCLUDE]
Eric Dumazet [Tue, 10 Dec 2024 18:33:51 +0000 (18:33 +0000)]
ipv6: mcast: annotate data-races around mc->mca_sfcount[MCAST_EXCLUDE]

mc->mca_sfcount[MCAST_EXCLUDE] is read locklessly from
ipv6_chk_mcast_addr().

Add READ_ONCE() and WRITE_ONCE() annotations accordingly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoipv6: mcast: reduce ipv6_chk_mcast_addr() indentation
Eric Dumazet [Tue, 10 Dec 2024 18:33:50 +0000 (18:33 +0000)]
ipv6: mcast: reduce ipv6_chk_mcast_addr() indentation

Add a label and two gotos to shorten lines by two tabulations,
to ease code review of following patches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoMerge branch 'lib-packing-introduce-and-use-un-pack_fields'
Jakub Kicinski [Thu, 12 Dec 2024 04:13:29 +0000 (20:13 -0800)]
Merge branch 'lib-packing-introduce-and-use-un-pack_fields'

Jacob Keller says:

====================
lib: packing: introduce and use (un)pack_fields

This series improves the packing library with a new API for packing or
unpacking a large number of fields at once with minimal code footprint. The
API is then used to replace bespoke packing logic in the ice driver,
preparing it to handle unpacking in the future. Finally, the ice driver has
a few other cleanups related to the packing logic.

The pack_fields and unpack_fields functions have the following improvements
over the existing pack() and unpack() API:

 1. Packing or unpacking a large number of fields takes significantly less
    code. This significantly reduces the .text size for an increase in the
    .data size which is much smaller.

 2. The unpacked data can be stored in sizes smaller than u64 variables.
    This reduces the storage requirement both for runtime data structures,
    and for the rodata defining the fields. This scales with the number of
    fields used.

 3. Most of the error checking is done at compile time, rather than
    runtime, via CHECK_PACKED_FIELD macros.

The actual packing and unpacking code still uses the u64 size
variables. However, these are converted to the appropriate field sizes when
storing or reading the data from the buffer.

This version now uses significantly improved macro checks, thanks to the
work of Vladimir. We now only need 300 lines of macro for the generated
checks. In addition, each new check only requires 4 lines of code for its
macro implementation and 1 extra line in the CHECK_PACKED_FIELDS macro.
This is significantly better than previous versions which required ~2700
lines.

The CHECK_PACKED_FIELDS macro uses __builtin_choose_expr to select the
appropriately sized CHECK_PACKED_FIELDS_N macro. This enables directly
adding CHECK_PACKED_FIELDS calls into the pack_fields and unpack_fields
macros. Drivers no longer need to call the CHECK_PACKED_FIELDS_N macros
directly, and we do not need to modify Kbuild or introduce multiple CONFIG
options.

The code for the CHECK_PACKED_FIELDS_(0..50) and CHECK_PACKED_FIELDS itself
can be generated from the C program in scripts/gen_packed_field_checks.c.
This little C program may be used in the future to update the checks to
more sizes if a driver with more than 50 fields appears in the future.
The total amount of required code is now much smaller, and we don't
anticipate needing to increase the size very often. Thus, it makes sense to
simply commit the result directly instead of attempting to modify Kbuild to
automatically generate it.

This version uses the 5-argument format of pack_fields and unpack_fields,
with the size of the packed buffer passed as one of the arguments. We do
enforce that the compiler can tell its a constant using
__builtin_constant_p(), ensuring that the size checks are handled at
compile time. We could reduce these to 4 arguments and require that the
passed in pbuf be of a type which has the appropriate size. I opted against
that because it makes the API less flexible and a bit less natural to use
in existing code.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
v9: https://lore.kernel.org/20241204-packing-pack-fields-and-ice-implementation-v9-0-81c8f2bd7323@intel.com
v8: https://lore.kernel.org/20241203-packing-pack-fields-and-ice-implementation-v8-0-2ed68edfe583@intel.com
v7: https://lore.kernel.org/20241202-packing-pack-fields-and-ice-implementation-v7-0-ed22e38e6c65@intel.com
v6: https://lore.kernel.org/20241118-packing-pack-fields-and-ice-implementation-v6-0-6af8b658a6c3@intel.com
v5: https://lore.kernel.org/20241111-packing-pack-fields-and-ice-implementation-v5-0-80c07349e6b7@intel.com
v4: https://lore.kernel.org/20241108-packing-pack-fields-and-ice-implementation-v4-0-81a9f42c30e5@intel.com
v3: https://lore.kernel.org/20241107-packing-pack-fields-and-ice-implementation-v3-0-27c566ac2436@intel.com
v2: https://lore.kernel.org/20241025-packing-pack-fields-and-ice-implementation-v2-0-734776c88e40@intel.com
v1: https://lore.kernel.org/20241011-packing-pack-fields-and-ice-implementation-v1-0-d9b1f7500740@intel.com
====================

Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-0-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoice: cleanup Rx queue context programming functions
Jacob Keller [Tue, 10 Dec 2024 20:27:19 +0000 (12:27 -0800)]
ice: cleanup Rx queue context programming functions

The ice_copy_rxq_ctx_to_hw() and ice_write_rxq_ctx() functions perform some
defensive checks which are typically frowned upon by kernel style
guidelines.

In particular, NULL checks on buffers which point to the stack are
discouraged, especially when the functions are static and only called once.
Checks of this sort only serve to hide potential programming error, as we
will not produce the normal crash dump on a NULL access.

In addition, ice_copy_rxq_ctx_to_hw() cannot fail in another way, so could
be made void.

Future support for VF Live Migration will need to introduce an inverse
function for reading Rx queue context from HW registers to unpack it, as
well as functions to pack and unpack Tx queue context from HW.

Rather than copying these style issues into the new functions, lets first
cleanup the existing code.

For the ice_copy_rxq_ctx_to_hw() function:

 * Move the Rx queue index check out of this function.
 * Convert the function to a void return.
 * Use a simple int variable instead of a u8 for the for loop index, and
   initialize it inside the for loop.
 * Update the function description to better align with kernel doc style.

For the ice_write_rxq_ctx() function:

 * Move the Rx queue index check into this function.
 * Update the function description with a Returns: to align with kernel doc
   style.

These changes align the existing write functions to current kernel
style, and will align with the style of the new functions added when we
implement live migration in a future series.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-10-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoice: move prefetch enable to ice_setup_rx_ctx
Jacob Keller [Tue, 10 Dec 2024 20:27:18 +0000 (12:27 -0800)]
ice: move prefetch enable to ice_setup_rx_ctx

The ice_write_rxq_ctx() function is responsible for programming the Rx
Queue context into hardware. It receives the configuration in unpacked form
via the ice_rlan_ctx structure.

This function unconditionally modifies the context to set the prefetch
enable bit. This was done by commit c31a5c25bb19 ("ice: Always set prefena
when configuring an Rx queue"). Setting this bit makes sense, since
prefetching descriptors is almost always the preferred behavior.

However, the ice_write_rxq_ctx() function is not the place that actually
defines the queue context. We initialize the Rx Queue context in
ice_setup_rx_ctx(). It is surprising to have the Rx queue context changed
by a function who's responsibility is to program the given context to
hardware.

Following the principle of least surprise, move the setting of the prefetch
enable bit out of ice_write_rxq_ctx() and into the ice_setup_rx_ctx().

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-9-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoice: reduce size of queue context fields
Jacob Keller [Tue, 10 Dec 2024 20:27:17 +0000 (12:27 -0800)]
ice: reduce size of queue context fields

The ice_rlan_ctx and ice_tlan_ctx structures have some fields which are
intentionally sized larger than necessary relative to the packed sizes the
data must fit into. This was done because the original ice_set_ctx()
function and its helpers did not correctly handle packing when the packed
bits straddled a byte. This is no longer the case with the use of the
<linux/packing.h> implementation.

Save some bytes in these structures by sizing the variables to the number
of bytes the actual bitpacked fields fit into.

There are a couple of gaps left in the structure, which is a result of the
fields being in the order they appear in the packed bit layout, but where
alignment forces some extra gaps. We could fix this, saving ~8 bytes from
each structure. However, these structures are not used heavily, and the
resulting savings is minimal:

$ bloat-o-meter ice-before-reorder.ko ice-after-reorder.ko
add/remove: 0/0 grow/shrink: 1/1 up/down: 26/-70 (-44)
Function                                     old     new   delta
ice_vsi_cfg_txq                             1873    1899     +26
ice_setup_rx_ctx.constprop                  1529    1459     -70
Total: Before=1459555, After=1459511, chg -0.00%

Thus, the fields are left in the same order as the packed bit layout,
despite the gaps this causes.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-8-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoice: use <linux/packing.h> for Tx and Rx queue context data
Jacob Keller [Tue, 10 Dec 2024 20:27:16 +0000 (12:27 -0800)]
ice: use <linux/packing.h> for Tx and Rx queue context data

The ice driver needs to write the Tx and Rx queue context when programming
Tx and Rx queues. This is currently done using some bespoke custom logic
via the ice_set_ctx() and its helper functions, along with bit position
definitions in the ice_tlan_ctx_info and ice_rlan_ctx_info structures.

This logic does work, but is problematic for several reasons:

1) ice_set_ctx requires a helper function for each byte size being packed,
   as it uses a separate function to pack u8, u16, u32, and u64 fields.
   This requires 4 functions which contain near-duplicate logic with the
   types changed out.

2) The logic in the ice_pack_ctx_word, ice_pack_ctx_dword, and
   ice_pack_ctx_qword does not handle values which straddle alignment
   boundaries very well. This requires that several fields in the
   ice_tlan_ctx_info and ice_rlan_ctx_info be a size larger than their bit
   size should require.

3) Future support for live migration will require adding unpacking
   functions to take the packed hardware context and unpack it into the
   ice_rlan_ctx and ice_tlan_ctx structures. Implementing this would
   require implementing ice_get_ctx, and its associated helper functions,
   which essentially doubles the amount of code required.

The Linux kernel has had a packing library that can handle this logic since
commit 554aae35007e ("lib: Add support for generic packing operations").
The library was recently extended with support for packing or unpacking an
array of fields, with a similar structure as the ice_ctx_ele structure.

Replace the ice-specific ice_set_ctx() logic with the recently added
pack_fields and packed_field_s infrastructure from <linux/packing.h>

For API simplicity, the Tx and Rx queue context are programmed using
separate ice_pack_txq_ctx() and ice_pack_rxq_ctx(). This avoids needing to
export the packed_field_s arrays. The functions can pointers to the
appropriate ice_txq_ctx_buf_t and ice_rxq_ctx_buf_t types, ensuring that
only buffers of the appropriate size are passed.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-7-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoice: use structures to keep track of queue context size
Jacob Keller [Tue, 10 Dec 2024 20:27:15 +0000 (12:27 -0800)]
ice: use structures to keep track of queue context size

The ice Tx and Rx queue context are currently stored as arrays of bytes
with defined size (ICE_RXQ_CTX_SZ and ICE_TXQ_CTX_SZ). The packed queue
context is often passed to other functions as a simple u8 * pointer, which
does not allow tracking the size. This makes the queue context API easy to
misuse, as you can pass an arbitrary u8 array or pointer.

Introduce wrapper typedefs which use a __packed structure that has the
proper fixed size for the Tx and Rx context buffers. This enables the
compiler to track the size of the value and ensures that passing the wrong
buffer size will be detected by the compiler.

The existing APIs do not benefit much from this change, however the
wrapping structures will be used to simplify the arguments of new packing
functions based on the recently introduced pack_fields API.

Co-developed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-6-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoice: remove int_q_state from ice_tlan_ctx
Jacob Keller [Tue, 10 Dec 2024 20:27:14 +0000 (12:27 -0800)]
ice: remove int_q_state from ice_tlan_ctx

The int_q_state field of the ice_tlan_ctx structure represents the internal
queue state. However, we never actually need to assign this or read this
during normal operation. In fact, trying to unpack it would not be possible
as it is larger than a u64. Remove this field from the ice_tlan_ctx
structure, and remove its packing field from the ice_tlan_ctx_info array.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-5-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agolib: packing: document recently added APIs
Jacob Keller [Tue, 10 Dec 2024 20:27:13 +0000 (12:27 -0800)]
lib: packing: document recently added APIs

Extend the documentation for the packing library, covering the intended use
for the recently added APIs. This includes the pack() and unpack() macros,
as well as the pack_fields() and unpack_fields() macros.

Add a note that the packing() API is now deprecated in favor of pack() and
unpack().

For the pack_fields() and unpack_fields() APIs, explain the rationale for
when a driver may want to select this API. Provide an example which shows
how to define the fields and call the pack_fields() and unpack_fields()
macros.

Co-developed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-4-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agolib: packing: add pack_fields() and unpack_fields()
Vladimir Oltean [Tue, 10 Dec 2024 20:27:12 +0000 (12:27 -0800)]
lib: packing: add pack_fields() and unpack_fields()

This is new API which caters to the following requirements:

- Pack or unpack a large number of fields to/from a buffer with a small
  code footprint. The current alternative is to open-code a large number
  of calls to pack() and unpack(), or to use packing() to reduce that
  number to half. But packing() is not const-correct.

- Use unpacked numbers stored in variables smaller than u64. This
  reduces the rodata footprint of the stored field arrays.

- Perform error checking at compile time, rather than runtime, and return
  void from the API functions. Because the C preprocessor can't generate
  variable length code (loops), this is a bit tricky to do with macros.

  To handle this, implement macros which sanity check the packed field
  definitions based on their size. Finally, a single macro with a chain of
  __builtin_choose_expr() is used to select the appropriate macros. We
  enforce the use of ascending or descending order to avoid O(N^2) scaling
  when checking for overlap. Note that the macros are written with care to
  ensure that the compilers can correctly evaluate the resulting code at
  compile time. In particular, care was taken with avoiding too many nested
  statement expressions. Nested statement expressions trip up some
  compilers, especially when passing down variables created in previous
  statement expressions.

  There are two key design choices intended to keep the overall macro code
  size small. First, the definition of each CHECK_PACKED_FIELDS_N macro is
  implemented recursively, by calling the N-1 macro. This avoids needing
  the code to repeat multiple times.

  Second, the CHECK_PACKED_FIELD macro enforces that the fields in the
  array are sorted in order. This allows checking for overlap only with
  neighboring fields, rather than the general overlap case where each field
  would need to be checked against other fields.

  The overlap checks use the first two fields to determine the order of the
  remaining fields, thus allowing either ascending or descending order.
  This enables drivers the flexibility to keep the fields ordered in which
  ever order most naturally fits their hardware design and its associated
  documentation.

  The CHECK_PACKED_FIELDS macro is directly called from within pack_fields
  and unpack_fields, ensuring that all drivers using the API receive the
  benefits of the compile-time checks. Users do not need to directly call
  any of the macros directly.

  The CHECK_PACKED_FIELDS and its helper macros CHECK_PACKED_FIELDS_(0..50)
  are generated using a simple C program in scripts/gen_packed_field_checks.c
  This program can be compiled on demand and executed to generate the
  macro code in include/linux/packing.h. This will aid in the event that a
  driver needs more than 50 fields. The generator can be updated with a new
  size, and used to update the packing.h header file. In practice, the ice
  driver will need to support 27 fields, and the sja1105 driver will need
  to support 0 fields. This on-demand generation avoids the need to modify
  Kbuild. We do not anticipate the maximum number of fields to grow very
  often.

- Reduced rodata footprint for the storage of the packed field arrays.
  To that end, we have struct packed_field_u8 and packed_field_u16, which
  define the fields with the associated type. More can be added as
  needed (unlikely for now). On these types, the same generic pack_fields()
  and unpack_fields() API can be used, thanks to the new C11 _Generic()
  selection feature, which can call pack_fields_u8() or pack_fields_16(),
  depending on the type of the "fields" array - a simplistic form of
  polymorphism. It is evaluated at compile time which function will actually
  be called.

Over time, packing() is expected to be completely replaced either with
pack() or with pack_fields().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-3-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agolib: packing: demote truncation error in pack() to a warning in __pack()
Vladimir Oltean [Tue, 10 Dec 2024 20:27:11 +0000 (12:27 -0800)]
lib: packing: demote truncation error in pack() to a warning in __pack()

Most of the sanity checks in pack() and unpack() can be covered at
compile time. There is only one exception, and that is truncation of the
uval during a pack() operation.

We'd like the error-less __pack() to catch that condition as well. But
at the same time, it is currently the responsibility of consumer drivers
(currently just sja1105) to print anything at all when this error
occurs, and then discard the return code.

We can just print a loud warning in the library code and continue with
the truncated __pack() operation. In practice, having the warning is
very important, see commit 24deec6b9e4a ("net: dsa: sja1105: disallow
C45 transactions on the BASE-TX MDIO bus") where the bug was caught
exactly by noticing this print.

Add the first print to the packing library, and at the same time remove
the print for the same condition from the sja1105 driver, to avoid
double printing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-2-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agolib: packing: create __pack() and __unpack() variants without error checking
Vladimir Oltean [Tue, 10 Dec 2024 20:27:10 +0000 (12:27 -0800)]
lib: packing: create __pack() and __unpack() variants without error checking

A future variant of the API, which works on arrays of packed_field
structures, will make most of these checks redundant. The idea will be
that we want to perform sanity checks at compile time, not once
for every function call.

Introduce new variants of pack() and unpack(), which elide the sanity
checks, assuming that the input was pre-sanitized.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-1-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 months agoisdn: Remove unused get_Bprotocol4id()
Dr. David Alan Gilbert [Wed, 11 Dec 2024 00:58:02 +0000 (00:58 +0000)]
isdn: Remove unused get_Bprotocol4id()

get_Bprotocol4id() was added in 2008 in
commit 1b2b03f8e514 ("Add mISDN core files")
but hasn't been used.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20241211005802.258279-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>