From bbe4b41259a3e255a16d795486d331c1670b4e75 Mon Sep 17 00:00:00 2001 From: Petr Machata Date: Mon, 9 Dec 2024 12:05:31 +0100 Subject: [PATCH 01/16] Documentation: networking: Add a caveat to nexthop_compat_mode sysctl net.ipv4.nexthop_compat_mode was added when nexthop objects were added to provide the view of nexthop objects through the usual lens of the route UAPI. As nexthop objects evolved, the information provided through this lens became incomplete. For example, details of resilient nexthop groups are obviously omitted. Now that 16-bit nexthop group weights are a thing, the 8-bit UAPI cannot convey the >8-bit weight accurately. Instead of inventing workarounds for an obsolete interface, just document the expectations of inaccuracy. Fixes: b72a6a7ab957 ("net: nexthop: Increase weight to u16") Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern Link: https://patch.msgid.link/b575e32399ccacd09079b2a218255164535123bd.1733740749.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index eacf8983e230..dcbb6f6caf6d 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2170,6 +2170,12 @@ nexthop_compat_mode - BOOLEAN understands the new API, this sysctl can be disabled to achieve full performance benefits of the new API by disabling the nexthop expansion and extraneous notifications. + + Note that as a backward-compatible mode, dumping of modern features + might be incomplete or wrong. For example, resilient groups will not be + shown as such, but rather as just a list of next hops. Also weights that + do not fit into 8 bits will show incorrectly. + Default: true (backward compat mode) fib_notify_on_flag_change - INTEGER -- 2.51.0 From 06d64ab46f19ac12f59a1d2aa8cd196b2e4edb5b Mon Sep 17 00:00:00 2001 From: MoYuanhao Date: Mon, 9 Dec 2024 13:28:14 +0100 Subject: [PATCH 02/16] tcp: check space before adding MPTCP SYN options Ensure there is enough space before adding MPTCP options in tcp_syn_options(). Without this check, 'remaining' could underflow, and causes issues. If there is not enough space, MPTCP should not be used. Signed-off-by: MoYuanhao Fixes: cec37a6e41aa ("mptcp: Handle MP_CAPABLE options for outgoing connections") Cc: stable@vger.kernel.org Acked-by: Matthieu Baerts (NGI0) [ Matt: Add Fixes, cc Stable, update Description ] Signed-off-by: Matthieu Baerts (NGI0) Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20241209-net-mptcp-check-space-syn-v1-1-2da992bb6f74@kernel.org Signed-off-by: Jakub Kicinski --- net/ipv4/tcp_output.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5485a70b5fe5..0e5b9a654254 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -883,8 +883,10 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, unsigned int size; if (mptcp_syn_options(sk, skb, &size, &opts->mptcp)) { - opts->options |= OPTION_MPTCP; - remaining -= size; + if (remaining >= size) { + opts->options |= OPTION_MPTCP; + remaining -= size; + } } } -- 2.51.0 From 5cb099902b6b6292b3a85ffa1bb844e0ba195945 Mon Sep 17 00:00:00 2001 From: Nikita Yushchenko Date: Sun, 8 Dec 2024 14:50:01 +0500 Subject: [PATCH 03/16] net: renesas: rswitch: fix possible early skb release When sending frame split into multiple descriptors, hardware processes descriptors one by one, including writing back DT values. The first descriptor could be already marked as completed when processing of next descriptors for the same frame is still in progress. Although only the last descriptor is configured to generate interrupt, completion of the first descriptor could be noticed by the driver when handling interrupt for the previous frame. Currently, driver stores skb in the entry that corresponds to the first descriptor. This results into skb could be unmapped and freed when hardware did not complete the send yet. This opens a window for corrupting the data being sent. Fix this by saving skb in the entry that corresponds to the last descriptor used to send the frame. Fixes: d2c96b9d5f83 ("net: rswitch: Add jumbo frames handling for TX") Signed-off-by: Nikita Yushchenko Reviewed-by: Yoshihiro Shimoda Link: https://patch.msgid.link/20241208095004.69468-2-nikita.yoush@cogentembedded.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/renesas/rswitch.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/renesas/rswitch.c b/drivers/net/ethernet/renesas/rswitch.c index 8d18dae4d8fb..f4fc2e7212c3 100644 --- a/drivers/net/ethernet/renesas/rswitch.c +++ b/drivers/net/ethernet/renesas/rswitch.c @@ -1681,8 +1681,9 @@ static netdev_tx_t rswitch_start_xmit(struct sk_buff *skb, struct net_device *nd if (dma_mapping_error(ndev->dev.parent, dma_addr_orig)) goto err_kfree; - gq->skbs[gq->cur] = skb; - gq->unmap_addrs[gq->cur] = dma_addr_orig; + /* Stored the skb at the last descriptor to avoid skb free before hardware completes send */ + gq->skbs[(gq->cur + nr_desc - 1) % gq->ring_size] = skb; + gq->unmap_addrs[(gq->cur + nr_desc - 1) % gq->ring_size] = dma_addr_orig; /* DT_FSTART should be set at last. So, this is reverse order. */ for (i = nr_desc; i-- > 0; ) { -- 2.51.0 From 0c9547e6ccf40455b0574cf589be3b152a3edf5b Mon Sep 17 00:00:00 2001 From: Nikita Yushchenko Date: Sun, 8 Dec 2024 14:50:02 +0500 Subject: [PATCH 04/16] net: renesas: rswitch: fix race window between tx start and complete If hardware is already transmitting, it can start handling the descriptor being written to immediately after it observes updated DT field, before the queue is kicked by a write to GWTRC. If the start_xmit() execution is preempted at unfortunate moment, this transmission can complete, and interrupt handled, before gq->cur gets updated. With the current implementation of completion, this will cause the last entry not completed. Fix that by changing completion loop to check DT values directly, instead of depending on gq->cur. Fixes: 3590918b5d07 ("net: ethernet: renesas: Add support for "Ethernet Switch"") Signed-off-by: Nikita Yushchenko Reviewed-by: Yoshihiro Shimoda Link: https://patch.msgid.link/20241208095004.69468-3-nikita.yoush@cogentembedded.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/renesas/rswitch.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/renesas/rswitch.c b/drivers/net/ethernet/renesas/rswitch.c index f4fc2e7212c3..cdf4d6ccccb0 100644 --- a/drivers/net/ethernet/renesas/rswitch.c +++ b/drivers/net/ethernet/renesas/rswitch.c @@ -862,13 +862,10 @@ static void rswitch_tx_free(struct net_device *ndev) struct rswitch_ext_desc *desc; struct sk_buff *skb; - for (; rswitch_get_num_cur_queues(gq) > 0; - gq->dirty = rswitch_next_queue_index(gq, false, 1)) { - desc = &gq->tx_ring[gq->dirty]; - if ((desc->desc.die_dt & DT_MASK) != DT_FEMPTY) - break; - + desc = &gq->tx_ring[gq->dirty]; + while ((desc->desc.die_dt & DT_MASK) == DT_FEMPTY) { dma_rmb(); + skb = gq->skbs[gq->dirty]; if (skb) { rdev->ndev->stats.tx_packets++; @@ -879,7 +876,10 @@ static void rswitch_tx_free(struct net_device *ndev) dev_kfree_skb_any(gq->skbs[gq->dirty]); gq->skbs[gq->dirty] = NULL; } + desc->desc.die_dt = DT_EEMPTY; + gq->dirty = rswitch_next_queue_index(gq, false, 1); + desc = &gq->tx_ring[gq->dirty]; } } @@ -1685,6 +1685,8 @@ static netdev_tx_t rswitch_start_xmit(struct sk_buff *skb, struct net_device *nd gq->skbs[(gq->cur + nr_desc - 1) % gq->ring_size] = skb; gq->unmap_addrs[(gq->cur + nr_desc - 1) % gq->ring_size] = dma_addr_orig; + dma_wmb(); + /* DT_FSTART should be set at last. So, this is reverse order. */ for (i = nr_desc; i-- > 0; ) { desc = &gq->tx_ring[rswitch_next_queue_index(gq, true, i)]; @@ -1695,8 +1697,6 @@ static netdev_tx_t rswitch_start_xmit(struct sk_buff *skb, struct net_device *nd goto err_unmap; } - wmb(); /* gq->cur must be incremented after die_dt was set */ - gq->cur = rswitch_next_queue_index(gq, true, nr_desc); rswitch_modify(rdev->addr, GWTRC(gq->index), 0, BIT(gq->index % 32)); -- 2.51.0 From bb617328bafa1023d8e9c25a25345a564c66c14f Mon Sep 17 00:00:00 2001 From: Nikita Yushchenko Date: Sun, 8 Dec 2024 14:50:03 +0500 Subject: [PATCH 05/16] net: renesas: rswitch: fix leaked pointer on error path If error path is taken while filling descriptor for a frame, skb pointer is left in the entry. Later, on the ring entry reuse, the same entry could be used as a part of a multi-descriptor frame, and skb for that new frame could be stored in a different entry. Then, the stale pointer will reach the completion routine, and passed to the release operation. Fix that by clearing the saved skb pointer at the error path. Fixes: d2c96b9d5f83 ("net: rswitch: Add jumbo frames handling for TX") Signed-off-by: Nikita Yushchenko Reviewed-by: Yoshihiro Shimoda Link: https://patch.msgid.link/20241208095004.69468-4-nikita.yoush@cogentembedded.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/renesas/rswitch.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/renesas/rswitch.c b/drivers/net/ethernet/renesas/rswitch.c index cdf4d6ccccb0..6d6912b9c16e 100644 --- a/drivers/net/ethernet/renesas/rswitch.c +++ b/drivers/net/ethernet/renesas/rswitch.c @@ -1703,6 +1703,7 @@ static netdev_tx_t rswitch_start_xmit(struct sk_buff *skb, struct net_device *nd return ret; err_unmap: + gq->skbs[(gq->cur + nr_desc - 1) % gq->ring_size] = NULL; dma_unmap_single(ndev->dev.parent, dma_addr_orig, skb->len, DMA_TO_DEVICE); err_kfree: -- 2.51.0 From 66b7e9f85b8459c823b11e9af69dbf4be5eb6be8 Mon Sep 17 00:00:00 2001 From: Nikita Yushchenko Date: Sun, 8 Dec 2024 14:50:04 +0500 Subject: [PATCH 06/16] net: renesas: rswitch: avoid use-after-put for a device tree node The device tree node saved in the rswitch_device structure is used at several driver locations. So passing this node to of_node_put() after the first use is wrong. Move of_node_put() for this node to exit paths. Fixes: b46f1e579329 ("net: renesas: rswitch: Simplify struct phy * handling") Signed-off-by: Nikita Yushchenko Reviewed-by: Yoshihiro Shimoda Link: https://patch.msgid.link/20241208095004.69468-5-nikita.yoush@cogentembedded.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/renesas/rswitch.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/renesas/rswitch.c b/drivers/net/ethernet/renesas/rswitch.c index 6d6912b9c16e..ebcdf2917b83 100644 --- a/drivers/net/ethernet/renesas/rswitch.c +++ b/drivers/net/ethernet/renesas/rswitch.c @@ -1891,7 +1891,6 @@ static int rswitch_device_alloc(struct rswitch_private *priv, unsigned int index rdev->np_port = rswitch_get_port_node(rdev); rdev->disabled = !rdev->np_port; err = of_get_ethdev_address(rdev->np_port, ndev); - of_node_put(rdev->np_port); if (err) { if (is_valid_ether_addr(rdev->etha->mac_addr)) eth_hw_addr_set(ndev, rdev->etha->mac_addr); @@ -1921,6 +1920,7 @@ out_txdmac: out_rxdmac: out_get_params: + of_node_put(rdev->np_port); netif_napi_del(&rdev->napi); free_netdev(ndev); @@ -1934,6 +1934,7 @@ static void rswitch_device_free(struct rswitch_private *priv, unsigned int index rswitch_txdmac_free(ndev); rswitch_rxdmac_free(ndev); + of_node_put(rdev->np_port); netif_napi_del(&rdev->napi); free_netdev(ndev); } -- 2.51.0 From 3dd002f20098b9569f8fd7f8703f364571e2e975 Mon Sep 17 00:00:00 2001 From: Nikita Yushchenko Date: Mon, 9 Dec 2024 16:32:04 +0500 Subject: [PATCH 07/16] net: renesas: rswitch: handle stop vs interrupt race Currently the stop routine of rswitch driver does not immediately prevent hardware from continuing to update descriptors and requesting interrupts. It can happen that when rswitch_stop() executes the masking of interrupts from the queues of the port being closed, napi poll for that port is already scheduled or running on a different CPU. When execution of this napi poll completes, it will unmask the interrupts. And unmasked interrupt can fire after rswitch_stop() returns from napi_disable() call. Then, the handler won't mask it, because napi_schedule_prep() will return false, and interrupt storm will happen. This can't be fixed by making rswitch_stop() call napi_disable() before masking interrupts. In this case, the interrupt storm will happen if interrupt fires between napi_disable() and masking. Fix this by checking for priv->opened_ports bit when unmasking interrupts after napi poll. For that to be consistent, move priv->opened_ports changes into spinlock-protected areas, and reorder other operations in rswitch_open() and rswitch_stop() accordingly. Signed-off-by: Nikita Yushchenko Reviewed-by: Yoshihiro Shimoda Fixes: 3590918b5d07 ("net: ethernet: renesas: Add support for "Ethernet Switch"") Link: https://patch.msgid.link/20241209113204.175015-1-nikita.yoush@cogentembedded.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/renesas/rswitch.c | 33 ++++++++++++++------------ 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/renesas/rswitch.c b/drivers/net/ethernet/renesas/rswitch.c index ebcdf2917b83..6ec714789573 100644 --- a/drivers/net/ethernet/renesas/rswitch.c +++ b/drivers/net/ethernet/renesas/rswitch.c @@ -908,8 +908,10 @@ retry: if (napi_complete_done(napi, budget - quota)) { spin_lock_irqsave(&priv->lock, flags); - rswitch_enadis_data_irq(priv, rdev->tx_queue->index, true); - rswitch_enadis_data_irq(priv, rdev->rx_queue->index, true); + if (test_bit(rdev->port, priv->opened_ports)) { + rswitch_enadis_data_irq(priv, rdev->tx_queue->index, true); + rswitch_enadis_data_irq(priv, rdev->rx_queue->index, true); + } spin_unlock_irqrestore(&priv->lock, flags); } @@ -1538,20 +1540,20 @@ static int rswitch_open(struct net_device *ndev) struct rswitch_device *rdev = netdev_priv(ndev); unsigned long flags; - phy_start(ndev->phydev); + if (bitmap_empty(rdev->priv->opened_ports, RSWITCH_NUM_PORTS)) + iowrite32(GWCA_TS_IRQ_BIT, rdev->priv->addr + GWTSDIE); napi_enable(&rdev->napi); - netif_start_queue(ndev); spin_lock_irqsave(&rdev->priv->lock, flags); + bitmap_set(rdev->priv->opened_ports, rdev->port, 1); rswitch_enadis_data_irq(rdev->priv, rdev->tx_queue->index, true); rswitch_enadis_data_irq(rdev->priv, rdev->rx_queue->index, true); spin_unlock_irqrestore(&rdev->priv->lock, flags); - if (bitmap_empty(rdev->priv->opened_ports, RSWITCH_NUM_PORTS)) - iowrite32(GWCA_TS_IRQ_BIT, rdev->priv->addr + GWTSDIE); + phy_start(ndev->phydev); - bitmap_set(rdev->priv->opened_ports, rdev->port, 1); + netif_start_queue(ndev); return 0; }; @@ -1563,7 +1565,16 @@ static int rswitch_stop(struct net_device *ndev) unsigned long flags; netif_tx_stop_all_queues(ndev); + + phy_stop(ndev->phydev); + + spin_lock_irqsave(&rdev->priv->lock, flags); + rswitch_enadis_data_irq(rdev->priv, rdev->tx_queue->index, false); + rswitch_enadis_data_irq(rdev->priv, rdev->rx_queue->index, false); bitmap_clear(rdev->priv->opened_ports, rdev->port, 1); + spin_unlock_irqrestore(&rdev->priv->lock, flags); + + napi_disable(&rdev->napi); if (bitmap_empty(rdev->priv->opened_ports, RSWITCH_NUM_PORTS)) iowrite32(GWCA_TS_IRQ_BIT, rdev->priv->addr + GWTSDID); @@ -1576,14 +1587,6 @@ static int rswitch_stop(struct net_device *ndev) kfree(ts_info); } - spin_lock_irqsave(&rdev->priv->lock, flags); - rswitch_enadis_data_irq(rdev->priv, rdev->tx_queue->index, false); - rswitch_enadis_data_irq(rdev->priv, rdev->rx_queue->index, false); - spin_unlock_irqrestore(&rdev->priv->lock, flags); - - phy_stop(ndev->phydev); - napi_disable(&rdev->napi); - return 0; }; -- 2.51.0 From 15bfb14727bcf5f72b099338fcdb5aca631f9197 Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Tue, 10 Dec 2024 13:47:44 +0000 Subject: [PATCH 08/16] MAINTAINERS: Add ethtool.h to NETWORKING [GENERAL] This is part of an effort to assign a section in MAINTAINERS to header files related to Networking. In this case the files named ethool.h. Signed-off-by: Simon Horman Link: https://patch.msgid.link/20241210-mnt-ethtool-h-v1-1-2a40b567939d@kernel.org Signed-off-by: Jakub Kicinski --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index f84ec3572a5d..16c76ecca3f9 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16335,6 +16335,7 @@ F: Documentation/networking/ F: Documentation/networking/net_cachelines/ F: Documentation/process/maintainer-netdev.rst F: Documentation/userspace-api/netlink/ +F: include/linux/ethtool.h F: include/linux/framer/framer-provider.h F: include/linux/framer/framer.h F: include/linux/in.h @@ -16349,6 +16350,7 @@ F: include/linux/rtnetlink.h F: include/linux/seq_file_net.h F: include/linux/skbuff* F: include/net/ +F: include/uapi/linux/ethtool.h F: include/uapi/linux/genetlink.h F: include/uapi/linux/hsr_netlink.h F: include/uapi/linux/in.h -- 2.51.0 From bb1e3eb57d2cc38951f9a9f1b8c298ced175798f Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 9 Dec 2024 12:57:50 -0500 Subject: [PATCH 09/16] net: mana: Fix memory leak in mana_gd_setup_irqs Commit 8afefc361209 ("net: mana: Assigning IRQ affinity on HT cores") added memory allocation in mana_gd_setup_irqs of 'irqs' but the code doesn't free this temporary array in the success path. This was caught by kmemleak. Fixes: 8afefc361209 ("net: mana: Assigning IRQ affinity on HT cores") Signed-off-by: Maxim Levitsky Reviewed-by: Michal Swiatkowski Reviewed-by: Kalesh AP Reviewed-by: Saurabh Sengar Reviewed-by: Yury Norov Link: https://patch.msgid.link/20241209175751.287738-2-mlevitsk@redhat.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/microsoft/mana/gdma_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c index 550be6dc305d..5babd217c7ed 100644 --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c @@ -1375,6 +1375,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev) gc->max_num_msix = nvec; gc->num_msix_usable = nvec; cpus_read_unlock(); + kfree(irqs); return 0; free_irq: -- 2.51.0 From 9a5beb6ca6305de5c5210efab0702ea79b62eb39 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 9 Dec 2024 12:57:51 -0500 Subject: [PATCH 10/16] net: mana: Fix irq_contexts memory leak in mana_gd_setup_irqs gc->irq_contexts is not freeded if one of the later operations fail. Suggested-by: Michael Kelley Fixes: 8afefc361209 ("net: mana: Assigning IRQ affinity on HT cores") Signed-off-by: Maxim Levitsky Reviewed-by: Michal Swiatkowski Reviewed-by: Kalesh AP Reviewed-by: Saurabh Sengar Reviewed-by: Yury Norov Link: https://patch.msgid.link/20241209175751.287738-3-mlevitsk@redhat.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/microsoft/mana/gdma_main.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c index 5babd217c7ed..2dc0c6ad54be 100644 --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c @@ -1318,7 +1318,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev) GFP_KERNEL); if (!gc->irq_contexts) { err = -ENOMEM; - goto free_irq_vector; + goto free_irq_array; } for (i = 0; i < nvec; i++) { @@ -1388,8 +1388,9 @@ free_irq: } kfree(gc->irq_contexts); - kfree(irqs); gc->irq_contexts = NULL; +free_irq_array: + kfree(irqs); free_irq_vector: cpus_read_unlock(); pci_free_irq_vectors(pdev); -- 2.51.0 From 3b58b53a26598209a7ad8259a5114ce71f7c3d64 Mon Sep 17 00:00:00 2001 From: Daniele Palmas Date: Mon, 9 Dec 2024 16:18:21 +0100 Subject: [PATCH 11/16] net: usb: qmi_wwan: add Telit FE910C04 compositions Add the following Telit FE910C04 compositions: 0x10c0: rmnet + tty (AT/NMEA) + tty (AT) + tty (diag) T: Bus=02 Lev=01 Prnt=03 Port=06 Cnt=01 Dev#= 13 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor=1bc7 ProdID=10c0 Rev=05.15 S: Manufacturer=Telit Cinterion S: Product=FE910 S: SerialNumber=f71b8b32 C: #Ifs= 4 Cfg#= 1 Atr=e0 MxPwr=500mA I: If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=82(I) Atr=03(Int.) MxPS= 8 Ivl=32ms I: If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=60 Driver=option E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=32ms I: If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option E: Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=86(I) Atr=03(Int.) MxPS= 10 Ivl=32ms I: If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option E: Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms 0x10c4: rmnet + tty (AT) + tty (AT) + tty (diag) T: Bus=02 Lev=01 Prnt=03 Port=06 Cnt=01 Dev#= 14 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor=1bc7 ProdID=10c4 Rev=05.15 S: Manufacturer=Telit Cinterion S: Product=FE910 S: SerialNumber=f71b8b32 C: #Ifs= 4 Cfg#= 1 Atr=e0 MxPwr=500mA I: If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=82(I) Atr=03(Int.) MxPS= 8 Ivl=32ms I: If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=32ms I: If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option E: Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=86(I) Atr=03(Int.) MxPS= 10 Ivl=32ms I: If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option E: Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms 0x10c8: rmnet + tty (AT) + tty (diag) + DPL (data packet logging) + adb T: Bus=02 Lev=01 Prnt=03 Port=06 Cnt=01 Dev#= 17 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor=1bc7 ProdID=10c8 Rev=05.15 S: Manufacturer=Telit Cinterion S: Product=FE910 S: SerialNumber=f71b8b32 C: #Ifs= 5 Cfg#= 1 Atr=e0 MxPwr=500mA I: If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=82(I) Atr=03(Int.) MxPS= 8 Ivl=32ms I: If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=32ms I: If#= 2 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option E: Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms I: If#= 3 Alt= 0 #EPs= 1 Cls=ff(vend.) Sub=ff Prot=80 Driver=(none) E: Ad=86(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms I: If#= 4 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=(none) E: Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms Signed-off-by: Daniele Palmas Link: https://patch.msgid.link/20241209151821.3688829-1-dnlplm@gmail.com Signed-off-by: Jakub Kicinski --- drivers/net/usb/qmi_wwan.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index 0c011d8f5d4d..9fe7f704a2f7 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1365,6 +1365,9 @@ static const struct usb_device_id products[] = { {QMI_QUIRK_SET_DTR(0x1bc7, 0x10a0, 0)}, /* Telit FN920C04 */ {QMI_QUIRK_SET_DTR(0x1bc7, 0x10a4, 0)}, /* Telit FN920C04 */ {QMI_QUIRK_SET_DTR(0x1bc7, 0x10a9, 0)}, /* Telit FN920C04 */ + {QMI_QUIRK_SET_DTR(0x1bc7, 0x10c0, 0)}, /* Telit FE910C04 */ + {QMI_QUIRK_SET_DTR(0x1bc7, 0x10c4, 0)}, /* Telit FE910C04 */ + {QMI_QUIRK_SET_DTR(0x1bc7, 0x10c8, 0)}, /* Telit FE910C04 */ {QMI_FIXED_INTF(0x1bc7, 0x1100, 3)}, /* Telit ME910 */ {QMI_FIXED_INTF(0x1bc7, 0x1101, 3)}, /* Telit ME910 dual modem */ {QMI_FIXED_INTF(0x1bc7, 0x1200, 5)}, /* Telit LE920 */ -- 2.51.0 From 6bd8614fc2d076fc21b7488c9f279853960964e2 Mon Sep 17 00:00:00 2001 From: Frederik Deweerdt Date: Mon, 9 Dec 2024 21:06:48 -0800 Subject: [PATCH 12/16] splice: do not checksum AF_UNIX sockets When `skb_splice_from_iter` was introduced, it inadvertently added checksumming for AF_UNIX sockets. This resulted in significant slowdowns, for example when using sendfile over unix sockets. Using the test code in [1] in my test setup (2G single core qemu), the client receives a 1000M file in: - without the patch: 1482ms (+/- 36ms) - with the patch: 652.5ms (+/- 22.9ms) This commit addresses the issue by marking checksumming as unnecessary in `unix_stream_sendmsg` Cc: stable@vger.kernel.org Signed-off-by: Frederik Deweerdt Fixes: 2e910b95329c ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES") Reviewed-by: Kuniyuki Iwashima Reviewed-by: Eric Dumazet Reviewed-by: Joe Damato Link: https://patch.msgid.link/Z1fMaHkRf8cfubuE@xiberoa Signed-off-by: Jakub Kicinski --- net/unix/af_unix.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 001ccc55ef0f..6b1762300443 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2313,6 +2313,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg, fds_sent = true; if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) { + skb->ip_summed = CHECKSUM_UNNECESSARY; err = skb_splice_from_iter(skb, &msg->msg_iter, size, sk->sk_allocation); if (err < 0) { -- 2.51.0 From acfcdb78d5d4cdb78e975210c8825b9a112463f6 Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Tue, 10 Dec 2024 15:26:40 +0200 Subject: [PATCH 13/16] net: dsa: felix: fix stuck CPU-injected packets with short taprio windows With this port schedule: tc qdisc replace dev $send_if parent root handle 100 taprio \ num_tc 8 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ map 0 1 2 3 4 5 6 7 \ base-time 0 cycle-time 10000 \ sched-entry S 01 1250 \ sched-entry S 02 1250 \ sched-entry S 04 1250 \ sched-entry S 08 1250 \ sched-entry S 10 1250 \ sched-entry S 20 1250 \ sched-entry S 40 1250 \ sched-entry S 80 1250 \ flags 2 ptp4l would fail to take TX timestamps of Pdelay_Resp messages like this: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l[4134.168]: port 2: send peer delay response failed It turns out that the driver can't take their TX timestamps because it can't transmit them in the first place. And there's nothing special about the Pdelay_Resp packets - they're just regular 68 byte packets. But with this taprio configuration, the switch would refuse to send even the ETH_ZLEN minimum packet size. This should have definitely not been the case. When applying the taprio config, the driver prints: mscc_felix 0000:00:00.5: port 0 tc 0 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 1 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 2 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 3 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 4 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 5 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 6 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS and thus, everything under 132 bytes - ETH_FCS_LEN should have been sent without problems. Yet it's not. For the forwarding path, the configuration is fine, yet packets injected from Linux get stuck with this schedule no matter what. The first hint that the static guard bands are the cause of the problem is that reverting Michael Walle's commit 297c4de6f780 ("net: dsa: felix: re-enable TAS guard band mode") made things work. It must be that the guard bands are calculated incorrectly. I remembered that there is a magic constant in the driver, set to 33 ns for no logical reason other than experimentation, which says "never let the static guard bands get so large as to leave less than this amount of remaining space in the time slot, because the queue system will refuse to schedule packets otherwise, and they will get stuck". I had a hunch that my previous experimentally-determined value was only good for packets coming from the forwarding path, and that the CPU injection path needed more. I came to the new value of 35 ns through binary search, after seeing that with 544 ns (the bit time required to send the Pdelay_Resp packet at gigabit) it works. Again, this is purely experimental, there's no logic and the manual doesn't say anything. The new driver prints for this schedule look like this: mscc_felix 0000:00:00.5: port 0 tc 0 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 1 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 2 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 3 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 4 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 5 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 6 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS So yes, the maximum MTU is now even smaller by 1 byte than before. This is maybe counter-intuitive, but makes more sense with a diagram of one time slot. Before: Gate open Gate close | | v 1250 ns total time slot duration v <----------------------------------------------------> <----><----------------------------------------------> 33 ns 1217 ns static guard band useful Gate open Gate close | | v 1250 ns total time slot duration v <----------------------------------------------------> <-----><---------------------------------------------> 35 ns 1215 ns static guard band useful The static guard band implemented by this switch hardware directly determines the maximum allowable MTU for that traffic class. The larger it is, the earlier the switch will stop scheduling frames for transmission, because otherwise they might overrun the gate close time (and avoiding that is the entire purpose of Michael's patch). So, we now have guard bands smaller by 2 ns, thus, in this particular case, we lose a byte of the maximum MTU. Fixes: 11afdc6526de ("net: dsa: felix: tc-taprio intervals smaller than MTU should send at least one packet") Signed-off-by: Vladimir Oltean Reviewed-by: Michael Walle Link: https://patch.msgid.link/20241210132640.3426788-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski --- drivers/net/dsa/ocelot/felix_vsc9959.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/net/dsa/ocelot/felix_vsc9959.c b/drivers/net/dsa/ocelot/felix_vsc9959.c index 0102a82e88cc..940f1b71226d 100644 --- a/drivers/net/dsa/ocelot/felix_vsc9959.c +++ b/drivers/net/dsa/ocelot/felix_vsc9959.c @@ -24,7 +24,7 @@ #define VSC9959_NUM_PORTS 6 #define VSC9959_TAS_GCL_ENTRY_MAX 63 -#define VSC9959_TAS_MIN_GATE_LEN_NS 33 +#define VSC9959_TAS_MIN_GATE_LEN_NS 35 #define VSC9959_VCAP_POLICER_BASE 63 #define VSC9959_VCAP_POLICER_MAX 383 #define VSC9959_SWITCH_PCI_BAR 4 @@ -1056,11 +1056,15 @@ static void vsc9959_mdio_bus_free(struct ocelot *ocelot) mdiobus_free(felix->imdio); } -/* The switch considers any frame (regardless of size) as eligible for - * transmission if the traffic class gate is open for at least 33 ns. +/* The switch considers any frame (regardless of size) as eligible + * for transmission if the traffic class gate is open for at least + * VSC9959_TAS_MIN_GATE_LEN_NS. + * * Overruns are prevented by cropping an interval at the end of the gate time - * slot for which egress scheduling is blocked, but we need to still keep 33 ns - * available for one packet to be transmitted, otherwise the port tc will hang. + * slot for which egress scheduling is blocked, but we need to still keep + * VSC9959_TAS_MIN_GATE_LEN_NS available for one packet to be transmitted, + * otherwise the port tc will hang. + * * This function returns the size of a gate interval that remains available for * setting the guard band, after reserving the space for one egress frame. */ @@ -1303,7 +1307,8 @@ static void vsc9959_tas_guard_bands_update(struct ocelot *ocelot, int port) * per-tc static guard band lengths, so it reduces the * useful gate interval length. Therefore, be careful * to calculate a guard band (and therefore max_sdu) - * that still leaves 33 ns available in the time slot. + * that still leaves VSC9959_TAS_MIN_GATE_LEN_NS + * available in the time slot. */ max_sdu = div_u64(remaining_gate_len_ps, picos_per_byte); /* A TC gate may be completely closed, which is a -- 2.51.0 From f8d4bc455047cf3903cd6f85f49978987dbb3027 Mon Sep 17 00:00:00 2001 From: Martin Ottens Date: Tue, 10 Dec 2024 14:14:11 +0100 Subject: [PATCH 14/16] net/sched: netem: account for backlog updates from child qdisc In general, 'qlen' of any classful qdisc should keep track of the number of packets that the qdisc itself and all of its children holds. In case of netem, 'qlen' only accounts for the packets in its internal tfifo. When netem is used with a child qdisc, the child qdisc can use 'qdisc_tree_reduce_backlog' to inform its parent, netem, about created or dropped SKBs. This function updates 'qlen' and the backlog statistics of netem, but netem does not account for changes made by a child qdisc. 'qlen' then indicates the wrong number of packets in the tfifo. If a child qdisc creates new SKBs during enqueue and informs its parent about this, netem's 'qlen' value is increased. When netem dequeues the newly created SKBs from the child, the 'qlen' in netem is not updated. If 'qlen' reaches the configured sch->limit, the enqueue function stops working, even though the tfifo is not full. Reproduce the bug: Ensure that the sender machine has GSO enabled. Configure netem as root qdisc and tbf as its child on the outgoing interface of the machine as follows: $ tc qdisc add dev root handle 1: netem delay 100ms limit 100 $ tc qdisc add dev parent 1:0 tbf rate 50Mbit burst 1542 latency 50ms Send bulk TCP traffic out via this interface, e.g., by running an iPerf3 client on the machine. Check the qdisc statistics: $ tc -s qdisc show dev Statistics after 10s of iPerf3 TCP test before the fix (note that netem's backlog > limit, netem stopped accepting packets): qdisc netem 1: root refcnt 2 limit 1000 delay 100ms Sent 2767766 bytes 1848 pkt (dropped 652, overlimits 0 requeues 0) backlog 4294528236b 1155p requeues 0 qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms Sent 2767766 bytes 1848 pkt (dropped 327, overlimits 7601 requeues 0) backlog 0b 0p requeues 0 Statistics after the fix: qdisc netem 1: root refcnt 2 limit 1000 delay 100ms Sent 37766372 bytes 24974 pkt (dropped 9, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms Sent 37766372 bytes 24974 pkt (dropped 327, overlimits 96017 requeues 0) backlog 0b 0p requeues 0 tbf segments the GSO SKBs (tbf_segment) and updates the netem's 'qlen'. The interface fully stops transferring packets and "locks". In this case, the child qdisc and tfifo are empty, but 'qlen' indicates the tfifo is at its limit and no more packets are accepted. This patch adds a counter for the entries in the tfifo. Netem's 'qlen' is only decreased when a packet is returned by its dequeue function, and not during enqueuing into the child qdisc. External updates to 'qlen' are thus accounted for and only the behavior of the backlog statistics changes. As in other qdiscs, 'qlen' then keeps track of how many packets are held in netem and all of its children. As before, sch->limit remains as the maximum number of packets in the tfifo. The same applies to netem's backlog statistics. Fixes: 50612537e9ab ("netem: fix classful handling") Signed-off-by: Martin Ottens Acked-by: Jamal Hadi Salim Link: https://patch.msgid.link/20241210131412.1837202-1-martin.ottens@fau.de Signed-off-by: Jakub Kicinski --- net/sched/sch_netem.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index fe6fed291a7b..71ec9986ed37 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -79,6 +79,8 @@ struct netem_sched_data { struct sk_buff *t_head; struct sk_buff *t_tail; + u32 t_len; + /* optional qdisc for classful handling (NULL at netem init) */ struct Qdisc *qdisc; @@ -383,6 +385,7 @@ static void tfifo_reset(struct Qdisc *sch) rtnl_kfree_skbs(q->t_head, q->t_tail); q->t_head = NULL; q->t_tail = NULL; + q->t_len = 0; } static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch) @@ -412,6 +415,7 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch) rb_link_node(&nskb->rbnode, parent, p); rb_insert_color(&nskb->rbnode, &q->t_root); } + q->t_len++; sch->q.qlen++; } @@ -518,7 +522,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, 1<q.qlen >= sch->limit)) { + if (unlikely(q->t_len >= sch->limit)) { /* re-link segs, so that qdisc_drop_all() frees them all */ skb->next = segs; qdisc_drop_all(skb, sch, to_free); @@ -702,8 +706,8 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch) tfifo_dequeue: skb = __qdisc_dequeue_head(&sch->q); if (skb) { - qdisc_qstats_backlog_dec(sch, skb); deliver: + qdisc_qstats_backlog_dec(sch, skb); qdisc_bstats_update(sch, skb); return skb; } @@ -719,8 +723,7 @@ deliver: if (time_to_send <= now && q->slot.slot_next <= now) { netem_erase_head(q, skb); - sch->q.qlen--; - qdisc_qstats_backlog_dec(sch, skb); + q->t_len--; skb->next = NULL; skb->prev = NULL; /* skb->dev shares skb->rbnode area, @@ -747,16 +750,21 @@ deliver: if (net_xmit_drop_count(err)) qdisc_qstats_drop(sch); qdisc_tree_reduce_backlog(sch, 1, pkt_len); + sch->qstats.backlog -= pkt_len; + sch->q.qlen--; } goto tfifo_dequeue; } + sch->q.qlen--; goto deliver; } if (q->qdisc) { skb = q->qdisc->ops->dequeue(q->qdisc); - if (skb) + if (skb) { + sch->q.qlen--; goto deliver; + } } qdisc_watchdog_schedule_ns(&q->watchdog, @@ -766,8 +774,10 @@ deliver: if (q->qdisc) { skb = q->qdisc->ops->dequeue(q->qdisc); - if (skb) + if (skb) { + sch->q.qlen--; goto deliver; + } } return NULL; } -- 2.51.0 From d2516c3a53705f783bb6868df0f4a2b977898a71 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Tue, 10 Dec 2024 15:12:41 +0100 Subject: [PATCH 15/16] net, team, bonding: Add netdev_base_features helper Both bonding and team driver have logic to derive the base feature flags before iterating over their slave devices to refine the set via netdev_increment_features(). Add a small helper netdev_base_features() so this can be reused instead of having it open-coded multiple times. Signed-off-by: Daniel Borkmann Cc: Nikolay Aleksandrov Cc: Ido Schimmel Cc: Jiri Pirko Reviewed-by: Hangbin Liu Reviewed-by: Nikolay Aleksandrov Link: https://patch.msgid.link/20241210141245.327886-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni --- drivers/net/bonding/bond_main.c | 4 +--- drivers/net/team/team_core.c | 3 +-- include/linux/netdev_features.h | 7 +++++++ 3 files changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 49dd4fe195e5..42c835c60cd8 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1520,9 +1520,7 @@ static netdev_features_t bond_fix_features(struct net_device *dev, struct slave *slave; mask = features; - - features &= ~NETIF_F_ONE_FOR_ALL; - features |= NETIF_F_ALL_FOR_ALL; + features = netdev_base_features(features); bond_for_each_slave(bond, slave, iter) { features = netdev_increment_features(features, diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c index a1b27b69f010..1df062c67640 100644 --- a/drivers/net/team/team_core.c +++ b/drivers/net/team/team_core.c @@ -2011,8 +2011,7 @@ static netdev_features_t team_fix_features(struct net_device *dev, netdev_features_t mask; mask = features; - features &= ~NETIF_F_ONE_FOR_ALL; - features |= NETIF_F_ALL_FOR_ALL; + features = netdev_base_features(features); rcu_read_lock(); list_for_each_entry_rcu(port, &team->port_list, list) { diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index 66e7d26b70a4..11be70a7929f 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -253,4 +253,11 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start) NETIF_F_GSO_UDP_TUNNEL | \ NETIF_F_GSO_UDP_TUNNEL_CSUM) +static inline netdev_features_t netdev_base_features(netdev_features_t features) +{ + features &= ~NETIF_F_ONE_FOR_ALL; + features |= NETIF_F_ALL_FOR_ALL; + return features; +} + #endif /* _LINUX_NETDEV_FEATURES_H */ -- 2.51.0 From d064ea7fe2a24938997b5e88e6b61cbb0a4bb906 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Tue, 10 Dec 2024 15:12:42 +0100 Subject: [PATCH 16/16] bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features If a bonding device has slave devices, then the current logic to derive the feature set for the master bond device is limited in that flags which are fully supported by the underlying slave devices cannot be propagated up to vlan devices which sit on top of bond devices. Instead, these get blindly masked out via current NETIF_F_ALL_FOR_ALL logic. vlan_features and mpls_features should reuse netdev_base_features() in order derive the set in the same way as ndo_fix_features before iterating through the slave devices to refine the feature set. Fixes: a9b3ace44c7d ("bonding: fix vlan_features computing") Fixes: 2e770b507ccd ("net: bonding: Inherit MPLS features from slave devices") Signed-off-by: Daniel Borkmann Cc: Nikolay Aleksandrov Cc: Ido Schimmel Cc: Jiri Pirko Reviewed-by: Nikolay Aleksandrov Reviewed-by: Hangbin Liu Link: https://patch.msgid.link/20241210141245.327886-2-daniel@iogearbox.net Signed-off-by: Paolo Abeni --- drivers/net/bonding/bond_main.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 42c835c60cd8..320dd71392ef 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1563,8 +1563,9 @@ static void bond_compute_features(struct bonding *bond) if (!bond_has_slaves(bond)) goto done; - vlan_features &= NETIF_F_ALL_FOR_ALL; - mpls_features &= NETIF_F_ALL_FOR_ALL; + + vlan_features = netdev_base_features(vlan_features); + mpls_features = netdev_base_features(mpls_features); bond_for_each_slave(bond, slave, iter) { vlan_features = netdev_increment_features(vlan_features, -- 2.51.0