From 0d15a26b247d25cd012134bf8825128fedb15cc9 Mon Sep 17 00:00:00 2001 From: MD Danish Anwar Date: Thu, 24 Apr 2025 15:23:16 +0530 Subject: [PATCH 01/16] net: ti: icssg-prueth: Add ICSSG FW Stats The ICSSG firmware maintains set of stats called PA_STATS. Currently the driver only dumps 4 stats. Add support for dumping more stats. The offset for different stats are defined as MACROs in icssg_switch_map.h file. All the offsets are for Slice0. Slice1 offsets are slice0 + 4. The offset calculation is taken care while reading the stats in emac_update_hardware_stats(). The statistics are documented in Documentation/networking/device_drivers/icssg_prueth.rst Reviewed-by: Simon Horman Signed-off-by: MD Danish Anwar Link: https://patch.msgid.link/20250424095316.2643573-1-danishanwar@ti.com Signed-off-by: Jakub Kicinski --- .../device_drivers/ethernet/index.rst | 1 + .../ethernet/ti/icssg_prueth.rst | 56 ++++++++++++++++++ drivers/net/ethernet/ti/icssg/icssg_common.c | 24 +++++++- drivers/net/ethernet/ti/icssg/icssg_prueth.h | 2 +- drivers/net/ethernet/ti/icssg/icssg_stats.c | 8 +-- drivers/net/ethernet/ti/icssg/icssg_stats.h | 58 ++++++++++++------- .../net/ethernet/ti/icssg/icssg_switch_map.h | 33 +++++++++++ 7 files changed, 151 insertions(+), 31 deletions(-) create mode 100644 Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst index 05d822b904b4..f9ed93c1da35 100644 --- a/Documentation/networking/device_drivers/ethernet/index.rst +++ b/Documentation/networking/device_drivers/ethernet/index.rst @@ -55,6 +55,7 @@ Contents: ti/cpsw_switchdev ti/am65_nuss_cpsw_switchdev ti/tlan + ti/icssg_prueth wangxun/txgbe wangxun/ngbe diff --git a/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst b/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst new file mode 100644 index 000000000000..da21ddf431bb --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst @@ -0,0 +1,56 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================== +Texas Instruments ICSSG PRUETH ethernet driver +============================================== + +:Version: 1.0 + +ICSSG Firmware +============== + +Every ICSSG core has two Programmable Real-Time Unit(PRUs), two auxiliary +Real-Time Transfer Unit (RTUs), and two Transmit Real-Time Transfer Units +(TX_PRUs). Each one of these runs its own firmware. The firmwares combnined are +referred as ICSSG Firmware. + +Firmware Statistics +=================== + +The ICSSG firmware maintains certain statistics which are dumped by the driver +via ``ethtool -S `` + +These statistics are as follows, + + - ``FW_RTU_PKT_DROP``: Diagnostic error counter which increments when RTU drops a locally injected packet due to port being disabled or rule violation. + - ``FW_Q0_OVERFLOW``: TX overflow counter for queue0 + - ``FW_Q1_OVERFLOW``: TX overflow counter for queue1 + - ``FW_Q2_OVERFLOW``: TX overflow counter for queue2 + - ``FW_Q3_OVERFLOW``: TX overflow counter for queue3 + - ``FW_Q4_OVERFLOW``: TX overflow counter for queue4 + - ``FW_Q5_OVERFLOW``: TX overflow counter for queue5 + - ``FW_Q6_OVERFLOW``: TX overflow counter for queue6 + - ``FW_Q7_OVERFLOW``: TX overflow counter for queue7 + - ``FW_DROPPED_PKT``: This counter is incremented when a packet is dropped at PRU because of rule violation. + - ``FW_RX_ERROR``: Incremented if there was a CRC error or Min/Max frame error at PRU + - ``FW_RX_DS_INVALID``: Incremented when RTU detects Data Status invalid condition + - ``FW_TX_DROPPED_PACKET``: Counter for packets dropped via TX Port + - ``FW_TX_TS_DROPPED_PACKET``: Counter for packets with TS flag dropped via TX Port + - ``FW_INF_PORT_DISABLED``: Incremented when RX frame is dropped due to port being disabled + - ``FW_INF_SAV``: Incremented when RX frame is dropped due to Source Address violation + - ``FW_INF_SA_DL``: Incremented when RX frame is dropped due to Source Address being in the denylist + - ``FW_INF_PORT_BLOCKED``: Incremented when RX frame is dropped due to port being blocked and frame being a special frame + - ``FW_INF_DROP_TAGGED`` : Incremented when RX frame is dropped for being tagged + - ``FW_INF_DROP_PRIOTAGGED``: Incremented when RX frame is dropped for being priority tagged + - ``FW_INF_DROP_NOTAG``: Incremented when RX frame is dropped for being untagged + - ``FW_INF_DROP_NOTMEMBER``: Incremented when RX frame is dropped for port not being member of VLAN + - ``FW_RX_EOF_SHORT_FRMERR``: Incremented if End Of Frame (EOF) task is scheduled without seeing RX_B1 + - ``FW_RX_B0_DROP_EARLY_EOF``: Incremented when frame is dropped due to Early EOF + - ``FW_TX_JUMBO_FRM_CUTOFF``: Incremented when frame is cut off to prevent packet size > 2000 Bytes + - ``FW_RX_EXP_FRAG_Q_DROP``: Incremented when express frame is received in the same queue as the previous fragment + - ``FW_RX_FIFO_OVERRUN``: RX fifo overrun counter + - ``FW_CUT_THR_PKT``: Incremented when a packet is forwarded using Cut-Through forwarding method + - ``FW_HOST_RX_PKT_CNT``: Number of valid packets sent by Rx PRU to Host on PSI + - ``FW_HOST_TX_PKT_CNT``: Number of valid packets copied by RTU0 to Tx queues + - ``FW_HOST_EGRESS_Q_PRE_OVERFLOW``: Host Egress Q (Pre-emptible) Overflow Counter + - ``FW_HOST_EGRESS_Q_EXP_OVERFLOW``: Host Egress Q (Pre-emptible) Overflow Counter diff --git a/drivers/net/ethernet/ti/icssg/icssg_common.c b/drivers/net/ethernet/ti/icssg/icssg_common.c index b4be76e13a2f..79f2d86acf21 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_common.c +++ b/drivers/net/ethernet/ti/icssg/icssg_common.c @@ -1318,10 +1318,28 @@ void icssg_ndo_get_stats64(struct net_device *ndev, stats->rx_over_errors = emac_get_stat_by_name(emac, "rx_over_errors"); stats->multicast = emac_get_stat_by_name(emac, "rx_multicast_frames"); - stats->rx_errors = ndev->stats.rx_errors; - stats->rx_dropped = ndev->stats.rx_dropped; + stats->rx_errors = ndev->stats.rx_errors + + emac_get_stat_by_name(emac, "FW_RX_ERROR") + + emac_get_stat_by_name(emac, "FW_RX_EOF_SHORT_FRMERR") + + emac_get_stat_by_name(emac, "FW_RX_B0_DROP_EARLY_EOF") + + emac_get_stat_by_name(emac, "FW_RX_EXP_FRAG_Q_DROP") + + emac_get_stat_by_name(emac, "FW_RX_FIFO_OVERRUN"); + stats->rx_dropped = ndev->stats.rx_dropped + + emac_get_stat_by_name(emac, "FW_DROPPED_PKT") + + emac_get_stat_by_name(emac, "FW_INF_PORT_DISABLED") + + emac_get_stat_by_name(emac, "FW_INF_SAV") + + emac_get_stat_by_name(emac, "FW_INF_SA_DL") + + emac_get_stat_by_name(emac, "FW_INF_PORT_BLOCKED") + + emac_get_stat_by_name(emac, "FW_INF_DROP_TAGGED") + + emac_get_stat_by_name(emac, "FW_INF_DROP_PRIOTAGGED") + + emac_get_stat_by_name(emac, "FW_INF_DROP_NOTAG") + + emac_get_stat_by_name(emac, "FW_INF_DROP_NOTMEMBER"); stats->tx_errors = ndev->stats.tx_errors; - stats->tx_dropped = ndev->stats.tx_dropped; + stats->tx_dropped = ndev->stats.tx_dropped + + emac_get_stat_by_name(emac, "FW_RTU_PKT_DROP") + + emac_get_stat_by_name(emac, "FW_TX_DROPPED_PACKET") + + emac_get_stat_by_name(emac, "FW_TX_TS_DROPPED_PACKET") + + emac_get_stat_by_name(emac, "FW_TX_JUMBO_FRM_CUTOFF"); } EXPORT_SYMBOL_GPL(icssg_ndo_get_stats64); diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h index b6be4aa57a61..23c465f1ce7f 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h @@ -54,7 +54,7 @@ #define ICSSG_MAX_RFLOWS 8 /* per slice */ -#define ICSSG_NUM_PA_STATS 4 +#define ICSSG_NUM_PA_STATS 32 #define ICSSG_NUM_MIIG_STATS 60 /* Number of ICSSG related stats */ #define ICSSG_NUM_STATS (ICSSG_NUM_MIIG_STATS + ICSSG_NUM_PA_STATS) diff --git a/drivers/net/ethernet/ti/icssg/icssg_stats.c b/drivers/net/ethernet/ti/icssg/icssg_stats.c index 6f0edae38ea2..e8241e998aa9 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_stats.c +++ b/drivers/net/ethernet/ti/icssg/icssg_stats.c @@ -11,7 +11,6 @@ #define ICSSG_TX_PACKET_OFFSET 0xA0 #define ICSSG_TX_BYTE_OFFSET 0xEC -#define ICSSG_FW_STATS_BASE 0x0248 static u32 stats_base[] = { 0x54c, /* Slice 0 stats start */ 0xb18, /* Slice 1 stats start */ @@ -46,9 +45,8 @@ void emac_update_hardware_stats(struct prueth_emac *emac) if (prueth->pa_stats) { for (i = 0; i < ARRAY_SIZE(icssg_all_pa_stats); i++) { - reg = ICSSG_FW_STATS_BASE + - icssg_all_pa_stats[i].offset * - PRUETH_NUM_MACS + slice * sizeof(u32); + reg = icssg_all_pa_stats[i].offset + + slice * sizeof(u32); regmap_read(prueth->pa_stats, reg, &val); emac->pa_stats[i] += val; } @@ -80,7 +78,7 @@ int emac_get_stat_by_name(struct prueth_emac *emac, char *stat_name) if (emac->prueth->pa_stats) { for (i = 0; i < ARRAY_SIZE(icssg_all_pa_stats); i++) { if (!strcmp(icssg_all_pa_stats[i].name, stat_name)) - return emac->pa_stats[icssg_all_pa_stats[i].offset / sizeof(u32)]; + return emac->pa_stats[i]; } } diff --git a/drivers/net/ethernet/ti/icssg/icssg_stats.h b/drivers/net/ethernet/ti/icssg/icssg_stats.h index e88b919f532c..5ec0b38e0c67 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_stats.h +++ b/drivers/net/ethernet/ti/icssg/icssg_stats.h @@ -155,24 +155,10 @@ static const struct icssg_miig_stats icssg_all_miig_stats[] = { ICSSG_MIIG_STATS(tx_bytes, true), }; -/** - * struct pa_stats_regs - ICSSG Firmware maintained PA Stats register - * @fw_rx_cnt: Number of valid packets sent by Rx PRU to Host on PSI - * @fw_tx_cnt: Number of valid packets copied by RTU0 to Tx queues - * @fw_tx_pre_overflow: Host Egress Q (Pre-emptible) Overflow Counter - * @fw_tx_exp_overflow: Host Egress Q (Express) Overflow Counter - */ -struct pa_stats_regs { - u32 fw_rx_cnt; - u32 fw_tx_cnt; - u32 fw_tx_pre_overflow; - u32 fw_tx_exp_overflow; -}; - -#define ICSSG_PA_STATS(field) \ -{ \ - #field, \ - offsetof(struct pa_stats_regs, field), \ +#define ICSSG_PA_STATS(field) \ +{ \ + #field, \ + field, \ } struct icssg_pa_stats { @@ -181,10 +167,38 @@ struct icssg_pa_stats { }; static const struct icssg_pa_stats icssg_all_pa_stats[] = { - ICSSG_PA_STATS(fw_rx_cnt), - ICSSG_PA_STATS(fw_tx_cnt), - ICSSG_PA_STATS(fw_tx_pre_overflow), - ICSSG_PA_STATS(fw_tx_exp_overflow), + ICSSG_PA_STATS(FW_RTU_PKT_DROP), + ICSSG_PA_STATS(FW_Q0_OVERFLOW), + ICSSG_PA_STATS(FW_Q1_OVERFLOW), + ICSSG_PA_STATS(FW_Q2_OVERFLOW), + ICSSG_PA_STATS(FW_Q3_OVERFLOW), + ICSSG_PA_STATS(FW_Q4_OVERFLOW), + ICSSG_PA_STATS(FW_Q5_OVERFLOW), + ICSSG_PA_STATS(FW_Q6_OVERFLOW), + ICSSG_PA_STATS(FW_Q7_OVERFLOW), + ICSSG_PA_STATS(FW_DROPPED_PKT), + ICSSG_PA_STATS(FW_RX_ERROR), + ICSSG_PA_STATS(FW_RX_DS_INVALID), + ICSSG_PA_STATS(FW_TX_DROPPED_PACKET), + ICSSG_PA_STATS(FW_TX_TS_DROPPED_PACKET), + ICSSG_PA_STATS(FW_INF_PORT_DISABLED), + ICSSG_PA_STATS(FW_INF_SAV), + ICSSG_PA_STATS(FW_INF_SA_DL), + ICSSG_PA_STATS(FW_INF_PORT_BLOCKED), + ICSSG_PA_STATS(FW_INF_DROP_TAGGED), + ICSSG_PA_STATS(FW_INF_DROP_PRIOTAGGED), + ICSSG_PA_STATS(FW_INF_DROP_NOTAG), + ICSSG_PA_STATS(FW_INF_DROP_NOTMEMBER), + ICSSG_PA_STATS(FW_RX_EOF_SHORT_FRMERR), + ICSSG_PA_STATS(FW_RX_B0_DROP_EARLY_EOF), + ICSSG_PA_STATS(FW_TX_JUMBO_FRM_CUTOFF), + ICSSG_PA_STATS(FW_RX_EXP_FRAG_Q_DROP), + ICSSG_PA_STATS(FW_RX_FIFO_OVERRUN), + ICSSG_PA_STATS(FW_CUT_THR_PKT), + ICSSG_PA_STATS(FW_HOST_RX_PKT_CNT), + ICSSG_PA_STATS(FW_HOST_TX_PKT_CNT), + ICSSG_PA_STATS(FW_HOST_EGRESS_Q_PRE_OVERFLOW), + ICSSG_PA_STATS(FW_HOST_EGRESS_Q_EXP_OVERFLOW), }; #endif /* __NET_TI_ICSSG_STATS_H */ diff --git a/drivers/net/ethernet/ti/icssg/icssg_switch_map.h b/drivers/net/ethernet/ti/icssg/icssg_switch_map.h index 424a7e945ea8..490a9cc06fb0 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_switch_map.h +++ b/drivers/net/ethernet/ti/icssg/icssg_switch_map.h @@ -231,4 +231,37 @@ /* Start of 32 bits PA_STAT counters */ #define PA_STAT_32b_START_OFFSET 0x0080 +#define FW_RTU_PKT_DROP 0x0088 +#define FW_Q0_OVERFLOW 0x0090 +#define FW_Q1_OVERFLOW 0x0098 +#define FW_Q2_OVERFLOW 0x00A0 +#define FW_Q3_OVERFLOW 0x00A8 +#define FW_Q4_OVERFLOW 0x00B0 +#define FW_Q5_OVERFLOW 0x00B8 +#define FW_Q6_OVERFLOW 0x00C0 +#define FW_Q7_OVERFLOW 0x00C8 +#define FW_DROPPED_PKT 0x00F8 +#define FW_RX_ERROR 0x0100 +#define FW_RX_DS_INVALID 0x0108 +#define FW_TX_DROPPED_PACKET 0x0110 +#define FW_TX_TS_DROPPED_PACKET 0x0118 +#define FW_INF_PORT_DISABLED 0x0120 +#define FW_INF_SAV 0x0128 +#define FW_INF_SA_DL 0x0130 +#define FW_INF_PORT_BLOCKED 0x0138 +#define FW_INF_DROP_TAGGED 0x0140 +#define FW_INF_DROP_PRIOTAGGED 0x0148 +#define FW_INF_DROP_NOTAG 0x0150 +#define FW_INF_DROP_NOTMEMBER 0x0158 +#define FW_RX_EOF_SHORT_FRMERR 0x0188 +#define FW_RX_B0_DROP_EARLY_EOF 0x0190 +#define FW_TX_JUMBO_FRM_CUTOFF 0x0198 +#define FW_RX_EXP_FRAG_Q_DROP 0x01A0 +#define FW_RX_FIFO_OVERRUN 0x01A8 +#define FW_CUT_THR_PKT 0x01B0 +#define FW_HOST_RX_PKT_CNT 0x0248 +#define FW_HOST_TX_PKT_CNT 0x0250 +#define FW_HOST_EGRESS_Q_PRE_OVERFLOW 0x0258 +#define FW_HOST_EGRESS_Q_EXP_OVERFLOW 0x0260 + #endif /* __NET_TI_ICSSG_SWITCH_MAP_H */ -- 2.51.0 From 32607a332cfea5a4b2a185f3e3d605a9bf4f8df0 Mon Sep 17 00:00:00 2001 From: Willem de Bruijn Date: Thu, 24 Apr 2025 10:35:18 -0400 Subject: [PATCH 02/16] ipv4: prefer multipath nexthop that matches source address With multipath routes, try to ensure that packets leave on the device that is associated with the source address. Avoid the following tcpdump example: veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S] veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S] Which can happen easily with the most straightforward setup: ip addr add 10.0.0.1/24 dev veth0 ip addr add 10.1.0.1/24 dev veth1 ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \ nexthop via 10.1.0.2 dev veth1 This is apparently considered WAI, based on the comment in ip_route_output_key_hash_rcu: * 2. Moreover, we are allowed to send packets with saddr * of another iface. --ANK It may be ok for some uses of multipath, but not all. For instance, when using two ISPs, a router may drop packets with unknown source. The behavior occurs because tcp_v4_connect makes three route lookups when establishing a connection: 1. ip_route_connect calls to select a source address, with saddr zero. 2. ip_route_connect calls again now that saddr and daddr are known. 3. ip_route_newports calls again after a source port is also chosen. With a route with multiple nexthops, each lookup may make a different choice depending on available entropy to fib_select_multipath. So it is possible for 1 to select the saddr from the first entry, but 3 to select the second entry. Leading to the above situation. Address this by preferring a match that matches the flowi4 saddr. This will make 2 and 3 make the same choice as 1. Continue to update the backup choice until a choice that matches saddr is found. Do this in fib_select_multipath itself, rather than passing an fl4_oif constraint, to avoid changing non-multipath route selection. Commit e6b45241c57a ("ipv4: reset flowi parameters on route connect") shows how that may cause regressions. Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to refresh in the loop. This does not happen in IPv6, which performs only one lookup. Signed-off-by: Willem de Bruijn Reviewed-by: David Ahern Reviewed-by: Eric Dumazet Reviewed-by: Ido Schimmel Link: https://patch.msgid.link/20250424143549.669426-2-willemdebruijn.kernel@gmail.com Signed-off-by: Paolo Abeni --- include/net/ip_fib.h | 3 ++- net/ipv4/fib_semantics.c | 39 +++++++++++++++++++++++++-------------- net/ipv4/route.c | 2 +- 3 files changed, 28 insertions(+), 16 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index e3864b74e92a..48bb3cf41469 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -574,7 +574,8 @@ static inline u32 fib_multipath_hash_from_keys(const struct net *net, int fib_check_nh(struct net *net, struct fib_nh *nh, u32 table, u8 scope, struct netlink_ext_ack *extack); -void fib_select_multipath(struct fib_result *res, int hash); +void fib_select_multipath(struct fib_result *res, int hash, + const struct flowi4 *fl4); void fib_select_path(struct net *net, struct fib_result *res, struct flowi4 *fl4, const struct sk_buff *skb); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 5326f1501af0..2371f311a1e1 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -2170,34 +2170,45 @@ static bool fib_good_nh(const struct fib_nh *nh) return !!(state & NUD_VALID); } -void fib_select_multipath(struct fib_result *res, int hash) +void fib_select_multipath(struct fib_result *res, int hash, + const struct flowi4 *fl4) { struct fib_info *fi = res->fi; struct net *net = fi->fib_net; - bool first = false; + bool found = false; + bool use_neigh; + __be32 saddr; if (unlikely(res->fi->nh)) { nexthop_path_fib_result(res, hash); return; } + use_neigh = READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh); + saddr = fl4 ? fl4->saddr : 0; + change_nexthops(fi) { - if (READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh)) { - if (!fib_good_nh(nexthop_nh)) - continue; - if (!first) { - res->nh_sel = nhsel; - res->nhc = &nexthop_nh->nh_common; - first = true; - } + if (use_neigh && !fib_good_nh(nexthop_nh)) + continue; + + if (!found) { + res->nh_sel = nhsel; + res->nhc = &nexthop_nh->nh_common; + found = !saddr || nexthop_nh->nh_saddr == saddr; } if (hash > atomic_read(&nexthop_nh->fib_nh_upper_bound)) continue; - res->nh_sel = nhsel; - res->nhc = &nexthop_nh->nh_common; - return; + if (!saddr || nexthop_nh->nh_saddr == saddr) { + res->nh_sel = nhsel; + res->nhc = &nexthop_nh->nh_common; + return; + } + + if (found) + return; + } endfor_nexthops(fi); } #endif @@ -2212,7 +2223,7 @@ void fib_select_path(struct net *net, struct fib_result *res, if (fib_info_num_path(res->fi) > 1) { int h = fib_multipath_hash(net, fl4, skb, NULL); - fib_select_multipath(res, h); + fib_select_multipath(res, h, fl4); } else #endif diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 49cffbe83802..e5e4c71be3af 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2154,7 +2154,7 @@ ip_mkroute_input(struct sk_buff *skb, struct fib_result *res, if (res->fi && fib_info_num_path(res->fi) > 1) { int h = fib_multipath_hash(res->fi->fib_net, NULL, skb, hkeys); - fib_select_multipath(res, h); + fib_select_multipath(res, h, NULL); IPCB(skb)->flags |= IPSKB_MULTIPATH; } #endif -- 2.51.0 From 65e9024643c7512ade3aedbb341e11d77ed7abc2 Mon Sep 17 00:00:00 2001 From: Willem de Bruijn Date: Thu, 24 Apr 2025 10:35:19 -0400 Subject: [PATCH 03/16] ip: load balance tcp connections to single dst addr and port Load balance new TCP connections across nexthops also when they connect to the same service at a single remote address and port. This affects only port-based multipath hashing: fib_multipath_hash_policy 1 or 3. Local connections must choose both a source address and port when connecting to a remote service, in ip_route_connect. This "chicken-and-egg problem" (commit 2d7192d6cbab ("ipv4: Sanitize and simplify ip_route_{connect,newports}()")) is resolved by first selecting a source address, by looking up a route using the zero wildcard source port and address. As a result multiple connections to the same destination address and port have no entropy in fib_multipath_hash. This is not a problem when forwarding, as skb-based hashing has a 4-tuple. Nor when establishing UDP connections, as autobind there selects a port before reaching ip_route_connect. Load balance also TCP, by using a random port in fib_multipath_hash. Port assignment in inet_hash_connect is not atomic with ip_route_connect. Thus ports are unpredictable, effectively random. Implementation details: Do not actually pass a random fl4_sport, as that affects not only hashing, but routing more broadly, and can match a source port based policy route, which existing wildcard port 0 will not. Instead, define a new wildcard flowi flag that is used only for hashing. Selecting a random source is equivalent to just selecting a random hash entirely. But for code clarity, follow the normal 4-tuple hash process and only update this field. fib_multipath_hash can be reached with zero sport from other code paths, so explicitly pass this flowi flag, rather than trying to infer this case in the function itself. Signed-off-by: Willem de Bruijn Reviewed-by: David Ahern Reviewed-by: Eric Dumazet Reviewed-by: Ido Schimmel Link: https://patch.msgid.link/20250424143549.669426-3-willemdebruijn.kernel@gmail.com Signed-off-by: Paolo Abeni --- include/net/flow.h | 1 + include/net/route.h | 3 +++ net/ipv4/route.c | 13 ++++++++++--- net/ipv6/route.c | 13 ++++++++++--- net/ipv6/tcp_ipv6.c | 2 ++ 5 files changed, 26 insertions(+), 6 deletions(-) diff --git a/include/net/flow.h b/include/net/flow.h index 2a3f0c42f092..a1839c278d87 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -39,6 +39,7 @@ struct flowi_common { #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH 0x02 #define FLOWI_FLAG_L3MDEV_OIF 0x04 +#define FLOWI_FLAG_ANY_SPORT 0x08 __u32 flowic_secid; kuid_t flowic_uid; __u32 flowic_multipath_hash; diff --git a/include/net/route.h b/include/net/route.h index c605fd5ec0c0..8e39aa822cf9 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -326,6 +326,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, if (inet_test_bit(TRANSPARENT, sk)) flow_flags |= FLOWI_FLAG_ANYSRC; + if (IS_ENABLED(CONFIG_IP_ROUTE_MULTIPATH) && !sport) + flow_flags |= FLOWI_FLAG_ANY_SPORT; + flowi4_init_output(fl4, oif, READ_ONCE(sk->sk_mark), ip_sock_rt_tos(sk), ip_sock_rt_scope(sk), protocol, flow_flags, dst, src, dport, sport, sk->sk_uid); diff --git a/net/ipv4/route.c b/net/ipv4/route.c index e5e4c71be3af..507b2e5dec50 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2037,8 +2037,12 @@ static u32 fib_multipath_custom_hash_fl4(const struct net *net, hash_keys.addrs.v4addrs.dst = fl4->daddr; if (hash_fields & FIB_MULTIPATH_HASH_FIELD_IP_PROTO) hash_keys.basic.ip_proto = fl4->flowi4_proto; - if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT) - hash_keys.ports.src = fl4->fl4_sport; + if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT) { + if (fl4->flowi4_flags & FLOWI_FLAG_ANY_SPORT) + hash_keys.ports.src = (__force __be16)get_random_u16(); + else + hash_keys.ports.src = fl4->fl4_sport; + } if (hash_fields & FIB_MULTIPATH_HASH_FIELD_DST_PORT) hash_keys.ports.dst = fl4->fl4_dport; @@ -2093,7 +2097,10 @@ int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4, hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS; hash_keys.addrs.v4addrs.src = fl4->saddr; hash_keys.addrs.v4addrs.dst = fl4->daddr; - hash_keys.ports.src = fl4->fl4_sport; + if (fl4->flowi4_flags & FLOWI_FLAG_ANY_SPORT) + hash_keys.ports.src = (__force __be16)get_random_u16(); + else + hash_keys.ports.src = fl4->fl4_sport; hash_keys.ports.dst = fl4->fl4_dport; hash_keys.basic.ip_proto = fl4->flowi4_proto; } diff --git a/net/ipv6/route.c b/net/ipv6/route.c index d0351e95d916..aa6b45bd3515 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2492,8 +2492,12 @@ static u32 rt6_multipath_custom_hash_fl6(const struct net *net, hash_keys.basic.ip_proto = fl6->flowi6_proto; if (hash_fields & FIB_MULTIPATH_HASH_FIELD_FLOWLABEL) hash_keys.tags.flow_label = (__force u32)flowi6_get_flowlabel(fl6); - if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT) - hash_keys.ports.src = fl6->fl6_sport; + if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT) { + if (fl6->flowi6_flags & FLOWI_FLAG_ANY_SPORT) + hash_keys.ports.src = (__force __be16)get_random_u16(); + else + hash_keys.ports.src = fl6->fl6_sport; + } if (hash_fields & FIB_MULTIPATH_HASH_FIELD_DST_PORT) hash_keys.ports.dst = fl6->fl6_dport; @@ -2547,7 +2551,10 @@ u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6, hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; hash_keys.addrs.v6addrs.src = fl6->saddr; hash_keys.addrs.v6addrs.dst = fl6->daddr; - hash_keys.ports.src = fl6->fl6_sport; + if (fl6->flowi6_flags & FLOWI_FLAG_ANY_SPORT) + hash_keys.ports.src = (__force __be16)get_random_u16(); + else + hash_keys.ports.src = fl6->fl6_sport; hash_keys.ports.dst = fl6->fl6_dport; hash_keys.basic.ip_proto = fl6->flowi6_proto; } diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 7dcb33f879ee..e8e68a142649 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -267,6 +267,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, fl6.flowi6_mark = sk->sk_mark; fl6.fl6_dport = usin->sin6_port; fl6.fl6_sport = inet->inet_sport; + if (IS_ENABLED(CONFIG_IP_ROUTE_MULTIPATH) && !fl6.fl6_sport) + fl6.flowi6_flags = FLOWI_FLAG_ANY_SPORT; fl6.flowi6_uid = sk->sk_uid; opt = rcu_dereference_protected(np->opt, lockdep_sock_is_held(sk)); -- 2.51.0 From 4d0dac499bf384fe3f42acc30906d304c3499dd8 Mon Sep 17 00:00:00 2001 From: Willem de Bruijn Date: Thu, 24 Apr 2025 10:35:20 -0400 Subject: [PATCH 04/16] selftests/net: test tcp connection load balancing Verify that TCP connections use both routes when connecting multiple times to a remote service over a two nexthop multipath route. Use socat to create the connections. Use tc prio + tc filter to count routes taken, counting SYN packets across the two egress devices. Also verify that the saddr matches that of the device. To avoid flaky tests when testing inherently randomized behavior, set a low bar and pass if even a single SYN is observed on each device. Signed-off-by: Willem de Bruijn Reviewed-by: Ido Schimmel Tested-by: Ido Schimmel Link: https://patch.msgid.link/20250424143549.669426-4-willemdebruijn.kernel@gmail.com Signed-off-by: Paolo Abeni --- tools/testing/selftests/net/fib_tests.sh | 120 ++++++++++++++++++++++- tools/testing/selftests/net/lib.sh | 24 +++++ 2 files changed, 143 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 3ea6f886a210..c58dc4ac2810 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -11,7 +11,7 @@ TESTS="unregister down carrier nexthop suppress ipv6_notify ipv4_notify \ ipv6_rt ipv4_rt ipv6_addr_metric ipv4_addr_metric ipv6_route_metrics \ ipv4_route_metrics ipv4_route_v6_gw rp_filter ipv4_del_addr \ ipv6_del_addr ipv4_mangle ipv6_mangle ipv4_bcast_neigh fib6_gc_test \ - ipv4_mpath_list ipv6_mpath_list" + ipv4_mpath_list ipv6_mpath_list ipv4_mpath_balance ipv6_mpath_balance" VERBOSE=0 PAUSE_ON_FAIL=no @@ -1085,6 +1085,35 @@ route_setup() set +e } +forwarding_cleanup() +{ + cleanup_ns $ns3 + + route_cleanup +} + +# extend route_setup with an ns3 reachable through ns2 over both devices +forwarding_setup() +{ + forwarding_cleanup + + route_setup + + setup_ns ns3 + + ip link add veth5 netns $ns3 type veth peer name veth6 netns $ns2 + ip -netns $ns3 link set veth5 up + ip -netns $ns2 link set veth6 up + + ip -netns $ns3 -4 addr add dev veth5 172.16.105.1/24 + ip -netns $ns2 -4 addr add dev veth6 172.16.105.2/24 + ip -netns $ns3 -4 route add 172.16.100.0/22 via 172.16.105.2 + + ip -netns $ns3 -6 addr add dev veth5 2001:db8:105::1/64 nodad + ip -netns $ns2 -6 addr add dev veth6 2001:db8:105::2/64 nodad + ip -netns $ns3 -6 route add 2001:db8:101::/33 via 2001:db8:105::2 +} + # assumption is that basic add of a single path route works # otherwise just adding an address on an interface is broken ipv6_rt_add() @@ -2600,6 +2629,93 @@ ipv6_mpath_list_test() route_cleanup } +tc_set_flower_counter__saddr_syn() { + tc_set_flower_counter $1 $2 $3 "src_ip $4 ip_proto tcp tcp_flags 0x2" +} + +ip_mpath_balance_dep_check() +{ + if [ ! -x "$(command -v socat)" ]; then + echo "socat command not found. Skipping test" + return 1 + fi + + if [ ! -x "$(command -v jq)" ]; then + echo "jq command not found. Skipping test" + return 1 + fi +} + +ip_mpath_balance() { + local -r ipver=$1 + local -r daddr=$2 + local -r num_conn=20 + + for i in $(seq 1 $num_conn); do + ip netns exec $ns3 socat $ipver TCP-LISTEN:8000 STDIO >/dev/null & + sleep 0.02 + echo -n a | ip netns exec $ns1 socat $ipver STDIO TCP:$daddr:8000 + done + + local -r syn0="$(tc_get_flower_counter $ns1 veth1)" + local -r syn1="$(tc_get_flower_counter $ns1 veth3)" + local -r syns=$((syn0+syn1)) + + [ "$VERBOSE" = "1" ] && echo "multipath: syns seen: ($syn0,$syn1)" + + [[ $syns -ge $num_conn ]] && [[ $syn0 -gt 0 ]] && [[ $syn1 -gt 0 ]] +} + +ipv4_mpath_balance_test() +{ + echo + echo "IPv4 multipath load balance test" + + ip_mpath_balance_dep_check || return 1 + forwarding_setup + + $IP route add 172.16.105.1 \ + nexthop via 172.16.101.2 \ + nexthop via 172.16.103.2 + + ip netns exec $ns1 \ + sysctl -q -w net.ipv4.fib_multipath_hash_policy=1 + + tc_set_flower_counter__saddr_syn $ns1 4 veth1 172.16.101.1 + tc_set_flower_counter__saddr_syn $ns1 4 veth3 172.16.103.1 + + ip_mpath_balance -4 172.16.105.1 + + log_test $? 0 "IPv4 multipath loadbalance" + + forwarding_cleanup +} + +ipv6_mpath_balance_test() +{ + echo + echo "IPv6 multipath load balance test" + + ip_mpath_balance_dep_check || return 1 + forwarding_setup + + $IP route add 2001:db8:105::1\ + nexthop via 2001:db8:101::2 \ + nexthop via 2001:db8:103::2 + + ip netns exec $ns1 \ + sysctl -q -w net.ipv6.fib_multipath_hash_policy=1 + + tc_set_flower_counter__saddr_syn $ns1 6 veth1 2001:db8:101::1 + tc_set_flower_counter__saddr_syn $ns1 6 veth3 2001:db8:103::1 + + ip_mpath_balance -6 "[2001:db8:105::1]" + + log_test $? 0 "IPv6 multipath loadbalance" + + forwarding_cleanup +} + ################################################################################ # usage @@ -2683,6 +2799,8 @@ do fib6_gc_test|ipv6_gc) fib6_gc_test;; ipv4_mpath_list) ipv4_mpath_list_test;; ipv6_mpath_list) ipv6_mpath_list_test;; + ipv4_mpath_balance) ipv4_mpath_balance_test;; + ipv6_mpath_balance) ipv6_mpath_balance_test;; help) echo "Test names: $TESTS"; exit 0;; esac diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh index 701905eeff66..7e1e56318625 100644 --- a/tools/testing/selftests/net/lib.sh +++ b/tools/testing/selftests/net/lib.sh @@ -270,6 +270,30 @@ tc_rule_handle_stats_get() .options.actions[0].stats$selector" } +# attach a qdisc with two children match/no-match and a flower filter to match +tc_set_flower_counter() { + local -r ns=$1 + local -r ipver=$2 + local -r dev=$3 + local -r flower_expr=$4 + + tc -n $ns qdisc add dev $dev root handle 1: prio bands 2 \ + priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + + tc -n $ns qdisc add dev $dev parent 1:1 handle 11: pfifo + tc -n $ns qdisc add dev $dev parent 1:2 handle 12: pfifo + + tc -n $ns filter add dev $dev parent 1: protocol ipv$ipver \ + flower $flower_expr classid 1:2 +} + +tc_get_flower_counter() { + local -r ns=$1 + local -r dev=$2 + + tc -n $ns -j -s qdisc show dev $dev handle 12: | jq .[0].packets +} + ret_set_ksft_status() { local ksft_status=$1; shift -- 2.51.0 From fca6170f5a039543fa5f390f1895fde503b80f46 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Fri, 25 Apr 2025 23:05:30 -0700 Subject: [PATCH 05/16] ipv4: fib: Fix fib_info_hash_alloc() allocation type In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) This was allocating many sizeof(struct hlist_head *) when it actually wanted sizeof(struct hlist_head). Luckily these are the same size. Adjust the allocation type to match the assignment. Signed-off-by: Kees Cook Reviewed-by: Simon Horman Reviewed-by: David Ahern Reviewed-by: Kuniyuki Iwashima Link: https://patch.msgid.link/20250426060529.work.873-kees@kernel.org Signed-off-by: Jakub Kicinski --- net/ipv4/fib_semantics.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 2371f311a1e1..03959c60d128 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -365,7 +365,7 @@ static struct hlist_head *fib_info_laddrhash_bucket(const struct net *net, static struct hlist_head *fib_info_hash_alloc(unsigned int hash_bits) { /* The second half is used for prefsrc */ - return kvcalloc((1 << hash_bits) * 2, sizeof(struct hlist_head *), + return kvcalloc((1 << hash_bits) * 2, sizeof(struct hlist_head), GFP_KERNEL); } -- 2.51.0 From 2eea791a75542651c97c3ca8d34d5a594867094a Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Fri, 25 Apr 2025 23:07:13 -0700 Subject: [PATCH 06/16] pds_core: Allocate pdsc_viftype_defaults copy with ARRAY_SIZE() In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) This is allocating a copy of pdsc_viftype_defaults, which is an array of struct pdsc_viftype. To correctly return "struct pdsc_viftype *" in the future, adjust the allocation to allocating ARRAY_SIZE-many entries. The resulting allocation size is the same. Signed-off-by: Kees Cook Reviewed-by: Shannon Nelson Link: https://patch.msgid.link/20250426060712.work.575-kees@kernel.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/amd/pds_core/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c index 9512aa4083f0..0e408f990f0b 100644 --- a/drivers/net/ethernet/amd/pds_core/core.c +++ b/drivers/net/ethernet/amd/pds_core/core.c @@ -414,7 +414,8 @@ static int pdsc_viftypes_init(struct pdsc *pdsc) { enum pds_core_vif_types vt; - pdsc->viftype_status = kzalloc(sizeof(pdsc_viftype_defaults), + pdsc->viftype_status = kcalloc(ARRAY_SIZE(pdsc_viftype_defaults), + sizeof(*pdsc->viftype_status), GFP_KERNEL); if (!pdsc->viftype_status) return -ENOMEM; -- 2.51.0 From 01cbf838c775a2096b8b0704b499012d867332bb Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Fri, 25 Apr 2025 23:07:58 -0700 Subject: [PATCH 07/16] net/mlx4_core: Adjust allocation type for buddy->bits In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) The assigned type is "unsigned long **", but the returned type will be "long **". These are the same size allocation (pointer size) but the types do not match. Adjust the allocation type to match the assignment. Signed-off-by: Kees Cook Reviewed-by: Simon Horman Reviewed-by: Tariq Toukan Link: https://patch.msgid.link/20250426060757.work.865-kees@kernel.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mellanox/mlx4/mr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c index d7444782bfdd..698a5d1f0d7e 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mr.c +++ b/drivers/net/ethernet/mellanox/mlx4/mr.c @@ -106,7 +106,7 @@ static int mlx4_buddy_init(struct mlx4_buddy *buddy, int max_order) buddy->max_order = max_order; spin_lock_init(&buddy->lock); - buddy->bits = kcalloc(buddy->max_order + 1, sizeof(long *), + buddy->bits = kcalloc(buddy->max_order + 1, sizeof(*buddy->bits), GFP_KERNEL); buddy->num_free = kcalloc(buddy->max_order + 1, sizeof(*buddy->num_free), GFP_KERNEL); -- 2.51.0 From c636eed60958875e6499043bcb32f69abf24314c Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Fri, 25 Apr 2025 23:08:42 -0700 Subject: [PATCH 08/16] nfp: xsk: Adjust allocation type for nn->dp.xsk_pools In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) The assigned type "struct xsk_buff_pool **", but the returned type will be "struct xsk_buff_pool ***". These are the same allocation size (pointer size), but the types don't match. Adjust the allocation type to match the assignment. Signed-off-by: Kees Cook Reviewed-by: Simon Horman Link: https://patch.msgid.link/20250426060841.work.016-kees@kernel.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 95514fabadf2..28997ddab966 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -2538,7 +2538,7 @@ nfp_net_alloc(struct pci_dev *pdev, const struct nfp_dev_info *dev_info, nn->dp.num_r_vecs, num_online_cpus()); nn->max_r_vecs = nn->dp.num_r_vecs; - nn->dp.xsk_pools = kcalloc(nn->max_r_vecs, sizeof(nn->dp.xsk_pools), + nn->dp.xsk_pools = kcalloc(nn->max_r_vecs, sizeof(*nn->dp.xsk_pools), GFP_KERNEL); if (!nn->dp.xsk_pools) { err = -ENOMEM; -- 2.51.0 From 5fe6530cd54b8647e71ff450f54f16b7f4c68066 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Fri, 25 Apr 2025 23:18:59 -0700 Subject: [PATCH 09/16] ptp: ocp: Add const to bp->attr_group allocation type In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) The assigned type is "const struct attribute_group **", but the returned type, while technically matching, will be not const qualified. As there is no general way to safely add const qualifiers, adjust the allocation type to match the assignment. Signed-off-by: Kees Cook Reviewed-by: Vadim Fedorenko Link: https://patch.msgid.link/20250426061858.work.470-kees@kernel.org Signed-off-by: Jakub Kicinski --- drivers/ptp/ptp_ocp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/ptp/ptp_ocp.c b/drivers/ptp/ptp_ocp.c index faf6e027f89a..ed5968a3ea5a 100644 --- a/drivers/ptp/ptp_ocp.c +++ b/drivers/ptp/ptp_ocp.c @@ -2372,7 +2372,7 @@ ptp_ocp_attr_group_add(struct ptp_ocp *bp, if (attr_tbl[i].cap & bp->fw_cap) count++; - bp->attr_group = kcalloc(count + 1, sizeof(struct attribute_group *), + bp->attr_group = kcalloc(count + 1, sizeof(*bp->attr_group), GFP_KERNEL); if (!bp->attr_group) return -ENOMEM; -- 2.51.0 From 187e0216366f3573c894514cf41267df843efd49 Mon Sep 17 00:00:00 2001 From: David Wei Date: Sat, 26 Apr 2025 12:55:24 -0700 Subject: [PATCH 10/16] io_uring/zcrx: selftests: use rand_port() Use rand_port() and stop hard coding port 9999. Signed-off-by: David Wei Reviewed-by: Joe Damato Link: https://patch.msgid.link/20250426195525.1906774-2-dw@davidwei.uk Signed-off-by: Jakub Kicinski --- .../selftests/drivers/net/hw/iou-zcrx.py | 48 ++++++++++++------- 1 file changed, 31 insertions(+), 17 deletions(-) diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py index 48b3d27cf472..a19550419771 100755 --- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py +++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py @@ -5,7 +5,7 @@ import re from os import path from lib.py import ksft_run, ksft_exit from lib.py import NetDrvEpEnv -from lib.py import bkg, cmd, defer, ethtool, wait_port_listen +from lib.py import bkg, cmd, defer, ethtool, rand_port, wait_port_listen def _get_current_settings(cfg): @@ -28,14 +28,14 @@ def _create_rss_ctx(cfg, chans): return (ctx_id, defer(ethtool, f"-X {cfg.ifname} delete context {ctx_id}", host=cfg.remote)) -def _set_flow_rule(cfg, chan): - output = ethtool(f"-N {cfg.ifname} flow-type tcp6 dst-port 9999 action {chan}", host=cfg.remote).stdout +def _set_flow_rule(cfg, port, chan): + output = ethtool(f"-N {cfg.ifname} flow-type tcp6 dst-port {port} action {chan}", host=cfg.remote).stdout values = re.search(r'ID (\d+)', output).group(1) return int(values) -def _set_flow_rule_rss(cfg, chan): - output = ethtool(f"-N {cfg.ifname} flow-type tcp6 dst-port 9999 action {chan}", host=cfg.remote).stdout +def _set_flow_rule_rss(cfg, port, chan): + output = ethtool(f"-N {cfg.ifname} flow-type tcp6 dst-port {port} action {chan}", host=cfg.remote).stdout values = re.search(r'ID (\d+)', output).group(1) return int(values) @@ -47,22 +47,27 @@ def test_zcrx(cfg) -> None: if combined_chans < 2: raise KsftSkipEx('at least 2 combined channels required') (rx_ring, hds_thresh) = _get_current_settings(cfg) + port = rand_port() ethtool(f"-G {cfg.ifname} tcp-data-split on", host=cfg.remote) defer(ethtool, f"-G {cfg.ifname} tcp-data-split auto", host=cfg.remote) + ethtool(f"-G {cfg.ifname} hds-thresh 0", host=cfg.remote) defer(ethtool, f"-G {cfg.ifname} hds-thresh {hds_thresh}", host=cfg.remote) + ethtool(f"-G {cfg.ifname} rx 64", host=cfg.remote) defer(ethtool, f"-G {cfg.ifname} rx {rx_ring}", host=cfg.remote) + ethtool(f"-X {cfg.ifname} equal {combined_chans - 1}", host=cfg.remote) defer(ethtool, f"-X {cfg.ifname} default", host=cfg.remote) - flow_rule_id = _set_flow_rule(cfg, combined_chans - 1) + + flow_rule_id = _set_flow_rule(cfg, port, combined_chans - 1) defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}", host=cfg.remote) - rx_cmd = f"{cfg.bin_remote} -s -p 9999 -i {cfg.ifname} -q {combined_chans - 1}" - tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_addr_v['6']} -p 9999 -l 12840" + rx_cmd = f"{cfg.bin_remote} -s -p {port} -i {cfg.ifname} -q {combined_chans - 1}" + tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_addr_v['6']} -p {port} -l 12840" with bkg(rx_cmd, host=cfg.remote, exit_wait=True): - wait_port_listen(9999, proto="tcp", host=cfg.remote) + wait_port_listen(port, proto="tcp", host=cfg.remote) cmd(tx_cmd) @@ -73,22 +78,27 @@ def test_zcrx_oneshot(cfg) -> None: if combined_chans < 2: raise KsftSkipEx('at least 2 combined channels required') (rx_ring, hds_thresh) = _get_current_settings(cfg) + port = rand_port() ethtool(f"-G {cfg.ifname} tcp-data-split on", host=cfg.remote) defer(ethtool, f"-G {cfg.ifname} tcp-data-split auto", host=cfg.remote) + ethtool(f"-G {cfg.ifname} hds-thresh 0", host=cfg.remote) defer(ethtool, f"-G {cfg.ifname} hds-thresh {hds_thresh}", host=cfg.remote) + ethtool(f"-G {cfg.ifname} rx 64", host=cfg.remote) defer(ethtool, f"-G {cfg.ifname} rx {rx_ring}", host=cfg.remote) + ethtool(f"-X {cfg.ifname} equal {combined_chans - 1}", host=cfg.remote) defer(ethtool, f"-X {cfg.ifname} default", host=cfg.remote) - flow_rule_id = _set_flow_rule(cfg, combined_chans - 1) + + flow_rule_id = _set_flow_rule(cfg, port, combined_chans - 1) defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}", host=cfg.remote) - rx_cmd = f"{cfg.bin_remote} -s -p 9999 -i {cfg.ifname} -q {combined_chans - 1} -o 4" - tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_addr_v['6']} -p 9999 -l 4096 -z 16384" + rx_cmd = f"{cfg.bin_remote} -s -p {port} -i {cfg.ifname} -q {combined_chans - 1} -o 4" + tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_addr_v['6']} -p {port} -l 4096 -z 16384" with bkg(rx_cmd, host=cfg.remote, exit_wait=True): - wait_port_listen(9999, proto="tcp", host=cfg.remote) + wait_port_listen(port, proto="tcp", host=cfg.remote) cmd(tx_cmd) @@ -99,24 +109,28 @@ def test_zcrx_rss(cfg) -> None: if combined_chans < 2: raise KsftSkipEx('at least 2 combined channels required') (rx_ring, hds_thresh) = _get_current_settings(cfg) + port = rand_port() ethtool(f"-G {cfg.ifname} tcp-data-split on", host=cfg.remote) defer(ethtool, f"-G {cfg.ifname} tcp-data-split auto", host=cfg.remote) + ethtool(f"-G {cfg.ifname} hds-thresh 0", host=cfg.remote) defer(ethtool, f"-G {cfg.ifname} hds-thresh {hds_thresh}", host=cfg.remote) + ethtool(f"-G {cfg.ifname} rx 64", host=cfg.remote) defer(ethtool, f"-G {cfg.ifname} rx {rx_ring}", host=cfg.remote) + ethtool(f"-X {cfg.ifname} equal {combined_chans - 1}", host=cfg.remote) defer(ethtool, f"-X {cfg.ifname} default", host=cfg.remote) (ctx_id, delete_ctx) = _create_rss_ctx(cfg, combined_chans) - flow_rule_id = _set_flow_rule_rss(cfg, ctx_id) + flow_rule_id = _set_flow_rule_rss(cfg, port, ctx_id) defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}", host=cfg.remote) - rx_cmd = f"{cfg.bin_remote} -s -p 9999 -i {cfg.ifname} -q {combined_chans - 1}" - tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_addr_v['6']} -p 9999 -l 12840" + rx_cmd = f"{cfg.bin_remote} -s -p {port} -i {cfg.ifname} -q {combined_chans - 1}" + tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_addr_v['6']} -p {port} -l 12840" with bkg(rx_cmd, host=cfg.remote, exit_wait=True): - wait_port_listen(9999, proto="tcp", host=cfg.remote) + wait_port_listen(port, proto="tcp", host=cfg.remote) cmd(tx_cmd) -- 2.51.0 From 6fbb4d3f7262771c376d1176e04811645d3c0c7b Mon Sep 17 00:00:00 2001 From: David Wei Date: Sat, 26 Apr 2025 12:55:25 -0700 Subject: [PATCH 11/16] io_uring/zcrx: selftests: parse json from ethtool -g Parse JSON from ethtool -g instead of parsing text output. Signed-off-by: David Wei Reviewed-by: Joe Damato Link: https://patch.msgid.link/20250426195525.1906774-3-dw@davidwei.uk Signed-off-by: Jakub Kicinski --- tools/testing/selftests/drivers/net/hw/iou-zcrx.py | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py index a19550419771..aef43f82edd5 100755 --- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py +++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py @@ -9,10 +9,8 @@ from lib.py import bkg, cmd, defer, ethtool, rand_port, wait_port_listen def _get_current_settings(cfg): - output = ethtool(f"-g {cfg.ifname}", host=cfg.remote).stdout - rx_ring = re.findall(r'RX:\s+(\d+)', output) - hds_thresh = re.findall(r'HDS thresh:\s+(\d+)', output) - return (int(rx_ring[1]), int(hds_thresh[1])) + output = ethtool(f"-g {cfg.ifname}", json=True, host=cfg.remote)[0] + return (output['rx'], output['hds-thresh']) def _get_combined_channels(cfg): -- 2.51.0 From 2b06aa2bcfb4eee0c5dc882ff8936913b1e6d44f Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Sun, 27 Apr 2025 12:59:49 -0700 Subject: [PATCH 12/16] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES The defines for SUPPORTED_INTERFACES and ADVERTISED_INTERFACES both appear to be unused. I couldn't find anything that actually references them in the original diff that added them and it seems like they have persisted despite using deprecated defines that aren't supposed to be used as per the ethtool.h header that defines the bits they are composed of. Since they are unused, and not supposed to be used anymore I am just dropping the lines of code since they seem to just be occupying space. Signed-off-by: Alexander Duyck Reviewed-by: Russell King (Oracle) Link: https://patch.msgid.link/174578398922.1580647.9720643128205980455.stgit@ahduyck-xeon-server.home.arpa Signed-off-by: Jakub Kicinski --- drivers/net/phy/phylink.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index 1bdd5d8bb5b0..0faa3d97e06b 100644 --- a/drivers/net/phy/phylink.c +++ b/drivers/net/phy/phylink.c @@ -24,13 +24,6 @@ #include "sfp.h" #include "swphy.h" -#define SUPPORTED_INTERFACES \ - (SUPPORTED_TP | SUPPORTED_MII | SUPPORTED_FIBRE | \ - SUPPORTED_BNC | SUPPORTED_AUI | SUPPORTED_Backplane) -#define ADVERTISED_INTERFACES \ - (ADVERTISED_TP | ADVERTISED_MII | ADVERTISED_FIBRE | \ - ADVERTISED_BNC | ADVERTISED_AUI | ADVERTISED_Backplane) - enum { PHYLINK_DISABLE_STOPPED, PHYLINK_DISABLE_LINK, -- 2.51.0 From eed848871c96d4b5a7b06307755b75abd0cc7a06 Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 28 Apr 2025 11:22:06 +0100 Subject: [PATCH 13/16] crypto/krb5: Fix change to use SG miter to use offset The recent patch to make the rfc3961 simplified code use sg_miter rather than manually walking the scatterlist to hash the contents of a buffer described by that scatterlist failed to take the starting offset into account. This is indicated by the selftests reporting: krb5: Running aes128-cts-hmac-sha256-128 mic krb5: !!! TESTFAIL crypto/krb5/selftest.c:446 krb5: MIC mismatch Fix this by calling sg_miter_skip() before doing the loop to advance by the offset. This only affects packet signing modes and not full encryption in RxGK because, for full encryption, the message digest is handled inside the authenc and krb5enc drivers. Note: Nothing in linus/master uses the krb5lib, though the bug is there. It is used by AF_RXRPC's RxGK implementation in -next, no need to backport. Fixes: da6f9bf40ac2 ("crypto: krb5 - Use SG miter instead of doing it by hand") Reported-by: Marc Dionne Signed-off-by: David Howells cc: Chuck Lever cc: Simon Horman cc: linux-afs@lists.infradead.org Acked-by: Herbert Xu Link: https://patch.msgid.link/3824017.1745835726@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski --- crypto/krb5/rfc3961_simplified.c | 1 + 1 file changed, 1 insertion(+) diff --git a/crypto/krb5/rfc3961_simplified.c b/crypto/krb5/rfc3961_simplified.c index 79180d28baa9..e49cbdec7c40 100644 --- a/crypto/krb5/rfc3961_simplified.c +++ b/crypto/krb5/rfc3961_simplified.c @@ -89,6 +89,7 @@ int crypto_shash_update_sg(struct shash_desc *desc, struct scatterlist *sg, sg_miter_start(&miter, sg, sg_nents(sg), SG_MITER_FROM_SG | SG_MITER_LOCAL); + sg_miter_skip(&miter, offset); for (i = 0; i < len; i += n) { sg_miter_next(&miter); n = min(miter.length, len - i); -- 2.51.0 From ebaebc5eaf431f7daeeb0a6788a8480d42136df1 Mon Sep 17 00:00:00 2001 From: Bui Quang Minh Date: Sat, 26 Apr 2025 15:12:19 +0700 Subject: [PATCH 14/16] xsk: respect the offsets when copying frags In commit 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion"), we introduce a helper to convert zerocopy xdp_buff to skb. However, in the frag copy, we mistakenly ignore the frag's offset. This commit adds the missing offset when copying frags in xdp_copy_frags_from_zc(). This function is not used anywhere so no backport is needed. Signed-off-by: Bui Quang Minh Link: https://patch.msgid.link/20250426081220.40689-2-minhquangbui99@gmail.com Signed-off-by: Jakub Kicinski --- net/core/xdp.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/net/core/xdp.c b/net/core/xdp.c index ea819764ae39..89111c68e545 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -698,7 +698,8 @@ static noinline bool xdp_copy_frags_from_zc(struct sk_buff *skb, nr_frags = xinfo->nr_frags; for (u32 i = 0; i < nr_frags; i++) { - u32 len = skb_frag_size(&xinfo->frags[i]); + const skb_frag_t *frag = &xinfo->frags[i]; + u32 len = skb_frag_size(frag); u32 offset, truesize = len; netmem_ref netmem; @@ -708,8 +709,8 @@ static noinline bool xdp_copy_frags_from_zc(struct sk_buff *skb, return false; } - memcpy(__netmem_address(netmem), - __netmem_address(xinfo->frags[i].netmem), + memcpy(__netmem_address(netmem) + offset, + __netmem_address(frag->netmem) + skb_frag_off(frag), LARGEST_ALIGN(len)); __skb_fill_netmem_desc_noacc(sinfo, i, netmem, offset, len); -- 2.51.0 From 7ead4405e06f67bf2163fd4708a6ce0a495b1dca Mon Sep 17 00:00:00 2001 From: Bui Quang Minh Date: Sat, 26 Apr 2025 15:12:20 +0700 Subject: [PATCH 15/16] xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc() This commit makes xdp_copy_frags_from_zc() use page allocation API page_pool_dev_alloc() instead of page_pool_dev_alloc_netmem() to avoid possible confusion of the returned value. Suggested-by: Jakub Kicinski Signed-off-by: Bui Quang Minh Link: https://patch.msgid.link/20250426081220.40689-3-minhquangbui99@gmail.com Signed-off-by: Jakub Kicinski --- net/core/xdp.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/net/core/xdp.c b/net/core/xdp.c index 89111c68e545..4e91c7790671 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -701,21 +701,21 @@ static noinline bool xdp_copy_frags_from_zc(struct sk_buff *skb, const skb_frag_t *frag = &xinfo->frags[i]; u32 len = skb_frag_size(frag); u32 offset, truesize = len; - netmem_ref netmem; + struct page *page; - netmem = page_pool_dev_alloc_netmem(pp, &offset, &truesize); - if (unlikely(!netmem)) { + page = page_pool_dev_alloc(pp, &offset, &truesize); + if (unlikely(!page)) { sinfo->nr_frags = i; return false; } - memcpy(__netmem_address(netmem) + offset, - __netmem_address(frag->netmem) + skb_frag_off(frag), + memcpy(page_address(page) + offset, + skb_frag_page(frag) + skb_frag_off(frag), LARGEST_ALIGN(len)); - __skb_fill_netmem_desc_noacc(sinfo, i, netmem, offset, len); + __skb_fill_page_desc_noacc(sinfo, i, page, offset, len); tsize += truesize; - pfmemalloc |= netmem_is_pfmemalloc(netmem); + pfmemalloc |= page_is_pfmemalloc(page); } xdp_update_skb_shared_info(skb, nr_frags, xinfo->xdp_frags_size, -- 2.51.0 From aa6dcab1ea92a8a1f5a2ff7dec825f25eeebf17d Mon Sep 17 00:00:00 2001 From: Aryan Srivastava Date: Tue, 29 Apr 2025 09:49:20 +1200 Subject: [PATCH 16/16] net: phy: aquantia: fix commenting format Comment was erroneously added with /**, amend this to use /* as it is not a kernel-doc. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202504262247.1UBrDBVN-lkp@intel.com/ Signed-off-by: Aryan Srivastava Reviewed-by: Simon Horman Link: https://patch.msgid.link/20250428214920.813038-1-aryan.srivastava@alliedtelesis.co.nz Signed-off-by: Jakub Kicinski --- drivers/net/phy/aquantia/aquantia_main.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/net/phy/aquantia/aquantia_main.c b/drivers/net/phy/aquantia/aquantia_main.c index 08b1c9cc902b..77a48635d7bf 100644 --- a/drivers/net/phy/aquantia/aquantia_main.c +++ b/drivers/net/phy/aquantia/aquantia_main.c @@ -516,8 +516,7 @@ static int aqr105_read_status(struct phy_device *phydev) if (!phydev->link || phydev->autoneg == AUTONEG_DISABLE) return 0; - /** - * The status register is not immediately correct on line side link up. + /* The status register is not immediately correct on line side link up. * Poll periodically until it reflects the correct ON state. * Only return fail for read error, timeout defaults to OFF state. */ @@ -634,8 +633,7 @@ static int aqr107_read_status(struct phy_device *phydev) if (!phydev->link || phydev->autoneg == AUTONEG_DISABLE) return 0; - /** - * The status register is not immediately correct on line side link up. + /* The status register is not immediately correct on line side link up. * Poll periodically until it reflects the correct ON state. * Only return fail for read error, timeout defaults to OFF state. */ -- 2.51.0