Eric Woudstra [Tue, 25 Feb 2025 20:15:09 +0000 (21:15 +0100)]
net: ethernet: mtk_ppe_offload: Allow QinQ, double ETH_P_8021Q only
mtk_foe_entry_set_vlan() in mtk_ppe.c already supports double vlan
tagging, but mtk_flow_offload_replace() in mtk_ppe_offload.c only allows
for 1 vlan tag, optionally in combination with pppoe and dsa tags.
However, mtk_foe_entry_set_vlan() only allows for setting the vlan id.
The protocol cannot be set, it is always ETH_P_8021Q, for inner and outer
tag. This patch adds QinQ support to mtk_flow_offload_replace(), only in
the case that both inner and outer tags are ETH_P_8021Q.
Only PPPoE-in-Q (as before) and Q-in-Q are allowed. A combination
of PPPoE and Q-in-Q is not allowed.
====================
bpf: cpumap: enable GRO for XDP_PASS frames
Several months ago, I had been looking through my old XDP hints tree[0]
to check whether some patches not directly related to hints can be sent
standalone. Roughly at the same time, Daniel appeared and asked[1] about
GRO for cpumap from that tree.
Currently, cpumap uses its own kthread which processes cpumap-redirected
frames by batches of 8, without any weighting (but with rescheduling
points). The resulting skbs get passed to the stack via
netif_receive_skb_list(), which means no GRO happens.
Even though we can't currently pass checksum status from the drivers,
in many cases GRO performs better than the listified Rx without the
aggregation, confirmed by tests.
In order to enable GRO in cpumap, we need to do the following:
* patches 1-2: decouple the GRO struct from the NAPI struct and allow
using it out of a NAPI entity within the kernel core code;
* patch 3: switch cpumap from netif_receive_skb_list() to
gro_receive_skb().
Additional improvements:
* patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
lists;
* patch 5-6: introduce and use function do get skbs from the NAPI percpu
caches by bulks, not one at a time;
* patch 7-8: use that function in veth as well and remove the one that
was now superseded by it.
My trafficgen UDP GRO tests, small frame sizes:
GRO off GRO on
baseline 2.7 N/A Mpps
patch 3 2.3 4 Mpps
patch 8 2.4 4.7 Mpps
1...3 diff -17 +48 %
1...8 diff -11 +74 %
Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
tests. On my system however, the same test gave me up to +100%.
Note that there's a series from Lorenzo[4] which achieves the same, but
in a different way. During the discussions, the approach using a
standalone GRO instance was preferred over the threaded NAPI.
Alexander Lobakin [Tue, 25 Feb 2025 17:17:50 +0000 (18:17 +0100)]
xdp: remove xdp_alloc_skb_bulk()
The only user was veth, which now uses napi_skb_cache_get_bulk().
It's now preferred over a direct allocation and is exported as
well, so remove this one.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexander Lobakin [Tue, 25 Feb 2025 17:17:49 +0000 (18:17 +0100)]
veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
Now that we can bulk-allocate skbs from the NAPI cache, use that
function to do that in veth as well instead of direct allocation from
the kmem caches. veth uses NAPI and GRO, so this is both context-safe
and beneficial.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexander Lobakin [Tue, 25 Feb 2025 17:17:48 +0000 (18:17 +0100)]
bpf: cpumap: switch to napi_skb_cache_get_bulk()
Now that cpumap uses GRO, which drops unused skb heads to the NAPI
cache, use napi_skb_cache_get_bulk() to try to reuse cached entries
and lower MM layer pressure. Always disable the BH before checking and
running the cpumap-pinned XDP prog and don't re-enable it in between
that and allocating an skb bulk, as we can access the NAPI caches only
from the BH context.
The better GRO aggregates packets, the less new skbs will be allocated.
If an aggregated skb contains 16 frags, this means 15 skbs were returned
to the cache, so next 15 skbs will be built without allocating anything.
The same trafficgen UDP GRO test now shows:
GRO off GRO on
threaded GRO 2.3 4 Mpps
thr bulk GRO 2.4 4.7 Mpps
Alexander Lobakin [Tue, 25 Feb 2025 17:17:47 +0000 (18:17 +0100)]
net: skbuff: introduce napi_skb_cache_get_bulk()
Add a function to get an array of skbs from the NAPI percpu cache.
It's supposed to be a drop-in replacement for
kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and
xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the
requirement to call it only from the BH) is that it tries to use
as many NAPI cache entries for skbs as possible, and allocate new
ones only if needed.
The logic is as follows:
* there is enough skbs in the cache: decache them and return to the
caller;
* not enough: try refilling the cache first. If there is now enough
skbs, return;
* still not enough: try allocating skbs directly to the output array
with %GFP_ZERO, maybe we'll be able to get some. If there's now
enough, return;
* still not enough: return as many as we were able to obtain.
Most of times, if called from the NAPI polling loop, the first one will
be true, sometimes (rarely) the second one. The third and the fourth --
only under heavy memory pressure.
It can save significant amounts of CPU cycles if there are GRO cycles
and/or Tx completion cycles (anything that descends to
napi_skb_cache_put()) happening on this CPU.
Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexander Lobakin [Tue, 25 Feb 2025 17:17:46 +0000 (18:17 +0100)]
bpf: cpumap: reuse skb array instead of a linked list to chain skbs
cpumap still uses linked lists to store a list of skbs to pass to the
stack. Now that we don't use listified Rx in favor of
napi_gro_receive(), linked list is now an unneeded overhead.
Inside the polling loop, we already have an array of skbs. Let's reuse
it for skbs passed to cpumap (generic XDP) and keep there in case of
XDP_PASS when a program is installed to the map itself. Don't list
regular xdp_frames after converting them to skbs as well; store them
in the mentioned array (but *before* generic skbs as the latters have
lower priority) and call gro_receive_skb() for each array element after
they're done.
Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexander Lobakin [Tue, 25 Feb 2025 17:17:45 +0000 (18:17 +0100)]
bpf: cpumap: switch to GRO from netif_receive_skb_list()
cpumap has its own BH context based on kthread. It has a sane batch
size of 8 frames per one cycle.
GRO can be used here on its own. Adjust cpumap calls to the upper stack
to use GRO API instead of netif_receive_skb_list() which processes skbs
by batches, but doesn't involve GRO layer at all.
In plenty of tests, GRO performs better than listed receiving even
given that it has to calculate full frame checksums on the CPU.
As GRO passes the skbs to the upper stack in the batches of
@gro_normal_batch, i.e. 8 by default, and skb->dev points to the
device where the frame comes from, it is enough to disable GRO
netdev feature on it to completely restore the original behaviour:
untouched frames will be being bulked and passed to the upper stack
by 8, as it was with netif_receive_skb_list().
Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexander Lobakin [Tue, 25 Feb 2025 17:17:44 +0000 (18:17 +0100)]
net: gro: expose GRO init/cleanup to use outside of NAPI
Make GRO init and cleanup functions global to be able to use GRO
without a NAPI instance. Taking into account already global gro_flush(),
it's now fully usable standalone.
New functions are not exported, since they're not supposed to be used
outside of the kernel core code.
Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexander Lobakin [Tue, 25 Feb 2025 17:17:43 +0000 (18:17 +0100)]
net: gro: decouple GRO from the NAPI layer
In fact, these two are not tied closely to each other. The only
requirements to GRO are to use it in the BH context and have some
sane limits on the packet batches, e.g. NAPI has a limit of its
budget (64/8/etc.).
Move purely GRO fields into a new structure, &gro_node. Embed it
into &napi_struct and adjust all the references.
gro_node::cached_napi_id is effectively the same as
napi_struct::napi_id, but to be used on GRO hotpath to mark skbs.
napi_struct::napi_id is now a fully control path field.
Three Ethernet drivers use napi_gro_flush() not really meant to be
exported, so move it to <net/gro.h> and add that include there.
napi_gro_receive() is used in more than 100 drivers, keep it
in <linux/netdevice.h>.
This does not make GRO ready to use outside of the NAPI context
yet.
Tested-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arnd Bergmann [Tue, 25 Feb 2025 08:57:14 +0000 (09:57 +0100)]
pktgen: avoid unused-const-variable warning
When extra warnings are enable, there are configurations that build
pktgen without CONFIG_XFRM, which leaves a static const variable unused:
net/core/pktgen.c:213:1: error: unused variable 'F_IPSEC' [-Werror,-Wunused-const-variable]
213 | PKT_FLAGS
| ^~~~~~~~~
net/core/pktgen.c:197:2: note: expanded from macro 'PKT_FLAGS'
197 | pf(IPSEC) /* ipsec on for flows */ \
| ^~~~~~~~~
This could be marked as __maybe_unused, or by making the one use visible
to the compiler by slightly rearranging the #ifdef blocks. The second
variant looks slightly nicer here, so use that.
====================
net: napi: add CPU affinity to napi->config
Drivers usually need to re-apply the user-set IRQ affinity to their IRQs
after reset. However, since there can be only one IRQ affinity notifier
for each IRQ, registering IRQ notifiers conflicts with the ARFS rmap
management in the core (which also registers separate IRQ affinity
notifiers).
Move the IRQ affinity management to the napi struct. This way we can have
a unified IRQ notifier to re-apply the user-set affinity and also manage
the ARFS rmaps.
The first patch moves the aRFS rmap management to core. It also adds the
IRQ affinity mask to napi_config and re-applies the mask after reset.
Patches 2, 4 and 5 use the new API for ena, ice and idpf drivers.
ICE does not always delete the NAPIs before releasing the IRQs. The third
patch makes sure the driver removes the IRQ number along with the queue
when the NAPIs are disabled. Without this, the next patches in this series
would free the IRQ before releasing the IRQ notifier (which generates
warnings).
Jakub Kicinski [Mon, 24 Feb 2025 23:22:27 +0000 (16:22 -0700)]
selftests: drv-net: add tests for napi IRQ affinity notifiers
Add tests to check that the napi retained the IRQ after down/up,
multiple changes in the number of rx queues and after
attaching/releasing XDP program.
Tested on ice and idpf:
# NETIF=<iface> tools/testing/selftests/drivers/net/hw/irq.py
KTAP version 1
1..4
ok 1 irq.check_irqs_reported
ok 2 irq.check_reconfig_queues
ok 3 irq.check_reconfig_xdp
ok 4 irq.check_down
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
Ahmed Zaki [Mon, 24 Feb 2025 23:22:22 +0000 (16:22 -0700)]
net: move aRFS rmap management and CPU affinity to core
A common task for most drivers is to remember the user-set CPU affinity
to its IRQs. On each netdev reset, the driver should re-assign the user's
settings to the IRQs. Unify this task across all drivers by moving the CPU
affinity to napi->config.
However, to move the CPU affinity to core, we also need to move aRFS
rmap management since aRFS uses its own IRQ notifiers.
For the aRFS, add a new netdev flag "rx_cpu_rmap_auto". Drivers supporting
aRFS should set the flag via netif_enable_cpu_rmap() and core will allocate
and manage the aRFS rmaps. Freeing the rmap is also done by core when the
netdev is freed. For better IRQ affinity management, move the IRQ rmap
notifier inside the napi_struct and add new notify.notify and
notify.release functions: netif_irq_cpu_rmap_notify() and
netif_napi_affinity_release().
Now we have the aRFS rmap management in core, add CPU affinity mask to
napi_config. To delegate the CPU affinity management to the core, drivers
must:
1 - set the new netdev flag "irq_affinity_auto":
netif_enable_irq_affinity(netdev)
2 - create the napi with persistent config:
netif_napi_add_config()
3 - bind an IRQ to the napi instance: netif_napi_set_irq()
the core will then make sure to use re-assign affinity to the napi's
IRQ.
The default IRQ mask is set to one cpu starting from the closest NUMA.
Hangbin Liu [Tue, 25 Feb 2025 03:39:14 +0000 (03:39 +0000)]
bonding: report duplicate MAC address in all situations
Normally, a bond uses the MAC address of the first added slave as the bond’s
MAC address. And the bond will set active slave’s MAC address to bond’s
address if fail_over_mac is set to none (0) or follow (2).
When the first slave is removed, the bond will still use the removed slave’s
MAC address, which can lead to a duplicate MAC address and potentially cause
issues with the switch. To avoid confusion, let's warn the user in all
situations, including when fail_over_mac is set to 2 or not in active-backup
mode.
Willem de Bruijn [Tue, 25 Feb 2025 02:33:55 +0000 (21:33 -0500)]
net: skb: free up one bit in tx_flags
The linked series wants to add skb tx completion timestamps.
That needs a bit in skb_shared_info.tx_flags, but all are in use.
A per-skb bit is only needed for features that are configured on a
per packet basis. Per socket features can be read from sk->sk_tsflags.
Per packet tsflags can be set in sendmsg using cmsg, but only those in
SOF_TIMESTAMPING_TX_RECORD_MASK.
Per packet tsflags can also be set without cmsg by sandwiching a
send inbetween two setsockopts:
val |= SOF_TIMESTAMPING_$FEATURE;
setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
write(fd, buf, sz);
val &= ~SOF_TIMESTAMPING_$FEATURE;
setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
Changing a datapath test from skb_shinfo(skb)->tx_flags to
skb->sk->sk_tsflags can change behavior in that case, as the tx_flags
is written before the second setsockopt updates sk_tsflags.
Therefore, only bits can be reclaimed that cannot be set by cmsg and
are also highly unlikely to be used to target individual packets
otherwise.
Free up the bit currently used for SKBTX_HW_TSTAMP_USE_CYCLES. This
selects between clock and free running counter source for HW TX
timestamps. It is probable that all packets of the same socket will
always use the same source.
====================
expand cmsg_ipv6.sh with ipv4 support
Expand IPV6_TCLASS to also cover IP_TOS.
Expand IPV6_HOPLIMIT to also cover IP_TTL.
A series of two patches for basic readability (patch 1 is a noop),
and so that git does not interpret code changes + file rename as
a whole file del + add.
====================
Willem de Bruijn [Tue, 25 Feb 2025 02:23:59 +0000 (21:23 -0500)]
selftests/net: expand cmsg_ipv6.sh with ipv4
Expand IPV6_TCLASS to also cover IP_TOS.
Expand IPV6_HOPLIMIT to also cover IP_TTL.
Expand csmg_sender.c to allow setting IPv4 setsockopts.
Also rename struct v6 to cmsg to match its expanded scope.
Don't bother updating all occurrences of tclass and hoplimit.
Rename cmsg_ipv6.sh to cmsg_ip.sh to match the expanded scope.
Be careful around the subtle API difference between TCLASS and TOS.
IP_TOS includes ECN bits. Add a test to verify that these are masked
when making routing decisions.
Eric Dumazet [Tue, 25 Feb 2025 17:10:48 +0000 (17:10 +0000)]
tcp: be less liberal in TSEcr received while in SYN_RECV state
Yong-Hao Zou mentioned that linux was not strict as other OS in 3WHS,
for flows using TCP TS option (RFC 7323)
As hinted by an old comment in tcp_check_req(),
we can check the TSEcr value in the incoming packet corresponds
to one of the SYNACK TSval values we have sent.
In this patch, I record the oldest and most recent values
that SYNACK packets have used.
Send a challenge ACK if we receive a TSEcr outside
of this range, and increase a new SNMP counter.
Due to TCP fastopen implementation, do not apply yet these checks
for fastopen flows.
v2: No longer use req->num_timeout, but treq->snt_tsval_first
to detect when first SYNACK is prepared. This means
we make sure to not send an initial zero TSval.
Make sure MPTCP and TCP selftests are passing.
Change MIB name to TcpExtTSEcrRejected
John Daley [Mon, 24 Feb 2025 23:43:50 +0000 (15:43 -0800)]
enic: add dependency on Page Pool
Driver was not configured to select page_pool, causing a compile error
if page pool module was not already selected.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502211253.3XRosM9I-lkp@intel.com/ Fixes: d24cb52b2d8a ("enic: Use the Page Pool API for RX") Reviewed-by: Nelson Escobar <neescoba@cisco.com> Signed-off-by: John Daley <johndale@cisco.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250224234350.23157-2-johndale@cisco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 26 Feb 2025 02:40:04 +0000 (18:40 -0800)]
Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2025-02-24
The following pull-request contains common mlx5 updates
for your *net-next* tree.
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Change POOL_NEXT_SIZE define value and make it global
net/mlx5: Add new health syndrome error and crr bit offset
====================
Jakub Kicinski [Wed, 26 Feb 2025 02:31:07 +0000 (18:31 -0800)]
Merge branch 'symmetric-or-xor-rss-hash'
Gal Pressman says:
====================
Symmetric OR-XOR RSS hash
Add support for a new type of input_xfrm: Symmetric OR-XOR.
Symmetric OR-XOR performs hash as follows:
(SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT)
Configuration is done through ethtool -x/X command.
For mlx5, the default is already symmetric hash, this patch now exposes
this to userspace and allows enabling/disabling of the feature.
Gal Pressman [Mon, 24 Feb 2025 17:44:16 +0000 (19:44 +0200)]
selftests: drv-net-hw: Add a test for symmetric RSS hash
Add a selftest that verifies symmetric RSS hash is working as intended.
The test runs iterations of traffic, swapping the src/dst UDP ports, and
verifies that the same RX queue is receiving the traffic in both cases.
====================
net: stmmac: dwc-qos: clean up clock initialisation
My single v1 patch has become two patches as a result of the build
error, as it appears this code uses "data" differently from others.
v2 still produced build warnings despite local builds being clean,
so v3 addresses those.
The first patch brings some consistency with other drivers, naming
local variables that refer to struct plat_stmmacenet_data as
"plat_dat", as is used elsewhere in this driver.
====================
Russell King (Oracle) [Mon, 24 Feb 2025 16:53:15 +0000 (16:53 +0000)]
net: stmmac: dwc-qos: clean up clock initialisation
Clean up the clock initialisation by providing a helper to find a
named clock in the bulk clocks, and provide the name of the stmmac
clock in match data so we can locate the stmmac clock in generic
code.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/E1tmbhj-004vSz-Pt@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Mon, 24 Feb 2025 16:53:10 +0000 (16:53 +0000)]
net: stmmac: dwc-qos: name struct plat_stmmacenet_data consistently
Most of the stmmac driver uses "plat_dat" to name the platform data,
and "data" is used for the glue driver data. make dwc-qos follow
this pattern to avoid silly mistakes.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/E1tmbhe-004vSt-M3@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Gottlieb [Mon, 24 Feb 2025 09:44:27 +0000 (10:44 +0100)]
Add OVN to `rtnetlink.h`
- The Open Virtual Network (OVN) routing netlink handler uses ID 84
- Will also add to `/etc/iproute2/rt_protos` once this is accepted
- For more information: https://github.com/ovn-org/ovn
Hangbin Liu [Mon, 24 Feb 2025 09:40:13 +0000 (09:40 +0000)]
selftests/net: ensure mptcp is enabled in netns
Some distributions may not enable MPTCP by default. All other MPTCP tests
source mptcp_lib.sh to ensure MPTCP is enabled before testing. However,
the ip_local_port_range test is the only one that does not include this
step.
Let's also ensure MPTCP is enabled in netns for ip_local_port_range so
that it passes on all distributions.
Suggested-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250224094013.13159-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hariprasad Kelam [Mon, 24 Feb 2025 03:56:03 +0000 (09:26 +0530)]
Octeontx2-af: RPM: Register driver with PCI subsys IDs
Although the PCI device ID and Vendor ID for the RPM (MAC) block
have remained the same across Octeon CN10K and the next-generation
CN20K silicon, Hardware architecture has changed (NIX mapped RPMs
and RFOE Mapped RPMs).
Add PCI Subsystem IDs to the device table to ensure that this driver
can be probed from NIX mapped RPM devices only.
Jakub Kicinski [Tue, 25 Feb 2025 02:23:46 +0000 (18:23 -0800)]
Merge branch 'mptcp-pm-misc-cleanups-part-3'
Matthieu Baerts says:
====================
mptcp: pm: misc cleanups, part 3
These cleanups lead the way to the unification of the path-manager
interfaces, and allow future extensions. The following patches are not
all linked to each others, but are all related to the path-managers,
except the last three.
- Patch 1: remove unused returned value in mptcp_nl_set_flags().
- Patch 2: new flag: avoid iterating over all connections if not needed.
- Patch 3: add a build check making sure there is enough space in cb-ctx.
- Patch 4: new mptcp_pm_genl_fill_addr helper to reduce duplicated code.
- Patch 5: simplify userspace_pm_append_new_local_addr helper.
- Patch 6: drop unneeded inet6_sk().
- Patch 7: use ipv6_addr_equal() instead of !ipv6_addr_cmp()
- Patch 8: scheduler: split an interface in two.
- Patch 9: scheduler: save 64 bytes of currently unused data.
- Patch 10: small optimisation to exit early in case of retransmissions.
====================
Thanks for the previous commit ("mptcp: sched: split get_subflow
interface into two"), the mptcp_sched_data structure is now currently
unused.
This structure has been added to allow future extensions that are not
ready yet. At the end, this structure will not even be used at all when
mptcp_subflow bpf_iter will be supported [1].
Here is a first step to save 64 bytes on the stack for each scheduling
operation. The structure is not removed yet not to break the WIP work on
these extensions, but will be done when [1] will be ready and applied.
Geliang Tang [Fri, 21 Feb 2025 15:44:01 +0000 (16:44 +0100)]
mptcp: sched: split get_subflow interface into two
get_retrans() interface of the burst packet scheduler invokes a sleeping
function mptcp_pm_subflow_chk_stale(), which calls __lock_sock_fast().
So get_retrans() interface should be set with BPF_F_SLEEPABLE flag in
BPF. But get_send() interface of this scheduler can't be set with
BPF_F_SLEEPABLE flag since it's invoked in ack_update_msk() under mptcp
data lock.
So this patch has to split get_subflow() interface of packet scheduer into
two interfaces: get_send() and get_retrans(). Then we can set get_retrans()
interface alone with BPF_F_SLEEPABLE flag.
Geliang Tang [Fri, 21 Feb 2025 15:43:58 +0000 (16:43 +0100)]
mptcp: pm: drop match in userspace_pm_append_new_local_addr
The variable 'match' in mptcp_userspace_pm_append_new_local_addr() is a
redundant one, and this patch drops it.
No need to define 'match' as 'struct mptcp_pm_addr_entry *' type. In this
function, it's only used to check whether it's NULL. It can be defined as
a Boolean one.
Also other variables 'addr_match' and 'id_match' make 'match' a redundant
one, which can be replaced by directly checking 'addr_match && id_match'.
Geliang Tang [Fri, 21 Feb 2025 15:43:57 +0000 (16:43 +0100)]
mptcp: pm: add mptcp_pm_genl_fill_addr helper
To save some redundant code in dump_addr() interfaces of both the
netlink PM and userspace PM, the code that calls netlink message
helpers (genlmsg_put/cancel/end) and mptcp_nl_fill_addr() is wrapped
into a new helper mptcp_pm_genl_fill_addr().
If an endpoint doesn't have the 'subflow' flag -- in fact, has no type,
so not 'subflow', 'signal', nor 'implicit' -- there are then no subflows
created from this local endpoint to at least the initial destination
address. In this case, no need to call mptcp_pm_nl_fullmesh() which is
there to recreate the subflows to reflect the new value of the fullmesh
attribute.
Similarly, there is then no need to iterate over all connections to do
nothing, if only the 'fullmesh' flag has been changed, and the endpoint
doesn't have the 'subflow' one. So stop early when dealing with this
specific case.
====================
net/mlx5e: Move IPSec policy check after decryption
This series by Jianbo adds IPsec policy check after decryption.
In current mlx5 driver, the policy check is done before decryption for
IPSec crypto and packet offload. This series changes that order to
make it consistent with the processing in kernel xfrm. Besides, RX
state with UPSPEC selector is supported correctly after new steering
table is added after decryption and before the policy check.
====================
Jianbo Liu [Thu, 20 Feb 2025 21:39:58 +0000 (23:39 +0200)]
net/mlx5e: Support RX xfrm state selector's UPSPEC for packet offload
Previously, the upper layer matches are added for the decryption rule
when xfrm selector's UPSPEC is specified in the command. However, it's
impossible as packets are not decrypted, and there is no way to do
match on the upper protocol (TCP/UDP) with specific source/destination
port. The result is that packets are not decrypted by hardware because
of this mismatch. Instead, they are forwarded to kernel, and
decryption is done by software.
To resolve this issue, this patch adds new table (sa_sel) after status
table and before policy table. When UPSPEC's proto is specified in
xfrm state's selector, a rule is added in status table to forward the
decrypted packets to sa_sel table, where the corresponding rule for
selector's UPSPEC is added, and packet's upper headers are checked
there. If matched, they will be forward to policy table to do policy
check. Otherwise, they are dropped immediately.
Besides, add a global count for this kind of packet drop.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianbo Liu [Thu, 20 Feb 2025 21:39:57 +0000 (23:39 +0200)]
net/mlx5e: Add pass flow group for IPSec RX status table
This flow group is added for the pass rules for both crypto offload
and packet offload. It is placed at the end of the table, and right
before the miss group. There are two entries, and the default pass
rules for both offloads are added in this group.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianbo Liu [Thu, 20 Feb 2025 21:39:56 +0000 (23:39 +0200)]
net/mlx5e: Add num_reserved_entries param for ipsec_ft_create()
Add parameter for ipsec_ft_create() to pass the number of the reserved
entries when creating auto-grouped flow table. It's used to create
table with pre-defined group(s) which may have more than one rule.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianbo Liu [Thu, 20 Feb 2025 21:39:55 +0000 (23:39 +0200)]
net/mlx5e: Skip IPSec RX policy check for crypto offload
For crypto offload, there is no xfrm policy rule offloaded to
hardware, so no need to continue with policy check for it.
Previously, for crypto offload, the hardware metadata reg c4 is not
used and not changed, but set to ASO_OK(0) before decryption to avoid
garbage data. Then a default rule is added to check ipsec_syndrome and
this register. Packets are forwarded to policy table if succeed, or
drop if fails.
According to hardware document, this register value could be 0, 1.
So a special value (0xAA), which is not used by hardware, is chosen as
an indication for crypto offload. It is set to c4 before decryption.
Then a default rule, which matches on 0xAA (and ipsec_syndrome on 0),
is added, which means packets are done by crypto offload, and
sends them to kernel directly, thus skips the policy check.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianbo Liu [Thu, 20 Feb 2025 21:39:54 +0000 (23:39 +0200)]
net/mlx5e: Move IPSec policy check after decryption
Currently, xfrm policy check is done before decryption in mlx5 driver.
If matching any policy, packets are forwarded to xfrm state table for
decryption. But this is exact opposite to what software does. For
kernel implementation, xfrm decode is unconditionally activated
whenever an IPSec packet reaches the input flow if there’s a matching
state rule.
This patch changes the order, move policy check after decryption.
Besides, a miss flow table is added at the end for legacy mode, to
make it easier to update the default destination of the steering rules.
So ESP packets are firstly forwarded to SA table for decryption, then
the result is checked in status table. If the decryption succeeds,
packets are forwarded to another table to check xfrm policy rules.
When a policy with allow action is matched, if in legacy mode packets
are forwarded to miss flow table with one rule to forward them to RoCE
tables, if in switchdev mode they are forwarded directly to TC root
chain instead.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianbo Liu [Thu, 20 Feb 2025 21:39:53 +0000 (23:39 +0200)]
net/mlx5e: Add correct match to check IPSec syndromes for switchdev mode
In commit dddb49b63d86 ("net/mlx5e: Add IPsec and ASO syndromes check
in HW"), IPSec and ASO syndromes checks after decryption for the
specified ASO object were added. But they are correct only for eswith
in legacy mode. For switchdev mode, metadata register c1 is used to
save the mapped id (not ASO object id). So, need to change the match
accordingly for the check rules in status table.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianbo Liu [Thu, 20 Feb 2025 21:39:52 +0000 (23:39 +0200)]
net/mlx5e: Change the destination of IPSec RX SA miss rule
For eswitch in legacy mode, the packets decrypted in RX SA table will
continue to be processed for RoCE. But this is not necessary for the
un-decrypted packets, which don't match any decryption rules but hit
the miss rule at the end of the table. So, change the destination of
miss rule to TTC default one and skip RoCE.
For eswitch in switchdev mode, the destination is unchanged.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianbo Liu [Thu, 20 Feb 2025 21:39:51 +0000 (23:39 +0200)]
net/mlx5e: Add helper function to update IPSec default destination
The default destination of IPSec steering rules for MPV mode will be
updated when the master device is brought up or down. Move the common
code into the helper function. It’s convenient to update destinations
in later patches.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiawen Wu [Fri, 21 Feb 2025 06:57:18 +0000 (14:57 +0800)]
net: wangxun: Replace the judgement of MAC type with flags
Since device MAC types are constantly being added, the judgments of
wx->mac.type are complex. Try to convert the types to flags depending
on functions.
Jiawen Wu [Fri, 21 Feb 2025 06:57:17 +0000 (14:57 +0800)]
net: txgbe: Add basic support for new AML devices
There is a new 40/25/10 Gigabit Ethernet device.
To support basic functions, PHYLINK is temporarily skipped as it is
intended to implement these configurations in the firmware. And the
associated link IRQ is also skipped.
And Implement the new SW-FW interaction interface, which use 64 Byte
message buffer.
Breno Leitao [Fri, 21 Feb 2025 17:51:27 +0000 (09:51 -0800)]
net: Remove shadow variable in netdev_run_todo()
Fix a shadow variable warning in net/core/dev.c when compiled with
CONFIG_LOCKDEP enabled. The warning occurs because 'dev' is redeclared
inside the while loop, shadowing the outer scope declaration.
net/core/dev.c:11211:22: warning: declaration shadows a local variable [-Wshadow]
struct net_device *dev = list_first_entry(&unlink_list,
net/core/dev.c:11202:21: note: previous declaration is here
struct net_device *dev, *tmp;
Remove the redundant declaration since the variable is already defined
in the outer scope and will be overwritten in the subsequent
list_for_each_entry_safe() loop anyway.
====================
net: stmmac: thead: clean up clock rate setting
This series cleans up the thead clock rate setting to use the
rgmii_clock() helper function added to phylib.
The first patch switches over to using the rgmii_clock() helper,
and the second patch cleans up the verification that the desired
clock rate is achievable, allowing the private clock rate
definitions to be removed.
====================
thead was checking that the stmmac_clk rate was a multiple of the
RGMII rates for 1G and 100M, but didn't check for 10M. Rather than
use this with hard-coded speeds, check that the calculated divisor
gives the required rate by multplying the transmit clock rate back
up to the stmmac clock rate and checking that it agrees.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Drew Fustini <drew@pdp7.com> Link: https://patch.msgid.link/E1tlToD-004W3g-HB@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Fri, 21 Feb 2025 14:15:12 +0000 (14:15 +0000)]
net: stmmac: thead: use rgmii_clock() for RGMII clock rate
Switch to using rgmii_clock() to get the RGMII TXC rate, and calculate
the divisor from the parent clock rate and the rate indicated by
rgmii_clock().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Drew Fustini <drew@pdp7.com> Link: https://patch.msgid.link/E1tlTo8-004W3a-CO@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kunihiko Hayashi [Fri, 21 Feb 2025 05:18:18 +0000 (14:18 +0900)]
net: stmmac: Correct usage of maximum queue number macros
The maximum numbers of each Rx and Tx queues are defined by
MTL_MAX_RX_QUEUES and MTL_MAX_TX_QUEUES respectively.
There are some places where Rx and Tx are used in reverse. There is no
issue when the Tx and Rx macros have the same value, but should correct
usage of macros for maximum queue number to keep consistency and prevent
unexpected mistakes.
Russell King (Oracle) [Fri, 21 Feb 2025 11:38:20 +0000 (11:38 +0000)]
net: stmmac: qcom-ethqos: use rgmii_clock() to set the link clock
The link clock operates at twice the RGMII clock rate. Therefore, we
can use the rgmii_clock() helper to set this clock rate.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vinod Koul <vkoul@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/E1tlRMK-004Vsx-Ss@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Wang [Tue, 18 Feb 2025 02:39:08 +0000 (10:39 +0800)]
virtio-net: tweak for better TX performance in NAPI mode
There are several issues existed in start_xmit():
- Transmitted packets need to be freed before sending a packet, this
introduces delay and increases the average packets transmit
time. This also increase the time that spent in holding the TX lock.
- Notification is enabled after free_old_xmit_skbs() which will
introduce unnecessary interrupts if TX notification happens on the
same CPU that is doing the transmission now (actually, virtio-net
driver are optimized for this case).
So this patch tries to avoid those issues by not cleaning transmitted
packets in start_xmit() when TX NAPI is enabled and disable
notifications even more aggressively. Notification will be since the
beginning of the start_xmit(). But we can't enable delayed
notification after TX is stopped as we will lose the
notifications. Instead, the delayed notification needs is enabled
after the virtqueue is kicked for best performance.
Performance numbers:
1) single queue 2 vcpus guest with pktgen_sample03_burst_single_flow.sh
(burst 256) + testpmd (rxonly) on the host:
- When pinning TX IRQ to pktgen VCPU: split virtqueue PPS were
increased 55% from 6.89 Mpps to 10.7 Mpps and 32% TX interrupts were
eliminated. Packed virtqueue PPS were increased 50% from 7.09 Mpps to
10.7 Mpps, 99% TX interrupts were eliminated.
- When pinning TX IRQ to VCPU other than pktgen: split virtqueue PPS
were increased 96% from 5.29 Mpps to 10.4 Mpps and 45% TX interrupts
were eliminated; Packed virtqueue PPS were increased 78% from 6.12
Mpps to 10.9 Mpps and 99% TX interrupts were eliminated.
2) single queue 1 vcpu guest + vhost-net/TAP on the host: single
session netperf from guest to host shows 82% improvement from
31Gb/s to 58Gb/s, %stddev were reduced from 34.5% to 1.9% and 88%
of TX interrupts were eliminated.
Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrisious Haddad [Wed, 19 Feb 2025 08:58:08 +0000 (10:58 +0200)]
net/mlx5: Change POOL_NEXT_SIZE define value and make it global
Change POOL_NEXT_SIZE define value from 0 to BIT(30), since this define
is used to request the available maximum sized flow table, and zero doesn't
make sense for it, whereas some places in the driver use zero explicitly
expecting the smallest table size possible but instead due to this
define they end up allocating the biggest table size unawarely.
In addition move the definition to "include/linux/mlx5/fs.h" to expose the
define to IB driver as well, while appropriately renaming it.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250219085808.349923-3-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
====================
mctp: Add MCTP-over-USB hardware transport binding
Add an implementation of the DMTF standard DSP0283, providing an MCTP
channel over high-speed USB.
This is a fairly trivial first implementation, in that we only submit
one tx and one rx URB at a time. We do accept multi-packet transfers,
but do not yet generate them on transmit.
Of course, questions and comments are most welcome, particularly on the
USB interfaces.
Jeremy Kerr [Fri, 21 Feb 2025 00:56:58 +0000 (08:56 +0800)]
net: mctp: Add MCTP USB transport driver
Add an implementation for DMTF DSP0283, which defines a MCTP-over-USB
transport. As per that spec, we're restricted to full speed mode,
requiring 512-byte transfers.
Each MCTP-over-USB interface is a peer-to-peer link to a single MCTP
endpoint, so no physical addressing is required (of course, that MCTP
endpoint may then bridge to further MCTP endpoints). Consequently,
interfaces will report with no lladdr data:
# mctp link
dev lo index 1 address 00:00:00:00:00:00 net 1 mtu 65536 up
dev mctpusb0 index 6 address none net 1 mtu 68 up
This is a simple initial implementation, with single rx & tx urbs, and
no multi-packet tx transfers - although we do accept multi-packet rx
from the device.
Includes suggested fixes from Santosh Puranik <spuranik@nvidia.com>.
Jeremy Kerr [Fri, 21 Feb 2025 00:56:57 +0000 (08:56 +0800)]
usb: Add base USB MCTP definitions
Upcoming changes will add a USB host (and later gadget) driver for the
MCTP-over-USB protocol. Add a header that provides common definitions
for protocol support: the packet header format and a few framing
definitions. Add a define for the MCTP class code, as per
https://usb.org/defined-class-codes.
Sean Anderson [Thu, 20 Feb 2025 16:42:57 +0000 (11:42 -0500)]
net: cadence: macb: Implement BQL
Implement byte queue limits to allow queuing disciplines to account for
packets enqueued in the ring buffer but not yet transmitted. There are a
separate set of transmit functions for AT91 that I haven't touched since
I don't have hardware to test on.
Russell King (Oracle) [Thu, 20 Feb 2025 12:47:49 +0000 (12:47 +0000)]
net: stmmac: print stmmac_init_dma_engine() errors using netdev_err()
stmmac_init_dma_engine() uses dev_err() which leads to errors being
reported as e.g:
dwc-eth-dwmac 2490000.ethernet: Failed to reset the dma
dwc-eth-dwmac 2490000.ethernet eth0: stmmac_hw_setup: DMA engine initialization failed
stmmac_init_dma_engine() is only called from stmmac_hw_setup() which
itself uses netdev_err(), and we will have a net_device setup. So,
change the dev_err() to netdev_err() to give consistent error
messages.
Hangbin Liu [Thu, 20 Feb 2025 08:53:26 +0000 (08:53 +0000)]
selftests: fib_nexthops: do not mark skipped tests as failed
The current test marks all unexpected return values as failed and sets ret
to 1. If a test is skipped, the entire test also returns 1, incorrectly
indicating failure.
To fix this, add a skipped variable and set ret to 4 if it was previously
0. Otherwise, keep ret set to 1.
====================
net: fib_rules: Add DSCP mask support
In some deployments users would like to encode path information into
certain bits of the IPv6 flow label, the UDP source port and the DSCP
field and use this information to route packets accordingly.
Redirecting traffic to a routing table based on specific bits in the
DSCP field is not currently possible. Only exact match is currently
supported by FIB rules.
This patchset extends FIB rules to match on the DSCP field with an
optional mask.
Patches #1-#5 gradually extend FIB rules to match on the DSCP field with
an optional mask.
Patch #6 adds test cases for the new functionality.
Ido Schimmel [Thu, 20 Feb 2025 08:05:25 +0000 (10:05 +0200)]
selftests: fib_rule_tests: Add DSCP mask match tests
Add tests for FIB rules that match on DSCP with a mask. Test both good
and bad flows and both the input and output paths.
# ./fib_rule_tests.sh
IPv6 FIB rule tests
[...]
TEST: rule6 check: dscp redirect to table [ OK ]
TEST: rule6 check: dscp no redirect to table [ OK ]
TEST: rule6 del by pref: dscp redirect to table [ OK ]
TEST: rule6 check: iif dscp redirect to table [ OK ]
TEST: rule6 check: iif dscp no redirect to table [ OK ]
TEST: rule6 del by pref: iif dscp redirect to table [ OK ]
TEST: rule6 check: dscp masked redirect to table [ OK ]
TEST: rule6 check: dscp masked no redirect to table [ OK ]
TEST: rule6 del by pref: dscp masked redirect to table [ OK ]
TEST: rule6 check: iif dscp masked redirect to table [ OK ]
TEST: rule6 check: iif dscp masked no redirect to table [ OK ]
TEST: rule6 del by pref: iif dscp masked redirect to table [ OK ]
[...]
Ido Schimmel [Thu, 20 Feb 2025 08:05:22 +0000 (10:05 +0200)]
ipv6: fib_rules: Add DSCP mask matching
Extend IPv6 FIB rules to match on DSCP using a mask. Unlike IPv4, also
initialize the DSCP mask when a non-zero 'tos' is specified as there is
no difference in matching between 'tos' and 'dscp'. As a side effect,
this makes it possible to match on 'dscp 0', like in IPv4.
Ido Schimmel [Thu, 20 Feb 2025 08:05:21 +0000 (10:05 +0200)]
ipv4: fib_rules: Add DSCP mask matching
Extend IPv4 FIB rules to match on DSCP using a mask. The mask is only
set in rules that match on DSCP (not TOS) and initialized to cover the
entire DSCP field if the mask attribute is not specified.
Ido Schimmel [Thu, 20 Feb 2025 08:05:20 +0000 (10:05 +0200)]
net: fib_rules: Add DSCP mask attribute
Add an attribute that allows matching on DSCP with a mask. Matching on
DSCP with a mask is needed in deployments where users encode path
information into certain bits of the DSCP field.
Temporarily set the type of the attribute to 'NLA_REJECT' while support
is being added.
We've added 19 non-merge commits during the last 8 day(s) which contain
a total of 35 files changed, 1126 insertions(+), 53 deletions(-).
The main changes are:
1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing
2) Add network TX timestamping support to BPF sock_ops, from Jason Xing
3) Add TX metadata Launch Time support, from Song Yoong Siang
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
igc: Add launch time support to XDP ZC
igc: Refactor empty frame insertion for launch time support
net: stmmac: Add launch time support to XDP ZC
selftests/bpf: Add launch time request to xdp_hw_metadata
xsk: Add launch time hardware offload support to XDP Tx metadata
selftests/bpf: Add simple bpf tests in the tx path for timestamping feature
bpf: Support selective sampling for bpf timestamping
bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback
bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback
bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
bpf: Disable unsafe helpers in TX timestamping callbacks
bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
bpf: Add networking timestamping support to bpf_get/setsockopt()
selftests/bpf: Add rto max for bpf_setsockopt test
bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt
====================
Ziwei Xiao [Wed, 19 Feb 2025 20:04:51 +0000 (12:04 -0800)]
gve: Add RSS cache for non RSS device option scenario
Not all the devices have the capability for the driver to query for the
registered RSS configuration. The driver can discover this by checking
the relevant device option during setup. If it cannot, the driver needs
to store the RSS config cache and directly return such cache when
queried by the ethtool. RSS config is inited when driver probes. Also the
default RSS config will be adjusted when there is RX queue count change.
At this point, only keys of GVE_RSS_KEY_SIZE and indirection tables of
GVE_RSS_INDIR_SIZE are supported.
Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Signed-off-by: Jeroen de Borst <jeroendb@google.com> Link: https://patch.msgid.link/20250219200451.3348166-1-jeroendb@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>