Nikita Zhandarovich [Wed, 15 Jan 2025 16:42:20 +0000 (08:42 -0800)]
net/rose: prevent integer overflows in rose_setsockopt()
In case of possible unpredictably large arguments passed to
rose_setsockopt() and multiplied by extra values on top of that,
integer overflows may occur.
Do the safest minimum and fix these issues by checking the
contents of 'opt' and returning -EINVAL if they are too large. Also,
switch to unsigned int and remove useless check for negative 'opt'
in ROSE_IDLE case.
Russell King (Oracle) [Mon, 20 Jan 2025 10:28:54 +0000 (10:28 +0000)]
net: phylink: fix regression when binding a PHY
Some PHYs don't support clause 45 access, and return -EOPNOTSUPP from
phy_modify_mmd(), which causes phylink_bringup_phy() to fail. Prevent
this failure by allowing -EOPNOTSUPP to also mean success.
Reported-by: Jiawen Wu <jiawenwu@trustnetic.com> Tested-by: Jiawen Wu <jiawenwu@trustnetic.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/E1tZp1a-001V62-DT@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In this series we fix an issue with missing cleanups during
error path of am65_cpsw_nuss_init_tx/rx_chns() when used anywhere
other than at probe().
Then we streamline RX and TX queue creation and cleanup. The queues
can now be created or destroyed by calling the appropriate
functions am65_cpsw_create_rxqs/txqs() and am65_cpsw_destroy_rxq/txqs().
We are missing netif_napi_del() and am65_cpsw_nuss_free_tx/rx_chns()
in error path when am65_cpsw_nuss_init_tx/rx_chns() is used anywhere
other than at probe(). i.e. am65_cpsw_nuss_update_tx_rx_chns and
am65_cpsw_nuss_resume()
As reported, in am65_cpsw_nuss_update_tx_rx_chns(),
if am65_cpsw_nuss_init_tx_chns() partially fails then
devm_add_action(dev, am65_cpsw_nuss_free_tx_chns,..) is added
but the cleanup via am65_cpsw_nuss_free_tx_chns() will not run.
Same issue exists for am65_cpsw_nuss_init_tx/rx_chns() failures
in am65_cpsw_nuss_resume() as well.
This would otherwise require more instances of devm_add/remove_action
and is clearly more of a distraction than any benefit.
So, drop devm_add/remove_action for am65_cpsw_nuss_free_tx/rx_chns()
and call am65_cpsw_nuss_free_tx/rx_chns() and netif_napi_del()
where required.
Reported-by: Siddharth Vadapalli <s-vadapalli@ti.com> Closes: https://lore.kernel.org/all/m4rhkzcr7dlylxr54udyt6lal5s2q4krrvmyay6gzgzhcu4q2c@r34snfumzqxy/ Signed-off-by: Roger Quadros <rogerq@kernel.org> Link: https://patch.msgid.link/20250117-am65-cpsw-streamline-v2-1-91a29c97e569@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Furong Xu [Fri, 17 Jan 2025 06:28:05 +0000 (14:28 +0800)]
net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags
After commit df542f669307 ("net: stmmac: Switch to zero-copy in
non-XDP RX path"), SKBs are always marked for recycle, it is redundant
to mark SKBs more than once when new frags are appended.
Xiangqian Zhang [Fri, 17 Jan 2025 09:46:03 +0000 (17:46 +0800)]
net: mii: Fix the Speed display when the network cable is not connected
Two different models of usb card, the drivers are r8152 and asix. If no
network cable is connected, Speed = 10Mb/s. This problem is repeated in
linux 3.10, 4.19, 5.4, 6.12. This problem also exists on the latest
kernel. Both drivers call mii_ethtool_get_link_ksettings,
but the value of cmd->base.speed in this
function can only be SPEED_1000 or SPEED_100 or SPEED_10.
When the network cable is not connected, set cmd->base.speed
=SPEED_UNKNOWN.
Jakub Kicinski [Mon, 20 Jan 2025 19:59:25 +0000 (11:59 -0800)]
Merge tag 'nf-next-25-01-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next:
1) Unbreak set size settings for rbtree set backend, intervals in
rbtree are represented as two elements, this detailed is leaked
to userspace leading to bogus ENOSPC from control plane.
2) Remove dead code in br_netfilter's br_nf_pre_routing_finish()
due to never matching error when looking up for route,
from Antoine Tenart.
3) Simplify check for device already in use in flowtable,
from Phil Sutter.
4) Three patches to restore interface name field in struct nft_hook
and use it, this is to prepare for wildcard interface support.
From Phil Sutter.
5) Do not remove netdev basechain when last device is gone, this is
for consistency with the flowtable behaviour. This allows for netdev
basechains without devices. Another patch to simplify netdev event
notifier after this update. Also from Phil.
6) Two patches to add missing spinlock when flowtable updates TCP
state flags, from Florian Westphal.
7) Simplify __nf_ct_refresh_acct() by removing skbuff parameter,
also from Florian.
8) Flowtable gc now extends ct timeout for offloaded flow. This
is to address a possible race that leads to handing over flow
to classic path with long ct timeouts.
9) Tear down flow if cached rt_mtu is stale, before this patch,
packet is handed over to classic path but flow entry still remained
in place.
10) Revisit the flowtable teardown strategy, which was originally
designed to release flowtable hardware entries early. Add a new
CLOSING flag that still allows hardware to release entries when
fin/rst is seen, but keeps the flow entry in place when the
TCP connection is closed. Release flow after timeout or when a new
syn packet is seen for TCP reopen scenario.
* tag 'nf-next-25-01-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: flowtable: add CLOSING state
netfilter: flowtable: teardown flow if cached mtu is stale
netfilter: conntrack: rework offload nf_conn timeout extension logic
netfilter: conntrack: remove skb argument from nf_ct_refresh
netfilter: nft_flow_offload: update tcp state flags under lock
netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath
netfilter: nf_tables: Simplify chain netdev notifier
netfilter: nf_tables: Tolerate chains with no remaining hooks
netfilter: nf_tables: Compare netdev hooks based on stored name
netfilter: nf_tables: Use stored ifname in netdev hook dumps
netfilter: nf_tables: Store user-defined hook ifname
netfilter: nf_tables: Flowtable hook's pf value never varies
netfilter: br_netfilter: remove unused conditional and dead code
netfilter: nf_tables: fix set size with rbtree backend
====================
====================
net: ethtool: fixes for HDS threshold
Quick follow up on the HDS threshold work, since the merge window
is upon us.
Fix the bnxt implementation to apply the settings right away,
because we update the parameters _after_ configuring HW user
needed to reconfig the device twice to get the settings to stick.
For this I took the liberty of moving the config to a separate
struct. This follows my original thinking for the queue API.
It should also fit more neatly into how many drivers which
support safe config update operate. Drivers can allocate
new objects using the "pending" struct.
netdevsim:
KTAP version 1
1..7
ok 1 hds.get_hds
ok 2 hds.get_hds_thresh
ok 3 hds.set_hds_disable
ok 4 hds.set_hds_enable
ok 5 hds.set_hds_thresh_zero
ok 6 hds.set_hds_thresh_max
ok 7 hds.set_hds_thresh_gt
# Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0
bnxt:
KTAP version 1
1..7
ok 1 hds.get_hds
ok 2 hds.get_hds_thresh
ok 3 hds.set_hds_disable # SKIP disabling of HDS not supported by the device
ok 4 hds.set_hds_enable
ok 5 hds.set_hds_thresh_zero
ok 6 hds.set_hds_thresh_max
ok 7 hds.set_hds_thresh_gt
# Totals: pass:6 fail:0 xfail:0 xpass:0 skip:1 error:0
Jakub Kicinski [Sun, 19 Jan 2025 02:05:17 +0000 (18:05 -0800)]
eth: bnxt: update header sizing defaults
300-400B RPC requests are fairly common. With the current default
of 256B HDS threshold bnxt ends up splitting those, lowering PCIe
bandwidth efficiency and increasing the number of memory allocation.
Increase the HDS threshold to fit 4 buffers in a 4k page.
This works out to 640B as the threshold on a typical kernel confing.
This change increases the performance for a microbenchmark which
receives 400B RPCs and sends empty responses by 4.5%.
Admittedly this is just a single benchmark, but 256B works out to
just 6 (so 2 more) packets per head page, because shinfo size
dominates the headers.
Now that we use page pool for the header pages I was also tempted
to default rx_copybreak to 0, but in synthetic testing the copybreak
size doesn't seem to make much difference.
Jakub Kicinski [Sun, 19 Jan 2025 02:05:15 +0000 (18:05 -0800)]
net: ethtool: populate the default HDS params in the core
The core has the current HDS config, it can pre-populate the values
for the drivers. While at it, remove the zero-setting in netdevsim.
Zero are the default values since the config is zalloc'ed.
Jakub Kicinski [Sun, 19 Jan 2025 02:05:14 +0000 (18:05 -0800)]
eth: bnxt: apply hds_thrs settings correctly
Use the pending config for hds_thrs. Core will only update the "current"
one after we return success. Without this change 2 reconfigs would be
required for the setting to reach the device.
Jakub Kicinski [Sun, 19 Jan 2025 02:05:13 +0000 (18:05 -0800)]
net: provide pending ring configuration in net_device
Record the pending configuration in net_device struct.
ethtool core duplicates the current config and the specific
handlers (for now just ringparam) can modify it.
Jakub Kicinski [Sun, 19 Jan 2025 02:05:11 +0000 (18:05 -0800)]
net: move HDS config from ethtool state
Separate the HDS config from the ethtool state struct.
The HDS config contains just simple parameters, not state.
Having it as a separate struct will make it easier to clone / copy
and also long term potentially make it per-queue.
To be precise, unix_release_sock() is also called for a new child
socket in unix_stream_connect() when something fails, but the new
sk does not have skb in the recv queue then and no event is logged.
Note that only tcp_inbound_ao_hash() uses a similar drop reason,
SKB_DROP_REASON_TCP_CLOSE, and this can be generalised later.
Aleksander Jan Bajkowski [Fri, 17 Jan 2025 22:24:21 +0000 (23:24 +0100)]
net: phy: realtek: HWMON support for standalone versions of RTL8221B and RTL8251
HWMON support has been added for the RTL8221/8251 PHYs integrated together
with the MAC inside the RTL8125/8126 chips. This patch extends temperature
reading support for standalone variants of the mentioned PHYs.
I don't know whether the earlier revisions of the RTL8226 also have a
built-in temperature sensor, so they have been skipped for now.
Tested on RTL8221B-VB-CG.
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed that HyStart incorrectly marks the start of rounds,
leading to inaccurate measurements of ACK train lengths and
resetting the `ca->sample_cnt` variable. This inaccuracy can impact
HyStart's functionality in terminating exponential cwnd growth during
Slow-Start, potentially degrading TCP performance.
The issue arises because the changes introduced in commit 4e1fddc98d25
("tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows")
moved the caller of the `bictcp_hystart_reset` function inside the `hystart_update` function.
This modification added an additional condition for triggering the caller,
requiring that (tcp_snd_cwnd(tp) >= hystart_low_window) must also
be satisfied before invoking `bictcp_hystart_reset`.
This fix ensures that `bictcp_hystart_reset` is correctly called
at the start of a new round, regardless of the congestion window size.
This is achieved by moving the condition
(tcp_snd_cwnd(tp) >= hystart_low_window)
from before calling `bictcp_hystart_reset` to after it.
I tested with a client and a server connected through two Linux software routers.
In this setup, the minimum RTT was 150 ms, the bottleneck bandwidth was 50 Mbps,
and the bottleneck buffer size was 1 BDP, calculated as (50M / 1514 / 8) * 0.150 = 619 packets.
I conducted the test twice, transferring data from the server to the client for 1.5 seconds.
Before the patch was applied, HYSTART-DELAY stopped the exponential growth of cwnd when
cwnd = 516, and the bottleneck link was not yet saturated (516 < 619).
After the patch was applied, HYSTART-ACK-TRAIN stopped the exponential growth of cwnd when
cwnd = 632, and the bottleneck link was saturated (632 > 619).
In this test, applying the patch resulted in 300 KB more data delivered.
Fixes: 4e1fddc98d25 ("tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows") Signed-off-by: Mahdi Arghavani <ma.arghavani@yahoo.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Haibo Zhang <haibo.zhang@otago.ac.nz> Cc: David Eyers <david.eyers@otago.ac.nz> Cc: Abbas Arghavani <abbas.arghavani@mdu.se> Reviewed-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ales Nezbeda [Fri, 17 Jan 2025 11:22:28 +0000 (12:22 +0100)]
net: macsec: Add endianness annotations in salt struct
This change resolves warning produced by sparse tool as currently
there is a mismatch between normal generic type in salt and endian
annotated type in macsec driver code. Endian annotated types should
be used here.
Sparse output:
warning: restricted ssci_t degrades to integer
warning: incorrect type in assignment (different base types)
expected restricted ssci_t [usertype] ssci
got unsigned int
warning: restricted __be64 degrades to integer
warning: incorrect type in assignment (different base types)
expected restricted __be64 [usertype] pn
got unsigned long long
Signed-off-by: Ales Nezbeda <anezbeda@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Fri, 17 Jan 2025 09:36:14 +0000 (12:36 +0300)]
tipc: re-order conditions in tipc_crypto_key_rcv()
On a 32bit system the "keylen + sizeof(struct tipc_aead_key)" math could
have an integer wrapping issue. It doesn't matter because the "keylen"
is checked on the next line, but just to make life easier for static
analysis tools, let's re-order these conditions and avoid the integer
overflow.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Since adding negotiation of in-band capabilities, it is no longer
sufficient to just look at the MLO_AN_xxx mode and PHY interface to
decide whether to do a major configuration, since the result now
depends on the capabilities of the attaching PHY.
Always trigger a major configuration in this case.
Reported-by: Eric Woudstra <ericwouds@gmail.com> Tested-by: Eric Woudstra <ericwouds@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
aarp_send_probe_phase1() used to work by calling ndo_do_ioctl of
appletalk drivers ltpc or cops, but these two drivers have been removed
since the following commits:
commit 03dcb90dbf62 ("net: appletalk: remove Apple/Farallon LocalTalk PC
support")
commit 00f3696f7555 ("net: appletalk: remove cops support")
Thus aarp_send_probe_phase1() no longer works, so drop it. (found by
code inspection)
Signed-off-by: č°¢č“é¦ (XIE Zhibang) <Yeking@Red54.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Roger Quadros [Thu, 16 Jan 2025 13:54:49 +0000 (15:54 +0200)]
net: ethernet: ti: am65-cpsw: fix freeing IRQ in am65_cpsw_nuss_remove_tx_chns()
When getting the IRQ we use k3_udma_glue_tx_get_irq() which returns
negative error value on error. So not NULL check is not sufficient
to deteremine if IRQ is valid. Check that IRQ is greater then zero
to ensure it is valid.
There is no issue at probe time but at runtime user can invoke
.set_channels which results in the following call chain.
am65_cpsw_set_channels()
am65_cpsw_nuss_update_tx_rx_chns()
am65_cpsw_nuss_remove_tx_chns()
am65_cpsw_nuss_init_tx_chns()
At this point if am65_cpsw_nuss_init_tx_chns() fails due to
k3_udma_glue_tx_get_irq() then tx_chn->irq will be set to a
negative value.
Then, at subsequent .set_channels with higher channel count we
will attempt to free an invalid IRQ in am65_cpsw_nuss_remove_tx_chns()
leading to a kernel warning.
The issue is present in the original commit that introduced this driver,
although there, am65_cpsw_nuss_update_tx_rx_chns() existed as
am65_cpsw_nuss_update_tx_chns().
Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver") Signed-off-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski [Wed, 15 Jan 2025 19:47:03 +0000 (20:47 +0100)]
dsa: Use str_enable_disable-like helpers
Replace ternary (condition ? "enable" : "disable") syntax with helpers
from string_choices.h because:
1. Simple function call with one argument is easier to read. Ternary
operator has three arguments and with wrapping might lead to quite
long code.
2. Is slightly shorter thus also easier to read.
3. It brings uniformity in the text - same string.
4. Allows deduping by the linker, which results in a smaller binary
file.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Wed, 15 Jan 2025 14:27:54 +0000 (09:27 -0500)]
net: sched: refine software bypass handling in tc_run
This patch addresses issues with filter counting in block (tcf_block),
particularly for software bypass scenarios, by introducing a more
accurate mechanism using useswcnt.
Previously, filtercnt and skipswcnt were introduced by:
filtercnt tracked all tp (tcf_proto) objects added to a block, and
skipswcnt counted tp objects with the skipsw attribute set.
The problem is: a single tp can contain multiple filters, some with skipsw
and others without. The current implementation fails in the case:
When the first filter in a tp has skipsw, both skipswcnt and filtercnt
are incremented, then adding a second filter without skipsw to the same
tp does not modify these counters because tp->counted is already set.
This results in bypass software behavior based solely on skipswcnt
equaling filtercnt, even when the block includes filters without
skipsw. Consequently, filters without skipsw are inadvertently bypassed.
To address this, the patch introduces useswcnt in block to explicitly count
tp objects containing at least one filter without skipsw. Key changes
include:
Whenever a filter without skipsw is added, its tp is marked with usesw
and counted in useswcnt. tc_run() now uses useswcnt to determine software
bypass, eliminating reliance on filtercnt and skipswcnt.
This refined approach prevents software bypass for blocks containing
mixed filters, ensuring correct behavior in tc_run().
Additionally, as atomic operations on useswcnt ensure thread safety and
tp->lock guards access to tp->usesw and tp->counted, the broader lock
down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
and this resolves a performance regression caused by the filter counting
mechanism during parallel filter insertions.
The improvement can be demonstrated using the following script:
# cat insert_tc_rules.sh
tc qdisc add dev ens1f0np0 ingress
for i in $(seq 16); do
taskset -c $i tc -b rules_$i.txt &
done
wait
Each of rules_$i.txt files above includes 100000 tc filter rules to a
mlx5 driver NIC ens1f0np0.
Without this patch:
# time sh insert_tc_rules.sh
real 0m50.780s
user 0m23.556s
sys 4m13.032s
With this patch:
# time sh insert_tc_rules.sh
real 0m17.718s
user 0m7.807s
sys 3m45.050s
Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: AsbjĆørn Sloth TĆønnesen <ast@fiberby.net> Tested-by: AsbjĆørn Sloth TĆønnesen <ast@fiberby.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Mon, 13 Jan 2025 23:50:38 +0000 (00:50 +0100)]
netfilter: flowtable: add CLOSING state
tcp rst/fin packet triggers an immediate teardown of the flow which
results in sending flows back to the classic forwarding path.
This behaviour was introduced by:
da5984e51063 ("netfilter: nf_flow_table: add support for sending flows back to the slow path") b6f27d322a0a ("netfilter: nf_flow_table: tear down TCP flows if RST or FIN was seen")
whose goal is to expedite removal of flow entries from the hardware
table. Before these patches, the flow was released after the flow entry
timed out.
However, this approach leads to packet races when restoring the
conntrack state as well as late flow re-offload situations when the TCP
connection is ending.
This patch adds a new CLOSING state that is is entered when tcp rst/fin
packet is seen. This allows for an early removal of the flow entry from
the hardware table. But the flow entry still remains in software, so tcp
packets to shut down the flow are not sent back to slow path.
If syn packet is seen from this new CLOSING state, then this flow enters
teardown state, ct state is set to TCP_CONNTRACK_CLOSE state and packet
is sent to slow path, so this TCP reopen scenario can be handled by
conntrack. TCP_CONNTRACK_CLOSE provides a small timeout that aims at
quickly releasing this stale entry from the conntrack table.
Moreover, skip hardware re-offload from flowtable software packet if the
flow is in CLOSING state.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Mon, 13 Jan 2025 23:50:37 +0000 (00:50 +0100)]
netfilter: flowtable: teardown flow if cached mtu is stale
Tear down the flow entry in the unlikely case that the interface mtu
changes, this gives the flow a chance to refresh the cached mtu,
otherwise such refresh does not occur until flow entry expires.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Offload nf_conn entries may not see traffic for a very long time.
To prevent incorrect 'ct is stale' checks during nf_conntrack table
lookup, the gc worker extends the timeout nf_conn entries marked for
offload to a large value.
The existing logic suffers from a few problems.
Garbage collection runs without locks, its unlikely but possible
that @ct is removed right after the 'offload' bit test.
In that case, the timeout of a new/reallocated nf_conn entry will
be increased.
Prevent this by obtaining a reference count on the ct object and
re-check of the confirmed and offload bits.
If those are not set, the ct is being removed, skip the timeout
extension in this case.
Parallel teardown is also problematic:
cpu1 cpu2
gc_worker
calls flow_offload_teardown()
tests OFFLOAD bit, set
clear OFFLOAD bit
ct->timeout is repaired (e.g. set to timeout[UDP_CT_REPLIED])
nf_ct_offload_timeout() called
expire value is fetched
<INTERRUPT>
-> NF_CT_DAY timeout for flow that isn't offloaded
(and might not see any further packets).
Use cmpxchg: if ct->timeout was repaired after the 2nd 'offload bit' test
passed, then ct->timeout will only be updated of ct->timeout was not
altered in between.
As we already have a gc worker for flowtable entries, ct->timeout repair
can be handled from the flowtable gc worker.
This avoids having flowtable specific logic in the conntrack core
and avoids checking entries that were never offloaded.
This allows to remove the nf_ct_offload_timeout helper.
Its safe to use in the add case, but not on teardown.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Mon, 13 Jan 2025 23:50:35 +0000 (00:50 +0100)]
netfilter: conntrack: remove skb argument from nf_ct_refresh
Its not used (and could be NULL), so remove it.
This allows to use nf_ct_refresh in places where we don't have
an skb without having to double-check that skb == NULL would be safe.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Mon, 13 Jan 2025 23:50:34 +0000 (00:50 +0100)]
netfilter: nft_flow_offload: update tcp state flags under lock
The conntrack entry is already public, there is a small chance that another
CPU is handling a packet in reply direction and racing with the tcp state
update.
Move this under ct spinlock.
This is done once, when ct is about to be offloaded, so this should
not result in a noticeable performance hit.
Fixes: 8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Mon, 13 Jan 2025 23:50:33 +0000 (00:50 +0100)]
netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath
This state reset is racy, no locks are held here.
Since commit 8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp"),
the window checks are disabled for normal data packets, but MAXACK flag
is checked when validating TCP resets.
Clear the flag so tcp reset validation checks are ignored.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With conditional chain deletion gone, callback code simplifies: Instead
of filling an nft_ctx object, just pass basechain to the per-chain
function. Also plain list_for_each_entry() is safe now.
Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Thu, 9 Jan 2025 17:31:36 +0000 (18:31 +0100)]
netfilter: nf_tables: Tolerate chains with no remaining hooks
Do not drop a netdev-family chain if the last interface it is registered
for vanishes. Users dumping and storing the ruleset upon shutdown to
restore it upon next boot may otherwise lose the chain and all contained
rules. They will still lose the list of devices, a later patch will fix
that. For now, this aligns the event handler's behaviour with that for
flowtables.
The controversal situation at netns exit should be no problem here:
event handler will unregister the hooks, core nftables cleanup code will
drop the chain itself.
Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Thu, 9 Jan 2025 17:31:34 +0000 (18:31 +0100)]
netfilter: nf_tables: Use stored ifname in netdev hook dumps
The stored ifname and ops.dev->name may deviate after creation due to
interface name changes. Prefer the more deterministic stored name in
dumps which also helps avoiding inadvertent changes to stored ruleset
dumps.
Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Thu, 9 Jan 2025 17:31:33 +0000 (18:31 +0100)]
netfilter: nf_tables: Store user-defined hook ifname
Prepare for hooks with NULL ops.dev pointer (due to non-existent device)
and store the interface name and length as specified by the user upon
creation. No functional change intended.
Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Thu, 9 Jan 2025 17:31:32 +0000 (18:31 +0100)]
netfilter: nf_tables: Flowtable hook's pf value never varies
When checking for duplicate hooks in nft_register_flowtable_net_hooks(),
comparing ops.pf value is pointless as it is always NFPROTO_NETDEV with
flowtable hooks.
Dropping the check leaves the search identical to the one in
nft_hook_list_find() so call that function instead of open coding.
Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Antoine Tenart [Thu, 9 Jan 2025 09:37:09 +0000 (10:37 +0100)]
netfilter: br_netfilter: remove unused conditional and dead code
The SKB_DROP_REASON_IP_INADDRERRORS drop reason is never returned from
any function, as such it cannot be returned from the ip_route_input call
tree. The 'reason != SKB_DROP_REASON_IP_INADDRERRORS' conditional is
thus always true.
Looking back at history, commit 50038bf38e65 ("net: ip: make
ip_route_input() return drop reasons") changed the ip_route_input
returned value check in br_nf_pre_routing_finish from -EHOSTUNREACH to
SKB_DROP_REASON_IP_INADDRERRORS. It turns out -EHOSTUNREACH could not be
returned either from the ip_route_input call tree and this since commit 251da4130115 ("ipv4: Cache ip_error() routes even when not
forwarding.").
Not a fix as this won't change the behavior. While at it use
kfree_skb_reason.
Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Mon, 6 Jan 2025 22:40:50 +0000 (23:40 +0100)]
netfilter: nf_tables: fix set size with rbtree backend
The existing rbtree implementation uses singleton elements to represent
ranges, however, userspace provides a set size according to the number
of ranges in the set.
Adjust provided userspace set size to the number of singleton elements
in the kernel by multiplying the range by two.
Check if the no-match all-zero element is already in the set, in such
case release one slot in the set size.
Fixes: 0ed6389c483d ("netfilter: nf_tables: rename set implementations") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Map VID 0 to untagged TT VLAN, by Sven Eckelmann
- Update MAINTAINERS/mailmap e-mail addresses, by the respective authors
(4 patches)
- netlink: reduce duplicate code by returning interfaces,
by Linus Lüssing
* tag 'batadv-next-pullrequest-20250117' of git://git.open-mesh.org/linux-merge:
batman-adv: netlink: reduce duplicate code by returning interfaces
MAINTAINERS: mailmap: add entries for Antonio Quartulli
mailmap: add entries for Sven Eckelmann
mailmap: add entries for Simon Wunderlich
MAINTAINERS: update email address of Marek Linder
batman-adv: Map VID 0 to untagged TT VLAN
batman-adv: Don't keep redundant TT change events
batman-adv: Remove atomic usage for tt.local_changes
batman-adv: Reorder includes for distributed-arp-table.c
batman-adv: Start new development cycle
====================
Jakub Kicinski [Sun, 19 Jan 2025 01:51:57 +0000 (17:51 -0800)]
Merge tag 'for-net-next-2025-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- btusb: Add new VID/PID 13d3/3610 for MT7922
- btusb: Add new VID/PID 13d3/3628 for MT7925
- btusb: Add MT7921e device 13d3:3576
- btusb: Add RTL8851BE device 13d3:3600
- btusb: Add ID 0x2c7c:0x0130 for Qualcomm WCN785x
- btusb: add sysfs attribute to control USB alt setting
- qca: Expand firmware-name property
- qca: Fix poor RF performance for WCN6855
- L2CAP: handle NULL sock pointer in l2cap_sock_alloc
- Allow reset via sysfs
- ISO: Allow BIG re-sync
- dt-bindings: Utilize PMU abstraction for WCN6750
- MGMT: Mark LL Privacy as stable
* tag 'for-net-next-2025-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (23 commits)
Bluetooth: MGMT: Fix slab-use-after-free Read in mgmt_remove_adv_monitor_sync
Bluetooth: qca: Fix poor RF performance for WCN6855
Bluetooth: Allow reset via sysfs
Bluetooth: Get rid of cmd_timeout and use the reset callback
Bluetooth: Remove the cmd timeout count in btusb
Bluetooth: Use str_enable_disable-like helpers
Bluetooth: btmtk: Remove resetting mt7921 before downloading the fw
Bluetooth: L2CAP: handle NULL sock pointer in l2cap_sock_alloc
Bluetooth: btusb: Add RTL8851BE device 13d3:3600
dt-bindings: bluetooth: Utilize PMU abstraction for WCN6750
Bluetooth: btusb: Add MT7921e device 13d3:3576
Bluetooth: btrtl: check for NULL in btrtl_setup_realtek()
Bluetooth: btbcm: Fix NULL deref in btbcm_get_board_name()
Bluetooth: qca: Expand firmware-name to load specific rampatch
Bluetooth: qca: Update firmware-name to support board specific nvm
dt-bindings: net: bluetooth: qca: Expand firmware-name property
Bluetooth: btusb: Add new VID/PID 13d3/3628 for MT7925
Bluetooth: btusb: Add new VID/PID 13d3/3610 for MT7922
Bluetooth: btusb: add sysfs attribute to control USB alt setting
Bluetooth: btusb: Add ID 0x2c7c:0x0130 for Qualcomm WCN785x
...
====================
Jakub Kicinski [Sun, 19 Jan 2025 01:46:53 +0000 (17:46 -0800)]
Merge tag 'wireless-next-2025-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.14
Most likely the last "new features" pull request for v6.14 and this is
a bigger one. Multi-Link Operation (MLO) work continues both in stack
in drivers. Few new devices supported and usual fixes all over.
Major changes:
cfg80211
* Emergency Preparedness Communication Services (EPCS) station mode support
mac80211
* an option to filter a sta from being flushed
* some support for RX Operating Mode Indication (OMI) power saving
* support for adding and removing station links for MLO
iwlwifi
* new device ids
* rework firmware error handling and restart
rtw88
* RTL8812A: RFE type 2 support
* LED support
rtw89
* variant info to support RTL8922AE-VS
mt76
* mt7996: single wiphy multiband support (preparation for MLO)
* mt7996: support for more variants
* mt792x: P2P_DEVICE support
* mt7921u: TP-Link TXE50UH support
ath12k
* enable MLO for QCN9274 (although it seems to be broken with dual
band devices)
* MLO radar detection support
* debugfs: transmit buffer OFDMA, AST entry and puncture stats
* tag 'wireless-next-2025-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (322 commits)
wifi: brcmfmac: fix NULL pointer dereference in brcmf_txfinalize()
wifi: rtw88: add RTW88_LEDS depends on LEDS_CLASS to Kconfig
wifi: wilc1000: unregister wiphy only after netdev registration
wifi: cfg80211: adjust allocation of colocated AP data
wifi: mac80211: fix memory leak in ieee80211_mgd_assoc_ml_reconf()
wifi: ath12k: fix key cache handling
wifi: ath12k: Fix uninitialized variable access in ath12k_mac_allocate() function
wifi: ath12k: Remove ath12k_get_num_hw() helper function
wifi: ath12k: Refactor the ath12k_hw get helper function argument
wifi: ath12k: Refactor ath12k_hw set helper function argument
wifi: mt76: mt7996: add implicit beamforming support for mt7992
wifi: mt76: mt7996: fix beacon command during disabling
wifi: mt76: mt7996: fix ldpc setting
wifi: mt76: mt7996: fix definition of tx descriptor
wifi: mt76: connac: adjust phy capabilities based on band constraints
wifi: mt76: mt7996: fix incorrect indexing of MIB FW event
wifi: mt76: mt7996: fix HE Phy capability
wifi: mt76: mt7996: fix the capability of reception of EHT MU PPDU
wifi: mt76: mt7996: add max mpdu len capability
wifi: mt76: mt7921: avoid undesired changes of the preset regulatory domain
...
====================
Eric Dumazet [Fri, 17 Jan 2025 23:21:13 +0000 (23:21 +0000)]
net: introduce netdev_napi_exit()
After 1b23cdbd2bbc ("net: protect netdev->napi_list with netdev_lock()")
it makes sense to iterate through dev->napi_list while holding
the device lock.
Heiner Kallweit [Thu, 16 Jan 2025 21:38:40 +0000 (22:38 +0100)]
net: phy: remove leftovers from switch to linkmode bitmaps
We have some leftovers from the switch to linkmode bitmaps which
- have never been used
- are not used any longer
- have no user outside phy_device.c
So remove them.
Jamal Hadi Salim [Thu, 16 Jan 2025 01:37:13 +0000 (17:37 -0800)]
net: sched: Disallow replacing of child qdisc from one parent to another
Lion Ackermann was able to create a UAF which can be abused for privilege
escalation with the following script
Step 1. create root qdisc
tc qdisc add dev lo root handle 1:0 drr
step2. a class for packet aggregation do demonstrate uaf
tc class add dev lo classid 1:1 drr
step3. a class for nesting
tc class add dev lo classid 1:2 drr
step4. a class to graft qdisc to
tc class add dev lo classid 1:3 drr
step5.
tc qdisc add dev lo parent 1:1 handle 2:0 plug limit 1024
step6.
tc qdisc add dev lo parent 1:2 handle 3:0 drr
step7.
tc class add dev lo classid 3:1 drr
step 8.
tc qdisc add dev lo parent 3:1 handle 4:0 pfifo
step 9. Display the class/qdisc layout
tc class ls dev lo
class drr 1:1 root leaf 2: quantum 64Kb
class drr 1:2 root leaf 3: quantum 64Kb
class drr 3:1 root leaf 4: quantum 64Kb
tc qdisc ls
qdisc drr 1: dev lo root refcnt 2
qdisc plug 2: dev lo parent 1:1
qdisc pfifo 4: dev lo parent 3:1 limit 1000p
qdisc drr 3: dev lo parent 1:2
step10. trigger the bug <=== prevented by this patch
tc qdisc replace dev lo parent 1:3 handle 4:0
step 11. Redisplay again the qdiscs/classes
tc class ls dev lo
class drr 1:1 root leaf 2: quantum 64Kb
class drr 1:2 root leaf 3: quantum 64Kb
class drr 1:3 root leaf 4: quantum 64Kb
class drr 3:1 root leaf 4: quantum 64Kb
tc qdisc ls
qdisc drr 1: dev lo root refcnt 2
qdisc plug 2: dev lo parent 1:1
qdisc pfifo 4: dev lo parent 3:1 refcnt 2 limit 1000p
qdisc drr 3: dev lo parent 1:2
Observe that a) parent for 4:0 does not change despite the replace request.
There can only be one parent. b) refcount has gone up by two for 4:0 and
c) both class 1:3 and 3:1 are pointing to it.
Step 12. send one packet to plug
echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10001))
step13. send one packet to the grafted fifo
echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10003))
step14. lets trigger the uaf
tc class delete dev lo classid 1:3
tc class delete dev lo classid 1:1
The semantics of "replace" is for a del/add _on the same node_ and not
a delete from one node(3:1) and add to another node (1:3) as in step10.
While we could "fix" with a more complex approach there could be
consequences to expectations so the patch takes the preventive approach of
"disallow such config".
Joint work with Lion Ackermann <nnamrec@gmail.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250116013713.900000-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is because unregister_netdevice_many_notify might run before the
rtnl lock section of ethnl operations, eg. set_channels in the above
example. In this example the rss lock would be destroyed by the device
unregistration path before being used again, but in general running
ethnl operations while dismantle has started is not a good idea.
Fix this by denying any operation on devices being unregistered. A check
was already there in ethnl_ops_begin, but not wide enough.
Note that the same issue cannot be seen on the ioctl version
(__dev_ethtool) because the device reference is retrieved from within
the rtnl lock section there. Once dismantle started, the net device is
unlisted and no reference will be found.
Fixes: dde91ccfa25f ("ethtool: do not perform operations on net devices being unregistered") Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250116092159.50890-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 17 Jan 2025 18:37:26 +0000 (10:37 -0800)]
eth: bnxt: fix string truncation warning in FW version
W=1 builds with gcc 14.2.1 report:
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:4193:32: error: ā%sā directive output may be truncated writing up to 31 bytes into a region of size 27 [-Werror=format-truncation=]
4193 | "/pkg %s", buf);
It's upset that we let buf be full length but then we use 5
characters for "/pkg ".
The builds is also clear with clang version 19.1.5 now.
The number of SYN + MPC retransmissions before falling back to TCP was
fixed to 2. This is certainly a good default value, but having a fixed
number can be a problem in some environments.
The current behaviour means that if all packets are dropped, there will
be:
- The initial SYN + MPC
- 2 retransmissions with MPC
- The next ones will be without MPTCP.
So typically ~3 seconds before falling back to TCP. In some networks
where some temporally blackholes are unfortunately frequent, or when a
client tries to initiate connections while the network is not ready yet,
this can cause new connections not to have MPTCP connections.
In such environments, it is now possible to increase the number of SYN
retransmissions with MPTCP options to make sure MPTCP is used.
Interesting values are:
- 0: the first retransmission will be done without MPTCP options: quite
aggressive, but also a higher risk of detecting false-positive
MPTCP blackholes.
- >= 128: all SYN retransmissions will keep the MPTCP options: back to
the < 6.12 behaviour.
====================
Fix race conditions in ndo_get_stats64
Fix race conditions in ndo_get_stats64 by storing tx/rx stats
locally and not availing per queue resources which could be torn
down during interface stop. Also remove stats fetch from
firmware which is currently unnecessary
====================
Shinas Rasheed [Fri, 17 Jan 2025 09:46:53 +0000 (01:46 -0800)]
octeon_ep_vf: update tx/rx stats locally for persistence
Update tx/rx stats locally, so that ndo_get_stats64()
can use that and not rely on per queue resources to obtain statistics.
The latter used to cause race conditions when the device stopped.
Shinas Rasheed [Fri, 17 Jan 2025 09:46:51 +0000 (01:46 -0800)]
octeon_ep: update tx/rx stats locally for persistence
Update tx/rx stats locally, so that ndo_get_stats64()
can use that and not rely on per queue resources to obtain statistics.
The latter used to cause race conditions when the device stopped.
Sean Anderson [Thu, 16 Jan 2025 23:29:50 +0000 (18:29 -0500)]
net: xilinx: axienet: Report an error for bad coalesce settings
Instead of silently ignoring invalid/unsupported settings, report an
error. Additionally, relax the check for non-zero usecs to apply only
when it will be used (i.e. when frames != 1).
Andy Shevchenko [Thu, 16 Jan 2025 15:27:28 +0000 (17:27 +0200)]
nfc: st21nfca: Remove unused of_gpio.h
of_gpio.h is deprecated and subject to remove. The drivers in question
don't use it, simply remove the unused header.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ethtool get_ts_stats() for DSA and ocelot driver
After a recent patch set with fixes and general restructuring, Jakub asked
for the Felix DSA driver to start reporting standardized statistics
for hardware timestamping:
https://lore.kernel.org/netdev/20241207180640.12da60ed@kernel.org/
Testing follows the same procedure as in the aforementioned series, with PTP
packet loss induced through taprio:
Vladimir Oltean [Thu, 16 Jan 2025 10:46:27 +0000 (12:46 +0200)]
net: mscc: ocelot: add TX timestamping statistics
Add an u64 hardware timestamping statistics structure for each ocelot
port. Export a function from the common switch library for reporting
them to ethtool. This is called by the ocelot switchdev front-end for
now.
Note that for the switchdev driver, we report the one-step PTP packets
as unconfirmed, even though in principle, for some transmission
mechanisms like FDMA, we may be able to confirm transmission and bump
the "pkts" counter in ocelot_fdma_tx_cleanup() instead. I don't have
access to hardware which uses the switchdev front-end, and I've kept the
implementation simple.
Vladimir Oltean [Thu, 16 Jan 2025 10:46:25 +0000 (12:46 +0200)]
net: ethtool: ts: add separate counter for unconfirmed one-step TX timestamps
For packets with two-step timestamp requests, the hardware timestamp
comes back to the driver through a confirmation mechanism of sorts,
which allows the driver to confidently bump the successful "pkts"
counter.
For one-step PTP, the NIC is supposed to autonomously insert its
hardware TX timestamp in the packet headers while simultaneously
transmitting it. There may be a confirmation that this was done
successfully, or there may not.
None of the current drivers which implement ethtool_ops :: get_ts_stats()
also support HWTSTAMP_TX_ONESTEP_SYNC or HWTSTAMP_TX_ONESTEP_SYNC, so it
is a bit unclear which model to follow. But there are NICs, such as DSA,
where there is no transmit confirmation at all. Here, it would be wrong /
misleading to increment the successful "pkts" counter, because one-step
PTP packets can be dropped on TX just like any other packets.
So introduce a special counter which signifies "yes, an attempt was made,
but we don't know whether it also exited the port or not". I expect that
for one-step PTP packets where a confirmation is available, the "pkts"
counter would be bumped.
====================
mlxsw: Move Tx header handling to PCI driver
Amit Cohen writes:
Tx header should be added to all packets transmitted from the CPU to
Spectrum ASICs. Historically, handling this header was added as a driver
function, as Tx header is different between Spectrum and Switch-X.
From May 2021, there is no support for SwitchX-2 ASIC, and all the relevant
code was removed.
For now, there is no justification to handle Tx header as part of
spectrum.c, we can handle this as part of PCI, in skb_transmit().
This change will also be useful when XDP support will be added to mlxsw,
as for XDP_TX and XDP_REDIRECT actions, Tx header should be added before
transmitting the packet.
Patch set overview:
Patches #1-#2 add structure to store Tx header info and initialize it
Patch #3 moves definitions of Tx header fields to txheader.h
Patch #4 moves Tx header handling to PCI driver
Patch #5 removes unnecessary attribute
====================
Amit Cohen [Thu, 16 Jan 2025 16:38:18 +0000 (17:38 +0100)]
mlxsw: Do not store Tx header length as driver parameter
Tx header handling was moved to PCI code, as there is no several drivers
which configure Tx header differently. Tx header length is stored as driver
parameter, this is not really necessary as it always stores the same value.
Remove this field and use the macro MLXSW_TXHDR_LEN explicitly.
Amit Cohen [Thu, 16 Jan 2025 16:38:17 +0000 (17:38 +0100)]
mlxsw: Move Tx header handling to PCI driver
Tx header should be added to all packets transmitted from the CPU to
Spectrum ASICs. Historically, handling this header was added as a driver
function, as Tx header is different between Spectrum and Switch-X. See
SwitchX implementation in commit 31557f0f9755 ("mlxsw: Introduce
Mellanox SwitchX-2 ASIC support"). From May 2021, there is no support
for SwitchX-2 ASIC, and all the relevant code was removed.
For now, there is no justification to handle Tx header as part of
spectrum.c, we can handle this as part of PCI, in skb_transmit().
A future patch set will add support for XDP in mlxsw driver, to support
XDP_TX and XDP_REDIRECT actions, Tx header should be added before
transmitting the packet. As preparation for this, move Tx header handling
to PCI driver, so then XDP code will not have to call API from spectrum.c.
This also improves the code as now Tx header is pushed just before
transmitting, so it is not done from many flows which might miss something.
Note that for PTP, we should configure Tx header differently, use the
fields from mlxsw_txhdr_info to configure the packets correctly in PCI
driver. Handle VLAN tagging in switch driver, verify that packet which
should be transmitted as data is tagged, otherwise, tag it.
Remove the calls for thxdr_construct() functions, as now this is done as
part of skb_transmit().
Amit Cohen [Thu, 16 Jan 2025 16:38:16 +0000 (17:38 +0100)]
mlxsw: Define Tx header fields in txheader.h
The next patch will move Tx header constructing to pci.c. As preparation,
move the definitions of Tx header fields from spectrum.c to txheader.h,
so pci.c will include this header and can access the fields.
Amit Cohen [Thu, 16 Jan 2025 16:38:15 +0000 (17:38 +0100)]
mlxsw: Initialize txhdr_info according to PTP operations
A next patch will construct Tx header as part of pci.c. The switch driver
(mlxsw_spectrum.ko) should encapsulate all the differences between the
different ASICs and the bus driver (mlxsw_pci.ko) should remain unaware.
As preparation, add the relevant info as part of mlxsw_txhdr_info
structure, so later bus driver will merely construct the Tx header based on
information passed from the switch driver.
Most of the packets are transmitted as control packets, but PTP packets in
Spectrum-2 and Spectrum-3 should be handled differently. The driver
transmits them as data packets, and the default VLAN tag (4095) is added if
the packet is not already tagged.
Extend PTP operations to store a boolean which indicates whether packets
should be transmitted as data packets. Set it for Spectrum-2 and
Spectrum-3 only. Extend mlxsw_txhdr_info to store fields which will be used
later to construct Tx header. Initialize such fields according to the new
boolean which is stored in PTP operations.
Note that for now, mlxsw_txhdr_info structure is initialized, but not used,
a next patch will use it.
Amit Cohen [Thu, 16 Jan 2025 16:38:14 +0000 (17:38 +0100)]
mlxsw: Add mlxsw_txhdr_info structure
mlxsw_tx_info structure is used to store information that is needed to
process Tx completions when Tx time stamps are requested. A next patch
will move Tx header handling from spectrum.c to pci.c. For that, some
additional fields which are related to Tx should be passed to pci driver.
As preparation, create an extended structure, called mlxsw_txhdr_info,
and store mlxsw_tx_info inside. The new fields should not be added to
mlxsw_tx_info structure as it is stored in the SKB control block which is
of limited size.
The next patch will extend the new structure with some fields which are
needed in order to construct Tx header.
Colin Ian King [Thu, 16 Jan 2025 18:17:00 +0000 (18:17 +0000)]
net/mlx5: fix unintentional sign extension on shift of dest_attr->vport.vhca_id
Shifting dest_attr->vport.vhca_id << 16 results in a promotion from an
unsigned 16 bit integer to a 32 bit signed integer, this is then sign
extended to a 64 bit unsigned long on 64 bitarchitectures. If vhca_id is
greater than 0x7fff then this leads to a sign extended result where all
the upper 32 bits of idx are set to 1. Fix this by casting vhca_id
to the same type as idx before performing the shift.
Fixes: 8e2e08a6d1e0 ("net/mlx5: fs, add support for dest vport HWS action") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Moshe Shemesh <moshe@nvidia.com> Link: https://patch.msgid.link/20250116181700.96437-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
John Ousterhout [Thu, 16 Jan 2025 19:56:41 +0000 (11:56 -0800)]
net: tc: improve qdisc error messages
The existing error message ("Invalid qdisc name") is confusing
because it suggests that there is no qdisc with the given name. In
fact, the name does refer to a valid qdisc, but it doesn't match
the kind of an existing qdisc being modified or replaced. The
new error message provides more detail to eliminate confusion.