Abin Joseph [Wed, 9 Oct 2024 16:28:21 +0000 (21:58 +0530)]
dt-bindings: net: emaclite: Add clock support
Add s_axi_aclk AXI4 clock support. Traditionally this IP was used on
microblaze platforms which had fixed clocks enabled all the time. But
since its a PL IP, it can also be used on SoC platforms like Zynq
UltraScale+ MPSoC which combines processing system (PS) and user
programmable logic (PL) into the same device. On these platforms instead
of fixed enabled clocks it is mandatory to explicitly enable IP clocks
for proper functionality.
So make clock a required property and also define max supported clock
constraints.
Eric Dumazet [Thu, 10 Oct 2024 03:41:00 +0000 (03:41 +0000)]
tcp: move sysctl_tcp_l3mdev_accept to netns_ipv4_read_rx
sysctl_tcp_l3mdev_accept is read from TCP receive fast path from
tcp_v6_early_demux(),
__inet6_lookup_established,
inet_request_bound_dev_if().
Move it to netns_ipv4_read_rx.
Remove the '#ifdef CONFIG_NET_L3_MASTER_DEV' that was guarding
its definition.
Note this adds a hole of three bytes that could be filled later.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Cc: Wei Wang <weiwan@google.com> Cc: Coco Li <lixiaoyan@google.com> Link: https://patch.msgid.link/20241010034100.320832-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aryan Srivastava [Thu, 10 Oct 2024 00:49:34 +0000 (13:49 +1300)]
net: phy: aquantia: poll status register
The system interface connection status register is not immediately
correct upon line side link up. This results in the status being read as
OFF and then transitioning to the correct host side link mode with a
short delay. This causes the phylink framework passing the OFF status
down to all MAC config drivers, resulting in the host side link being
misconfigured, which in turn can lead to link flapping or complete
packet loss in some cases.
Mitigate this by periodically polling the register until it not showing
the OFF state. This will be done every 1ms for 10ms, using the same
poll/timeout as the processor intensive operation reads.
If the phy is still expressing the OFF state after the timeout, then set
the link to false and pass the NA interface mode onto the phylink
framework.
Jakub Kicinski [Tue, 8 Oct 2024 15:48:24 +0000 (08:48 -0700)]
eth: remove the DLink/Sundance (ST201) driver
Konstantin reports the maintainer's address bounces.
There is no other maintainer and the driver is quite old.
There is a good chance nobody is using this driver any more.
Let's try to remove it completely, we can revert it back in
if someone complains.
Jakub Kicinski [Fri, 11 Oct 2024 01:40:34 +0000 (18:40 -0700)]
Merge branch 'tg3-link-irqs-napis-and-queues'
Joe Damato says:
====================
tg3: Link IRQs, NAPIs, and queues
This follows from a previous RFC (wherein I botched the subject lines of
all the messages) [1].
I've taken Michael Chan's suggestion on modifying patch 2 and I've
updated the commit messages of both patches to test and show the output
for the default 1 TX 4 RX queues and the 4 TX and 4 RX queues cases.
Reviewers: please check the commit messages carefully to ensure the
output is correct (or on your own systems to verify, if you like). I am
not a tg3 expert and it's possible that I got something wrong.
Examine /proc/interrupts once again, noting that tg3 will now rename the
IRQs to suggest that they are combined tx and rx without allocating
additional IRQs, so the total IRQ count in /proc/interrupts is
unchanged:
Heiner Kallweit [Wed, 9 Oct 2024 05:48:05 +0000 (07:48 +0200)]
r8169: remove original workaround for RTL8125 broken rx issue
Now that we have b9c7ac4fe22c ("r8169: disable ALDPS per default for
RTL8125"), the first attempt to fix the issue shouldn't be needed
any longer. So let's effectively revert 621735f59064 ("r8169: fix
rare issue with broken rx after link-down on RTL8125") and see
whether anybody complains.
Linus Torvalds [Thu, 10 Oct 2024 19:36:35 +0000 (12:36 -0700)]
Merge tag 'net-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth and netfilter.
Current release - regressions:
- dsa: sja1105: fix reception from VLAN-unaware bridges
- Revert "net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is
enabled"
- eth: fec: don't save PTP state if PTP is unsupported
Current release - new code bugs:
- smc: fix lack of icsk_syn_mss with IPPROTO_SMC, prevent null-deref
- eth: airoha: update Tx CPU DMA ring idx at the end of xmit loop
- phy: aquantia: AQR115c fix up PMA capabilities
Previous releases - regressions:
- tcp: 3 fixes for retrans_stamp and undo logic
Previous releases - always broken:
- net: do not delay dst_entries_add() in dst_release()
- netfilter: restrict xtables extensions to families that are safe,
syzbot found a way to combine ebtables with extensions that are
never used by userspace tools
- sctp: ensure sk_state is set to CLOSED if hashing fails in
sctp_listen_start
- mptcp: handle consistently DSS corruption, and prevent corruption
due to large pmtu xmit"
* tag 'net-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
MAINTAINERS: Add headers and mailing list to UDP section
MAINTAINERS: consistently exclude wireless files from NETWORKING [GENERAL]
slip: make slhc_remember() more robust against malicious packets
net/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC
ppp: fix ppp_async_encode() illegal access
docs: netdev: document guidance on cleanup patches
phonet: Handle error of rtnl_register_module().
mpls: Handle error of rtnl_register_module().
mctp: Handle error of rtnl_register_module().
bridge: Handle error of rtnl_register_module().
vxlan: Handle error of rtnl_register_module().
rtnetlink: Add bulk registration helpers for rtnetlink message handlers.
net: do not delay dst_entries_add() in dst_release()
mptcp: pm: do not remove closing subflows
mptcp: fallback when MPTCP opts are dropped after 1st data
tcp: fix mptcp DSS corruption due to large pmtu xmit
mptcp: handle consistently DSS corruption
net: netconsole: fix wrong warning
net: dsa: refuse cross-chip mirroring operations
net: fec: don't save PTP state if PTP is unsupported
...
Linus Torvalds [Thu, 10 Oct 2024 19:25:32 +0000 (12:25 -0700)]
Merge tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
"Ring-buffer fix: do not have boot-mapped buffers use CPU hotplug
callbacks
When a ring buffer is mapped to memory assigned at boot, it also
splits it up evenly between the possible CPUs. But the allocation code
still attached a CPU notifier callback to this ring buffer. When a CPU
is added, the callback will happen and another per-cpu buffer is
created for the ring buffer.
But for boot mapped buffers, there is no room to add another one (as
they were all created already). The result of calling the CPU hotplug
notifier on a boot mapped ring buffer is unpredictable and could lead
to a system crash.
If the ring buffer is boot mapped simply do not attach the CPU
notifier to it"
* tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Do not have boot mapped buffers hook to CPU hotplug
Linus Torvalds [Thu, 10 Oct 2024 17:02:59 +0000 (10:02 -0700)]
Merge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- update fstrim loop and add more cancellation points, fix reported
delayed or blocked suspend if there's a huge chunk queued
- fix error handling in recent qgroup xarray conversion
- in zoned mode, fix warning printing device path without RCU
protection
- again fix invalid extent xarray state (6252690f7e1b), lost due to
refactoring
* tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix clear_dirty and writeback ordering in submit_one_sector()
btrfs: zoned: fix missing RCU locking in error message when loading zone info
btrfs: fix missing error handling when adding delayed ref with qgroups enabled
btrfs: add cancellation points to trim loops
btrfs: split remaining space to discard in chunks
Linus Torvalds [Thu, 10 Oct 2024 16:52:49 +0000 (09:52 -0700)]
Merge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix NFSD bring-up / shutdown
- Fix a UAF when releasing a stateid
* tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix possible badness in FREE_STATEID
nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed
NFSD: Mark filecache "down" if init fails
Linus Torvalds [Thu, 10 Oct 2024 16:45:45 +0000 (09:45 -0700)]
Merge tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- A few small typo fixes
- fstests xfs/538 DEBUG-only fix
- Performance fix on blockgc on COW'ed files, by skipping trims on
cowblock inodes currently opened for write
- Prevent cowblocks to be freed under dirty pagecache during unshare
- Update MAINTAINERS file to quote the new maintainer
* tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix a typo
xfs: don't free cowblocks from under dirty pagecache on unshare
xfs: skip background cowblock trims on inodes open for write
xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc
xfs: don't ifdef around the exact minlen allocations
xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate
xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
xfs: return bool from xfs_attr3_leaf_add
xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname
xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()
xfs: scrub: convert comma to semicolon
xfs: Remove empty declartion in header file
MAINTAINERS: add Carlos Maiolino as XFS release manager
The aim of this proposal is to make the handling of some files,
related to Networking and Wireless, more consistently. It does so by:
1. Adding some more headers to the UDP section, making it consistent
with the TCP section.
2. Excluding some files relating to Wireless from NETWORKING [GENERAL],
making their handling consistent with other files related to
Wireless.
The aim of this is to make things more consistent. And for MAINTAINERS
to better reflect the situation on the ground. I am more than happy to
be told that the current state of affairs is fine. Or for other ideas to
be discussed.
Simon Horman [Wed, 9 Oct 2024 08:47:22 +0000 (09:47 +0100)]
MAINTAINERS: consistently exclude wireless files from NETWORKING [GENERAL]
We already exclude wireless drivers from the netdev@ traffic, to
delegate it to linux-wireless@, and avoid overwhelming netdev@.
Many of the following wireless-related sections MAINTAINERS
are already not included in the NETWORKING [GENERAL] section.
For consistency, exclude those that are.
This patch add a toy implementation that performs a simple return to
prevent such panic. This is because MSS can be set in sock_create_kern
or smc_setsockopt, similar to how it's done in AF_SMC. However, for
AF_SMC, there is currently no way to synchronize MSS within
__sys_connect_file. This toy implementation lays the groundwork for us
to support such feature for IPPROTO_SMC in the future.
Fixes: d25a92ccae6b ("net/smc: Introduce IPPROTO_SMC") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Link: https://patch.msgid.link/1728456916-67035-1-git-send-email-alibuda@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Horman [Wed, 9 Oct 2024 09:12:19 +0000 (10:12 +0100)]
docs: netdev: document guidance on cleanup patches
The purpose of this section is to document what is the current practice
regarding clean-up patches which address checkpatch warnings and similar
problems. I feel there is a value in having this documented so others
can easily refer to it.
Clearly this topic is subjective. And to some extent the current
practice discourages a wider range of patches than is described here.
But I feel it is best to start somewhere, with the most well established
part of the current practice.
Jakub Kicinski [Thu, 10 Oct 2024 15:31:06 +0000 (08:31 -0700)]
Merge branch 'net-introduce-tx-h-w-shaping-api'
Paolo Abeni says:
====================
net: introduce TX H/W shaping API
We have a plurality of shaping-related drivers API, but none flexible
enough to meet existing demand from vendors[1].
This series introduces new device APIs to configure in a flexible way
TX H/W shaping. The new functionalities are exposed via a newly
defined generic netlink interface and include introspection
capabilities. Some self-tests are included, on top of a dummy
netdevsim implementation. Finally a basic implementation for the iavf
driver is provided.
In a group operation, when prior to the op itself, the leaves have
different parents, the user must specify the parent handle for the
group. I.e., starting from the previous config:
./tools/net/ynl/cli.py --spec $SPEC \
--do group --json '{"ifindex": '$IFINDEX',
"leaves": [
{"handle": {"scope": "queue", "id":'$QID1'},
"weight": '$W1'},
{"handle": {"scope": "queue", "id":'$QID3'},
"weight": '$W3'}],
"handle": {"scope": "node"},
"bw-max": 3000000 }'
Netlink error: Invalid argument
nl_len = 96 (80) nl_flags = 0x300 nl_type = 2
error: -22
extack: {'msg': 'All the leaves shapers must have the same old parent'}
Sudheer Mogilappagari [Wed, 9 Oct 2024 08:10:01 +0000 (10:10 +0200)]
iavf: add support to exchange qos capabilities
During driver initialization VF determines QOS capability is allowed
by PF and receives QOS parameters. After which quanta size for queues
is configured which is not configurable and is set to 1KB currently.
Sudheer Mogilappagari [Wed, 9 Oct 2024 08:10:00 +0000 (10:10 +0200)]
iavf: Add net_shaper_ops support
Implement net_shaper_ops support for IAVF. This enables configuration
of rate limiting on per queue basis. Customer intends to enforce
bandwidth limit on Tx traffic steered to the queue by configuring
rate limits on the queue.
To set rate limiting for a queue, update shaper object of given queues
in driver and send VIRTCHNL_OP_CONFIG_QUEUE_BW to PF to update HW
configuration.
Deleting shaper configured for queue is nothing but configuring shaper
with bw_max 0. The PF restores the default rate limiting config
when bw_max is zero.
Wenjun Wu [Wed, 9 Oct 2024 08:09:59 +0000 (10:09 +0200)]
ice: Support VF queue rate limit and quanta size configuration
Add support to configure VF queue rate limit and quanta size.
For quanta size configuration, the quanta profiles are divided evenly
by PF numbers. For each port, the first quanta profile is reserved for
default. When VF is asked to set queue quanta size, PF will search for
an available profile, change the fields and assigned this profile to the
queue.
Wenjun Wu [Wed, 9 Oct 2024 08:09:58 +0000 (10:09 +0200)]
virtchnl: support queue rate limit and quanta size configuration
This patch adds new virtchnl opcodes and structures for rate limit
and quanta size configuration, which include:
1. VIRTCHNL_OP_CONFIG_QUEUE_BW, to configure max bandwidth for each
VF per queue.
2. VIRTCHNL_OP_CONFIG_QUANTA, to configure quanta size per queue.
3. VIRTCHNL_OP_GET_QOS_CAPS, VF queries current QoS configuration, such
as enabled TCs, arbiter type, up2tc and bandwidth of VSI node. The
configuration is previously set by DCB and PF, and now is the potential
QoS capability of VF. VF can take it as reference to configure queue TC
mapping.
Paolo Abeni [Wed, 9 Oct 2024 08:09:56 +0000 (10:09 +0200)]
net-shapers: implement cap validation in the core
Use the device capabilities to reject invalid attribute values before
pushing them to the H/W.
Note that validating the metric explicitly avoids NL_SET_BAD_ATTR()
usage, to provide unambiguous error messages to the user.
Validating the nesting requires the knowledge of the new parent for
the given shaper; as such is a chicken-egg problem: to validate the
leaf nesting we need to know the node scope, to validate the node
nesting we need to know the leafs parent scope.
To break the circular dependency, place the leafs nesting validation
after the parsing.
Paolo Abeni [Wed, 9 Oct 2024 08:09:52 +0000 (10:09 +0200)]
net-shapers: implement delete support for NODE scope shaper
Leverage the previously introduced group operation to implement
the removal of NODE scope shaper, re-linking its leaves under the
the parent node before actually deleting the specified NODE scope
shaper.
Paolo Abeni [Wed, 9 Oct 2024 08:09:51 +0000 (10:09 +0200)]
net-shapers: implement NL group operation
Allow grouping multiple leaves shaper under the given root.
The node and the leaves shapers are created, if needed, otherwise
the existing shapers are re-linked as requested.
Try hard to pre-allocated the needed resources, to avoid non
trivial H/W configuration rollbacks in case of any failure.
Paolo Abeni [Wed, 9 Oct 2024 08:09:50 +0000 (10:09 +0200)]
net-shapers: implement NL set and delete operations
Both NL operations directly map on the homonymous device shaper
callbacks, update accordingly the shapers cache and are serialized
via a per device lock.
Implement the cache modification helpers to additionally deal with
NODE scope shaper. That will be needed by the group() operation
implemented in the next patch.
The delete implementation is partial: does not handle NODE scope
shaper yet. Such support will require infrastructure from
the next patch and will be implemented later in the series.
Paolo Abeni [Wed, 9 Oct 2024 08:09:49 +0000 (10:09 +0200)]
net-shapers: implement NL get operation
Introduce the basic infrastructure to implement the net-shaper
core functionality. Each network devices carries a net-shaper cache,
the NL get() operation fetches the data from such cache.
The cache is initially empty, will be fill by the set()/group()
operation implemented later and is destroyed at device cleanup time.
The net_shaper_fill_handle(), net_shaper_ctx_init(), and
net_shaper_generic_pre() implementations handle generic index type
attributes, despite the current caller always pass a constant value
to avoid more noise in later patches using them with different
attributes.
Paolo Abeni [Wed, 9 Oct 2024 08:09:48 +0000 (10:09 +0200)]
netlink: spec: add shaper YAML spec
Define the user-space visible interface to query, configure and delete
network shapers via yaml definition.
Add dummy implementations for the relevant NL callbacks.
set() and delete() operations touch a single shaper creating/updating or
deleting it.
The group() operation creates a shaper's group, nesting multiple input
shapers under the specified output shaper.
Paolo Abeni [Wed, 9 Oct 2024 08:09:47 +0000 (10:09 +0200)]
genetlink: extend info user-storage to match NL cb ctx
This allows a more uniform implementation of non-dump and dump
operations, and will be used later in the series to avoid some
per-operation allocation.
Additionally rename the NL_ASSERT_DUMP_CTX_FITS macro, to
fit a more extended usage.
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:37 +0000 (11:47 -0700)]
phonet: Handle error of rtnl_register_module().
Before commit addf9b90de22 ("net: rtnetlink: use rcu to free rtnl
message handlers"), once the first rtnl_register_module() allocated
rtnl_msg_handlers[PF_PHONET], the following calls never failed.
However, after the commit, rtnl_register_module() could fail silently
to allocate rtnl_msg_handlers[PF_PHONET][msgtype] and requires error
handling for each call.
Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality. This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.
Let's use rtnl_register_many() to handle the errors easily.
Fixes: addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message handlers") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Rémi Denis-Courmont <courmisch@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:36 +0000 (11:47 -0700)]
mpls: Handle error of rtnl_register_module().
Since introduced, mpls_init() has been ignoring the returned
value of rtnl_register_module(), which could fail silently.
Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality. This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.
Let's handle the errors by rtnl_register_many().
Fixes: 03c0566542f4 ("mpls: Netlink commands to add, remove, and dump routes") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:35 +0000 (11:47 -0700)]
mctp: Handle error of rtnl_register_module().
Since introduced, mctp has been ignoring the returned value of
rtnl_register_module(), which could fail silently.
Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality. This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:34 +0000 (11:47 -0700)]
bridge: Handle error of rtnl_register_module().
Since introduced, br_vlan_rtnl_init() has been ignoring the returned
value of rtnl_register_module(), which could fail silently.
Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality. This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.
Let's handle the errors by rtnl_register_many().
Fixes: 8dcea187088b ("net: bridge: vlan: add rtm definitions and dump support") Fixes: f26b296585dc ("net: bridge: vlan: add new rtm message support") Fixes: adb3ce9bcb0f ("net: bridge: vlan: add del rtm message support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:33 +0000 (11:47 -0700)]
vxlan: Handle error of rtnl_register_module().
Since introduced, vxlan_vnifilter_init() has been ignoring the
returned value of rtnl_register_module(), which could fail silently.
Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality. This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.
Let's handle the errors by rtnl_register_many().
Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:32 +0000 (11:47 -0700)]
rtnetlink: Add bulk registration helpers for rtnetlink message handlers.
Before commit addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message
handlers"), once rtnl_msg_handlers[protocol] was allocated, the following
rtnl_register_module() for the same protocol never failed.
However, after the commit, rtnl_msg_handler[protocol][msgtype] needs to
be allocated in each rtnl_register_module(), so each call could fail.
Many callers of rtnl_register_module() do not handle the returned error,
and we need to add many error handlings.
To handle that easily, let's add wrapper functions for bulk registration
of rtnetlink message handlers.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Christian Marangi [Tue, 8 Oct 2024 19:47:16 +0000 (21:47 +0200)]
net: phy: Validate PHY LED OPs presence before registering
Validate PHY LED OPs presence before registering and parsing them.
Defining LED nodes for a PHY driver that actually doesn't supports them
is redundant and useless.
It's also the case with Generic PHY driver used and a DT having LEDs
node for the specific PHY.
Skip it and report the error with debug print enabled.
Paolo Abeni [Thu, 10 Oct 2024 11:50:55 +0000 (13:50 +0200)]
Merge tag 'nf-24-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Restrict xtables extensions to families that are safe, syzbot found
a way to combine ebtables with extensions that are never used by
userspace tools. From Florian Westphal.
2) Set l3mdev inconditionally whenever possible in nft_fib to fix lookup
mismatch, also from Florian.
netfilter pull request 24-10-09
* tag 'nf-24-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: conntrack_vrf.sh: add fib test case
netfilter: fib: check correct rtable in vrf setups
netfilter: xtables: avoid NFPROTO_UNSPEC where needed
====================
====================
net/mlx5: qos: Refactor esw qos to support new features
This patch series by Cosmin and Carolina prepares the mlx5 qos infra for
the upcoming feature of cross E-Switch scheduling.
Noop cleanups:
net/mlx5: qos: Flesh out element_attributes in mlx5_ifc.h
net/mlx5: qos: Rename vport 'tsar' into 'sched_elem'.
net/mlx5: qos: Consistently name vport vars as 'vport'
net/mlx5: qos: Refactor and document bw_share calculation
net/mlx5: qos: Rename rate group 'list' as 'parent_entry'
Refactor the code with the goal of moving groups out of E-Switches:
net/mlx5: qos: Maintain rate group vport members in a list
net/mlx5: qos: Always create group0
net/mlx5: qos: Drop 'esw' param from vport qos functions
net/mlx5: qos: Store the eswitch in a mlx5_esw_rate_group
Move groups from an E-Switch into an mlx5_qos_domain:
net/mlx5: qos: Store rate groups in a qos domain
Refactor locking to use a new mutex in the qos domain:
net/mlx5: qos: Refactor locking to a qos domain mutex
In follow-up patchsets, we'll allow qos domains to be shared
between E-Switches of the same NIC.
The two top patches are simple enhancements.
====================
Carolina Jubran [Tue, 8 Oct 2024 18:32:21 +0000 (21:32 +0300)]
net/mlx5: Unify QoS element type checks across NIC and E-Switch
Refactor the QoS element type support check by introducing a new
function, mlx5_qos_element_type_supported(), which handles element type
validation for both NIC and E-Switch schedulers.
This change removes the redundant esw_qos_element_type_supported()
function and unifies the element type checks into a single
implementation.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:20 +0000 (21:32 +0300)]
net/mlx5: qos: Refactor locking to a qos domain mutex
E-Switch qos changes used the esw state_lock to serialize qos changes.
With the introduction of cross-esw scheduling, multiple E-Switches might
be involved in a qos operation, so prepare for that by switching locking
to use a qos domain mutex.
Add three helper functions:
- esw_qos_lock
- esw_qos_unlock
- esw_assert_qos_lock_held
Convert existing direct lock/unlock/lockdep calls to them. Also call
esw_assert_qos_lock_held in a couple more places.
mlx5_esw_qos_set_vport_rate expected to be called with the esw
state_lock already held.
Change it to instead acquire the qos lock directly.
mlx5_eswitch_get_vport_config also accessed qos properties with the esw
state lock. Introduce a new function mlx5_esw_qos_get_vport_rate to
access those with the correct lock and change get_vport_config to use
it.
Finally, mlx5_vport_disable is called from the cleanup path with the esw
state_lock held, so have it additionally acquire the qos lock to make
sure there are no races.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:19 +0000 (21:32 +0300)]
net/mlx5: qos: Store rate groups in a qos domain
Groups are currently maintained as a list in their corresponding
eswitch, protected by the esw state_lock.
The upcoming cross-eswitch scheduling feature cannot work with this
approach, as it would require acquiring multiple eswitch locks (in the
correct order) in order to maintain group membership.
This commit moves the rate groups into a new 'qos domain' struct and
adds explicit qos init/cleanup steps to the eswitch init/cleanup.
Upcoming patches will expand the qos domain struct and allow it to be
shared between eswitches. For now, qos domains are private to each esw
so there's only an extra indirection.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:18 +0000 (21:32 +0300)]
net/mlx5: qos: Rename rate group 'list' as 'parent_entry'
'list' is not very descriptive, I prefer list membership to clearly
specify which list the entry belongs to. This commit renames the list
entry into the esw groups list as 'parent_entry' to make the code more
readable. This is a no-op change.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:17 +0000 (21:32 +0300)]
net/mlx5: qos: Add an explicit 'dev' to vport trace calls
vport qos trace calls used vport->dev implicitly as the device to which
the command was sent (and thus the device logged in traces).
But that will no longer be the case for cross-esw scheduling, where the
commands have to be sent to the group esw device instead.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:16 +0000 (21:32 +0300)]
net/mlx5: qos: Store the eswitch in a mlx5_esw_rate_group
The rate groups are about to be moved out of eswitches, so store a
reference to the eswitch they belong to so things can still work
later.
This allows dropping the esw parameter from a couple of functions and
simplifying some of the code. Use this opportunity to make sure that
vport scheduling element commands are always sent to the group eswitch,
because that will be relevant for cross-esw scheduling. For now though,
the eswitches are not different.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:15 +0000 (21:32 +0300)]
net/mlx5: qos: Drop 'esw' param from vport qos functions
The vport has a pointer to its own eswitch in vport->dev->priv.eswitch,
so passing the same eswitch as a parameter to the various functions
manipulating vport qos is superfluous at best and prone to errors at
worst.
More importantly, with the upcoming cross-esw scheduling changes, the
eswitch that should receive the various scheduling element commands is
NOT the same as the vport's eswitch, so the current code's assumptions
will break.
To avoid confusion and bugs, this commit drops the 'esw' parameter from
all vport qos functions and uses the vport's own eswitch pointer
instead.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:14 +0000 (21:32 +0300)]
net/mlx5: qos: Always create group0
All vports not explicitly members of a group with QoS enabled are part
of the internal esw group0, except when the hw reports that groups
aren't supported (log_esw_max_sched_depth == 0). This creates corner
cases in the code, which has to make sure that this case is supported.
Additionally, the groups are about to be moved out of eswitches, and
group0 being NULL creates additional complications there.
This patch makes sure to always create group0, even if max sched depth
is 0. In that case, a software-only group0 is created referencing the
root TSAR. Vports can point to this group when their QoS is enabled and
they'll be attached to the root TSAR directly. This eliminates corner
cases in the code by offering the guarantee that if qos is enabled,
vport->qos.group is non-NULL.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:13 +0000 (21:32 +0300)]
net/mlx5: qos: Maintain rate group vport members in a list
Previously, finding group members was done by iterating over all vports
of an eswitch and comparing their group with the required one, but that
approach will break down when a group can contain vports from multiple
eswitches.
Solve that by maintaining a list of vport members.
Instead of iterating over esw vports, loop over the members list.
Use this opportunity to provide two new functions to allocate and free a
group, so that the number of state transitions is smaller. This will
also be used in a future patch.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:12 +0000 (21:32 +0300)]
net/mlx5: qos: Refactor and document bw_share calculation
The previous function (esw_qos_calculate_group_min_rate_divider) had two
completely different modes of execution, depending on the 'group_level'
parameter. Split it into two separate functions:
- esw_qos_calculate_min_rate_divider - computes min across groups.
- esw_qos_calculate_group_min_rate_divider - computes min in a group.
Fold the divider calculation into the corresponding normalize functions
to avoid having the caller compute the corresponding divider.
Also rename the normalize functions to better indicate what level
they're operating on.
Finally, document everything so that this topic can more easily be
understood by future maintainers.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:10 +0000 (21:32 +0300)]
net/mlx5: qos: Rename vport 'tsar' into 'sched_elem'.
Vports do not use TSARs (Transmit Scheduling ARbiters), which are used
for grouping multiple entities together. Use the correct name in
variables and functions for clarity.
Also move the scheduling context to a local variable in the
esw_qos_sched_elem_config function instead of an empty parameter that
needs to be provided by all callers.
There is no functional change here.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:09 +0000 (21:32 +0300)]
net/mlx5: qos: Flesh out element_attributes in mlx5_ifc.h
This is used for multiple purposes, depending on the scheduling element
created. There are a few helper struct defined a long time ago, but they
are not easy to find in the file and they are about to get new members.
This commit cleans up this area a bit by:
- moving the helper structs closer to where they are relevant.
- defining a helper union to include all of them to help
discoverability.
- making use of it everywhere element_attributes is used.
- using a consistent 'attr' name.
Paolo Abeni [Thu, 10 Oct 2024 10:52:29 +0000 (12:52 +0200)]
Merge branch 'eth-fbnic-add-timestamping-support'
Vadim Fedorenko says:
====================
eth: fbnic: add timestamping support
The series is to add timestamping support for Meta's NIC driver.
Changelog:
v3 -> v4:
- use adjust_by_scaled_ppm() instead of open coding it
- adjust cached value of high bits of timestamp to be sure it
is older then incoming timestamps
v2 -> v3:
- rebase on top of net-next
- add doc to describe retur value of fbnic_ts40_to_ns()
v1 -> v2:
- adjust comment about using u64 stats locking primitive
- fix typo in the first patch
- Cc Richard
====================
Vadim Fedorenko [Tue, 8 Oct 2024 18:14:35 +0000 (11:14 -0700)]
eth: fbnic: add TX packets timestamping support
Add TX configuration to ethtool interface. Add processing of TX
timestamp completions as well as configuration to request HW to create
TX timestamp completion.
Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Vadim Fedorenko [Tue, 8 Oct 2024 18:14:32 +0000 (11:14 -0700)]
eth: fbnic: add software TX timestamping support
Add software TX timestamping support. RX software timestamping is
implemented in the core and there is no need to provide special flag
in the driver anymore.
Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Breno Leitao [Tue, 8 Oct 2024 16:32:04 +0000 (09:32 -0700)]
net: Remove likely from l3mdev_master_ifindex_by_index
The likely() annotation in l3mdev_master_ifindex_by_index() has been
found to be incorrect 100% of the time in real-world workloads (e.g.,
web servers).
Annotated branches shows the following in these servers:
correct incorrect % Function File Line
0 169053813 100 l3mdev_master_ifindex_by_index l3mdev.h 81
This is happening because l3mdev_master_ifindex_by_index() is called
from __inet_check_established(), which calls
l3mdev_master_ifindex_by_index() passing the socked bounded interface.
Eric Dumazet [Tue, 8 Oct 2024 14:31:10 +0000 (14:31 +0000)]
net: do not delay dst_entries_add() in dst_release()
dst_entries_add() uses per-cpu data that might be freed at netns
dismantle from ip6_route_net_exit() calling dst_entries_destroy()
Before ip6_route_net_exit() can be called, we release all
the dsts associated with this netns, via calls to dst_release(),
which waits an rcu grace period before calling dst_destroy()
dst_entries_add() use in dst_destroy() is racy, because
dst_entries_destroy() could have been called already.
Decrementing the number of dsts must happen sooner.
Notes:
1) in CONFIG_XFRM case, dst_destroy() can call
dst_release_immediate(child), this might also cause UAF
if the child does not have DST_NOCOUNT set.
IPSEC maintainers might take a look and see how to address this.
2) There is also discussion about removing this count of dst,
which might happen in future kernels.
Fixes: f88649721268 ("ipv4: fix dst race in sk_dst_get()") Closes: https://lore.kernel.org/lkml/CANn89iLCCGsP7SFn9HKpvnKu96Td4KD08xf7aGtiYgZnkjaL=w@mail.gmail.com/T/ Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20241008143110.1064899-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This is a prep of per-net RTNL conversion for RTM_(NEW|DEL|SET)ADDR.
Currently, each IPv4 address is linked to the global hash table, and
this needs to be protected by another global lock or namespacified to
support per-net RTNL.
Adding a global lock will cause deadlock in the rtnetlink path and GC,
rtnetlink check_lifetime
|- rtnl_net_lock(net) |- acquire the global lock
|- acquire the global lock |- check ifa's netns
`- put ifa into hash table `- rtnl_net_lock(net)
so we need to namespacify the hash table.
The IPv6 one is already namespacified, let's follow that.
Namespacifying the GC is required for per-netns RTNL, but using the
per-netns hash table will shorten the time on the hash bucket iteration
under RTNL.
Let's add per-netns GC work and use the per-netns hash table.
Joe Damato adds support for netdev-genl support to query IRQ, NAPI,
and queue information.
For e1000:
Joe Damato adds support for netdev-genl support to query IRQ, NAPI,
and queue information.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
e1000: Link NAPI instances to queues and IRQs
e1000e: Link NAPI instances to queues and IRQs
e1000e: Remove duplicated writel() in e1000_configure_tx/rx()
igb: Cleanup unused declarations
iavf: Remove unused declarations
ice: Cleanup unused declarations
ice: Use common error handling code in two functions
ice: Make use of assign_bit() API
ice: store max_frame and rx_buf_len only in ice_rx_ring
ice: consistently use q_idx in ice_vc_cfg_qs_msg()
ice: add E830 HW VF mailbox message limit support
ice: Implement ethtool reset support
====================
This series contains updates to ice, i40e, igb, and e1000e drivers.
For ice:
Marcin allows driver to load, into safe mode, when DDP package is
missing or corrupted and adjusts the netif_is_ice() check to
account for when the device is in safe mode. He also fixes an
out-of-bounds issue when MSI-X are increased for VFs.
Wojciech clears FDB entries on reset to match the hardware state.
For i40e:
Aleksandr adds locking around MACVLAN filters to prevent memory leaks
due to concurrency issues.
For igb:
Mohamed Khalfella adds a check to not attempt to bring up an already
running interface on non-fatal PCIe errors.
For e1000e:
Vitaly changes board type for I219 to more closely match the hardware
and stop PHY issues.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
e1000e: change I219 (19) devices to ADP
igb: Do not bring the device up after non-fatal error
i40e: Fix macvlan leak by synchronizing access to mac_filter_hash
ice: Fix increasing MSI-X on VF
ice: Flush FDB entries before reset
ice: Fix netif_is_ice() in Safe Mode
ice: Fix entering Safe Mode
====================
Minda Chen [Tue, 8 Oct 2024 11:14:43 +0000 (19:14 +0800)]
net: stmmac: Add DW QoS Eth v4/v5 ip payload error statistics
Add DW QoS Eth v4/v5 ip payload error statistics, and rename descriptor
bit macro because v4/v5 descriptor IPCE bit claims ip checksum
error or TCP/UDP/ICMP segment length error.
Here is bit description from DW QoS Eth data book(Part 19.6.2.2)
bit7 IPCE: IP Payload Error
When this bit is programmed, it indicates either of the following:
1).The 16-bit IP payload checksum (that is, the TCP, UDP, or ICMP
checksum) calculated by the MAC does not match the corresponding
checksum field in the received segment.
2).The TCP, UDP, or ICMP segment length does not match the payload
length value in the IP Header field.
3).The TCP, UDP, or ICMP segment length is less than minimum allowed
segment length for TCP, UDP, or ICMP.
====================
mptcp: misc. fixes involving fallback to TCP
- Patch 1: better handle DSS corruptions from a bugged peer: reducing
warnings, doing a fallback or a reset depending on the subflow state.
For >= v5.7.
- Patch 2: fix DSS corruption due to large pmtu xmit, where MPTCP was
not taken into account. For >= v5.6.
- Patch 3: fallback when MPTCP opts are dropped after the first data
packet, instead of resetting the connection. For >= v5.6.
- Patch 4: restrict the removal of a subflow to other closing states, a
better fix, for a recent one. For >= v5.10.
====================
In a previous fix, the in-kernel path-manager has been modified not to
retrigger the removal of a subflow if it was already closed, e.g. when
the initial subflow is removed, but kept in the subflows list.
To be complete, this fix should also skip the subflows that are in any
closing state: mptcp_close_ssk() will initiate the closure, but the
switch to the TCP_CLOSE state depends on the other peer.
Fixes: 58e1b66b4e4b ("mptcp: pm: do not remove already closed subflows") Cc: stable@vger.kernel.org Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-4-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mptcp: fallback when MPTCP opts are dropped after 1st data
As reported by Christoph [1], before this patch, an MPTCP connection was
wrongly reset when a host received a first data packet with MPTCP
options after the 3wHS, but got the next ones without.
According to the MPTCP v1 specs [2], a fallback should happen in this
case, because the host didn't receive a DATA_ACK from the other peer,
nor receive data for more than the initial window which implies a
DATA_ACK being received by the other peer.
The patch here re-uses the same logic as the one used in other places:
by looking at allow_infinite_fallback, which is disabled at the creation
of an additional subflow. It's not looking at the first DATA_ACK (or
implying one received from the other side) as suggested by the RFC, but
it is in continuation with what was already done, which is safer, and it
fixes the reported issue. The next step, looking at this first DATA_ACK,
is tracked in [4].
This patch has been validated using the following Packetdrill script:
// Data from the client with valid MPTCP options (no DATA_ACK: normal)
+0.1 < P. 1:501(500) ack 1 win 2048 <mpcapable v1 flags[flag_h] key[skey, ckey] mpcdatalen 500, nop, nop>
// From here, the MPTCP options will be dropped by a middlebox
+0.0 > . 1:1(0) ack 501 <dss dack8=501 dll=0 nocs>
// The server replies with data, still thinking MPTCP is being used
+0.0 > P. 1:101(100) ack 501 <dss dack8=501 dsn8=1 ssn=1 dll=100 nocs, nop, nop>
// But the client already did a fallback to TCP, because the two previous packets have been received without MPTCP options
+0.1 < . 501:501(0) ack 101 win 2048
+0.0 < P. 501:601(100) ack 101 win 2048
// The server should fallback to TCP, not reset: it didn't get a DATA_ACK, nor data for more than the initial window
+0.0 > . 101:101(0) ack 601
Note that this script requires Packetdrill with MPTCP support, see [3].
Additionally syzkaller provided a nice reproducer. The repro enables
pmtu on the loopback device, leading to tcp_mtu_probe() generating
very large probe packets.
tcp_can_coalesce_send_queue_head() currently does not check for
mptcp-level invariants, and allowed the creation of cross-DSS probes,
leading to the mentioned corruption.
Address the issue teaching tcp_can_coalesce_send_queue_head() about
mptcp using the tcp_skb_can_collapse(), also reducing the code
duplication.
Paolo Abeni [Tue, 8 Oct 2024 11:04:52 +0000 (13:04 +0200)]
mptcp: handle consistently DSS corruption
Bugged peer implementation can send corrupted DSS options, consistently
hitting a few warning in the data path. Use DEBUG_NET assertions, to
avoid the splat on some builds and handle consistently the error, dumping
related MIBs and performing fallback and/or reset according to the
subflow type.
Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-1-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Breno Leitao [Tue, 8 Oct 2024 09:43:24 +0000 (02:43 -0700)]
net: netconsole: fix wrong warning
A warning is triggered when there is insufficient space in the buffer
for userdata. However, this is not an issue since userdata will be sent
in the next iteration.
Current warning message:
------------[ cut here ]------------
WARNING: CPU: 13 PID: 3013042 at drivers/net/netconsole.c:1122 write_ext_msg+0x3b6/0x3d0
? write_ext_msg+0x3b6/0x3d0
console_flush_all+0x1e9/0x330
The code incorrectly issues a warning when this_chunk is zero, which is
a valid scenario. The warning should only be triggered when this_chunk
is negative.
Fixes: 1ec9daf95093 ("net: netconsole: append userdata to fragmented netconsole messages") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241008094325.896208-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 8 Oct 2024 09:43:20 +0000 (12:43 +0300)]
net: dsa: refuse cross-chip mirroring operations
In case of a tc mirred action from one switch to another, the behavior
is not correct. We simply tell the source switch driver to program a
mirroring entry towards mirror->to_local_port = to_dp->index, but it is
not even guaranteed that the to_dp belongs to the same switch as dp.
For proper cross-chip support, we would need to go through the
cross-chip notifier layer in switch.c, program the entry on cascade
ports, and introduce new, explicit API for cross-chip mirroring, given
that intermediary switches should have introspection into the DSA tags
passed through the cascade port (and not just program a port mirror on
the entire cascade port). None of that exists today.
Reject what is not implemented so that user space is not misled into
thinking it works.
Fixes: f50f212749e8 ("net: dsa: Add plumbing for port mirroring") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20241008094320.3340980-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Fang [Tue, 8 Oct 2024 06:11:53 +0000 (14:11 +0800)]
net: fec: don't save PTP state if PTP is unsupported
Some platforms (such as i.MX25 and i.MX27) do not support PTP, so on
these platforms fec_ptp_init() is not called and the related members
in fep are not initialized. However, fec_ptp_save_state() is called
unconditionally, which causes the kernel to panic. Therefore, add a
condition so that fec_ptp_save_state() is not called if PTP is not
supported.
Fixes: a1477dc87dc4 ("net: fec: Restart PPS after link state change") Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/lkml/353e41fe-6bb4-4ee9-9980-2da2a9c1c508@roeck-us.net/ Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Csókás, Bence <csokas.bence@prolan.hu> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://patch.msgid.link/20241008061153.1977930-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Tue, 8 Oct 2024 12:13:07 +0000 (12:13 +0000)]
ipv6: switch inet6_acaddr_hash() to less predictable hash
commit 2384d02520ff ("net/ipv6: Add anycast addresses to a global hashtable")
added inet6_acaddr_hash(), using ipv6_addr_hash() and net_hash_mix()
to get hash spreading for typical users.
However ipv6_addr_hash() is highly predictable and a malicious user
could abuse a specific hash bucket.
Switch to __ipv6_addr_jhash(). We could use a dedicated
secret, or reuse net_hash_mix() as I did in this patch.
Eric Dumazet [Tue, 8 Oct 2024 12:01:01 +0000 (12:01 +0000)]
ipv6: switch inet6_addr_hash() to less predictable hash
In commit 3f27fb23219e ("ipv6: addrconf: add per netns perturbation
in inet6_addr_hash()"), I added net_hash_mix() in inet6_addr_hash()
to get better hash dispersion, at a time all netns were sharing the
hash table.
Since then, commit 21a216a8fc63 ("ipv6/addrconf: allocate a per
netns hash table") made the hash table per netns.
We could remove the net_hash_mix() from inet6_addr_hash(), but
there is still an issue with ipv6_addr_hash().
It is highly predictable and a malicious user can easily create
thousands of IPv6 addresses all stored in the same hash bucket.
Switch to __ipv6_addr_jhash(). We could use a dedicated
secret, or reuse net_hash_mix() as I did in this patch.