Jason Xing [Mon, 26 Feb 2024 03:22:18 +0000 (11:22 +0800)]
tcp: add a dropreason definitions and prepare for cookie check
Adding one drop reason to detect the condition of skb dropped
because of hook points in cookie check and extending NO_SOCKET
to consider another two cases can be used later.
Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
"Implement a per-cpu cache of +1/-1 MB, to reduce number
of changes to sk->sk_prot->memory_allocated, which
would otherwise be cause of false sharing."
sk_prot->memory_allocated points to global atomic variable:
atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp;
If increasing the per-cpu cache size from 1MB to e.g. 16MB,
changes to sk->sk_prot->memory_allocated can be further reduced.
Performance may be improved on system with many cores.
Signed-off-by: Adam Li <adamli@os.amperecomputing.com> Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 28 Feb 2024 08:21:42 +0000 (08:21 +0000)]
Merge branch 'dsa-realtek-reset'
Luiz Angelo Daros de Luca says:
====================
net: dsa: realtek: support reset controller and update docs
The driver previously supported reset pins using GPIO, but it lacked
support for reset controllers. Although a reset method is generally not
required, the driver fails to detect the switch if the reset was kept
asserted by a previous driver.
This series adds support to reset a Realtek switch using a reset
controller. It also updates the binding documentation to remove the
requirement of a reset method and to add the new reset controller
property.
It was tested on a TL-WR1043ND v1 router (rtl8366rb via SMI).
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
---
Changes in v5:
- Fixed error checking logic when reset controller (de)assert fails
- Link to v4: https://lore.kernel.org/r/20240219-realtek-reset-v4-0-858b82a29503@gmail.com
Changes in v4:
- do not test for priv->reset,priv->reset_ctl
- updated commit message
- Link to v3: https://lore.kernel.org/r/20240213-realtek-reset-v3-0-37837e574713@gmail.com
Changes in v3:
- Rebased on the Realtek DSA driver refactoring (08f627164126)
- Dropped the reset controller example in bindings
- Used %pe in error printing
- Linked to v2: https://lore.kernel.org/r/20231027190910.27044-1-luizluca@gmail.com/
Changes in v2:
- Introduced a dedicated commit for removing the reset-gpios requirement
- Placed binding patches before code changes
- Removed the 'reset-names' property
- Moved the example from the commit message to realtek.yaml
- Split the reset function into _assert/_deassert variants
- Modified reset functions to return a warning instead of a value
- Utilized devm_reset_control_get_optional to prevent failure when the
reset control is missing
- Used 'true' and 'false' for boolean values
- Removed the CONFIG_RESET_CONTROLLER check as stub methods are
sufficient when undefined
- Linked to v1: https://lore.kernel.org/r/20231024205805.19314-1-luizluca@gmail.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Luiz Angelo Daros de Luca [Sun, 25 Feb 2024 16:29:55 +0000 (13:29 -0300)]
net: dsa: realtek: support reset controller
Add support for resetting the device using a reset controller,
complementing the existing GPIO reset functionality (reset-gpios).
Although the reset is optional and the driver performs a soft reset
during setup, if the initial reset pin state was asserted, the driver
will not detect the device until the reset is deasserted.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
Luiz Angelo Daros de Luca [Sun, 25 Feb 2024 16:29:53 +0000 (13:29 -0300)]
dt-bindings: net: dsa: realtek: reset-gpios is not required
The 'reset-gpios' should not be mandatory. although they might be
required for some devices if the switch reset was left asserted by a
previous driver, such as the bootloader.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Cc: devicetree@vger.kernel.org Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com> Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
Jones Syue 薛懷宗 [Mon, 26 Feb 2024 02:24:52 +0000 (02:24 +0000)]
bonding: 802.3ad replace MAC_ADDRESS_EQUAL with __agg_has_partner
Replace macro MAC_ADDRESS_EQUAL() for null_mac_addr checking with inline
function__agg_has_partner(). When MAC_ADDRESS_EQUAL() is verifiying
aggregator's partner mac addr with null_mac_addr, means that seeing if
aggregator has a valid partner or not. Using __agg_has_partner() makes it
more clear to understand.
In ad_port_selection_logic(), since aggregator->partner_system and
port->partner_oper.system has been compared first as a prerequisite, it is
safe to replace the upcoming MAC_ADDRESS_EQUAL() for null_mac_addr checking
with __agg_has_partner().
Delete null_mac_addr, which is not required anymore in bond_3ad.c, since
all references to it are gone.
the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the argument "size + size * count" in the
devm_kzalloc() function.
Eric Dumazet [Sat, 24 Feb 2024 09:06:30 +0000 (09:06 +0000)]
netlink: use kvmalloc() in netlink_alloc_large_skb()
This is a followup of commit 234ec0b6034b ("netlink: fix potential
sleeping issue in mqueue_flush_file"), because vfree_atomic()
overhead is unfortunate for medium sized allocations.
1) If the allocation is smaller than PAGE_SIZE, do not bother
with vmalloc() at all. Some arches have 64KB PAGE_SIZE,
while NLMSG_GOODSIZE is smaller than 8KB.
2) Use kvmalloc(), which might allocate one high order page
instead of vmalloc if memory is not too fragmented.
Alexander Lobakin [Mon, 26 Feb 2024 14:49:11 +0000 (15:49 +0100)]
bnxt_en: fix accessing vnic_info before allocating it
bnxt_alloc_mem() dereferences ::vnic_info in the variable declaration
block, but allocates it much later. As a result, the following crash
happens on my setup:
Jakub Kicinski [Sat, 24 Feb 2024 05:06:58 +0000 (21:06 -0800)]
selftests: netdevsim: be less selective for FW for the devlink test
Commit 6151ff9c7521 ("selftests: netdevsim: use suitable existing dummy
file for flash test") introduced a nice trick to the devlink flashing
test. Instead of user having to create a file under /lib/firmware
we just pick the first one that already exists.
Sadly, in AWS Linux there are no files directly under /lib/firmware,
only in subdirectories. Don't limit the search to -maxdepth 1.
We can use the %P print format to get the correct path for files
inside subdirectories:
$ find /lib/firmware -type f -printf '%P\n' | head -1
intel-ucode/06-1a-05
The full path is /lib/firmware/intel-ucode/06-1a-05
This works in GNU find, busybox doesn't have printf at all,
so we're not making it worse.
Jesper Nilsson [Fri, 23 Feb 2024 20:37:01 +0000 (21:37 +0100)]
net: stmmac: mmc_core: Drop interrupt registers from stats
The MMC IPC interrupt status and interrupt mask registers are
of little use as Ethernet statistics, but incrementing counters
based on the current interrupt and interrupt mask registers
makes them actively misleading.
For example, if the interrupt mask is set to 0x08420842,
the current code will increment by that amount each iteration,
leading to the following sequence of nonsense:
These registers have been included in the Ethernet statistics
since the first version of MMC back in 2011 (commit 1c901a46d57).
That commit also mentions the MMC interrupts as
"something to add later (if actually useful)".
If the registers are actually useful, they should probably
be part of the Ethernet register dump instead of statistics,
but for now, drop the counters for mmc_rx_ipc_intr and
mmc_rx_ipc_intr_mask completely.
Ciprian Regus [Fri, 23 Feb 2024 16:21:27 +0000 (18:21 +0200)]
net: ethernet: adi: adin1110: Reduce the MDIO_TRDONE poll interval
In order to do a clause 22 access to the PHY registers of the ADIN1110,
we have to write the MDIO frame to the ADIN1110_MDIOACC register, and
then poll the MDIO_TRDONE bit (for a 1) in the same register. The
device will set this bit to 1 once the internal MDIO transaction is
done. In practice, this bit takes ~50 - 60 us to be set.
The first attempt to poll the bit is right after the ADIN1110_MDIOACC
register is written, so it will always be read as 0. The next check will
only be done after 10 ms, which will result in the MDIO transactions
taking a long time to complete. Reduce this polling interval to 100 us.
Since this interval is short enough, switch the poll function to
readx_poll_timeout_atomic() instead.
Paolo Abeni [Tue, 27 Feb 2024 10:24:06 +0000 (11:24 +0100)]
Merge branch 'net-ipa-don-t-abort-system-suspend'
Alex Elder says:
====================
net: ipa: don't abort system suspend
Currently the IPA code aborts an in-progress system suspend if an
IPA interrupt arrives before the suspend completes. There is no
need to do that though, because the IPA driver handles a forced
suspend correctly, quiescing any hardware activity before finally
turning off clocks and interconnects.
This series drops the call to pm_wakeup_dev_event() if an IPA
SUSPEND interrupt arrives during system suspend. Doing this
makes the two remaining IPA power flags unnecessary, and allows
some additional code to be cleaned up--and best of all, removed.
The result is much simpler (and I'm really glad not to be using
these flags any more).
The first patch implements the main change. The second and
third remove the flags that were used to determine whether to
call pm_wakeup_dev_event(). The next two remove a function that
becomes a trivial wrapper, and the last one just avoids writing
a register unnecessarily.
Note that the first two patches will have checkpatch warnings,
because checkpatch disagrees with my compiler on what to do when
a block contains only a semicolon. I went with what the compiler
recommends.
clang says: warning: suggest braces around empty body
checkpatch: WARNING: braces {} are not necessary for single statement blocks
Alex Elder [Fri, 23 Feb 2024 13:39:30 +0000 (07:39 -0600)]
net: ipa: don't bother zeroing an already zero register
In ipa_interrupt_suspend_clear_all(), if the SUSPEND_INFO register
read contains no set bits, there's no interrupt condition to clear.
Skip the write to the clear register in that case.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alex Elder [Fri, 23 Feb 2024 13:39:29 +0000 (07:39 -0600)]
net: ipa: kill ipa_power_suspend_handler()
Now that ipa_power_suspend_handler() is a trivial wrapper around
ipa_interrupt_suspend_clear_all(), we can open-code it in the one
place it's used, and get rid of the function.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alex Elder [Fri, 23 Feb 2024 13:39:28 +0000 (07:39 -0600)]
net: ipa: move ipa_interrupt_suspend_clear_all() up
The next patch makes ipa_interrupt_suspend_clear_all() static,
calling it only within "ipa_interrupt.c". Move its definition
higher in the file so no declaration is needed.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alex Elder [Fri, 23 Feb 2024 13:39:27 +0000 (07:39 -0600)]
net: ipa: kill the IPA_POWER_FLAG_RESUMED flag
The IPA_POWER_FLAG_RESUMED was originally used to avoid calling
pm_wakeup_dev_event() more than once when handling a SUSPEND
interrupt. This call is no longer made, so there' no need for the
flag, so get rid of it.
That leaves no more IPA power flags usefully defined, so just get
rid of the bitmap in the IPA power structure and the definition of
the ipa_power_flag enumerated type.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alex Elder [Fri, 23 Feb 2024 13:39:26 +0000 (07:39 -0600)]
net: ipa: kill IPA_POWER_FLAG_SYSTEM
The SYSTEM IPA power flag is set, cleared, and tested. But nothing
happens based on its value when tested, so it serves no purpose.
Get rid of this flag.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alex Elder [Fri, 23 Feb 2024 13:39:25 +0000 (07:39 -0600)]
net: ipa: don't bother aborting system resume
The IPA interrupt can fire if there is data to be delivered to a GSI
channel that is suspended. This condition occurs in three scenarios.
First, runtime power management automatically suspends the IPA
hardware after half a second of inactivity. This has nothing
to do with system suspend, so a SYSTEM IPA power flag is used to
avoid calling pm_wakeup_dev_event() when runtime suspended.
Second, if the system is suspended, the receipt of an IPA interrupt
should trigger a system resume. Configuring the IPA interrupt for
wakeup accomplishes this.
Finally, if system suspend is underway and the IPA interrupt fires,
we currently call pm_wakeup_dev_event() to abort the system suspend.
The IPA driver correctly handles quiescing the hardware before
suspending it, so there's really no need to abort a suspend in
progress in the third case. We can simply quiesce and suspend
things, and be done.
Incoming data can still wake the system after it's suspended.
The IPA interrupt has wakeup mode enabled, so if it fires *after*
we've suspended, it will trigger a wakeup (if not disabled via
sysfs).
Stop calling pm_wakeup_dev_event() to abort a system suspend in
progress in ipa_power_suspend_handler().
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Heiner Kallweit [Tue, 20 Feb 2024 21:55:38 +0000 (22:55 +0100)]
net: phy: simplify genphy_c45_ethtool_set_eee
Simplify the function, no functional change intended.
- Remove not needed variable unsupp, I think code is even better
readable now.
- Move setting phydev->eee_enabled out of the if clause
- Simplify return value handling
Jakub Kicinski [Tue, 27 Feb 2024 02:42:14 +0000 (18:42 -0800)]
Merge branch 'mptcp-various-small-improvements'
Matthieu Baerts says:
====================
mptcp: various small improvements
This series brings various small improvements to MPTCP and its
selftests:
Patch 1 prints an error if there are duplicated subtests names. It is
important to have unique (sub)tests names in TAP, because some CI
environments drop (sub)tests with duplicated names.
Patch 2 is a preparation for patches 3 and 4, which check the protocol
in tcp_sk() and mptcp_sk() with DEBUG_NET, only in code from net/mptcp/.
We recently had the case where an MPTCP socket was wrongly treated as a
TCP one, and fuzzers and static checkers never spot the issue. This
would prevent such issues in the future.
Patches 5 to 7 are some cleanup for the MPTCP selftests. These patches
are not supposed to change the behaviour.
Patch 8 sets the poll timeout in diag selftest to the same value as the
one used in the other selftests.
====================
Geliang Tang [Fri, 23 Feb 2024 20:18:00 +0000 (21:18 +0100)]
selftests: mptcp: diag: change timeout_poll to 30
Even if it is set to 100ms from the beginning with commit df62f2ec3df6 ("selftests/mptcp: add diag interface tests"), there is
no reason not to have it to 30ms like all the other tests. "diag.sh" is
not supposed to be slower than the other ones.
To maintain consistency with other scripts, this patch changes it to 30.
Geliang Tang [Fri, 23 Feb 2024 20:17:58 +0000 (21:17 +0100)]
selftests: mptcp: simult flows: define missing vars
The variables 'large', 'small', 'sout', 'cout', 'capout' and 'size' are
used in multiple functions, so they should be clearly defined as global
variables at the top of the file.
This patch redefines them at the beginning of simult_flows.sh.
Geliang Tang [Fri, 23 Feb 2024 20:17:57 +0000 (21:17 +0100)]
selftests: mptcp: netlink: drop duplicate var ret
The variable 'ret' are defined twice in pm_netlink.sh. This patch drops
this duplicate one that has been defined from the beginning, with
commit eedbc685321b ("selftests: add PM netlink functional tests")
mptcp: check the protocol in mptcp_sk() with DEBUG_NET
Fuzzers and static checkers might not detect when mptcp_sk() is used
with a non mptcp_sock structure.
This is similar to the parent commit, where it is easy to use mptcp_sk()
with a TCP sock, e.g. with a subflow sk.
So a new simple check is done when CONFIG_DEBUG_NET is enabled to tell
kernel devs when a non-MPTCP socket is being used as an MPTCP one.
'mptcp_sk()' macro is then defined differently: with an extra WARN to
complain when an unexpected socket is being used.
mptcp: check the protocol in tcp_sk() with DEBUG_NET
Fuzzers and static checkers might not detect when tcp_sk() is used with
a non tcp_sock structure.
This kind of mistake already happened a few times with MPTCP: when
wrongly using TCP-specific helpers with mptcp_sock pointers. On the
other hand, there are many 'tcp_xxx()' helpers that are taking a 'struct
sock' pointer as arguments, and some of them are only looking at fields
from 'struct sock', and nothing from 'struct tcp_sock'. It is then
tempting to use them with a 'struct mptcp_sock'.
So a new simple check is done when CONFIG_DEBUG_NET is enabled to tell
kernel devs when a non-TCP socket is being used as a TCP one. 'tcp_sk()'
macro is then re-defined to add a WARN when an unexpected socket is
being used.
It is important to have a unique (sub)test name in TAP, because some CI
environments drop tests with duplicated name.
When adding a new subtest entry, an error message is printed in case of
duplicated entries. If there were duplicated entries and if all features
were expected to work, the script exits with an error at the end, after
having printed all subtests in the TAP format. Thanks to that, the MPTCP
CI will catch such issues early.
Eric Dumazet [Fri, 23 Feb 2024 20:10:54 +0000 (20:10 +0000)]
ipv6: anycast: complete RCU handling of struct ifacaddr6
struct ifacaddr6 are already freed after RCU grace period.
Add __rcu qualifier to aca_next pointer, and idev->ac_list
Add relevant rcu_assign_pointer() and dereference accessors.
ipv6_chk_acast_dev() no longer needs to acquire idev->lock.
/proc/net/anycast6 is now purely RCU protected, it no
longer acquires idev->lock.
Similarly in6_dump_addrs() can use RCU protection to iterate
through anycast addresses. It was relying on a mixture of RCU
and RTNL but next patches will get rid of RTNL there.
Breno Leitao [Fri, 23 Feb 2024 11:58:37 +0000 (03:58 -0800)]
net/vsockmon: Leverage core stats allocator
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in the vsockmon driver and leverage the network
core allocation instead.
Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20240223115839.3572852-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David S. Miller [Mon, 26 Feb 2024 13:09:09 +0000 (13:09 +0000)]
Merge branch 'pcs-xpcs-cleanups'
Serge Semin says:
====================
net: pcs: xpcs: Cleanups before adding MMIO dev support
As stated in the subject this series is a short prequel before submitting
the main patches adding the memory-mapped DW XPCS support to the DW XPCS
and DW *MAC (STMMAC) drivers. Originally it was a part of the bigger
patchset (see the changelog v2 link below) but was detached to a
preparation set to shrink down the main series thus simplifying it'
review.
The patchset' content is straightforward: drop the redundant sentinel
entry and the header files; return EINVAL errno from the soft-reset method
and make sure that the interface validation method return EINVAL straight
away if the requested interface isn't supported by the XPCS device
instance. All of these changes are required to simplify the changes being
introduced a bit later in the framework of the memory-mapped DW XPCS
support patches.
Serge Semin [Thu, 22 Feb 2024 17:58:23 +0000 (20:58 +0300)]
net: pcs: xpcs: Explicitly return error on caps validation
If an unsupported interface is passed to the PCS validation callback there
is no need in further link-modes calculations since the resultant array
will be initialized with zeros which will be perceived by the phylink
subsystem as error anyway (see phylink_validate_mac_and_pcs()). Instead
let's explicitly return the -EINVAL error to inform the caller about the
unsupported interface as it's done in the rest of the pcs_validate
callbacks.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Serge Semin [Thu, 22 Feb 2024 17:58:22 +0000 (20:58 +0300)]
net: pcs: xpcs: Return EINVAL in the internal methods
In particular the xpcs_soft_reset() and xpcs_do_config() functions
currently return -1 if invalid auto-negotiation mode is specified. That
value might be then passed to the generic kernel subsystems which require
a standard kernel errno value. Even though the erroneous conditions are
very specific (memory corruption or buggy driver implementation) using a
hard-coded -1 literal doesn't seem correct anyway especially when it comes
to passing it higher to the network subsystem or printing to the system
log. Convert the hard-coded error values to -EINVAL then.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Serge Semin [Thu, 22 Feb 2024 17:58:20 +0000 (20:58 +0300)]
net: pcs: xpcs: Drop sentinel entry from 2500basex ifaces list
There are currently only two methods (xpcs_find_compat() and
xpcs_get_interfaces()) defined in the driver which loop over the available
interfaces. All of them rely on the xpcs_compat::num_interfaces field
value to get the total number of supported interfaces. Thus the interface
arrays are supposed to be filled with actual interface IDs and there is no
need in the dummy terminating ID placed at the end of the arrays.
Based on the above drop the PHY_INTERFACE_MODE_MAX entry from the
xpcs_2500basex_interfaces array and the PHY_INTERFACE_MODE_MAX-based
conditional statement from the xpcs_get_interfaces() method as redundant.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 26 Feb 2024 11:46:13 +0000 (11:46 +0000)]
Merge branch 'rtnetlink-reduce-rtnl-pressure'
Eric Dumazet says:
====================
rtnetlink: reduce RTNL pressure for dumps
This series restarts the conversion of rtnl dump operations
to RCU protection, instead of requiring RTNL.
In this new attempt (prior one failed in 2011), I chose to
allow a gradual conversion of selected operations.
After this series, "ip -6 addr" and "ip -4 ro" no longer
need to acquire RTNL.
I refrained from changing inet_dump_ifaddr() and inet6_dump_addr()
to avoid merge conflicts because of two fixes in net tree.
I also started the work for "ip link" future conversion.
v2: rtnl_fill_link_ifmap() always emit IFLA_MAP (Jiri Pirko)
Added "nexthop: allow nexthop_mpath_fill_node()
to be called without RTNL" to avoid a lockdep splat (Ido Schimmel)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:21 +0000 (10:50 +0000)]
rtnetlink: provide RCU protection to rtnl_fill_prop_list()
We want to be able to run rtnl_fill_ifinfo() under RCU protection
instead of RTNL in the future.
dev->name_node items are already rcu protected.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:19 +0000 (10:50 +0000)]
inet: switch inet_dump_fib() to RCU protection
No longer hold RTNL while calling inet_dump_fib().
Also change return value for a completed dump:
Returning 0 instead of skb->len allows NLMSG_DONE
to be appended to the skb. User space does not have
to call us again to get a standalone NLMSG_DONE marker.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:18 +0000 (10:50 +0000)]
nexthop: allow nexthop_mpath_fill_node() to be called without RTNL
nexthop_mpath_fill_node() will be potentially called
from contexts holding rcu_lock instead of RTNL.
Suggested-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/all/ZdZDWVdjMaQkXBgW@shredder/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:17 +0000 (10:50 +0000)]
inet: allow ip_valid_fib_dump_req() to be called with RTNL or RCU
Add a new field into struct fib_dump_filter, to let callers
tell if they use RTNL locking or RCU.
This is used in the following patch, when inet_dump_fib()
no longer holds RTNL.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:16 +0000 (10:50 +0000)]
ipv6: switch inet6_dump_ifinfo() to RCU protection
No longer hold RTNL while calling inet6_dump_ifinfo()
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:15 +0000 (10:50 +0000)]
rtnetlink: add RTNL_FLAG_DUMP_UNLOCKED flag
Similarly to RTNL_FLAG_DOIT_UNLOCKED, this new flag
allows dump operations registered via rtnl_register()
or rtnl_register_module() to opt-out from RTNL protection.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:14 +0000 (10:50 +0000)]
rtnetlink: change nlk->cb_mutex role
In commit af65bdfce98d ("[NETLINK]: Switch cb_lock spinlock
to mutex and allow to override it"), Patrick McHardy used
a common mutex to protect both nlk->cb and the dump() operations.
The override is used for rtnl dumps, registered with
rntl_register() and rntl_register_module().
We want to be able to opt-out some dump() operations
to not acquire RTNL, so we need to protect nlk->cb
with a per socket mutex.
This patch renames nlk->cb_def_mutex to nlk->nl_cb_mutex
The optional pointer to the mutex used to protect dump()
call is stored in nlk->dump_cb_mutex
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:12 +0000 (10:50 +0000)]
netlink: fix netlink_diag_dump() return value
__netlink_diag_dump() returns 1 if the dump is not complete,
zero if no error occurred.
If err variable is zero, this means the dump is complete:
We should not return skb->len in this case, but 0.
This allows NLMSG_DONE to be appended to the skb.
User space does not have to call us again only to get NLMSG_DONE.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:11 +0000 (10:50 +0000)]
ipv6: use xarray iterator to implement inet6_dump_ifinfo()
Prepare inet6_dump_ifinfo() to run with RCU protection
instead of RTNL and use for_each_netdev_dump() interface.
Also properly return 0 at the end of a dump, avoiding
an extra recvmsg() system call and RTNL acquisition.
Note that RTNL-less dumps need core changes, coming later
in the series.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 22 Feb 2024 10:50:09 +0000 (10:50 +0000)]
ipv6: prepare inet6_fill_ifla6_attrs() for RCU
We want to no longer hold RTNL while calling inet6_fill_ifla6_attrs()
in the future. Add needed READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 26 Feb 2024 11:38:45 +0000 (11:38 +0000)]
Merge branch 'dp83826'
Jérémie Dautheribes says:
====================
Add support for TI DP83826 configuration
This short patch series introduces the possibility of overriding
some parameters which are latched by default by hardware straps on the
TI DP83826 PHY.
The settings that can be overridden include:
- Configuring the PHY in either MII mode or RMII mode.
- When in RMII mode, configuring the PHY in RMII slave mode or RMII
master mode.
The RMII master/slave mode is TI-specific and determines whether the PHY
operates from a 25MHz reference clock (master mode) or from a 50MHz
reference clock (slave mode).
While these features should be supported by all the TI DP8382x family,
I have only been able to test them on TI DP83826 hardware. Therefore,
support has been added specifically for this PHY in this patch series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jérémie Dautheribes [Thu, 22 Feb 2024 10:31:17 +0000 (11:31 +0100)]
net: phy: dp83826: support configuring RMII master/slave operation mode
The TI DP83826 PHY can operate between two RMII modes:
- master mode (PHY operates from a 25MHz clock reference)
- slave mode (PHY operates from a 50MHz clock reference)
By default, the operation mode is configured by hardware straps.
Add support to configure the operation mode from within the driver.
Signed-off-by: Jérémie Dautheribes <jeremie.dautheribes@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jérémie Dautheribes [Thu, 22 Feb 2024 10:31:16 +0000 (11:31 +0100)]
net: phy: dp83826: Add support for phy-mode configuration
The TI DP83826 PHY can operate in either MII mode or RMII mode.
By default, it is configured by straps.
It can also be configured by writing to the bit 5 of register 0x17 - RMII
and Status Register (RCSR).
When phydev->interface is rmii, rmii mode must be enabled, otherwise
mii mode must be set.
This prevents misconfiguration of hw straps.
Signed-off-by: Jérémie Dautheribes <jeremie.dautheribes@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jérémie Dautheribes [Thu, 22 Feb 2024 10:31:15 +0000 (11:31 +0100)]
dt-bindings: net: dp83822: support configuring RMII master/slave mode
Add property ti,rmii-mode to support selecting the RMII operation mode
between:
- master mode (PHY operates from a 25MHz clock reference)
- slave mode (PHY operates from a 50MHz clock reference)
If not set, the operation mode is configured by hardware straps.
Signed-off-by: Jérémie Dautheribes <jeremie.dautheribes@bootlin.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksij Rempel [Thu, 22 Feb 2024 07:51:13 +0000 (08:51 +0100)]
net: dsa: microchip: Add support for bridge port isolation
Implement bridge port isolation for KSZ switches. Enabling the isolation
of switch ports from each other while maintaining connectivity with the
CPU and other forwarding ports. For instance, to isolate swp1 and swp2
from each other, use the following commands:
- bridge link set dev swp1 isolated on
- bridge link set dev swp2 isolated on
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 22 Feb 2024 23:48:31 +0000 (15:48 -0800)]
tools: ynl: fix header guards
devlink and ethtool have a trailing _ in the header guard. I must have
copy/pasted it into new guards, assuming it's a headers_install artifact.
This fixes build if system headers are old.
Fixes: 8f109e91b852 ("tools: ynl: include dpll and mptcp_pm in C codegen") Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240222234831.179181-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
tools: ynl: couple of cmdline enhancements
This is part of the original "netlink: specs: devlink: add the rest of
missing attribute definitions" set which was rejected [1]. These three
patches enhances the cmdline user comfort, allowing to pass flag
attribute with bool values and enum names instead of scalars.
====================
net: Don't bother filling in ethtool driver version
The drivers included in this series set the ethtool driver version to the
same as the default, UTS_RELEASE, so don't both doing this.
As noted by Masahiro in [0], with CONFIG_MODVERSIONS=y, some drivers could
be built as modules against a different kernel tree with differing
UTS_RELEASE. As such, these changes could lead to a change in behaviour.
However, defaulting to the core kernel UTS_RELEASE would be expected
behaviour.
Simon Horman [Wed, 21 Feb 2024 17:46:21 +0000 (17:46 +0000)]
ps3/gelic: minor Kernel Doc corrections
* Update the Kernel Doc for gelic_descr_set_tx_cmdstat()
and gelic_net_setup_netdev() so that documented name
and the actual name of the function match.
* Move define of GELIC_ALIGN() so that it is no longer
between gelic_alloc_card_net() and it's Kernel Doc.
* Document netdev parameter of gelic_alloc_card_net()
in a way consistent to the documentation of other netdev parameters
in this file.
Addresses the following warnings flagged by ./scripts/kernel-doc -none:
.../ps3_gelic_net.c:711: warning: expecting prototype for gelic_net_set_txdescr_cmdstat(). Prototype was for gelic_descr_set_tx_cmdstat() instead
.../ps3_gelic_net.c:1474: warning: expecting prototype for gelic_ether_setup_netdev(). Prototype was for gelic_net_setup_netdev() instead
.../ps3_gelic_net.c:1528: warning: expecting prototype for gelic_alloc_card_net(). Prototype was for GELIC_ALIGN() instead
.../ps3_gelic_net.c:1531: warning: Function parameter or struct member 'netdev' not described in 'gelic_alloc_card_net'
Florian Westphal [Thu, 22 Feb 2024 14:03:10 +0000 (15:03 +0100)]
net: mpls: error out if inner headers are not set
mpls_gso_segment() assumes skb_inner_network_header() returns
a valid result:
mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb);
if (unlikely(!mpls_hlen || mpls_hlen % MPLS_HLEN))
goto out;
if (unlikely(!pskb_may_pull(skb, mpls_hlen)))
With syzbot reproducer, skb_inner_network_header() yields 0,
skb_network_header() returns 108, so this will
"pskb_may_pull(skb, -108)))" which triggers a newly added
DEBUG_NET_WARN_ON_ONCE() check:
First iteration of this patch made mpls_hlen signed and changed
test to error out to "mpls_hlen <= 0 || ..".
Eric Dumazet said:
> I was thinking about adding a debug check in skb_inner_network_header()
> if inner_network_header is zero (that would mean it is not 'set' yet),
> but this would trigger even after your patch.
So add new skb_inner_network_header_was_set() helper and use that.
The syzbot reproducer injects data via packet socket. The skb that gets
allocated and passed down the stack has ->protocol set to NSH (0x894f)
and gso_type set to SKB_GSO_UDP | SKB_GSO_DODGY.
This gets passed to skb_mac_gso_segment(), which sees NSH as ptype to
find a callback for. nsh_gso_segment() retrieves next type:
proto = tun_p_to_eth_p(nsh_hdr(skb)->np);
... which is MPLS (TUN_P_MPLS_UC). It updates skb->protocol and then
calls mpls_gso_segment(). Inner offsets are all 0, so mpls_gso_segment()
ends up with a negative header size.
In case more callers rely on silent handling of such large may_pull values
we could also 'legalize' this behaviour, either replacing the debug check
with (len > INT_MAX) test or removing it and instead adding a comment
before existing
if (unlikely(len > skb->len))
return SKB_DROP_REASON_PKT_TOO_SMALL;
test in pskb_may_pull_reason(), saying that this check also implicitly
takes care of callers that miscompute header sizes.
Cc: Simon Horman <horms@kernel.org> Fixes: 219eee9c0d16 ("net: skbuff: add overflow debug check to pull/push helpers") Reported-by: syzbot+99d15fcdb0132a1e1a82@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/00000000000043b1310611e388aa@google.com/raw Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240222140321.14080-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jann Horn [Tue, 20 Feb 2024 19:42:44 +0000 (20:42 +0100)]
net: ethtool: avoid rebuilds on UTS_RELEASE change
Currently, when you switch between branches or something like that and
rebuild, net/ethtool/ioctl.c has to be built again because it depends
on UTS_RELEASE.
By instead referencing a string variable stored in another object file,
this can be avoided.
Sneh Shah [Tue, 20 Feb 2024 05:07:35 +0000 (10:37 +0530)]
net: stmmac: dwmac-qcom-ethqos: Add support for 2.5G SGMII
Serdes phy needs to operate at 2500 mode for 2.5G speed and 1000
mode for 1G/100M/10M speed.
Added changes to configure serdes phy and mac based on link speed.
Changing serdes phy speed involves multiple register writes for
serdes block. To avoid redundant write operations only update serdes
phy when new speed is different.
For 2500 speed MAC PCS autoneg needs to disabled. Added changes to
disable MAC PCS autoneg if ANE parameter is not set.
Signed-off-by: Sneh Shah <quic_snehshah@quicinc.com> Tested-by: Abhishek Chauhan <quic_abchauha@quicinc.com> # sa8775p-ride Reviewed-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Through the routine bond_mii_monitor(), bonding driver inspects and commits
the slave state changes. During the times when slave state change and
failure in aqcuiring rtnl lock happen at the same time, the routine
bond_mii_monitor() reschedules itself to come around after 1 msec to commit
the new state.
During this, it executes the routine bond_miimon_inspect() to re-inspect
the state chane and prints the corresponding slave state on to the console.
Hence we do see a message at every 1 msec till the rtnl lock is acquired
and state chage is committed.
This patch doesn't change how bond functions. It only simply limits this
kind of log flood.
Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Link: https://lore.kernel.org/r/20240221082752.4660-1-praveen.kannoju@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 23 Feb 2024 03:06:20 +0000 (19:06 -0800)]
Merge tag 'nf-next-24-02-21' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter updates for net-next
1. Prefer KMEM_CACHE() macro to create kmem caches, from Kunwu Chan.
Patches 2 and 3 consolidate nf_log NULL checks and introduces
extra boundary checks on family and type to make it clear that no out
of bounds access will happen. No in-tree user currently passes such
values, but thats not clear from looking at the function.
From Pablo Neira Ayuso.
Patch 4, also from Pablo, gets rid of unneeded conditional in
nft_osf init function.
Patch 5, from myself, fixes erroneous Kconfig dependencies that
came in an earlier net-next pull request. This should get rid
of the xtables related build failure reports.
Patches 6 to 10 are an update to nftables' concatenated-ranges
set type to speed up element insertions. This series also
compacts a few data structures and cleans up a few oddities such
as reliance on ZERO_SIZE_PTR when asking to allocate a set with
no elements. From myself.
Patches 11 moves the nf_reinject function from the netfilter core
(vmlinux) into the nfnetlink_queue backend, the only location where
this is called from. Also from myself.
Patch 12, from Kees Cook, switches xtables' compat layer to use
unsafe_memcpy because xt_entry_target cannot easily get converted
to a real flexible array (its UAPI and used inside other structs).
* tag 'nf-next-24-02-21' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: x_tables: Use unsafe_memcpy() for 0-sized destination
netfilter: move nf_reinject into nfnetlink_queue modules
netfilter: nft_set_pipapo: use GFP_KERNEL for insertions
netfilter: nft_set_pipapo: speed up bulk element insertions
netfilter: nft_set_pipapo: shrink data structures
netfilter: nft_set_pipapo: do not rely on ZERO_SIZE_PTR
netfilter: nft_set_pipapo: constify lookup fn args where possible
netfilter: xtables: fix up kconfig dependencies
netfilter: nft_osf: simplify init path
netfilter: nf_log: validate nf_logger_find_get()
netfilter: nf_log: consolidate check for NULL logger in lookup function
netfilter: expect: Simplify the allocation of slab caches in nf_conntrack_expect_init
====================
Breno Leitao [Wed, 21 Feb 2024 16:17:32 +0000 (08:17 -0800)]
ipv6/sit: Do not allocate stats in the driver
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in the ipv6/sit driver and leverage the network
core allocation.
Jakub Kicinski [Thu, 22 Feb 2024 23:11:18 +0000 (15:11 -0800)]
Merge tag 'wireless-next-2024-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.9
The third "new features" pull request for v6.9. This is a quick
followup to send commit 04edb5dc68f4 ("wifi: ath12k: Fix uninitialized
use of ret in ath12k_mac_allocate()") to fix the ath12k clang warning
introduced in the previous pull request.
We also have support for QCA2066 in ath11k, several new features in
ath12k and few other changes in drivers. In stack it's mostly cleanup
and refactoring.
Major changes:
ath12k
* firmware-2.bin support
* support having multiple identical PCI devices (firmware needs to
have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
* QCN9274: support split-PHY devices
* WCN7850: enable Power Save Mode in station mode
* WCN7850: P2P support
ath11k:
* QCA6390 & WCN6855: support 2 concurrent station interfaces
* QCA2066 support
iwlwifi
* mvm: support wider-bandwidth OFDMA
* bump firmware API to 90 for BZ/SC devices
brcmfmac
* DMI nvram filename quirk for ACEPC W5 Pro
* tag 'wireless-next-2024-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (75 commits)
wifi: wilc1000: revert reset line logic flip
wifi: brcmfmac: Add DMI nvram filename quirk for ACEPC W5 Pro
wifi: rtlwifi: set initial values for unexpected cases of USB endpoint priority
wifi: rtl8xxxu: check vif before using in rtl8xxxu_tx()
wifi: rtlwifi: rtl8192cu: Fix TX aggregation
wifi: wilc1000: remove AKM suite be32 conversion for external auth request
wifi: nl80211: refactor parsing CSA offsets
wifi: nl80211: force WLAN_AKM_SUITE_SAE in big endian in NL80211_CMD_EXTERNAL_AUTH
wifi: iwlwifi: load b0 version of ucode for HR1/HR2
wifi: iwlwifi: handle per-phy statistics from fw
wifi: iwlwifi: iwl-fh.h: fix kernel-doc issues
wifi: iwlwifi: api: fix kernel-doc reference
wifi: iwlwifi: mvm: unlock mvm if there is no primary link
wifi: iwlwifi: bump FW API to 90 for BZ/SC devices
wifi: iwlwifi: mvm: support PHY context version 6
wifi: iwlwifi: mvm: partially support PHY context version 6
wifi: iwlwifi: mvm: support wider-bandwidth OFDMA
wifi: cfg80211: use ML element parsing helpers
wifi: mac80211: align ieee80211_mle_get_bss_param_ch_cnt()
wifi: cfg80211: refactor RNR parsing
...
====================
- ipv6: sr: fix possible use-after-free and null-ptr-deref
- mptcp: fix several data races
- phonet: take correct lock to peek at the RX queue
Misc:
- handful of fixes and reliability improvements for selftests"
* tag 'net-6.8.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits)
l2tp: pass correct message length to ip6_append_data
net: phy: realtek: Fix rtl8211f_config_init() for RTL8211F(D)(I)-VD-CG PHY
selftests: ioam: refactoring to align with the fix
Fix write to cloned skb in ipv6_hop_ioam()
phonet/pep: fix racy skb_queue_empty() use
phonet: take correct lock to peek at the RX queue
net: sparx5: Add spinlock for frame transmission from CPU
net/sched: flower: Add lock protection when remove filter handle
devlink: fix port dump cmd type
net: stmmac: Fix EST offset for dwmac 5.10
tools: ynl: don't leak mcast_groups on init error
tools: ynl: make sure we always pass yarg to mnl_cb_run
net: mctp: put sock on tag allocation failure
netfilter: nf_tables: use kzalloc for hook allocation
netfilter: nf_tables: register hooks last when adding new chain/flowtable
netfilter: nft_flow_offload: release dst in case direct xmit path is used
netfilter: nft_flow_offload: reset dst in route object after setting up flow
netfilter: nf_tables: set dormant flag on hook register failure
selftests: tls: add test for peeking past a record of a different type
selftests: tls: add test for merging of same-type control messages
...
Linus Torvalds [Thu, 22 Feb 2024 17:23:22 +0000 (09:23 -0800)]
Merge tag 'trace-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- While working on the ring buffer I noticed that the counter used for
knowing where the end of the data is on a sub-buffer was not a full
"int" but just 20 bits. It was masked out to 0xfffff.
With the new code that allows the user to change the size of the
sub-buffer, it is theoretically possible to ask for a size bigger
than 2^20. If that happens, unexpected results may occur as there's
no code checking if the counter overflowed the 20 bits of the write
mask. There are other checks to make sure events fit in the
sub-buffer, but if the sub-buffer itself is too big, that is not
checked.
Add a check in the resize of the sub-buffer to make sure that it
never goes beyond the size of the counter that holds how much data is
on it.
* tag 'trace-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Do not let subbuf be bigger than write mask
The current Ntuple filter implementation has a limitation on 5750X (P5)
and newer chips. The destination ring of the ntuple filter must be
a valid ring in the RSS indirection table. Ntuple filters may not work
if the RSS indirection table is modified by the user to only contain a
subset of the rings. If an ntuple filter is set to a ring destination
that is not in the RSS indirection table, the packet matching that
filter will be placed in a random ring instead of the specified
destination ring.
This series of patches will fix the problem by using a separate VNIC
for ntuple filters. The default VNIC will be dedicated for RSS and
so the indirection table can be setup in any way and will not affect
ntuple filters using the separate VNIC.
Quite a bit of refactoring is needed to do the the VNIC and RSS
context accounting in the first few patches. This is technically a
bug fix, but I think the changes are too big for -net.
====================
Pavan Chebbi [Tue, 20 Feb 2024 23:03:17 +0000 (15:03 -0800)]
bnxt_en: Use the new VNIC to create ntuple filters
The newly created vnic (BNXT_VNIC_NTUPLE) is ready to be used to create
ntuple filters when supported by firmware. All RX rings can be used
regardless of the RSS indirection setting on the default VNIC.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pavan Chebbi [Tue, 20 Feb 2024 23:03:16 +0000 (15:03 -0800)]
bnxt_en: Create and setup the additional VNIC for adding ntuple filters
Allocate and setup the additional VNIC for ntuple filters if this
new method is supported by the firmware. Even though this VNIC is
only used for ntuple filters with direct ring destinations, we still
setup the RSS hash to be identical to the default VNIC so that each
RX packet will have the correct hash in the RX completion. This
VNIC is always at VNIC index BNXT_VNIC_NTUPLE.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pavan Chebbi [Tue, 20 Feb 2024 23:03:15 +0000 (15:03 -0800)]
bnxt_en: Provision for an additional VNIC for ntuple filters
On newer chips that support the ring table index method for
ntuple filters, the current scheme of using the same VNIC for
both RSS and ntuple filters will not work in all cases. An
ntuple filter can only be directed to a destination ring if
that destination ring is also in the RSS indirection table.
To support ntuple filters with any arbitratry RSS indirection
table that may only include a subset of the rings, we need to
use a separate VNIC for ntuple filters.
This patch provisions the additional VNIC. The next patch will
allocate additional VNIC from firmware and set it up.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pavan Chebbi [Tue, 20 Feb 2024 23:03:13 +0000 (15:03 -0800)]
bnxt_en: Refactor bnxt_set_features()
Refactor bnxt_set_features() function to have a common
function to re-init. We'll need this to reinitialize when
ntuple configuration changes.
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Venkat Duvvuru [Tue, 20 Feb 2024 23:03:12 +0000 (15:03 -0800)]
bnxt_en: Add bnxt_get_total_vnics() to calculate number of VNICs
Refactor the code by adding a new function to calculate the number of
required VNICs. This is used in multiple places when reserving or
checking resources.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Michael Chan [Tue, 20 Feb 2024 23:03:11 +0000 (15:03 -0800)]
bnxt_en: Check additional resources in bnxt_check_rings()
bnxt_check_rings() is called to check if we have enough resource
assets to satisfy the new number of ethtool channels. If the asset
test fails, the ethtool operation will fail gracefully. Otherwise
we will proceed and commit to use the new number of channels. If it
fails to allocate any resources, the chip will fail to come up.
For completeness, check all possible resources before committing to
the new settings. Add the missing ring group and RSS context asset
tests in bnxt_check_rings().
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add RSS context fields to struct bnxt_hw_rings and struct bnxt_hw_resc.
With these, we can now specific the exact number of RSS contexts to
reserve and store the reserved value. The original code relies on
other resources to infer the number of RSS contexts to reserve and the
reserved value is not stored. This improved infrastructure will make
the RSS context accounting more complete and is needed by later
patches.
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Michael Chan [Tue, 20 Feb 2024 23:03:09 +0000 (15:03 -0800)]
bnxt_en: Explicitly specify P5 completion rings to reserve
The current code assumes that every RX ring group and every TX ring
requires a completion ring on P5_PLUS chips. Now that we have the
bnxt_hw_rings structure, add the cp_p5 field so that it can
be explicitly specified. This makes the logic more clear.
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Michael Chan [Tue, 20 Feb 2024 23:03:08 +0000 (15:03 -0800)]
bnxt_en: Refactor ring reservation functions
The current functions to reserve hardware rings pass in 6 different ring
or resource types as parameters. Add a structure bnxt_hw_rings to
consolidate all these parameters and pass the structure pointer instead
to these functions. Add 2 related helper functions also. This makes
the code cleaner and makes it easier to add new resources to be
reserved.
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
MCTP core protocol updates, minor fixes & tests
This series implements some procotol improvements for AF_MCTP,
particularly for systems with multiple MCTP networks defined. For those,
we need to add the network ID to the tag lookups, which then suggests an
updated version of the tag allocate / drop ioctl to allow the net ID to
be specified there too.
The ioctl change affects uabi, so might warrant some extra attention.
There are also a couple of new kunit tests for multiple-net
configurations.
We have a fix for populating the flow data when fragmenting, and a
testcase for that too.
Of course, any queries/comments/etc., please let me know!
====================
Jeremy Kerr [Mon, 19 Feb 2024 09:51:55 +0000 (17:51 +0800)]
net: mctp: tests: Test that outgoing skbs have flow data populated
When CONFIG_MCTP_FLOWS is enabled, outgoing skbs should have their
SKB_EXT_MCTP extension set for drivers to consume.
Add two tests for local-to-output routing that check for the flow
extensions: one for the simple single-packet case, and one for
fragmentation.
We now make MCTP_TEST select MCTP_FLOWS, so we always get coverage of
these flow tests. The tests are skippable if MCTP_FLOWS is (otherwise)
disabled, but that would need manual config tweaking.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>