David Howells [Wed, 4 Dec 2024 07:46:41 +0000 (07:46 +0000)]
rxrpc: Fix CPU time starvation in I/O thread
Starvation can happen in the rxrpc I/O thread because it goes back to the
top of the I/O loop after it does any one thing without trying to give any
other connection or call CPU time. Also, because it processes one call
packet at a time, it tries to do the retransmission loop after each ACK
without checking to see if there are other ACKs already in the queue that
can update the SACK state.
Fix this by:
(1) Add a received-packet queue on each call.
(2) Distribute packets from the master Rx queue to the individual call,
conn and error queues and 'poking' calls to add them to the attend
queue first thing in the I/O thread.
(3) Go through all the attention-seeking connections and calls before
going back to the top of the I/O thread. Each queue is extracted as a
whole and then gone through so that new additions to insert themselves
into the queue.
(4) Make the call event handler go through all the packets currently on
the call's rx_queue before transmitting and retransmitting DATA
packets.
(5) Drop the skb argument from the call event handler as this is now
replaced with the rx_queue. Instead, keep track of whether we
received a packet or an ACK for the tests that used to rely on that.
David Howells [Wed, 4 Dec 2024 07:46:40 +0000 (07:46 +0000)]
rxrpc: Add a tracepoint to show variables pertinent to jumbo packet size
Add a tracepoint to be called right before packets are transmitted for the
first time that shows variable values that are pertinent to how many
subpackets will be added to a jumbo DATA packet.
David Howells [Wed, 4 Dec 2024 07:46:39 +0000 (07:46 +0000)]
rxrpc: Prepare to be able to send jumbo DATA packets
Prepare to be able to send jumbo DATA packets if the we decide to, but
don't enable that yet. This will allow larger chunks of data to be sent
without reducing the retryability as the subpackets in a jumbo packet can
also be retransmitted individually.
David Howells [Wed, 4 Dec 2024 07:46:38 +0000 (07:46 +0000)]
rxrpc: Separate the packet length from the data length in rxrpc_txbuf
Separate the packet length from the data length (txb->len) stored in the
rxrpc_txbuf to make security calculations easier. Also store the
allocation size as that's an upper bound on the size of the security
wrapper and change a number of fields to unsigned short as the amount of
data can't exceed the capacity of a UDP packet.
Also, whilst we're at it, use kzalloc() for txbufs.
David Howells [Wed, 4 Dec 2024 07:46:37 +0000 (07:46 +0000)]
rxrpc: Implement path-MTU probing using padded PING ACKs (RFC8899)
Implement path-MTU probing (along the lines of RFC8899) by padding some of
the PING ACKs we send. PING ACKs get their own individual responses quite
apart from the acking of data (though, as ACKs, they fulfil that role
also).
The probing concentrates on packet sizes that correspond how many
subpackets can be stuffed inside a jumbo packet as jumbo DATA packets are
just aggregations of individual DATA packets and can be split easily for
retransmission purposes.
If we want to perform probing, we advertise this by setting the maximum
number of jumbo subpackets to 0 in the ack trailer when we send an ACK and
see if the peer is also advertising the service. This is interpreted by
non-supporting Rx stacks as an indication that jumbo packets aren't
supported.
The MTU sizes advertised in the ACK trailer AF_RXRPC transmits are pegged
at a maximum of 1444 unless pmtud is supported by both sides.
David Howells [Wed, 4 Dec 2024 07:46:35 +0000 (07:46 +0000)]
rxrpc: Request an ACK on impending Tx stall
Set the REQUEST-ACK flag on the DATA packet we're about to send if we're
about to stall transmission because the app layer isn't keeping up
supplying us with data to transmit.
David Howells [Wed, 4 Dec 2024 07:46:32 +0000 (07:46 +0000)]
rxrpc: Clean up Tx header flags generation handling
Clean up the generation of the header flags when building packet headers
for transmission:
(1) Assemble the flags in a local variable rather than in the txb->flags.
(2) Do the flags masking and JUMBO-PACKET setting in one bit of code for
both the main header and the jumbo headers.
(3) Generate the REQUEST-ACK flag afresh each time. There's a possibility
we might want to do jumbo retransmission packets in future.
(4) Pass the local flags variable to the rxrpc_tx_data tracepoint rather
than the combination of the txb flags and the wire header flags (the
latter belong only to the first subpacket).
David Howells [Wed, 4 Dec 2024 07:46:30 +0000 (07:46 +0000)]
rxrpc: Fix handling of received connection abort
Fix the handling of a connection abort that we've received. Though the
abort is at the connection level, it needs propagating to the calls on that
connection. Whilst the propagation bit is performed, the calls aren't then
woken up to go and process their termination, and as no further input is
forthcoming, they just hang.
Also add some tracing for the logging of connection aborts.
Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20241204074710.990092-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:46:29 +0000 (07:46 +0000)]
ktime: Add us_to_ktime()
Add a us_to_ktime() helper to go with ms_to_ktime() and ns_to_ktime().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Thomas Gleixner <tglx@linutronix.de>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20241204074710.990092-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
cn10k-ipsec: Add outbound inline ipsec support
This patch series adds outbound inline ipsec support on Marvell
cn10k series of platform. One crypto hardware logical function
(cpt-lf) per netdev is required for inline ipsec outbound
functionality. Software prepare and submit crypto hardware
(CPT) instruction for outbound inline ipsec crypto mode offload.
The CPT instruction have details for encryption and authentication
Crypto hardware encrypt, authenticate and provide the ESP packet
to network hardware logic to transmit ipsec packet.
First patch makes dma memory writable for in-place encryption,
Second patch moves code to common file, Third patch disable
backpressure on crypto (CPT) and network (NIX) hardware.
Patch four onwards enables inline outbound ipsec.
v9->v10:
- Removed unlikely() in data-patch and used static_branch when at least
a SA is configured.
- Added missing READ_ONCE() as per comment on previous patch
- Removed "\n" from end of extack messages
- Poll for context write status check reduced to 100ms from 10s
v8->v9:
- Removed mutex lock to use hardware, now using hardware state
- Previous versions were supporting only 64 SAs and a bitmap was
used for same. That limitation is removed from this version.
- Replaced netdev_err with NL_SET_ERR_MSG_MOD in state add flow
as per comment in previous version
v7->v8:
- spell correction in patch 1/8 (s/sdk/skb)
v6->v7:
- skb data was mapped as device writeable but it was not ensured
that skb is writeable. This version calls skb_unshare() to make
skb data writeable (Thanks Jakub Kicinski for pointing out).
v4->v5:
- Fixed un-initialized warning and pointer check
(comment from Kalesh Anakkur Purayil)
v3->v4:
- Few error messages in data-path removed and some moved
under netif_msg_tx_err().
- Added check for crypto offload (XFRM_DEV_OFFLOAD_CRYPTO)
Thanks "Leon Romanovsky" for pointing out
- Fixed codespell error as per comment from Simon Horman
- Added some other cleanup comment from Kalesh Anakkur Purayil
v2->v3:
- Fix smatch and sparse errors (Comment from Simon Horman)
- Fix build error with W=1 (Comment from Simon Horman)
https://patchwork.kernel.org/project/netdevbpf/patch/20240513105446.297451-6-bbhushan2@marvell.com/
- Some other minor cleanup as per comment
https://www.spinics.net/lists/netdev/msg997197.html
v1->v2:
- Fix compilation error to build driver a module
- Use dma_wmb() instead of architecture specific barrier
- Fix couple of other compilation warnings
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:57 +0000 (11:26 +0530)]
cn10k-ipsec: Process outbound ipsec crypto offload
Prepare and submit crypto hardware (CPT) instruction for
outbound ipsec crypto offload. The CPT instruction have
authentication offset, IV offset and encapsulation offset
in input packet. Also provide SA context pointer which have
details about algo, keys, salt etc. Crypto hardware encrypt,
authenticate and provide the ESP packet to networking hardware.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:56 +0000 (11:26 +0530)]
cn10k-ipsec: Add SA add/del support for outb ipsec crypto offload
This patch adds support to add and delete Security Association
(SA) xfrm ops. Hardware maintains SA context in memory allocated
by software. Each SA context is 128 byte aligned and size of
each context is multiple of 128-byte. Add support for transport
and tunnel ipsec mode, ESP protocol, aead aes-gcm-icv16, key size
128/192/256-bits with 32bit salt.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:55 +0000 (11:26 +0530)]
cn10k-ipsec: Init hardware for outbound ipsec crypto offload
One crypto hardware logical function (cpt-lf) per netdev is
required for outbound ipsec crypto offload. Allocate, attach
and initialize one crypto hardware function when enabling
outbound ipsec crypto offload. Crypto hardware function will
be detached and freed on disabling outbound ipsec crypto
offload.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:54 +0000 (11:26 +0530)]
octeontx2-af: Disable backpressure between CPT and NIX
NIX can assert backpressure to CPT on the NIX<=>CPT link.
Keep the backpressure disabled for now. NIX block anyways
handles backpressure asserted by MAC due to PFC or flow
control pkts.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:52 +0000 (11:26 +0530)]
octeontx2-pf: map skb data as device writeable
Crypto hardware need write permission for in-place encrypt
or decrypt operation on skb-data to support IPsec crypto
offload. That patch uses skb_unshare to make skb data writeable
for ipsec crypto offload and map skb fragment memory as
device read-write.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: net: add negotiation of in-band capabilities (remainder)
Here are the last three patches which were not included in the non-RFC
posting, but were in the RFC posting. These add the .pcs_inband()
method to the Lynx, MTK Lynx and XPCS drivers.
====================
Stas Sergeev [Thu, 5 Dec 2024 07:36:14 +0000 (10:36 +0300)]
tun: fix group permission check
Currently tun checks the group permission even if the user have matched.
Besides going against the usual permission semantic, this has a
very interesting implication: if the tun group is not among the
supplementary groups of the tun user, then effectively no one can
access the tun device. CAP_SYS_ADMIN still can, but its the same as
not setting the tun ownership.
This patch relaxes the group checking so that either the user match
or the group match is enough. This avoids the situation when no one
can access the device even though the ownership is properly set.
Also I simplified the logic by removing the redundant inversions:
tun_not_capable() --> !tun_capable()
Signed-off-by: Stas Sergeev <stsp2@yandex.ru> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20241205073614.294773-1-stsp2@yandex.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg [Fri, 6 Dec 2024 10:30:57 +0000 (11:30 +0100)]
tools: ynl-gen-c: don't require -o argument
Without -o the tool currently crashes, but it's not marked
as required. The only thing we can't do without it is to
generate the correct #include for user source files, but
we can put a placeholder instead.
====================
net: Convert some UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS.
VXLAN, Geneve and Bareudp use various device counters for managing
RX and TX statistics:
* VXLAN uses the device core_stats for RX and TX drops, tstats for
regular RX/TX counters and DEV_STATS_INC() for various types of
RX/TX errors.
* Geneve uses tstats for regular RX/TX counters and DEV_STATS_INC()
for everything else, include RX/TX drops.
* Bareudp, was recently converted to follow VXLAN behaviour, that is,
device core_stats for RX and TX drops, tstats for regular RX/TX
counters and DEV_STATS_INC() for other counter types.
Let's consolidate statistics management around the dstats counters
instead. This avoids using core_stats in VXLAN and Bareudp, as
core_stats is supposed to be used by core networking code only (and not
in drivers). This also allows Geneve to avoid using atomic increments
when updating RX and TX drop counters, as dstats is per-cpu. Finally,
this also simplifies the code as all three modules now handle stats in
the same way and with only two different sets of counters (the per-cpu
dstats and the atomic DEV_STATS_INC()).
Patch 1 creates dstats helper functions that can be used outside of VRF
(until then, dstats was VRF-specific).
Then patches 2 to 4, convert VXLAN, Geneve and Bareudp, one by one.
====================
Guillaume Nault [Wed, 4 Dec 2024 12:11:32 +0000 (13:11 +0100)]
bareudp: Handle stats using NETDEV_PCPU_STAT_DSTATS.
Bareudp uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX
packet counters. It was also recently converted to use the device core
stats (dev_core_stats_*()) for RX and TX drops (see commit 788d5d655bc9
("bareudp: Use pcpu stats to update rx_dropped counter.")).
Since core stats are to be avoided in drivers, and for consistency with
VXLAN and Geneve, let's convert packet stats handling to DSTATS, which
can handle RX/TX stats and packet drops. Statistics that don't fit
DSTATS are still updated atomically with DEV_STATS_INC().
Guillaume Nault [Wed, 4 Dec 2024 12:11:30 +0000 (13:11 +0100)]
geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.
Geneve uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX
packet counters. All other counters are handled using atomic increments
with DEV_STATS_INC().
Let's convert packet stats handling to DSTATS, which has a per-cpu
counter for packet drops too, to avoid the cost of atomic increments
in these cases. Statistics that don't fit DSTATS are still updated
atomically with DEV_STATS_INC().
Guillaume Nault [Wed, 4 Dec 2024 12:11:27 +0000 (13:11 +0100)]
vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.
VXLAN uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX and
TX packet counters. It also uses the device core stats
(dev_core_stats_*()) for RX and TX drops.
Let's consolidate that using the DSTATS infrastructure, which can
handle both packet counters and packet drops. Statistics that don't
fit DSTATS are still updated atomically with DEV_STATS_INC().
While there, convert the "len" variable of vxlan_encap_bypass() to
unsigned int, to respect the types of skb->len and
dev_dstats_[rt]x_add().
Guillaume Nault [Wed, 4 Dec 2024 12:11:21 +0000 (13:11 +0100)]
vrf: Make pcpu_dstats update functions available to other modules.
Currently vrf is the only module that uses NETDEV_PCPU_STAT_DSTATS.
In order to make this kind of statistics available to other modules,
we need to define the update functions in netdevice.h.
Therefore, let's define dev_dstats_*() functions for RX and TX packet
updates (packets, bytes and drops). Use these new functions in vrf.c
instead of vrf_rx_stats() and the other manual counter updates.
While there, update the type of the "len" variables to "unsigned int",
so that there're aligned with both skb->len and the new dstats update
functions.
Jakub Kicinski [Sat, 7 Dec 2024 01:53:29 +0000 (17:53 -0800)]
Merge branch 'lan78xx-preparations-for-phylink'
Oleksij Rempel says:
====================
lan78xx: Preparations for PHYlink
This patch set is part of the preparatory work for migrating the lan78xx
USB Ethernet driver to the PHYlink framework. During extensive testing,
I observed that resetting the USB adapter can lead to various read/write
errors. While the errors themselves are acceptable, they generate
excessive log messages, resulting in significant log spam. This set
improves error handling to reduce logging noise by addressing errors
directly and returning early when necessary.
Key highlights of this series include:
- Enhanced error handling to reduce log spam while preserving the
original error values, avoiding unnecessary overwrites.
- Improved error reporting using the `%pe` specifier for better clarity
in log messages.
- Removal of redundant and problematic PHY fixups for LAN8835 and
KSZ9031, with detailed explanations in the respective patches.
- Cleanup of code structure, including unified `goto` labels for better
readability and maintainability, even in simple editors.
====================
Oleksij Rempel [Wed, 4 Dec 2024 08:41:42 +0000 (09:41 +0100)]
net: usb: lan78xx: Improve error handling in dataport and multicast writes
Update `lan78xx_dataport_write` and `lan78xx_deferred_multicast_write`
to:
- Handle errors during register read/write operations.
- Exit immediately on errors and log them using `%pe` for clarity.
- Avoid silent failures by propagating error codes properly.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:41 +0000 (09:41 +0100)]
net: usb: lan78xx: Add error handling to lan78xx_irq_bus_sync_unlock
Update `lan78xx_irq_bus_sync_unlock` to handle errors in register
read/write operations. If an error occurs, log it and exit the function
appropriately. This ensures proper handling of failures during IRQ
synchronization.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:39 +0000 (09:41 +0100)]
net: usb: lan78xx: Add error handling to lan78xx_init_ltm
Convert `lan78xx_init_ltm` to return error codes and handle errors
properly. Previously, errors during the LTM initialization process were
not propagated, potentially leading to undetected issues. This patch
ensures:
- Errors in `lan78xx_read_reg` and `lan78xx_write_reg` are checked and
handled.
- Errors are logged with detailed messages using `%pe` for clarity.
- The function exits immediately on error, returning the error code.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:38 +0000 (09:41 +0100)]
net: usb: lan78xx: Improve error handling in EEPROM and OTP operations
Refine error handling in EEPROM and OTP read/write functions by:
- Return error values immediately upon detection.
- Avoid overwriting correct error codes with `-EIO`.
- Preserve initial error codes as they were appropriate for specific
failures.
- Use `-ETIMEDOUT` for timeout conditions instead of `-EIO`.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:37 +0000 (09:41 +0100)]
net: usb: lan78xx: Fix error handling in MII read/write functions
Ensure proper error handling in `lan78xx_mdiobus_read` and
`lan78xx_mdiobus_write` by checking return values of register read/write
operations and returning errors to the caller.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:36 +0000 (09:41 +0100)]
net: usb: lan78xx: Improve error reporting with %pe specifier
Replace integer error codes with the `%pe` format specifier in register
read and write error messages. This change provides human-readable error
strings, making logs more informative and debugging easier.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:34 +0000 (09:41 +0100)]
net: usb: lan78xx: Remove KSZ9031 PHY fixup
Remove the KSZ9031RNX PHY fixup from the lan78xx driver. The fixup applied
specific RGMII pad skew configurations globally, but these settings violate the
RGMII specification and cause more harm than benefit.
Key issues with the fixup:
1. **Non-Compliant Timing**: The fixup's delay settings fall outside the RGMII
specification requirements of 1.5 ns to 2.0 ns:
- RX Path: Total delay of **2.16 ns** (PHY internal delay of 1.2 ns + 0.96
ns skew).
- TX Path: Total delay of **0.96 ns**, significantly below the RGMII minimum
of 1.5 ns.
2. **Redundant or Incorrect Configurations**:
- The RGMII skew registers written by the fixup do not meaningfully alter
the PHY's default behavior and fail to account for its internal delays.
- The TX_DATA pad skew was not configured, relying on power-on defaults
that are insufficient for RGMII compliance.
3. **Micrel Driver Support**: By setting `PHY_INTERFACE_MODE_RGMII_ID`, the
Micrel driver can calculate and assign appropriate skew values for the
KSZ9031 PHY. This ensures better timing configurations without relying on
external fixups.
4. **System Interference**: The fixup applied globally, reconfiguring all
KSZ9031 PHYs in the system, even those unrelated to the LAN78xx adapter.
This could lead to unintended and harmful behavior on unrelated interfaces.
While the fixup is removed, a better mechanism is still needed to dynamically
determine the optimal combination of PHY and MAC delays to fully meet RGMII
requirements without relying on Device Tree or global fixups. This would allow
for robust operation across different hardware configurations.
The Micrel driver is capable of using the interface mode value to calculate and
apply better skew values, providing a configuration much closer to the RGMII
specification than the fixup. Removing the fixup ensures better default
behavior and prevents harm to other system interfaces.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:33 +0000 (09:41 +0100)]
net: usb: lan78xx: Remove LAN8835 PHY fixup
Remove the PHY fixup for the LAN8835 PHY in the lan78xx driver due to
the following reasons:
- There is no publicly available information about the LAN8835 PHY.
However, it appears to be the integrated PHY used in the LAN7800 and
LAN7850 USB Ethernet controllers. These PHYs use the GMII interface,
not RGMII as configured by the fixup.
- The correct driver for handling the LAN8835 PHY functionality is the
Microchip PHY driver (`drivers/net/phy/microchip.c`), which properly
supports these integrated PHYs.
- The PHY ID `0x0007C130` is actually used by the LAN8742A PHY, which
only supports RMII. This interface is incompatible with the LAN78xx
MAC, as the LAN7801 (the only LAN78xx version without an integrated
PHY) supports only RGMII.
- The mask applied for this fixup is overly broad, inadvertently
covering both Microchip LAN88xx PHYs and unrelated SMSC LAN8742A PHYs,
leading to potential conflicts with other devices.
- Testing has shown that removing this fixup for LAN7800 and LAN7850
does not result in any noticeable difference in functionality, as the
Microchip PHY driver (`drivers/net/phy/microchip.c`) handles all
necessary configurations for these integrated PHYs.
- Registering this fixup globally (not limited to USB devices) risks
conflicts by unintentionally modifying other interfaces whenever a
LAN7801 adapter is connected to the system.
Note that both LAN7800 and LAN7850 USB Ethernet controllers use an
integrated PHY with the ID `0x0007C132`. Additionally, the LAN7515, a
specialized part for Raspberry Pi, includes an integrated LAN7800 USB
Ethernet controller and USB hub in a multifunctional chip design, and it
also uses the same PHY ID (`0x0007C132`).
Jakub Kicinski [Sat, 7 Dec 2024 01:47:34 +0000 (17:47 -0800)]
Merge branch 'net-phylib-eee-cleanups'
Russell King says:
====================
net: phylib EEE cleanups
Clean up phylib's EEE support. Patches previously posted as RFC as part
of the phylink EEE series.
Patch 1 changes the Marvell driver to use the state we store in
struct phy_device, rather than manually calling
phydev->eee_cfg.eee_enabled.
Patch 2 avoids genphy_c45_ethtool_get_eee() setting ->eee_enabled, as
we copy that from phydev->eee_cfg.eee_enabled later, and after patch 3
mo one uses this after calling genphy_c45_ethtool_get_eee(). In fact,
the only caller of this function now is phy_ethtool_get_eee().
As all callers to genphy_c45_eee_is_active() now pass NULL as its
is_enabled flag, this is no longer useful. Remove the argument in
patch 3.
Patch 4 updates the phylib documentation to make it absolutely clear
that phy_ethtool_get_eee() now fills in all members of struct
ethtool_keee, which is why we now have so many buggy network drivers.
====================
All callers to genphy_c45_eee_is_active() now pass NULL as the
is_enabled argument, which means we never use the value computed
in this function. Remove the argument and clean up this function.
genphy_c45_ethtool_get_eee() is only called from phy_ethtool_get_eee(),
which then calls eeecfg_to_eee(). eeecfg_to_eee() will overwrite
keee.eee_enabled, so there's no point setting keee.eee_enabled in
genphy_c45_ethtool_get_eee(). Remove this assignment.
Joe Damato [Wed, 4 Dec 2024 16:32:39 +0000 (16:32 +0000)]
selftests: net: cleanup busy_poller.c
Fix various integer type conversions by using strtoull and a temporary
variable which is bounds checked before being casted into the
appropriate cfg_* variable for use by the test program.
While here:
- free the strdup'd cfg string for overall hygenie.
- initialize napi_id = 0 in setup_queue to avoid warnings on some
compilers.
Eric Dumazet [Wed, 4 Dec 2024 21:02:34 +0000 (21:02 +0000)]
net: tipc: remove one synchronize_net() from tipc_nametbl_stop()
tipc_exit_net() is very slow and is abused by syzbot.
tipc_nametbl_stop() is called for each netns being dismantled.
Calling synchronize_net() right before freeing tn->nametbl
is a big hammer.
Replace this with kfree_rcu().
Note that RCU is not properly used here, otherwise
tn->nametbl should be cleared before the synchronize_net()
or kfree_rcu(), or even before the cleanup loop.
We might need to fix this at some point.
Also note tipc uses other synchronize_rcu() calls,
more work is needed to make tipc_exit_net() much faster.
This is V3 of the phylink conversion for ucc_geth.
The main changes in this V3 are related to error handling in the patches
1 and 10 to report an error when the deprecated "interface" property is
found in DT. Doing so, I found and addressed some issues with the jump
labels in the error paths, impacting patches 1 and 10.
The rest of the changes are just a rebase on net-next.
Some of the V2 changes haven't been reviewed, so I stress out that I'm
still uncertain about the way WoL is handled is patches 4 and 10.
Thanks,
Maxime
Link to V1: https://lore.kernel.org/netdev/20241107170255.1058124-1-maxime.chevallier@bootlin.com/
Link to V2: https://lore.kernel.org/netdev/20241114153603.307872-1-maxime.chevallier@bootlin.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 3 Dec 2024 12:43:20 +0000 (13:43 +0100)]
net: freescale: ucc_geth: Introduce a helper to check Reduced modes
A number of parallel MII interfaces also exist in a "Reduced" mode,
usually with higher clock rates and fewer data lines, to ease the
hardware design. This is what the 'R' stands for in RGMII, RMII, RTBI,
RXAUI, etc.
The UCC Geth controller has a special configuration bit that needs to be
set when the MII mode is one of the supported reduced modes.
Add a local helper for that.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 3 Dec 2024 12:43:19 +0000 (13:43 +0100)]
net: freescale: ucc_geth: Move the serdes configuration around
The uec_configure_serdes() function deals with serialized linkmodes
settings. It's used during the link bringup sequence. It is planned to
be used during the phylink conversion for mac configuration, but it
needs to me moved around in the process. To make the phylink port
clearer, this commit moves the function without any feature change.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 3 Dec 2024 12:43:18 +0000 (13:43 +0100)]
net: freescale: ucc_geth: Hardcode the preamble length to 7 bytes
The preamble length can be configured in ucc_geth, however it just
ends-up always being configured to 7 bytes, as nothing ever changes the
default value of 7.
Make that value the default value when the MACCFG2 register gets
initialized, and remove the code to configure that value altogether.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The frame length check is configured when the phy interface is setup.
However, it's configured according to an internal flag that is always
false. So, just make so that we disable the relevant bit in the MACCFG2
register upon accessing it for other MAC configuration operations.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 3 Dec 2024 12:43:16 +0000 (13:43 +0100)]
net: freescale: ucc_geth: Use the correct type to store WoL opts
The WoL opts are represented through a bitmask stored in a u32. As this
mask is copied as-is in the driver, make sure we use the exact same type
to store them internally.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 3 Dec 2024 12:43:15 +0000 (13:43 +0100)]
net: freescale: ucc_geth: Fix WOL configuration
The get/set_wol ethtool ops rely on querying the PHY for its WoL
capabilities, checking for the presence of a PHY and a PHY interrupts
isn't enough. Address that by cleaning up the WoL configuration
sequence.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 3 Dec 2024 12:43:14 +0000 (13:43 +0100)]
net: freescale: ucc_geth: Use netdev->phydev to access the PHY
As this driver pre-dates phylib, it uses a private pointer to get a
reference to the attached phy_device. Drop that pointer and use the
netdev's pointer instead.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 3 Dec 2024 12:43:13 +0000 (13:43 +0100)]
net: freescale: ucc_geth: split adjust_link for phylink conversion
Preparing the phylink conversion, split the adjust_link callbaclk, by
clearly separating the mac configuration, link_up and link_down phases.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 3 Dec 2024 12:43:12 +0000 (13:43 +0100)]
net: freescale: ucc_geth: Drop support for the "interface" DT property
In april 2007, ucc_geth was converted to phylib with :
commit 728de4c927a3 ("ucc_geth: migrate ucc_geth to phylib").
In that commit, the device-tree property "interface", that could be used to
retrieve the PHY interface mode was deprecated.
DTS files that still used that property were converted along the way, in
the following commit, also dating from april 2007 :
commit 0fd8c47cccb1 ("[POWERPC] Replace undocumented interface properties in dts files")
17 years later, there's no users of that property left and I hope it's
safe to say we can remove support from that in the ucc_geth driver,
making the probe() function a bit simpler.
Should there be any users that have a DT that was generated when 2.6.21 was
cutting-edge, print an error message with hints on how to convert the
devicetree if the 'interface' property is found.
With that property gone, we can greatly simplify the parsing of the
phy-interface-mode from the devicetree by using of_get_phy_mode(),
allowing the removal of the open-coded parsing in the driver.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
xdp: a fistful of generic changes pt. I
XDP for idpf is currently 6 chapters:
* convert Rx to libeth;
* convert Tx and stats to libeth;
* generic XDP and XSk code changes (you are here);
* generic XDP and XSk code additions;
* actual XDP for idpf via new libeth_xdp;
* XSk for idpf (via ^).
Part III does the following:
* improve &xdp_buff_xsk cacheline placement;
* does some cleanups with marking read-only bpf_prog and xdp_buff
arguments const for some generic functions;
* allows attaching already registered XDP memory model to RxQ info;
* makes system percpu page_pools valid XDP memory models;
* starts using netmems in the XDP core code (1 function);
* allows mixing pages from several page_pools within one XDP frame;
* optimizes &xdp_frame layout and removes no-more-used field.
Bullets 4-6 are the most important ones. All of them are prereqs to
libeth_xdp.
====================
Alexander Lobakin [Tue, 3 Dec 2024 17:37:31 +0000 (18:37 +0100)]
page_pool: make page_pool_put_page_bulk() handle array of netmems
Currently, page_pool_put_page_bulk() indeed takes an array of pointers
to the data, not pages, despite the name. As one side effect, when
you're freeing frags from &skb_shared_info, xdp_return_frame_bulk()
converts page pointers to virtual addresses and then
page_pool_put_page_bulk() converts them back. Moreover, data pointers
assume every frag is placed in the host memory, making this function
non-universal.
Make page_pool_put_page_bulk() handle array of netmems. Pass frag
netmems directly and use virt_to_netmem() when freeing xdpf->data,
so that the PP core will then get the compound netmem and take care
of the rest.
They do the same as their non-underscored buddies, but assume the netmem
is always page-backed. When working with header &page_pools, you don't
need to check whether netmem belongs to the host memory and you can
never get NULL instead of &page. Checks for the LSB, clearing the LSB,
branches take cycles and increase object code size, sometimes
significantly. When you're sure your PP is always host, you can avoid
this by using the underscored counterparts.
Toke Høiland-Jørgensen [Tue, 3 Dec 2024 17:37:29 +0000 (18:37 +0100)]
xdp: register system page pool as an XDP memory model
To make the system page pool usable as a source for allocating XDP
frames, we need to register it with xdp_reg_mem_model(), so that page
return works correctly. This is done in preparation for using the system
page_pool to convert XDP_PASS XSk frames to skbs; for the same reason,
make the per-cpu variable non-static so we can access it from other
source files as well (but w/o exporting).
Alexander Lobakin [Tue, 3 Dec 2024 17:37:28 +0000 (18:37 +0100)]
xsk: allow attaching XSk pool via xdp_rxq_info_reg_mem_model()
When you register an XSk pool as XDP Rxq info memory model, you then
need to manually attach it after the registration.
Let the user combine both actions into one by just passing a pointer
to the pool directly to xdp_rxq_info_reg_mem_model(), which will take
care of calling xsk_pool_set_rxq_info(). This looks similar to how a
&page_pool gets registered and reduce repeating driver code.
Alexander Lobakin [Tue, 3 Dec 2024 17:37:27 +0000 (18:37 +0100)]
xdp: allow attaching already registered memory model to xdp_rxq_info
One may need to register memory model separately from xdp_rxq_info. One
simple example may be XDP test run code, but in general, it might be
useful when memory model registering is managed by one layer and then
XDP RxQ info by a different one.
Allow such scenarios by adding a simple helper which "attaches"
already registered memory model to the desired xdp_rxq_info. As this
is mostly needed for Page Pool, add a special function to do that for
a &page_pool pointer.
Alexander Lobakin [Tue, 3 Dec 2024 17:37:26 +0000 (18:37 +0100)]
xdp, xsk: constify read-only arguments of some static inline helpers
Lots of read-only helpers for &xdp_buff and &xdp_frame, such as getting
the frame length, skb_shared_info etc., don't have their arguments
marked with `const` for no reason. Add the missing annotations to leave
less place for mistakes and more for optimization.
Alexander Lobakin [Tue, 3 Dec 2024 17:37:25 +0000 (18:37 +0100)]
bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other
stuff that doesn't modify the structure itself. Same for net_device.
Address at least some of them and add `const` attributes there. The
object code didn't change, but that may prevent unwanted data
modifications and also allow more helpers to have const arguments.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexander Lobakin [Tue, 3 Dec 2024 17:37:24 +0000 (18:37 +0100)]
xsk: align &xdp_buff_xsk harder
After the series "XSk buff on a diet" by Maciej, the greatest pow-2
which &xdp_buff_xsk can be divided got reduced from 16 to 8 on x86_64.
Also, sizeof(xdp_buff_xsk) now is 120 bytes, which, taking the previous
sentence into account, leads to that it leaves 8 bytes at the end of
cacheline, which means an array of buffs will have its elements
messed between the cachelines chaotically.
Use __aligned_largest for this struct. This alignment is usually 16
bytes, which makes it fill two full cachelines and align an array
nicely. ___cacheline_aligned may be excessive here, especially on
arches with 128-256 byte CLs, as well as 32-bit arches (76 -> 96
bytes on MIPS32R2), while not doing better than _largest.
====================
net_sched: sch_sfq: reject limit of 1
The implementation does not properly support limits of 1. Add an
in-kernel check, in addition to existing iproute2 check, since other
tools may be used for configuration.
This patch set also adds a selfcheck to test that a limit of 1 is
rejected.
An alternative (or in addition) we could fix the implementation by
setting q->tail to NULL in sfq_drop if this is the last slot we marked
empty, e.g.:
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -317,8 +317,11 @@ static unsigned int sfq_drop(struct Qdisc *sch, struct sk_buff **to_free)
/* It is difficult to believe, but ALL THE SLOTS HAVE LENGTH 1. */
x = q->tail->next;
slot = &q->slots[x];
- q->tail->next = slot->next;
q->ht[slot->hash] = SFQ_EMPTY_SLOT;
+ if (x == slot->next)
+ q->tail = NULL; /* no more active slots */
+ else
+ q->tail->next = slot->next;
goto drop;
}
====================
Octavian Purdila [Wed, 4 Dec 2024 03:05:19 +0000 (19:05 -0800)]
net_sched: sch_sfq: don't allow 1 packet limit
The current implementation does not work correctly with a limit of
1. iproute2 actually checks for this and this patch adds the check in
kernel as well.
This fixes the following syzkaller reported crash:
The crash can be also be reproduced with the following (with a tc
recompiled to allow for sfq limits of 1):
tc qdisc add dev dummy0 handle 1: root tbf rate 1Kbit burst 100b lat 1s
../iproute2-6.9.0/tc/tc qdisc add dev dummy0 handle 2: parent 1:10 sfq limit 1
ifconfig dummy0 up
ping -I dummy0 -f -c2 -W0.1 8.8.8.8
sleep 1
Scenario that triggers the crash:
* the first packet is sent and queued in TBF and SFQ; qdisc qlen is 1
* TBF dequeues: it peeks from SFQ which moves the packet to the
gso_skb list and keeps qdisc qlen set to 1. TBF is out of tokens so
it schedules itself for later.
* the second packet is sent and TBF tries to queues it to SFQ. qdisc
qlen is now 2 and because the SFQ limit is 1 the packet is dropped
by SFQ. At this point qlen is 1, and all of the SFQ slots are empty,
however q->tail is not NULL.
At this point, assuming no more packets are queued, when sch_dequeue
runs again it will decrement the qlen for the current empty slot
causing an underflow and the subsequent out of bounds access.
Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Octavian Purdila <tavip@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241204030520.2084663-2-tavip@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ethtool: generate uapi header from the spec
We keep expanding ethtool netlink api surface and this leads to
constantly playing catchup on the ynl spec side. There are a couple
of things that prevent us from fully converting to generating
the header from the spec (stats and cable tests), but we can
generate 95% of the header which is still better than maintaining
c header and spec separately. The series adds a couple of missing
features on the ynl-gen-c side and separates the parts
that we can generate into new ethtool_netlink_generated.h.
====================
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:47 +0000 (07:55 -0800)]
ethtool: separate definitions that are gonna be generated
Reshuffle definitions that are gonna be generated into
ethtool_netlink_generated.h and match ynl spec order.
This should make it easier to compare the output of the ynl-gen-c
to the existing uapi header. No functional changes.
Things that are still remaining to be manually defined:
- ETHTOOL_FLAG_ALL - probably no good way to add to spec?
- some of the cable test bits (not sure whether it's possible to move to
spec)
- some of the stats definitions (no way currently to move to spec)
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:45 +0000 (07:55 -0800)]
ynl: add missing pieces to ethtool spec to better match uapi header
- __ETHTOOL_UDP_TUNNEL_TYPE_CNT and render max
- skip rendering stringset (empty enum)
- skip rendering c33-pse-ext-state (defined in ethtool.h)
- rename header flags to ethtool-flag-
- add attr-cnt-name to each attribute to use XXX_CNT instead of XXX_MAX
- add unspec 0 entry to each attribute
- carry some doc entries from the existing header
- tcp-header-split
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:44 +0000 (07:55 -0800)]
ynl: support directional specs in ynl-gen-c.py
The intent is to generate ethtool uapi headers. For now, some of the
things are hard-coded:
- <FAMILY>_MSG_{USER,KERNEL}_MAX
- the split between USER and KERNEL messages
- ipv6:
- avoid possible NULL deref in modify_prefix_route()
- release expired exception dst cached in socket
- smc: fix LGR and link use-after-free issue
- hsr: avoid potential out-of-bound access in fill_frame_info()
- can: hi311x: fix potential use-after-free
- eth: ice: fix VLAN pruning in switchdev mode
Previous releases - always broken:
- netfilter:
- ipset: hold module reference while requesting a module
- nft_inner: incorrect percpu area handling under softirq
- can: j1939: fix skb reference counting
- eth:
- mlxsw: use correct key block on Spectrum-4
- mlx5: fix memory leak in mlx5hws_definer_calc_layout"
* tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
net :mana :Request a V2 response version for MANA_QUERY_GF_STAT
net: avoid potential UAF in default_operstate()
vsock/test: verify socket options after setting them
vsock/test: fix parameter types in SO_VM_SOCKETS_* calls
vsock/test: fix failures due to wrong SO_RCVLOWAT parameter
net/mlx5e: Remove workaround to avoid syndrome for internal port
net/mlx5e: SD, Use correct mdev to build channel param
net/mlx5: E-Switch, Fix switching to switchdev mode in MPV
net/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabled
net/mlx5: HWS: Properly set bwc queue locks lock classes
net/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layout
bnxt_en: handle tpa_info in queue API implementation
bnxt_en: refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap()
bnxt_en: refactor tpa_info alloc/free into helpers
geneve: do not assume mac header is set in geneve_xmit_skb()
mlxsw: spectrum_acl_flex_keys: Use correct key block on Spectrum-4
ethtool: Fix wrong mod state in case of verbose and no_mask bitset
ipmr: tune the ipmr_can_free_table() checks.
netfilter: nft_set_hash: skip duplicated elements pending gc run
netfilter: ipset: Hold module reference while requesting a module
...
Linus Torvalds [Thu, 5 Dec 2024 18:17:55 +0000 (10:17 -0800)]
Merge tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix trace histogram sort function cmp_entries_dup()
The sort function cmp_entries_dup() returns either 1 or 0, and not -1
if parameter "a" is less than "b" by memcmp().
- Fix archs that call trace_hardirqs_off() without RCU watching
Both x86 and arm64 no longer call any tracepoints with RCU not
watching. It was assumed that it was safe to get rid of
trace_*_rcuidle() version of the tracepoint calls. This was needed to
get rid of the SRCU protection and be able to implement features like
faultable traceponits and add rust tracepoints.
Unfortunately, there were a few architectures that still relied on
that logic. There's only one file that has tracepoints that are
called without RCU watching. Add macro logic around the tracepoints
for architectures that do not have CONFIG_ARCH_WANTS_NO_INSTR defined
will check if the code is in the idle path (the only place RCU isn't
watching), and enable RCU around calling the tracepoint, but only do
it if the tracepoint is enabled.
* tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix archs that still call tracepoints without RCU watching
tracing: Fix cmp_entries_dup() to respect sort() comparison rules
Linus Torvalds [Thu, 5 Dec 2024 18:06:47 +0000 (10:06 -0800)]
Merge tag 'hid-for-linus-2024120501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Benjamin Tissoires:
- regression fix in suspend/resume for i2c-hid (Kenny Levinsen)
- fix wacom driver assuming a name can not be null (WangYuli)
- a couple of constify changes/fixes (Thomas Weißschuh)
- a couple of selftests/hid fixes (Maximilian Heyne & Benjamin
Tissoires)
* tag 'hid-for-linus-2024120501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
selftests/hid: fix kfunc inclusions with newer bpftool
HID: bpf: drop unneeded casts discarding const
HID: bpf: constify hid_ops
selftests: hid: fix typo and exit code
HID: wacom: fix when get product name maybe null pointer
HID: i2c-hid: Revert to using power commands to wake on resume
Linus Torvalds [Thu, 5 Dec 2024 18:03:43 +0000 (10:03 -0800)]
Merge tag 'linux-watchdog-6.13-rc1' of git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- Add support for exynosautov920 SoC
- Add support for Airoha EN7851 watchdog
- Add support for MT6735 TOPRGU/WDT
- Delete the cpu5wdt driver
- Always print when registering watchdog fails
- Several other small fixes and improvements
* tag 'linux-watchdog-6.13-rc1' of git://www.linux-watchdog.org/linux-watchdog: (36 commits)
watchdog: rti: of: honor timeout-sec property
watchdog: s3c2410_wdt: add support for exynosautov920 SoC
dt-bindings: watchdog: Document ExynosAutoV920 watchdog bindings
watchdog: mediatek: Add support for MT6735 TOPRGU/WDT
watchdog: mediatek: Make sure system reset gets asserted in mtk_wdt_restart()
dt-bindings: watchdog: fsl-imx-wdt: Add missing 'big-endian' property
dt-bindings: watchdog: Document Qualcomm QCS8300
docs: ABI: Fix spelling mistake in pretimeout_avaialable_governors
Revert "watchdog: s3c2410_wdt: use exynos_get_pmu_regmap_by_phandle() for PMU regs"
watchdog: rzg2l_wdt: Power on the watchdog domain in the restart handler
watchdog: Switch back to struct platform_driver::remove()
watchdog: it87_wdt: add PWRGD enable quirk for Qotom QCML04
watchdog: da9063: Remove __maybe_unused notations
watchdog: da9063: Do not use a global variable
watchdog: Delete the cpu5wdt driver
watchdog: Add support for Airoha EN7851 watchdog
dt-bindings: watchdog: airoha: document watchdog for Airoha EN7581
watchdog: sl28cpld_wdt: don't print out if registering watchdog fails
watchdog: rza_wdt: don't print out if registering watchdog fails
watchdog: rti_wdt: don't print out if registering watchdog fails
...
Steven Rostedt [Wed, 4 Dec 2024 15:04:14 +0000 (10:04 -0500)]
tracing: Fix archs that still call tracepoints without RCU watching
Tracepoints require having RCU "watching" as it uses RCU to do updates to
the tracepoints. There are some cases that would call a tracepoint when
RCU was not "watching". This was usually in the idle path where RCU has
"shutdown". For the few locations that had tracepoints without RCU
watching, there was an trace_*_rcuidle() variant that could be used. This
used SRCU for protection.
There are tracepoints that trace when interrupts and preemption are
enabled and disabled. In some architectures, these tracepoints are called
in a path where RCU is not watching. When x86 and arm64 removed these
locations, it was incorrectly assumed that it would be safe to remove the
trace_*_rcuidle() variant and also remove the SRCU logic, as it made the
code more complex and harder to implement new tracepoint features (like
faultable tracepoints and tracepoints in rust).
Instead of bringing back the trace_*_rcuidle(), as it will not be trivial
to do as new code has already been added depending on its removal, add a
workaround to the one file that still requires it (trace_preemptirq.c). If
the architecture does not define CONFIG_ARCH_WANTS_NO_INSTR, then check if
the code is in the idle path, and if so, call ct_irq_enter/exit() which
will enable RCU around the tracepoint.
Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20241204100414.4d3e06d0@gandalf.local.home Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 48bcda684823 ("tracing: Remove definition of trace_*_rcuidle()") Closes: https://lore.kernel.org/all/bddb02de-957a-4df5-8e77-829f55728ea2@roeck-us.net/ Acked-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Shradha Gupta [Wed, 4 Dec 2024 05:48:20 +0000 (21:48 -0800)]
net :mana :Request a V2 response version for MANA_QUERY_GF_STAT
The current requested response version(V1) for MANA_QUERY_GF_STAT query
results in STATISTICS_FLAGS_TX_ERRORS_GDMA_ERROR value being set to
0 always.
In order to get the correct value for this counter we request the response
version to be V2.
Cc: stable@vger.kernel.org Fixes: e1df5202e879 ("net :mana :Add remaining GDMA stats for MANA to ethtool") Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1733291300-12593-1-git-send-email-shradhagupta@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The buggy address belongs to the object at ffff888043eba000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 432 bytes inside of
freed 2048-byte region [ffff888043eba000, ffff888043eba800)
Fixes: 8c55facecd7a ("net: linkwatch: only report IF_OPER_LOWERLAYERDOWN if iflink is actually down") Reported-by: syzbot+1939f24bdb783e9e43d9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/674f3a18.050a0220.48a03.0041.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20241203170933.2449307-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 5 Dec 2024 10:49:14 +0000 (11:49 +0100)]
Merge tag 'nf-24-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix esoteric undefined behaviour due to uninitialized stack access
in ip_vs_protocol_init(), from Jinghao Jia.
2) Fix iptables xt_LED slab-out-of-bounds due to incorrect sanitization
of the led string identifier, reported by syzbot. Patch from
Dmitry Antipov.
3) Remove WARN_ON_ONCE reachable from userspace to check for the maximum
cgroup level, nft_socket cgroup matching is restricted to 255 levels,
but cgroups allow for INT_MAX levels by default. Reported by syzbot.
4) Fix nft_inner incorrect use of percpu area to store tunnel parser
context with softirqs, resulting in inconsistent inner header
offsets that could lead to bogus rule mismatches, reported by syzbot.
5) Grab module reference on ipset core while requesting set type modules,
otherwise kernel crash is possible by removing ipset core module,
patch from Phil Sutter.
6) Fix possible double-free in nft_hash garbage collector due to unstable
walk interator that can provide twice the same element. Use a sequence
number to skip expired/dead elements that have been already scheduled
for removal. Based on patch from Laurent Fasnach
netfilter pull request 24-12-05
* tag 'nf-24-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_set_hash: skip duplicated elements pending gc run
netfilter: ipset: Hold module reference while requesting a module
netfilter: nft_inner: incorrect percpu area handling under softirq
netfilter: nft_socket: remove WARN_ON_ONCE on maximum cgroup level
netfilter: x_tables: fix LED ID check in led_tg_check()
ipvs: fix UB due to uninitialized stack access in ip_vs_protocol_init()
====================
Parameters were created using wrong C types, which caused them to be of
wrong size on some architectures, causing problems.
The problem with SO_RCVLOWAT was found on s390 (big endian), while x86-64
didn't show it. After the fix, all tests pass on s390.
Then Stefano Garzarella pointed out that SO_VM_SOCKETS_* calls might have
a similar problem, which turned out to be true, hence, the second patch.
Changes for v8:
- Fix whitespace warnings from "checkpatch.pl --strict"
- Add maintainers to Cc:
Changes for v7:
- Rebase on top of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
- Add the "net" tags to the subjects
Changes for v6:
- rework the patch #3 to avoid creating a new file for new functions,
and exclude vsock_perf from calling the new functions.
- add "Reviewed-by:" to the patch #2.
Changes for v5:
- in the patch #2 replace the introduced uint64_t with unsigned long long
to match documentation
- add a patch #3 that verifies every setsockopt() call.
Changes for v4:
- add "Reviewed-by:" to the first patch, and add a second patch fixing
SO_VM_SOCKETS_* calls, which depends on the first one (hence, it's now
a patch series.)
Changes for v3:
- fix the same problem in vsock_perf and update commit message
Changes for v2:
- add "Fixes:" lines to the commit message
====================
Konstantin Shkolnyy [Tue, 3 Dec 2024 15:06:56 +0000 (09:06 -0600)]
vsock/test: verify socket options after setting them
Replace setsockopt() calls with calls to functions that follow
setsockopt() with getsockopt() and check that the returned value and its
size are the same as have been set. (Except in vsock_perf.)
Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Konstantin Shkolnyy [Tue, 3 Dec 2024 15:06:55 +0000 (09:06 -0600)]
vsock/test: fix parameter types in SO_VM_SOCKETS_* calls
Change parameters of SO_VM_SOCKETS_* to unsigned long long as documented
in the vm_sockets.h, because the corresponding kernel code requires them
to be at least 64-bit, no matter what architecture. Otherwise they are
too small on 32-bit machines.
Fixes: 5c338112e48a ("test/vsock: rework message bounds test") Fixes: 685a21c314a8 ("test/vsock: add big message test") Fixes: 542e893fbadc ("vsock/test: two tests to check credit update logic") Fixes: 8abbffd27ced ("test/vsock: vsock_perf utility") Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Konstantin Shkolnyy [Tue, 3 Dec 2024 15:06:54 +0000 (09:06 -0600)]
vsock/test: fix failures due to wrong SO_RCVLOWAT parameter
This happens on 64-bit big-endian machines.
SO_RCVLOWAT requires an int parameter. However, instead of int, the test
uses unsigned long in one place and size_t in another. Both are 8 bytes
long on 64-bit machines. The kernel, having received the 8 bytes, doesn't
test for the exact size of the parameter, it only cares that it's >=
sizeof(int), and casts the 4 lower-addressed bytes to an int, which, on
a big-endian machine, contains 0. 0 doesn't trigger an error, SO_RCVLOWAT
returns with success and the socket stays with the default SO_RCVLOWAT = 1,
which results in vsock_test failures, while vsock_perf doesn't even notice
that it's failed to change it.
Fixes: b1346338fbae ("vsock_test: POLLIN + SO_RCVLOWAT test") Fixes: 542e893fbadc ("vsock/test: two tests to check credit update logic") Fixes: 8abbffd27ced ("test/vsock: vsock_perf utility") Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>