Oleksij Rempel [Mon, 16 Dec 2024 12:09:38 +0000 (13:09 +0100)]
net: usb: lan78xx: Use action-specific label in lan78xx_mac_reset
Rename the generic `done` label to the action-specific `exit_unlock`
label in `lan78xx_mac_reset`. This improves clarity by indicating the
specific cleanup action (mutex unlock) and aligns with best practices
for error handling and cleanup labels.
Oleksij Rempel [Mon, 16 Dec 2024 12:09:37 +0000 (13:09 +0100)]
net: usb: lan78xx: Use ETIMEDOUT instead of ETIME in lan78xx_stop_hw
Update lan78xx_stop_hw to return -ETIMEDOUT instead of -ETIME when
a timeout occurs. While -ETIME indicates a general timer expiration,
-ETIMEDOUT is more commonly used for signaling operation timeouts and
provides better consistency with standard error handling in the driver.
The -ETIME checks in tx_complete() and rx_complete() are unrelated to
this error handling change. In these functions, the error values are derived
from urb->status, which reflects USB transfer errors. The error value from
lan78xx_stop_hw will be exposed in the following cases:
- usb_driver::suspend
- net_device_ops::ndo_stop (potentially, though currently the return value
is not used).
Oleksij Rempel [Mon, 16 Dec 2024 12:09:36 +0000 (13:09 +0100)]
net: usb: lan78xx: Add error handling to lan78xx_get_regs
Update `lan78xx_get_regs` to handle errors during register and PHY
reads. Log warnings for failed reads and exit the function early if an
error occurs. Drop all previously logged registers to signal
inconsistent readings to the user space. This ensures that invalid data
is not returned to users.
Ido Schimmel [Mon, 16 Dec 2024 13:18:44 +0000 (14:18 +0100)]
mlxsw: Switch to napi_gro_receive()
Benefit from the recent conversion of the driver to NAPI and enable GRO
support through the use of napi_gro_receive(). Pass the NAPI pointer
from the bus driver (mlxsw_pci) to the switch driver (mlxsw_spectrum)
through the skb control block where various packet metadata is already
encoded.
The main motivation is to improve forwarding performance through the use
of GRO fraglist [1]. In my testing, when the forwarding data path is
simple (routing between two ports) there is not much difference in
forwarding performance between GRO disabled and GRO enabled with
fraglist.
The improvement becomes more noticeable as the data path becomes more
complex since it is traversed less times with GRO enabled. For example,
with 10 ingress and 10 egress flower filters with different priorities
on the two ports between which routing is performed, there is an
improvement of about 140% in forwarded bandwidth.
====================
inetpeer: reduce false sharing and atomic operations
After commit 8c2bd38b95f7 ("icmp: change the order of rate limits"),
there is a risk that a host receiving packets from an unique
source targeting closed ports is using a common inet_peer structure
from many cpus.
All these cpus have to acquire/release a refcount and update
the inet_peer timestamp (p->dtime)
Switch to pure RCU to avoid changing the refcount, and update
p->dtime only once per jiffy.
Eric Dumazet [Sun, 15 Dec 2024 17:56:29 +0000 (17:56 +0000)]
inetpeer: do not get a refcount in inet_getpeer()
All inet_getpeer() callers except ip4_frag_init() don't need
to acquire a permanent refcount on the inetpeer.
They can switch to full RCU protection.
Move the refcount_inc_not_zero() into ip4_frag_init(),
so that all the other callers no longer have to
perform a pair of expensive atomic operations on
a possibly contended cache line.
inet_putpeer() no longer needs to be exported.
After this patch, my DUT can receive 8,400,000 UDP packets
per second targeting closed ports, using 50% less cpu cycles
than before.
Also change two calls to l3mdev_master_ifindex() by
l3mdev_master_ifindex_rcu() (Ido ideas)
The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
====================
Thomas Weißschuh [Mon, 16 Dec 2024 11:30:11 +0000 (12:30 +0100)]
netxen_nic: constify 'struct bin_attribute'
The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
Thomas Weißschuh [Mon, 16 Dec 2024 11:30:09 +0000 (12:30 +0100)]
net: phy: ks8995: constify 'struct bin_attribute'
The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
Thomas Weißschuh [Mon, 16 Dec 2024 11:30:08 +0000 (12:30 +0100)]
net: bridge: constify 'struct bin_attribute'
The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
Jakub Kicinski [Sun, 15 Dec 2024 21:29:38 +0000 (13:29 -0800)]
net: page_pool: rename page_pool_is_last_ref()
page_pool_is_last_ref() releases a reference while the name,
to me at least, suggests it just checks if the refcount is 1.
The semantics of the function are the same as those of
atomic_dec_and_test() and refcount_dec_and_test(), so just
use the _and_test() suffix.
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/20241215212938.99210-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rahul Rameshbabu [Sat, 14 Dec 2024 19:43:06 +0000 (19:43 +0000)]
rust: net::phy scope ThisModule usage in the module_phy_driver macro
Similar to the use of $crate::Module, ThisModule should be referred to as
$crate::ThisModule in the macro evaluation. The reason the macro previously
did not cause any errors is because all the users of the macro would use
kernel::prelude::*, bringing ThisModule into scope.
Toke Høiland-Jørgensen [Sat, 14 Dec 2024 16:50:59 +0000 (17:50 +0100)]
net/sched: Add drop reasons for AQM-based qdiscs
Now that we have generic QDISC_CONGESTED and QDISC_OVERLIMIT drop
reasons, let's have all the qdiscs that contain an AQM apply them
consistently when dropping packets.
Kuniyuki Iwashima [Fri, 13 Dec 2024 11:08:49 +0000 (20:08 +0900)]
af_unix: Clean up error paths in unix_dgram_sendmsg().
The error path is complicated in unix_dgram_sendmsg() because there
are two timings when other could be non-NULL: when it's fetched from
unix_peer_get() and when it's looked up by unix_find_other().
Let's move unix_peer_get() to the else branch for unix_find_other()
and clean up the error paths.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
r8169: add support for RTL8125D rev.b
Add support for RTL8125D rev.b. Its XID is 0x689. It is basically
based on the one with XID 0x688, but with different firmware file.
To avoid a mess with the version numbering, adjust it first.
====================
ChunHao Lin [Fri, 13 Dec 2024 19:02:58 +0000 (20:02 +0100)]
r8169: add support for RTL8125D rev.b
Add support for RTL8125D rev.b. Its XID is 0x689. It is basically
based on the one with XID 0x688, but with different firmware file.
Signed-off-by: ChunHao Lin <hau@realtek.com>
[hkallweit1@gmail.com: rebased after adjusted version numbering] Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/75e5e9ec-d01f-43ac-b0f4-e7456baf18d1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 17 Dec 2024 02:11:21 +0000 (18:11 -0800)]
Merge branch 'add-support-for-so_priority-cmsg'
Anna Emese Nyiri says:
====================
Add support for SO_PRIORITY cmsg
Introduce a new helper function, `sk_set_prio_allowed`,
to centralize the logic for validating priority settings.
Add support for the `SO_PRIORITY` control message,
enabling user-space applications to set socket priority
via control messages (cmsg).
====================
Anna Emese Nyiri [Fri, 13 Dec 2024 08:44:57 +0000 (09:44 +0100)]
sock: Introduce SO_RCVPRIORITY socket option
Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the
ancillary data returned by recvmsg().
This is analogous to the existing support for SO_RCVMARK,
as implemented in commit 6fd1d51cfa253 ("net: SO_RCVMARK socket option
for SO_MARK with recvmsg()").
Anna Emese Nyiri [Fri, 13 Dec 2024 08:44:56 +0000 (09:44 +0100)]
selftests: net: test SO_PRIORITY ancillary data with cmsg_sender
Extend cmsg_sender.c with a new option '-Q' to send SO_PRIORITY
ancillary data.
cmsg_so_priority.sh script added to validate SO_PRIORITY behavior
by creating VLAN device with egress QoS mapping and testing packet
priorities using flower filters. Verify that packets with different
priorities are correctly matched and counted by filters for multiple
protocols and IP versions.
Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com> Link: https://patch.msgid.link/20241213084457.45120-4-annaemesenyiri@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Anna Emese Nyiri [Fri, 13 Dec 2024 08:44:55 +0000 (09:44 +0100)]
sock: support SO_PRIORITY cmsg
The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to this is IP_TOS, when the priority value
is calculated during the handling of
ancillary data, as implemented in commit f02db315b8d8 ("ipv4: IP_TOS
and IP_TTL can be specified as ancillary data").
However, this is a computed
value, and there is currently no mechanism to set a custom priority
via control messages prior to this patch.
According to this patch, if SO_PRIORITY is specified as ancillary data,
the packet is sent with the priority value set through
sockc->priority, overriding the socket-level values
set via the traditional setsockopt() method. This is analogous to
the existing support for SO_MARK, as implemented in
commit c6af0c227a22 ("ip: support SO_MARK cmsg").
If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that
takes precedence is the last one in the cmsg list.
This patch has the side effect that raw_send_hdrinc now interprets cmsg
IP_TOS.
Anna Emese Nyiri [Fri, 13 Dec 2024 08:44:54 +0000 (09:44 +0100)]
sock: Introduce sk_set_prio_allowed helper function
Simplify priority setting permissions with the 'sk_set_prio_allowed'
function, centralizing the validation logic. This change is made in
anticipation of a second caller in a following patch.
No functional changes.
Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com> Link: https://patch.msgid.link/20241213084457.45120-2-annaemesenyiri@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Donald Hunter [Fri, 13 Dec 2024 11:25:50 +0000 (11:25 +0000)]
netlink: specs: add phys-binding attr to rt_link spec
Add the missing phys-binding attr to the mctp-attrs in the rt_link spec.
This fixes commit 580db513b4a9 ("net: mctp: Expose transport binding
identifier via IFLA attribute").
Note that enum mctp_phys_binding is not currently uapi, but perhaps it
should be?
David Howells [Thu, 12 Dec 2024 21:04:22 +0000 (21:04 +0000)]
rxrpc: Fix ability to add more data to a call once MSG_MORE deasserted
When userspace is adding data to an RPC call for transmission, it must pass
MSG_MORE to sendmsg() if it intends to add more data in future calls to
sendmsg(). Calling sendmsg() without MSG_MORE being asserted closes the
transmission phase of the call (assuming sendmsg() adds all the data
presented) and further attempts to add more data should be rejected.
However, this is no longer the case. The change of call state that was
previously the guard got bumped over to the I/O thread, which leaves a
window for a repeat sendmsg() to insert more data. This previously went
unnoticed, but the more recent patch that changed the structures behind the
Tx queue added a warning:
WARNING: CPU: 3 PID: 6639 at net/rxrpc/sendmsg.c:296 rxrpc_send_data+0x3f2/0x860
and rejected the additional data, returning error EPROTO.
Fix this by adding a guard flag to the call, setting the flag when we queue
the final packet and then rejecting further attempts to add data with
EPROTO.
Fixes: 2d689424b618 ("rxrpc: Move call state changes from sendmsg to I/O thread") Reported-by: syzbot+ff11be94dfcd7a5af8da@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/6757fb68.050a0220.2477f.005f.GAE@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: syzbot+ff11be94dfcd7a5af8da@syzkaller.appspotmail.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/2870480.1734037462@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Thu, 12 Dec 2024 20:58:15 +0000 (20:58 +0000)]
rxrpc: Disable IRQ, not BH, to take the lock for ->attend_link
Use spin_lock_irq(), not spin_lock_bh() to take the lock when accessing the
->attend_link() to stop a delay in the I/O thread due to an interrupt being
taken in the app thread whilst that holds the lock and vice versa.
Fixes: a2ea9a907260 ("rxrpc: Use irq-disabling spinlocks between app and I/O thread") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/2870146.1734037095@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 17 Dec 2024 00:47:42 +0000 (16:47 -0800)]
Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next 2024-12-16
The following pull-request contains mlx5 IFC updates.
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add device cap abs_native_port_num
net/mlx5: qos: Add ifc support for cross-esw scheduling
net/mlx5: Add support for new scheduling elements
net/mlx5: Add ConnectX-8 device to ifc
net/mlx5: ifc: Reorganize mlx5_ifc_flow_table_context_bits
====================
David S. Miller [Mon, 16 Dec 2024 12:51:41 +0000 (12:51 +0000)]
Merge branch 'net-timestamp-selectable'
Kory Maincent says:
====================
net: Make timestamping selectable
Up until now, there was no way to let the user select the hardware
PTP provider at which time stamping occurs. The stack assumed that PHY time
stamping is always preferred, but some MAC/PHY combinations were buggy.
This series updates the default MAC/PHY default timestamping and aims to
allow the user to select the desired hwtstamp provider administratively.
Changes in v21:
- NIT fixes.
- Link to v20: https://lore.kernel.org/r/20241204-feature_ptp_netnext-v20-0-9bd99dc8a867@bootlin.com
Changes in v20:
- Change hwtstamp provider design to avoid saving "user" (phy or net) in
the ptp clock structure.
- Link to v19: https://lore.kernel.org/r/20241030-feature_ptp_netnext-v19-0-94f8aadc9d5c@bootlin.com
Changes in v19:
- Rebase on net-next
- Link to v18: https://lore.kernel.org/r/20241023-feature_ptp_netnext-v18-0-ed948f3b6887@bootlin.com
Changes in v18:
- Few changes in the tsconfig-set ethtool command.
- Add tsconfig-set-reply ethtool netlink socket.
- Add missing netlink tsconfig documentation
- Link to v17: https://lore.kernel.org/r/20240709-feature_ptp_netnext-v17-0-b5317f50df2a@bootlin.com
Changes in v17:
- Fix a documentation nit.
- Add a missing kernel_ethtool_tsinfo update from a new MAC driver.
- Link to v16: https://lore.kernel.org/r/20240705-feature_ptp_netnext-v16-0-5d7153914052@bootlin.com
Changes in v16:
- Add a new patch to separate tsinfo into a new tsconfig command to get
and set the hwtstamp config.
- Used call_rcu() instead of synchronize_rcu() to free the hwtstamp_provider
- Moved net core changes of patch 12 directly to patch 8.
- Link to v15: https://lore.kernel.org/r/20240612-feature_ptp_netnext-v15-0-b2a086257b63@bootlin.com
Changes in v15:
- Fix uninitialized ethtool_ts_info structure.
- Link to v14: https://lore.kernel.org/r/20240604-feature_ptp_netnext-v14-0-77b6f6efea40@bootlin.com
Changes in v14:
- Add back an EXPORT_SYMBOL() missing.
- Link to v13: https://lore.kernel.org/r/20240529-feature_ptp_netnext-v13-0-6eda4d40fa4f@bootlin.com
Changes in v13:
- Add PTP builtin code to fix build errors when building PTP as a module.
- Fix error spotted by smatch and sparse.
- Link to v12: https://lore.kernel.org/r/20240430-feature_ptp_netnext-v12-0-2c5f24b6a914@bootlin.com
Changes in v12:
- Add missing return description in the kdoc.
- Fix few nit.
- Link to v11: https://lore.kernel.org/r/20240422-feature_ptp_netnext-v11-0-f14441f2a1d8@bootlin.com
Changes in v11:
- Add netlink examples.
- Remove a change of my out of tree marvell_ptp patch in the patch series.
- Remove useless extern.
- Link to v10: https://lore.kernel.org/r/20240409-feature_ptp_netnext-v10-0-0fa2ea5c89a9@bootlin.com
Changes in v10:
- Move declarations to net/core/dev.h instead of netdevice.h
- Add netlink documentation.
- Add ETHTOOL_A_TSINFO_GHWTSTAMP netlink attributes instead of a bit in
ETHTOOL_A_TSINFO_TIMESTAMPING bitset.
- Send "Move from simple ida to xarray" patch standalone.
- Add tsinfo ntf command.
- Add rcu_lock protection mechanism to avoid memory leak.
- Fixed doc and kdoc issue.
- Link to v9: https://lore.kernel.org/r/20240226-feature_ptp_netnext-v9-0-455611549f21@bootlin.com
Changes in v9:
- Remove the RFC prefix.
- Correct few NIT fixes.
- Link to v8: https://lore.kernel.org/r/20240216-feature_ptp_netnext-v8-0-510f42f444fb@bootlin.com
Changes in v8:
- Drop the 6 first patch as they are now merged.
- Change the full implementation to not be based on the hwtstamp layer
(MAC/PHY) but on the hwtstamp provider which mean a ptp clock and a
phc qualifier.
- Made some patch to prepare the new implementation.
- Expand netlink tsinfo instead of a new ts command for new hwtstamp
configuration uAPI and for dumping tsinfo of specific hwtstamp provider.
- Link to v7: https://lore.kernel.org/r/20231114-feature_ptp_netnext-v7-0-472e77951e40@bootlin.com
Changes in v7:
- Fix a temporary build error.
- Link to v6: https://lore.kernel.org/r/20231019-feature_ptp_netnext-v6-0-71affc27b0e5@bootlin.com
Changes in v6:
- Few fixes from the reviews.
- Replace the allowlist to default_timestamp flag to know which phy is
using old API behavior.
- Rename the timestamping layer enum values.
- Move to a simple enum instead of the mix between enum and bitfield.
- Update ts_info and ts-set in software timestamping case.
Changes in v5:
- Update to ndo_hwstamp_get/set. This bring several new patches.
- Add few patches to make the glue.
- Convert macb to ndo_hwstamp_get/set.
- Add netlink specs description of new ethtool commands.
- Removed netdev notifier.
- Split the patches that expose the timestamping to userspace to separate
the core and ethtool development.
- Add description of software timestamping.
- Convert PHYs hwtstamp callback to use kernel_hwtstamp_config.
Changes in v4:
- Move on to ethtool netlink instead of ioctl.
- Add a netdev notifier to allow packet trapping by the MAC in case of PHY
time stamping.
- Add a PHY whitelist to not break the old PHY default time-stamping
preference API.
Changes in v3:
- Expose the PTP choice to ethtool instead of sysfs.
You can test it with the ethtool source on branch feature_ptp of:
https://github.com/kmaincent/ethtool
- Added a devicetree binding to select the preferred timestamp.
Changes in v2:
- Move selected_timestamping_layer variable of the concerned patch.
- Use sysfs_streq instead of strmcmp.
- Use the PHY timestamp only if available.
====================
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kory Maincent [Thu, 12 Dec 2024 17:06:45 +0000 (18:06 +0100)]
net: ethtool: Add support for tsconfig command to get/set hwtstamp config
Introduce support for ETHTOOL_MSG_TSCONFIG_GET/SET ethtool netlink socket
to read and configure hwtstamp configuration of a PHC provider. Note that
simultaneous hwtstamp isn't supported; configuring a new one disables the
previous setting.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kory Maincent [Thu, 12 Dec 2024 17:06:44 +0000 (18:06 +0100)]
net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology
Either the MAC or the PHY can provide hwtstamp, so we should be able to
read the tsinfo for any hwtstamp provider.
Enhance 'get' command to retrieve tsinfo of hwtstamp providers within a
network topology.
Add support for a specific dump command to retrieve all hwtstamp
providers within the network topology, with added functionality for
filtered dump to target a single interface.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kory Maincent [Thu, 12 Dec 2024 17:06:43 +0000 (18:06 +0100)]
net: Add the possibility to support a selected hwtstamp in netdevice
Introduce the description of a hwtstamp provider, mainly defined with a
the hwtstamp source and the phydev pointer.
Add a hwtstamp provider description within the netdev structure to
allow saving the hwtstamp we want to use. This prepares for future
support of an ethtool netlink command to select the desired hwtstamp
provider. By default, the old API that does not support hwtstamp
selectability is used, meaning the hwtstamp provider pointer is unset.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kory Maincent [Thu, 12 Dec 2024 17:06:42 +0000 (18:06 +0100)]
net: Make net_hwtstamp_validate accessible
Make the net_hwtstamp_validate function accessible in prevision to use
it from ethtool to validate the hwtstamp configuration before setting it.
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kory Maincent [Thu, 12 Dec 2024 17:06:41 +0000 (18:06 +0100)]
net: Make dev_get_hwtstamp_phylib accessible
Make the dev_get_hwtstamp_phylib function accessible in prevision to use
it from ethtool to read the hwtstamp current configuration.
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 16 Dec 2024 12:47:30 +0000 (12:47 +0000)]
Merge branch 'tls1.3-key-updates'
Sabrina Dubroca says:
====================
tls: implement key updates for TLS1.3
This adds support for receiving KeyUpdate messages (RFC 8446, 4.6.3
[1]). A sender transmits a KeyUpdate message and then changes its TX
key. The receiver should react by updating its RX key before
processing the next message.
This patchset implements key updates by:
1. pausing decryption when a KeyUpdate message is received, to avoid
attempting to use the old key to decrypt a record encrypted with
the new key
2. returning -EKEYEXPIRED to syscalls that cannot receive the
KeyUpdate message, until the rekey has been performed by userspace
3. passing the KeyUpdate message to userspace as a control message
4. allowing updates of the crypto_info via the TLS_TX/TLS_RX
setsockopts
This API has been tested with gnutls to make sure that it allows
userspace libraries to implement key updates [2]. Thanks to Frantisek
Krenzelok <fkrenzel@redhat.com> for providing the implementation in
gnutls and testing the kernel patches.
=======================================================================
Discussions around v2 of this patchset focused on how HW offload would
interact with rekey.
RX
- The existing SW path will handle all records between the KeyUpdate
message signaling the change of key and the new key becoming known
to the kernel -- those will be queued encrypted, and decrypted in
SW as they are read by userspace (once the key is provided, ie same
as this patchset)
- Call ->tls_dev_del + ->tls_dev_add immediately during
setsockopt(TLS_RX)
TX
- After setsockopt(TLS_TX), switch to the existing SW path (not the
current device_fallback) until we're able to re-enable HW offload
- tls_device_sendmsg will call into tls_sw_sendmsg under lock_sock
to avoid changing socket ops during the rekey while another
thread might be waiting on the lock
- We only re-enable HW offload (call ->tls_dev_add to install the new
key in HW) once all records sent with the old key have been
ACKed. At this point, all unacked records are SW-encrypted with the
new key, and the old key is unused by both HW and retransmissions.
- If there are no unacked records when userspace does
setsockopt(TLS_TX), we can (try to) install the new key in HW
immediately.
- If yet another key has been provided via setsockopt(TLS_TX), we
don't install intermediate keys, only the latest.
- TCP notifies ktls of ACKs via the icsk_clean_acked callback. In
case of a rekey, tls_icsk_clean_acked will record when all data
sent with the most recent past key has been sent. The next call
to sendmsg will install the new key in HW.
- We close and push the current SW record before reenabling
offload.
If ->tls_dev_add fails to install the new key in HW, we stay in SW
mode. We can add a counter to keep track of this.
In addition:
Because we can't change socket ops during a rekey, we'll also have to
modify do_tls_setsockopt_conf to check ctx->tx_conf and only call
either tls_set_device_offload or tls_set_sw_offload. RX already uses
the same ops for both TLS_HW and TLS_SW, so we could switch between HW
and SW mode on rekey.
An alternative would be to have a common sendmsg which locks
the socket and then calls the correct implementation. We'll need that
anyway for the offload under rekey case, so that would only add a test
to the SW path's ops (compared to the current code). That should allow
us to simplify build_protos a bit, but might have a performance
impact - we'll need to check it if we want to go that route.
=======================================================================
Changes since v4:
- add counter for received KeyUpdate messages
- improve wording in the documentation
- improve handling of bogus messages when looking for KeyUpdate's
- some coding style clean ups
Changes since v3:
- rebase on top of net-next
- rework tls_check_pending_rekey according to Jakub's feedback
- add statistics for rekey: {RX,TX}REKEY{OK,ERROR}
- some coding style clean ups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Thu, 12 Dec 2024 15:36:09 +0000 (16:36 +0100)]
selftests: tls: add rekey tests
Test the kernel's ability to:
- update the key (but not the version or cipher), only for TLS1.3
- pause decryption after receiving a KeyUpdate message, until a new
RX key has been provided
- reflect the pause/non-readable socket in poll()
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Thu, 12 Dec 2024 15:36:07 +0000 (16:36 +0100)]
docs: tls: document TLS1.3 key updates
Document the kernel's behavior and userspace expectations.
Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Thu, 12 Dec 2024 15:36:06 +0000 (16:36 +0100)]
tls: add counters for rekey
This introduces 5 counters to keep track of key updates:
Tls{Rx,Tx}Rekey{Ok,Error} and TlsRxRekeyReceived.
Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Thu, 12 Dec 2024 15:36:05 +0000 (16:36 +0100)]
tls: implement rekey for TLS1.3
This adds the possibility to change the key and IV when using
TLS1.3. Changing the cipher or TLS version is not supported.
Once we have updated the RX key, we can unblock the receive side. If
the rekey fails, the context is unmodified and userspace is free to
retry the update or close the socket.
This change only affects tls_sw, since 1.3 offload isn't supported.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Thu, 12 Dec 2024 15:36:04 +0000 (16:36 +0100)]
tls: block decryption when a rekey is pending
When a TLS handshake record carrying a KeyUpdate message is received,
all subsequent records will be encrypted with a new key. We need to
stop decrypting incoming records with the old key, and wait until
userspace provides a new key.
Make a note of this in the RX context just after decrypting that
record, and stop recvmsg/splice calls with EKEYEXPIRED until the new
key is available.
key_update_pending can't be combined with the existing bitfield,
because we will read it locklessly in ->poll.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
These cleanups lead the way to the unification of the path-manager
interfaces, and allow future extensions. The following patches are not
linked to each others, but are all related to the userspace
path-manager.
- Patch 1: add a new helper to reduce duplicated code.
- Patch 2: add a macro to iterate over the address list, clearer.
- Patch 3: reduce duplicated code to get the corresponding MPTCP socket.
- Patch 4: move userspace PM specific code out of the in-kernel one.
- Patch 5: pass an entry instead of a list with always one entry.
- Patch 6: uniform struct type used for the local addresses.
- Patch 7: simplify error handling.
====================
Geliang Tang [Fri, 13 Dec 2024 19:52:58 +0000 (20:52 +0100)]
mptcp: drop useless "err = 0" in subflow_destroy
Upon successful return, mptcp_pm_parse_addr() returns 0. There is no need
to set "err = 0" after this. So after mptcp_nl_find_ssk() returns, just
need to set "err = -ESRCH", then release and free msk socket if it returns
NULL.
Also, no need to define the variable "subflow" in subflow_destroy(), use
mptcp_subflow_ctx(ssk) directly.
This patch doesn't change the behaviour of the code, just refactoring.
Geliang Tang [Fri, 13 Dec 2024 19:52:57 +0000 (20:52 +0100)]
mptcp: change local addr type of subflow_destroy
Generally, in the path manager interfaces, the local address is defined as
an mptcp_pm_addr_entry type address, while the remote address is defined as
an mptcp_addr_info type one:
But subflow_destroy() interface uses two mptcp_addr_info type parameters.
This patch changes the first one to mptcp_pm_addr_entry type and use helper
mptcp_pm_parse_entry() to parse it instead of using mptcp_pm_parse_addr().
This patch doesn't change the behaviour of the code, just refactoring.
Geliang Tang [Fri, 13 Dec 2024 19:52:56 +0000 (20:52 +0100)]
mptcp: drop free_list for deleting entries
mptcp_pm_remove_addrs() actually only deletes one address, which does
not match its name. This patch renames it to mptcp_pm_remove_addr_entry()
and changes the parameter "rm_list" to "entry".
With the help of mptcp_pm_remove_addr_entry(), it's no longer necessary to
move the entry to be deleted to free_list and then traverse the list to
delete the entry, which is not allowed in BPF. The entry can be directly
deleted through list_del_rcu() and sock_kfree_s() now.
Geliang Tang [Fri, 13 Dec 2024 19:52:55 +0000 (20:52 +0100)]
mptcp: move mptcp_pm_remove_addrs into pm_userspace
Since mptcp_pm_remove_addrs() is only called from the userspace PM, this
patch moves it into pm_userspace.c.
For this, lookup_subflow_by_saddr() and remove_anno_list_by_saddr()
helpers need to be exported in protocol.h. Also add "mptcp_" prefix for
these helpers.
Here, mptcp_pm_remove_addrs() is not changed to a static function because
it will be used in BPF Path Manager.
This patch doesn't change the behaviour of the code, just refactoring.
Geliang Tang [Fri, 13 Dec 2024 19:52:54 +0000 (20:52 +0100)]
mptcp: add mptcp_userspace_pm_get_sock helper
Each userspace pm netlink function uses nla_get_u32() to get the msk
token value, then pass it to mptcp_token_get_sock() to get the msk.
Finally check whether userspace PM is selected on this msk. It makes
sense to wrap them into a helper, named mptcp_userspace_pm_get_sock(),
to do this.
This patch doesn't change the behaviour of the code, just refactoring.
Geliang Tang [Fri, 13 Dec 2024 19:52:53 +0000 (20:52 +0100)]
mptcp: add mptcp_for_each_userspace_pm_addr macro
Similar to mptcp_for_each_subflow() macro, this patch adds a new macro
mptcp_for_each_userspace_pm_addr() for userspace PM to iterate over the
address entries on the local address list userspace_pm_local_addr_list
of the mptcp socket.
This patch doesn't change the behaviour of the code, just refactoring.
Geliang Tang [Fri, 13 Dec 2024 19:52:52 +0000 (20:52 +0100)]
mptcp: add mptcp_userspace_pm_lookup_addr helper
Like __lookup_addr() helper in pm_netlink.c, a new helper
mptcp_userspace_pm_lookup_addr() is also defined in pm_userspace.c.
It looks up the corresponding mptcp_pm_addr_entry address in
userspace_pm_local_addr_list through the passed "addr" parameter
and returns the found address entry.
This helper can be used in mptcp_userspace_pm_delete_local_addr(),
mptcp_userspace_pm_set_flags(), mptcp_userspace_pm_get_local_id()
and mptcp_userspace_pm_is_backup() to simplify the code.
Please note that with this change now list_for_each_entry() is used in
mptcp_userspace_pm_append_new_local_addr(), not list_for_each_entry_safe(),
but that's OK to do so because mptcp_userspace_pm_lookup_addr() only
returns an entry from the list, the list hasn't been modified here.
Easwar Hariharan [Thu, 12 Dec 2024 17:33:01 +0000 (17:33 +0000)]
gve: Convert timeouts to secs_to_jiffies()
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
Donald Hunter [Fri, 13 Dec 2024 11:08:27 +0000 (11:08 +0000)]
netlink: specs: add uint, sint to netlink-raw schema
Add uint, sint to the list of attr types in the netlink-raw schema. This
fixes the rt_link spec which had a uint attr added in commit f858cc9eed5b ("net: add IFLA_MAX_PACING_OFFLOAD_HORIZON device attribute")
Arnd Bergmann [Fri, 13 Dec 2024 08:32:18 +0000 (09:32 +0100)]
octeontx2-af: fix build regression without CONFIG_DCB
When DCB is disabled, the pfc_en struct member cannot be accessed:
drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c: In function 'otx2_is_pfc_enabled':
drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c:22:48: error: 'struct otx2_nic' has no member named 'pfc_en'
22 | return IS_ENABLED(CONFIG_DCB) && !!pfvf->pfc_en;
| ^~
drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c: In function 'otx2_nix_config_bp':
drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c:1755:33: error: 'IEEE_8021QAZ_MAX_TCS' undeclared (first use in this function)
1755 | req->chan_cnt = IEEE_8021QAZ_MAX_TCS;
| ^~~~~~~~~~~~~~~~~~~~
Move the member out of the #ifdef block to avoid putting back another
check in the source file and add the missing include file unconditionally.
Fixes: a7ef63dbd588 ("octeontx2-af: Disable backpressure between CPT and NIX") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241213083228.2645757-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geert Uytterhoeven [Thu, 12 Dec 2024 09:11:43 +0000 (10:11 +0100)]
ethernet: Make OA_TC6 config symbol invisible
Commit aa58bec064ab1622 ("net: ethernet: oa_tc6: implement register
write operation") introduced a library that implements the OPEN Alliance
TC6 10BASE-T1x MAC-PHY Serial Interface protocol for supporting
10BASE-T1x MAC-PHYs.
There is no need to ask the user about enabling this library, as all
drivers that use it select the OA_TC6 symbol. Hence make the symbol
invisible, unless when compile-testing.
Vladimir Oltean [Thu, 12 Dec 2024 14:08:34 +0000 (16:08 +0200)]
net: phylink: improve phylink_sfp_config_phy() error message with missing PHY driver
It seems that phylink does not support driving PHYs in SFP modules using
the Generic PHY or Generic Clause 45 PHY driver. I've come to this
conclusion after analyzing these facts:
- sfp_sm_probe_phy(), who is our caller here, first calls
phy_device_register() and then sfp_add_phy() -> ... ->
phylink_sfp_connect_phy().
- phydev->supported is populated by phy_probe()
- phy_probe() is usually called synchronously from phy_device_register()
via phy_bus_match(), if a precise device driver is found for the PHY.
In that case, phydev->supported has a good chance of being set to a
non-zero mask.
- There is an exceptional case for the PHYs for which phy_bus_match()
didn't find a driver. Those devices sit for a while without a driver,
then phy_attach_direct() force-binds the genphy_c45_driver or
genphy_driver to them. Again, this triggers phy_probe() and renders
a good chance of phydev->supported being populated, assuming
compatibility with genphy_read_abilities() or
genphy_c45_pma_read_abilities().
- phylink_sfp_config_phy() does not support the exceptional case of
retrieving phydev->supported from the Generic PHY driver, due to its
code flow. It expects the phydev->supported mask to already be
non-empty, because it first calls phylink_validate() on it, and only
calls phylink_attach_phy() if that succeeds. Thus, phylink_attach_phy()
-> phy_attach_direct() has no chance of running.
It is not my wish to change the state of affairs by altering the code
flow, but merely to document the limitation rather than have the current
unspecific error:
[ 61.800079] mv88e6085 d0032004.mdio-mii:12 sfp: validation with support 00,00000000,00000000,00000000 failed: -EINVAL
[ 61.820743] sfp sfp: sfp_add_phy failed: -EINVAL
On the premise that an empty phydev->supported is going to make
phylink_validate() fail anyway, and that this is caused by a missing PHY
driver, it would be more informative to single out that case, undercut
the entire phylink_sfp_config_phy() call, including phylink_validate(),
and print a more specific message for this common gotcha:
[ 37.076403] mv88e6085 d0032004.mdio-mii:12 sfp: PHY i2c:sfp:16 (id 0x01410cc2) has no driver loaded
[ 37.089157] mv88e6085 d0032004.mdio-mii:12 sfp: Drivers which handle known common cases: CONFIG_BCM84881_PHY, CONFIG_MARVELL_PHY
[ 37.108047] sfp sfp: sfp_add_phy failed: -EINVAL
Maximilian Güntner [Thu, 12 Dec 2024 16:19:11 +0000 (17:19 +0100)]
ipv4: output metric as unsigned int
adding a route metric greater than 0x7fff_ffff leads to an
unintended wrap when printing the underlying u32 as an
unsigned int (`%d`) thus incorrectly rendering the metric
as negative. Formatting using `%u` corrects the issue.
David S. Miller [Sun, 15 Dec 2024 21:12:37 +0000 (21:12 +0000)]
Merge branch 'dp83822-gpio2'
Dimitri Fedrau says:
====================
net: phy: dp83822: Add support for GPIO2 clock output
The DP83822 has several clock configuration options for pins GPIO1, GPIO2
and GPIO3. Clock options include:
- MAC IF clock
- XI clock
- Free-Running clock
- Recovered clock
This patch adds support for GPIO2, the support for GPIO1 and GPIO3 can be
easily added if needed. Code and device tree bindings are derived from
dp83867 which has a similar feature.
Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
---
Changes in v3:
- Dropped <dt-bindings/net/ti-dp83822.h>
- Moved defines from <dt-bindings/net/ti-dp83822.h> to dp83822.c
- Switched to enum of type string for property ti,gpio2-clk-out and added
explanation for values, added example.
- Link to v2: https://lore.kernel.org/r/20241211-dp83822-gpio2-clk-out-v2-0-614a54f6acab@liebherr.com
Changes in v2:
- Move MII_DP83822_IOCTRL2 before MII_DP83822_GENCFG
- List case statements together, and have one break at the end.
- Move dp83822->set_gpio2_clk_out = true at the end of the validation
- Link to v1: https://lore.kernel.org/r/20241209-dp83822-gpio2-clk-out-v1-0-fd3c8af59ff5@liebherr.com
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dimitri Fedrau [Thu, 12 Dec 2024 08:44:07 +0000 (09:44 +0100)]
net: phy: dp83822: Add support for GPIO2 clock output
The GPIO2 pin on the DP83822 can be configured as clock output. Add support
for configuration via DT.
Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Dimitri Fedrau [Thu, 12 Dec 2024 08:44:06 +0000 (09:44 +0100)]
dt-bindings: net: dp83822: Add support for GPIO2 clock output
The GPIO2 pin on the DP83822 can be configured as clock output. Add
binding to support this feature.
Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Yuyang Huang [Wed, 11 Dec 2024 08:22:41 +0000 (17:22 +0900)]
netlink: add IGMP/MLD join/leave notifications
This change introduces netlink notifications for multicast address
changes. The following features are included:
* Addition and deletion of multicast addresses are reported using
RTM_NEWMULTICAST and RTM_DELMULTICAST messages with AF_INET and
AF_INET6.
* Two new notification groups: RTNLGRP_IPV4_MCADDR and
RTNLGRP_IPV6_MCADDR are introduced for receiving these events.
This change allows user space applications (e.g., ip monitor) to
efficiently track multicast group memberships by listening for netlink
events. Previously, applications relied on inefficient polling of
procfs, introducing delays. With netlink notifications, applications
receive realtime updates on multicast group membership changes,
enabling more precise metrics collection and system monitoring.
This change also unlocks the potential for implementing a wide range
of sophisticated multicast related features in user space by allowing
applications to combine kernel provided multicast address information
with user space data and communicate decisions back to the kernel for
more fine grained control. This mechanism can be used for various
purposes, including multicast filtering, IGMP/MLD offload, and
IGMP/MLD snooping.
Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Co-developed-by: Patrick Ruddy <pruddy@vyatta.att-mail.com> Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com> Link: https://lore.kernel.org/r/20180906091056.21109-1-pruddy@vyatta.att-mail.com Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mina Almasry [Wed, 11 Dec 2024 21:20:31 +0000 (21:20 +0000)]
page_pool: disable sync for cpu for dmabuf memory provider
dmabuf dma-addresses should not be dma_sync'd for CPU/device. Typically
its the driver responsibility to dma_sync for CPU, but the driver should
not dma_sync for CPU if the netmem is actually coming from a dmabuf
memory provider.
The page_pool already exposes a helper for dma_sync_for_cpu:
page_pool_dma_sync_for_cpu. Upgrade this existing helper to handle
netmem, and have it skip dma_sync if the memory is from a dmabuf memory
provider. Drivers should migrate to using this helper when adding
support for netmem.
Also minimize the impact on the dma syncing performance for pages. Special
case the dma-sync path for pages to not go through the overhead checks
for dma-syncing and conversion to netmem.
Samiullah Khawaja [Wed, 11 Dec 2024 21:20:30 +0000 (21:20 +0000)]
page_pool: Set `dma_sync` to false for devmem memory provider
Move the `dma_map` and `dma_sync` checks to `page_pool_init` to make
them generic. Set dma_sync to false for devmem memory provider because
the dma_sync APIs should not be used for dma_buf backed devmem memory
provider.
====================
xdp: a fistful of generic changes pt. II (part)
XDP for idpf is currently 5.5 chapters:
* convert Rx to libeth;
* convert Tx and stats to libeth;
* generic XDP and XSk code changes;
* generic XDP and XSk code additions (you are here);
* actual XDP for idpf via new libeth_xdp;
* XSk for idpf (via ^).
Part III.2.1 does the following:
* allows mixing pages from several Page Pools within one XDP frame;
* optimizes &xdp_frame structure and removes no-more-used field;
Everything is prereq for libeth_xdp, but will be useful standalone
as well: faster xdp_return_frame_bulk() and xdp_frame fields access.
====================
Alexander Lobakin [Wed, 11 Dec 2024 17:26:47 +0000 (18:26 +0100)]
skbuff: allow 2-4-argument skb_frag_dma_map()
skb_frag_dma_map(dev, frag, 0, skb_frag_size(frag), DMA_TO_DEVICE)
is repeated across dozens of drivers and really wants a shorthand.
Add a macro which will count args and handle all possible number
from 2 to 5. Semantics:
Alexander Lobakin [Wed, 11 Dec 2024 17:26:40 +0000 (18:26 +0100)]
xdp: make __xdp_return() MP-agnostic
Currently, __xdp_return() takes pointer to the virtual memory to free
a buffer. Apart from that this sometimes provokes redundant
data <--> page conversions, taking data pointer effectively prevents
lots of XDP code to support non-page-backed buffers, as there's no
mapping for the non-host memory (data is always NULL).
Just convert it to always take netmem reference. For
xdp_return_{buff,frame*}(), this chops off one page_address() per each
frag and adds one virt_to_netmem() (same as virt_to_page()) per header
buffer. For __xdp_return() itself, it removes one virt_to_page() for
MEM_TYPE_PAGE_POOL and another one for MEM_TYPE_PAGE_ORDER0, adding
one page_address() for [not really common nowadays]
MEM_TYPE_PAGE_SHARED, but the main effect is that the abovementioned
functions won't die or memleak anymore if the frame has non-host memory
attached and will correctly free those.
Alexander Lobakin [Wed, 11 Dec 2024 17:26:39 +0000 (18:26 +0100)]
xdp: get rid of xdp_frame::mem.id
Initially, xdp_frame::mem.id was used to search for the corresponding
&page_pool to return the page correctly.
However, after that struct page was extended to have a direct pointer
to its PP (netmem has it as well), further keeping of this field makes
no sense. xdp_return_frame_bulk() still used it to do a lookup, and
this leftover is now removed.
Remove xdp_frame::mem and replace it with ::mem_type, as only memory
type still matters and we need to know it to be able to free the frame
correctly.
As a cute side effect, we can now make every scalar field in &xdp_frame
of 4 byte width, speeding up accesses to them.
Alexander Lobakin [Wed, 11 Dec 2024 17:26:38 +0000 (18:26 +0100)]
page_pool: allow mixing PPs within one bulk
The main reason for this change was to allow mixing pages from different
&page_pools within one &xdp_buff/&xdp_frame. Why not? With stuff like
devmem and io_uring zerocopy Rx, it's required to have separate PPs for
header buffers and payload buffers.
Adjust xdp_return_frame_bulk() and page_pool_put_netmem_bulk(), so that
they won't be tied to a particular pool. Let the latter create a
separate bulk of pages which's PP is different from the first netmem of
the bulk and process it after the main loop.
This greatly optimizes xdp_return_frame_bulk(): no more hashtable
lookups and forced flushes on PP mismatch. Also make
xdp_flush_frame_bulk() inline, as it's just one if + function call + one
u32 read, not worth extending the call ladder.
Co-developed-by: Toke Høiland-Jørgensen <toke@redhat.com> # iterative Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> # while (count) Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241211172649.761483-2-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- wifi: mac80211:
- fix a queue stall in certain cases of channel switch
- wake the queues in case of failure in resume
- splice: do not checksum AF_UNIX sockets
- virtio_net: fix BUG()s in BQL support due to incorrect accounting
of purged packets during interface stop
- eth:
- stmmac: fix TSO DMA API mis-usage causing oops
- bnxt_en: fixes for HW GRO: GSO type on 5750X chips and oops
due to incorrect aggregation ID mask on 5760X chips
Previous releases - always broken:
- Bluetooth: improve setsockopt() handling of malformed user input
- eth: ocelot: fix PTP timestamping in presence of packet loss
- ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply not
supported"
* tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
net: dsa: tag_ocelot_8021q: fix broken reception
net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries
net: renesas: rswitch: fix initial MPIC register setting
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
Bluetooth: iso: Fix circular lock in iso_listen_bis
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Bluetooth: iso: Fix recursive locking warning
Bluetooth: iso: Always release hdev at the end of iso_listen_bis
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Bluetooth: hci_core: Fix sleeping function called from invalid context
team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
team: Fix initial vlan_feature set in __team_compute_features
bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features
net, team, bonding: Add netdev_base_features helper
net/sched: netem: account for backlog updates from child qdisc
net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
splice: do not checksum AF_UNIX sockets
net: usb: qmi_wwan: add Telit FE910C04 compositions
...
Jakub Kicinski [Thu, 12 Dec 2024 15:10:39 +0000 (07:10 -0800)]
Merge tag 'for-net-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- SCO: Fix transparent voice setting
- ISO: Locking fixes
- hci_core: Fix sleeping function called from invalid context
- hci_event: Fix using rcu_read_(un)lock while iterating
- btmtk: avoid UAF in btmtk_process_coredump
* tag 'for-net-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
Bluetooth: iso: Fix circular lock in iso_listen_bis
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Bluetooth: iso: Fix recursive locking warning
Bluetooth: iso: Always release hdev at the end of iso_listen_bis
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Bluetooth: hci_core: Fix sleeping function called from invalid context
Bluetooth: Improve setsockopt() handling of malformed user input
====================
Robert Hodaszi [Wed, 11 Dec 2024 14:47:41 +0000 (15:47 +0100)]
net: dsa: tag_ocelot_8021q: fix broken reception
The blamed commit changed the dsa_8021q_rcv() calling convention to
accept pre-populated source_port and switch_id arguments. If those are
not available, as in the case of tag_ocelot_8021q, the arguments must be
pre-initialized with -1.
Due to the bug of passing uninitialized arguments in tag_ocelot_8021q,
dsa_8021q_rcv() does not detect that it needs to populate the
source_port and switch_id, and this makes dsa_conduit_find_user() fail,
which leads to packet loss on reception.
Fixes: dcfe7673787b ("net: dsa: tag_sja1105: absorb logic for not overwriting precise info into dsa_8021q_rcv()") Signed-off-by: Robert Hodaszi <robert.hodaszi@digi.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20241211144741.1415758-1-robert.hodaszi@digi.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jesse Van Gavere [Wed, 11 Dec 2024 09:29:32 +0000 (10:29 +0100)]
net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries
Commit 8d7ae22ae9f8 ("net: dsa: microchip: KSZ9477 register regmap
alignment to 32 bit boundaries") fixed an issue whereby regmap_reg_range
did not allow writes as 32 bit words to KSZ9477 PHY registers, this fix
for KSZ9896 is adapted from there as the same errata is present in
KSZ9896C as "Module 5: Certain PHY registers must be written as pairs
instead of singly" the explanation below is likewise taken from this
commit.
The commit provided code
to apply "Module 6: Certain PHY registers must be written as pairs instead
of singly" errata for KSZ9477 as this chip for certain PHY registers
(0xN120 to 0xN13F, N=1,2,3,4,5) must be accessed as 32 bit words instead
of 16 or 8 bit access.
Otherwise, adjacent registers (no matter if reserved or not) are
overwritten with 0x0.
Without this patch some registers (e.g. 0x113c or 0x1134) required for 32
bit access are out of valid regmap ranges.
As a result, following error is observed and KSZ9896 is not properly
configured:
ksz-switch spi1.0: can't rmw 32bit reg 0x113c: -EIO
ksz-switch spi1.0: can't rmw 32bit reg 0x1134: -EIO
ksz-switch spi1.0 lan1 (uninitialized): failed to connect to PHY: -EIO
ksz-switch spi1.0 lan1 (uninitialized): error -5 setting up PHY for tree 0, switch 0, port 0
The solution is to modify regmap_reg_range to allow accesses with 4 bytes
boundaries.
Thadeu Lima de Souza Cascardo [Tue, 10 Dec 2024 19:36:10 +0000 (16:36 -0300)]
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
hci_devcd_append may lead to the release of the skb, so it cannot be
accessed once it is called.
==================================================================
BUG: KASAN: slab-use-after-free in btmtk_process_coredump+0x2a7/0x2d0 [btmtk]
Read of size 4 at addr ffff888033cfabb0 by task kworker/0:3/82
The buggy address belongs to the object at ffff888033cfab40
which belongs to the cache skbuff_head_cache of size 232
The buggy address is located 112 bytes inside of
freed 232-byte region [ffff888033cfab40, ffff888033cfac28)
Memory state around the buggy address: ffff888033cfaa80: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc ffff888033cfab00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
>ffff888033cfab80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff888033cfac00: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc ffff888033cfac80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Check if we need to call hci_devcd_complete before calling
hci_devcd_append. That requires that we check data->cd_info.cnt >=
MTK_COREDUMP_NUM instead of data->cd_info.cnt > MTK_COREDUMP_NUM, as we
increment data->cd_info.cnt only once the call to hci_devcd_append
succeeds.
Fixes: 0b7015132878 ("Bluetooth: btusb: mediatek: add MediaTek devcoredump support") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Iulia Tanasescu [Mon, 9 Dec 2024 09:42:18 +0000 (11:42 +0200)]
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
This fixes the circular locking dependency warning below, by reworking
iso_sock_recvmsg, to ensure that the socket lock is always released
before calling a function that locks hdev.
[ 561.670344] ======================================================
[ 561.670346] WARNING: possible circular locking dependency detected
[ 561.670349] 6.12.0-rc6+ #26 Not tainted
[ 561.670351] ------------------------------------------------------
[ 561.670353] iso-tester/3289 is trying to acquire lock:
[ 561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3},
at: iso_conn_big_sync+0x73/0x260 [bluetooth]
[ 561.670405]
but task is already holding lock:
[ 561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0},
at: iso_sock_recvmsg+0xbf/0x500 [bluetooth]
[ 561.670450]
which lock already depends on the new lock.
Fixes: 07a9342b94a9 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync") Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Iulia Tanasescu [Mon, 9 Dec 2024 09:42:17 +0000 (11:42 +0200)]
Bluetooth: iso: Fix circular lock in iso_listen_bis
This fixes the circular locking dependency warning below, by
releasing the socket lock before enterning iso_listen_bis, to
avoid any potential deadlock with hdev lock.
[ 75.307983] ======================================================
[ 75.307984] WARNING: possible circular locking dependency detected
[ 75.307985] 6.12.0-rc6+ #22 Not tainted
[ 75.307987] ------------------------------------------------------
[ 75.307987] kworker/u81:2/2623 is trying to acquire lock:
[ 75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO)
at: iso_connect_cfm+0x253/0x840 [bluetooth]
[ 75.308021]
but task is already holding lock:
[ 75.308022] ffff8fdd61a10078 (&hdev->lock)
at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[ 75.308053]
which lock already depends on the new lock.
Fixes: 02171da6e86a ("Bluetooth: ISO: Add hcon for listening bis sk") Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Frédéric Danis [Thu, 5 Dec 2024 15:51:59 +0000 (16:51 +0100)]
Bluetooth: SCO: Add support for 16 bits transparent voice setting
The voice setting is used by sco_connect() or sco_conn_defer_accept()
after being set by sco_sock_setsockopt().
The PCM part of the voice setting is used for offload mode through PCM
chipset port.
This commits add support for mSBC 16 bits offloading, i.e. audio data
not transported over HCI.
The BCM4349B1 supports 16 bits transparent data on its I2S port.
If BT_VOICE_TRANSPARENT is used when accepting a SCO connection, this
gives only garbage audio while using BT_VOICE_TRANSPARENT_16BIT gives
correct audio.
This has been tested with connection to iPhone 14 and Samsung S24.
Fixes: ad10b1a48754 ("Bluetooth: Add Bluetooth socket voice option") Signed-off-by: Frédéric Danis <frederic.danis@collabora.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Iulia Tanasescu [Wed, 4 Dec 2024 12:28:49 +0000 (14:28 +0200)]
Bluetooth: iso: Fix recursive locking warning
This updates iso_sock_accept to use nested locking for the parent
socket, to avoid lockdep warnings caused because the parent and
child sockets are locked by the same thread:
[ 41.585683] ============================================
[ 41.585688] WARNING: possible recursive locking detected
[ 41.585694] 6.12.0-rc6+ #22 Not tainted
[ 41.585701] --------------------------------------------
[ 41.585705] iso-tester/3139 is trying to acquire lock:
[ 41.585711] ffff988b29530a58 (sk_lock-AF_BLUETOOTH)
at: bt_accept_dequeue+0xe3/0x280 [bluetooth]
[ 41.585905]
but task is already holding lock:
[ 41.585909] ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
at: iso_sock_accept+0x61/0x2d0 [bluetooth]
[ 41.586064]
other info that might help us debug this:
[ 41.586069] Possible unsafe locking scenario: