Eric Dumazet [Wed, 29 Jan 2025 14:27:26 +0000 (14:27 +0000)]
net: revert RTNL changes in unregister_netdevice_many_notify()
This patch reverts following changes:
83419b61d187 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2) ae646f1a0bb9 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1) cfa579f66656 net: no longer hold RTNL while calling flush_all_backlogs()
This caused issues in layers holding a private mutex:
====================
mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted
Here are two small fixes for issues introduced in v6.12.
- Patch 1: reset the mpc_drop mark for other SYN retransmits, to only
consider an MPTCP blackhole when the first SYN retransmitted without
the MPTCP options is accepted, as initially intended.
- Patch 2: also mention in the doc that the blackhole_timeout sysctl
knob is per-netns, like all the others.
doc: mptcp: sysctl: blackhole_timeout is per-netns
All other sysctl entries mention it, and it is a per-namespace sysctl.
So mention it as well.
Fixes: 27069e7cb3d1 ("mptcp: disable active MPTCP in case of blackhole") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted
The Fixes commit mentioned this:
> An MPTCP firewall blackhole can be detected if the following SYN
> retransmission after a fallback to "plain" TCP is accepted.
But in fact, this blackhole was detected if any following SYN
retransmissions after a fallback to TCP was accepted.
That's because 'mptcp_subflow_early_fallback()' will set 'request_mptcp'
to 0, and 'mpc_drop' will never be reset to 0 after.
This is an issue, because some not so unusual situations might cause the
kernel to detect a false-positive blackhole, e.g. a client trying to
connect to a server while the network is not ready yet, causing a few
SYN retransmissions, before reaching the end server.
Fixes: 27069e7cb3d1 ("mptcp: disable active MPTCP in case of blackhole") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
Fix missing rtnl lock in suspend path
Fix the suspend path by ensuring the rtnl lock is held where required.
Calls to open, close and WOL operations must be performed under the
rtnl lock to prevent conflicts with ongoing ndo operations.
Discussion about this issue can be found here:
https://lore.kernel.org/netdev/20250120141926.1290763-1-kory.maincent@bootlin.com/
While working on the ravb fix, it was discovered that the sh_eth driver
has the same issue. This patch series addresses both drivers.
I do not have access to hardware for either of these MACs, so it would
be great if maintainers or others with the relevant boards could test
these fixes.
Kory Maincent [Wed, 29 Jan 2025 09:50:47 +0000 (10:50 +0100)]
net: sh_eth: Fix missing rtnl lock in suspend/resume path
Fix the suspend/resume path by ensuring the rtnl lock is held where
required. Calls to sh_eth_close, sh_eth_open and wol operations must be
performed under the rtnl lock to prevent conflicts with ongoing ndo
operations.
Kory Maincent [Wed, 29 Jan 2025 09:50:46 +0000 (10:50 +0100)]
net: ravb: Fix missing rtnl lock in suspend/resume path
Fix the suspend/resume path by ensuring the rtnl lock is held where
required. Calls to ravb_open, ravb_close and wol operations must be
performed under the rtnl lock to prevent conflicts with ongoing ndo
operations.
Paolo Abeni [Thu, 30 Jan 2025 10:00:31 +0000 (11:00 +0100)]
Merge tag 'for-net-2025-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- btusb: mediatek: Add locks for usb_driver_claim_interface()
- L2CAP: accept zero as a special value for MTU auto-selection
- btusb: Fix possible infinite recursion of btusb_reset
- Add ABI doc for sysfs reset
- btnxpuart: Fix glitches seen in dual A2DP streaming
* tag 'for-net-2025-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection
Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming
Bluetooth: Add ABI doc for sysfs reset
Bluetooth: Fix possible infinite recursion of btusb_reset
Bluetooth: btusb: mediatek: Add locks for usb_driver_claim_interface()
====================
Toke Høiland-Jørgensen [Mon, 27 Jan 2025 13:13:42 +0000 (14:13 +0100)]
net: xdp: Disallow attaching device-bound programs in generic mode
Device-bound programs are used to support RX metadata kfuncs. These
kfuncs are driver-specific and rely on the driver context to read the
metadata. This means they can't work in generic XDP mode. However, there
is no check to disallow such programs from being attached in generic
mode, in which case the metadata kfuncs will be called in an invalid
context, leading to crashes.
Fix this by adding a check to disallow attaching device-bound programs
in generic mode.
Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs") Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> Closes: https://lore.kernel.org/r/dae862ec-43b5-41a0-8edf-46c59071cdda@hetzner-cloud.de Tested-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250127131344.238147-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jon Maloy [Mon, 27 Jan 2025 23:13:04 +0000 (18:13 -0500)]
tcp: correct handling of extreme memory squeeze
Testing with iperf3 using the "pasta" protocol splicer has revealed
a problem in the way tcp handles window advertising in extreme memory
squeeze situations.
Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.
However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.
The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:
// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.
// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.
// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.
Note that well behaved endpoints will send win0 probes, so the problem
will not occur.
Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.
We fix this by updating the socket state correctly when a packet has
been dropped because of memory exhaustion and we have to advertize
a zero window.
Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.
Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory") Cc: Menglong Dong <menglong8.dong@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250127231304.1465565-1-jmaloy@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rafał Miłecki [Mon, 27 Jan 2025 17:51:59 +0000 (09:51 -0800)]
bgmac: reduce max frame size to support just MTU 1500
bgmac allocates new replacement buffer before handling each received
frame. Allocating & DMA-preparing 9724 B each time consumes a lot of CPU
time. Ideally bgmac should just respect currently set MTU but it isn't
the case right now. For now just revert back to the old limited frame
size.
This change bumps NAT masquerade speed by ~95%.
Since commit 8218f62c9c9b ("mm: page_frag: use initial zero offset for
page_frag_alloc_align()"), the bgmac driver fails to open its network
interface successfully and runs out of memory in the following call
stack:
So in that case we do indeed have offset + fragsz (40192) > size (32768)
and so we would eventually return NULL. Reverting to the older 1500
bytes MTU allows the network driver to be usable again.
Fixes: 8c7da63978f1 ("bgmac: configure MTU and add support for frames beyond 8192 byte size") Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
[florian: expand commit message about recent commits] Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20250127175159.1788246-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
vsock: Transport reassignment and error handling issues
Series deals with two issues:
- socket reference count imbalance due to an unforgiving transport release
(triggered by transport reassignment);
- unintentional API feature, a failing connect() making the socket
impossible to use for any subsequent connect() attempts.
Michal Luczaj [Tue, 28 Jan 2025 13:15:32 +0000 (14:15 +0100)]
vsock/test: Add test for connect() retries
Deliberately fail a connect() attempt; expect error. Then verify that
subsequent attempt (using the same socket) can still succeed, rather than
fail outright.
Michal Luczaj [Tue, 28 Jan 2025 13:15:28 +0000 (14:15 +0100)]
vsock: Allow retrying on connect() failure
sk_err is set when a (connectible) connect() fails. Effectively, this makes
an otherwise still healthy SS_UNCONNECTED socket impossible to use for any
subsequent connection attempts.
Clear sk_err upon trying to establish a connection.
Michal Luczaj [Tue, 28 Jan 2025 13:15:27 +0000 (14:15 +0100)]
vsock: Keep the binding until socket destruction
Preserve sockets bindings; this includes both resulting from an explicit
bind() and those implicitly bound through autobind during connect().
Prevents socket unbinding during a transport reassignment, which fixes a
use-after-free:
1. vsock_create() (refcnt=1) calls vsock_insert_unbound() (refcnt=2)
2. transport->release() calls vsock_remove_bound() without checking if
sk was bound and moved to bound list (refcnt=1)
3. vsock_bind() assumes sk is in unbound list and before
__vsock_insert_bound(vsock_bound_sockets()) calls
__vsock_remove_bound() which does:
list_del_init(&vsk->bound_table); // nop
sock_put(&vsk->sk); // refcnt=0
BUG: KASAN: slab-use-after-free in __vsock_bind+0x62e/0x730
Read of size 4 at addr ffff88816b46a74c by task a.out/2057
dump_stack_lvl+0x68/0x90
print_report+0x174/0x4f6
kasan_report+0xb9/0x190
__vsock_bind+0x62e/0x730
vsock_bind+0x97/0xe0
__sys_bind+0x154/0x1f0
__x64_sys_bind+0x6e/0xb0
do_syscall_64+0x93/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fedor Pchelkin [Tue, 28 Jan 2025 21:08:14 +0000 (00:08 +0300)]
Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection
One of the possible ways to enable the input MTU auto-selection for L2CAP
connections is supposed to be through passing a special "0" value for it
as a socket option. Commit [1] added one of those into avdtp. However, it
simply wouldn't work because the kernel still treats the specified value
as invalid and denies the setting attempt. Recorded BlueZ logs include the
following:
Found by Linux Verification Center (linuxtesting.org).
Fixes: 4b6e228e297b ("Bluetooth: Auto tune if input MTU is set to 0") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Neeraj Sanjay Kale [Mon, 20 Jan 2025 14:19:46 +0000 (19:49 +0530)]
Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming
This fixes a regression caused by previous commit for fixing truncated
ACL data, which is causing some intermittent glitches when running two
A2DP streams.
serdev_device_write_buf() is the root cause of the glitch, which is
reverted, and the TX work will continue to write until the queue is empty.
This change fixes both issues. No A2DP streaming glitches or truncated
ACL data issue observed.
Fixes: 8023dd220425 ("Bluetooth: btnxpuart: Fix driver sending truncated data") Fixes: 689ca16e5232 ("Bluetooth: NXP: Add protocol support for NXP Bluetooth chipsets") Signed-off-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Hsin-chen Chuang [Mon, 20 Jan 2025 10:39:39 +0000 (18:39 +0800)]
Bluetooth: Fix possible infinite recursion of btusb_reset
The function enters infinite recursion if the HCI device doesn't support
GPIO reset: btusb_reset -> hdev->reset -> vendor_reset -> btusb_reset...
btusb_reset shouldn't call hdev->reset after commit f07d478090b0
("Bluetooth: Get rid of cmd_timeout and use the reset callback")
Fixes: f07d478090b0 ("Bluetooth: Get rid of cmd_timeout and use the reset callback") Signed-off-by: Hsin-chen Chuang <chharry@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Douglas Anderson [Thu, 16 Jan 2025 03:36:36 +0000 (19:36 -0800)]
Bluetooth: btusb: mediatek: Add locks for usb_driver_claim_interface()
The documentation for usb_driver_claim_interface() says that "the
device lock" is needed when the function is called from places other
than probe(). This appears to be the lock for the USB interface
device. The Mediatek btusb code gets called via this path:
With the above call trace the device lock hasn't been claimed. Claim
it.
Without this fix, we'd sometimes see the error "Failed to claim iso
interface". Sometimes we'd even see worse errors, like a NULL pointer
dereference (where `intf->dev.driver` was NULL) with a trace like:
Both errors appear to be fixed with the proper locking.
Fixes: ceac1cb0259d ("Bluetooth: btusb: mediatek: add ISO data transmission functions") Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cosmin Ratiu [Mon, 27 Jan 2025 10:41:47 +0000 (12:41 +0200)]
bonding: Correctly support GSO ESP offload
The referenced fix is incomplete. It correctly computes
bond_dev->gso_partial_features across slaves, but unfortunately
netdev_fix_features discards gso_partial_features from the feature set
if NETIF_F_GSO_PARTIAL isn't set in bond_dev->features.
This is visible with ethtool -k bond0 | grep esp:
tx-esp-segmentation: off [requested on]
esp-hw-offload: on
esp-tx-csum-hw-offload: on
This patch reworks the bonding GSO offload support by:
- making aggregating gso_partial_features across slaves similar to the
other feature sets (this part is a no-op).
- advertising the default partial gso features on empty bond devs, same
as with other feature sets (also a no-op).
- adding NETIF_F_GSO_PARTIAL to hw_enc_features filtered across slaves.
- adding NETIF_F_GSO_PARTIAL to features in bond_setup()
With all of these, 'ethtool -k bond0 | grep esp' now reports:
tx-esp-segmentation: on
esp-hw-offload: on
esp-tx-csum-hw-offload: on
Fixes: 4861333b4217 ("bonding: add ESP offload features when slaves support") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Link: https://patch.msgid.link/20250127104147.759658-1-cratiu@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
Limit devicetree parameters to hardware capability
This series includes patches that checks the devicetree properties,
the number of MTL queues and FIFO size values, and if these specified
values exceed the value contained in hardware capabilities, limit to
the values from the capabilities. Do nothing if the capabilities don't
have any specified values.
And this sets hardware capability values if FIFO sizes are not specified
and removes redundant lines.
====================
Kunihiko Hayashi [Mon, 27 Jan 2025 01:38:20 +0000 (10:38 +0900)]
net: stmmac: Specify hardware capability value when FIFO size isn't specified
When Tx/Rx FIFO size is not specified in advance, the driver checks if
the value is zero and sets the hardware capability value in functions
where that value is used.
Consolidate the check and settings into function stmmac_hw_init() and
remove redundant other statements.
If FIFO size is zero and the hardware capability also doesn't have upper
limit values, return with an error message.
Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kunihiko Hayashi [Mon, 27 Jan 2025 01:38:19 +0000 (10:38 +0900)]
net: stmmac: Limit FIFO size by hardware capability
Tx/Rx FIFO size is specified by the parameter "{tx,rx}-fifo-depth" from
stmmac_platform layer.
However, these values are constrained by upper limits determined by the
capabilities of each hardware feature. There is a risk that the upper
bits will be truncated due to the calculation, so it's appropriate to
limit them to the upper limit values and display a warning message.
This only works if the hardware capability has the upper limit values.
Fixes: e7877f52fd4a ("stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree") Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kunihiko Hayashi [Mon, 27 Jan 2025 01:38:18 +0000 (10:38 +0900)]
net: stmmac: Limit the number of MTL queues to hardware capability
The number of MTL queues to use is specified by the parameter
"snps,{tx,rx}-queues-to-use" from stmmac_platform layer.
However, the maximum numbers of queues are constrained by upper limits
determined by the capability of each hardware feature. It's appropriate
to limit the values not to exceed the upper limit values and display
a warning message.
This only works if the hardware capability has the upper limit values.
Fixes: d976a525c371 ("net: stmmac: multiple queues dt configuration") Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Gal Pressman [Sun, 26 Jan 2025 19:18:45 +0000 (21:18 +0200)]
ethtool: Fix set RXNFC command with symmetric RSS hash
The sanity check that both source and destination are set when symmetric
RSS hash is requested is only relevant for ETHTOOL_SRXFH (rx-flow-hash),
it should not be performed on any other commands (e.g.
ETHTOOL_SRXCLSRLINS/ETHTOOL_SRXCLSRLDEL).
This resolves accessing uninitialized 'info.data' field, and fixes false
errors in rule insertion:
# ethtool --config-ntuple eth2 flow-type ip4 dst-ip 255.255.255.255 action -1 loc 0
rmgr: Cannot insert RX class rule: Invalid argument
Cannot insert classification rule
Fixes: 13e59344fb9d ("net: ethtool: add support for symmetric-xor RSS hash") Cc: Ahmed Zaki <ahmed.zaki@intel.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://patch.msgid.link/20250126191845.316589-1-gal@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
usbnet: ipheth: prevent OoB reads of NDP16
iOS devices support two types of tethering over USB: regular, where the
internet connection is shared from the phone to the attached computer,
and reverse, where the internet connection is shared from the attached
computer to the phone.
The `ipheth` driver is responsible for regular tethering only. With this
tethering type, iOS devices support two encapsulation modes on RX:
legacy and NCM.
In "NCM mode", the iOS device encapsulates RX (phone->computer) traffic
in NCM Transfer Blocks (similarly to CDC NCM). However, unlike reverse
tethering, regular tethering is not compliant with the CDC NCM spec:
* Does not have the required CDC NCM descriptors
* TX (computer->phone) is not NCM-encapsulated at all
Thus `ipheth` implements a very limited subset of the spec with the sole
purpose of parsing RX URBs. This driver does not aim to be
a CDC NCM-compliant implementation and, in fact, can't be one because of
the points above.
For a complete spec-compliant CDC NCM implementation, there is already
the `cdc_ncm` driver. This driver is used for reverse tethering on iOS
devices. This patch series does not in any way change `cdc_ncm`.
In the first iteration of the NCM mode implementation in `ipheth`,
there were a few potential out of bounds reads when processing malformed
URBs received from a connected device:
* Only the start of NDP16 (wNdpIndex) was checked to fit in the URB
buffer.
* Datagram length check as part of DPEs could overflow.
* DPEs could be read past the end of NDP16 and even end of URB buffer
if a trailer DPE wasn't encountered.
The above is not expected to happen in normal device operation.
To address the above issues for iOS devices in NCM mode, rely on
and check for a specific fixed format of incoming URBs expected from
an iOS device:
* 12-byte NTH16
* 96-byte NDP16, allowing up to 22 DPEs (up to 21 datagrams + trailer)
On iOS, NDP16 directly follows NTH16, and its length is constant
regardless of the DPE count.
As the regular tethering implementation of iOS devices isn't compliant
with CDC NCM, it's not possible to use the `cdc_ncm` driver to handle
this functionality. Furthermore, while the logic required to properly
parse URBs with NCM-encapsulated frames is already part of said driver,
I haven't found a nice way to reuse the existing code without messing
with the `cdc_ncm` driver itself.
I didn't want to reimplement more of the spec than I absolutely had to,
because that work had already been done in `cdc_ncm`. Instead, to limit
the scope, I chose to rely on the specific URB format of iOS devices
that hasn't changed since the NCM mode was introduced there.
I tested each individual patch in the v5 series with iPhone 15 Pro Max,
iOS 18.2.1: compiled cleanly, ran iperf3 between phone and computer,
observed no errors in either kernel log or interface statistics.
v4 was Reviewed-by Jakub Kicinski <kuba@kernel.org>. Compared to v4,
v5 has no code changes. The two differences are:
* Patch "usbnet: ipheth: break up NCM header size computation"
moved later in the series, closer to a subsequent commit that makes
use of the change.
* In patch "usbnet: ipheth: refactor NCM datagram loop", removed
a stray paragraph in commit msg.
Above items are also noted in the changelogs of respective patches.
====================
Foster Snowhill [Sat, 25 Jan 2025 23:54:09 +0000 (00:54 +0100)]
usbnet: ipheth: document scope of NCM implementation
Clarify that the "NCM" implementation in `ipheth` is very limited, as
iOS devices aren't compatible with the CDC NCM specification in regular
tethering mode.
For a standards-compliant implementation, one shall turn to
the `cdc_ncm` module.
Cc: stable@vger.kernel.org # 6.5.x Signed-off-by: Foster Snowhill <forst@pen.gy> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Foster Snowhill [Sat, 25 Jan 2025 23:54:07 +0000 (00:54 +0100)]
usbnet: ipheth: break up NCM header size computation
Originally, the total NCM header size was computed as the sum of two
vaguely labelled constants. While accurate, it wasn't particularly clear
where they were coming from.
Use sizes of existing NCM structs where available. Define the total
NDP16 size based on the maximum amount of DPEs that can fit into the
iOS-specific fixed-size header.
This change does not fix any particular issue. Rather, it introduces
intermediate constants that will simplify subsequent commits.
It should also make it clearer for the reader where the constant values
come from.
Cc: stable@vger.kernel.org # 6.5.x Signed-off-by: Foster Snowhill <forst@pen.gy> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Foster Snowhill [Sat, 25 Jan 2025 23:54:06 +0000 (00:54 +0100)]
usbnet: ipheth: refactor NCM datagram loop
Introduce an rx_error label to reduce repetitions in the header
signature checks.
Store wDatagramIndex and wDatagramLength after endianness conversion to
avoid repeated le16_to_cpu() calls.
Rewrite the loop to return on a null trailing DPE, which is required
by the CDC NCM spec. In case it is missing, fall through to rx_error.
This change does not fix any particular issue. Its purpose is to
simplify a subsequent commit that fixes a potential OoB read by limiting
the maximum amount of processed DPEs.
Cc: stable@vger.kernel.org # 6.5.x Signed-off-by: Foster Snowhill <forst@pen.gy> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Foster Snowhill [Sat, 25 Jan 2025 23:54:05 +0000 (00:54 +0100)]
usbnet: ipheth: use static NDP16 location in URB
Original code allowed for the start of NDP16 to be anywhere within the
URB based on the `wNdpIndex` value in NTH16. Only the start position of
NDP16 was checked, so it was possible for even the fixed-length part
of NDP16 to extend past the end of URB, leading to an out-of-bounds
read.
On iOS devices, the NDP16 header always directly follows NTH16. Rely on
and check for this specific format.
This, along with NCM-specific minimal URB length check that already
exists, will ensure that the fixed-length part of NDP16 plus a set
amount of DPEs fit within the URB.
Note that this commit alone does not fully address the OoB read.
The limit on the amount of DPEs needs to be enforced separately.
Foster Snowhill [Sat, 25 Jan 2025 23:54:04 +0000 (00:54 +0100)]
usbnet: ipheth: check that DPE points past NCM header
By definition, a DPE points at the start of a network frame/datagram.
Thus it makes no sense for it to point at anything that's part of the
NCM header. It is not a security issue, but merely an indication of
a malformed DPE.
Enforce that all DPEs point at the data portion of the URB, past the
NCM header.
Thomas Weißschuh [Sat, 25 Jan 2025 09:28:38 +0000 (10:28 +0100)]
ptp: Properly handle compat ioctls
Pointer arguments passed to ioctls need to pass through compat_ptr() to
work correctly on s390; as explained in Documentation/driver-api/ioctl.rst.
Detect compat mode at runtime and call compat_ptr() for those commands
which do take pointer arguments.
Nikita Zhandarovich [Fri, 24 Jan 2025 09:30:20 +0000 (01:30 -0800)]
net: usb: rtl8150: enable basic endpoint checking
Syzkaller reports [1] encountering a common issue of utilizing a wrong
usb endpoint type during URB submitting stage. This, in turn, triggers
a warning shown below.
For now, enable simple endpoint checking (specifically, bulk and
interrupt eps, testing control one is not essential) to mitigate
the issue with a view to do other related cosmetic changes later,
if they are necessary.
Jakub Kicinski [Tue, 28 Jan 2025 00:16:31 +0000 (16:16 -0800)]
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-01-24 (idpf, ice, iavf)
For idpf:
Emil adds memory barrier when accessing control queue descriptors and
restores call to idpf_vc_xn_shutdown() when resetting.
Manoj Vishwanathan expands transaction lock to properly protect xn->salt
value and adds additional debugging information.
Marco Leogrande converts workqueues to be unbound.
For ice:
Przemek fixes incorrect size use for array.
Mateusz removes reporting of invalid parameter and value.
For iavf:
Michal adjusts some VLAN changes to occur without a PF call to avoid
timing issues with the calls.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
iavf: allow changing VLAN state without calling PF
ice: remove invalid parameter of equalizer
ice: fix ice_parser_rt::bst_key array size
idpf: add more info during virtchnl transaction timeout/salt mismatch
idpf: convert workqueues to unbound
idpf: Acquire the lock before accessing the xn->salt
idpf: fix transaction timeouts on reset
idpf: add read memory barrier when checking descriptor done bit
====================
1) Fix incrementing the upper 32 bit sequence numbers for GSO skbs.
From Jianbo Liu.
2) Fix an out-of-bounds read on xfrm state lookup.
From Florian Westphal.
3) Fix secpath handling on packet offload mode.
From Alexandre Cassen.
4) Fix the usage of skb->sk in the xfrm layer.
5) Don't disable preemption while looking up cache state
to fix PREEMPT_RT.
From Sebastian Sewior.
* tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: Don't disable preemption while looking up cache state.
xfrm: Fix the usage of skb->sk
xfrm: delete intermediate secpath entry in packet offload mode
xfrm: state: fix out-of-bounds read during lookup
xfrm: replay: Fix the update of replay_esn->oseq_hi for GSO
====================
The root cause is the bad handling of disconnect() generated internally
by the MPTCP protocol in case of connect FASTOPEN errors.
Address the issue increasing the socket disconnect counter even on such
a case, to allow other threads waiting on the same socket lock to
properly error out.
With the in-kernel path-manager, it is possible to change the 'fullmesh'
flag. The code in mptcp_pm_nl_fullmesh() expects to change it only on
'subflow' endpoints, to recreate more or less subflows using the linked
address.
Unfortunately, the set_flags() hook was a bit more permissive, and
allowed 'implicit' endpoints to get the 'fullmesh' flag while it is not
allowed before.
That's what syzbot found, triggering the following warning:
Here, syzbot managed to set the 'fullmesh' flag on an 'implicit' and
used -- according to 'id_avail_bitmap' -- endpoint, causing the PM to
try decrement the local_addr_used counter which is only incremented for
the 'subflow' endpoint.
Note that 'no type' endpoints -- not 'subflow', 'signal', 'implicit' --
are fine, because their ID will not be marked as used in the 'id_avail'
bitmap, and setting 'fullmesh' can help forcing the creation of subflow
when receiving an ADD_ADDR.
Fixes: 73c762c1f07d ("mptcp: set fullmesh flag in pm_netlink") Cc: stable@vger.kernel.org Reported-by: syzbot+cd16e79c1e45f3fe0377@syzkaller.appspotmail.com Closes: https://lore.kernel.org/6786ac51.050a0220.216c54.00a6.GAE@google.com Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/540 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250123-net-mptcp-syzbot-issues-v1-2-af73258a726f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Thu, 23 Jan 2025 18:05:54 +0000 (19:05 +0100)]
mptcp: consolidate suboption status
MPTCP maintains the received sub-options status is the bitmask carrying
the received suboptions and in several bitfields carrying per suboption
additional info.
Zeroing the bitmask before parsing is not enough to ensure a consistent
status, and the MPTCP code has to additionally clear some bitfiled
depending on the actually parsed suboption.
The above schema is fragile, and syzbot managed to trigger a path where
a relevant bitfield is not cleared/initialized:
Local variable mp_opt created at:
mptcp_incoming_options+0x119/0x3d30 net/mptcp/options.c:1127
tcp_data_queue+0xb4/0x7be0 net/ipv4/tcp_input.c:5233
The current schema is too fragile; address the issue grouping all the
state-related data together and clearing the whole group instead of
just the bitmask. This also cleans-up the code a bit, as there is no
need to individually clear "random" bitfield in a couple of places
any more.
Chenyuan Yang [Thu, 23 Jan 2025 21:42:13 +0000 (15:42 -0600)]
net: davicom: fix UAF in dm9000_drv_remove
dm is netdev private data and it cannot be
used after free_netdev() call. Using dm after free_netdev()
can cause UAF bug. Fix it by moving free_netdev() at the end of the
function.
This is similar to the issue fixed in commit ad297cd2db89 ("net: qcom/emac: fix UAF in emac_remove").
This bug is detected by our static analysis tool.
Fixes: cf9e60aa69ae ("net: davicom: Fix regulator not turned off on driver removal") Signed-off-by: Chenyuan Yang <chenyuan0y@gmail.com> CC: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Link: https://patch.msgid.link/20250123214213.623518-1-chenyuan0y@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Milos Reljin [Fri, 24 Jan 2025 10:41:02 +0000 (10:41 +0000)]
net: phy: c45-tjaxx: add delay between MDIO write and read in soft_reset
In application note (AN13663) for TJA1120, on page 30, there's a figure
with average PHY startup timing values following software reset.
The time it takes for SMI to become operational after software reset
ranges roughly from 500 us to 1500 us.
This commit adds 2000 us delay after MDIO write which triggers software
reset. Without this delay, soft_reset function returns an error and
prevents successful PHY init.
Shigeru Yoshida [Thu, 23 Jan 2025 14:57:46 +0000 (23:57 +0900)]
vxlan: Fix uninit-value in vxlan_vnifilter_dump()
KMSAN reported an uninit-value access in vxlan_vnifilter_dump() [1].
If the length of the netlink message payload is less than
sizeof(struct tunnel_msg), vxlan_vnifilter_dump() accesses bytes
beyond the message. This can lead to uninit-value access. Fix this by
returning an error in such situations.
David Howells [Thu, 23 Jan 2025 08:59:12 +0000 (08:59 +0000)]
rxrpc, afs: Fix peer hash locking vs RCU callback
In its address list, afs now retains pointers to and refs on one or more
rxrpc_peer objects. The address list is freed under RCU and at this time,
it puts the refs on those peers.
Now, when an rxrpc_peer object runs out of refs, it gets removed from the
peer hash table and, for that, rxrpc has to take a spinlock. However, it
is now being called from afs's RCU cleanup, which takes place in BH
context - but it is just taking an ordinary spinlock.
The put may also be called from non-BH context, and so there exists the
possibility of deadlock if the BH-based RCU cleanup happens whilst the hash
spinlock is held. This led to the attached lockdep complaint.
Fix this by changing spinlocks of rxnet->peer_hash_lock back to
BH-disabling locks.
================================
WARNING: inconsistent lock state
6.13.0-rc5-build2+ #1223 Tainted: G E
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes: ffff88810babe228 (&rxnet->peer_hash_lock){+.?.}-{3:3}, at: rxrpc_put_peer+0xcb/0x180
{SOFTIRQ-ON-W} state was registered at:
mark_usage+0x164/0x180
__lock_acquire+0x544/0x990
lock_acquire.part.0+0x103/0x280
_raw_spin_lock+0x2f/0x40
rxrpc_peer_keepalive_worker+0x144/0x440
process_one_work+0x486/0x7c0
process_scheduled_works+0x73/0x90
worker_thread+0x1c8/0x2a0
kthread+0x19b/0x1b0
ret_from_fork+0x24/0x40
ret_from_fork_asm+0x1a/0x30
irq event stamp: 972402
hardirqs last enabled at (972402): [<ffffffff8244360e>] _raw_spin_unlock_irqrestore+0x2e/0x50
hardirqs last disabled at (972401): [<ffffffff82443328>] _raw_spin_lock_irqsave+0x18/0x60
softirqs last enabled at (972300): [<ffffffff810ffbbe>] handle_softirqs+0x3ee/0x430
softirqs last disabled at (972313): [<ffffffff810ffc54>] __irq_exit_rcu+0x44/0x110
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&rxnet->peer_hash_lock);
<Interrupt>
lock(&rxnet->peer_hash_lock);
*** DEADLOCK ***
1 lock held by swapper/1/0:
#0: ffffffff83576be0 (rcu_callback){....}-{0:0}, at: rcu_lock_acquire+0x7/0x30
Jan Stancek [Thu, 23 Jan 2025 12:38:51 +0000 (13:38 +0100)]
selftests: net/{lib,openvswitch}: extend CFLAGS to keep options from environment
Package build environments like Fedora rpmbuild introduced hardening
options (e.g. -pie -Wl,-z,now) by passing a -spec option to CFLAGS
and LDFLAGS.
Some Makefiles currently override CFLAGS but not LDFLAGS, which leads
to a mismatch and build failure, for example:
/usr/bin/ld: /tmp/ccd2apay.o: relocation R_X86_64_32 against
`.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: failed to set dynamic section sizes: bad value
collect2: error: ld returned 1 exit status
make[1]: *** [../../lib.mk:222: tools/testing/selftests/net/lib/csum] Error 1
openvswitch/Makefile CFLAGS currently do not appear to be used, but
fix it anyway for the case when new tests are introduced in future.
Jan Stancek [Thu, 23 Jan 2025 08:35:42 +0000 (09:35 +0100)]
selftests: mptcp: extend CFLAGS to keep options from environment
Package build environments like Fedora rpmbuild introduced hardening
options (e.g. -pie -Wl,-z,now) by passing a -spec option to CFLAGS
and LDFLAGS.
mptcp Makefile currently overrides CFLAGS but not LDFLAGS, which leads
to a mismatch and build failure, for example:
make[1]: *** [../../lib.mk:222: tools/testing/selftests/net/mptcp/mptcp_sockopt] Error 1
/usr/bin/ld: /tmp/ccqyMVdb.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: failed to set dynamic section sizes: bad value
collect2: error: ld returned 1 exit status
Jakub Kicinski [Thu, 23 Jan 2025 23:16:20 +0000 (15:16 -0800)]
net: page_pool: don't try to stash the napi id
Page ppol tried to cache the NAPI ID in page pool info to avoid
having a dependency on the life cycle of the NAPI instance.
Since commit under Fixes the NAPI ID is not populated until
napi_enable() and there's a good chance that page pool is
created before NAPI gets enabled.
Protect the NAPI pointer with the existing page pool mutex,
the reading path already holds it. napi_id itself we need
to READ_ONCE(), it's protected by netdev_lock() which are
not holding in page pool.
This is the SET path, where we call GET to either check user request
against max values, or check if any of the settings will change.
The logic in netdevsim is trying to report the default (ENABLED)
if user has not requested any specific setting. The user setting
is recorded in dev->cfg, don't depend on kernel_ringparam being
pre-populated with it.
Fixes: 928459bbda19 ("net: ethtool: populate the default HDS params in the core") Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+b3bcd80232d00091e061@syzkaller.appspotmail.com Tested-by: syzbot+b3bcd80232d00091e061@syzkaller.appspotmail.com Link: https://patch.msgid.link/20250123221410.1067678-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 23 Jan 2025 15:55:40 +0000 (07:55 -0800)]
MAINTAINERS: add Paul Fertser as a NC-SI reviewer
Paul has been providing very solid reviews for NC-SI changes
lately, so much so I started CCing him on all NC-SI patches.
Make the designation official.
====================
eth: fix calling napi_enable() in atomic context
Dan has reported that I missed a lot of drivers which call napi_enable()
in atomic with the naive coccinelle search for spin locks:
https://lore.kernel.org/dcfd56bc-de32-4b11-9e19-d8bd1543745d@stanley.mountain
Fix them. Most of the fixes involve taking the netdev_lock()
before the spin lock. mt76 is special because we can just
move napi_enable() from the BH section.
local_bh_disable() is not a real lock, its most likely taken
because napi_schedule() requires that we invoke softirqs at
some point. napi_enable() needs to take a mutex, so move it
from under the BH protection.
Jakub Kicinski [Fri, 24 Jan 2025 03:18:36 +0000 (19:18 -0800)]
eth: forcedeth: remove local wrappers for napi enable/disable
The local helpers for calling napi_enable() and napi_disable()
don't serve much purpose and they will complicate the fix in
the subsequent patch. Remove them, call the core functions
directly.
Jakub Kicinski [Fri, 24 Jan 2025 03:18:35 +0000 (19:18 -0800)]
eth: tg3: fix calling napi_enable() in atomic context
tg3 has a spin lock protecting most of the config,
switch to taking netdev_lock() explicitly on enable/start
paths. Disable/stop paths seem to not be under the spin
lock (since napi_disable() already needs to sleep),
so leave that side as is.
tg3_restart_hw() releases and re-takes the spin lock,
we need to do the same because dev_close() needs to
take netdev_lock().
Jakub Kicinski [Fri, 24 Jan 2025 01:21:30 +0000 (17:21 -0800)]
tools: ynl: c: correct reverse decode of empty attrs
netlink reports which attribute was incorrect by sending back
an attribute offset. Offset points to the address of struct nlattr,
but to interpret the type we also need the nesting path.
Attribute IDs have different meaning in different nests
of the same message.
Correct the condition for "is the offset within current attribute".
ynl_attr_data_len() does not include the attribute header,
so the end offset was off by 4 bytes.
This means that we'd always skip over flags and empty nests.
The devmem tests, for example, issues an invalid request with
empty queue nests, resulting in the following error:
Thomas Weißschuh [Thu, 23 Jan 2025 07:22:40 +0000 (08:22 +0100)]
ptp: Ensure info->enable callback is always set
The ioctl and sysfs handlers unconditionally call the ->enable callback.
Not all drivers implement that callback, leading to NULL dereferences.
Example of affected drivers: ptp_s390.c, ptp_vclock.c and ptp_mock.c.
Instead use a dummy callback if no better was specified by the driver.
Fixes: d94ba80ebbea ("ptp: Added a brand new class driver for ptp clocks.") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250123-ptp-enable-v1-1-b015834d3a47@weissschuh.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stanislav Fomichev [Thu, 23 Jan 2025 00:04:07 +0000 (16:04 -0800)]
net/mlx5e: add missing cpu_to_node to kvzalloc_node in mlx5e_open_xdpredirect_sq
kvzalloc_node is not doing a runtime check on the node argument
(__alloc_pages_node_noprof does have a VM_BUG_ON, but it expands to
nothing on !CONFIG_DEBUG_VM builds), so doing any ethtool/netlink
operation that calls mlx5e_open on a CPU that's larger that MAX_NUMNODES
triggers OOB access and panic (see the trace below).
Add missing cpu_to_node call to convert cpu id to node id.
Change ndo_set_mac_address to dev_set_mac_address because
dev_set_mac_address provides a way to notify network layer about MAC
change. In other case, services may not aware about MAC change and keep
using old one which set from network adapter driver.
As example, DHCP client from systemd do not update MAC address without
notification from net subsystem which leads to the problem with acquiring
the right address from DHCP server.
The way of selecting the first suitable MAC address from the list is
changed, instead of having the driver check it this patch just assumes
any valid MAC should be good.
Fixes: b8291cf3d118 ("net/ncsi: Add NC-SI 1.2 Get MC MAC Address command") Signed-off-by: Paul Fertser <fercerpav@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Swiatkowski [Thu, 5 Sep 2024 09:14:10 +0000 (11:14 +0200)]
iavf: allow changing VLAN state without calling PF
First case:
> ip l a l $VF name vlanx type vlan id 100
> ip l d vlanx
> ip l a l $VF name vlanx type vlan id 100
As workqueue can be execute after sometime, there is a window to have
call trace like that:
- iavf_del_vlan
- iavf_add_vlan
- iavf_del_vlans (wq)
It means that our VLAN 100 will change the state from IAVF_VLAN_ACTIVE
to IAVF_VLAN_REMOVE (iavf_del_vlan). After that in iavf_add_vlan state
won't be changed because VLAN 100 is on the filter list. The final
result is that the VLAN 100 filter isn't added in hardware (no
iavf_add_vlans call).
To fix that change the state if the filter wasn't removed yet directly
to active. It is save as IAVF_VLAN_REMOVE means that virtchnl message
wasn't sent yet.
Second case:
> ip l a l $VF name vlanx type vlan id 100
Any type of VF reset ex. change trust
> ip l s $PF vf $VF_NUM trust on
> ip l d vlanx
> ip l a l $VF name vlanx type vlan id 100
In case of reset iavf driver is responsible for readding all filters
that are being used. To do that all VLAN filters state are changed to
IAVF_VLAN_ADD. Here is even longer window for changing VLAN state from
kernel side, as workqueue isn't called immediately. We can have call
trace like that:
- changing to IAVF_VLAN_ADD (after reset)
- iavf_del_vlan (called from kernel ops)
- iavf_del_vlans (wq)
Not exsisitng VLAN filters will be removed from hardware. It isn't a
bug, ice driver will handle it fine. However, we can have call trace
like that:
- changing to IAVF_VLAN_ADD (after reset)
- iavf_del_vlan (called from kernel ops)
- iavf_add_vlan (called from kernel ops)
- iavf_del_vlans (wq)
With fix for previous case we end up with no VLAN filters in hardware.
We have to remove VLAN filters if the state is IAVF_VLAN_ADD and delete
VLAN was called. It is save as IAVF_VLAN_ADD means that virtchnl message
wasn't sent yet.
Mateusz Polchlopek [Tue, 31 Dec 2024 09:50:44 +0000 (10:50 +0100)]
ice: remove invalid parameter of equalizer
It occurred that in the commit 70838938e89c ("ice: Implement driver
functionality to dump serdes equalizer values") the invalid DRATE parameter
for reading has been added. The output of the command:
$ ethtool -d <ethX>
returns the garbage value in the place where DRATE value should be
stored.
Remove mentioned parameter to prevent return of corrupted data to
userspace.
Fixes: 70838938e89c ("ice: Implement driver functionality to dump serdes equalizer values") Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Przemek Kitszel [Thu, 19 Dec 2024 11:55:16 +0000 (12:55 +0100)]
ice: fix ice_parser_rt::bst_key array size
Fix &ice_parser_rt::bst_key size. It was wrongly set to 10 instead of 20
in the initial impl commit (see Fixes tag). All usage code assumed it was
of size 20. That was also the initial size present up to v2 of the intro
series [2], but halved by v3 [3] refactor described as "Replace magic
hardcoded values with macros." The introducing series was so big that
some ugliness was unnoticed, same for bugs :/
ICE_BST_KEY_TCAM_SIZE and ICE_BST_TCAM_KEY_SIZE were differing by one.
There was tmp variable @j in the scope of edited function, but was not
used in all places. This ugliness is now gone.
I'm moving ice_parser_rt::pg_prio a few positions up, to fill up one of
the holes in order to compensate for the added 10 bytes to the ::bst_key,
resulting in the same size of the whole as prior to the fix, and minimal
changes in the offsets of the fields.
Extend also the debug dump print of the key to cover all bytes. To not
have string with 20 "%02x" and 20 params, switch to
ice_debug_array_w_prefix().
Manoj Vishwanathan [Mon, 16 Dec 2024 16:27:35 +0000 (16:27 +0000)]
idpf: add more info during virtchnl transaction timeout/salt mismatch
Add more information related to the transaction like cookie, vc_op,
salt when transaction times out and include similar information
when transaction salt does not match.
Info output for transaction timeout:
-------------------
(op:5015 cookie:45fe vc_op:5015 salt:45 timeout:60000ms)
-------------------
Marco Leogrande [Mon, 16 Dec 2024 16:27:34 +0000 (16:27 +0000)]
idpf: convert workqueues to unbound
When a workqueue is created with `WQ_UNBOUND`, its work items are
served by special worker-pools, whose host workers are not bound to
any specific CPU. In the default configuration (i.e. when
`queue_delayed_work` and friends do not specify which CPU to run the
work item on), `WQ_UNBOUND` allows the work item to be executed on any
CPU in the same node of the CPU it was enqueued on. While this
solution potentially sacrifices locality, it avoids contention with
other processes that might dominate the CPU time of the processor the
work item was scheduled on.
This is not just a theoretical problem: in a particular scenario
misconfigured process was hogging most of the time from CPU0, leaving
less than 0.5% of its CPU time to the kworker. The IDPF workqueues
that were using the kworker on CPU0 suffered large completion delays
as a result, causing performance degradation, timeouts and eventual
system crash.
Tested:
* I have also run a manual test to gauge the performance
improvement. The test consists of an antagonist process
(`./stress --cpu 2`) consuming as much of CPU 0 as possible. This
process is run under `taskset 01` to bind it to CPU0, and its
priority is changed with `chrt -pQ 9900 10000 ${pid}` and
`renice -n -20 ${pid}` after start.
Then, the IDPF driver is forced to prefer CPU0 by editing all calls
to `queue_delayed_work`, `mod_delayed_work`, etc... to use CPU 0.
Finally, `ktraces` for the workqueue events are collected.
Without the current patch, the antagonist process can force
arbitrary delays between `workqueue_queue_work` and
`workqueue_execute_start`, that in my tests were as high as
`30ms`. With the current patch applied, the workqueue can be
migrated to another unloaded CPU in the same node, and, keeping
everything else equal, the maximum delay I could see was `6us`.
Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") Signed-off-by: Marco Leogrande <leogrande@google.com> Signed-off-by: Manoj Vishwanathan <manojvishy@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Emil Tantilov [Fri, 20 Dec 2024 02:09:32 +0000 (18:09 -0800)]
idpf: fix transaction timeouts on reset
Restore the call to idpf_vc_xn_shutdown() at the beginning of
idpf_vc_core_deinit() provided the function is not called on remove.
In the reset path the mailbox is destroyed, leading to all transactions
timing out.
Fixes: 09d0fb5cb30e ("idpf: deinit virtchnl transaction manager after vport and vectors") Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Emil Tantilov [Fri, 22 Nov 2024 04:40:59 +0000 (20:40 -0800)]
idpf: add read memory barrier when checking descriptor done bit
Add read memory barrier to ensure the order of operations when accessing
control queue descriptors. Specifically, we want to avoid cases where loads
can be reordered:
1. Load #1 is dispatched to read descriptor flags.
2. Load #2 is dispatched to read some other field from the descriptor.
3. Load #2 completes, accessing memory/cache at a point in time when the DD
flag is zero.
4. NIC DMA overwrites the descriptor, now the DD flag is one.
5. Any fields loaded before step 4 are now inconsistent with the actual
descriptor state.
Add read memory barrier between steps 1 and 2, so that load #2 is not
executed until load #1 has completed.
Sebastian Sewior [Thu, 23 Jan 2025 16:20:45 +0000 (17:20 +0100)]
xfrm: Don't disable preemption while looking up cache state.
For the state cache lookup xfrm_input_state_lookup() first disables
preemption, to remain on the CPU and then retrieves a per-CPU pointer.
Within the preempt-disable section it also acquires
netns_xfrm::xfrm_state_lock, a spinlock_t. This lock must not be
acquired with explicit disabled preemption (such as by get_cpu())
because this lock becomes a sleeping lock on PREEMPT_RT.
To remain on the same CPU is just an optimisation for the CPU local
lookup. The actual modification of the per-CPU variable happens with
netns_xfrm::xfrm_state_lock acquired.
Remove get_cpu() and use the state_cache_input on the current CPU.
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Closes: https://lore.kernel.org/all/CAADnVQKkCLaj=roayH=Mjiiqz_svdf1tsC3OE4EC0E=mAD+L1A@mail.gmail.com/ Fixes: 81a331a0e72dd ("xfrm: Add an inbound percpu state cache.") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Eric Dumazet [Tue, 21 Jan 2025 18:12:41 +0000 (18:12 +0000)]
ipmr: do not call mr_mfc_uses_dev() for unres entries
syzbot found that calling mr_mfc_uses_dev() for unres entries
would crash [1], because c->mfc_un.res.minvif / c->mfc_un.res.maxvif
alias to "struct sk_buff_head unresolved", which contain two pointers.
This code never worked, lets remove it.
[1]
Unable to handle kernel paging request at virtual address ffff5fff2d536613
KASAN: maybe wild-memory-access in range [0xfffefff96a9b3098-0xfffefff96a9b309f]
Modules linked in:
CPU: 1 UID: 0 PID: 7321 Comm: syz.0.16 Not tainted 6.13.0-rc7-syzkaller-g1950a0af2d55 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mr_mfc_uses_dev net/ipv4/ipmr_base.c:290 [inline]
pc : mr_table_dump+0x5a4/0x8b0 net/ipv4/ipmr_base.c:334
lr : mr_mfc_uses_dev net/ipv4/ipmr_base.c:289 [inline]
lr : mr_table_dump+0x694/0x8b0 net/ipv4/ipmr_base.c:334
Call trace:
mr_mfc_uses_dev net/ipv4/ipmr_base.c:290 [inline] (P)
mr_table_dump+0x5a4/0x8b0 net/ipv4/ipmr_base.c:334 (P)
mr_rtm_dumproute+0x254/0x454 net/ipv4/ipmr_base.c:382
ipmr_rtm_dumproute+0x248/0x4b4 net/ipv4/ipmr.c:2648
rtnl_dump_all+0x2e4/0x4e8 net/core/rtnetlink.c:4327
rtnl_dumpit+0x98/0x1d0 net/core/rtnetlink.c:6791
netlink_dump+0x4f0/0xbc0 net/netlink/af_netlink.c:2317
netlink_recvmsg+0x56c/0xe64 net/netlink/af_netlink.c:1973
sock_recvmsg_nosec net/socket.c:1033 [inline]
sock_recvmsg net/socket.c:1055 [inline]
sock_read_iter+0x2d8/0x40c net/socket.c:1125
new_sync_read fs/read_write.c:484 [inline]
vfs_read+0x740/0x970 fs/read_write.c:565
ksys_read+0x15c/0x26c fs/read_write.c:708
Fixes: cb167893f41e ("net: Plumb support for filtering ipv4 and ipv6 multicast route dumps") Reported-by: syzbot+5cfae50c0e5f2c500013@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/678fe2d1.050a0220.15cac.00b3.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250121181241.841212-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement cleanup of descriptors in the TSO error path of
fec_enet_txq_submit_tso(). The cleanup
- Unmaps DMA buffers for data descriptors skipping TSO header
- Clears all buffer descriptors
- Handles extended descriptors by clearing cbd_esc when enabled
Dimitri Fedrau [Sat, 18 Jan 2025 18:43:43 +0000 (19:43 +0100)]
net: phy: marvell-88q2xxx: Fix temperature measurement with reset-gpios
When using temperature measurement on Marvell 88Q2XXX devices and the
reset-gpios property is set in DT, the device does a hardware reset when
interface is brought down and up again. That means that the content of
the register MDIO_MMD_PCS_MV_TEMP_SENSOR2 is reset to default and that
leads to permanent deactivation of the temperature measurement, because
activation is done in mv88q2xxx_probe. To fix this move activation of
temperature measurement to mv88q222x_config_init.
Fixes: a557a92e6881 ("net: phy: marvell-88q2xxx: add support for temperature sensor") Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250118-marvell-88q2xxx-fix-hwmon-v2-1-402e62ba2dcb@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jian Shen [Sat, 18 Jan 2025 09:47:41 +0000 (17:47 +0800)]
net: hns3: fix oops when unload drivers paralleling
When unload hclge driver, it tries to disable sriov first for each
ae_dev node from hnae3_ae_dev_list. If user unloads hns3 driver at
the time, because it removes all the ae_dev nodes, and it may cause
oops.
But we can't simply use hnae3_common_lock for this. Because in the
process flow of pci_disable_sriov(), it will trigger the remove flow
of VF, which will also take hnae3_common_lock.
To fixes it, introduce a new mutex to protect the unload process.
Yijie Yang [Mon, 20 Jan 2025 07:08:28 +0000 (15:08 +0800)]
dt-bindings: net: qcom,ethqos: Correct fallback compatible for qcom,qcs615-ethqos
The qcs615-ride utilizes the same EMAC as the qcs404, rather than the
sm8150. The current incorrect fallback could result in packet loss.
The Ethernet on qcs615-ride is currently not utilized by anyone. Therefore,
there is no need to worry about any ABI impact.
Fixes: 32535b9410b8 ("dt-bindings: net: qcom,ethqos: add description for qcs615") Signed-off-by: Yijie Yang <quic_yijiyang@quicinc.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://patch.msgid.link/20250120-schema_qcs615-v4-1-d9d122f89e64@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paul Fertser [Thu, 16 Jan 2025 15:29:00 +0000 (18:29 +0300)]
net/ncsi: wait for the last response to Deselect Package before configuring channel
The NCSI state machine as it's currently implemented assumes that
transition to the next logical state is performed either explicitly by
calling `schedule_work(&ndp->work)` to re-queue itself or implicitly
after processing the predefined (ndp->pending_req_num) number of
replies. Thus to avoid the configuration FSM from advancing prematurely
and getting out of sync with the process it's essential to not skip
waiting for a reply.
This patch makes the code wait for reception of the Deselect Package
response for the last package probed before proceeding to channel
configuration.
Thanks go to Potin Lai and Cosmo Chou for the initial investigation and
testing.
Fixes: 8e13f70be05e ("net/ncsi: Probe single packages to avoid conflict") Cc: stable@vger.kernel.org Signed-off-by: Paul Fertser <fercerpav@gmail.com> Link: https://patch.msgid.link/20250116152900.8656-1-fercerpav@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Christian Marangi [Mon, 20 Jan 2025 15:41:40 +0000 (16:41 +0100)]
net: airoha: Fix wrong GDM4 register definition
Fix wrong GDM4 register definition, in Airoha SDK GDM4 is defined at
offset 0x2400 but this doesn't make sense as it does conflict with the
CDM4 that is in the same location.
Following the pattern where each GDM base is at the FWD_CFG, currently
GDM4 base offset is set to 0x2500. This is correct but REG_GDM4_FWD_CFG
and REG_GDM4_SRC_PORT_SET are still using the SDK reference with the
0x2400 offset. Fix these 2 define by subtracting 0x100 to each register
to reflect the real address location.
Fixes: 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250120154148.13424-1-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dan Carpenter [Fri, 17 Jan 2025 09:38:41 +0000 (12:38 +0300)]
NFC: nci: Add bounds checking in nci_hci_create_pipe()
The "pipe" variable is a u8 which comes from the network. If it's more
than 127, then it results in memory corruption in the caller,
nci_hci_connect_gate().
Cc: stable@vger.kernel.org Fixes: a1b0b9415817 ("NFC: nci: Create pipe on specific gate in nci_hci_connect_gate") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://patch.msgid.link/bcf5453b-7204-4297-9c20-4d8c7dacf586@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jamal Hadi Salim [Sat, 11 Jan 2025 14:57:39 +0000 (09:57 -0500)]
net: sched: fix ets qdisc OOB Indexing
Haowei Yan <g1042620637@gmail.com> found that ets_class_from_arg() can
index an Out-Of-Bound class in ets_class_from_arg() when passed clid of
0. The overflow may cause local privilege escalation.
Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc") Reported-by: Haowei Yan <g1042620637@gmail.com> Suggested-by: Haowei Yan <g1042620637@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250111145740.74755-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 22 Jan 2025 16:28:57 +0000 (08:28 -0800)]
Merge tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"This is slightly smaller than usual, with the most interesting work
being still around RTNL scope reduction.
Core:
- More core refactoring to reduce the RTNL lock contention, including
preparatory work for the per-network namespace RTNL lock, replacing
RTNL lock with a per device-one to protect NAPI-related net device
data and moving synchronize_net() calls outside such lock.
- Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge
and more specific TCP coverage.
- Reduce network namespace tear-down time by removing per-subsystems
synchronize_net() in tipc and sched.
- Add flow label selector support for fib rules, allowing traffic
redirection based on such header field.
Netfilter:
- Do not remove netdev basechain when last device is gone, allowing
netdev basechains without devices.
- Revisit the flowtable teardown strategy, dealing better with fin,
reset and re-open events.
- Scale-up IP-vs connection dumping by avoiding linear search on each
restart.
Protocols:
- A significant XDP socket refactor, consolidating and optimizing
several helpers into the core
- Better scaling of ICMP rate-limiting, by removing false-sharing in
inet peers handling.
- Introduces netlink notifications for multicast IPv4 and IPv6
address changes.
- Add ipsec support for IP-TFS/AggFrag encapsulation, allowing
aggregation and fragmentation of the inner IP.
- Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to
avoid local port exhaustion issues when the average connection
lifetime is very short.
- Support updating keys (re-keying) for connections using kernel TLS
(for TLS 1.3 only).
- Support ipv4-mapped ipv6 address clients in smc-r v2.
- Add support for jumbo data packet transmission in RxRPC sockets,
gluing multiple data packets in a single UDP packet.
- Support RxRPC RACK-TLP to manage packet loss and retransmission in
conjunction with the congestion control algorithm.
Driver API:
- Introduce a unified and structured interface for reporting PHY
statistics, exposing consistent data across different H/W via
ethtool.
- Make timestamping selectable, allow the user to select the desired
hwtstamp provider (PHY or MAC) administratively.
- Add support for configuring a header-data-split threshold (HDS)
value via ethtool, to deal with partial or buggy H/W
implementation.
- Consolidate DSA drivers Energy Efficiency Ethernet support.
- Add EEE management to phylink, making use of the phylib
implementation.
- Add phylib support for in-band capabilities negotiation.
- Simplify how phylib-enabled mac drivers expose the supported
interfaces.
Tests and tooling:
- Make the YNL tool package-friendly to make it easier to deploy it
separately from the kernel.
- Increase TCP selftest coverage importing several packetdrill
test-cases.
- Regenerate the ethtool uapi header from the YNL spec, to ease
maintenance and future development.
- Add YNL support for decoding the link types used in net self-tests,
allowing a single build to run both net and drivers/net.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- add cross E-Switch QoS support
- add SW Steering support for ConnectX-8
- implement support for HW-Managed Flow Steering, improving the
rule deletion/insertion rate
- support for multi-host LAG
- Intel (ixgbe, ice, igb):
- ice: add support for devlink health events
- ixgbe: add initial support for E610 chipset variant
- igb: add support for AF_XDP zero-copy
- Meta:
- add support for basic RSS config
- allow changing the number of channels
- add hardware monitoring support
- Broadcom (bnxt):
- implement TCP data split and HDS threshold ethtool support,
enabling Device Memory TCP.
- Marvell Octeon:
- implement egress ipsec offload support for the cn10k family
- Hisilicon (HIBMC):
- implement unicast MAC filtering
- Ethernet NICs embedded and virtual:
- Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding
contented atomic operations for drop counters
- Freescale:
- quicc: phylink conversion
- enetc: support Tx and Rx checksum offload and improve TSO
performances
- MediaTek:
- airoha: introduce support for ETS and HTB Qdisc offload
- Microchip:
- lan78XX USB: preparation work for phylink conversion
- Synopsys (stmmac):
- support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45
- refactor EEE support to leverage the new driver API
- optimize DMA and cache access to increase raw RX performances
by 40%
- TI:
- icssg-prueth: add multicast filtering support for VLAN
interface
- netkit:
- add ability to configure head/tailroom
- VXLAN:
- accepts packets with user-defined reserved bit
- Ethernet switches:
- Microchip:
- lan969x: add RGMII support
- lan969x: improve TX and RX performance using the FDMA engine
- nVidia/Mellanox:
- move Tx header handling to PCI driver, to ease XDP support
- Ethernet PHYs:
- Texas Instruments DP83822:
- add support for GPIO2 clock output
- Realtek:
- 8169: add support for RTL8125D rev.b
- rtl822x: add hwmon support for the temperature sensor
- Microchip:
- add support for RDS PTP hardware
- consolidate periodic output signal generation
- CAN:
- several DT-bindings to DT schema conversions
- tcan4x5x:
- add HW standby support
- support nWKRQ voltage selection
- kvaser:
- allowing Bus Error Reporting runtime configuration
- WiFi:
- the on-going Multi-Link Operation (MLO) effort continues,
affecting both the stack and in drivers
- mac80211/cfg80211:
- Emergency Preparedness Communication Services (EPCS) station
mode support
- support for adding and removing station links for MLO
- add support for WiFi 7/EHT mesh over 320 MHz channels
- report Tx power info for each link
- RealTek (rtw88):
- enable USB Rx aggregation and USB 3 to improve performance
- LED support
- RealTek (rtw89):
- refactor power save to support Multi-Link Operations
- add support for RTL8922AE-VS variant
- MediaTek (mt76):
- single wiphy multiband support (preparation for MLO)
- p2p device support
- add TP-Link TXE50UH USB adapter support
- Qualcomm (ath10k):
- support for the QCA6698AQ IP core
- Qualcomm (ath12k):
- enable MLO for QCN9274
- Bluetooth:
- Allow sysfs to trigger hdev reset, to allow recovering devices
not responsive from user-space
- MediaTek: add support for MT7922, MT7925, MT7921e devices
- Realtek: add support for RTL8851BE devices
- Qualcomm: add support for WCN785x devices
- ISO: allow BIG re-sync"
* tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1386 commits)
net/rose: prevent integer overflows in rose_setsockopt()
net: phylink: fix regression when binding a PHY
net: ethernet: ti: am65-cpsw: streamline TX queue creation and cleanup
net: ethernet: ti: am65-cpsw: streamline RX queue creation and cleanup
net: ethernet: ti: am65-cpsw: ensure proper channel cleanup in error path
ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL.
ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL.
ipv6: Move lifetime validation to inet6_rtm_newaddr().
ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().
ipv6: Pass dev to inet6_addr_add().
ipv6: Convert inet6_ioctl() to per-netns RTNL.
ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().
ipv6: Hold rtnl_net_lock() in addrconf_dad_work().
ipv6: Hold rtnl_net_lock() in addrconf_verify_work().
ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.
ipv6: Add __in6_dev_get_rtnl_net().
net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags
net: mii: Fix the Speed display when the network cable is not connected
sysctl net: Remove macro checks for CONFIG_SYSCTL
eth: bnxt: update header sizing defaults
...
When the 'cachestat()' system call was added in commit cf264e1329fb
("cachestat: implement cachestat syscall"), it was meant to be a much
more convenient (and performant) version of mincore() that didn't need
mapping things into the user virtual address space in order to work.
But it ended up missing the "check for writability or ownership" fix for
mincore(), done in commit 134fca9063ad ("mm/mincore.c: make mincore()
more conservative").
This just adds equivalent logic to 'cachestat()', modified for the file
context (rather than vma).
Linus Torvalds [Wed, 22 Jan 2025 04:12:24 +0000 (20:12 -0800)]
Merge tag 'audit-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit update from Paul Moore:
"A single audit patch that fixes a problem when collecting pathnames
for audit PATH records that was caused by some faulty pathname
matching logic"
* tag 'audit-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: fix suffixed '/' filename matching
Linus Torvalds [Wed, 22 Jan 2025 04:09:14 +0000 (20:09 -0800)]
Merge tag 'selinux-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
- Extended permissions supported in conditional policy
The SELinux extended permissions, aka "xperms", allow security admins
to target individuals ioctls, and recently netlink messages, with
their SELinux policy. Adding support for conditional policies allows
admins to toggle the granular xperms using SELinux booleans, helping
pave the way for greater use of xperms in general purpose SELinux
policies. This change bumps the maximum SELinux policy version to 34.
- Fix a SCTP/SELinux error return code inconsistency
Depending on the loaded SELinux policy, specifically it's
EXTSOCKCLASS support, the bind(2) LSM/SELinux hook could return
different error codes due to the SELinux code checking the socket's
SELinux object class (which can vary depending on EXTSOCKCLASS) and
not the socket's sk_protocol field. We fix this by doing the obvious,
and looking at the sock->sk_protocol field instead of the object
class.
- Makefile fixes to properly cleanup av_permissions.h
Add av_permissions.h to "targets" so that it is properly cleaned up
using the kbuild infrastructure.
- A number of smaller improvements by Christian Göttsche
A variety of straightforward changes to reduce code duplication,
reduce pointer lookups, migrate void pointers to defined types,
simplify code, constify function parameters, and correct iterator
types.
* tag 'selinux-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: make more use of str_read() when loading the policy
selinux: avoid unnecessary indirection in struct level_datum
selinux: use known type instead of void pointer
selinux: rename comparison functions for clarity
selinux: rework match_ipv6_addrmask()
selinux: constify and reconcile function parameter names
selinux: avoid using types indicating user space interaction
selinux: supply missing field initializers
selinux: add netlink nlmsg_type audit message
selinux: add support for xperms in conditional policies
selinux: Fix SCTP error inconsistency in selinux_socket_bind()
selinux: use native iterator types
selinux: add generated av_permissions.h to targets
Linus Torvalds [Wed, 22 Jan 2025 04:03:04 +0000 (20:03 -0800)]
Merge tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- Improved handling of LSM "secctx" strings through lsm_context struct
The LSM secctx string interface is from an older time when only one
LSM was supported, migrate over to the lsm_context struct to better
support the different LSMs we now have and make it easier to support
new LSMs in the future.
These changes explain the Rust, VFS, and networking changes in the
diffstat.
- Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are
enabled
Small tweak to be a bit smarter about when we build the LSM's common
audit helpers.
- Check for absurdly large policies from userspace in SafeSetID
SafeSetID policies rules are fairly small, basically just "UID:UID",
it easy to impose a limit of KMALLOC_MAX_SIZE on policy writes which
helps quiet a number of syzbot related issues. While work is being
done to address the syzbot issues through other mechanisms, this is a
trivial and relatively safe fix that we can do now.
- Various minor improvements and cleanups
A collection of improvements to the kernel selftests, constification
of some function parameters, removing redundant assignments, and
local variable renames to improve readability.
* tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
lockdown: initialize local array before use to quiet static analysis
safesetid: check size of policy writes
net: corrections for security_secid_to_secctx returns
lsm: rename variable to avoid shadowing
lsm: constify function parameters
security: remove redundant assignment to return variable
lsm: Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are set
selftests: refactor the lsm `flags_overset_lsm_set_self_attr` test
binder: initialize lsm_context structure
rust: replace lsm context+len with lsm_context
lsm: secctx provider check on release
lsm: lsm_context in security_dentry_init_security
lsm: use lsm_context in security_inode_getsecctx
lsm: replace context+len with lsm_context
lsm: ensure the correct LSM context releaser
Linus Torvalds [Wed, 22 Jan 2025 03:54:32 +0000 (19:54 -0800)]
Merge tag 'integrity-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"There's just a couple of changes: two kernel messages addressed, a
measurement policy collision addressed, and one policy cleanup.
Please note that the contents of the IMA measurement list is
potentially affected. The builtin tmpfs IMA policy rule change might
introduce additional measurements, while detecting a reboot might
eliminate some measurements"
* tag 'integrity-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: ignore suffixed policy rule comments
ima: limit the builtin 'tcb' dont_measure tmpfs policy rule
ima: kexec: silence RCU list traversal warning
ima: Suspend PCR extends and log appends when rebooting