Without this, the 'i' variable declared before could be overridden by
accident, e.g.
for i in "${@}"; do
__ksft_status_merge "${i}" ## 'i' has been modified
foo "${i}" ## using 'i' with an unexpected value
done
After a quick look, it looks like 'i' is currently not used after having
been modified in __ksft_status_merge(), but still, better be safe than
sorry. I saw this while modifying the same file, not because I suspected
an issue somewhere.
selftests: net: lib: avoid error removing empty netns name
If there is an error to create the first netns with 'setup_ns()',
'cleanup_ns()' will be called with an empty string as first parameter.
The consequences is that 'cleanup_ns()' will try to delete an invalid
netns, and wait 20 seconds if the netns list is empty.
Instead of just checking if the name is not empty, convert the string
separated by spaces to an array. Manipulating the array is cleaner, and
calling 'cleanup_ns()' with an empty array will be a no-op.
Fixes: 25ae948b4478 ("selftests/net: add lib.sh") Cc: stable@vger.kernel.org Acked-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240605-upstream-net-20240605-selftests-net-lib-fixes-v1-2-b3afadd368c9@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Note that with this series there is skb_queue_empty() left in
unix_dgram_disconnected() that needs to be changed to lockless
version, and unix_peer(other) access there should be protected
by unix_state_lock().
This will require some refactoring, so another series will follow.
Kuniyuki Iwashima [Tue, 4 Jun 2024 16:52:39 +0000 (09:52 -0700)]
af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
If the socket type is SOCK_STREAM or SOCK_SEQPACKET, unix_release_sock()
checks the length of the peer socket's recvq under unix_state_lock().
However, unix_stream_read_generic() calls skb_unlink() after releasing
the lock. Also, for SOCK_SEQPACKET, __skb_try_recv_datagram() unlinks
skb without unix_state_lock().
Thues, unix_state_lock() does not protect qlen.
Let's use skb_queue_empty_lockless() in unix_release_sock().
Kuniyuki Iwashima [Tue, 4 Jun 2024 16:52:38 +0000 (09:52 -0700)]
af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
Once sk->sk_state is changed to TCP_LISTEN, it never changes.
unix_accept() takes advantage of this characteristics; it does not
hold the listener's unix_state_lock() and only acquires recvq lock
to pop one skb.
It means unix_state_lock() does not prevent the queue length from
changing in unix_stream_connect().
Thus, we need to use unix_recvq_full_lockless() to avoid data-race.
Now we remove unix_recvq_full() as no one uses it.
Note that we can remove READ_ONCE() for sk->sk_max_ack_backlog in
unix_recvq_full_lockless() because of the following reasons:
(1) For SOCK_DGRAM, it is a written-once field in unix_create1()
(2) For SOCK_STREAM and SOCK_SEQPACKET, it is changed under the
listener's unix_state_lock() in unix_listen(), and we hold
the lock in unix_stream_connect()
Kuniyuki Iwashima [Tue, 4 Jun 2024 16:52:30 +0000 (09:52 -0700)]
af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
unix_poll() and unix_dgram_poll() read sk->sk_state locklessly and
calls unix_writable() which also reads sk->sk_state without holding
unix_state_lock().
Let's use READ_ONCE() in unix_poll() and unix_dgram_poll() and pass
it to unix_writable().
While at it, we remove TCP_SYN_SENT check in unix_dgram_poll() as
that state does not exist for AF_UNIX socket since the code was added.
Fixes: 1586a5877db9 ("af_unix: do not report POLLOUT on listeners") Fixes: 3c73419c09a5 ("af_unix: fix 'poll for write'/ connected DGRAM sockets") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kuniyuki Iwashima [Tue, 4 Jun 2024 16:52:27 +0000 (09:52 -0700)]
af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
to BPF SOCKMAP.
When the socket disconnects from the peer by connect(AF_UNSPEC), the state
is set back to TCP_CLOSE.
Then, the peer's state is also set to TCP_CLOSE, but the update is done
locklessly and unconditionally.
Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects
from B.
After the first two connect()s, all three sockets' sk->sk_state are
TCP_ESTABLISHED:
So, when a socket disconnects from the peer, we should not set TCP_CLOSE to
the peer if the peer is connected to yet another socket, and this must be
done under unix_state_lock().
Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless
readers. These data-races will be fixed in the following patches.
Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Aleksandr Mishin [Tue, 4 Jun 2024 08:25:00 +0000 (11:25 +0300)]
net: wwan: iosm: Fix tainted pointer delete is case of region creation fail
In case of region creation fail in ipc_devlink_create_region(), previously
created regions delete process starts from tainted pointer which actually
holds error code value.
Fix this bug by decreasing region index before delete.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 4dcd183fbd67 ("net: wwan: iosm: devlink registration") Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240604082500.20769-1-amishin@t-argos.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
Intel Wired LAN Driver Updates 2024-05-29 (ice, igc)
This series includes fixes for the ice driver as well as a fix for the igc
driver.
Jacob fixes two issues in the ice driver with reading the NVM for providing
firmware data via devlink info. First, fix an off-by-one error when reading
the Preserved Fields Area, resolving an infinite loop triggered on some
NVMs which lack certain data in the NVM. Second, fix the reading of the NVM
Shadow RAM on newer E830 and E825-C devices which have a variable sized CSS
header rather than assuming this header is always the same fixed size as in
the E810 devices.
Larysa fixes three issues with the ice driver XDP logic that could occur if
the number of queues is changed after enabling an XDP program. First, the
af_xdp_zc_qps bitmap is removed and replaced by simpler logic to track
whether queues are in zero-copy mode. Second, the reset and .ndo_bpf flows
are distinguished to avoid potential races with a PF reset occuring
simultaneously to .ndo_bpf callback from userspace. Third, the logic for
mapping XDP queues to vectors is fixed so that XDP state is restored for
XDP queues after a reconfiguration.
Sasha fixes reporting of Energy Efficient Ethernet support via ethtool in
the igc driver.
Sasha Neftin [Mon, 3 Jun 2024 21:42:35 +0000 (14:42 -0700)]
igc: Fix Energy Efficient Ethernet support declaration
The commit 01cf893bf0f4 ("net: intel: i40e/igc: Remove setting Autoneg in
EEE capabilities") removed SUPPORTED_Autoneg field but left inappropriate
ethtool_keee structure initialization. When "ethtool --show <device>"
(get_eee) invoke, the 'ethtool_keee' structure was accidentally overridden.
Remove the 'ethtool_keee' overriding and add EEE declaration as per IEEE
specification that allows reporting Energy Efficient Ethernet capabilities.
Examples:
Before fix:
ethtool --show-eee enp174s0
EEE settings for enp174s0:
EEE status: not supported
After fix:
EEE settings for enp174s0:
EEE status: disabled
Tx LPI: disabled
Supported EEE link modes: 100baseT/Full
1000baseT/Full
2500baseT/Full
Fixes: 01cf893bf0f4 ("net: intel: i40e/igc: Remove setting Autoneg in EEE capabilities") Suggested-by: Dima Ruinskiy <dima.ruinskiy@intel.com> Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-6-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Larysa Zaremba [Mon, 3 Jun 2024 21:42:34 +0000 (14:42 -0700)]
ice: map XDP queues to vectors in ice_vsi_map_rings_to_vectors()
ice_pf_dcb_recfg() re-maps queues to vectors with
ice_vsi_map_rings_to_vectors(), which does not restore the previous
state for XDP queues. This leads to no AF_XDP traffic after rebuild.
Map XDP queues to vectors in ice_vsi_map_rings_to_vectors().
Also, move the code around, so XDP queues are mapped independently only
through .ndo_bpf().
Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller functions") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-5-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Larysa Zaremba [Mon, 3 Jun 2024 21:42:33 +0000 (14:42 -0700)]
ice: add flag to distinguish reset from .ndo_bpf in XDP rings config
Commit 6624e780a577 ("ice: split ice_vsi_setup into smaller functions")
has placed ice_vsi_free_q_vectors() after ice_destroy_xdp_rings() in
the rebuild process. The behaviour of the XDP rings config functions is
context-dependent, so the change of order has led to
ice_destroy_xdp_rings() doing additional work and removing XDP prog, when
it was supposed to be preserved.
Also, dependency on the PF state reset flags creates an additional,
fortunately less common problem:
* PFR is requested e.g. by tx_timeout handler
* .ndo_bpf() is asked to delete the program, calls ice_destroy_xdp_rings(),
but reset flag is set, so rings are destroyed without deleting the
program
* ice_vsi_rebuild tries to delete non-existent XDP rings, because the
program is still on the VSI
* system crashes
With a similar race, when requested to attach a program,
ice_prepare_xdp_rings() can actually skip setting the program in the VSI
and nevertheless report success.
Instead of reverting to the old order of function calls, add an enum
argument to both ice_prepare_xdp_rings() and ice_destroy_xdp_rings() in
order to distinguish between calls from rebuild and .ndo_bpf().
Fixes: efc2214b6047 ("ice: Add support for XDP") Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-4-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Larysa Zaremba [Mon, 3 Jun 2024 21:42:32 +0000 (14:42 -0700)]
ice: remove af_xdp_zc_qps bitmap
Referenced commit has introduced a bitmap to distinguish between ZC and
copy-mode AF_XDP queues, because xsk_get_pool_from_qid() does not do this
for us.
The bitmap would be especially useful when restoring previous state after
rebuild, if only it was not reallocated in the process. This leads to e.g.
xdpsock dying after changing number of queues.
Instead of preserving the bitmap during the rebuild, remove it completely
and distinguish between ZC and copy-mode queues based on the presence of
a device associated with the pool.
Fixes: e102db780e1c ("ice: track AF_XDP ZC enabled queues in bitmap") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-3-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jacob Keller [Mon, 3 Jun 2024 21:42:31 +0000 (14:42 -0700)]
ice: fix reads from NVM Shadow RAM on E830 and E825-C devices
The ice driver reads data from the Shadow RAM portion of the NVM during
initialization, including data used to identify the NVM image and device,
such as the ETRACK ID used to populate devlink dev info fw.bundle.
Currently it is using a fixed offset defined by ICE_CSS_HEADER_LENGTH to
compute the appropriate offset. This worked fine for E810 and E822 devices
which both have CSS header length of 330 words.
Other devices, including both E825-C and E830 devices have different sizes
for their CSS header. The use of a hard coded value results in the driver
reading from the wrong block in the NVM when attempting to access the
Shadow RAM copy. This results in the driver reporting the fw.bundle as 0x0
in both the devlink dev info and ethtool -i output.
The first E830 support was introduced by commit ba20ecb1d1bb ("ice: Hook up
4 E830 devices by adding their IDs") and the first E825-C support was
introducted by commit f64e18944233 ("ice: introduce new E825C devices
family")
The NVM actually contains the CSS header length embedded in it. Remove the
hard coded value and replace it with logic to read the length from the NVM
directly. This is more resilient against all existing and future hardware,
vs looking up the expected values from a table. It ensures the driver will
read from the appropriate place when determining the ETRACK ID value used
for populating the fw.bundle_id and for reporting in ethtool -i.
The CSS header length for both the active and inactive flash bank is stored
in the ice_bank_info structure to avoid unnecessary duplicate work when
accessing multiple words of the Shadow RAM. Both banks are read in the
unlikely event that the header length is different for the NVM in the
inactive bank, rather than being different only by the overall device
family.
Fixes: ba20ecb1d1bb ("ice: Hook up 4 E830 devices by adding their IDs") Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-2-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jacob Keller [Mon, 3 Jun 2024 21:42:30 +0000 (14:42 -0700)]
ice: fix iteration of TLVs in Preserved Fields Area
The ice_get_pfa_module_tlv() function iterates over the Type-Length-Value
structures in the Preserved Fields Area (PFA) of the NVM. This is used by
the driver to access data such as the Part Board Assembly identifier.
The function uses simple logic to iterate over the PFA. First, the pointer
to the PFA in the NVM is read. Then the total length of the PFA is read
from the first word.
A pointer to the first TLV is initialized, and a simple loop iterates over
each TLV. The pointer is moved forward through the NVM until it exceeds the
PFA area.
The logic seems sound, but it is missing a key detail. The Preserved
Fields Area length includes one additional final word. This is documented
in the device data sheet as a dummy word which contains 0xFFFF. All NVMs
have this extra word.
If the driver tries to scan for a TLV that is not in the PFA, it will read
past the size of the PFA. It reads and interprets the last dummy word of
the PFA as a TLV with type 0xFFFF. It then reads the word following the PFA
as a length.
The PFA resides within the Shadow RAM portion of the NVM, which is
relatively small. All of its offsets are within a 16-bit size. The PFA
pointer and TLV pointer are stored by the driver as 16-bit values.
In almost all cases, the word following the PFA will be such that
interpreting it as a length will result in 16-bit arithmetic overflow. Once
overflowed, the new next_tlv value is now below the maximum offset of the
PFA. Thus, the driver will continue to iterate the data as TLVs. In the
worst case, the driver hits on a sequence of reads which loop back to
reading the same offsets in an endless loop.
To fix this, we need to correct the loop iteration check to account for
this extra word at the end of the PFA. This alone is sufficient to resolve
the known cases of this issue in the field. However, it is plausible that
an NVM could be misconfigured or have corrupt data which results in the
same kind of overflow. Protect against this by using check_add_overflow
when calculating both the maximum offset of the TLVs, and when calculating
the next_tlv offset at the end of each loop iteration. This ensures that
the driver will not get stuck in an infinite loop when scanning the PFA.
Fixes: e961b679fb0b ("ice: add board identifier info to devlink .info_get") Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-1-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 6 Jun 2024 02:03:07 +0000 (19:03 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-06-05
We've added 8 non-merge commits during the last 6 day(s) which contain
a total of 9 files changed, 34 insertions(+), 35 deletions(-).
The main changes are:
1) Fix a potential use-after-free in bpf_link_free when the link uses
dealloc_deferred to free the link object but later still tests for
presence of link->ops->dealloc, from Cong Wang.
2) Fix BPF test infra to set the run context for rawtp test_run callback
where syzbot reported a crash, from Jiri Olsa.
3) Fix bpf_session_cookie BTF_ID in the special_kfunc_set list to exclude
it for the case of !CONFIG_FPROBE, also from Jiri Olsa.
4) Fix a Coverity static analysis report to not close() a link_fd of -1
in the multi-uprobe feature detector, from Andrii Nakryiko.
5) Revert support for redirect to any xsk socket bound to the same umem
as it can result in corrupted ring state which can lead to a crash when
flushing rings. A different approach will be pursued for bpf-next to
address it safely, from Magnus Karlsson.
6) Fix inet_csk_accept prototype in test_sk_storage_tracing.c which caused
BPF CI failure after the last tree fast forwarding, from Andrii Nakryiko.
7) Fix a coccicheck warning in BPF devmap that iterator variable cannot
be NULL, from Thorsten Blum.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
Revert "xsk: Document ability to redirect to any socket bound to the same umem"
Revert "xsk: Support redirect to any socket bound to the same umem"
bpf: Set run context for rawtp test_run callback
bpf: Fix a potential use-after-free in bpf_link_free()
bpf, devmap: Remove unnecessary if check in for loop
libbpf: don't close(-1) in multi-uprobe feature detector
bpf: Fix bpf_session_cookie BTF_ID in special_kfunc_set list
selftests/bpf: fix inet_csk_accept prototype in test_sk_storage_tracing.c
====================
If one TCA_TAPRIO_ATTR_PRIOMAP attribute has been provided,
taprio_parse_mqprio_opt() must validate it, or userspace
can inject arbitrary data to the kernel, the second time
taprio_change() is called.
First call (with valid attributes) sets dev->num_tc
to a non zero value.
Second call (with arbitrary mqprio attributes)
returns early from taprio_parse_mqprio_opt()
and bad things can happen.
Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule") Reported-by: Noam Rathaus <noamr@ssd-disclosure.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20240604181511.769870-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aleksandr Mishin [Tue, 4 Jun 2024 10:05:52 +0000 (13:05 +0300)]
net/mlx5: Fix tainted pointer delete is case of flow rules creation fail
In case of flow rule creation fail in mlx5_lag_create_port_sel_table(),
instead of previously created rules, the tainted pointer is deleted
deveral times.
Fix this bug by using correct flow rules pointers.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 352899f384d4 ("net/mlx5: Lag, use buckets in hash mode") Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240604100552.25201-1-amishin@t-argos.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shay Drory [Mon, 3 Jun 2024 21:04:43 +0000 (00:04 +0300)]
net/mlx5: Always stop health timer during driver removal
Currently, if teardown_hca fails to execute during driver removal, mlx5
does not stop the health timer. Afterwards, mlx5 continue with driver
teardown. This may lead to a UAF bug, which results in page fault
Oops[1], since the health timer invokes after resources were freed.
Hence, stop the health monitor even if teardown_hca fails.
Fixes: 9b98d395b85d ("net/mlx5: Start health poll at earlier stage of driver load") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Moshe Shemesh [Mon, 3 Jun 2024 21:04:42 +0000 (00:04 +0300)]
net/mlx5: Stop waiting for PCI if pci channel is offline
In case pci channel becomes offline the driver should not wait for PCI
reads during health dump and recovery flow. The driver has timeout for
each of these loops trying to read PCI, so it would fail anyway.
However, in case of recovery waiting till timeout may cause the pci
error_detected() callback fail to meet pci_dpc_recovered() wait timeout.
Fixes: b3bd076f7501 ("net/mlx5: Report devlink health on FW fatal issues") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Frank Wunderlich [Mon, 3 Jun 2024 19:25:05 +0000 (21:25 +0200)]
net: ethernet: mtk_eth_soc: handle dma buffer size soc specific
The mainline MTK ethernet driver suffers long time from rarly but
annoying tx queue timeouts. We think that this is caused by fixed
dma sizes hardcoded for all SoCs.
We suspect this problem arises from a low level of free TX DMADs,
the TX Ring alomost full.
The transmit timeout is caused by the Tx queue not waking up. The
Tx queue stops when the free counter is less than ring->thres, and
it will wake up once the free counter is greater than ring->thres.
If the CPU is too late to wake up the Tx queues, it may cause a
transmit timeout.
Therefore, we increased the TX and RX DMADs to improve this error
situation.
Use the dma-size implementation from SDK in a per SoC manner. In
difference to SDK we have no RSS feature yet, so all RX/TX sizes
should be raised from 512 to 2048 byte except fqdma on mt7988 to
avoid the tx timeout issue.
Fixes: 656e705243fd ("net-next: mediatek: add support for MT7623 ethernet") Suggested-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 3 Jun 2024 18:48:26 +0000 (11:48 -0700)]
rtnetlink: make the "split" NLM_DONE handling generic
Jaroslav reports Dell's OMSA Systems Management Data Engine
expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo()
and inet_dump_ifaddr(). We already added a similar fix previously in
commit 460b0d33cf10 ("inet: bring NLM_DONE out to a separate recv() again")
Instead of modifying all the dump handlers, and making them look
different than modern for_each_netdev_dump()-based dump handlers -
put the workaround in rtnetlink code. This will also help us move
the custom rtnl-locking from af_netlink in the future (in net-next).
Note that this change is not touching rtnl_dump_all(). rtnl_dump_all()
is different kettle of fish and a potential problem. We now mix families
in a single recvmsg(), but NLM_DONE is not coalesced.
Fixes: 3e41af90767d ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()") Fixes: cdb2f80f1c10 ("inet: use xa_array iterator to implement inet_dump_ifaddr()") Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Xing [Mon, 3 Jun 2024 17:02:17 +0000 (01:02 +0800)]
mptcp: count CLOSE-WAIT sockets for MPTCP_MIB_CURRESTAB
Like previous patch does in TCP, we need to adhere to RFC 1213:
"tcpCurrEstab OBJECT-TYPE
...
The number of TCP connections for which the current state
is either ESTABLISHED or CLOSE- WAIT."
So let's consider CLOSE-WAIT sockets.
The logic of counting
When we increment the counter?
a) Only if we change the state to ESTABLISHED.
When we decrement the counter?
a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT,
say, on the client side, changing from ESTABLISHED to FIN-WAIT-1.
b) if the socket leaves CLOSE-WAIT, say, on the server side, changing
from CLOSE-WAIT to LAST-ACK.
Fixes: d9cd27b8cd19 ("mptcp: add CurrEstab MIB counter support") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Xing [Mon, 3 Jun 2024 17:02:16 +0000 (01:02 +0800)]
tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB
According to RFC 1213, we should also take CLOSE-WAIT sockets into
consideration:
"tcpCurrEstab OBJECT-TYPE
...
The number of TCP connections for which the current state
is either ESTABLISHED or CLOSE- WAIT."
After this, CurrEstab counter will display the total number of
ESTABLISHED and CLOSE-WAIT sockets.
The logic of counting
When we increment the counter?
a) if we change the state to ESTABLISHED.
b) if we change the state from SYN-RECEIVED to CLOSE-WAIT.
When we decrement the counter?
a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT,
say, on the client side, changing from ESTABLISHED to FIN-WAIT-1.
b) if the socket leaves CLOSE-WAIT, say, on the server side, changing
from CLOSE-WAIT to LAST-ACK.
Please note: there are two chances that old state of socket can be changed
to CLOSE-WAIT in tcp_fin(). One is SYN-RECV, the other is ESTABLISHED.
So we have to take care of the former case.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Mon, 3 Jun 2024 09:30:19 +0000 (17:30 +0800)]
selftests: hsr: add missing config for CONFIG_BRIDGE
hsr_redbox.sh test need to create bridge for testing. Add the missing
config CONFIG_BRIDGE in config file.
Fixes: eafbf0574e05 ("test: hsr: Extend the hsr_redbox.sh to have more SAN devices connected") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Tested-by: Simon Horman <horms@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 3 Jun 2024 08:59:26 +0000 (10:59 +0200)]
vxlan: Fix regression when dropping packets due to invalid src addresses
Commit f58f45c1e5b9 ("vxlan: drop packets from invalid src-address")
has recently been added to vxlan mainly in the context of source
address snooping/learning so that when it is enabled, an entry in the
FDB is not being created for an invalid address for the corresponding
tunnel endpoint.
Before commit f58f45c1e5b9 vxlan was similarly behaving as geneve in
that it passed through whichever macs were set in the L2 header. It
turns out that this change in behavior breaks setups, for example,
Cilium with netkit in L3 mode for Pods as well as tunnel mode has been
passing before the change in f58f45c1e5b9 for both vxlan and geneve.
After mentioned change it is only passing for geneve as in case of
vxlan packets are dropped due to vxlan_set_mac() returning false as
source and destination macs are zero which for E/W traffic via tunnel
is totally fine.
Fix it by only opting into the is_valid_ether_addr() check in
vxlan_set_mac() when in fact source address snooping/learning is
actually enabled in vxlan. This is done by moving the check into
vxlan_snoop(). With this change, the Cilium connectivity test suite
passes again for both tunnel flavors.
Fixes: f58f45c1e5b9 ("vxlan: drop packets from invalid src-address") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Bauer <mail@david-bauer.net> Cc: Ido Schimmel <idosch@nvidia.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Bauer <mail@david-bauer.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Hangyu Hua [Mon, 3 Jun 2024 07:13:03 +0000 (15:13 +0800)]
net: sched: sch_multiq: fix possible OOB write in multiq_tune()
q->bands will be assigned to qopt->bands to execute subsequent code logic
after kmalloc. So the old q->bands should not be used in kmalloc.
Otherwise, an out-of-bounds write will occur.
Fixes: c2999f7fb05b ("net: sched: multiq: don't call qdisc_put() while holding tree lock") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Taehee Yoo [Mon, 3 Jun 2024 04:57:55 +0000 (04:57 +0000)]
ionic: fix kernel panic in XDP_TX action
In the XDP_TX path, ionic driver sends a packet to the TX path with rx
page and corresponding dma address.
After tx is done, ionic_tx_clean() frees that page.
But RX ring buffer isn't reset to NULL.
So, it uses a freed page, which causes kernel panic.
Tristram Ha [Fri, 31 May 2024 01:38:01 +0000 (18:38 -0700)]
net: phy: Micrel KSZ8061: fix errata solution not taking effect problem
KSZ8061 needs to write to a MMD register at driver initialization to fix
an errata. This worked in 5.0 kernel but not in newer kernels. The
issue is the main phylib code no longer resets PHY at the very beginning.
Calling phy resuming code later will reset the chip if it is already
powered down at the beginning. This wipes out the MMD register write.
Solution is to implement a phy resume function for KSZ8061 to take care
of this problem.
Fixes: 232ba3a51cc2 ("net: phy: Micrel KSZ8061: link failure after cable connect") Signed-off-by: Tristram Ha <tristram.ha@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wen Gu [Fri, 31 May 2024 08:54:17 +0000 (16:54 +0800)]
net/smc: avoid overwriting when adjusting sock bufsizes
When copying smc settings to clcsock, avoid setting clcsock's sk_sndbuf
to sysctl_tcp_wmem[1], since this may overwrite the value set by
tcp_sndbuf_expand() in TCP connection establishment.
And the other setting sk_{snd|rcv}buf to sysctl value in
smc_adjust_sock_bufsizes() can also be omitted since the initialization
of smc sock and clcsock has set sk_{snd|rcv}buf to smc.sysctl_{w|r}mem
or ipv4_sysctl_tcp_{w|r}mem[1].
Fixes: 30c3c4a4497c ("net/smc: Use correct buffer sizes when switching between TCP and SMC") Link: https://lore.kernel.org/r/5eaf3858-e7fd-4db8-83e8-3d7a3e0e9ae2@linux.alibaba.com Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>, too. Signed-off-by: David S. Miller <davem@davemloft.net>
Subbaraya Sundeep [Wed, 29 May 2024 15:29:44 +0000 (20:59 +0530)]
octeontx2-af: Always allocate PF entries from low prioriy zone
PF mcam entries has to be at low priority always so that VF
can install longest prefix match rules at higher priority.
This was taken care currently but when priority allocation
wrt reference entry is requested then entries are allocated
from mid-zone instead of low priority zone. Fix this and
always allocate entries from low priority zone for PFs.
Fixes: 7df5b4b260dd ("octeontx2-af: Allocate low priority entries for PF") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduced a potential kernel crash when multiple napi instances
redirect to the same AF_XDP socket. By removing the queue_index check, it is
possible for multiple napi instances to access the Rx ring at the same time,
which will result in a corrupted ring state which can lead to a crash when
flushing the rings in __xsk_flush(). This can happen when the linked list of
sockets to flush gets corrupted by concurrent accesses. A quick and small fix
is not possible, so let us revert this for now.
Jiri Olsa [Tue, 4 Jun 2024 15:00:24 +0000 (17:00 +0200)]
bpf: Set run context for rawtp test_run callback
syzbot reported crash when rawtp program executed through the
test_run interface calls bpf_get_attach_cookie helper or any
other helper that touches task->bpf_ctx pointer.
Setting the run context (task->bpf_ctx pointer) for test_run
callback.
Fixes: 7adfc6c9b315 ("bpf: Add bpf_get_attach_cookie() BPF helper to access bpf_cookie value") Reported-by: syzbot+3ab78ff125b7979e45f9@syzkaller.appspotmail.com Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://syzkaller.appspot.com/bug?extid=3ab78ff125b7979e45f9 Link: https://lore.kernel.org/bpf/20240604150024.359247-1-jolsa@kernel.org
Jakub Kicinski [Thu, 30 May 2024 23:26:07 +0000 (16:26 -0700)]
net: tls: fix marking packets as decrypted
For TLS offload we mark packets with skb->decrypted to make sure
they don't escape the host without getting encrypted first.
The crypto state lives in the socket, so it may get detached
by a call to skb_orphan(). As a safety check - the egress path
drops all packets with skb->decrypted and no "crypto-safe" socket.
The skb marking was added to sendpage only (and not sendmsg),
because tls_device injected data into the TCP stack using sendpage.
This special case was missed when sendpage got folded into sendmsg.
Fixes: c5c37af6ecad ("tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240530232607.82686-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Tue, 4 Jun 2024 01:52:24 +0000 (18:52 -0700)]
Merge tag 'wireless-2024-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.10-rc3
The first fixes for v6.10. And we have a big one, I suspect the
biggest wireless pull request we ever had. There are fixes all over,
both in stack and drivers. Likely the most important here are mt76 not
working on mt7615 devices, ath11k not being able to connect to 6 GHz
networks and rtlwifi suffering from packet loss. But of course there's
much more.
* tag 'wireless-2024-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (37 commits)
wifi: rtlwifi: Ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
wifi: mt76: mt7615: add missing chanctx ops
wifi: wilc1000: document SRCU usage instead of SRCU
Revert "wifi: wilc1000: set atomic flag on kmemdup in srcu critical section"
Revert "wifi: wilc1000: convert list management to RCU"
wifi: mac80211: fix UBSAN noise in ieee80211_prep_hw_scan()
wifi: mac80211: correctly parse Spatial Reuse Parameter Set element
wifi: mac80211: fix Spatial Reuse element size check
wifi: iwlwifi: mvm: don't read past the mfuart notifcation
wifi: iwlwifi: mvm: Fix scan abort handling with HW rfkill
wifi: iwlwifi: mvm: check n_ssids before accessing the ssids
wifi: iwlwifi: mvm: properly set 6 GHz channel direct probe option
wifi: iwlwifi: mvm: handle BA session teardown in RF-kill
wifi: iwlwifi: mvm: Handle BIGTK cipher in kek_kck cmd
wifi: iwlwifi: mvm: remove stale STA link data during restart
wifi: iwlwifi: dbg_ini: move iwl_dbg_tlv_free outside of debugfs ifdef
wifi: iwlwifi: mvm: set properly mac header
wifi: iwlwifi: mvm: revert gen2 TX A-MPDU size to 64
wifi: iwlwifi: mvm: d3: fix WoWLAN command version lookup
wifi: iwlwifi: mvm: fix a crash on 7265
...
====================
Eric Dumazet [Fri, 31 May 2024 13:26:34 +0000 (13:26 +0000)]
ipv6: sr: block BH in seg6_output_core() and seg6_input_core()
As explained in commit 1378817486d6 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in seg6_output_core() is not good enough,
because seg6_output_core() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter seg6_output_core()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Apply a similar change in seg6_input_core().
Fixes: fa79581ea66c ("ipv6: sr: fix several BUGs when preemption is enabled") Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Lebrun <dlebrun@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Fri, 31 May 2024 13:26:33 +0000 (13:26 +0000)]
net: ipv6: rpl_iptunnel: block BH in rpl_output() and rpl_input()
As explained in commit 1378817486d6 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in rpl_output() is not good enough,
because rpl_output() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter rpl_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Eric Dumazet [Fri, 31 May 2024 13:26:32 +0000 (13:26 +0000)]
ipv6: ioam: block BH from ioam6_output()
As explained in commit 1378817486d6 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in ioam6_output() is not good enough,
because ioam6_output() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter ioam6_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Fixes: 8cb3bf8bff3c ("ipv6: ioam: Add support for the ip6ip6 encapsulation") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Justin Iurman <justin.iurman@uliege.be> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthias Stocker [Fri, 31 May 2024 10:37:11 +0000 (12:37 +0200)]
vmxnet3: disable rx data ring on dma allocation failure
When vmxnet3_rq_create() fails to allocate memory for rq->data_ring.base,
the subsequent call to vmxnet3_rq_destroy_all_rxdataring does not reset
rq->data_ring.desc_size for the data ring that failed, which presumably
causes the hypervisor to reference it on packet reception.
To fix this bug, rq->data_ring.desc_size needs to be set to 0 to tell
the hypervisor to disable this feature.
Cong Wang [Sun, 2 Jun 2024 18:27:03 +0000 (11:27 -0700)]
bpf: Fix a potential use-after-free in bpf_link_free()
After commit 1a80dbcb2dba, bpf_link can be freed by
link->ops->dealloc_deferred, but the code still tests and uses
link->ops->dealloc afterward, which leads to a use-after-free as
reported by syzbot. Actually, one of them should be sufficient, so
just call one of them instead of both. Also add a WARN_ON() in case
of any problematic implementation.
Fixes: 1a80dbcb2dba ("bpf: support deferring bpf_link dealloc to after RCU grace period") Reported-by: syzbot+1989ee16d94720836244@syzkaller.appspotmail.com Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240602182703.207276-1-xiyou.wangcong@gmail.com
Thorsten Blum [Wed, 29 May 2024 10:19:01 +0000 (12:19 +0200)]
bpf, devmap: Remove unnecessary if check in for loop
The iterator variable dst cannot be NULL and the if check can be removed.
Remove it and fix the following Coccinelle/coccicheck warning reported
by itnull.cocci:
ERROR: iterator variable bound on line 762 cannot be NULL
Tristram Ha [Wed, 29 May 2024 02:20:23 +0000 (19:20 -0700)]
net: phy: micrel: fix KSZ9477 PHY issues after suspend/resume
When the PHY is powered up after powered down most of the registers are
reset, so the PHY setup code needs to be done again. In addition the
interrupt register will need to be setup again so that link status
indication works again.
Fixes: 26dd2974c5b5 ("net: phy: micrel: Move KSZ9477 errata fixes to PHY driver") Signed-off-by: Tristram Ha <tristram.ha@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Safonov [Wed, 29 May 2024 17:29:32 +0000 (18:29 +0100)]
net/tcp: Don't consider TCP_CLOSE in TCP_AO_ESTABLISHED
TCP_CLOSE may or may not have current/rnext keys and should not be
considered "established". The fast-path for TCP_CLOSE is
SKB_DROP_REASON_TCP_CLOSE. This is what tcp_rcv_state_process() does
anyways. Add an early drop path to not spend any time verifying
segment signatures for sockets in TCP_CLOSE state.
DelphineCCChiu [Wed, 29 May 2024 06:58:55 +0000 (14:58 +0800)]
net/ncsi: Fix the multi thread manner of NCSI driver
Currently NCSI driver will send several NCSI commands back to back without
waiting the response of previous NCSI command or timeout in some state
when NIC have multi channel. This operation against the single thread
manner defined by NCSI SPEC(section 6.3.2.3 in DSP0222_1.1.1)
According to NCSI SPEC(section 6.2.13.1 in DSP0222_1.1.1), we should probe
one channel at a time by sending NCSI commands (Clear initial state, Get
version ID, Get capabilities...), than repeat this steps until the max
number of channels which we got from NCSI command (Get capabilities) has
been probed.
Jason Xing [Thu, 30 May 2024 03:27:17 +0000 (11:27 +0800)]
net: rps: fix error when CONFIG_RFS_ACCEL is off
John Sperbeck reported that if we turn off CONFIG_RFS_ACCEL, the 'head'
is not defined, which will trigger compile error. So I move the 'head'
out of the CONFIG_RFS_ACCEL scope.
Fixes: 84b6823cd96b ("net: rps: protect last_qtail with rps_input_queue_tail_save() helper") Reported-by: John Sperbeck <jsperbeck@google.com> Closes: https://lore.kernel.org/all/20240529203421.2432481-1-jsperbeck@google.com/ Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240530032717.57787-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lars Kellogg-Stedman [Wed, 29 May 2024 21:02:43 +0000 (17:02 -0400)]
ax25: Fix refcount imbalance on inbound connections
When releasing a socket in ax25_release(), we call netdev_put() to
decrease the refcount on the associated ax.25 device. However, the
execution path for accepting an incoming connection never calls
netdev_hold(). This imbalance leads to refcount errors, and ultimately
to kernel crashes.
A typical call trace for the above situation will start with one of the
following errors:
refcount_t: decrement hit 0; leaking memory.
refcount_t: underflow; use-after-free.
On reboot (or any attempt to remove the interface), the kernel gets
stuck in an infinite loop:
unregister_netdevice: waiting for ax0 to become free. Usage count = 0
This patch corrects these issues by ensuring that we call netdev_hold()
and ax25_dev_hold() for new connections in ax25_accept(). This makes the
logic leading to ax25_accept() match the logic for ax25_bind(): in both
cases we increment the refcount, which is ultimately decremented in
ax25_release().
Fixes: 9fd75b66b8f6 ("ax25: Fix refcount leaks caused by ax25_cb_del()") Signed-off-by: Lars Kellogg-Stedman <lars@oddbit.com> Tested-by: Duoming Zhou <duoming@zju.edu.cn> Tested-by: Dan Cross <crossd@gmail.com> Tested-by: Chris Maness <christopher.maness@gmail.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/20240529210242.3346844-2-lars@oddbit.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heng Qi [Tue, 28 May 2024 13:41:16 +0000 (21:41 +0800)]
virtio_net: fix a spurious deadlock issue
When the following snippet is run, lockdep will report a deadlock[1].
/* Acquire all queues dim_locks */
for (i = 0; i < vi->max_queue_pairs; i++)
mutex_lock(&vi->rq[i].dim_lock);
There's no deadlock here because the vq locks are always taken
in the same order, but lockdep can not figure it out. So refactoring
the code to alleviate the problem.
[1]
========================================================
WARNING: possible recursive locking detected
6.9.0-rc7+ #319 Not tainted
--------------------------------------------
ethtool/962 is trying to acquire lock:
but task is already holding lock:
other info that might help us debug this:
Possible unsafe locking scenario:
Fixes: 4d4ac2ececd3 ("virtio_net: Add a lock for per queue RX coalesce") Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20240528134116.117426-3-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heng Qi [Tue, 28 May 2024 13:41:15 +0000 (21:41 +0800)]
virtio_net: fix possible dim status unrecoverable
When the dim worker is scheduled, if it no longer needs to issue
commands, dim may not be able to return to the working state later.
For example, the following single queue scenario:
1. The dim worker of rxq0 is scheduled, and the dim status is
changed to DIM_APPLY_NEW_PROFILE;
2. dim is disabled or parameters have not been modified;
3. virtnet_rx_dim_work exits directly;
Then, even if net_dim is invoked again, it cannot work because the
state is not restored to DIM_START_MEASURE.
Fixes: 6208799553a8 ("virtio-net: support rx netdim") Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://lore.kernel.org/r/20240528134116.117426-2-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vadim Fedorenko [Thu, 30 May 2024 04:08:14 +0000 (21:08 -0700)]
ethtool: init tsinfo stats if requested
Statistic values should be set to ETHTOOL_STAT_NOT_SET even if the
device doesn't support statistics. Otherwise zeros will be returned as
if they are proper values:
Peter Geis [Wed, 29 May 2024 18:56:35 +0000 (14:56 -0400)]
MAINTAINERS: remove Peter Geis
The Motorcomm PHY driver is now maintained by the OEM. The driver has
expanded far beyond my original purpose, and I do not have the hardware
to test against the new portions of it. Therefore I am removing myself as
a maintainer of the driver.
Heng Qi [Thu, 30 May 2024 03:41:43 +0000 (11:41 +0800)]
virtio_net: fix missing lock protection on control_buf access
Refactored the handling of control_buf to be within the cvq_lock
critical section, mitigating race conditions between reading device
responses and new command submissions.
Fixes: 6f45ab3e0409 ("virtio_net: Add a lock for the command VQ.") Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Hariprasad Kelam <hkelam@marvell.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://lore.kernel.org/r/20240530034143.19579-1-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 0a44dfc07074 ("wifi: mac80211: simplify non-chanctx
drivers") ieee80211_hw_config() is no longer called with changed = ~0.
rtlwifi relied on ~0 in order to ignore the default retry limits of
4/7, preferring 48/48 in station mode and 7/7 in AP/IBSS.
RTL8192DU has a lot of packet loss with the default limits from
mac80211. Fix it by ignoring IEEE80211_CONF_CHANGE_RETRY_LIMITS
completely, because it's the simplest solution.
Alexis Lothoré [Tue, 28 May 2024 14:20:30 +0000 (16:20 +0200)]
wifi: wilc1000: document SRCU usage instead of SRCU
Commit f236464f1db7 ("wifi: wilc1000: convert list management to RCU")
attempted to convert SRCU to RCU usage, assuming it was not really needed.
The runtime issues that arose after merging it showed that there are code
paths involving sleeping functions, and removing those would need some
heavier driver rework.
Add some documentation about SRCU need to make sure that any future
developer do not miss some use cases if tempted to convert back again to
RCU.
Commit 35aee01ff43d ("wifi: wilc1000: set atomic flag on kmemdup in srcu
critical section") was preparatory to the SRCU to RCU conversion done by
commit f236464f1db7 ("wifi: wilc1000: convert list management to RCU").
This conversion brought issues and so has been reverted, so the atomic flag
is not needed anymore.
Commit f236464f1db7 ("wifi: wilc1000: convert list management to RCU")
replaced SRCU with RCU, aiming to simplify RCU usage in the driver. No
documentation or commit history hinted about why SRCU has been preferred
in original design, so it has been assumed to be safe to do this
conversion.
Unfortunately, some static analyzers raised warnings, confirmed by runtime
checker, not long after the merge. At least three different issues arose
when switching to RCU:
- wilc_wlan_txq_filter_dup_tcp_ack is executed in a RCU read critical
section yet calls wait_for_completion_timeout
- wilc_wfi_init_mon_interface calls kmalloc and register_netdevice while
manipulating a vif retrieved from vif list
- set_channel sends command to chip (and so, also waits for a completion)
while holding a vif retrieved from vif list (so, in RCU read critical
section)
Some of those issues are not trivial to fix and would need bigger driver
rework. Fix those issues by reverting the SRCU to RCU conversion commit
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-wireless/3b46ec7c-baee-49fd-b760-3bc12fb12eaf@moroto.mountain/ Fixes: f236464f1db7 ("wifi: wilc1000: convert list management to RCU") Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://msgid.link/20240528-wilc_revert_srcu_to_rcu-v1-1-bce096e0798c@bootlin.com
Jiri Olsa [Fri, 31 May 2024 19:45:00 +0000 (21:45 +0200)]
bpf: Fix bpf_session_cookie BTF_ID in special_kfunc_set list
The bpf_session_cookie is unavailable for !CONFIG_FPROBE as reported
by Sebastian [1].
To fix that we remove CONFIG_FPROBE ifdef for session kfuncs, which
is fine, because there's filter for session programs.
Then based on bpf_trace.o dependency:
obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o
we add bpf_session_cookie BTF_ID in special_kfunc_set list dependency
on CONFIG_BPF_EVENTS.
[1] https://lore.kernel.org/bpf/20240531071557.MvfIqkn7@linutronix.de/T/#m71c6d5ec71db2967288cb79acedc15cc5dbfeec5 Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by: Alexei Starovoitov <ast@kernel.org> Fixes: 5c919acef8514 ("bpf: Add support for kprobe session cookie") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240531194500.2967187-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- netfilter: tproxy: bail out if IP has been disabled on the device
- af_unix: annotate data-race around unix_sk(sk)->addr
- eth: mlx5e: fix UDP GSO for encapsulated packets
- eth: idpf: don't enable NAPI and interrupts prior to allocating Rx
buffers
- eth: i40e: fully suspend and resume IO operations in EEH case
- eth: octeontx2-pf: free send queue buffers incase of leaf to inner
- eth: ipvlan: dont Use skb->sk in ipvlan_process_v{4,6}_outbound"
* tag 'net-6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
netdev: add qstat for csum complete
ipvlan: Dont Use skb->sk in ipvlan_process_v{4,6}_outbound
net: ena: Fix redundant device NUMA node override
ice: check for unregistering correct number of devlink params
ice: fix 200G PHY types to link speed mapping
i40e: Fully suspend and resume IO operations in EEH case
i40e: factoring out i40e_suspend/i40e_resume
e1000e: move force SMBUS near the end of enable_ulp function
net: dsa: microchip: fix RGMII error in KSZ DSA driver
ipv4: correctly iterate over the target netns in inet_dump_ifaddr()
net: fix __dst_negative_advice() race
nfc/nci: Add the inconsistency check between the input data length and count
MAINTAINERS: dwmac: starfive: update Maintainer
net/sched: taprio: extend minimum interval restriction to entire cycle too
net/sched: taprio: make q->picos_per_byte available to fill_sched_entry()
netfilter: nft_fib: allow from forward/input without iif selector
netfilter: tproxy: bail out if IP has been disabled on the device
netfilter: nft_payload: skbuff vlan metadata mangle support
net: ti: icssg-prueth: Fix start counter for ft1 filter
sock_map: avoid race between sock_map_close and sk_psock_put
...
Jakub Kicinski [Wed, 29 May 2024 16:35:47 +0000 (09:35 -0700)]
netdev: add qstat for csum complete
Recent commit 0cfe71f45f42 ("netdev: add queue stats") added
a lot of useful stats, but only those immediately needed by virtio.
Presumably virtio does not support CHECKSUM_COMPLETE,
so statistic for that form of checksumming wasn't included.
Other drivers will definitely need it, in fact we expect it
to be needed in net-next soon (mlx5). So let's add the definition
of the counter for CHECKSUM_COMPLETE to uAPI in net already,
so that the counters are in a more natural order (all subsequent
counters have not been present in any released kernel, yet).
The warning triggers as this:
packet_sendmsg
packet_snd //skb->sk is packet sk
__dev_queue_xmit
__dev_xmit_skb //q->enqueue is not NULL
__qdisc_run
sch_direct_xmit
dev_hard_start_xmit
ipvlan_start_xmit
ipvlan_xmit_mode_l3 //l3 mode
ipvlan_process_outbound //vepa flag
ipvlan_process_v6_outbound
ip6_local_out
__ip6_finish_output
ip6_finish_output2 //multicast packet
sk_mc_loop //sk->sk_family is AF_PACKET
Call ip{6}_local_out() with NULL sk in ipvlan as other tunnels to fix this.
Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240529095633.613103-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 30 May 2024 08:14:56 +0000 (10:14 +0200)]
Merge tag 'nf-24-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
Patch #1 syzbot reports that nf_reinject() could be called without
rcu_read_lock() when flushing pending packets at nfnetlink
queue removal, from Eric Dumazet.
Patch #2 flushes ipset list:set when canceling garbage collection to
reference to other lists to fix a race, from Jozsef Kadlecsik.
Patch #3 restores q-in-q matching with nft_payload by reverting f6ae9f120dad ("netfilter: nft_payload: add C-VLAN support").
Patch #4 fixes vlan mangling in skbuff when vlan offload is present
in skbuff, without this patch nft_payload corrupts packets
in this case.
Patch #5 fixes possible nul-deref in tproxy no IP address is found in
netdevice, reported by syzbot and patch from Florian Westphal.
Patch #6 removes a superfluous restriction which prevents loose fib
lookups from input and forward hooks, from Eric Garver.
My assessment is that patches #1, #2 and #5 address possible kernel
crash, anything else in this batch fixes broken features.
netfilter pull request 24-05-29
* tag 'nf-24-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_fib: allow from forward/input without iif selector
netfilter: tproxy: bail out if IP has been disabled on the device
netfilter: nft_payload: skbuff vlan metadata mangle support
netfilter: nft_payload: restore vlan q-in-q match support
netfilter: ipset: Add list flush to cancel_gc
netfilter: nfnetlink_queue: acquire rcu_read_lock() in instance_destroy_rcu()
====================
Shay Agroskin [Tue, 28 May 2024 17:09:12 +0000 (20:09 +0300)]
net: ena: Fix redundant device NUMA node override
The driver overrides the NUMA node id of the device regardless of
whether it knows its correct value (often setting it to -1 even though
the node id is advertised in 'struct device'). This can lead to
suboptimal configurations.
This patch fixes this behavior and makes the shared memory allocation
functions use the NUMA node id advertised by the underlying device.
This series includes a variety of fixes that have been accumulating on the
Intel Wired LAN dev-queue.
Hui Wang provides a fix for suspend/resume on e1000e due to failure
to correctly setup the SMBUS in enable_ulp().
Thinh Tran provides a fix for EEH I/O suspend/resume on i40e to
ensure that I/O operations can continue after a resume. To avoid duplicate
code, the common logic is factored out of i40e_suspend and i40e_resume.
Paul Greenwalt provides a fix to correctly map the 200G PHY types to link
speeds in the ice driver.
Dave Ertman provides a fix correcting devlink parameter unregistration in
the event that the driver loads in safe mode and some of the parameters
were not registered.
====================
Dave Ertman [Tue, 28 May 2024 22:06:11 +0000 (15:06 -0700)]
ice: check for unregistering correct number of devlink params
On module load, the ice driver checks for the lack of a specific PF
capability to determine if it should reduce the number of devlink params
to register. One situation when this test returns true is when the
driver loads in safe mode. The same check is not present on the unload
path when devlink params are unregistered. This results in the driver
triggering a WARN_ON in the kernel devlink code.
The current check and code path uses a reduction in the number of elements
reported in the list of params. This is fragile and not good for future
maintaining.
Change the parameters to be held in two lists, one always registered and
one dependent on the check.
Add a symmetrical check in the unload path so that the correct parameters
are unregistered as well.
Fixes: 109eb2917284 ("ice: Add tx_scheduling_layers devlink param") CC: Lukasz Czapnik <lukasz.czapnik@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240528-net-2024-05-28-intel-net-fixes-v1-8-dc8593d2bbc6@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paul Greenwalt [Tue, 28 May 2024 22:06:08 +0000 (15:06 -0700)]
ice: fix 200G PHY types to link speed mapping
Commit 24407a01e57c ("ice: Add 200G speed/phy type use") added support
for 200G PHY speeds, but did not include the mapping of 200G PHY types
to link speed. As a result the driver is returning UNKNOWN link speed
when setting 200G ethtool advertised link modes.
To fix this add 200G PHY types to link speed mapping to
ice_get_link_speed_based_on_phy_type().
Fixes: 24407a01e57c ("ice: Add 200G speed/phy type use") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240528-net-2024-05-28-intel-net-fixes-v1-5-dc8593d2bbc6@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thinh Tran [Tue, 28 May 2024 22:06:06 +0000 (15:06 -0700)]
i40e: Fully suspend and resume IO operations in EEH case
When EEH events occurs, the callback functions in the i40e, which are
managed by the EEH driver, will completely suspend and resume all IO
operations.
- In the PCI error detected callback, replaced i40e_prep_for_reset()
with i40e_io_suspend(). The change is to fully suspend all I/O
operations
- In the PCI error slot reset callback, replaced pci_enable_device_mem()
with pci_enable_device(). This change enables both I/O and memory of
the device.
- In the PCI error resume callback, replaced i40e_handle_reset_warning()
with i40e_io_resume(). This change allows the system to resume I/O
operations
Fixes: a5f3d2c17b07 ("powerpc/pseries/pci: Add MSI domains") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Robert Thomas <rob.thomas@ibm.com> Signed-off-by: Thinh Tran <thinhtr@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240528-net-2024-05-28-intel-net-fixes-v1-3-dc8593d2bbc6@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thinh Tran [Tue, 28 May 2024 22:06:05 +0000 (15:06 -0700)]
i40e: factoring out i40e_suspend/i40e_resume
Two new functions, i40e_io_suspend() and i40e_io_resume(), have been
introduced. These functions were factored out from the existing
i40e_suspend() and i40e_resume() respectively. This factoring was
done due to concerns about the logic of the I40E_SUSPENSED state, which
caused the device to be unable to recover. The functions are now used
in the EEH handling for device suspend/resume callbacks.
The function i40e_enable_mc_magic_wake() has been moved ahead of
i40e_io_suspend() to ensure it is declared before being used.
Tested-by: Robert Thomas <rob.thomas@ibm.com> Signed-off-by: Thinh Tran <thinhtr@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240528-net-2024-05-28-intel-net-fixes-v1-2-dc8593d2bbc6@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hui Wang [Tue, 28 May 2024 22:06:04 +0000 (15:06 -0700)]
e1000e: move force SMBUS near the end of enable_ulp function
The commit 861e8086029e ("e1000e: move force SMBUS from enable ulp
function to avoid PHY loss issue") introduces a regression on
PCH_MTP_I219_LM18 (PCIID: 0x8086550A). Without the referred commit, the
ethernet works well after suspend and resume, but after applying the
commit, the ethernet couldn't work anymore after the resume and the
dmesg shows that the NIC link changes to 10Mbps (1000Mbps originally):
[ 43.305084] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 10 Mbps Full Duplex, Flow Control: Rx/Tx
Without the commit, the force SMBUS code will not be executed if
"return 0" or "goto out" is executed in the enable_ulp(), and in my
case, the "goto out" is executed since FWSM_FW_VALID is set. But after
applying the commit, the force SMBUS code will be ran unconditionally.
Here move the force SMBUS code back to enable_ulp() and put it
immediately ahead of hw->phy.ops.release(hw), this could allow the
longest settling time as possible for interface in this function and
doesn't change the original code logic.
The issue was found on a Lenovo laptop with the ethernet hw as below:
00:1f.6 Ethernet controller [0200]: Intel Corporation Device [8086:550a]
(rev 20).
And this patch is verified (cable plug and unplug, system suspend
and resume) on Lenovo laptops with ethernet hw: [8086:550a],
[8086:550b], [8086:15bb], [8086:15be], [8086:1a1f], [8086:1a1c] and
[8086:0dc7].
Fixes: 861e8086029e ("e1000e: move force SMBUS from enable ulp function to avoid PHY loss issue") Signed-off-by: Hui Wang <hui.wang@canonical.com> Acked-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240528-net-2024-05-28-intel-net-fixes-v1-1-dc8593d2bbc6@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tristram Ha [Tue, 28 May 2024 21:34:26 +0000 (14:34 -0700)]
net: dsa: microchip: fix RGMII error in KSZ DSA driver
The driver should return RMII interface when XMII is running in RMII mode.
Fixes: 0ab7f6bf1675 ("net: dsa: microchip: ksz9477: use common xmii function") Signed-off-by: Tristram Ha <tristram.ha@microchip.com> Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com> Acked-by: Jerry Ray <jerry.ray@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/1716932066-3342-1-git-send-email-Tristram.Ha@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexander Mikhalitsyn [Tue, 28 May 2024 20:30:30 +0000 (22:30 +0200)]
ipv4: correctly iterate over the target netns in inet_dump_ifaddr()
A recent change to inet_dump_ifaddr had the function incorrectly iterate
over net rather than tgt_net, resulting in the data coming for the
incorrect network namespace.
Fixes: cdb2f80f1c10 ("inet: use xa_array iterator to implement inet_dump_ifaddr()") Reported-by: Stéphane Graber <stgraber@stgraber.org> Closes: https://github.com/lxc/incus/issues/892 Bisected-by: Stéphane Graber <stgraber@stgraber.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Tested-by: Stéphane Graber <stgraber@stgraber.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240528203030.10839-1-aleksandr.mikhalitsyn@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Tue, 28 May 2024 11:43:53 +0000 (11:43 +0000)]
net: fix __dst_negative_advice() race
__dst_negative_advice() does not enforce proper RCU rules when
sk->dst_cache must be cleared, leading to possible UAF.
RCU rules are that we must first clear sk->sk_dst_cache,
then call dst_release(old_dst).
Note that sk_dst_reset(sk) is implementing this protocol correctly,
while __dst_negative_advice() uses the wrong order.
Given that ip6_negative_advice() has special logic
against RTF_CACHE, this means each of the three ->negative_advice()
existing methods must perform the sk_dst_reset() themselves.
Note the check against NULL dst is centralized in
__dst_negative_advice(), there is no need to duplicate
it in various callbacks.
Many thanks to Clement Lecigne for tracking this issue.
This old bug became visible after the blamed commit, using UDP sockets.
Fixes: a87cb3e48ee8 ("net: Facility to report route quality of connected sockets") Reported-by: Clement Lecigne <clecigne@google.com> Diagnosed-by: Clement Lecigne <clecigne@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240528114353.1794151-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We gave it a try, but it turns out the kernel test robot did in fact
find performance regressions for it, so we'll have to look at the more
involved alternative fixes for Yafang Shao's Elasticsearch load issue.
There were several alternatives discussed, they just weren't as simple
as this first attempt.
The report is of a -7.4% regression of filebench.sum_operations/s, which
appears significant enough to trigger my "this patch may get reverted if
somebody finds a performance regression on some other load" rule.
So it's still the case that we should end up deleting dentries more
aggressively - or just be better at pruning them later - but it needs a
bit more finesse than this simple thing.
Linus Torvalds [Wed, 29 May 2024 16:25:15 +0000 (09:25 -0700)]
Merge tag '9p-for-6.10-rc2' of https://github.com/martinetd/linux
Pull 9p fixes from Dominique Martinet:
"Two fixes headed to stable trees:
- a trace event was dumping uninitialized values
- a missing lock that was thought to have exclusive access, and it
turned out not to"
* tag '9p-for-6.10-rc2' of https://github.com/martinetd/linux:
9p: add missing locking around taking dentry fid list
net/9p: fix uninit-value in p9_client_rpc()