net: ngbe: fix memory leak in ngbe_probe() error path
When ngbe_sw_init() is called, memory is allocated for wx->rss_key
in wx_init_rss_key(). However, in ngbe_probe() function, the subsequent
error paths after ngbe_sw_init() don't free the rss_key. Fix that by
freeing it in error path along with wx->mac_table.
Also change the label to which execution jumps when ngbe_sw_init()
fails, because otherwise, it could lead to a double free for rss_key,
when the mac_table allocation fails in wx_sw_init().
Fixes: 02338c484ab6 ("net: ngbe: Initialize sw info and register netdev") Signed-off-by: Abdun Nihaal <abdun.nihaal@gmail.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Jiawen Wu <jiawenwu@trustnetic.com> Link: https://patch.msgid.link/20250412154927.25908-1-abdun.nihaal@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 14 Apr 2025 22:58:30 +0000 (15:58 -0700)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
igc: Fix PTM timeout
Christopher S M Hall says:
There have been sporadic reports of PTM timeouts using i225/i226 devices
These timeouts have been root caused to:
1) Manipulating the PTM status register while PTM is enabled
and triggered
2) The hardware retrying too quickly when an inappropriate response
is received from the upstream device
The issue can be reproduced with the following:
$ sudo phc2sys -R 1000 -O 0 -i tsn0 -m
Note: 1000 Hz (-R 1000) is unrealistically large, but provides a way to
quickly reproduce the issue.
PHC2SYS exits with:
"ioctl PTP_OFFSET_PRECISE: Connection timed out" when the PTM transaction
fails
The first patch in this series also resolves an issue reported by Corinna
Vinschen relating to kdump:
This patch also fixes a hang in igc_probe() when loading the igc
driver in the kdump kernel on systems supporting PTM.
The igc driver running in the base kernel enables PTM trigger in
igc_probe(). Therefore the driver is always in PTM trigger mode,
except in brief periods when manually triggering a PTM cycle.
When a crash occurs, the NIC is reset while PTM trigger is enabled.
Due to a hardware problem, the NIC is subsequently in a bad busmaster
state and doesn't handle register reads/writes. When running
igc_probe() in the kdump kernel, the first register access to a NIC
register hangs driver probing and ultimately breaks kdump.
With this patch, igc has PTM trigger disabled most of the time,
and the trigger is only enabled for very brief (10 - 100 us) periods
when manually triggering a PTM cycle. Chances that a crash occurs
during a PTM trigger are not zero, but extremly reduced.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
igc: add lock preventing multiple simultaneous PTM transactions
igc: cleanup PTP module if probe fails
igc: handle the IGC_PTP_ENABLED flag correctly
igc: move ktime snapshot into PTM retry loop
igc: increase wait time before retrying PTM
igc: fix PTM cycle trigger logic
====================
Jakub Kicinski [Wed, 9 Apr 2025 14:55:41 +0000 (07:55 -0700)]
netlink: specs: ovs_vport: align with C codegen capabilities
We started generating C code for OvS a while back, but actually
C codegen only supports fixed headers specified at the family
level right now (schema also allows specifying them per op).
ovs_flow and ovs_datapath already specify the fixed header
at the family level but ovs_vport does it per op.
Move the property, all ops use the same header.
The first dependency is the problem. wiphy mutex should be outside
the instance locks. The problem happens in notifiers (as always)
for CLOSE. We only hold the instance lock for ops locked devices
during CLOSE, and WiFi netdevs are not ops locked. Unfortunately,
when we dev_close_many() during netns dismantle we may be holding
the instance lock of _another_ netdev when issuing a CLOSE for
a WiFi device.
Lockdep's "Possible unsafe locking scenario" only prints 3 locks
and we have 4, plus I think we'd need 3 CPUs, like this:
Tho, I don't think that's possible as CPU1 and CPU2 would
be under rtnl_lock. Even if we have per-netns rtnl_lock and
wiphy can span network namespaces - CPU0 and CPU1 must be
in the same netns to see dev_instance_lock, so CPU0 can't
be installing a socket as CPU1 is tearing the netns down.
Regardless, our expected lock ordering is that wiphy lock
is taken before instance locks, so let's fix this.
Go over the ops locked and non-locked devices separately.
Note that calling dev_close_many() on an empty list is perfectly
fine. All processing (including RCU syncs) are conditional
on the list not being empty, already.
Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations") Reported-by: syzbot+6f588c78bf765b62b450@syzkaller.appspotmail.com Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250412233011.309762-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After detecting the np_link_fail exception,
the driver attempts to fix the exception by
using phy_stop() and phy_start() in the scheduled task.
However, hbg_fix_np_link_fail() and .ndo_stop()
may be concurrently executed. As a result,
phy_stop() is executed twice, and the following Calltrace occurs:
This patch adds the rtnl_lock to hbg_fix_np_link_fail()
to ensure that other operations are not performed concurrently.
In addition, np_link_fail exception can be fixed
only when the PHY is link.
Fixes: e0306637e85d ("net: hibmcge: Add support for mac link exception handling feature") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250410021327.590362-8-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: hibmcge: fix not restore rx pause mac addr after reset issue
The MAC hardware supports receiving two types of
pause frames from link partner.
One is a pause frame with a destination address
of 01:80:C2:00:00:01.
The other is a pause frame whose destination address
is the address of the hibmcge driver.
01:80:C2:00:00:01 is supported by default.
In .ndo_set_mac_address(), the hibmcge driver calls
.hbg_hw_set_rx_pause_mac_addr() to set its mac address as the
destination address of the rx puase frame.
Therefore, pause frames with two types of MAC addresses can be received.
Currently, the rx pause addr does not restored after reset.
As a result, pause frames whose destination address is
the hibmcge driver address cannot be correctly received.
This patch restores the configuration by calling
.hbg_hw_set_rx_pause_mac_addr() after reset is complete.
Fixes: 3f5a61f6d504 ("net: hibmcge: Add reset supported in this module") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250410021327.590362-7-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: hibmcge: fix the incorrect np_link fail state issue.
In the debugfs file, the driver displays the np_link fail state
based on the HBG_NIC_STATE_NP_LINK_FAIL.
However, HBG_NIC_STATE_NP_LINK_FAIL is cleared in hbg_service_task()
So, this value of np_link fail is always false.
This patch directly reads the related register to display the real state.
Fixes: e0306637e85d ("net: hibmcge: Add support for mac link exception handling feature") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250410021327.590362-6-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A dbg log is generated when the driver modifies the MTU,
which is expected to trace the change of the MTU.
However, the log is recorded after WRITE_ONCE().
At this time, netdev->mtu has been changed to the new value.
As a result, netdev->mtu is the same as new_mtu.
This patch modifies the log location and records logs before WRITE_ONCE().
Fixes: ff4edac6e9bd ("net: hibmcge: Implement some .ndo functions") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250410021327.590362-5-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: hibmcge: fix the share of irq statistics among different network ports issue
hbg_irqs is a global array which contains irq statistics.
However, the irq statistics of different network ports
point to the same global array. As a result, the statistics are incorrect.
This patch allocates a statistics array for each network port
to prevent the statistics of different network ports
from affecting each other.
irq statistics are removed from hbg_irq_info. Therefore,
all data in hbg_irq_info remains unchanged. Therefore,
the input parameter of some functions is changed to const.
Fixes: 4d089035fa19 ("net: hibmcge: Add interrupt supported in this module") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250410021327.590362-4-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver supports pause frames,
but does not pass pause frames based on rx pause enable configuration,
resulting in incorrect pause frame statistics.
ethtool --include-statistics -a enp132s0f2
Pause parameters for enp132s0f2:
Autonegotiate: on
RX: on
TX: on
RX negotiated: on
TX negotiated: on
Statistics:
tx_pause_frames: 0
rx_pause_frames: 100
Fixes: 3a03763f3876 ("net: hibmcge: Add pauseparam supported in this module") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250410021327.590362-2-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE")
changed to lockless __linkwatch_sync_dev in ethtool_op_get_link.
All paths except bonding are coming via locked ioctl. Add necessary
locking to bonding.
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reported-by: syzbot+48c14f61594bdfadb086@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=48c14f61594bdfadb086 Fixes: 04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250410161117.3519250-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ethtool: cmis_cdb: use correct rpl size in ethtool_cmis_module_poll()
rpl is passed as a pointer to ethtool_cmis_module_poll(), so the correct
size of rpl is sizeof(*rpl) which should be just 1 byte. Using the
pointer size instead can cause stack corruption:
pds_core: fix memory leak in pdsc_debugfs_add_qcq()
The memory allocated for intr_ctrl_regset, which is passed to
debugfs_create_regset32() may not be cleaned up when the driver is
removed. Fix that by using device managed allocation for it.
Fixes: 45d76f492938 ("pds_core: set up device and adminq") Signed-off-by: Abdun Nihaal <abdun.nihaal@gmail.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://patch.msgid.link/20250409054450.48606-1-abdun.nihaal@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 11 Apr 2025 23:38:03 +0000 (16:38 -0700)]
Merge tag 'wireless-2025-04-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Just a handful of fixes, notably
- iwlwifi: various build warning fixes (e.g. PM_SLEEP)
- iwlwifi: fix operation when FW reset handshake times out
- mac80211: drop pending frames on interface down
* tag 'wireless-2025-04-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
Revert "wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()"
wifi: iwlwifi: mld: Restart firmware on iwl_mld_no_wowlan_resume() error
wifi: iwlwifi: pcie: set state to no-FW before reset handshake
wifi: wl1251: fix memory leak in wl1251_tx_work
wifi: brcmfmac: fix memory leak in brcmf_get_module_param
wifi: iwlwifi: mld: silence uninitialized variable warning
wifi: mac80211: Purge vif txq in ieee80211_do_stop()
wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()
wifi: at76c50x: fix use after free access in at76_disconnect
wifi: add wireless list to MAINTAINERS
iwlwifi: mld: fix building with CONFIG_PM_SLEEP disabled
wifi: iwlwifi: mld: fix PM_SLEEP -Wundef warning
wifi: iwlwifi: mld: reduce scope for uninitialized variable
====================
SMC consists of two sockets: smc_sock and kernel TCP socket.
Currently, there are two ways of creating the sockets, and syzbot reported
a lockdep splat [0] for the newer way introduced by commit d25a92ccae6b
("net/smc: Introduce IPPROTO_SMC").
socket(AF_SMC , SOCK_STREAM, SMCPROTO_SMC or SMCPROTO_SMC6)
socket(AF_INET or AF_INET6, SOCK_STREAM, IPPROTO_SMC)
When a socket is allocated, sock_lock_init() sets a lockdep lock class to
sk->sk_lock.slock based on its protocol family. In the IPPROTO_SMC case,
AF_INET or AF_INET6 lock class is assigned to smc_sock.
The repro sets IPV6_JOIN_ANYCAST for IPv6 UDP and SMC socket and exercises
smc_switch_to_fallback() for IPPROTO_SMC.
1. smc_switch_to_fallback() is called under lock_sock() and holds
smc->clcsock_release_lock.
2. Setting IPV6_JOIN_ANYCAST to SMC holds smc->clcsock_release_lock
and calls setsockopt() for the kernel TCP socket, which holds RTNL
and the kernel socket's lock_sock().
Add a mutex around the PTM transaction to prevent multiple transactors
Multiple processes try to initiate a PTM transaction, one or all may
fail. This can be reproduced by running two instances of the
following:
$ sudo phc2sys -O 0 -i tsn0 -m
PHC2SYS exits with:
"ioctl PTP_OFFSET_PRECISE: Connection timed out" when the PTM transaction
fails
Note: Normally two instance of PHC2SYS will not run, but one process
should not break another.
Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()") Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com> Reviewed-by: Corinna Vinschen <vinschen@redhat.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Christopher S M Hall [Tue, 1 Apr 2025 23:35:33 +0000 (16:35 -0700)]
igc: cleanup PTP module if probe fails
Make sure that the PTP module is cleaned up if the igc_probe() fails by
calling igc_ptp_stop() on exit.
Fixes: d89f88419f99 ("igc: Add skeletal frame for Intel(R) 2.5G Ethernet Controller support") Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com> Reviewed-by: Corinna Vinschen <vinschen@redhat.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Christopher S M Hall [Tue, 1 Apr 2025 23:35:32 +0000 (16:35 -0700)]
igc: handle the IGC_PTP_ENABLED flag correctly
All functions in igc_ptp.c called from igc_main.c should check the
IGC_PTP_ENABLED flag. Adding check for this flag to stop and reset
functions.
Fixes: 5f2958052c58 ("igc: Add basic skeleton for PTP") Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com> Reviewed-by: Corinna Vinschen <vinschen@redhat.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Christopher S M Hall [Tue, 1 Apr 2025 23:35:31 +0000 (16:35 -0700)]
igc: move ktime snapshot into PTM retry loop
Move ktime_get_snapshot() into the loop. If a retry does occur, a more
recent snapshot will result in a more accurate cross-timestamp.
Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com> Reviewed-by: Corinna Vinschen <vinschen@redhat.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Christopher S M Hall [Tue, 1 Apr 2025 23:35:30 +0000 (16:35 -0700)]
igc: increase wait time before retrying PTM
The i225/i226 hardware retries if it receives an inappropriate response
from the upstream device. If the device retries too quickly, the root
port does not respond.
The wait between attempts was reduced from 10us to 1us in commit 6b8aa753a9f9 ("igc: Decrease PTM short interval from 10 us to 1 us"), which
said:
With the 10us interval, we were seeing PTM transactions take around
12us. Hardware team suggested this interval could be lowered to 1us
which was confirmed with PCIe sniffer. With the 1us interval, PTM
dialogs took around 2us.
While a 1us short cycle time was thought to be theoretically sufficient, it
turns out in practice it is not quite long enough. It is unclear if the
problem is in the root port or an issue in i225/i226.
Increase the wait from 1us to 4us. Increasing to 2us appeared to work in
practice on the setups we have available. A value of 4us was chosen due to
the limited hardware available for testing, with a goal of ensuring we wait
long enough without overly penalizing the response time when unnecessary.
The issue can be reproduced with the following:
$ sudo phc2sys -R 1000 -O 0 -i tsn0 -m
Note: 1000 Hz (-R 1000) is unrealistically large, but provides a way to
quickly reproduce the issue.
PHC2SYS exits with:
"ioctl PTP_OFFSET_PRECISE: Connection timed out" when the PTM transaction
fails
Fixes: 6b8aa753a9f9 ("igc: Decrease PTM short interval from 10 us to 1 us") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com> Reviewed-by: Corinna Vinschen <vinschen@redhat.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Christopher S M Hall [Tue, 1 Apr 2025 23:35:29 +0000 (16:35 -0700)]
igc: fix PTM cycle trigger logic
Writing to clear the PTM status 'valid' bit while the PTM cycle is
triggered results in unreliable PTM operation. To fix this, clear the
PTM 'trigger' and status after each PTM transaction.
The issue can be reproduced with the following:
$ sudo phc2sys -R 1000 -O 0 -i tsn0 -m
Note: 1000 Hz (-R 1000) is unrealistically large, but provides a way to
quickly reproduce the issue.
PHC2SYS exits with:
"ioctl PTP_OFFSET_PRECISE: Connection timed out" when the PTM transaction
fails
This patch also fixes a hang in igc_probe() when loading the igc
driver in the kdump kernel on systems supporting PTM.
The igc driver running in the base kernel enables PTM trigger in
igc_probe(). Therefore the driver is always in PTM trigger mode,
except in brief periods when manually triggering a PTM cycle.
When a crash occurs, the NIC is reset while PTM trigger is enabled.
Due to a hardware problem, the NIC is subsequently in a bad busmaster
state and doesn't handle register reads/writes. When running
igc_probe() in the kdump kernel, the first register access to a NIC
register hangs driver probing and ultimately breaks kdump.
With this patch, igc has PTM trigger disabled most of the time,
and the trigger is only enabled for very brief (10 - 100 us) periods
when manually triggering a PTM cycle. Chances that a crash occurs
during a PTM trigger are not 0, but extremely reduced.
Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com> Reviewed-by: Corinna Vinschen <vinschen@redhat.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Corinna Vinschen <vinschen@redhat.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Since the original bug seems to have been around for years,
but a new issue was report with the fix, revert the fix for
now. We have a couple of weeks to figure it out for this
release, if needed.
Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/linux-wireless/20250410215527.3001-1-spasswolf@web.de Fixes: a104042e2bf6 ("wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
wifi: iwlwifi: mld: Restart firmware on iwl_mld_no_wowlan_resume() error
Commit 44605365f935 ("iwlwifi: mld: fix building with CONFIG_PM_SLEEP
disabled") sought to fix build breakage, but inadvertently introduced
a new issue:
iwl_mld_mac80211_start() no longer calls iwl_mld_start_fw() after having
called iwl_mld_stop_fw() in the error path of iwl_mld_no_wowlan_resume().
Johannes Berg [Fri, 11 Apr 2025 08:40:54 +0000 (10:40 +0200)]
wifi: iwlwifi: pcie: set state to no-FW before reset handshake
The reset handshake attempts to kill the firmware, and it'll go
into a pretty much dead state once we do that. However, if it
times out, then we'll attempt to dump the firmware to be able
to see why it didn't respond. During this dump, we cannot treat
it as if it was still running, since we just tried to kill it,
otherwise dumping will attempt to send a DBGC stop command. As
this command will time out, we'll go into a reset loop.
For now, fix this by setting the trans->state to say firmware
isn't running before doing the reset handshake. In the longer
term, we should clean up the way this state is handled.
It's not entirely clear but it seems likely that this issue was
introduced by my rework of the error handling, prior to that it
would've been synchronous at that point and (I think) not have
attempted to reset since it was already doing down.
Xin Long [Tue, 8 Apr 2025 17:46:17 +0000 (13:46 -0400)]
ipv6: add exception routes to GC list in rt6_insert_exception
Commit 5eb902b8e719 ("net/ipv6: Remove expired routes with a separated list
of routes.") introduced a separated list for managing route expiration via
the GC timer.
However, it missed adding exception routes (created by ip6_rt_update_pmtu()
and rt6_do_redirect()) to this GC list. As a result, these exceptions were
never considered for expiration and removal, leading to stale entries
persisting in the routing table.
This patch fixes the issue by calling fib6_add_gc_list() in
rt6_insert_exception(), ensuring that exception routes are properly tracked
and garbage collected when expired.
David Wei [Wed, 9 Apr 2025 16:31:53 +0000 (09:31 -0700)]
io_uring/zcrx: enable tcp-data-split in selftest
For bnxt when the agg ring is used then tcp-data-split is automatically
reported to be enabled, but __net_mp_open_rxq() requires tcp-data-split
to be explicitly enabled by the user.
Enable tcp-data-split explicitly in io_uring zc rx selftest.
Fixes: 288c06973daa ("Bluetooth: Enforce key size of 16 bytes on FIPS level") Signed-off-by: Frédéric Danis <frederic.danis@collabora.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: btnxpuart: Add an error message if FW dump trigger fails
This prints an error message if the FW Dump trigger command fails. This
scenario is mainly observed in legacy chipsets 8987 and 8997 and also
IW416, where this feature is unavailable due to memory constraints.
Fixes: 998e447f443f ("Bluetooth: btnxpuart: Add support for HCI coredump feature") Signed-off-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: btnxpuart: Revert baudrate change in nxp_shutdown
This reverts the change baudrate logic in nxp_shutdown.
Earlier, when the driver was removed, it restored the controller
baudrate to fw_init_baudrate, so that on re-loading the driver, things
work fine.
However, if the driver was removed while hci0 interface is down, the
change baudrate vendor command could not be sent by the driver. When the
driver was re-loaded, host and controller baudrate would be mismatched
and hci initialization would fail. The only way to recover would be to
reboot the system.
This issue was fixed by moving the restore baudrate logic from
nxp_serdev_remove() to nxp_shutdown().
This fix however caused another issue with the command "hciconfig hci0
reset", which makes hci0 DOWN and UP immediately.
Running "bluetoothctl power off" and "bluetoothctl power on" in a tight
loop works fine.
To maintain support for "hciconfig reset" command, the above mentioned fix
is reverted.
Fixes: 6fca6781d19d ("Bluetooth: btnxpuart: Move vendor specific initialization to .post_init") Signed-off-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pauli Virtanen [Thu, 3 Apr 2025 16:57:59 +0000 (19:57 +0300)]
Bluetooth: increment TX timestamping tskey always for stream sockets
Documentation/networking/timestamping.rst implies TX timestamping OPT_ID
tskey increments for each sendmsg. In practice: TCP socket increments
it for all sendmsg, timestamping on or off, but UDP only when
timestamping is on. The user-visible counter resets when OPT_ID is
turned on, so difference can be seen only if timestamping is enabled for
some packets only (eg. via SO_TIMESTAMPING CMSG).
Fix BT sockets to work in the same way: stream sockets increment tskey
for all sendmsg (seqpacket already increment for timestamped only).
Fixes: 134f4b39df7b ("Bluetooth: add support for skb TX SND/COMPLETION timestamping") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The btrtl_initialize() function checks that rtl_load_file() either
had an error or it loaded a zero length file. However, if it loaded
a zero length file then the error code is not set correctly. It
results in an error pointer vs NULL bug, followed by a NULL pointer
dereference. This was detected by Smatch:
drivers/bluetooth/btrtl.c:592 btrtl_initialize() warn: passing zero to 'ERR_PTR'
Fixes: 26503ad25de8 ("Bluetooth: btrtl: split the device initialization into smaller parts") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Luiz Augusto von Dentz [Tue, 1 Apr 2025 17:02:08 +0000 (13:02 -0400)]
Bluetooth: hci_event: Fix sending MGMT_EV_DEVICE_FOUND for invalid address
This fixes sending MGMT_EV_DEVICE_FOUND for invalid address
(00:00:00:00:00:00) which is a regression introduced by a2ec905d1e16 ("Bluetooth: fix kernel oops in store_pending_adv_report")
since in the attempt to skip storing data for extended advertisement it
actually made the code to skip the entire if statement supposed to send
MGMT_EV_DEVICE_FOUND without attempting to use the last_addr_adv which
is garanteed to be invalid for extended advertisement since we never
store anything on it.
* tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
ethtool: cmis_cdb: Fix incorrect read / write length extension
selftests: netfilter: add test case for recent mismatch bug
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
net: ppp: Add bound checking for skb data on ppp_sync_txmung
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
ipv6: Align behavior across nexthops during path selection
net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY
net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()
selftests/tc-testing: sfq: check that a derived limit of 1 is rejected
net_sched: sch_sfq: move the limit validation
net_sched: sch_sfq: use a temporary work area for validating configuration
net: libwx: handle page_pool_dev_alloc_pages error
selftests: mptcp: validate MPJoin HMacFailure counters
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
rtnetlink: Fix bad unlock balance in do_setlink().
net: ethtool: Don't call .cleanup_data when prepare_data fails
tc: Ensure we have enough buffer space when sending filter netlink notifications
net: libwx: Fix the wrong Rx descriptor field
octeontx2-pf: qos: fix VF root node parent queue index
selftests: tls: check that disconnect does nothing
...
Merge tag 'for-linus-6.15a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- A simple fix adding the module description of the Xenbus frontend
module
- A fix correcting the xen-acpi-processor Kconfig dependency for PVH
Dom0 support
- A fix for the Xen balloon driver when running as Xen Dom0 in PVH mode
- A fix for PVH Dom0 in order to avoid problems with CPU idle and
frequency drivers conflicting with Xen
* tag 'for-linus-6.15a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: disable CPU idle and frequency drivers for PVH dom0
x86/xen: fix balloon target initialization for PVH dom0
xen: Change xen-acpi-processor dom0 dependency
xenbus: add module description
Merge tag 'block-6.15-20250410' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Add a missing ublk selftest script, from test additions added last
week
- Two fixes for ublk error recovery and reissue
- Cleanup of ublk argument passing
* tag 'block-6.15-20250410' of git://git.kernel.dk/linux:
ublk: pass ublksrv_ctrl_cmd * instead of io_uring_cmd *
ublk: don't fail request for recovery & reissue in case of ubq->canceling
ublk: fix handling recovery & reissue in ublk_abort_queue()
selftests: ublk: fix test_stripe_04
Merge tag 'io_uring-6.15-20250410' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Reject zero sized legacy provided buffers upfront. No ill side
effects from this one, only really done to shut up a silly syzbot
test case.
- Fix for a regression in tag posting for registered files or buffers,
where the tag would be posted even when the registration failed.
- two minor zcrx cleanups for code added this merge window.
* tag 'io_uring-6.15-20250410' of git://git.kernel.dk/linux:
io_uring/kbuf: reject zero sized provided buffers
io_uring/zcrx: separate niov number from pages
io_uring/zcrx: put refill data into separate cache line
io_uring: don't post tag CQEs on file/buffer registration failure
Merge tag 'gpio-fixes-for-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix resource handling in gpio-tegra186
- fix wakeup source leaks in gpio-mpc8xxx and gpio-zynq
- fix minor issues with some GPIO OF quirks
- deprecate GPIOD_FLAGS_BIT_NONEXCLUSIVE and devm_gpiod_unhinge()
symbols and add a TODO task to track replacing them with a better
solution
* tag 'gpio-fixes-for-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: of: Move Atmel HSMCI quirk up out of the regulator comment
gpiolib: of: Fix the choice for Ingenic NAND quirk
gpio: zynq: Fix wakeup source leaks on device unbind
gpio: mpc8xxx: Fix wakeup source leaks on device unbind
gpio: TODO: track the removal of regulator-related workarounds
MAINTAINERS: add more keywords for the GPIO subsystem entry
gpio: deprecate devm_gpiod_unhinge()
gpio: deprecate the GPIOD_FLAGS_BIT_NONEXCLUSIVE flag
gpio: tegra186: fix resource handling in ACPI probe path
Merge tag 'mtd/fixes-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull mtd fixes from Miquel Raynal:
"Two important fixes: the build of the SPI NAND layer with old GCC
versions as well as the fix of the Qpic Makefile which was wrong in
the first place.
There are also two smaller fixes about a missing error and status
check"
* tag 'mtd/fixes-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: spinand: Fix build with gcc < 7.5
mtd: rawnand: Add status chack in r852_ready()
mtd: inftlcore: Add error check for inftl_read_oob()
mtd: nand: Drop explicit test for built-in CONFIG_SPI_QPIC_SNAND
The 'read_write_len_ext' field in 'struct ethtool_cmis_cdb_cmd_args'
stores the maximum number of bytes that can be read from or written to
the Local Payload (LPL) page in a single multi-byte access.
Cited commit started overwriting this field with the maximum number of
bytes that can be read from or written to the Extended Payload (LPL)
pages in a single multi-byte access. Transceiver modules that support
auto paging can advertise a number larger than 255 which is problematic
as 'read_write_len_ext' is a 'u8', resulting in the number getting
truncated and firmware flashing failing [1].
Fix by ignoring the maximum EPL access size as the kernel does not
currently support auto paging (even if the transceiver module does) and
will not try to read / write more than 128 bytes at once.
[1]
Transceiver module firmware flashing started for device enp177s0np0
Transceiver module firmware flashing in progress for device enp177s0np0
Progress: 0%
Transceiver module firmware flashing encountered an error for device enp177s0np0
Status message: Write FW block EPL command failed, LPL length is longer
than CDB read write length extension allows.
Fixes: 9a3b0d078bd8 ("net: ethtool: Add support for writing firmware blocks using EPL payload") Reported-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Closes: https://lore.kernel.org/netdev/20250402183123.321036-3-michael.chan@broadcom.com/ Tested-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250409112440.365672-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 10 Apr 2025 11:13:35 +0000 (13:13 +0200)]
Merge tag 'nf-25-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains a Netfilter fix and improved test coverage:
1) Fix AVX2 matching in nft_pipapo, from Florian Westphal.
2) Extend existing test to improve coverage for the aforementioned bug,
also from Florian.
netfilter pull request 25-04-10
* tag 'nf-25-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: add test case for recent mismatch bug
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
====================
selftests: netfilter: add test case for recent mismatch bug
Without 'nft_set_pipapo: fix incorrect avx2 match of 5th field octet"
this fails:
TEST: reported issues
Add two elements, flush, re-add 1s [ OK ]
net,mac with reload 0s [ OK ]
net,port,proto 3s [ OK ]
avx2 false match 0s [FAIL]
False match for fe80:dead:01fe:0a02:0b03:6007:8009:a001
Other tests do not detect the kernel bug as they only alter parts in
the /64 netmask.
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
Given a set element like:
icmpv6 . dead:beef:00ff::1
The value of 'ff' is irrelevant, any address will be matched
as long as the other octets are the same.
This is because of too-early register clobbering:
ymm7 is reloaded with new packet data (pkt[9]) but it still holds data
of an earlier load that wasn't processed yet.
The existing tests in nft_concat_range.sh selftests do exercise this code
path, but do not trigger incorrect matching due to the network prefix
limitation.
net: ppp: Add bound checking for skb data on ppp_sync_txmung
Ensure we have enough data in linear buffer from skb before accessing
initial bytes. This prevents potential out-of-bounds accesses
when processing short packets.
When ppp_sync_txmung receives an incoming package with an empty
payload:
(remote) gef➤ p *(struct pppoe_hdr *) (skb->head + skb->network_header)
$18 = {
type = 0x1,
ver = 0x1,
code = 0x0,
sid = 0x2,
length = 0x0,
tag = 0xffff8880371cdb96
}
from the skb struct (trimmed)
tail = 0x16,
end = 0x140,
head = 0xffff88803346f400 "4",
data = 0xffff88803346f416 ":\377",
truesize = 0x380,
len = 0x0,
data_len = 0x0,
mac_len = 0xe,
hdr_len = 0x0,
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
When I ran the repro [0] and waited a few seconds, I observed two
LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1]
Reproduction Steps:
1) Mount CIFS
2) Add an iptables rule to drop incoming FIN packets for CIFS
3) Unmount CIFS
4) Unload the CIFS module
5) Remove the iptables rule
At step 3), the CIFS module calls sock_release() for the underlying
TCP socket, and it returns quickly. However, the socket remains in
FIN_WAIT_1 because incoming FIN packets are dropped.
At this point, the module's refcnt is 0 while the socket is still
alive, so the following rmmod command succeeds.
# ss -tan
State Recv-Q Send-Q Local Address:Port Peer Address:Port
FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445
This highlights a discrepancy between the lifetime of the CIFS module
and the underlying TCP socket. Even after CIFS calls sock_release()
and it returns, the TCP socket does not die immediately in order to
close the connection gracefully.
While this is generally fine, it causes an issue with LOCKDEP because
CIFS assigns a different lock class to the TCP socket's sk->sk_lock
using sock_lock_init_class_and_name().
Once an incoming packet is processed for the socket or a timer fires,
sk->sk_lock is acquired.
Then, LOCKDEP checks the lock context in check_wait_context(), where
hlock_class() is called to retrieve the lock class. However, since
the module has already been unloaded, hlock_class() logs a warning
and returns NULL, triggering the null-ptr-deref.
If LOCKDEP is enabled, we must ensure that a module calling
sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded
while such a socket is still alive to prevent this issue.
Let's hold the module reference in sock_lock_init_class_and_name()
and release it when the socket is freed in sk_prot_free().
Note that sock_lock_init() clears sk->sk_owner for svc_create_socket()
that calls sock_lock_init_class_and_name() for a listening socket,
which clones a socket by sk_clone_lock() without GFP_ZERO.
ipv6: Align behavior across nexthops during path selection
A nexthop is only chosen when the calculated multipath hash falls in the
nexthop's hash region (i.e., the hash is smaller than the nexthop's hash
threshold) and when the nexthop is assigned a non-negative score by
rt6_score_route().
Commit 4d0ab3a6885e ("ipv6: Start path selection from the first
nexthop") introduced an unintentional difference between the first
nexthop and the rest when the score is negative.
When the first nexthop matches, but has a negative score, the code will
currently evaluate subsequent nexthops until one is found with a
non-negative score. On the other hand, when a different nexthop matches,
but has a negative score, the code will fallback to the nexthop with
which the selection started ('match').
Align the behavior across all nexthops and fallback to 'match' when the
first nexthop matches, but has a negative score.
Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N") Fixes: 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop") Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Closes: https://lore.kernel.org/netdev/67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250408084316.243559-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Mon, 7 Apr 2025 09:40:42 +0000 (12:40 +0300)]
net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY
DSA has 2 kinds of drivers:
1. Those who call dsa_switch_suspend() and dsa_switch_resume() from
their device PM ops: qca8k-8xxx, bcm_sf2, microchip ksz
2. Those who don't: all others. The above methods should be optional.
For type 1, dsa_switch_suspend() calls dsa_user_suspend() -> phylink_stop(),
and dsa_switch_resume() calls dsa_user_resume() -> phylink_start().
These seem good candidates for setting mac_managed_pm = true because
that is essentially its definition [1], but that does not seem to be the
biggest problem for now, and is not what this change focuses on.
Talking strictly about the 2nd category of DSA drivers here (which
do not have MAC managed PM, meaning that for their attached PHYs,
mdio_bus_phy_suspend() and mdio_bus_phy_resume() should run in full),
I have noticed that the following warning from mdio_bus_phy_resume() is
triggered:
It's running as a result of a previous dsa_user_open() -> ... ->
phylink_start() -> phy_start() having been initiated by the user.
The previous mdio_bus_phy_suspend() was supposed to have called
phy_stop_machine(), but it didn't. So this is why the PHY is in state
PHY_NOLINK by the time mdio_bus_phy_resume() runs.
mdio_bus_phy_suspend() did not call phy_stop_machine() because for
phylink, the phydev->adjust_link function pointer is NULL. This seems a
technicality introduced by commit fddd91016d16 ("phylib: fix PAL state
machine restart on resume"). That commit was written before phylink
existed, and was intended to avoid crashing with consumer drivers which
don't use the PHY state machine - phylink always does, when using a PHY.
But phylink itself has historically not been developed with
suspend/resume in mind, and apparently not tested too much in that
scenario, allowing this bug to exist unnoticed for so long. Plus, prior
to the WARN_ON(), it would have likely been invisible.
This issue is not in fact restricted to type 2 DSA drivers (according to
the above ad-hoc classification), but can be extrapolated to any MAC
driver with phylink and MDIO-bus-managed PHY PM ops. DSA is just where
the issue was reported. Assuming mac_managed_pm is set correctly, a
quick search indicates the following other drivers might be affected:
Make the existing conditions dependent on the PHY device having a
phydev->phy_link_change() implementation equal to the default
phy_link_change() provided by phylib. Otherwise, we implicitly know that
the phydev has the phylink-provided phylink_phy_change() callback, and
when phylink is used, the PHY state machine always needs to be stopped/
started on the suspend/resume path. The code is structured as such that
if phydev->phy_link_change() is absent, it is a matter of time until the
kernel will crash - no need to further complicate the test.
Thus, for the situation where the PM is not managed by the MAC, we will
make the MDIO bus PM ops treat identically the phylink-controlled PHYs
with the phylib-controlled PHYs where an adjust_link() callback is
supplied. In both cases, the MDIO bus PM ops should stop and restart the
PHY state machine.
Vladimir Oltean [Mon, 7 Apr 2025 09:38:59 +0000 (12:38 +0300)]
net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()
In an upcoming change, mdio_bus_phy_may_suspend() will need to
distinguish a phylib-based PHY client from a phylink PHY client.
For that, it will need to compare the phydev->phy_link_change() function
pointer with the eponymous phy_link_change() provided by phylib.
To avoid forward function declarations, the default PHY link state
change method should be moved upwards. There is no functional change
associated with this patch, it is only to reduce the noise from a real
bug fix.
Merge tag 'linux_kselftest-fixes-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
- Fixes tpm2, futex, and mincore tests
- Create a dedicated .gitignore for tpm2 tests
* tag 'linux_kselftest-fixes-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/mincore: Allow read-ahead pages to reach the end of the file
selftests/futex: futex_waitv wouldblock test should fail
selftests: tpm2: test_smoke: use POSIX-conformant expression operator
selftests: tpm2: create a dedicated .gitignore
ublk: pass ublksrv_ctrl_cmd * instead of io_uring_cmd *
The ublk_ctrl_*() handlers all take struct io_uring_cmd *cmd but only
use it to get struct ublksrv_ctrl_cmd *header from the io_uring SQE.
Since the caller ublk_ctrl_uring_cmd() has already computed header, pass
it instead of cmd.
Ming Lei [Wed, 9 Apr 2025 01:14:42 +0000 (09:14 +0800)]
ublk: don't fail request for recovery & reissue in case of ubq->canceling
ubq->canceling is set with request queue quiesced when io_uring context is
exiting. USER_RECOVERY or !RECOVERY_FAIL_IO requires request to be re-queued
and re-dispatch after device is recovered.
However commit d796cea7b9f3 ("ublk: implement ->queue_rqs()") still may fail
any request in case of ubq->canceling, this way breaks USER_RECOVERY or
!RECOVERY_FAIL_IO.
Fix it by calling __ublk_abort_rq() in case of ubq->canceling.
Ming Lei [Wed, 9 Apr 2025 01:14:41 +0000 (09:14 +0800)]
ublk: fix handling recovery & reissue in ublk_abort_queue()
Commit 8284066946e6 ("ublk: grab request reference when the request is handled
by userspace") doesn't grab request reference in case of recovery reissue.
Then the request can be requeued & re-dispatch & failed when canceling
uring command.
If it is one zc request, the request can be freed before io_uring
returns the zc buffer back, then cause kernel panic:
Fixes it by always grabbing request reference for aborting the request.
Reported-by: Caleb Sander Mateos <csander@purestorage.com> Closes: https://lore.kernel.org/linux-block/CADUfDZodKfOGUeWrnAxcZiLT+puaZX8jDHoj_sfHZCOZwhzz6A@mail.gmail.com/ Fixes: 8284066946e6 ("ublk: grab request reference when the request is handled by userspace") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250409011444.2142010-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
As a limit of 1 is invalid, this patch series moves the limit
validation to after all configuration changes have been done. To do
so, the configuration is done in a temporary work area then applied to
the internal state.
The patch series also adds new test cases.
v3:
- remove a couple of unnecessary comments
- rearrange local variables to use reverse Christmas tree style
declaration order
v2: https://lore.kernel.org/all/20250402162750.1671155-1-tavip@google.com/
- remove tmp struct and directly use local variables
selftests/tc-testing: sfq: check that a derived limit of 1 is rejected
Because the limit is updated indirectly when other parameters are
updated, there are cases where even though the user requests a limit
of 2 it can actually be set to 1.
Add the following test cases to check that the kernel rejects them:
- limit 2 depth 1 flows 1
- limit 2 depth 1 divisor 1
Signed-off-by: Octavian Purdila <tavip@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
It is not sufficient to directly validate the limit on the data that
the user passes as it can be updated based on how the other parameters
are changed.
Move the check at the end of the configuration update process to also
catch scenarios where the limit is indirectly updated, for example
with the following configurations:
net_sched: sch_sfq: use a temporary work area for validating configuration
Many configuration parameters have influence on others (e.g. divisor
-> flows -> limit, depth -> limit) and so it is difficult to correctly
do all of the validation before applying the configuration. And if a
validation error is detected late it is difficult to roll back a
partially applied configuration.
To avoid these issues use a temporary work area to update and validate
the configuration and only then apply the configuration to the
internal state.
Signed-off-by: Octavian Purdila <tavip@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'linux_kselftest-kunit-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fixes from Shuah Khan:
- Fix the tool to report test count in case of a late test plan when
tests are specified before the test plan
- Fix spelling error
* tag 'linux_kselftest-kunit-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Spelling s/slowm/slow/
kunit: tool: fix count of tests if late test plan
page_pool_dev_alloc_pages could return NULL. There was a WARN_ON(!page)
but it would still proceed to use the NULL pointer and then crash.
This is similar to commit 001ba0902046
("net: fec: handle page_pool_dev_alloc_pages error").
This is found by our static analysis tool KNighter.
Signed-off-by: Chenyuan Yang <chenyuan0y@gmail.com> Fixes: 3c47e8ae113a ("net: libwx: Support to receive packets in NAPI") Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250407184952.2111299-1-chenyuan0y@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The parent commit fixes an issue around these counters where one of them
-- MPJoinAckHMacFailure -- was wrongly incremented in some cases.
This makes sure the counter is always 0. It should be incremented only
in case of corruption, or a wrong implementation, which should not be
the case in these selftests.
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Recently, during a debugging session using local MPTCP connections, I
noticed MPJoinAckHMacFailure was not zero on the server side. The
counter was in fact incremented when the PM rejected new subflows,
because the 'subflow' limit was reached.
The fix is easy, simply dissociating the two cases: only the HMAC
validation check should increase MPTCP_MIB_JOINACKMAC counter.
Fixes: 4cf8b7e48a09 ("subflow: introduce and use mptcp_can_accept_new_subflow()") Cc: stable@vger.kernel.org Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250407-net-mptcp-hmac-failure-mib-v1-1-3c9ecd0a3a50@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Qiuxu Zhuo [Tue, 11 Mar 2025 08:09:40 +0000 (16:09 +0800)]
selftests/mincore: Allow read-ahead pages to reach the end of the file
When running the mincore_selftest on a system with an XFS file system, it
failed the "check_file_mmap" test case due to the read-ahead pages reaching
the end of the file. The failure log is as below:
RUN global.check_file_mmap ...
mincore_selftest.c:264:check_file_mmap:Expected i (1024) < vec_size (1024)
mincore_selftest.c:265:check_file_mmap:Read-ahead pages reached the end of the file
check_file_mmap: Test failed
FAIL global.check_file_mmap
This is because the read-ahead window size of the XFS file system on this
machine is 4 MB, which is larger than the size from the #PF address to the
end of the file. As a result, all the pages for this file are populated.
This issue can be fixed by extending the current FILE_SIZE 4MB to a larger
number, but it will still fail if the read-ahead window size of the file
system is larger enough. Additionally, in the real world, read-ahead pages
reaching the end of the file can happen and is an expected behavior.
Therefore, allowing read-ahead pages to reach the end of the file is a
better choice for the "check_file_mmap" test case.
Ahmed Salem [Tue, 11 Feb 2025 23:16:17 +0000 (01:16 +0200)]
selftests: tpm2: test_smoke: use POSIX-conformant expression operator
Use POSIX-conformant expression operator symbol '='.
The use of the non POSIX-conformant symbol '==' would work
in bash, but not in sh where the unexpected operator error
would result in test_smoke.sh being skipped.
Instead of changing the shebang to use bash, which may not be
available on all systems, use the POSIX-conformant expression
symbol '=' to test for equality.
Without this patch:
===================
# make -j8 TARGETS=tpm2 kselftest
# selftests: tpm2: test_smoke.sh
# ./test_smoke.sh: 9: [: 2: unexpected operator
ok 1 selftests: tpm2: test_smoke.sh # SKIP
With this patch:
================
# make -j8 TARGETS=tpm2 kselftest
# selftests: tpm2: test_smoke.sh
# Ran 9 tests in 9.236s
ok 1 selftests: tpm2: test_smoke.sh
Khaled Elnaggar [Sun, 26 Jan 2025 19:51:33 +0000 (21:51 +0200)]
selftests: tpm2: create a dedicated .gitignore
The tpm2 selftests produce two logs: SpaceTest.log and
AsyncTest.log. Only SpaceTest.log was listed in selftests/.gitignore,
while AsyncTest.log remained untracked.
This change creates a dedicated .gitignore in the tpm2/ directory to
manage these entries, keeping tpm2-specific patterns isolated from
parent .gitignore.
Fixed white-space errors during commit
Shuah Khan <skhan@linuxfoundation.org>
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
stage-1 page tables) to align with the architecture. This avoids
possibly taking an SEA at EL2 on the page table walk or using an
architecturally UNKNOWN fault IPA
- Use acquire/release semantics in the KVM FF-A proxy to avoid
reading a stale value for the FF-A version
- Fix KVM guest driver to match PV CPUID hypercall ABI
- Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
selftests, which is the only memory type for which atomic
instructions are architecturally guaranteed to work
s390:
- Don't use %pK for debug printing and tracepoints
x86:
- Use a separate subclass when acquiring KVM's per-CPU posted
interrupts wakeup lock in the scheduled out path, i.e. when adding
a vCPU on the list of vCPUs to wake, to workaround a false positive
deadlock. The schedule out code runs with a scheduler lock that the
wakeup handler takes in the opposite order; but it does so with
IRQs disabled and cannot run concurrently with a wakeup
- Allow building irqbypass.ko as as module when kvm.ko is a module
- Wrap relatively expensive sanity check with KVM_PROVE_MMU
- Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
selftests:
- Add more scenarios to the MONITOR/MWAIT test
- Add option to rseq test to override /dev/cpu_dma_latency
- Bring list of exit reasons up to date
- Cleanup Makefile to list once tests that are valid on all
architectures
Other:
- Documentation fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
KVM: arm64: Use acquire/release to communicate FF-A version negotiation
KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable
KVM: arm64: selftests: Introduce and use hardware-definition macros
KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive
KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list
KVM: x86: Explicitly zero-initialize on-stack CPUID unions
KVM: Allow building irqbypass.ko as as module when kvm.ko is a module
KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU
KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency
KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
Documentation: kvm: remove KVM_CAP_MIPS_TE
Documentation: kvm: organize capabilities in the right section
Documentation: kvm: fix some definition lists
Documentation: kvm: drop "Capability" heading from capabilities
Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE
Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE
selftests: kvm: list once tests that are valid on all architectures
selftests: kvm: bring list of exit reasons up to date
selftests: kvm: revamp MONITOR/MWAIT tests
KVM: arm64: Don't translate FAR if invalid/unsafe
...
Merge tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- fprobe: remove fprobe_hlist_node when module unloading
When a fprobe target module is removed, the fprobe_hlist_node should
be removed from the fprobe's hash table to prevent reusing
accidentally if another module is loaded at the same address.
- fprobe: lock module while registering fprobe
The module containing the function to be probeed is locked using a
reference counter until the fprobe registration is complete, which
prevents use after free.
- fprobe-events: fix possible UAF on modules
Basically as same as above, but in the fprobe-events layer we also
need to get module reference counter when we find the tracepoint in
the module.
* tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: fprobe: Cleanup fprobe hash when module unloading
tracing: fprobe events: Fix possible UAF on modules
tracing: fprobe: Fix to lock module while registering fprobe
rtnetlink: Fix bad unlock balance in do_setlink().
When validate_linkmsg() fails in do_setlink(), we jump to the errout
label and calls netdev_unlock_ops() even though we have not called
netdev_lock_ops() as reported by syzbot. [0]
syz-executor814/5834 is trying to release lock (&dev_instance_lock_key) at:
[<ffffffff89f41f56>] netdev_unlock include/linux/netdevice.h:2756 [inline]
[<ffffffff89f41f56>] netdev_unlock_ops include/net/netdev_lock.h:48 [inline]
[<ffffffff89f41f56>] do_setlink+0xc26/0x43a0 net/core/rtnetlink.c:3406
but there are no more locks to release!
other info that might help us debug this:
1 lock held by syz-executor814/5834:
#0: ffffffff900fc408 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
#0: ffffffff900fc408 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_nets_lock net/core/rtnetlink.c:341 [inline]
#0: ffffffff900fc408 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0xd68/0x1fe0 net/core/rtnetlink.c:4064
Merge tag 'cgroup-for-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- A number of cpuset remote partition related fixes and cleanups along
with selftest updates.
- A change from this merge window made cgroup_rstat_updated_list()
called outside cgroup_rstat_lock leading to list corruptions. Fix it
by relocating the call inside the lock.
* tag 'cgroup-for-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Fix race between newly created partition and dying one
cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lock
selftest/cgroup: Add a remote partition transition test to test_cpuset_prs.sh
selftest/cgroup: Clean up and restructure test_cpuset_prs.sh
selftest/cgroup: Update test_cpuset_prs.sh to use | as effective CPUs and state separator
cgroup/cpuset: Remove unneeded goto in sched_partition_write() and rename it
cgroup/cpuset: Code cleanup and comment update
cgroup/cpuset: Don't allow creation of local partition over a remote one
cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition
cgroup/cpuset: Fix error handling in remote_partition_disable()
cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask()
Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC cleanups from Eric Biggers:
"Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crc: remove CONFIG_LIBCRC32C
lib/crc: document all the CRC library kconfig options
lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T
lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF
lib/crc: remove unnecessary prompt for CONFIG_CRC16
lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT
lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
A recent optimization change in LLVM [1] aims to transform certain loop
idioms into calls to strlen() or wcslen(). This change transforms the
first while loop in UniStrcat() into a call to wcslen(), breaking the
build when UniStrcat() gets inlined into alloc_path_with_tree_prefix():
ld.lld: error: undefined symbol: wcslen
>>> referenced by nls_ucs2_utils.h:54 (fs/smb/client/../../nls/nls_ucs2_utils.h:54)
>>> vmlinux.o:(alloc_path_with_tree_prefix)
>>> referenced by nls_ucs2_utils.h:54 (fs/smb/client/../../nls/nls_ucs2_utils.h:54)
>>> vmlinux.o:(alloc_path_with_tree_prefix)
Disable this optimization with '-fno-builtin-wcslen', which prevents the
compiler from assuming that wcslen() is available in the kernel's C
library.
[ More to the point - it's not that we couldn't implement wcslen(), it's
that this isn't an optimization at all in the context of the kernel.
Replacing a simple inlined loop with a function call to the same loop
is just stupid and pointless if you don't have long strings and fancy
libraries with vectorization support etc.
For the regular 'strlen()' cases, we want the compiler to do this in
order to handle the trivial case of constant strings. And we do have
optimized versions of 'strlen()' on some architectures. But for
wcslen? Just no. - Linus ]
Maxime Chevallier [Mon, 7 Apr 2025 13:05:10 +0000 (15:05 +0200)]
net: ethtool: Don't call .cleanup_data when prepare_data fails
There's a consistent pattern where the .cleanup_data() callback is
called when .prepare_data() fails, when it should really be called to
clean after a successful .prepare_data() as per the documentation.
Rewrite the error-handling paths to make sure we don't cleanup
un-prepared data.
Fixes: c781ff12a2f3 ("ethtool: Allow network drivers to dump arbitrary EEPROM data") Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250407130511.75621-1-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
tc: Ensure we have enough buffer space when sending filter netlink notifications
The tfilter_notify() and tfilter_del_notify() functions assume that
NLMSG_GOODSIZE is always enough to dump the filter chain. This is not
always the case, which can lead to silent notify failures (because the
return code of tfilter_notify() is not always checked). In particular,
this can lead to NLM_F_ECHO not being honoured even though an action
succeeds, which forces userspace to create workarounds[0].
Fix this by increasing the message size if dumping the filter chain into
the allocated skb fails. Use the size of the incoming skb as a size hint
if set, so we can start at a larger value when appropriate.
To trigger this, run the following commands:
# ip link add type veth
# tc qdisc replace dev veth0 root handle 1: fq_codel
# tc -echo filter add dev veth0 parent 1: u32 match u32 0 0 $(for i in $(seq 32); do echo action pedit munge ip dport set 22; done)
Before this fix, tc just returns:
Not a filter(cmd 2)
After the fix, we get the correct echo:
added filter dev veth0 parent 1: protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid not_in_hw
match 00000000/00000000 at 0
action order 1: pedit action pass keys 1
index 1 ref 1 bind 1
key #0 at 20: val 00000016 mask ffff0000
[repeated 32 times]
WX_RXD_IPV6EX was incorrectly defined in Rx ring descriptor. In fact, this
field stores the 802.1ad ID from which the packet was received. The wrong
definition caused the statistics rx_csum_offload_errors to fail to grow
when receiving the 802.1ad packet with incorrect checksum.
x86/xen: disable CPU idle and frequency drivers for PVH dom0
When running as a PVH dom0 the ACPI tables exposed to Linux are (mostly)
the native ones, thus exposing the C and P states, that can lead to
attachment of CPU idle and frequency drivers. However the entity in
control of the CPU C and P states is Xen, as dom0 doesn't have a full view
of the system load, neither has all CPUs assigned and identity pinned.
Like it's done for classic PV guests, prevent Linux from using idle or
frequency state drivers when running as a PVH dom0.
On an AMD EPYC 7543P system without this fix a Linux PVH dom0 will keep the
host CPUs spinning at 100% even when dom0 is completely idle, as it's
attempting to use the acpi_idle driver.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20250407101842.67228-1-roger.pau@citrix.com>
Paolo Bonzini [Tue, 8 Apr 2025 09:49:31 +0000 (05:49 -0400)]
Merge tag 'kvmarm-fixes-6.15-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64: First batch of fixes for 6.15
- Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
stage-1 page tables) to align with the architecture. This avoids
possibly taking an SEA at EL2 on the page table walk or using an
architecturally UNKNOWN fault IPA.
- Use acquire/release semantics in the KVM FF-A proxy to avoid reading
a stale value for the FF-A version.
- Fix KVM guest driver to match PV CPUID hypercall ABI.
- Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
selftests, which is the only memory type for which atomic
instructions are architecturally guaranteed to work.
Jakub Kicinski [Fri, 4 Apr 2025 18:03:34 +0000 (11:03 -0700)]
selftests: tls: check that disconnect does nothing
"Inspired" by syzbot test, pre-queue some data, disconnect()
and try to receive(). This used to trigger a warning in TLS's strp.
Now we expect the disconnect() to have almost no effect.
Jakub Kicinski [Fri, 4 Apr 2025 18:03:33 +0000 (11:03 -0700)]
net: tls: explicitly disallow disconnect
syzbot discovered that it can disconnect a TLS socket and then
run into all sort of unexpected corner cases. I have a vague
recollection of Eric pointing this out to us a long time ago.
Supporting disconnect is really hard, for one thing if offload
is enabled we'd need to wait for all packets to be _acked_.
Disconnect is not commonly used, disallow it.
The immediate problem syzbot run into is the warning in the strp,
but that's just the easiest bug to trigger:
sctp: detect and prevent references to a freed transport in sendmsg
sctp_sendmsg() re-uses associations and transports when possible by
doing a lookup based on the socket endpoint and the message destination
address, and then sctp_sendmsg_to_asoc() sets the selected transport in
all the message chunks to be sent.
There's a possible race condition if another thread triggers the removal
of that selected transport, for instance, by explicitly unbinding an
address with setsockopt(SCTP_SOCKOPT_BINDX_REM), after the chunks have
been set up and before the message is sent. This can happen if the send
buffer is full, during the period when the sender thread temporarily
releases the socket lock in sctp_wait_for_sndbuf().
This causes the access to the transport data in
sctp_outq_select_transport(), when the association outqueue is flushed,
to result in a use-after-free read.
This change avoids this scenario by having sctp_transport_free() signal
the freeing of the transport, tagging it as "dead". In order to do this,
the patch restores the "dead" bit in struct sctp_transport, which was
removed in
commit 47faa1e4c50e ("sctp: remove the dead field of sctp_transport").
Then, in the scenario where the sender thread has released the socket
lock in sctp_wait_for_sndbuf(), the bit is checked again after
re-acquiring the socket lock to detect the deletion. This is done while
holding a reference to the transport to prevent it from being freed in
the process.
If the transport was deleted while the socket lock was relinquished,
sctp_sendmsg_to_asoc() will return -EAGAIN to let userspace retry the
send.
The bug was found by a private syzbot instance (see the error report [1]
and the C reproducer that triggers it [2]).
Andy Shevchenko [Wed, 2 Apr 2025 12:20:01 +0000 (15:20 +0300)]
gpiolib: of: Move Atmel HSMCI quirk up out of the regulator comment
The regulator comment in of_gpio_set_polarity_by_property()
made on top of a couple of the cases, while Atmel HSMCI quirk
is not related to that. Make it clear by moving Atmel HSMCI
quirk up out of the scope of the regulator comment.
Andy Shevchenko [Wed, 2 Apr 2025 12:20:00 +0000 (15:20 +0300)]
gpiolib: of: Fix the choice for Ingenic NAND quirk
The Ingenic NAND quirk has been added under CONFIG_LCD_HX8357 ifdeffery
which sounds quite wrong. Fix the choice for Ingenic NAND quirk
by wrapping it into own ifdeffery related to the respective driver.
====================
net_sched: make ->qlen_notify() idempotent
Gerrard reported a vulnerability exists in fq_codel where manipulating
the MTU can cause codel_dequeue() to drop all packets. The parent qdisc's
sch->q.qlen is only updated via ->qlen_notify() if the fq_codel queue
remains non-empty after the drops. This discrepancy in qlen between
fq_codel and its parent can lead to a use-after-free condition.
Let's fix this by making all existing ->qlen_notify() idempotent so that
the sch->q.qlen check will be no longer necessary.
Patch 1~5 make all existing ->qlen_notify() idempotent to prepare for
patch 6 which removes the sch->q.qlen check. They are followed by 5
selftests for each type of Qdisc's we touch here.
All existing and new Qdisc selftests pass after this patchset.
Cong Wang [Thu, 3 Apr 2025 21:16:36 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with ETS parent
Add a test case for FQ_CODEL with ETS parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.
Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().
Cc: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20250403211636.166257-6-xiyou.wangcong@gmail.com Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cong Wang [Thu, 3 Apr 2025 21:16:35 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with DRR parent
Add a test case for FQ_CODEL with DRR parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.
Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().
Cc: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20250403211636.166257-5-xiyou.wangcong@gmail.com Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cong Wang [Thu, 3 Apr 2025 21:16:34 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with HFSC parent
Add a test case for FQ_CODEL with HFSC parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.
Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().
Cc: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20250403211636.166257-4-xiyou.wangcong@gmail.com Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cong Wang [Thu, 3 Apr 2025 21:16:33 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with QFQ parent
Add a test case for FQ_CODEL with QFQ parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.
Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().
Cc: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20250403211636.166257-3-xiyou.wangcong@gmail.com Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cong Wang [Thu, 3 Apr 2025 21:16:32 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with HTB parent
Add a test case for FQ_CODEL with HTB parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.
Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().
Cc: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20250403211636.166257-2-xiyou.wangcong@gmail.com Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cong Wang [Thu, 3 Apr 2025 21:16:31 +0000 (14:16 -0700)]
codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()
After making all ->qlen_notify() callbacks idempotent, now it is safe to
remove the check of qlen!=0 from both fq_codel_dequeue() and
codel_qdisc_dequeue().
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg> Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM") Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM") Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250403211636.166257-1-xiyou.wangcong@gmail.com Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cong Wang [Thu, 3 Apr 2025 21:10:27 +0000 (14:10 -0700)]
sch_ets: make est_qlen_notify() idempotent
est_qlen_notify() deletes its class from its active list with
list_del() when qlen is 0, therefore, it is not idempotent and
not friendly to its callers, like fq_codel_dequeue().
Let's make it idempotent to ease qdisc_tree_reduce_backlog() callers'
life. Also change other list_del()'s to list_del_init() just to be
extra safe.
Cong Wang [Thu, 3 Apr 2025 21:10:26 +0000 (14:10 -0700)]
sch_qfq: make qfq_qlen_notify() idempotent
qfq_qlen_notify() always deletes its class from its active list
with list_del_init() _and_ calls qfq_deactivate_agg() when the whole list
becomes empty.
To make it idempotent, just skip everything when it is not in the active
list.
Also change other list_del()'s to list_del_init() just to be extra safe.
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250403211033.166059-5-xiyou.wangcong@gmail.com Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cong Wang [Thu, 3 Apr 2025 21:10:25 +0000 (14:10 -0700)]
sch_hfsc: make hfsc_qlen_notify() idempotent
hfsc_qlen_notify() is not idempotent either and not friendly
to its callers, like fq_codel_dequeue(). Let's make it idempotent
to ease qdisc_tree_reduce_backlog() callers' life:
1. update_vf() decreases cl->cl_nactive, so we can check whether it is
non-zero before calling it.
2. eltree_remove() always removes RB node cl->el_node, but we can use
RB_EMPTY_NODE() + RB_CLEAR_NODE() to make it safe.
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250403211033.166059-4-xiyou.wangcong@gmail.com Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>