Amir Tzin [Tue, 4 Mar 2025 16:06:18 +0000 (18:06 +0200)]
net/mlx5: Lag, Enable Multiport E-Switch offloads on 8 ports LAG
Patch [1] added mlx5 driver support for 8 ports HCAs which are available
since ConnectX-8. Now that Multiport E-Switch is tested, we can enable
it by removing flag MLX5_LAG_MPESW_OFFLOADS_SUPPORTED_PORTS.
Shahar Shitrit [Tue, 4 Mar 2025 16:06:17 +0000 (18:06 +0200)]
net/mlx5e: Enable lanes configuration when auto-negotiation is off
Currently, when auto-negotiation is disabled, the driver retrieves
the speed and converts it into all link modes that correspond to
that speed. With this patch, we add the ability to set the number
of lanes, so that the combination of speed and lanes corresponds to
exactly one specific link mode for the extended bit map.
For the legacy bit map the driver sets all link modes correspond to
speed and lanes.
This change provides users with the option to set a specific link
mode, rather than enabling all link modes associated with a given
speed when auto-negotiation is off.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250304160620.417580-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shahar Shitrit [Tue, 4 Mar 2025 16:06:16 +0000 (18:06 +0200)]
net/mlx5: Refactor link speed handling with mlx5_link_info struct
Introduce struct mlx5_link_info with a speed field and change the
types of mlx5e_link_speed and mlx5e_ext_link_speed from arrays of
u32 to arrays of struct mlx5_link_info. These arrays are renamed
to mlx5e_link_info and mlx5e_ext_link_info, respectively.
This change prepares for a future patch that will introduce a lanes
field in struct mlx5_link_info and add lanes mapping alongside the
speed for each link mode in the two arrays.
Additionally, rename function mlx5_port_speed2linkmodes() to
mlx5_port_info2linkmodes() and function mlx5_port_ptys2speed()
to mlx5_port_ptys2info() and update the speed parameter/return
type to struct mlx5_link_info, in preparation for the upcoming
patch where these functions will also utilize the lanes field.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250304160620.417580-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shahar Shitrit [Tue, 4 Mar 2025 16:06:15 +0000 (18:06 +0200)]
net/mlx5: Relocate function declarations from port.h to mlx5_core.h
The port header is a general file under include, yet it
contains declarations for functions that are either not
exported or exported but not used outside the mlx5_core
driver.
To enhance code organization, we move these declarations
to mlx5_core.h, where they are more appropriately scoped.
This refactor removes unnecessary exported symbols and
prevents unexported functions from being inadvertently
referenced outside of the mlx5_core driver.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250304160620.417580-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Add perout configuration support in IEP driver
IEP driver supported both perout and pps signal generation
but perout feature is faulty with half-cooked support
due to some missing configuration. Hence perout feature is
removed as a bug fix. This patch series adds back this feature
which configures perout signal based on the arguments passed
by the perout request.
This patch series is continuation to the bug fix:
https://lore.kernel.org/20250227092441.1848419-1-m-malladi@ti.com
as suggested by Jakub Kicinski and Jacob Keller:
https://lore.kernel.org/20250220172410.025b96d6@kernel.org
Meghana Malladi [Tue, 4 Mar 2025 10:57:53 +0000 (16:27 +0530)]
net: ti: icss-iep: Add phase offset configuration for perout signal
icss_iep_perout_enable_hw() is a common function for generating
both pps and perout signals. When enabling pps, the application needs
to only pass enable/disable argument, whereas for perout it supports
different flags to configure the signal.
In case the app passes a valid phase offset value, the signal should
start toggling after that phase offset, else start immediately or
as soon as possible. ICSS_IEP_SYNC_START_REG register take number of
clock cycles to wait before starting the signal after activation time.
Set appropriate value to this register to support phase offset.
Signed-off-by: Meghana Malladi <m-malladi@ti.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250304105753.1552159-3-m-malladi@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Meghana Malladi [Tue, 4 Mar 2025 10:57:52 +0000 (16:27 +0530)]
net: ti: icss-iep: Add pwidth configuration for perout signal
icss_iep_perout_enable_hw() is a common function for generating
both pps and perout signals. When enabling pps, the application needs
to only pass enable/disable argument, whereas for perout it supports
different flags to configure the signal.
But icss_iep_perout_enable_hw() function is missing to hook the
configuration params passed by the app, causing perout to behave
same a pps (except being able to configure the period). As duty cycle
is also one feature which can configured for perout, incorporate this
in the function to get the expected signal.
Signed-off-by: Meghana Malladi <m-malladi@ti.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250304105753.1552159-2-m-malladi@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 4 Mar 2025 18:06:15 +0000 (10:06 -0800)]
selftests: openvswitch: don't hardcode the drop reason subsys
WiFi removed one of their subsys entries from drop reasons, in
commit 286e69677065 ("wifi: mac80211: Drop cooked monitor support")
SKB_DROP_REASON_SUBSYS_OPENVSWITCH is now 2 not 3.
The drop reasons are not uAPI, read the correct value
from debug info.
We need to enable vmlinux BTF, otherwise pahole needs
a few GB of memory to decode the enum name.
====================
Increase maximum MTU to 9k for Airoha EN7581 SoC
EN7581 SoC supports 9k maximum MTU.
Enable the reception of Scatter-Gather (SG) frames for Airoha EN7581.
Introduce airoha_dev_change_mtu callback.
====================
Russell King (Oracle) [Tue, 4 Mar 2025 11:21:27 +0000 (11:21 +0000)]
net: stmmac: simplify phylink_suspend() and phylink_resume() calls
Currently, the calls to phylink's suspend and resume functions are
inside overly complex tests, and boil down to:
if (device_may_wakeup(priv->device) && priv->plat->pmt) {
call phylink
} else {
call phylink and
if (device_may_wakeup(priv->device))
do something else
}
This results in phylink always being called, possibly with differing
arguments for phylink_suspend().
Simplify this code, noting that each site is slightly different due to
the order in which phylink is called and the "something else".
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/E1tpQL1-005St4-Hn@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Wed, 5 Mar 2025 17:54:16 +0000 (17:54 +0000)]
net: stmmac: avoid shadowing global buf_sz
stmmac_rx() declares a local variable named "buf_sz" but there is also
a global variable for a module parameter which is called the same. To
avoid confusion, rename the local variable.
Jakub Kicinski [Tue, 4 Mar 2025 23:32:03 +0000 (15:32 -0800)]
selftests: net: fix error message in bpf_offload
We hit a following exception on timeout, nmaps is never set:
Test bpftool bound info reporting (own ns)...
Traceback (most recent call last):
File "/home/virtme/testing-1/tools/testing/selftests/net/./bpf_offload.py", line 1128, in <module>
check_dev_info(False, "")
File "/home/virtme/testing-1/tools/testing/selftests/net/./bpf_offload.py", line 583, in check_dev_info
maps = bpftool_map_list_wait(expected=2, ns=ns)
File "/home/virtme/testing-1/tools/testing/selftests/net/./bpf_offload.py", line 215, in bpftool_map_list_wait
raise Exception("Time out waiting for map counts to stabilize want %d, have %d" % (expected, nmaps))
NameError: name 'nmaps' is not defined
A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
looks unintentional, and affects MPTCP selftests, e.g. some tests
re-establishing a connection after a disconnect are now unstable.
The cleanup used the 'clamp' macro which takes 3 arguments -- value,
lowest, and highest -- and returns a value between the lowest and the
highest allowable values. This then assumes ...
lowest (rcv_ssthresh) <= highest (rcv_wnd)
... which doesn't seem to be always the case here according to the MPTCP
selftests, even when running them without MPTCP, but only TCP.
For example, when we have ...
rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
... before the cleanup, the rcv_ssthresh was not changed, while after
the cleanup, it is lowered down to rcv_wnd (highest).
During a simple test with TCP, here are the values I observed:
113664 (out) 65483 < 65536
117760 (out) 110968 < 110976
129024 (out) 116527 > 109696 => lo > hi
Here, we can see that it is not that rare to have rcv_ssthresh (lo)
higher than rcv_wnd (hi), so having a different behaviour when the
clamp() macro is used, even without MPTCP.
Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
here, which seems to be generally the case in my tests with small
connections.
I then suggests reverting this part, not to change the behaviour.
Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250305-net-next-fix-tcp-win-clamp-v1-1-12afb705d34e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Wed, 5 Mar 2025 17:54:21 +0000 (17:54 +0000)]
net: stmmac: mostly remove "buf_sz"
The "buf_sz" parameter is not used in the stmmac driver - there is one
place where the value of buf_sz is validated, and two places where it
is written. It is otherwise unused.
Remove these accesses. However, leave the module parameter in place as
removing it could cause module load to fail, breaking user setups.
====================
net: stmmac: dwc-qos: Add FSD EQoS support
FSD platform has two instances of EQoS IP, one is in FSYS0 block and
another one is in PERIC block. This patch series add required DT binding
and platform driver specific changes for the same.
====================
Swathi K S [Wed, 5 Mar 2025 09:12:46 +0000 (14:42 +0530)]
net: stmmac: dwc-qos: Add FSD EQoS support
The FSD SoC contains two instance of the Synopsys DWC ethernet QOS IP core.
The binding that it uses is slightly different from existing ones because
of the integration (clocks, resets).
Swathi K S [Wed, 5 Mar 2025 09:12:45 +0000 (14:42 +0530)]
dt-bindings: net: Add FSD EQoS device tree bindings
Add FSD Ethernet compatible in Synopsys dt-bindings document. Add FSD
Ethernet YAML schema to enable the DT validation.
Signed-off-by: Pankaj Dubey <pankaj.dubey@samsung.com> Signed-off-by: Ravi Patel <ravi.patel@samsung.com> Signed-off-by: Swathi K S <swathi.ks@samsung.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://patch.msgid.link/20250305091246.106626-2-swathi.ks@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
tcp: even faster connect() under stress
This is a followup on the prior series, "tcp: scale connect() under pressure"
Now spinlocks are no longer in the picture, we see a very high cost
of the inet6_ehashfn() function.
In this series (of 2), I change how lport contributes to inet6_ehashfn()
to ensure better cache locality and call inet6_ehashfn()
only once per connect() system call.
This brings an additional 229 % increase of performance
for "neper/tcp_crr -6 -T 200 -F 30000" stress test,
while greatly improving latency metrics.
Eric Dumazet [Wed, 5 Mar 2025 03:45:49 +0000 (03:45 +0000)]
inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
In order to speedup __inet_hash_connect(), we want to ensure hash values
for <source address, port X, destination address, destination port>
are not randomly spread, but monotonically increasing.
Goal is to allow __inet_hash_connect() to derive the hash value
of a candidate 4-tuple with a single addition in the following
patch in the series.
Then (hash_sport == hash_0 + sport) for all sport values.
As far as I know, there is no security implication with this change.
After this patch, when __inet_hash_connect() has to try XXXX candidates,
the hash table buckets are contiguous and packed, allowing
a better use of cpu caches and hardware prefetchers.
Cross-merge networking fixes after downstream PR (net-6.14-rc6).
Conflicts:
net/ethtool/cabletest.c 2bcf4772e45a ("net: ethtool: try to protect all callback with netdev instance lock") 637399bf7e77 ("net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device")
====================
net: Hold netdev instance lock during ndo operations
As the gradual purging of rtnl continues, start grabbing netdev
instance lock in more places so we can get to the state where
most paths are working without rtnl. Start with requiring the
drivers that use shaper api (and later queue mgmt api) to work
with both rtnl and netdev instance lock. Eventually we might
attempt to drop rtnl. This mostly affects iavf, gve, bnxt and
netdev sim (as the drivers that implement shaper/queue mgmt)
so those drivers are converted in the process.
call_netdevice_notifiers locking is very inconsistent and might need
a separate follow up. Some notified events are covered by the
instance lock, some are not, which might complicate the driver
expectations.
Reviewed-by: Eric Dumazet <edumazet@google.com>
====================
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:32 +0000 (08:37 -0800)]
eth: bnxt: remove most dependencies on RTNL
Only devlink and sriov paths are grabbing rtnl explicitly. The rest is
covered by netdev instance lock which the core now grabs, so there is
no need to manage rtnl in most places anymore.
On the core side we can now try to drop rtnl in some places
(do_setlink for example) for the drivers that signal non-rtnl
mode (TBD).
Boot-tested and with `ethtool -L eth1 combined 24` to trigger reset.
Cc: Saeed Mahameed <saeed@kernel.org> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-15-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:30 +0000 (08:37 -0800)]
net: add option to request netdev instance lock
Currently only the drivers that implement shaper or queue APIs
are grabbing instance lock. Add an explicit opt-in for the
drivers that want to grab the lock without implementing the above
APIs.
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:29 +0000 (08:37 -0800)]
net: replace dev_addr_sem with netdev instance lock
Lockdep reports possible circular dependency in [0]. Instead of
fixing the ordering, replace global dev_addr_sem with netdev
instance lock. Most of the paths that set/get mac are RTNL
protected. Two places where it's not, convert to explicit
locking:
- sysfs address_show
- dev_get_mac_address via dev_ioctl
Jakub Kicinski [Wed, 5 Mar 2025 16:37:28 +0000 (08:37 -0800)]
net: ethtool: try to protect all callback with netdev instance lock
Protect all ethtool callbacks and PHY related state with the netdev
instance lock, for drivers which want / need to have their ops
instance-locked. Basically take the lock everywhere we take rtnl_lock.
It was tempting to take the lock in ethnl_ops_begin(), but turns
out we actually nest those calls (when generating notifications).
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-11-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:24 +0000 (08:37 -0800)]
net: hold netdev instance lock during rtnetlink operations
To preserve the atomicity, hold the lock while applying multiple
attributes. The major issue with a full conversion to the instance
lock are software nesting devices (bonding/team/vrf/etc). Those
devices call into the core stack for their lower (potentially
real hw) devices. To avoid explicitly wrapping all those places
into instance lock/unlock, introduce new API boundaries:
- (some) existing dev_xxx calls are now considered "external"
(to drivers) APIs and they transparently grab the instance
lock if needed (dev_api.c)
- new netif_xxx calls are internal core stack API (naming is
sketchy, I've tried netdev_xxx_locked per Jakub's suggestion,
but it feels a bit verbose; but happy to get back to this
naming scheme if this is the preference)
This avoids touching most of the existing ioctl/sysfs/drivers paths.
Note the special handling of ndo_xxx_slave operations: I exploit
the fact that none of the drivers that call these functions
need/use instance lock. At the same time, they use dev_xxx
APIs, so the lower device has to be unlocked.
Changes in unregister_netdevice_many_notify (to protect dev->state
with instance lock) trigger lockdep - the loop over close_list
(mostly from cleanup_net) introduces spurious ordering issues.
netdev_lock_cmp_fn has a justification on why it's ok to suppress
for now.
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:23 +0000 (08:37 -0800)]
net: hold netdev instance lock during queue operations
For the drivers that use queue management API, switch to the mode where
core stack holds the netdev instance lock. This affects the following
drivers:
- bnxt
- gve
- netdevsim
Originally I locked only start/stop, but switched to holding the
lock over all iterations to make them look atomic to the device
(feels like it should be easier to reason about).
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:22 +0000 (08:37 -0800)]
net: hold netdev instance lock during qdisc ndo_setup_tc
Qdisc operations that can lead to ndo_setup_tc might need
to have an instance lock. Add netdev_lock_ops/netdev_unlock_ops
invocations for all psched_rtnl_msg_handlers operations.
Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Saeed Mahameed <saeed@kernel.org> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-5-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:21 +0000 (08:37 -0800)]
net: sched: wrap doit/dumpit methods
In preparation for grabbing netdev instance lock around qdisc
operations, introduce tc_xxx wrappers that lookup netdev
and call respective __tc_xxx helper to do the actual work.
No functional changes.
Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Saeed Mahameed <saeed@kernel.org> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-4-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:19 +0000 (08:37 -0800)]
net: hold netdev instance lock during ndo_open/ndo_stop
For the drivers that use shaper API, switch to the mode where
core stack holds the netdev lock. This affects two drivers:
* iavf - already grabs netdev lock in ndo_open/ndo_stop, so mostly
remove these
* netdevsim - switch to _locked APIs to avoid deadlock
iavf_close diff is a bit confusing, the existing call looks like this:
iavf_close() {
netdev_lock()
..
netdev_unlock()
wait_event_timeout(down_waitqueue)
}
I change it to the following:
netdev_lock()
iavf_close() {
..
netdev_unlock()
wait_event_timeout(down_waitqueue)
netdev_lock() // reusing this lock call
}
netdev_unlock()
Since I'm reusing existing netdev_lock call, so it looks like I only
add netdev_unlock.
- wifi: iwlwifi:
- fix A-MSDU TSO preparation
- free pages allocated when failing to build A-MSDU
- ipv6: fix dst ref loop in ila lwtunnel
- mptcp: fix 'scheduling while atomic' in
mptcp_pm_nl_append_new_local_addr
- bluetooth: add check for mgmt_alloc_skb() in
mgmt_device_connected()
- ethtool: allow NULL nlattrs when getting a phy_device
- eth: be2net: fix sleeping while atomic bugs in
be_ndo_bridge_getlink
Previous releases - always broken:
- core: support TCP GSO case for a few missing flags
- wifi: mac80211:
- fix vendor-specific inheritance
- cleanup sta TXQs on flush
- llc: do not use skb_get() before dev_queue_xmit()
- eth: ipa: nable checksum for IPA_ENDPOINT_AP_MODEM_{RX,TX}
for v4.7"
* tag 'net-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (41 commits)
net: ipv6: fix missing dst ref drop in ila lwtunnel
net: ipv6: fix dst ref loop in ila lwtunnel
mctp i3c: handle NULL header address
net: dsa: mt7530: Fix traffic flooding for MMIO devices
net-timestamp: support TCP GSO case for a few missing flags
vlan: enforce underlying device type
mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device
ppp: Fix KMSAN uninit-value warning with bpf
net: ipa: Enable checksum for IPA_ENDPOINT_AP_MODEM_{RX,TX} for v4.7
net: ipa: Fix QSB data for v4.7
net: ipa: Fix v4.7 resource group names
net: hns3: make sure ptp clock is unregister and freed if hclge_ptp_get_cycle returns an error
wifi: nl80211: disable multi-link reconfiguration
net: dsa: rtl8366rb: don't prompt users for LED control
be2net: fix sleeping while atomic bugs in be_ndo_bridge_getlink
llc: do not use skb_get() before dev_queue_xmit()
wifi: cfg80211: regulatory: improve invalid hints checking
caif_virtio: fix wrong pointer check in cfv_probe()
net: gso: fix ownership in __udp_gso_segment
...
Linus Torvalds [Thu, 6 Mar 2025 19:19:15 +0000 (09:19 -1000)]
Merge tag 'v6.14-rc5-smb3-fixes' of git://git.samba.org/ksmbd
Pull smb fixes from Steve French:
"Five SMB server fixes, two related client fixes, and minor MAINTAINERS
update:
- Two SMB3 lock fixes fixes (including use after free and bug on fix)
- Fix to race condition that can happen in processing IPC responses
- Four ACL related fixes: one related to endianness of num_aces, and
two related fixes to the checks for num_aces (for both client and
server), and one fixing missing check for num_subauths which can
cause memory corruption
- And minor update to email addresses in MAINTAINERS file"
* tag 'v6.14-rc5-smb3-fixes' of git://git.samba.org/ksmbd:
cifs: fix incorrect validation for num_aces field of smb_acl
ksmbd: fix incorrect validation for num_aces field of smb_acl
smb: common: change the data type of num_aces to le16
ksmbd: fix bug on trap in smb2_lock
ksmbd: fix use-after-free in smb2_lock
ksmbd: fix type confusion via race condition when using ipc_msg_send_request
ksmbd: fix out-of-bounds in parse_sec_desc()
MAINTAINERS: update email address in cifs and ksmbd entry
Linus Torvalds [Thu, 6 Mar 2025 18:18:48 +0000 (08:18 -1000)]
Merge tag 'exfat-for-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat fixes from Namjae Jeon:
- Optimize new cluster allocation by correctly find empty entry slot
- Add a check to prevent excessive bitmap clearing due to invalid
data size of file/dir entry
- Fix incorrect error return for zero-byte writes
* tag 'exfat-for-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: add a check for invalid data size
exfat: short-circuit zero-byte writes in exfat_file_write_iter
exfat: fix soft lockup in exfat_clear_bitmap
exfat: fix just enough dentries but allocate a new cluster to dir
Linus Torvalds [Thu, 6 Mar 2025 18:04:49 +0000 (08:04 -1000)]
Merge tag 'vfs-6.14-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix spelling mistakes in idmappings.rst
- Fix RCU warnings in override_creds()/revert_creds()
- Create new pid namespaces with default limit now that pid_max is
namespaced
* tag 'vfs-6.14-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
pid: Do not set pid_max in new pid namespaces
doc: correcting two prefix errors in idmappings.rst
cred: Fix RCU warnings in override/revert_creds
Linus Torvalds [Thu, 6 Mar 2025 17:53:25 +0000 (07:53 -1000)]
fs/pipe: fix pipe buffer index use in FUSE
This was another case that Rasmus pointed out where the direct access to
the pipe head and tail pointers broke on 32-bit configurations due to
the type changes.
As with the pipe FIONREAD case, fix it by using the appropriate helper
functions that deal with the right pipe index sizing.
Linus Torvalds [Thu, 6 Mar 2025 17:33:58 +0000 (07:33 -1000)]
fs/pipe: do not open-code pipe head/tail logic in FIONREAD
Rasmus points out that we do indeed have other cases of breakage from
the type changes that were introduced on 32-bit targets in order to read
the pipe head and tail values atomically (commit 3d252160b818: "fs/pipe:
Read pipe->{head,tail} atomically outside pipe->mutex").
Fix it up by using the proper helper functions that now deal with the
pipe buffer index types properly. This makes the code simpler and more
obvious.
The compiler does the CSE and loop hoisting of the pipe ring size
masking that we used to do manually, so open-coding this was never a
good idea.
Linus Torvalds [Thu, 6 Mar 2025 17:30:42 +0000 (07:30 -1000)]
fs/pipe: express 'pipe_empty()' in terms of 'pipe_occupancy()'
That's what 'pipe_full()' does, so it's more consistent. But more
importantly it gets the type limits right when the pipe head and tail
are no longer necessarily 'unsigned int'.
Justin Iurman [Tue, 4 Mar 2025 18:10:39 +0000 (19:10 +0100)]
net: ipv6: fix dst ref loop in ila lwtunnel
This patch follows commit 92191dd10730 ("net: ipv6: fix dst ref loops in
rpl, seg6 and ioam6 lwtunnels") and, on a second thought, the same patch
is also needed for ila (even though the config that triggered the issue
was pathological, but still, we don't want that to happen).
Michal Koutný [Wed, 5 Mar 2025 14:58:49 +0000 (15:58 +0100)]
pid: Do not set pid_max in new pid namespaces
It is already difficult for users to troubleshoot which of multiple pid
limits restricts their workload. The per-(hierarchical-)NS pid_max would
contribute to the confusion.
Also, the implementation copies the limit upon creation from
parent, this pattern showed cumbersome with some attributes in legacy
cgroup controllers -- it's subject to race condition between parent's
limit modification and children creation and once copied it must be
changed in the descendant.
Let's do what other places do (ucounts or cgroup limits) -- create new
pid namespaces without any limit at all. The global limit (actually any
ancestor's limit) is still effectively in place, we avoid the
set/unshare race and bumps of global (ancestral) limit have the desired
effect on pid namespace that do not care.
After blamed commit rtm_to_fib_config() now calls
lwtunnel_valid_encap_type{_attr}() without RTNL held,
triggering an unlock balance in __rtnl_unlock,
as reported by syzbot [1]
IPv6 and rtm_to_nh_config() are not yet converted.
Add a temporary @rtnl_is_held parameter to lwtunnel_valid_encap_type()
and lwtunnel_valid_encap_type_attr().
While we are at it replace the two rcu_dereference()
in lwtunnel_valid_encap_type() with more appropriate
rcu_access_pointer().
[1]
syz-executor245/5836 is trying to release lock (rtnl_mutex) at:
[<ffffffff89d0e38c>] __rtnl_unlock+0x6c/0xf0 net/core/rtnetlink.c:142
but there are no more locks to release!
other info that might help us debug this:
no locks held by syz-executor245/5836.
Lorenzo Bianconi [Tue, 4 Mar 2025 08:50:23 +0000 (09:50 +0100)]
net: dsa: mt7530: Fix traffic flooding for MMIO devices
On MMIO devices (e.g. MT7988 or EN7581) unicast traffic received on lanX
port is flooded on all other user ports if the DSA switch is configured
without VLAN support since PORT_MATRIX in PCR regs contains all user
ports. Similar to MDIO devices (e.g. MT7530 and MT7531) fix the issue
defining default VLAN-ID 0 for MT7530 MMIO devices.
Fixes: 110c18bfed414 ("net: dsa: mt7530: introduce driver for MT7988 built-in switch") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Chester A. Unal <chester.a.unal@arinc9.com> Link: https://patch.msgid.link/20250304-mt7988-flooding-fix-v1-1-905523ae83e9@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Mon, 3 Mar 2025 20:19:25 +0000 (21:19 +0100)]
net: phy: remove remaining PHY package related definitions from phy.h
Move definition of struct phy_package_shared to phy_package.c, and
move remaining PHY package related declarations from phy.h to
phylib.h, thus making them accessible for PHY drivers only.
Heiner Kallweit [Mon, 3 Mar 2025 20:18:46 +0000 (21:18 +0100)]
net: phy: move PHY package related code from phy.h to phy_package.c
Move PHY package related inline functions from phy.h to phy_package.c.
While doing so remove locked versions phy_package_read() and
phy_package_write() which have no user.
Heiner Kallweit [Mon, 3 Mar 2025 20:15:09 +0000 (21:15 +0100)]
net: phy: add getters for public members in struct phy_package_shared
Add getters for public members, this prepares for making struct
phy_package_shared private to phylib. Declare the getters in a new header
file phylib.h, which will be used by PHY drivers only.
====================
Enable SGMII and 2500BASEX interface mode switching for Intel platforms
During the interface mode change, the 'phylink_major_config' function will
be triggered in phylink. The modification of the following functions will
support the switching between SGMII and 2500BASE-X interface modes for
the Intel platform:
- xpcs_switch_interface_mode: Re-initiates clause 37 auto-negotiation for
the SGMII interface mode to perform auto-negotiation.
- mac_finish: Configures the SerDes according to the interface mode.
With the above changes, the code will work as follows during the interface
mode change. The PCS will reconfigure according to the pcs_neg_mode and the
selected interface mode. Then, the MAC driver will perform SerDes
configuration in 'mac_finish' based on the selected interface mode. During
the SerDes configuration, the selected interface mode will identify TSN
lane registers from FIA and then send IPC commands to the Power Management
Controller (PMC) through the PMC driver/API. The PMC will act as a proxy to
program the PLL registers.
====================
net: stmmac: configure SerDes according to the interface mode
Intel platform will configure the SerDes through PMC API based on the
provided interface mode.
This patch adds several new functions below:-
- intel_tsn_lane_is_available(): This new function reads FIA lane
ownership registers and common lane registers through IPC commands
to know which lane the mGbE port is assigned to.
- intel_mac_finish(): To configure the SerDes based on the assigned
lane and latest interface mode, it sends IPC command to the PMC through
PMC driver/API. The PMC acts as a proxy for R/W on behalf of the driver.
- intel_set_reg_access(): Set the register access to the available TSN
interface.
David E. Box [Thu, 27 Feb 2025 12:15:19 +0000 (20:15 +0800)]
arch: x86: add IPC mailbox accessor function and add SoC register access
- Exports intel_pmc_ipc() for host access to the PMC IPC mailbox
- Enables the host to access specific SoC registers through the PMC
firmware using IPC commands. This access method is necessary for
registers that are not available through direct Memory-Mapped I/O (MMIO),
which is used for other accessible parts of the PMC.
Signed-off-by: David E. Box <david.e.box@linux.intel.com> Signed-off-by: Chao Qin <chao.qin@intel.com> Signed-off-by: Choong Yong Liang <yong.liang.choong@linux.intel.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://patch.msgid.link/20250227121522.1802832-4-yong.liang.choong@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The xpcs_switch_interface_mode function was introduced to handle
interface switching.
According to the XPCS datasheet, a soft reset is required to initiate
Clause 37 auto-negotiation when the XPCS switches interface modes.
When the interface mode switches from 2500BASE-X to SGMII,
re-initiating Clause 37 auto-negotiation is required for the SGMII
interface mode to function properly.
net: phylink: use pl->link_interface in phylink_expects_phy()
The phylink_expects_phy() function allows MAC drivers to check if they are
expecting a PHY to attach. The checking condition in phylink_expects_phy()
aims to achieve the same result as the checking condition in
phylink_attach_phy().
However, the checking condition in phylink_expects_phy() uses
pl->link_config.interface, while phylink_attach_phy() uses
pl->link_interface.
Initially, both pl->link_interface and pl->link_config.interface are set
to SGMII, and pl->cfg_link_an_mode is set to MLO_AN_INBAND.
When the interface switches from SGMII to 2500BASE-X,
pl->link_config.interface is updated by phylink_major_config().
At this point, pl->cfg_link_an_mode remains MLO_AN_INBAND, and
pl->link_config.interface is set to 2500BASE-X.
Subsequently, when the STMMAC interface is taken down
administratively and brought back up, it is blocked by
phylink_expects_phy().
Since phylink_expects_phy() and phylink_attach_phy() aim to achieve the
same result, phylink_expects_phy() should check pl->link_interface,
which never changes, instead of pl->link_config.interface, which is
updated by phylink_major_config().
Linus Torvalds [Wed, 5 Mar 2025 17:35:40 +0000 (07:35 -1000)]
fs/pipe: remove buggy and unused 'helper' function
While looking for incorrect users of the pipe head/tail fields (see
commit c27c66afc449: "fs/pipe: Fix pipe_occupancy() with 16-bit
indexes"), I found a bug in pipe_discard_from() that looked entirely
broken.
However, the fix is trivial: this buggy function isn't actually called
by anything, so let's just remove it ASAP.
Linus Torvalds [Wed, 5 Mar 2025 17:08:09 +0000 (07:08 -1000)]
fs/pipe: Fix pipe_occupancy() with 16-bit indexes
The pipe_occupancy() logic implicitly relied on the natural unsigned
modulo arithmetic in C, but that doesn't work for the new 'pipe_index_t'
case, since any arithmetic will be done in 'int' (and here we had also
made it 'unsigned int' due to the function call boundary).
So make the modulo arithmetic explicit by casting the result to the
proper type.
Jason Xing [Tue, 4 Mar 2025 00:44:29 +0000 (08:44 +0800)]
net-timestamp: support TCP GSO case for a few missing flags
When I read through the TSO codes, I found out that we probably
miss initializing the tx_flags of last seg when TSO is turned
off, which means at the following points no more timestamp
(for this last one) will be generated. There are three flags
to be handled in this patch:
1. SKBTX_HW_TSTAMP
2. SKBTX_BPF
3. SKBTX_SCHED_TSTAMP
Note that SKBTX_BPF[1] was added in 6.14.0-rc2 by commit 6b98ec7e882af ("bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback")
and only belongs to net-next branch material for now. The common
issue of the above three flags can be fixed by this single patch.
This patch initializes the tx_flags to SKBTX_ANY_TSTAMP like what
the UDP GSO does to make the newly segmented last skb inherit the
tx_flags so that requested timestamp will be generated in each
certain layer, or else that last one has zero value of tx_flags
which leads to no timestamp at all.
Fixes: 4ed2d765dfacc ("net-timestamp: TCP timestamping") Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Sandeen [Tue, 11 Feb 2025 20:14:21 +0000 (14:14 -0600)]
exfat: short-circuit zero-byte writes in exfat_file_write_iter
When generic_write_checks() returns zero, it means that
iov_iter_count() is zero, and there is no work to do.
Simply return success like all other filesystems do, rather than
proceeding down the write path, which today yields an -EFAULT in
generic_perform_write() via the
(fault_in_iov_iter_readable(i, bytes) == bytes) check when bytes
== 0.
Fixes: 11a347fb6cef ("exfat: change to get file size from DataLength") Reported-by: Noah <kernel-org-10@maxgrass.eu> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Namjae Jeon [Fri, 31 Jan 2025 03:55:55 +0000 (12:55 +0900)]
exfat: fix soft lockup in exfat_clear_bitmap
bitmap clear loop will take long time in __exfat_free_cluster()
if data size of file/dir enty is invalid.
If cluster bit in bitmap is already clear, stop clearing bitmap go to
out of loop.
Fixes: 31023864e67a ("exfat: add fat entry operations") Reported-by: Kun Hu <huk23@m.fudan.edu.cn>, Jiaji Qin <jjtan24@m.fudan.edu.cn> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Yuezhang Mo [Fri, 22 Nov 2024 02:50:55 +0000 (10:50 +0800)]
exfat: fix just enough dentries but allocate a new cluster to dir
This commit fixes the condition for allocating cluster to parent
directory to avoid allocating new cluster to parent directory when
there are just enough empty directory entries at the end of the
parent directory.
Fixes: af02c72d0b62 ("exfat: convert exfat_find_empty_entry() to use dentry cache") Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
====================
Permission checks for dynamic POSIX clocks
Dynamic clocks - such as PTP clocks - extend beyond the standard POSIX
clock API by using ioctl calls. While file permissions are enforced for
standard POSIX operations, they are not implemented for ioctl calls,
since the POSIX layer cannot differentiate between calls which modify
the clock's state (like enabling PPS output generation) and those that
don't (such as retrieving the clock's PPS capabilities).
On the other hand, drivers implementing the dynamic clocks lack the
necessary information context to enforce permission checks themselves.
Additionally, POSIX clock layer requires the WRITE permission even for
readonly adjtime() operations before invoking the callback.
Add a struct file pointer to the POSIX clock context and use it to
implement the appropriate permission checks on PTP chardevs. Permit
readonly adjtime() for dynamic clocks. Add a readonly option to testptp.
Changes in v4:
- Allow readonly adjtime() for dynamic clocks, as suggested by Thomas
Changes in v3:
- Reword the log message for commit against posix-clock and fix
documentation of struct posix_clock_context, as suggested by Thomas
Changes in v2:
- Store file pointer in POSIX clock context rather than fmode in the PTP
clock's private data, as suggested by Richard.
- Move testptp.c changes into separate patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Wojtek Wasko [Mon, 3 Mar 2025 16:13:45 +0000 (18:13 +0200)]
testptp: Add option to open PHC in readonly mode
PTP Hardware Clocks no longer require WRITE permission to perform
readonly operations, such as listing device capabilities or listening to
EXTTS events once they have been enabled by a process with WRITE
permissions.
Add '-r' option to testptp to open the PHC in readonly mode instead of
the default read-write mode. Skip enabling EXTTS if readonly mode is
requested.
Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Wojtek Wasko <wwasko@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Many devices implement highly accurate clocks, which the kernel manages
as PTP Hardware Clocks (PHCs). Userspace applications rely on these
clocks to timestamp events, trace workload execution, correlate
timescales across devices, and keep various clocks in sync.
The kernel’s current implementation of PTP clocks does not enforce file
permissions checks for most device operations except for POSIX clock
operations, where file mode is verified in the POSIX layer before
forwarding the call to the PTP subsystem. Consequently, it is common
practice to not give unprivileged userspace applications any access to
PTP clocks whatsoever by giving the PTP chardevs 600 permissions. An
example of users running into this limitation is documented in [1].
Additionally, POSIX layer requires WRITE permission even for readonly
adjtime() calls which are used in PTP layer to return current frequency
offset applied to the PHC.
Add permission checks for functions that modify the state of a PTP
device. Continue enforcing permission checks for POSIX clock operations
(settime, adjtime) in the POSIX layer. Only require WRITE access for
dynamic clocks adjtime() if any flags are set in the modes field.
Changes in v4:
- Require FMODE_WRITE in ajtime() only for calls modifying the clock in
any way.
Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Wojtek Wasko <wwasko@nvidia.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Wojtek Wasko [Mon, 3 Mar 2025 16:13:43 +0000 (18:13 +0200)]
posix-clock: Store file pointer in struct posix_clock_context
File descriptor based pc_clock_*() operations of dynamic posix clocks
have access to the file pointer and implement permission checks in the
generic code before invoking the relevant dynamic clock callback.
Character device operations (open, read, poll, ioctl) do not implement a
generic permission control and the dynamic clock callbacks have no
access to the file pointer to implement them.
Extend struct posix_clock_context with a struct file pointer and
initialize it in posix_clock_open(), so that all dynamic clock callbacks
can access it.
Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Wojtek Wasko <wwasko@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 5 Mar 2025 05:05:53 +0000 (19:05 -1000)]
Merge tag 'x86_microcode_for_v6.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull AMD microcode loading fixes from Borislav Petkov:
- Load only sha256-signed microcode patch blobs
- Other good cleanups
* tag 'x86_microcode_for_v6.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Load only SHA256-checksummed patches
x86/microcode/AMD: Add get_patch_level()
x86/microcode/AMD: Get rid of the _load_microcode_amd() forward declaration
x86/microcode/AMD: Merge early_apply_microcode() into its single callsite
x86/microcode/AMD: Remove unused save_microcode_in_initrd_amd() declarations
x86/microcode/AMD: Remove ugly linebreak in __verify_patch_section() signature
Dan Carpenter [Mon, 3 Mar 2025 12:02:12 +0000 (15:02 +0300)]
net: Prevent use after free in netif_napi_set_irq_locked()
The cpu_rmap_put() will call kfree() when the last reference is dropped
so it could result in a use after free when we dereference the same
pointer the next line. Move the cpu_rmap_put() after the dereference.
Sean Anderson [Mon, 3 Mar 2025 23:18:32 +0000 (18:18 -0500)]
net: cadence: macb: Synchronize standard stats
The new stats calculations add several additional calls to
macb/gem_update_stats() and accesses to bp->hw_stats. These are
protected by a spinlock since commit fa52f15c745c ("net: cadence: macb:
Synchronize stats calculations"), which was applied in parallel. Add
some locking now that the net has been merged into net-next.
Eric Dumazet [Sun, 2 Mar 2025 12:42:37 +0000 (12:42 +0000)]
tcp: use RCU lookup in __inet_hash_connect()
When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
the many spin_lock_bh(&head->lock) performed in its loop.
This patch adds an RCU lookup to avoid the spinlock cost.
check_established() gets a new @rcu_lookup argument.
First reason is to not make any changes while head->lock
is not held.
Second reason is to not make this RCU lookup a second time
after the spinlock has been acquired.
Eric Dumazet [Sun, 2 Mar 2025 12:42:34 +0000 (12:42 +0000)]
tcp: use RCU in __inet{6}_check_established()
When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
__inet_check_established() and/or __inet6_check_established().
This patch adds an RCU lookup to avoid the spinlock
acquisition when the 4-tuple is found in the hash table.
Note that there are still spin_lock_bh() calls in
__inet_hash_connect() to protect inet_bind_hashbucket,
this will be fixed later in this series.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250302124237.3913746-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Krister Johansen [Mon, 3 Mar 2025 17:10:13 +0000 (18:10 +0100)]
mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
If multiple connection requests attempt to create an implicit mptcp
endpoint in parallel, more than one caller may end up in
mptcp_pm_nl_append_new_local_addr because none found the address in
local_addr_list during their call to mptcp_pm_nl_get_local_id. In this
case, the concurrent new_local_addr calls may delete the address entry
created by the previous caller. These deletes use synchronize_rcu, but
this is not permitted in some of the contexts where this function may be
called. During packet recv, the caller may be in a rcu read critical
section and have preemption disabled.
An example stack:
BUG: scheduling while atomic: swapper/2/0/0x00000302
This problem seems particularly prevalent if the user advertises an
endpoint that has a different external vs internal address. In the case
where the external address is advertised and multiple connections
already exist, multiple subflow SYNs arrive in parallel which tends to
trigger the race during creation of the first local_addr_list entries
which have the internal address instead.
Fix by skipping the replacement of an existing implicit local address if
called via mptcp_pm_nl_get_local_id.