Cong Wang [Thu, 3 Apr 2025 21:10:23 +0000 (14:10 -0700)]
sch_htb: make htb_qlen_notify() idempotent
htb_qlen_notify() always deactivates the HTB class and in fact could
trigger a warning if it is already deactivated. Therefore, it is not
idempotent and not friendly to its callers, like fq_codel_dequeue().
Let's make it idempotent to ease qdisc_tree_reduce_backlog() callers'
life.
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250403211033.166059-2-xiyou.wangcong@gmail.com Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In case the backlog transmit queue for system-importance messages is overloaded,
tipc_link_xmit() returns -ENOBUFS but the skb list is not purged. This leads to
memory leak and failure when a skb is allocated.
This commit fixes this issue by purging the skb list before tipc_link_xmit()
returns.
Jakub Kicinski [Mon, 7 Apr 2025 18:00:04 +0000 (11:00 -0700)]
Merge branch 'fix-wrong-hds-thresh-value-setting'
Taehee Yoo says:
====================
fix wrong hds-thresh value setting
A hds-thresh value is not set correctly if input value is 0.
The cause is that ethtool_ringparam_get_cfg(), which is a internal
function that returns ringparameters from both ->get_ringparam() and
dev->cfg can't return a correct hds-thresh value.
The first patch fixes ethtool_ringparam_get_cfg() to set hds-thresh
value correcltly.
The second patch adds random test for hds-thresh value.
So that we can test 0 value for a hds-thresh properly.
====================
selftests: drv-net: test random value for hds-thresh
hds.py has been testing 0(set_hds_thresh_zero()),
MAX(set_hds_thresh_max()), GT(set_hds_thresh_gt()) values for hds-thresh.
However if a hds-thresh value was already 0, set_hds_thresh_zero()
can't test properly.
So, it tests random value first and then tests 0, MAX, GT values.
Testing bnxt:
TAP version 13
1..13
ok 1 hds.get_hds
ok 2 hds.get_hds_thresh
ok 3 hds.set_hds_disable # SKIP disabling of HDS not supported by
the device
ok 4 hds.set_hds_enable
ok 5 hds.set_hds_thresh_random
ok 6 hds.set_hds_thresh_zero
ok 7 hds.set_hds_thresh_max
ok 8 hds.set_hds_thresh_gt
ok 9 hds.set_xdp
ok 10 hds.enabled_set_xdp
ok 11 hds.ioctl
ok 12 hds.ioctl_set_xdp
ok 13 hds.ioctl_enabled_set_xdp
# Totals: pass:12 fail:0 xfail:0 xpass:0 skip:1 error:0
Testing lo:
TAP version 13
1..13
ok 1 hds.get_hds # SKIP tcp-data-split not supported by device
ok 2 hds.get_hds_thresh # SKIP hds-thresh not supported by device
ok 3 hds.set_hds_disable # SKIP ring-set not supported by the device
ok 4 hds.set_hds_enable # SKIP ring-set not supported by the device
ok 5 hds.set_hds_thresh_random # SKIP hds-thresh not supported by
device
ok 6 hds.set_hds_thresh_zero # SKIP ring-set not supported by the
device
ok 7 hds.set_hds_thresh_max # SKIP hds-thresh not supported by
device
ok 8 hds.set_hds_thresh_gt # SKIP hds-thresh not supported by device
ok 9 hds.set_xdp # SKIP tcp-data-split not supported by device
ok 10 hds.enabled_set_xdp # SKIP tcp-data-split not supported by
device
ok 11 hds.ioctl # SKIP tcp-data-split not supported by device
ok 12 hds.ioctl_set_xdp # SKIP tcp-data-split not supported by
device
ok 13 hds.ioctl_enabled_set_xdp # SKIP tcp-data-split not supported
by device
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:13 error:0
net: ethtool: fix ethtool_ringparam_get_cfg() returns a hds_thresh value always as 0.
When hds-thresh is configured, ethnl_set_rings() is called, and it calls
ethtool_ringparam_get_cfg() to get ringparameters from .get_ringparam()
callback and dev->cfg.
Both hds_config and hds_thresh values should be set from dev->cfg, not
from .get_ringparam().
But ethtool_ringparam_get_cfg() sets only hds_config from dev->cfg.
So, ethtool_ringparam_get_cfg() returns always a hds_thresh as 0.
If an input value of hds-thresh is 0, a hds_thresh value from
ethtool_ringparam_get_cfg() are same. So ethnl_set_rings() does
nothing and returns immediately.
It causes a bug that setting a hds-thresh value to 0 is not working.
Reproducer:
modprobe netdevsim
echo 1 > /sys/bus/netdevsim/new_device
ethtool -G eth0 hds-thresh 100
ethtool -G eth0 hds-thresh 0
ethtool -g eth0
#hds-thresh value should be 0, but it shows 100.
The tools/testing/selftests/drivers/net/hds.py can test it too with
applying a following patch for hds.py.
Merge tag 'net-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter.
Current release - regressions:
- four fixes for the netdev per-instance locking
Current release - new code bugs:
- consolidate more code between existing Rx zero-copy and uring so
that the latter doesn't miss / have to duplicate the safety checks
Previous releases - regressions:
- ipv6: fix omitted Netlink attributes when using SKIP_STATS
Previous releases - always broken:
- net: fix geneve_opt length integer overflow
- udp: fix multiple wrap arounds of sk->sk_rmem_alloc when it
approaches INT_MAX
- dsa: mvpp2: add a lock to avoid corruption of the shared TCAM
- dsa: airoha: fix issues with traffic QoS configuration / offload,
and flow table offload
Misc:
- touch up the Netlink YAML specs of old families to make them usable
for user space C codegen"
* tag 'net-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits)
selftests: net: amt: indicate progress in the stress test
netlink: specs: rt_route: pull the ifa- prefix out of the names
netlink: specs: rt_addr: pull the ifa- prefix out of the names
netlink: specs: rt_addr: fix get multi command name
netlink: specs: rt_addr: fix the spec format / schema failures
net: avoid false positive warnings in __net_mp_close_rxq()
net: move mp dev config validation to __net_mp_open_rxq()
net: ibmveth: make veth_pool_store stop hanging
arcnet: Add NULL check in com20020pci_probe()
ipv6: Do not consider link down nexthops in path selection
ipv6: Start path selection from the first nexthop
usbnet:fix NPE during rx_complete
net: octeontx2: Handle XDP_ABORTED and XDP invalid as XDP_DROP
net: fix geneve_opt length integer overflow
io_uring/zcrx: fix selftests w/ updated netdev Python helpers
selftests: net: use netdevsim in netns test
docs: net: document netdev notifier expectations
net: dummy: request ops lock
netdevsim: add dummy device notifiers
net: rename rtnl_net_debug to lock_debug
...
Merge tag 'spi-fix-v6.15-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes that came in during the merge window,
everything is driver specific with nothing standing out particularly"
* tag 'spi-fix-v6.15-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: bcm2835: Restore native CS probing when pinctrl-bcm2835 is absent
spi: bcm2835: Do not call gpiod_put() on invalid descriptor
spi: cadence-qspi: revert "Improve spi memory performance"
spi: cadence: Fix out-of-bounds array access in cdns_mrvl_xspi_setup_clock()
spi: fsl-qspi: use devm function instead of driver remove
spi: SPI_QPIC_SNAND should be tristate and depend on MTD
spi-rockchip: Fix register out of bounds access
Merge tag 'soc-drivers-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull more SoC driver updates from Arnd Bergmann:
"This is the promised follow-up to the soc drivers branch, adding minor
updates to omap and freescale drivers.
Most notably, Ioana Ciornei takes over maintenance of the DPAA bus
driver used in some NXP (originally Freescale) chips"
* tag 'soc-drivers-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
bus: fsl-mc: Remove deadcode
MAINTAINERS: add the linuppc-dev list to the fsl-mc bus entry
MAINTAINERS: fix nonexistent dtbinding file name
MAINTAINERS: add myself as maintainer for the fsl-mc bus
irqdomain: soc: Switch to irq_find_mapping()
Input: tsc2007 - accept standard properties
Merge tag 'platform-drivers-x86-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
- thinkpad_acpi:
- Fix NULL pointer dereferences while probing
- Disable ACPI fan access for T495* and E560
- ISST: Correct command storage data length
* tag 'platform-drivers-x86-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
MAINTAINERS: consistently use my dedicated email address
platform/x86: ISST: Correct command storage data length
platform/x86: thinkpad_acpi: disable ACPI fan access for T495* and E560
platform/x86: thinkpad_acpi: Fix NULL pointer dereferences while probing
Jakub Kicinski [Thu, 3 Apr 2025 14:56:36 +0000 (07:56 -0700)]
selftests: net: amt: indicate progress in the stress test
Our CI expects output from the test at least once every 10 minutes.
The AMT test when running on debug kernel is just on the edge
of that time for the stress test. Improve the output:
- print the name of the test first, before starting it,
- output a dot every 10% of the way.
Output after:
TEST: amt discovery [ OK ]
TEST: IPv4 amt multicast forwarding [ OK ]
TEST: IPv6 amt multicast forwarding [ OK ]
TEST: IPv4 amt traffic forwarding torture .......... [ OK ]
TEST: IPv6 amt traffic forwarding torture .......... [ OK ]
Jakub Kicinski [Thu, 3 Apr 2025 01:37:06 +0000 (18:37 -0700)]
netlink: specs: rt_route: pull the ifa- prefix out of the names
YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in route-attrs and metrics and specify name-prefix instead.
This is a bit risky, hopefully there aren't many users out there.
Jakub Kicinski [Thu, 3 Apr 2025 01:37:05 +0000 (18:37 -0700)]
netlink: specs: rt_addr: pull the ifa- prefix out of the names
YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in addr-attrs and specify name-prefix instead.
This is a bit risky, hopefully there aren't many users out there.
====================
net: make memory provider install / close paths more common
We seem to be fixing bugs in config path for devmem which also exist
in the io_uring ZC path. Let's try to make the two paths more common,
otherwise this is bound to keep happening.
Jakub Kicinski [Thu, 3 Apr 2025 01:34:05 +0000 (18:34 -0700)]
net: avoid false positive warnings in __net_mp_close_rxq()
Commit under Fixes solved the problem of spurious warnings when we
uninstall an MP from a device while its down. The __net_mp_close_rxq()
which is used by io_uring was not fixed. Move the fix over and reuse
__net_mp_close_rxq() in the devmem path.
Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250403013405.2827250-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 3 Apr 2025 01:34:04 +0000 (18:34 -0700)]
net: move mp dev config validation to __net_mp_open_rxq()
devmem code performs a number of safety checks to avoid having
to reimplement all of them in the drivers. Move those to
__net_mp_open_rxq() and reuse that function for binding to make
sure that io_uring ZC also benefits from them.
While at it rename the queue ID variable to rxq_idx in
__net_mp_open_rxq(), we touch most of the relevant lines.
The XArray insertion is reordered after the netdev_rx_queue_restart()
call, otherwise we'd need to duplicate the queue index check
or risk inserting an invalid pointer. The XArray allocation
failures should be extremely rare.
Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: 6e18ed929d3b ("net: add helpers for setting a memory provider on an rx queue") Link: https://patch.msgid.link/20250403013405.2827250-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dave Marquardt [Wed, 2 Apr 2025 15:44:03 +0000 (10:44 -0500)]
net: ibmveth: make veth_pool_store stop hanging
v2:
- Created a single error handling unlock and exit in veth_pool_store
- Greatly expanded commit message with previous explanatory-only text
Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
ibmveth_close and ibmveth_open, preventing multiple calls in a row to
napi_disable.
Background: Two (or more) threads could call veth_pool_store through
writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
with a little shell script. This causes a hang.
I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
kernel. I ran this test again and saw:
The control APIs are not idempotent. Control API calls are safe
against concurrent use of datapath APIs but an incorrect sequence of
control API calls may result in crashes, deadlocks, or race
conditions. For example, calling napi_disable() multiple times in a
row will deadlock.
In the normal open and close paths, rtnl_mutex is acquired to prevent
other callers. This is missing from veth_pool_store. Use rtnl_mutex in
veth_pool_store fixes these hangs.
Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com> Fixes: 860f242eb534 ("[PATCH] ibmveth change buffer pools dynamically") Reviewed-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250402154403.386744-1-davemarq@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Henry Martin [Wed, 2 Apr 2025 13:50:36 +0000 (21:50 +0800)]
arcnet: Add NULL check in com20020pci_probe()
devm_kasprintf() returns NULL when memory allocation fails. Currently,
com20020pci_probe() does not check for this case, which results in a
NULL pointer dereference.
Add NULL check after devm_kasprintf() to prevent this issue and ensure
no resources are left allocated.
ipv6: Do not consider link down nexthops in path selection
Nexthops whose link is down are not supposed to be considered during
path selection when the "ignore_routes_with_linkdown" sysctl is set.
This is done by assigning them a negative region boundary.
However, when comparing the computed hash (unsigned) with the region
boundary (signed), the negative region boundary is treated as unsigned,
resulting in incorrect nexthop selection.
Fix by treating the computed hash as signed. Note that the computed hash
is always in range of [0, 2^31 - 1].
Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250402114224.293392-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cited commit transitioned IPv6 path selection to use hash-threshold
instead of modulo-N. With hash-threshold, each nexthop is assigned a
region boundary in the multipath hash function's output space and a
nexthop is chosen if the calculated hash is smaller than the nexthop's
region boundary.
Hash-threshold does not work correctly if path selection does not start
with the first nexthop. For example, if fib6_select_path() is always
passed the last nexthop in the group, then it will always be chosen
because its region boundary covers the entire hash function's output
space.
Fix this by starting the selection process from the first nexthop and do
not consider nexthops for which rt6_score_route() provided a negative
score.
Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N") Reported-by: Stanislav Fomichev <stfomichev@gmail.com> Closes: https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250402114224.293392-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ying Lu [Wed, 2 Apr 2025 08:58:59 +0000 (16:58 +0800)]
usbnet:fix NPE during rx_complete
Missing usbnet_going_away Check in Critical Path.
The usb_submit_urb function lacks a usbnet_going_away
validation, whereas __usbnet_queue_skb includes this check.
This inconsistency creates a race condition where:
A URB request may succeed, but the corresponding SKB data
fails to be queued.
Subsequent processes:
(e.g., rx_complete → defer_bh → __skb_unlink(skb, list))
attempt to access skb->next, triggering a NULL pointer
dereference (Kernel Panic).
Lorenzo Bianconi [Tue, 1 Apr 2025 09:02:12 +0000 (11:02 +0200)]
net: octeontx2: Handle XDP_ABORTED and XDP invalid as XDP_DROP
In the current implementation octeontx2 manages XDP_ABORTED and XDP
invalid as XDP_PASS forwarding the skb to the networking stack.
Align the behaviour to other XDP drivers handling XDP_ABORTED and XDP
invalid as XDP_DROP.
Please note this patch has just compile tested.
Merge tag 'x86-urgent-2025-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- Fix a performance regression on AMD iGPU and dGPU drivers, related to
the unintended activation of DMA bounce buffers that regressed game
performance if KASLR disturbed things just enough
- Fix a copy_user_generic() performance regression on certain older
non-FSRM/ERMS CPUs
- Fix a Clang build warning due to a semantic merge conflict the Kunit
tree generated with the x86 tree
- Fix FRED related system hang during S4 resume
- Remove an unused API
* tag 'x86-urgent-2025-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fred: Fix system hang during S4 resume with FRED enabled
x86/platform/iosf_mbi: Remove unused iosf_mbi_unregister_pmic_bus_access_notifier()
x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers
x86/tools: Drop duplicate unlikely() definition in insn_decoder_test.c
x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs
Merge tag 'sound-fix-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of device-specific fixes that have been gathered since
the previous pull:
- A few more HD-audio quirks and fixups
- A series of Qualcomm AudioReach fixes
- Various small fixes for ASoC rt5665, WSA, SOF and Cirrus"
* tag 'sound-fix-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek: Fix built-in mic on another ASUS VivoBook model
ALSA: hda/realtek - Support mute led function for HP platform
ASoC: imx-card: Add NULL check in imx_card_probe()
ASoC: codecs: rt5665: Fix some error handling paths in rt5665_probe()
ASoC: q6apm-dai: make use of q6apm_get_hw_pointer
ASoC: qdsp6: q6apm-dai: fix capture pipeline overruns.
ASoC: qdsp6: q6apm-dai: set 10 ms period and buffer alignment.
ASoC: q6apm: add q6apm_get_hw_pointer helper
ASoC: q6apm-dai: schedule all available frames to avoid dsp under-runs
ASoC: SOF: hda/ptl: Move mic privacy change notification sending to a work
ALSA/hda: intel-sdw-acpi: Remove (explicitly) unused header
ALSA: hda/realtek: Enable Mute LED on HP OMEN 16 Laptop xd000xx
ALSA: hda/tas2781: Upgrade calibratd-data writing code to support Alpha and Beta dsp firmware
ASoC: qdsp6: q6asm-dai: fix q6asm_dai_compr_set_params error path
ALSA: hda/realtek: Fix built-in mic breakage on ASUS VivoBook X515JA
ASoC: sma1307: Fix error handling in sma1307_setting_loaded()
ASoC: codecs: wsa884x: Correct VI sense channel mask
ASoC: codecs: wsa883x: Correct VI sense channel mask
firmware: cs_dsp: Ensure cs_dsp_load[_coeff]() returns 0 on success
Merge tag 'omap-for-v6.14/drivers-signed' of https://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap into soc/drivers-2
arm/omap: drivers: updates for v6.14
* tag 'omap-for-v6.14/drivers-signed' of https://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap:
Input: tsc2007 - accept standard properties
Merge tag 'soc_fsl-6.15-1' of https://github.com/chleroy/linux into soc/drivers-2
FSL SOC Changes for 6.15:
- irqdomain cleanups from Jiry
- Add Ioana as Maintainer of fsl-mc bus and remove Laurentiu and Stuart
- Remove deadcode from fsl-mc bus
* tag 'soc_fsl-6.15-1' of https://github.com/chleroy/linux:
bus: fsl-mc: Remove deadcode
MAINTAINERS: add the linuppc-dev list to the fsl-mc bus entry
MAINTAINERS: fix nonexistent dtbinding file name
MAINTAINERS: add myself as maintainer for the fsl-mc bus
irqdomain: soc: Switch to irq_find_mapping()
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache fixes from Al Viro:
"Fixes for bugs caught as part of tree-in-dcache work.
Mostly dentry refcount mishandling"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
hypfs_create_cpu_files(): add missing check for hypfs_mkdir() failure
qibfs: fix _another_ leak
spufs: fix a leak in spufs_create_context()
spufs: fix gang directory lifetimes
spufs: fix a leak on spufs_new_file() failure
Jakub Kicinski [Thu, 3 Apr 2025 23:23:00 +0000 (16:23 -0700)]
Merge tag 'nf-25-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains Netfilter fixes for net:
1) conncount incorrectly removes element for non-dynamic sets,
these elements represent a static control plane configuration,
leave them in place.
2) syzbot found a way to unregister a basechain that has been never
registered from the chain update path, fix from Florian Westphal.
3) Fix incorrect pointer arithmetics in geneve support for tunnel,
from Lin Ma.
* tag 'nf-25-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_tunnel: fix geneve_opt type confusion addition
netfilter: nf_tables: don't unregister hook when table is dormant
netfilter: nft_set_hash: GC reaps elements with conncount for dynamic sets only
====================
Merge tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Four ksmbd SMB3 server fixes, all also for stable"
* tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix null pointer dereference in alloc_preauth_hash()
ksmbd: validate zero num_subauth before sub_auth is accessed
ksmbd: fix overflow in dacloffset bounds check
ksmbd: fix session use-after-free in multichannel connection
Merge tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer updates from Steven Rostedt:
"Persistent buffer cleanups and simplifications.
It was mistaken that the physical memory returned from "reserve_mem"
had to be vmap()'d to get to it from a virtual address. But
reserve_mem already maps the memory to the virtual address of the
kernel so a simple phys_to_virt() can be used to get to the virtual
address from the physical memory returned by "reserve_mem". With this
new found knowledge, the code can be cleaned up and simplified.
- Enforce that the persistent memory is page aligned
As the buffers using the persistent memory are all going to be
mapped via pages, make sure that the memory given to the tracing
infrastructure is page aligned. If it is not, it will print a
warning and fail to map the buffer.
- Use phys_to_virt() to get the virtual address from reserve_mem
Instead of calling vmap() on the physical memory returned from
"reserve_mem", use phys_to_virt() instead.
As the memory returned by "memmap" or any other means where a
physical address is given to the tracing infrastructure, it still
needs to be vmap(). Since this memory can never be returned back to
the buddy allocator nor should it ever be memmory mapped to user
space, flag this buffer and up the ref count. The ref count will
keep it from ever being freed, and the flag will prevent it from
ever being memory mapped to user space.
- Use vmap_page_range() for memmap virtual address mapping
For the memmap buffer, instead of allocating an array of struct
pages, assigning them to the contiguous phsycial memory and then
passing that to vmap(), use vmap_page_range() instead
- Replace flush_dcache_folio() with flush_kernel_vmap_range()
Instead of calling virt_to_folio() and passing that to
flush_dcache_folio(), just call flush_kernel_vmap_range() directly.
This also fixes a bug where if a subbuffer was bigger than
PAGE_SIZE only the PAGE_SIZE portion would be flushed"
* tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()
tracing: Use vmap_page_range() to map memmap ring buffer
tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
tracing: Enforce the persistent ring buffer to be page aligned
Merge tag 'block-6.15-20250403' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Keith:
- PCI endpoint target cleanup (Damien)
- Early import for uring_cmd fixed buffer (Caleb)
- Multipath documentation and notification improvements (John)
- Invalid pci sq doorbell write fix (Maurizio)
- Queue init locking fix
- Remove dead nsegs parameter from blk_mq_get_new_requests()
* tag 'block-6.15-20250403' of git://git.kernel.dk/linux:
block: don't grab elevator lock during queue initialization
nvme-pci: skip nvme_write_sq_db on empty rqlist
nvme-multipath: change the NVME_MULTIPATH config option
nvme: update the multipath warning in nvme_init_ns_head
nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io()
nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request()
nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer
nvmet: pci-epf: Keep completion queues mapped
block: remove unused nseg parameter
For igc:
Joe Damato removes unmapping of XSK queues from NAPI instance.
Zdenek Bouska swaps condition checks/call to prevent AF_XDP Tx drops
with low budget value.
For e1000e:
Vitaly adjusts Kumeran interface configuration to prevent MDI errors.
For ixgbe:
Piotr clears PHY high values on media type detection to ensure stale
values are not used.
For idpf:
Emil adjusts shutdown calls to prevent NULL pointer dereference.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
idpf: fix adapter NULL pointer dereference on reboot
ixgbe: fix media type detection for E610 device
e1000e: change k1 configuration on MTP and later platforms
igc: Fix TX drops in XDP ZC
igc: Fix XSK queue NAPI ID mapping
====================
Merge tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"Set of fixes/updates for io_uring that should go into this release.
The ublk bits could've gone via either tree - usually I put them in
block, but they got a bit mixed this series with the zero-copy
supported that ended up dipping into both trees.
This contains:
- Fix for sendmsg zc, include in pinned pages accounting like we do
for the other zc types
- Series for ublk fixing request aborting, doing various little
cleanups, fixing some zc issues, and adding queue_rqs support
- Another ublk series doing some code cleanups
- Series cleaning up the io_uring send path, mostly in preparation
for registered buffers
- Series doing little MSG_RING cleanups
- Fix for the newly added zc rx, fixing len being 0 for the last
invocation of the callback
- Add vectored registered buffer support for ublk. With that, then
ublk also supports this feature in the kernel revision where it
could generically introduced for rw/net
- A bunch of selftest additions for ublk. This is the majority of the
diffstat
- Silence a KCSAN data race warning for io-wq
- Various little cleanups and fixes"
* tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linux: (44 commits)
io_uring: always do atomic put from iowq
selftests: ublk: enable zero copy for stripe target
io_uring: support vectored kernel fixed buffer
block: add for_each_mp_bvec()
io_uring: add validate_fixed_range() for validate fixed buffer
selftests: ublk: kublk: fix an error log line
selftests: ublk: kublk: use ioctl-encoded opcodes
io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0
io_uring/net: avoid import_ubuf for regvec send
io_uring/rsrc: check size when importing reg buffer
io_uring: cleanup {g,s]etsockopt sqe reading
io_uring: hide caches sqes from drivers
io_uring: make zcrx depend on CONFIG_IO_URING
io_uring: add req flag invariant build assertion
Documentation: ublk: remove dead footnote
selftests: ublk: specify io_cmd_buf pointer type
ublk: specify io_cmd_buf pointer type
io_uring: don't pass ctx to tw add remote helper
io_uring/msg: initialise msg request opcode
io_uring/msg: rename io_double_lock_ctx()
...
Lin Ma [Wed, 2 Apr 2025 16:56:32 +0000 (00:56 +0800)]
net: fix geneve_opt length integer overflow
struct geneve_opt uses 5 bit length for each single option, which
means every vary size option should be smaller than 128 bytes.
However, all current related Netlink policies cannot promise this
length condition and the attacker can exploit a exact 128-byte size
option to *fake* a zero length option and confuse the parsing logic,
further achieve heap out-of-bounds read.
Fix these issues by enforing correct length condition in related
policies.
Fixes: 925d844696d9 ("netfilter: nft_tunnel: add support for geneve opts") Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve") Fixes: 0ed5269f9e41 ("net/sched: add tunnel option support to act_tunnel_key") Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://patch.msgid.link/20250402165632.6958-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
- fix handling of bogus (negative/too long) replies
- fix crash on mkdir with ACLs (... looks like nobody is using ACLs
with semi-recent kernels...)
- ipv6 support for trans=tcp
- minor concurrency fix to make syzbot happy
- minor cleanup
* tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux:
docs: fs/9p: Add missing "not" in cache documentation
9p: Use hashtable.h for hash_errmap
Documentation/fs/9p: fix broken link
9p/trans_fd: mark concurrent read and writes to p9_conn->err
9p/net: return error on bogus (longer than requested) replies
9p/net: fix improper handling of bogus negative read/write replies
fs/9p: fix NULL pointer dereference on mkdir
net/9p/fd: support ipv6 for trans=tcp
====================
net: hold instance lock during NETDEV_UP/REGISTER
Solving the issue reported by Cosmin in [0] requires consistent
lock during NETDEV_UP/REGISTER notifiers. This series
addresses that (along with some other fixes in net/ipv4/devinet.c
and net/ipv6/addrconf.c) and appends the patches from Jakub
that were conditional on consistent locking in NETDEV_UNREGISTER.
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:47 +0000 (09:34 -0700)]
net: dummy: request ops lock
Even though dummy device doesn't really need an instance lock,
a lot of selftests use dummy so it's useful to have extra
expose to the instance lock on NIPA. Request the instance/ops
locking.
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:46 +0000 (09:34 -0700)]
netdevsim: add dummy device notifiers
In order to exercise and verify notifiers' locking assumptions,
register dummy notifiers (via register_netdevice_notifier_dev_net).
Share notifier event handler that enforces the assumptions with
lock_debug.c (rename and export rtnl_net_debug_event as
netdev_debug_event). Add ops lock asserts to netdev_debug_event.
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:44 +0000 (09:34 -0700)]
net: use netif_disable_lro in ipv6_add_dev
ipv6_add_dev might call dev_disable_lro which unconditionally grabs
instance lock, so it will deadlock during NETDEV_REGISTER. Switch
to netif_disable_lro.
Make sure all callers hold the instance lock as well.
Cc: Cosmin Ratiu <cratiu@nvidia.com> Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250401163452.622454-4-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:43 +0000 (09:34 -0700)]
net: hold instance lock during NETDEV_REGISTER/UP
Callers of inetdev_init can come from several places with inconsistent
expectation about netdev instance lock. Grab instance lock during
REGISTER (plus UP). Also solve the inconsistency with UNREGISTER
where it was locked only during move netns path.
Switch to netif_disable_lro which assumes the caller holds the instance
lock. inetdev_init is called for blackhole device (which sw device and
doesn't grab instance lock) and from REGISTER/UNREGISTER notifiers.
We already hold the instance lock for REGISTER notifier during
netns change and we'll soon hold the lock during other paths.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Cosmin Ratiu <cratiu@nvidia.com> Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250401163452.622454-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge tag 'rtc-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"We see a net reduction of the number of lines of code thanks to the
removal of a now unused driver and a testing tool that is not used
anymore. Apart from this, the max31335 driver gets support for a new
part number and pm8xxx gets UEFI support.
Core:
- setdate is removed as it has better replacements
- skip alarms with a second resolution when we know the RTC doesn't
support those.
Subsystem:
- remove unnecessary private struct members
- use devm_pm_set_wake_irq were relevant
Drivers:
- ds1307: stop disabling alarms on probe for DS1337, DS1339, DS1341
and DS3231
- max31335: add max31331 support
- pcf50633 is removed as support for the related SoC has been removed
- pcf85063: properly handle POR failures"
* tag 'rtc-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (50 commits)
rtc: remove 'setdate' test program
selftest: rtc: skip some tests if the alarm only supports minutes
rtc: mt6397: drop unused defines
rtc: pcf85063: replace dev_err+return with return dev_err_probe
rtc: pcf85063: do a SW reset if POR failed
rtc: max31335: Add driver support for max31331
dt-bindings: rtc: max31335: Add max31331 support
rtc: cros-ec: Avoid a couple of -Wflex-array-member-not-at-end warnings
dt-bindings: rtc: pcf2127: Reference spi-peripheral-props.yaml
rtc: rzn1: implement one-second accuracy for alarms
rtc: pcf50633: Remove
rtc: pm8xxx: implement qcom,no-alarm flag for non-HLOS owned alarm
rtc: pm8xxx: mitigate flash wear
rtc: pm8xxx: add support for uefi offset
dt-bindings: rtc: qcom-pm8xxx: document qcom,no-alarm flag
rtc: rv3032: drop WADA
rtc: rv3032: fix EERD location
rtc: pm8xxx: switch to devm_device_init_wakeup
rtc: pm8xxx: fix possible race condition
rtc: mpfs: switch to devm_device_init_wakeup
...
Lorenzo Bianconi [Tue, 1 Apr 2025 09:42:30 +0000 (11:42 +0200)]
net: airoha: Validate egress gdm port in airoha_ppe_foe_entry_prepare()
Dev pointer in airoha_ppe_foe_entry_prepare routine is not strictly
a device allocated by airoha_eth driver since it is an egress device
and the flowtable can contain even wlan, pppoe or vlan devices. E.g:
In this case airoha_get_dsa_port() will just return the original device
pointer and we can't assume netdev priv pointer points to an
airoha_gdm_port struct.
Fix the issue validating egress gdm port in airoha_ppe_foe_entry_prepare
routine before accessing net_device priv pointer.
David Oberhollenzer [Tue, 1 Apr 2025 13:56:37 +0000 (15:56 +0200)]
net: dsa: mv88e6xxx: propperly shutdown PPU re-enable timer on destroy
The mv88e6xxx has an internal PPU that polls PHY state. If we want to
access the internal PHYs, we need to disable the PPU first. Because
that is a slow operation, a 10ms timer is used to re-enable it,
canceled with every access, so bulk operations effectively only
disable it once and re-enable it some 10ms after the last access.
If a PHY is accessed and then the mv88e6xxx module is removed before
the 10ms are up, the PPU re-enable ends up accessing a dangling pointer.
This especially affects probing during bootup. The MDIO bus and PHY
registration may succeed, but registration with the DSA framework
may fail later on (e.g. because the CPU port depends on another,
very slow device that isn't done probing yet, returning -EPROBE_DEFER).
In this case, probe() fails, but the MDIO subsystem may already have
accessed the MIDO bus or PHYs, arming the timer.
This is fixed as follows:
- If probe fails after mv88e6xxx_phy_init(), make sure we also call
mv88e6xxx_phy_destroy() before returning
- In mv88e6xxx_remove(), make sure we do the teardown in the correct
order, calling mv88e6xxx_phy_destroy() after unregistering the
switch device.
- In mv88e6xxx_phy_destroy(), destroy both the timer and the work item
that the timer might schedule, synchronously waiting in case one of
the callbacks already fired and destroying the timer first, before
waiting for the work item.
- Access to the PPU is guarded by a mutex, the worker acquires it
with a mutex_trylock(), not proceeding with the expensive shutdown
if that fails. We grab the mutex in mv88e6xxx_phy_destroy() to make
sure the slow PPU shutdown is already done or won't even enter, when
we wait for the work item.
Fixes: 2e5f032095ff ("dsa: add support for the Marvell 88E6131 switch chip") Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://patch.msgid.link/20250401135705.92760-1-david.oberhollenzer@sigma-star.at Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6: fix omitted netlink attributes when using RTEXT_FILTER_SKIP_STATS
Using RTEXT_FILTER_SKIP_STATS is incorrectly skipping non-stats IPv6
netlink attributes on link dump. This causes issues on userspace tools,
e.g iproute2 is not rendering address generation mode as it should due
to missing netlink attribute.
Move the filling of IFLA_INET6_STATS and IFLA_INET6_ICMP6STATS to a
helper function guarded by a flag check to avoid hitting the same
situation in the future.
Fixes: d5566fd72ec1 ("rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250402121751.3108-1-ffmancera@riseup.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When queue is being reset, callbacks of mgmt_ops are called by
netdev_nl_bind_rx_doit().
The netdev_nl_bind_rx_doit() first acquires netdev_lock() and then calls
callbacks.
So, mgmt_ops callbacks should not acquire netdev_lock() internaly.
The bnxt_queue_{start | stop}() calls napi_{enable | disable}() but they
internally acquire netdev_lock().
So, deadlock occurs.
To avoid deadlock, napi_{enable | disable}_locked() should be used
instead.
Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Fixes: cae03e5bdd9e ("net: hold netdev instance lock during queue operations") Link: https://patch.msgid.link/20250402133123.840173-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/selftests: Add loopback link local route for self-connect
self-connect-ipv6 got slightly flaky on netdev:
> # timeout set to 120
> # selftests: net/tcp_ao: self-connect_ipv6
> # 1..5
> # # 708[lib/setup.c:250] rand seed 1742872572
> # TAP version 13
> # # 708[lib/proc.c:213] Snmp6 Ip6OutNoRoutes: 0 => 1
> # not ok 1 # error 708[self-connect.c:70] failed to connect()
> # ok 2 No unexpected trace events during the test run
> # # Planned tests != run tests (5 != 2)
> # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:1
> ok 1 selftests: net/tcp_ao: self-connect_ipv6
I can not reproduce it on my machines, but judging by "Ip6OutNoRoutes"
there is no route to the local_addr (::1).
Looking at the kernel code, I see that kernel does add link-local
address automatically in init_loopback(), but that is called from
ipv6 notifier block. So, in turn the userspace that brought up
the loopback interface may see rtnetlink ACK earlier than
addrconf_notify() does it's job (at least, on a slow VM such as netdev).
Probably, for ipv4 it's the same, judging by inetdev_event().
The fix is quite simple: set the link-local route straight after
bringing the loopback interface. That will make it synchronous.
Edward Cree [Tue, 1 Apr 2025 22:54:39 +0000 (23:54 +0100)]
sfc: fix NULL dereferences in ef100_process_design_param()
Since cited commit, ef100_probe_main() and hence also
ef100_check_design_params() run before efx->net_dev is created;
consequently, we cannot netif_set_tso_max_size() or _segs() at this
point.
Move those netif calls to ef100_probe_netdev(), and also replace
netif_err within the design params code with pci_err.
Reported-by: Kyungwook Boo <bookyungwook@gmail.com> Fixes: 98ff4c7c8ac7 ("sfc: Separate netdev probe/remove from PCI probe/remove") Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250401225439.2401047-1-edward.cree@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Joshua Washington [Wed, 2 Apr 2025 00:10:37 +0000 (00:10 +0000)]
gve: handle overflow when reporting TX consumed descriptors
When the tx tail is less than the head (in cases of wraparound), the TX
consumed descriptor statistic in DQ will be reported as
UINT32_MAX - head + tail, which is incorrect. Mask the difference of
head and tail according to the ring size when reporting the statistic.
Cc: stable@vger.kernel.org Fixes: 2c9198356d56 ("gve: Add consumed counts to ethtool stats") Signed-off-by: Joshua Washington <joshwash@google.com> Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250402001037.2717315-1-hramamurthy@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux
Pull ARM and clkdev updates from Russell King:
- Simplify ARM_MMU_KEEP usage
- Add Rust support for ARM architecture version 7
- Align IPIs reported in /proc/interrupts
- require linker to support KEEP within OVERLAY
- add KEEP() for ARM vectors
- add __printf() attribute for clkdev functions
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
ARM: 9445/1: clkdev: Mark some functions with __printf() attribute
ARM: 9444/1: add KEEP() keyword to ARM_VECTORS
ARM: 9443/1: Require linker to support KEEP within OVERLAY for DCE
ARM: 9442/1: smp: Fix IPI alignment in /proc/interrupts
ARM: 9441/1: rust: Enable Rust support for ARMv7
ARM: 9439/1: arm32: simplify ARM_MMU_KEEP usage
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Fix max_pfn calculation when hotplugging memory so that it never
decreases
- Fix dereference of unused source register in the MOPS SET operation
fault handling
- Fix NULL calling in do_compat_alignment_fixup() when the 32-bit user
space does an unaligned LDREX/STREX
- Add the HiSilicon HIP09 processor to the Spectre-BHB affected CPUs
- Drop unused code pud accessors (special/mkspecial)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Don't call NULL in do_compat_alignment_fixup()
arm64: Add support for HIP09 Spectre-BHB mitigation
arm64: mm: Drop dead code for pud special bit handling
arm64: mops: Do not dereference src reg for a set operation
arm64: mm: Correct the update of max_pfn
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix BPF selftests expectations of assembler output and struct layout
(Song Liu and Yonghong Song)
- Fix XSK error code when queue is full (Wang Liang)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Fix verifier_private_stack test failure
selftests/bpf: Fix verifier_bpf_fastcall test
selftests/bpf: Fix tests after fields reorder in struct file
xsk: Fix __xsk_generic_xmit() error code when cq is full
Merge tag 'mm-nonmm-stable-2025-04-02-22-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more non-MM updates from Andrew Morton:
"One bugfix and a couple of small late-arriving updates"
* tag 'mm-nonmm-stable-2025-04-02-22-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib: scatterlist: fix sg_split_phys to preserve original scatterlist offsets
lib/sort.c: add _nonatomic() variants with cond_resched()
mailmap: add an entry for Nicolas Schier
Merge tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- The series "mm: fixes for fallouts from mem_init() cleanup" from Mike
Rapoport fixes a couple of issues with the just-merged "arch, mm:
reduce code duplication in mem_init()" series
- The series "MAINTAINERS: add my isub-entries to MM part." from Mike
Rapoport does some maintenance on MAINTAINERS
- The series "remove tlb_remove_page_ptdesc()" from Qi Zheng does some
cleanup work to the page mapping code
- The series "mseal system mappings" from Jeff Xu permits sealing of
"system mappings", such as vdso, vvar, vvar_vclock, vectors (arm
compat-mode), sigpage (arm compat-mode)
- Plus the usual shower of singleton patches
* tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
mseal sysmap: add arch-support txt
mseal sysmap: enable s390
selftest: test system mappings are sealed
mseal sysmap: update mseal.rst
mseal sysmap: uprobe mapping
mseal sysmap: enable arm64
mseal sysmap: enable x86-64
mseal sysmap: generic vdso vvar mapping
selftests: x86: test_mremap_vdso: skip if vdso is msealed
mseal sysmap: kernel config and header change
mm: pgtable: remove tlb_remove_page_ptdesc()
x86: pgtable: convert to use tlb_remove_ptdesc()
riscv: pgtable: unconditionally use tlb_remove_ptdesc()
mm: pgtable: convert some architectures to use tlb_remove_ptdesc()
mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc*
mm: pgtable: make generic tlb_remove_table() use struct ptdesc
microblaze/mm: put mm_cmdline_setup() in .init.text section
mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range
MAINTAINERS: mm: add entry for secretmem
MAINTAINERS: mm: add entry for numa memblocks and numa emulation
...
Merge tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM hotfixes from Andrew Morton:
"Five hotfixes. Three are cc:stable and the remainder address post-6.14
issues or aren't considered necessary for -stable kernels.
All patches are for MM"
* tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead()
mm/hugetlb: move hugetlb_sysctl_init() to the __init section
mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock
mm/hugetlb_vmemmap: fix memory loads ordering
mm/userfaultfd: fix release hang over concurrent GUP
Merge tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Calling scx_bpf_create_dsq() with the same ID would succeed creating
duplicate DSQs. Fix it to return -EEXIST.
- scx_select_cpu_dfl() fixes and cleanups.
- Synchronize tool/sched_ext with external scheduler repo. While this
isn't a fix. There's no risk to the kernel and it's better if they
stay synced closer.
* tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
tools/sched_ext: Sync with scx repo
sched_ext: initialize built-in idle state before ops.init()
sched_ext: create_dsq: Return -EEXIST on duplicate request
sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()
sched_ext: idle: Fix return code of scx_select_cpu_dfl()
Merge tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix build error when CONFIG_PROBE_EVENTS_BTF_ARGS is not enabled
The tracing of arguments in the function tracer depends on some
functions that are only defined when PROBE_EVENTS_BTF_ARGS is
enabled. In fact, PROBE_EVENTS_BTF_ARGS also depends on all the same
configs as the function argument tracing requires. Just have the
function argument tracing depend on PROBE_EVENTS_BTF_ARGS.
- Free module_delta for persistent ring buffer instance
When an instance holds the persistent ring buffer, it allocates a
helper array to hold the deltas between where modules are loaded on
the last boot and the current boot. This array needs to be freed when
the instance is freed.
- Add cond_resched() to loop in ftrace_graph_set_hash()
The hash functions in ftrace loop over every function that can be
enabled by ftrace. This can be 50,000 functions or more. This loop is
known to trigger soft lockup warnings and requires a cond_resched().
The loop in ftrace_graph_set_hash() was missing it.
- Fix the event format verifier to include "%*p.." arguments
To prevent events from dereferencing stale pointers that can happen
if a trace event uses a dereferece pointer to something that was not
copied into the ring buffer and can be freed by the time the trace is
read, a verifier is called. At boot or module load, the verifier
scans the print format string for pointers that can be dereferenced
and it checks the arguments to make sure they do not contain
something that can be freed. The "%*p" was not handled, which would
add another argument and cause the verifier to not only not verify
this pointer, but it will look at the wrong argument for every
pointer after that.
- Fix mcount sorttable building for different endian type target
When modifying the ELF file to sort the mcount_loc table in the
sorttable.c code, the endianess of the file and the host is used to
determine if the bytes need to be swapped when calculations are done.
A change was made to the sorting of the mcount_loc that read the
values from the ELF file into an array and the swap happened on the
filling of the array. But one of the calculations of the array still
did the swap when it did not need to. This caused building on a
little endian machine for a big endian target to not find the mcount
function in the 'nm' table and it zeroed it out, causing there to be
no functions available to trace.
- Add goto out_unlock jump to rv_register_monitor() on error path
One of the error paths in rv_register_monitor() just returned the
error when it should have jumped to the out_unlock label to release
the mutex.
* tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Fix missing unlock on double nested monitors return path
scripts/sorttable: Fix endianness handling in build-time mcount sort
tracing: Verify event formats that have "%*p.."
ftrace: Add cond_resched() to ftrace_graph_set_hash()
tracing: Free module_delta on freeing of persistent ring buffer
ftrace: Have tracing function args depend on PROBE_EVENTS_BTF_ARGS
Kent Overstreet [Wed, 2 Apr 2025 16:23:25 +0000 (12:23 -0400)]
bcachefs: Fix "journal stuck" during recovery
If we crash when the journal pin fifo is completely full - i.e. we're
at the maximum number of dirty journal entries - that may put us in a
sticky situation in recovery, as journal replay will need to be able to
open new journal entries in order to get going.
bch2_fs_journal_start() already had provisions for resizing the journal
pin fifo if needed, but it needs a fudge factor to ensure there's room
for journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Codepaths that create entries in the snapshots btree currently call
bch2_mark_snapshot(), which updates the in-memory snapshot table, before
transaction commit.
This is because bch2_mark_snapshot() is an atomic trigger, run with
btree write locks held, and isn't allowed to fail - but it might need to
reallocate the table, hence we call it early when we're still allowed to
fail.
This is generally harmless - if we fail, we'll have left an entry in the
snapshots table around, but nothing will reference it and it'll get
overwritten if reused by another transaction.
But check_snapshot_exists(), which reconstructs snapshots when the
snapshots btree has been corrupted or lost, was erronously rechecking if
the snapshot exists inside the transaction commit loop - so on
transaction restart (in this case mem_realloced), the second iteration
would return without repairing.
This code needs some cleanup: splitting out a "maybe realloc snapshots
table" helper would have avoided this, that will be in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs: use nonblocking variant of print_string_as_lines in error path
The inconsistency error path calls print_string_as_lines, which calls
console_lock, which is a potentially-sleeping function and so can't be
called in an atomic context.
Replace calls to it with the nonblocking variant which is safe to call.
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 1 Apr 2025 17:01:29 +0000 (13:01 -0400)]
bcachefs: Fix scheduling while atomic from logging changes
Two fixes from the recent logging changes:
bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt
context, or with rcu_read_lock() held.
The one syzbot found is in
bch2_bkey_pick_read_device
bch2_dev_rcu
bch2_fs_inconsistent
We're starting to switch to lift the printbufs up to higher levels so we
can emit better log messages and print them all in one go (avoid
garbling), so that conversion will help with spotting these in the
future; when we declare a printbuf it must be flagged if we're in an
atomic context.
Ming Lei [Thu, 3 Apr 2025 10:54:02 +0000 (18:54 +0800)]
block: don't grab elevator lock during queue initialization
->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
queue freeze lock depends on fs_reclaim.
So don't grab elevator lock during queue initialization which needs to
call kmalloc(GFP_KERNEL), and we can cut the dependency between
->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
This way is safe because elevator setting isn't ready to run during
queue initialization.
There isn't such issue in __blk_mq_update_nr_hw_queues() because
memalloc_noio_save() is called before acquiring elevator lock.
Pavel Begunkov [Thu, 3 Apr 2025 11:29:30 +0000 (12:29 +0100)]
io_uring: always do atomic put from iowq
io_uring always switches requests to atomic refcounting for iowq
execution before there is any parallilism by setting REQ_F_REFCOUNT,
and the flag is not cleared until the request completes. That should be
fine as long as the compiler doesn't make up a non existing value for
the flags, however KCSAN still complains when the request owner changes
oter flag bits:
BUG: KCSAN: data-race in io_req_task_cancel / io_wq_free_work
...
read to 0xffff888117207448 of 8 bytes by task 3871 on cpu 0:
req_ref_put_and_test io_uring/refs.h:22 [inline]
Skip REQ_F_REFCOUNT checks for iowq, we know it's set.
Lin Ma [Wed, 2 Apr 2025 17:00:26 +0000 (01:00 +0800)]
netfilter: nft_tunnel: fix geneve_opt type confusion addition
When handling multiple NFTA_TUNNEL_KEY_OPTS_GENEVE attributes, the
parsing logic should place every geneve_opt structure one by one
compactly. Hence, when deciding the next geneve_opt position, the
pointer addition should be in units of char *.
However, the current implementation erroneously does type conversion
before the addition, which will lead to heap out-of-bounds write.
Fix this bug with correct pointer addition and conversion in parse
and dump code.
Fixes: 925d844696d9 ("netfilter: nft_tunnel: add support for geneve opts") Signed-off-by: Lin Ma <linma@zju.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Antoine Tenart [Wed, 26 Mar 2025 17:36:32 +0000 (18:36 +0100)]
net: decrease cached dst counters in dst_release
Upstream fix ac888d58869b ("net: do not delay dst_entries_add() in
dst_release()") moved decrementing the dst count from dst_destroy to
dst_release to avoid accessing already freed data in case of netns
dismantle. However in case CONFIG_DST_CACHE is enabled and OvS+tunnels
are used, this fix is incomplete as the same issue will be seen for
cached dsts:
Merge tag 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Pull firewire update from Takashi Sakamoto:
"A single commit to use the common helper function for on-stack
trailing array to enqueue any isochronous packet by the requests
from userspace applications"
* tag 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: core: avoid -Wflex-array-member-not-at-end warning
As a result, verifier_bpf_fastcall tests should now expect a negative
value for pcpu_hot, IOW, the disassemble should show "r=" instead of
"w=".
Fix this in the test.
Note that, a later change created a new variable "cpu_number" for
bpf_get_smp_processor_id() [2]. The inlining logic is updated properly
as part of this change, so there is no need to fix anything on the
kernel side.
[1] commit 9d7de2aa8b41 ("x86/percpu/64: Use relative percpu offsets")
[2] commit 01c7bc5198e9 ("x86/smp: Move cpu number to percpu hot section") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250328193124.808784-1-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Song Liu [Thu, 27 Mar 2025 18:55:28 +0000 (11:55 -0700)]
selftests/bpf: Fix tests after fields reorder in struct file
The change in struct file [1] moved f_ref to the 3rd cache line.
It made *(u64 *)file dereference invalid from the verifier point of view,
because btf_struct_walk() walks into f_lock field, which is 4-byte long.
Fix the selftests to deference the file pointer as a 4-byte access.
[1] commit e249056c91a2 ("fs: place f_ref to 3rd cache line in struct file to resolve false sharing") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250327185528.1740787-1-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Wang Liang [Thu, 27 Feb 2025 08:10:52 +0000 (16:10 +0800)]
xsk: Fix __xsk_generic_xmit() error code when cq is full
When the cq reservation is failed, the error code is not set which is
initialized to zero in __xsk_generic_xmit(). That means the packet is not
send successfully but sendto() return ok.
Considering the impact on uapi, return -EAGAIN is a good idea. The cq is
full usually because it is not released in time, try to send msg again is
appropriate.
The bug was at the very early implementation of xsk, so the Fixes tag
targets the commit that introduced the changes in
xsk_cq_reserve_addr_locked where this fix depends on.
Fixes: e6c4047f5122 ("xsk: Use xsk_buff_pool directly for cq functions") Suggested-by: Magnus Karlsson <magnus.karlsson@gmail.com> Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250227081052.4096337-1-wangliang74@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- dm-integrity: set ti->error on memory allocation failure
- dm-bufio: remove unused return value
- dm-verity: do forward error correction on metadata I/O errors
- dm: fix unconditional IO throttle caused by REQ_PREFLUSH
- dm cache: prevent BUG_ON by blocking retries on failed device resumes
- dm cache: support shrinking the origin device
- dm: restrict dm device size to 2^63-512 bytes
- dm-delay: support zoned devices
- dm-verity: support block number limits for different ioprio classes
- dm-integrity: fix non-constant-time tag verification (security bug)
- dm-verity, dm-ebs: fix prefetch-vs-suspend race
* tag 'for-6.15/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (27 commits)
dm-ebs: fix prefetch-vs-suspend race
dm-verity: fix prefetch-vs-suspend race
dm-integrity: fix non-constant-time tag verification
dm-verity: support block number limits for different ioprio classes
dm-delay: support zoned devices
dm: restrict dm device size to 2^63-512 bytes
dm cache: support shrinking the origin device
dm cache: prevent BUG_ON by blocking retries on failed device resumes
dm vdo indexer: reorder uds_request to reduce padding
dm: fix unconditional IO throttle caused by REQ_PREFLUSH
dm vdo: rework processing of loaded refcount byte arrays
dm vdo: remove remaining ring references
dm-verity: do forward error correction on metadata I/O errors
dm-bufio: remove unused return value
dm-integrity: set ti->error on memory allocation failure
dm: Enable inline crypto passthrough for striped target
dm vdo slab-depot: read refcount blocks in large chunks at load time
dm vdo vio-pool: allow variable-sized metadata vios
dm vdo vio-pool: support pools with multiple data blocks per vio
dm vdo vio-pool: add a pool pointer to pooled_vio
...
Merge tag 'cxl-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull Compute Express Link (CXL) updates from Dave Jiang:
- Add support for Global Persistent Flush (GPF)
- Cleanup of DPA partition metadata handling:
- Remove the CXL_DECODER_MIXED enum that's not needed anymore
- Introduce helpers to access resource and perf meta data
- Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
- Make cxl_dpa_alloc() DPA partition number agnostic
- Remove cxl_decoder_mode
- Cleanup partition size and perf helpers
- Remove unused CXL partition values
- Add logging support for CXL CPER endpoint and port protocol errors:
- Prefix protocol error struct and function names with cxl_
- Move protocol error definitions and structures to a common location
- Remove drivers/firmware/efi/cper_cxl.h to include/linux/cper.h
- Add support in GHES to process CXL CPER protocol errors
- Process CXL CPER protocol errors
- Add trace logging for CXL PCIe port RAS errors
- Remove redundant gp_port init
- Add validation of cxl device serial number
- CXL ABI documentation updates/fixups
- A series that uses guard() to clean up open coded mutex lockings and
remove gotos for error handling.
- Some followup patches to support dirty shutdown accounting:
- Add helper to retrieve DVSEC offset for dirty shutdown registers
- Rename cxl_get_dirty_shutdown() to cxl_arm_dirty_shutdown()
- Add support for dirty shutdown count via sysfs
- cxl_test support for dirty shutdown
- A series to support CXL mailbox Features commands.
Mostly in preparation for CXL EDAC code to utilize the Features
commands. It's also in preparation for CXL fwctl support to utilize
the CXL Features. The commands include "Get Supported Features", "Get
Feature", and "Set Feature".
- A series to support extended linear cache support described by the
ACPI HMAT table.
The addition helps enumerate the cache and also provides additional
RAS reporting support for configuration with extended linear cache.
(and related fixes for the series).
- An update to cxl_test to support a 3-way capable CFMWS
- A documentation fix to remove unused "mixed mode"
* tag 'cxl-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (39 commits)
cxl/region: Fix the first aliased address miscalculation
cxl/region: Quiet some dev_warn()s in extended linear cache setup
cxl/Documentation: Remove 'mixed' from sysfs mode doc
cxl: Fix warning from emitting resource_size_t as long long int on 32bit systems
cxl/test: Define a CFMWS capable of a 3 way HB interleave
cxl/mem: Do not return error if CONFIG_CXL_MCE unset
tools/testing/cxl: Set Shutdown State support
cxl/pmem: Export dirty shutdown count via sysfs
cxl/pmem: Rename cxl_dirty_shutdown_state()
cxl/pci: Introduce cxl_gpf_get_dvsec()
cxl/pci: Support Global Persistent Flush (GPF)
cxl: Document missing sysfs files
cxl: Plug typos in ABI doc
cxl/pmem: debug invalid serial number data
cxl/cdat: Remove redundant gp_port initialization
cxl/memdev: Remove unused partition values
cxl/region: Drop goto pattern of construct_region()
cxl/region: Drop goto pattern in cxl_dax_region_alloc()
cxl/core: Use guard() to drop goto pattern of cxl_dpa_alloc()
cxl/core: Use guard() to drop the goto pattern of cxl_dpa_free()
...
Merge tag 'usb-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB / Thunderbolt updates from Greg KH:
"Here is the big set of USB and Thunderbolt driver updates for
6.15-rc1. Included in here are:
- Thunderbolt driver and core api updates for new hardware and
features
- usb-storage const array cleanups
- typec driver updates
- dwc3 driver updates
- xhci driver updates and bugfixes
- small USB documentation updates
- usb cdns3 driver updates
- usb gadget driver updates
- other small driver updates and fixes
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (92 commits)
thunderbolt: Do not add non-active NVM if NVM upgrade is disabled for retimer
thunderbolt: Scan retimers after device router has been enumerated
usb: host: cdns3: forward lost power information to xhci
usb: host: xhci-plat: allow upper layers to signal power loss
usb: xhci: change xhci_resume() parameters to explicit the desired info
usb: cdns3-ti: run HW init at resume() if HW was reset
usb: cdns3-ti: move reg writes to separate function
usb: cdns3: call cdns_power_is_lost() only once in cdns_resume()
usb: cdns3: rename hibernated argument of role->resume() to lost_power
usb: xhci: tegra: rename `runtime` boolean to `is_auto_runtime`
usb: host: xhci-plat: mvebu: use ->quirks instead of ->init_quirk() func
usb: dwc3: Don't use %pK through printk
usb: core: Don't use %pK through printk
usb: gadget: aspeed: Add NULL pointer check in ast_vhub_init_dev()
dt-bindings: usb: qcom,dwc3: Synchronize minItems for interrupts and -names
usb: common: usb-conn-gpio: switch psy_cfg from of_node to fwnode
usb: xhci: Avoid Stop Endpoint retry loop if the endpoint seems Running
usb: xhci: Don't change the status of stalled TDs on failed Stop EP
xhci: Avoid queuing redundant Stop Endpoint command for stalled endpoint
xhci: Handle spurious events on Etron host isoc enpoints
...
Merge tag 'staging-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver updates from Greg KH:
"Here is the big set of staging driver cleanups and updates for
6.15-rc1.
As expected, with the introduction of the gpib drivers, loads of
cleanups and fixes showed up, with the huge majority of changes being
for that chunk of drivers. This is good and shows that the community
can fix up things in public when asked to. Also included in here are:
- small sm750fb cleanups
- tiny rtl8723bs cleanups
- more vchiq_arm cleanups and changes, hopefully this will get out of
staging soon
All of these have been in linux-next for almost 2 weeks now with no
reported issues"
* tag 'staging-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (76 commits)
staging: rtl8723bs: fixed a unnecessary parentheses coding style issue
staging: vchiq_arm: Improve initial VCHIQ connect
staging: vchiq_arm: Create keep-alive thread during probe
staging: vchiq_arm: Stop kthreads if vchiq cdev register fails
staging: vchiq_arm: Fix possible NPR of keep-alive thread
staging: vchiq_arm: Register debugfs after cdev
staging: vchiq_arm: Don't use %pK through printk
staging: rtl8723bs: select CONFIG_CRYPTO_LIB_AES
staging: rtl8723bs: Remove some unused functions, macros, and structs
staging: gpib: change return type of t1_delay function to report errors
staging: gpib: remove commented-out lines
staging: gpib: fix kernel-doc section for usb_gpib_line_status() function
staging: gpib: fix kernel-doc section for function usb_gpib_interface_clear()
staging: gpib: fix kernel-doc section for write_loop() function
staging: gpib: Removing typedef for gpib_board
staging: gpib: struct typing for gpib_gboard_t
staging: gpib: tnt4882: struct gpib_board
staging: gpib: tms9914: struct gpib_board
staging: gpib: pc2: struct gpib_board
staging: gpib: ni_usb_gpib: struct gpib_board
...
Merge tag 'char-misc-6.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are two counter driver fixes that I realized I never sent to you
for 6.14-final.
They have been in my for weeks, as well as linux-next, my fault for
not sending them earlier. They are:
- bugfix for stm32-lptimer-cnt counter driver
- bugfix for microchip-tcb-capture counter driver
Again, these have been in linux-next for weeks with no reported
issues"
* tag 'char-misc-6.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
counter: microchip-tcb-capture: Fix undefined counter channel state on probe
counter: stm32-lptimer-cnt: fix error handling when enabling
Guillaume Nault [Sat, 29 Mar 2025 00:33:44 +0000 (01:33 +0100)]
tunnels: Accept PACKET_HOST in skb_tunnel_check_pmtu().
Because skb_tunnel_check_pmtu() doesn't handle PACKET_HOST packets,
commit 30a92c9e3d6b ("openvswitch: Set the skbuff pkt_type for proper
pmtud support.") forced skb->pkt_type to PACKET_OUTGOING for
openvswitch packets that are sent using the OVS_ACTION_ATTR_OUTPUT
action. This allowed such packets to invoke the
iptunnel_pmtud_check_icmp() or iptunnel_pmtud_check_icmpv6() helpers
and thus trigger PMTU update on the input device.
However, this also broke other parts of PMTU discovery. Since these
packets don't have the PACKET_HOST type anymore, they won't trigger the
sending of ICMP Fragmentation Needed or Packet Too Big messages to
remote hosts when oversized (see the skb_in->pkt_type condition in
__icmp_send() for example).
These two skb->pkt_type checks are therefore incompatible as one
requires skb->pkt_type to be PACKET_HOST, while the other requires it
to be anything but PACKET_HOST.
It makes sense to not trigger ICMP messages for non-PACKET_HOST packets
as these messages should be generated only for incoming l2-unicast
packets. However there doesn't seem to be any reason for
skb_tunnel_check_pmtu() to ignore PACKET_HOST packets.
Allow both cases to work by allowing skb_tunnel_check_pmtu() to work on
PACKET_HOST packets and not overriding skb->pkt_type in openvswitch
anymore.
Fixes: 30a92c9e3d6b ("openvswitch: Set the skbuff pkt_type for proper pmtud support.") Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Tested-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/eac941652b86fddf8909df9b3bf0d97bc9444793.1743208264.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stefano Garzarella [Fri, 28 Mar 2025 14:15:28 +0000 (15:15 +0100)]
vsock: avoid timeout during connect() if the socket is closing
When a peer attempts to establish a connection, vsock_connect() contains
a loop that waits for the state to be TCP_ESTABLISHED. However, the
other peer can be fast enough to accept the connection and close it
immediately, thus moving the state to TCP_CLOSING.
When this happens, the peer in the vsock_connect() is properly woken up,
but since the state is not TCP_ESTABLISHED, it goes back to sleep
until the timeout expires, returning -ETIMEDOUT.
If the socket state is TCP_CLOSING, waiting for the timeout is pointless.
vsock_connect() can return immediately without errors or delay since the
connection actually happened. The socket will be in a closing state,
but this is not an issue, and subsequent calls will fail as expected.
We discovered this issue while developing a test that accepts and
immediately closes connections to stress the transport switch between
two connect() calls, where the first one was interrupted by a signal
(see Closes link).
Reported-by: Luigi Leonardi <leonardi@redhat.com> Closes: https://lore.kernel.org/virtualization/bq6hxrolno2vmtqwcvb5bljfpb7mvwb3kohrvaed6auz5vxrfv@ijmd2f3grobn/ Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250328141528.420719-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matt Dowling reported a weird UDP memory usage issue.
Under normal operation, the UDP memory usage reported in /proc/net/sockstat
remains close to zero. However, it occasionally spiked to 524,288 pages
and never dropped. Moreover, the value doubled when the application was
terminated. Finally, it caused intermittent packet drops.
We can reproduce the issue with the script below [0]:
The application set INT_MAX to SO_RCVBUF, which triggered an integer
overflow in udp_rmem_release().
When a socket is close()d, udp_destruct_common() purges its receive
queue and sums up skb->truesize in the queue. This total is calculated
and stored in a local unsigned integer variable.
The total size is then passed to udp_rmem_release() to adjust memory
accounting. However, because the function takes a signed integer
argument, the total size can wrap around, causing an overflow.
Then, the released amount is calculated as follows:
1) Add size to sk->sk_forward_alloc.
2) Round down sk->sk_forward_alloc to the nearest lower multiple of
PAGE_SIZE and assign it to amount.
3) Subtract amount from sk->sk_forward_alloc.
4) Pass amount >> PAGE_SHIFT to __sk_mem_reduce_allocated().
When the issue occurred, the total in udp_destruct_common() was 2147484480
(INT_MAX + 833), which was cast to -2147482816 in udp_rmem_release().
At 1) sk->sk_forward_alloc is changed from 3264 to -2147479552, and
2) sets -2147479552 to amount. 3) reverts the wraparound, so we don't
see a warning in inet_sock_destruct(). However, udp_memory_allocated
ends up doubling at 4).
Since commit 3cd3399dd7a8 ("net: implement per-cpu reserves for
memory_allocated"), memory usage no longer doubles immediately after
a socket is close()d because __sk_mem_reduce_allocated() caches the
amount in udp_memory_per_cpu_fw_alloc. However, the next time a UDP
socket receives a packet, the subtraction takes effect, causing UDP
memory usage to double.
This issue makes further memory allocation fail once the socket's
sk->sk_rmem_alloc exceeds net.ipv4.udp_rmem_min, resulting in packet
drops.
To prevent this issue, let's use unsigned int for the calculation and
call sk_forward_alloc_add() only once for the small delta.
Note that first_packet_length() also potentially has the same problem.
[0]:
from socket import *
SO_RCVBUFFORCE = 33
INT_MAX = (2 ** 31) - 1
s = socket(AF_INET, SOCK_DGRAM)
s.bind(('', 0))
s.setsockopt(SOL_SOCKET, SO_RCVBUFFORCE, INT_MAX)
c = socket(AF_INET, SOCK_DGRAM)
c.connect(s.getsockname())
data = b'a' * 100
while True:
c.send(data)
Fixes: f970bd9e3a06 ("udp: implement memory accounting helpers") Reported-by: Matt Dowling <madowlin@amazon.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250401184501.67377-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
__udp_enqueue_schedule_skb() has the following condition:
if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
goto drop;
sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
or SO_RCVBUFFORCE.
If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
as sk->sk_rmem_alloc is also signed int.
Then, the size of the incoming skb is added to sk->sk_rmem_alloc
unconditionally.
This results in integer overflow (possibly multiple times) on
sk->sk_rmem_alloc and allows a single socket to have skb up to
net.core.udp_mem[1].
For example, if we set a large value to udp_mem[1] and INT_MAX to
sk->sk_rcvbuf and flood packets to the socket, we can see multiple
overflows:
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
^- PAGE_SHIFT
# ss -uam
State Recv-Q ...
UNCONN -1757018048 ... <-- flipping the sign repeatedly
skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
Previously, we had a boundary check for INT_MAX, which was removed by
commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
A complete fix would be to revert it and cap the right operand by
INT_MAX:
# ss -uam
State Recv-Q ...
UNCONN -2147482816 ... <-- INT_MAX + 831 bytes
skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947)
So, let's define rmem and rcvbuf as unsigned int and check skb->truesize
only when rcvbuf is large enough to lower the overflow possibility.
Note that we still have a small chance to see overflow if multiple skbs
to the same socket are processed on different core at the same time and
each size does not exceed the limit but the total size does.
Note also that we must ignore skb->truesize for a small buffer as
explained in commit 363dc73acacb ("udp: be less conservative with
sock rmem accounting").
Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250401184501.67377-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtnetlink: Use register_pernet_subsys() in rtnl_net_debug_init().
rtnl_net_debug_init() registers rtnl_net_debug_net_ops by
register_pernet_device() but calls unregister_pernet_subsys()
in case register_netdevice_notifier() fails.
It corrupts pernet_list because first_device is updated in
register_pernet_device() but not unregister_pernet_subsys().
Let's fix it by calling register_pernet_subsys() instead.
The _subsys() one fits better for the use case because it keeps
the notifier alive until default_device_exit_net(), giving it
more chance to test NETDEV_UNREGISTER.
Fixes: 03fa53485659 ("rtnetlink: Add ASSERT_RTNL_NET() placeholder for netdev notifier.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250401190716.70437-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge tag 'nfs-for-6.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Bugfixes:
- Three fixes for looping in the NFSv4 state manager delegation code
- Fix for the NFSv4 state XDR code (Neil Brown)
- Fix a leaked reference in nfs_lock_and_join_requests()
- Fix a use-after-free in the delegation return code
Features:
- Implement the NFSv4.2 copy offload OFFLOAD_STATUS operation to
allow monitoring of an in-progress copy
- Add a mount option to force NFSv3/NFSv4 to use READDIRPLUS in a
getdents() call
- SUNRPC now allows some basic management of an existing RPC client's
connections using sysfs
- Improvements to the automated teardown of a NFS client when the
container it was initiated from gets killed
- Improvements to prevent tasks from getting stuck in a killable wait
state after calling exit_signals()"
* tag 'nfs-for-6.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (29 commits)
nfs: Add missing release on error in nfs_lock_and_join_requests()
NFSv4: Check for delegation validity in nfs_start_delegation_return_locked()
NFS: Don't allow waiting for exiting tasks
SUNRPC: Don't allow waiting for exiting tasks
NFSv4: Treat ENETUNREACH errors as fatal for state recovery
NFSv4: clp->cl_cons_state < 0 signifies an invalid nfs_client
NFSv4: Further cleanups to shutdown loops
NFS: Shut down the nfs_client only after all the superblocks
SUNRPC: rpc_clnt_set_transport() must not change the autobind setting
SUNRPC: rpcbind should never reset the port to the value '0'
pNFS/flexfiles: Report ENETDOWN as a connection error
pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containers
NFS: Treat ENETUNREACH errors as fatal in containers
NFS: Add a mount option to make ENETUNREACH errors fatal
sunrpc: Add a sysfs file for one-step xprt deletion
sunrpc: Add a sysfs file for adding a new xprt
sunrpc: Add a sysfs files for rpc_clnt information
sunrpc: Add a sysfs attr for xprtsec
NFS: Add implid to sysfs
NFS: Extend rdirplus mount option with "force|none"
...