]> www.infradead.org Git - users/hch/misc.git/log
users/hch/misc.git
3 weeks agotc: Ensure we have enough buffer space when sending filter netlink notifications
Toke Høiland-Jørgensen [Mon, 7 Apr 2025 10:55:34 +0000 (12:55 +0200)]
tc: Ensure we have enough buffer space when sending filter netlink notifications

The tfilter_notify() and tfilter_del_notify() functions assume that
NLMSG_GOODSIZE is always enough to dump the filter chain. This is not
always the case, which can lead to silent notify failures (because the
return code of tfilter_notify() is not always checked). In particular,
this can lead to NLM_F_ECHO not being honoured even though an action
succeeds, which forces userspace to create workarounds[0].

Fix this by increasing the message size if dumping the filter chain into
the allocated skb fails. Use the size of the incoming skb as a size hint
if set, so we can start at a larger value when appropriate.

To trigger this, run the following commands:

 # ip link add type veth
 # tc qdisc replace dev veth0 root handle 1: fq_codel
 # tc -echo filter add dev veth0 parent 1: u32 match u32 0 0 $(for i in $(seq 32); do echo action pedit munge ip dport set 22; done)

Before this fix, tc just returns:

Not a filter(cmd 2)

After the fix, we get the correct echo:

added filter dev veth0 parent 1: protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid not_in_hw
  match 00000000/00000000 at 0
action order 1:  pedit action pass keys 1
  index 1 ref 1 bind 1
key #0  at 20: val 00000016 mask ffff0000
[repeated 32 times]

[0] https://github.com/openvswitch/ovs/commit/106ef21860c935e5e0017a88bf42b94025c4e511

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Frode Nordahl <frode.nordahl@canonical.com>
Closes: https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/2018500
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250407105542.16601-1-toke@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: libwx: Fix the wrong Rx descriptor field
Jiawen Wu [Mon, 7 Apr 2025 10:33:22 +0000 (18:33 +0800)]
net: libwx: Fix the wrong Rx descriptor field

WX_RXD_IPV6EX was incorrectly defined in Rx ring descriptor. In fact, this
field stores the 802.1ad ID from which the packet was received. The wrong
definition caused the statistics rx_csum_offload_errors to fail to grow
when receiving the 802.1ad packet with incorrect checksum.

Fixes: ef4f3c19f912 ("net: wangxun: libwx add rx offload functions")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/20250407103322.273241-1-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoocteontx2-pf: qos: fix VF root node parent queue index
Hariprasad Kelam [Mon, 7 Apr 2025 07:03:41 +0000 (12:33 +0530)]
octeontx2-pf: qos: fix VF root node parent queue index

The current code configures the Physical Function (PF) root node at TL1
and the Virtual Function (VF) root node at TL2.

This ensure at any given point of time PF traffic gets more priority.

                    PF root node
                      TL1
                     /  \
                    TL2  TL2 VF root node
                    /     \
                   TL3    TL3
                   /       \
                  TL4      TL4
                  /         \
                 SMQ        SMQ

Due to a bug in the current code, the TL2 parent queue index on the
VF interface is not being configured, leading to 'SMQ Flush' errors

Fixes: 5e6808b4c68d ("octeontx2-pf: Add support for HTB offload")
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250407070341.2765426-1-hkelam@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoselftests: tls: check that disconnect does nothing
Jakub Kicinski [Fri, 4 Apr 2025 18:03:34 +0000 (11:03 -0700)]
selftests: tls: check that disconnect does nothing

"Inspired" by syzbot test, pre-queue some data, disconnect()
and try to receive(). This used to trigger a warning in TLS's strp.
Now we expect the disconnect() to have almost no effect.

Link: https://lore.kernel.org/67e6be74.050a0220.2f068f.007e.GAE@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250404180334.3224206-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: tls: explicitly disallow disconnect
Jakub Kicinski [Fri, 4 Apr 2025 18:03:33 +0000 (11:03 -0700)]
net: tls: explicitly disallow disconnect

syzbot discovered that it can disconnect a TLS socket and then
run into all sort of unexpected corner cases. I have a vague
recollection of Eric pointing this out to us a long time ago.
Supporting disconnect is really hard, for one thing if offload
is enabled we'd need to wait for all packets to be _acked_.
Disconnect is not commonly used, disallow it.

The immediate problem syzbot run into is the warning in the strp,
but that's just the easiest bug to trigger:

  WARNING: CPU: 0 PID: 5834 at net/tls/tls_strp.c:486 tls_strp_msg_load+0x72e/0xa80 net/tls/tls_strp.c:486
  RIP: 0010:tls_strp_msg_load+0x72e/0xa80 net/tls/tls_strp.c:486
  Call Trace:
   <TASK>
   tls_rx_rec_wait+0x280/0xa60 net/tls/tls_sw.c:1363
   tls_sw_recvmsg+0x85c/0x1c30 net/tls/tls_sw.c:2043
   inet6_recvmsg+0x2c9/0x730 net/ipv6/af_inet6.c:678
   sock_recvmsg_nosec net/socket.c:1023 [inline]
   sock_recvmsg+0x109/0x280 net/socket.c:1045
   __sys_recvfrom+0x202/0x380 net/socket.c:2237

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Reported-by: syzbot+b4cd76826045a1eb93c1@syzkaller.appspotmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250404180334.3224206-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agosctp: detect and prevent references to a freed transport in sendmsg
Ricardo Cañuelo Navarro [Fri, 4 Apr 2025 14:53:21 +0000 (16:53 +0200)]
sctp: detect and prevent references to a freed transport in sendmsg

sctp_sendmsg() re-uses associations and transports when possible by
doing a lookup based on the socket endpoint and the message destination
address, and then sctp_sendmsg_to_asoc() sets the selected transport in
all the message chunks to be sent.

There's a possible race condition if another thread triggers the removal
of that selected transport, for instance, by explicitly unbinding an
address with setsockopt(SCTP_SOCKOPT_BINDX_REM), after the chunks have
been set up and before the message is sent. This can happen if the send
buffer is full, during the period when the sender thread temporarily
releases the socket lock in sctp_wait_for_sndbuf().

This causes the access to the transport data in
sctp_outq_select_transport(), when the association outqueue is flushed,
to result in a use-after-free read.

This change avoids this scenario by having sctp_transport_free() signal
the freeing of the transport, tagging it as "dead". In order to do this,
the patch restores the "dead" bit in struct sctp_transport, which was
removed in
commit 47faa1e4c50e ("sctp: remove the dead field of sctp_transport").

Then, in the scenario where the sender thread has released the socket
lock in sctp_wait_for_sndbuf(), the bit is checked again after
re-acquiring the socket lock to detect the deletion. This is done while
holding a reference to the transport to prevent it from being freed in
the process.

If the transport was deleted while the socket lock was relinquished,
sctp_sendmsg_to_asoc() will return -EAGAIN to let userspace retry the
send.

The bug was found by a private syzbot instance (see the error report [1]
and the C reproducer that triggers it [2]).

Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport.txt
Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport__repro.c
Cc: stable@vger.kernel.org
Fixes: df132eff4638 ("sctp: clear the transport of some out_chunk_list chunks in sctp_assoc_rm_peer")
Suggested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250404-kasan_slab-use-after-free_read_in_sctp_outq_select_transport__20250404-v1-1-5ce4a0b78ef2@igalia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoMerge branch 'net_sched-make-qlen_notify-idempotent'
Paolo Abeni [Tue, 8 Apr 2025 08:58:16 +0000 (10:58 +0200)]
Merge branch 'net_sched-make-qlen_notify-idempotent'

Cong Wang says:

====================
net_sched: make ->qlen_notify() idempotent

Gerrard reported a vulnerability exists in fq_codel where manipulating
the MTU can cause codel_dequeue() to drop all packets. The parent qdisc's
sch->q.qlen is only updated via ->qlen_notify() if the fq_codel queue
remains non-empty after the drops. This discrepancy in qlen between
fq_codel and its parent can lead to a use-after-free condition.

Let's fix this by making all existing ->qlen_notify() idempotent so that
the sch->q.qlen check will be no longer necessary.

Patch 1~5 make all existing ->qlen_notify() idempotent to prepare for
patch 6 which removes the sch->q.qlen check. They are followed by 5
selftests for each type of Qdisc's we touch here.

All existing and new Qdisc selftests pass after this patchset.

Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM")
Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
====================

Link: https://patch.msgid.link/20250403211033.166059-1-xiyou.wangcong@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoselftests/tc-testing: Add a test case for FQ_CODEL with ETS parent
Cong Wang [Thu, 3 Apr 2025 21:16:36 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with ETS parent

Add a test case for FQ_CODEL with ETS parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.

Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().

Cc: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250403211636.166257-6-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoselftests/tc-testing: Add a test case for FQ_CODEL with DRR parent
Cong Wang [Thu, 3 Apr 2025 21:16:35 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with DRR parent

Add a test case for FQ_CODEL with DRR parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.

Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().

Cc: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250403211636.166257-5-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoselftests/tc-testing: Add a test case for FQ_CODEL with HFSC parent
Cong Wang [Thu, 3 Apr 2025 21:16:34 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with HFSC parent

Add a test case for FQ_CODEL with HFSC parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.

Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().

Cc: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250403211636.166257-4-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoselftests/tc-testing: Add a test case for FQ_CODEL with QFQ parent
Cong Wang [Thu, 3 Apr 2025 21:16:33 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with QFQ parent

Add a test case for FQ_CODEL with QFQ parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.

Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().

Cc: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250403211636.166257-3-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoselftests/tc-testing: Add a test case for FQ_CODEL with HTB parent
Cong Wang [Thu, 3 Apr 2025 21:16:32 +0000 (14:16 -0700)]
selftests/tc-testing: Add a test case for FQ_CODEL with HTB parent

Add a test case for FQ_CODEL with HTB parent to verify packet drop
behavior when the queue becomes empty. This helps ensure proper
notification mechanisms between qdiscs.

Note this is best-effort, it is hard to play with those parameters
perfectly to always trigger ->qlen_notify().

Cc: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250403211636.166257-2-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agocodel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()
Cong Wang [Thu, 3 Apr 2025 21:16:31 +0000 (14:16 -0700)]
codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()

After making all ->qlen_notify() callbacks idempotent, now it is safe to
remove the check of qlen!=0 from both fq_codel_dequeue() and
codel_qdisc_dequeue().

Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM")
Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM")
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250403211636.166257-1-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agosch_ets: make est_qlen_notify() idempotent
Cong Wang [Thu, 3 Apr 2025 21:10:27 +0000 (14:10 -0700)]
sch_ets: make est_qlen_notify() idempotent

est_qlen_notify() deletes its class from its active list with
list_del() when qlen is 0, therefore, it is not idempotent and
not friendly to its callers, like fq_codel_dequeue().

Let's make it idempotent to ease qdisc_tree_reduce_backlog() callers'
life. Also change other list_del()'s to list_del_init() just to be
extra safe.

Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250403211033.166059-6-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agosch_qfq: make qfq_qlen_notify() idempotent
Cong Wang [Thu, 3 Apr 2025 21:10:26 +0000 (14:10 -0700)]
sch_qfq: make qfq_qlen_notify() idempotent

qfq_qlen_notify() always deletes its class from its active list
with list_del_init() _and_ calls qfq_deactivate_agg() when the whole list
becomes empty.

To make it idempotent, just skip everything when it is not in the active
list.

Also change other list_del()'s to list_del_init() just to be extra safe.

Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250403211033.166059-5-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agosch_hfsc: make hfsc_qlen_notify() idempotent
Cong Wang [Thu, 3 Apr 2025 21:10:25 +0000 (14:10 -0700)]
sch_hfsc: make hfsc_qlen_notify() idempotent

hfsc_qlen_notify() is not idempotent either and not friendly
to its callers, like fq_codel_dequeue(). Let's make it idempotent
to ease qdisc_tree_reduce_backlog() callers' life:

1. update_vf() decreases cl->cl_nactive, so we can check whether it is
non-zero before calling it.

2. eltree_remove() always removes RB node cl->el_node, but we can use
   RB_EMPTY_NODE() + RB_CLEAR_NODE() to make it safe.

Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250403211033.166059-4-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agosch_drr: make drr_qlen_notify() idempotent
Cong Wang [Thu, 3 Apr 2025 21:10:24 +0000 (14:10 -0700)]
sch_drr: make drr_qlen_notify() idempotent

drr_qlen_notify() always deletes the DRR class from its active list
with list_del(), therefore, it is not idempotent and not friendly
to its callers, like fq_codel_dequeue().

Let's make it idempotent to ease qdisc_tree_reduce_backlog() callers'
life. Also change other list_del()'s to list_del_init() just to be
extra safe.

Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250403211033.166059-3-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agosch_htb: make htb_qlen_notify() idempotent
Cong Wang [Thu, 3 Apr 2025 21:10:23 +0000 (14:10 -0700)]
sch_htb: make htb_qlen_notify() idempotent

htb_qlen_notify() always deactivates the HTB class and in fact could
trigger a warning if it is already deactivated. Therefore, it is not
idempotent and not friendly to its callers, like fq_codel_dequeue().

Let's make it idempotent to ease qdisc_tree_reduce_backlog() callers'
life.

Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250403211033.166059-2-xiyou.wangcong@gmail.com
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agotipc: fix memory leak in tipc_link_xmit
Tung Nguyen [Thu, 3 Apr 2025 09:24:31 +0000 (09:24 +0000)]
tipc: fix memory leak in tipc_link_xmit

In case the backlog transmit queue for system-importance messages is overloaded,
tipc_link_xmit() returns -ENOBUFS but the skb list is not purged. This leads to
memory leak and failure when a skb is allocated.

This commit fixes this issue by purging the skb list before tipc_link_xmit()
returns.

Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
Signed-off-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20250403092431.514063-1-tung.quang.nguyen@est.tech
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: hold instance lock during NETDEV_CHANGE
Stanislav Fomichev [Fri, 4 Apr 2025 16:11:22 +0000 (09:11 -0700)]
net: hold instance lock during NETDEV_CHANGE

Cosmin reports an issue with ipv6_add_dev being called from
NETDEV_CHANGE notifier:

[ 3455.008776]  ? ipv6_add_dev+0x370/0x620
[ 3455.010097]  ipv6_find_idev+0x96/0xe0
[ 3455.010725]  addrconf_add_dev+0x1e/0xa0
[ 3455.011382]  addrconf_init_auto_addrs+0xb0/0x720
[ 3455.013537]  addrconf_notify+0x35f/0x8d0
[ 3455.014214]  notifier_call_chain+0x38/0xf0
[ 3455.014903]  netdev_state_change+0x65/0x90
[ 3455.015586]  linkwatch_do_dev+0x5a/0x70
[ 3455.016238]  rtnl_getlink+0x241/0x3e0
[ 3455.019046]  rtnetlink_rcv_msg+0x177/0x5e0

Similarly, linkwatch might get to ipv6_add_dev without ops lock:
[ 3456.656261]  ? ipv6_add_dev+0x370/0x620
[ 3456.660039]  ipv6_find_idev+0x96/0xe0
[ 3456.660445]  addrconf_add_dev+0x1e/0xa0
[ 3456.660861]  addrconf_init_auto_addrs+0xb0/0x720
[ 3456.661803]  addrconf_notify+0x35f/0x8d0
[ 3456.662236]  notifier_call_chain+0x38/0xf0
[ 3456.662676]  netdev_state_change+0x65/0x90
[ 3456.663112]  linkwatch_do_dev+0x5a/0x70
[ 3456.663529]  __linkwatch_run_queue+0xeb/0x200
[ 3456.663990]  linkwatch_event+0x21/0x30
[ 3456.664399]  process_one_work+0x211/0x610
[ 3456.664828]  worker_thread+0x1cc/0x380
[ 3456.665691]  kthread+0xf4/0x210

Reclassify NETDEV_CHANGE as a notifier that consistently runs under the
instance lock.

Link: https://lore.kernel.org/netdev/aac073de8beec3e531c86c101b274d434741c28e.camel@nvidia.com/
Reported-by: Cosmin Ratiu <cratiu@nvidia.com>
Tested-by: Cosmin Ratiu <cratiu@nvidia.com>
Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250404161122.3907628-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: Fix null-ptr-deref in addrconf_add_ifaddr().
Kuniyuki Iwashima [Sun, 6 Apr 2025 03:57:51 +0000 (20:57 -0700)]
ipv6: Fix null-ptr-deref in addrconf_add_ifaddr().

The cited commit placed netdev_lock_ops() just after __dev_get_by_index()
in addrconf_add_ifaddr(), where dev could be NULL as reported. [0]

Let's call netdev_lock_ops() only when dev is not NULL.

[0]:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000198: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000cc0-0x0000000000000cc7]
CPU: 3 UID: 0 PID: 12032 Comm: syz.0.15 Not tainted 6.14.0-13408-g9f867ba24d36 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:addrconf_add_ifaddr (./include/net/netdev_lock.h:30 ./include/net/netdev_lock.h:41 net/ipv6/addrconf.c:3157)
Code: 8b b4 24 94 00 00 00 4c 89 ef e8 7e 4c 2f ff 4c 8d b0 c5 0c 00 00 48 89 c3 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <0f> b6 04 02 4c 89 f2 83 e2 07 38 d0 7f 08 80
RSP: 0018:ffffc90015b0faa0 EFLAGS: 00010213
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000198 RSI: ffffffff893162f2 RDI: ffff888078cb0338
RBP: ffffc90015b0fbb0 R08: 0000000000000000 R09: fffffbfff20cbbe2
R10: ffffc90015b0faa0 R11: 0000000000000000 R12: 1ffff92002b61f54
R13: ffff888078cb0000 R14: 0000000000000cc5 R15: ffff888078cb0000
FS: 00007f92559ed640(0000) GS:ffff8882a8659000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f92559ecfc8 CR3: 000000001c39e000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 inet6_ioctl (net/ipv6/af_inet6.c:580)
 sock_do_ioctl (net/socket.c:1196)
 sock_ioctl (net/socket.c:1314)
 __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:906 fs/ioctl.c:892 fs/ioctl.c:892)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130
RIP: 0033:0x7f9254b9c62d
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff f8
RSP: 002b:00007f92559ecf98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f9254d65f80 RCX: 00007f9254b9c62d
RDX: 0000000020000040 RSI: 0000000000008916 RDI: 0000000000000003
RBP: 00007f9254c264d3 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f9254d65f80 R15: 00007f92559cd000
 </TASK>
Modules linked in:

Fixes: 8965c160b8f7 ("net: use netif_disable_lro in ipv6_add_dev")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Hui Guo <guohui.study@gmail.com>
Closes: https://lore.kernel.org/netdev/CAHOo4gK+tdU1B14Kh6tg-tNPqnQ1qGLfinONFVC43vmgEPnXXw@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250406035755.69238-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'fix-wrong-hds-thresh-value-setting'
Jakub Kicinski [Mon, 7 Apr 2025 18:00:04 +0000 (11:00 -0700)]
Merge branch 'fix-wrong-hds-thresh-value-setting'

Taehee Yoo says:

====================
fix wrong hds-thresh value setting

A hds-thresh value is not set correctly if input value is 0.
The cause is that ethtool_ringparam_get_cfg(), which is a internal
function that returns ringparameters from both ->get_ringparam() and
dev->cfg can't return a correct hds-thresh value.

The first patch fixes ethtool_ringparam_get_cfg() to set hds-thresh
value correcltly.

The second patch adds random test for hds-thresh value.
So that we can test 0 value for a hds-thresh properly.
====================

Link: https://patch.msgid.link/20250404122126.1555648-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoselftests: drv-net: test random value for hds-thresh
Taehee Yoo [Fri, 4 Apr 2025 12:21:26 +0000 (12:21 +0000)]
selftests: drv-net: test random value for hds-thresh

hds.py has been testing 0(set_hds_thresh_zero()),
MAX(set_hds_thresh_max()), GT(set_hds_thresh_gt()) values for hds-thresh.
However if a hds-thresh value was already 0, set_hds_thresh_zero()
can't test properly.
So, it tests random value first and then tests 0, MAX, GT values.

Testing bnxt:
    TAP version 13
    1..13
    ok 1 hds.get_hds
    ok 2 hds.get_hds_thresh
    ok 3 hds.set_hds_disable # SKIP disabling of HDS not supported by
    the device
    ok 4 hds.set_hds_enable
    ok 5 hds.set_hds_thresh_random
    ok 6 hds.set_hds_thresh_zero
    ok 7 hds.set_hds_thresh_max
    ok 8 hds.set_hds_thresh_gt
    ok 9 hds.set_xdp
    ok 10 hds.enabled_set_xdp
    ok 11 hds.ioctl
    ok 12 hds.ioctl_set_xdp
    ok 13 hds.ioctl_enabled_set_xdp
    # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:1 error:0

Testing lo:
    TAP version 13
    1..13
    ok 1 hds.get_hds # SKIP tcp-data-split not supported by device
    ok 2 hds.get_hds_thresh # SKIP hds-thresh not supported by device
    ok 3 hds.set_hds_disable # SKIP ring-set not supported by the device
    ok 4 hds.set_hds_enable # SKIP ring-set not supported by the device
    ok 5 hds.set_hds_thresh_random # SKIP hds-thresh not supported by
    device
    ok 6 hds.set_hds_thresh_zero # SKIP ring-set not supported by the
    device
    ok 7 hds.set_hds_thresh_max # SKIP hds-thresh not supported by
    device
    ok 8 hds.set_hds_thresh_gt # SKIP hds-thresh not supported by device
    ok 9 hds.set_xdp # SKIP tcp-data-split not supported by device
    ok 10 hds.enabled_set_xdp # SKIP tcp-data-split not supported by
    device
    ok 11 hds.ioctl # SKIP tcp-data-split not supported by device
    ok 12 hds.ioctl_set_xdp # SKIP tcp-data-split not supported by
    device
    ok 13 hds.ioctl_enabled_set_xdp # SKIP tcp-data-split not supported
    by device
    # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:13 error:0

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250404122126.1555648-3-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: ethtool: fix ethtool_ringparam_get_cfg() returns a hds_thresh value always as 0.
Taehee Yoo [Fri, 4 Apr 2025 12:21:25 +0000 (12:21 +0000)]
net: ethtool: fix ethtool_ringparam_get_cfg() returns a hds_thresh value always as 0.

When hds-thresh is configured, ethnl_set_rings() is called, and it calls
ethtool_ringparam_get_cfg() to get ringparameters from .get_ringparam()
callback and dev->cfg.
Both hds_config and hds_thresh values should be set from dev->cfg, not
from .get_ringparam().
But ethtool_ringparam_get_cfg() sets only hds_config from dev->cfg.
So, ethtool_ringparam_get_cfg() returns always a hds_thresh as 0.

If an input value of hds-thresh is 0, a hds_thresh value from
ethtool_ringparam_get_cfg() are same. So ethnl_set_rings() does
nothing and returns immediately.
It causes a bug that setting a hds-thresh value to 0 is not working.

Reproducer:
    modprobe netdevsim
    echo 1 > /sys/bus/netdevsim/new_device
    ethtool -G eth0 hds-thresh 100
    ethtool -G eth0 hds-thresh 0
    ethtool -g eth0
    #hds-thresh value should be 0, but it shows 100.

The tools/testing/selftests/drivers/net/hds.py can test it too with
applying a following patch for hds.py.

Fixes: 928459bbda19 ("net: ethtool: populate the default HDS params in the core")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250404122126.1555648-2-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge tag 'net-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Fri, 4 Apr 2025 16:15:35 +0000 (09:15 -0700)]
Merge tag 'net-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter.

  Current release - regressions:

   - four fixes for the netdev per-instance locking

  Current release - new code bugs:

   - consolidate more code between existing Rx zero-copy and uring so
     that the latter doesn't miss / have to duplicate the safety checks

  Previous releases - regressions:

   - ipv6: fix omitted Netlink attributes when using SKIP_STATS

  Previous releases - always broken:

   - net: fix geneve_opt length integer overflow

   - udp: fix multiple wrap arounds of sk->sk_rmem_alloc when it
     approaches INT_MAX

   - dsa: mvpp2: add a lock to avoid corruption of the shared TCAM

   - dsa: airoha: fix issues with traffic QoS configuration / offload,
     and flow table offload

  Misc:

   - touch up the Netlink YAML specs of old families to make them usable
     for user space C codegen"

* tag 'net-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits)
  selftests: net: amt: indicate progress in the stress test
  netlink: specs: rt_route: pull the ifa- prefix out of the names
  netlink: specs: rt_addr: pull the ifa- prefix out of the names
  netlink: specs: rt_addr: fix get multi command name
  netlink: specs: rt_addr: fix the spec format / schema failures
  net: avoid false positive warnings in __net_mp_close_rxq()
  net: move mp dev config validation to __net_mp_open_rxq()
  net: ibmveth: make veth_pool_store stop hanging
  arcnet: Add NULL check in com20020pci_probe()
  ipv6: Do not consider link down nexthops in path selection
  ipv6: Start path selection from the first nexthop
  usbnet:fix NPE during rx_complete
  net: octeontx2: Handle XDP_ABORTED and XDP invalid as XDP_DROP
  net: fix geneve_opt length integer overflow
  io_uring/zcrx: fix selftests w/ updated netdev Python helpers
  selftests: net: use netdevsim in netns test
  docs: net: document netdev notifier expectations
  net: dummy: request ops lock
  netdevsim: add dummy device notifiers
  net: rename rtnl_net_debug to lock_debug
  ...

3 weeks agoMerge tag 'spi-fix-v6.15-merge-window' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 4 Apr 2025 16:09:34 +0000 (09:09 -0700)]
Merge tag 'spi-fix-v6.15-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A small collection of fixes that came in during the merge window,
  everything is driver specific with nothing standing out particularly"

* tag 'spi-fix-v6.15-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: bcm2835: Restore native CS probing when pinctrl-bcm2835 is absent
  spi: bcm2835: Do not call gpiod_put() on invalid descriptor
  spi: cadence-qspi: revert "Improve spi memory performance"
  spi: cadence: Fix out-of-bounds array access in cdns_mrvl_xspi_setup_clock()
  spi: fsl-qspi: use devm function instead of driver remove
  spi: SPI_QPIC_SNAND should be tristate and depend on MTD
  spi-rockchip: Fix register out of bounds access

3 weeks agoMerge tag 'soc-drivers-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Linus Torvalds [Fri, 4 Apr 2025 16:06:32 +0000 (09:06 -0700)]
Merge tag 'soc-drivers-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull more SoC driver updates from Arnd Bergmann:
 "This is the promised follow-up to the soc drivers branch, adding minor
  updates to omap and freescale drivers.

  Most notably, Ioana Ciornei takes over maintenance of the DPAA bus
  driver used in some NXP (originally Freescale) chips"

* tag 'soc-drivers-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  bus: fsl-mc: Remove deadcode
  MAINTAINERS: add the linuppc-dev list to the fsl-mc bus entry
  MAINTAINERS: fix nonexistent dtbinding file name
  MAINTAINERS: add myself as maintainer for the fsl-mc bus
  irqdomain: soc: Switch to irq_find_mapping()
  Input: tsc2007 - accept standard properties

3 weeks agoMerge tag 'platform-drivers-x86-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 4 Apr 2025 16:00:49 +0000 (09:00 -0700)]
Merge tag 'platform-drivers-x86-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Ilpo Järvinen:

 - thinkpad_acpi:
     - Fix NULL pointer dereferences while probing
     - Disable ACPI fan access for T495* and E560

 - ISST: Correct command storage data length

* tag 'platform-drivers-x86-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  MAINTAINERS: consistently use my dedicated email address
  platform/x86: ISST: Correct command storage data length
  platform/x86: thinkpad_acpi: disable ACPI fan access for T495* and E560
  platform/x86: thinkpad_acpi: Fix NULL pointer dereferences while probing

3 weeks agoselftests: net: amt: indicate progress in the stress test
Jakub Kicinski [Thu, 3 Apr 2025 14:56:36 +0000 (07:56 -0700)]
selftests: net: amt: indicate progress in the stress test

Our CI expects output from the test at least once every 10 minutes.
The AMT test when running on debug kernel is just on the edge
of that time for the stress test. Improve the output:
 - print the name of the test first, before starting it,
 - output a dot every 10% of the way.

Output after:

  TEST: amt discovery                                                 [ OK ]
  TEST: IPv4 amt multicast forwarding                                 [ OK ]
  TEST: IPv6 amt multicast forwarding                                 [ OK ]
  TEST: IPv4 amt traffic forwarding torture               ..........  [ OK ]
  TEST: IPv6 amt traffic forwarding torture               ..........  [ OK ]

Reviewed-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250403145636.2891166-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'netlink-specs-rt_addr-fix-problems-revealed-by-c-codegen'
Jakub Kicinski [Fri, 4 Apr 2025 14:36:11 +0000 (07:36 -0700)]
Merge branch 'netlink-specs-rt_addr-fix-problems-revealed-by-c-codegen'

Jakub Kicinski says:

====================
netlink: specs: rt_addr: fix problems revealed by C codegen

I put together basic YNL C support for classic netlink. This revealed
a few problems in the rt_addr spec.

v1: https://lore.kernel.org/20250401012939.2116915-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250403013706.2828322-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonetlink: specs: rt_route: pull the ifa- prefix out of the names
Jakub Kicinski [Thu, 3 Apr 2025 01:37:06 +0000 (18:37 -0700)]
netlink: specs: rt_route: pull the ifa- prefix out of the names

YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in route-attrs and metrics and specify name-prefix instead.

This is a bit risky, hopefully there aren't many users out there.

Fixes: 023289b4f582 ("doc/netlink: Add spec for rt route messages")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250403013706.2828322-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonetlink: specs: rt_addr: pull the ifa- prefix out of the names
Jakub Kicinski [Thu, 3 Apr 2025 01:37:05 +0000 (18:37 -0700)]
netlink: specs: rt_addr: pull the ifa- prefix out of the names

YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in addr-attrs and specify name-prefix instead.

This is a bit risky, hopefully there aren't many users out there.

Fixes: dfb0f7d9d979 ("doc/netlink: Add spec for rt addr messages")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250403013706.2828322-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonetlink: specs: rt_addr: fix get multi command name
Jakub Kicinski [Thu, 3 Apr 2025 01:37:04 +0000 (18:37 -0700)]
netlink: specs: rt_addr: fix get multi command name

Command names should match C defines, codegens may depend on it.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Fixes: 4f280376e531 ("selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250403013706.2828322-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonetlink: specs: rt_addr: fix the spec format / schema failures
Jakub Kicinski [Thu, 3 Apr 2025 01:37:03 +0000 (18:37 -0700)]
netlink: specs: rt_addr: fix the spec format / schema failures

The spec is mis-formatted, schema validation says:

  Failed validating 'type' in schema['properties']['operations']['properties']['list']['items']['properties']['dump']['properties']['request']['properties']['value']:
    {'minimum': 0, 'type': 'integer'}

  On instance['operations']['list'][3]['dump']['request']['value']:
    '58 - ifa-family'

The ifa-family clearly wants to be part of an attribute list.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Yuyang Huang <yuyanghuang@google.com>
Fixes: 4f280376e531 ("selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support")
Link: https://patch.msgid.link/20250403013706.2828322-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'net-make-memory-provider-install-close-paths-more-common'
Jakub Kicinski [Fri, 4 Apr 2025 14:35:42 +0000 (07:35 -0700)]
Merge branch 'net-make-memory-provider-install-close-paths-more-common'

Jakub Kicinski says:

====================
net: make memory provider install / close paths more common

We seem to be fixing bugs in config path for devmem which also exist
in the io_uring ZC path. Let's try to make the two paths more common,
otherwise this is bound to keep happening.

Found by code inspection and compile tested only.

v1: https://lore.kernel.org/20250331194201.2026422-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250403013405.2827250-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: avoid false positive warnings in __net_mp_close_rxq()
Jakub Kicinski [Thu, 3 Apr 2025 01:34:05 +0000 (18:34 -0700)]
net: avoid false positive warnings in __net_mp_close_rxq()

Commit under Fixes solved the problem of spurious warnings when we
uninstall an MP from a device while its down. The __net_mp_close_rxq()
which is used by io_uring was not fixed. Move the fix over and reuse
__net_mp_close_rxq() in the devmem path.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250403013405.2827250-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: move mp dev config validation to __net_mp_open_rxq()
Jakub Kicinski [Thu, 3 Apr 2025 01:34:04 +0000 (18:34 -0700)]
net: move mp dev config validation to __net_mp_open_rxq()

devmem code performs a number of safety checks to avoid having
to reimplement all of them in the drivers. Move those to
__net_mp_open_rxq() and reuse that function for binding to make
sure that io_uring ZC also benefits from them.

While at it rename the queue ID variable to rxq_idx in
__net_mp_open_rxq(), we touch most of the relevant lines.

The XArray insertion is reordered after the netdev_rx_queue_restart()
call, otherwise we'd need to duplicate the queue index check
or risk inserting an invalid pointer. The XArray allocation
failures should be extremely rare.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 6e18ed929d3b ("net: add helpers for setting a memory provider on an rx queue")
Link: https://patch.msgid.link/20250403013405.2827250-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: ibmveth: make veth_pool_store stop hanging
Dave Marquardt [Wed, 2 Apr 2025 15:44:03 +0000 (10:44 -0500)]
net: ibmveth: make veth_pool_store stop hanging

v2:
- Created a single error handling unlock and exit in veth_pool_store
- Greatly expanded commit message with previous explanatory-only text

Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
ibmveth_close and ibmveth_open, preventing multiple calls in a row to
napi_disable.

Background: Two (or more) threads could call veth_pool_store through
writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
with a little shell script. This causes a hang.

I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
kernel. I ran this test again and saw:

    Setting pool0/active to 0
    Setting pool1/active to 1
    [   73.911067][ T4365] ibmveth 30000002 eth0: close starting
    Setting pool1/active to 1
    Setting pool1/active to 0
    [   73.911367][ T4366] ibmveth 30000002 eth0: close starting
    [   73.916056][ T4365] ibmveth 30000002 eth0: close complete
    [   73.916064][ T4365] ibmveth 30000002 eth0: open starting
    [  110.808564][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
    [  230.808495][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
    [  243.683786][  T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
    [  243.683827][  T123]       Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8
    [  243.683833][  T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  243.683838][  T123] task:stress.sh       state:D stack:28096 pid:4365  tgid:4365  ppid:4364   task_flags:0x400040 flags:0x00042000
    [  243.683852][  T123] Call Trace:
    [  243.683857][  T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
    [  243.683868][  T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
    [  243.683878][  T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
    [  243.683888][  T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
    [  243.683896][  T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
    [  243.683904][  T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
    [  243.683913][  T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
    [  243.683921][  T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
    [  243.683928][  T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
    [  243.683936][  T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
    [  243.683944][  T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
    [  243.683951][  T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
    [  243.683958][  T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
    [  243.683966][  T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
    [  243.683973][  T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
    ...
    [  243.684087][  T123] Showing all locks held in the system:
    [  243.684095][  T123] 1 lock held by khungtaskd/123:
    [  243.684099][  T123]  #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
    [  243.684114][  T123] 4 locks held by stress.sh/4365:
    [  243.684119][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
    [  243.684132][  T123]  #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
    [  243.684143][  T123]  #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
    [  243.684155][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
    [  243.684166][  T123] 5 locks held by stress.sh/4366:
    [  243.684170][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
    [  243.684183][  T123]  #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
    [  243.684194][  T123]  #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
    [  243.684205][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
    [  243.684216][  T123]  #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0

From the ibmveth debug, two threads are calling veth_pool_store, which
calls ibmveth_close and ibmveth_open. Here's the sequence:

  T4365             T4366
  ----------------- ----------------- ---------
  veth_pool_store   veth_pool_store
                    ibmveth_close
  ibmveth_close
  napi_disable
                    napi_disable
  ibmveth_open
  napi_enable                         <- HANG

ibmveth_close calls napi_disable at the top and ibmveth_open calls
napi_enable at the top.

https://docs.kernel.org/networking/napi.html]] says

  The control APIs are not idempotent. Control API calls are safe
  against concurrent use of datapath APIs but an incorrect sequence of
  control API calls may result in crashes, deadlocks, or race
  conditions. For example, calling napi_disable() multiple times in a
  row will deadlock.

In the normal open and close paths, rtnl_mutex is acquired to prevent
other callers. This is missing from veth_pool_store. Use rtnl_mutex in
veth_pool_store fixes these hangs.

Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com>
Fixes: 860f242eb534 ("[PATCH] ibmveth change buffer pools dynamically")
Reviewed-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250402154403.386744-1-davemarq@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoarcnet: Add NULL check in com20020pci_probe()
Henry Martin [Wed, 2 Apr 2025 13:50:36 +0000 (21:50 +0800)]
arcnet: Add NULL check in com20020pci_probe()

devm_kasprintf() returns NULL when memory allocation fails. Currently,
com20020pci_probe() does not check for this case, which results in a
NULL pointer dereference.

Add NULL check after devm_kasprintf() to prevent this issue and ensure
no resources are left allocated.

Fixes: 6b17a597fc2f ("arcnet: restoring support for multiple Sohard Arcnet cards")
Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com>
Link: https://patch.msgid.link/20250402135036.44697-1-bsdhenrymartin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'ipv6-multipath-routing-fixes'
Jakub Kicinski [Fri, 4 Apr 2025 14:30:09 +0000 (07:30 -0700)]
Merge branch 'ipv6-multipath-routing-fixes'

Ido Schimmel says:

====================
ipv6: Multipath routing fixes

This patchset contains two fixes for IPv6 multipath routing. See the
commit messages for more details.
====================

Link: https://patch.msgid.link/20250402114224.293392-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: Do not consider link down nexthops in path selection
Ido Schimmel [Wed, 2 Apr 2025 11:42:24 +0000 (14:42 +0300)]
ipv6: Do not consider link down nexthops in path selection

Nexthops whose link is down are not supposed to be considered during
path selection when the "ignore_routes_with_linkdown" sysctl is set.
This is done by assigning them a negative region boundary.

However, when comparing the computed hash (unsigned) with the region
boundary (signed), the negative region boundary is treated as unsigned,
resulting in incorrect nexthop selection.

Fix by treating the computed hash as signed. Note that the computed hash
is always in range of [0, 2^31 - 1].

Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250402114224.293392-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: Start path selection from the first nexthop
Ido Schimmel [Wed, 2 Apr 2025 11:42:23 +0000 (14:42 +0300)]
ipv6: Start path selection from the first nexthop

Cited commit transitioned IPv6 path selection to use hash-threshold
instead of modulo-N. With hash-threshold, each nexthop is assigned a
region boundary in the multipath hash function's output space and a
nexthop is chosen if the calculated hash is smaller than the nexthop's
region boundary.

Hash-threshold does not work correctly if path selection does not start
with the first nexthop. For example, if fib6_select_path() is always
passed the last nexthop in the group, then it will always be chosen
because its region boundary covers the entire hash function's output
space.

Fix this by starting the selection process from the first nexthop and do
not consider nexthops for which rt6_score_route() provided a negative
score.

Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
Reported-by: Stanislav Fomichev <stfomichev@gmail.com>
Closes: https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250402114224.293392-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agousbnet:fix NPE during rx_complete
Ying Lu [Wed, 2 Apr 2025 08:58:59 +0000 (16:58 +0800)]
usbnet:fix NPE during rx_complete

Missing usbnet_going_away Check in Critical Path.
The usb_submit_urb function lacks a usbnet_going_away
validation, whereas __usbnet_queue_skb includes this check.

This inconsistency creates a race condition where:
A URB request may succeed, but the corresponding SKB data
fails to be queued.

Subsequent processes:
(e.g., rx_complete → defer_bh → __skb_unlink(skb, list))
attempt to access skb->next, triggering a NULL pointer
dereference (Kernel Panic).

Fixes: 04e906839a05 ("usbnet: fix cyclical race on disconnect with work queue")
Cc: stable@vger.kernel.org
Signed-off-by: Ying Lu <luying1@xiaomi.com>
Link: https://patch.msgid.link/4c9ef2efaa07eb7f9a5042b74348a67e5a3a7aea.1743584159.git.luying1@xiaomi.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: octeontx2: Handle XDP_ABORTED and XDP invalid as XDP_DROP
Lorenzo Bianconi [Tue, 1 Apr 2025 09:02:12 +0000 (11:02 +0200)]
net: octeontx2: Handle XDP_ABORTED and XDP invalid as XDP_DROP

In the current implementation octeontx2 manages XDP_ABORTED and XDP
invalid as XDP_PASS forwarding the skb to the networking stack.
Align the behaviour to other XDP drivers handling XDP_ABORTED and XDP
invalid as XDP_DROP.
Please note this patch has just compile tested.

Fixes: 06059a1a9a4a5 ("octeontx2-pf: Add XDP support to netdev PF")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250401-octeontx2-xdp-abort-fix-v1-1-f0587c35a0b9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge tag 'x86-urgent-2025-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 4 Apr 2025 14:12:26 +0000 (07:12 -0700)]
Merge tag 'x86-urgent-2025-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - Fix a performance regression on AMD iGPU and dGPU drivers, related to
   the unintended activation of DMA bounce buffers that regressed game
   performance if KASLR disturbed things just enough

 - Fix a copy_user_generic() performance regression on certain older
   non-FSRM/ERMS CPUs

 - Fix a Clang build warning due to a semantic merge conflict the Kunit
   tree generated with the x86 tree

 - Fix FRED related system hang during S4 resume

 - Remove an unused API

* tag 'x86-urgent-2025-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fred: Fix system hang during S4 resume with FRED enabled
  x86/platform/iosf_mbi: Remove unused iosf_mbi_unregister_pmic_bus_access_notifier()
  x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers
  x86/tools: Drop duplicate unlikely() definition in insn_decoder_test.c
  x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs

3 weeks agoMerge tag 'sound-fix-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 4 Apr 2025 14:05:33 +0000 (07:05 -0700)]
Merge tag 'sound-fix-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "A collection of device-specific fixes that have been gathered since
  the previous pull:

   - A few more HD-audio quirks and fixups

   - A series of Qualcomm AudioReach fixes

   - Various small fixes for ASoC rt5665, WSA, SOF and Cirrus"

* tag 'sound-fix-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/realtek: Fix built-in mic on another ASUS VivoBook model
  ALSA: hda/realtek - Support mute led function for HP platform
  ASoC: imx-card: Add NULL check in imx_card_probe()
  ASoC: codecs: rt5665: Fix some error handling paths in rt5665_probe()
  ASoC: q6apm-dai: make use of q6apm_get_hw_pointer
  ASoC: qdsp6: q6apm-dai: fix capture pipeline overruns.
  ASoC: qdsp6: q6apm-dai: set 10 ms period and buffer alignment.
  ASoC: q6apm: add q6apm_get_hw_pointer helper
  ASoC: q6apm-dai: schedule all available frames to avoid dsp under-runs
  ASoC: SOF: hda/ptl: Move mic privacy change notification sending to a work
  ALSA/hda: intel-sdw-acpi: Remove (explicitly) unused header
  ALSA: hda/realtek: Enable Mute LED on HP OMEN 16 Laptop xd000xx
  ALSA: hda/tas2781: Upgrade calibratd-data writing code to support Alpha and Beta dsp firmware
  ASoC: qdsp6: q6asm-dai: fix q6asm_dai_compr_set_params error path
  ALSA: hda/realtek: Fix built-in mic breakage on ASUS VivoBook X515JA
  ASoC: sma1307: Fix error handling in sma1307_setting_loaded()
  ASoC: codecs: wsa884x: Correct VI sense channel mask
  ASoC: codecs: wsa883x: Correct VI sense channel mask
  firmware: cs_dsp: Ensure cs_dsp_load[_coeff]() returns 0 on success

3 weeks agoMerge tag 'omap-for-v6.14/drivers-signed' of https://git.kernel.org/pub/scm/linux...
Arnd Bergmann [Fri, 4 Apr 2025 12:37:41 +0000 (14:37 +0200)]
Merge tag 'omap-for-v6.14/drivers-signed' of https://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap into soc/drivers-2

arm/omap: drivers: updates for v6.14

* tag 'omap-for-v6.14/drivers-signed' of https://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap:
  Input: tsc2007 - accept standard properties

3 weeks agoMerge tag 'soc_fsl-6.15-1' of https://github.com/chleroy/linux into soc/drivers-2
Arnd Bergmann [Fri, 4 Apr 2025 12:37:06 +0000 (14:37 +0200)]
Merge tag 'soc_fsl-6.15-1' of https://github.com/chleroy/linux into soc/drivers-2

FSL SOC Changes for 6.15:

- irqdomain cleanups from Jiry

- Add Ioana as Maintainer of fsl-mc bus and remove Laurentiu and Stuart

- Remove deadcode from fsl-mc bus

* tag 'soc_fsl-6.15-1' of https://github.com/chleroy/linux:
  bus: fsl-mc: Remove deadcode
  MAINTAINERS: add the linuppc-dev list to the fsl-mc bus entry
  MAINTAINERS: fix nonexistent dtbinding file name
  MAINTAINERS: add myself as maintainer for the fsl-mc bus
  irqdomain: soc: Switch to irq_find_mapping()

3 weeks agoMerge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Fri, 4 Apr 2025 04:12:48 +0000 (21:12 -0700)]
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dcache fixes from Al Viro:
 "Fixes for bugs caught as part of tree-in-dcache work.

  Mostly dentry refcount mishandling"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  hypfs_create_cpu_files(): add missing check for hypfs_mkdir() failure
  qibfs: fix _another_ leak
  spufs: fix a leak in spufs_create_context()
  spufs: fix gang directory lifetimes
  spufs: fix a leak on spufs_new_file() failure

3 weeks agoMerge tag 'nf-25-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Jakub Kicinski [Thu, 3 Apr 2025 23:23:00 +0000 (16:23 -0700)]
Merge tag 'nf-25-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following batch contains Netfilter fixes for net:

1) conncount incorrectly removes element for non-dynamic sets,
   these elements represent a static control plane configuration,
   leave them in place.

2) syzbot found a way to unregister a basechain that has been never
   registered from the chain update path, fix from Florian Westphal.

3) Fix incorrect pointer arithmetics in geneve support for tunnel,
   from Lin Ma.

* tag 'nf-25-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_tunnel: fix geneve_opt type confusion addition
  netfilter: nf_tables: don't unregister hook when table is dormant
  netfilter: nft_set_hash: GC reaps elements with conncount for dynamic sets only
====================

Link: https://patch.msgid.link/20250403115752.19608-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Linus Torvalds [Thu, 3 Apr 2025 23:18:06 +0000 (16:18 -0700)]
Merge tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Four ksmbd SMB3 server fixes, all also for stable"

* tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix null pointer dereference in alloc_preauth_hash()
  ksmbd: validate zero num_subauth before sub_auth is accessed
  ksmbd: fix overflow in dacloffset bounds check
  ksmbd: fix session use-after-free in multichannel connection

3 weeks agoMerge tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 3 Apr 2025 23:09:29 +0000 (16:09 -0700)]
Merge tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer updates from Steven Rostedt:
 "Persistent buffer cleanups and simplifications.

  It was mistaken that the physical memory returned from "reserve_mem"
  had to be vmap()'d to get to it from a virtual address. But
  reserve_mem already maps the memory to the virtual address of the
  kernel so a simple phys_to_virt() can be used to get to the virtual
  address from the physical memory returned by "reserve_mem". With this
  new found knowledge, the code can be cleaned up and simplified.

   - Enforce that the persistent memory is page aligned

     As the buffers using the persistent memory are all going to be
     mapped via pages, make sure that the memory given to the tracing
     infrastructure is page aligned. If it is not, it will print a
     warning and fail to map the buffer.

   - Use phys_to_virt() to get the virtual address from reserve_mem

     Instead of calling vmap() on the physical memory returned from
     "reserve_mem", use phys_to_virt() instead.

     As the memory returned by "memmap" or any other means where a
     physical address is given to the tracing infrastructure, it still
     needs to be vmap(). Since this memory can never be returned back to
     the buddy allocator nor should it ever be memmory mapped to user
     space, flag this buffer and up the ref count. The ref count will
     keep it from ever being freed, and the flag will prevent it from
     ever being memory mapped to user space.

   - Use vmap_page_range() for memmap virtual address mapping

     For the memmap buffer, instead of allocating an array of struct
     pages, assigning them to the contiguous phsycial memory and then
     passing that to vmap(), use vmap_page_range() instead

   - Replace flush_dcache_folio() with flush_kernel_vmap_range()

     Instead of calling virt_to_folio() and passing that to
     flush_dcache_folio(), just call flush_kernel_vmap_range() directly.
     This also fixes a bug where if a subbuffer was bigger than
     PAGE_SIZE only the PAGE_SIZE portion would be flushed"

* tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()
  tracing: Use vmap_page_range() to map memmap ring buffer
  tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
  tracing: Enforce the persistent ring buffer to be page aligned

3 weeks agoMerge tag 'block-6.15-20250403' of git://git.kernel.dk/linux
Linus Torvalds [Thu, 3 Apr 2025 23:04:38 +0000 (16:04 -0700)]
Merge tag 'block-6.15-20250403' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - PCI endpoint target cleanup (Damien)
      - Early import for uring_cmd fixed buffer (Caleb)
      - Multipath documentation and notification improvements (John)
      - Invalid pci sq doorbell write fix (Maurizio)

 - Queue init locking fix

 - Remove dead nsegs parameter from blk_mq_get_new_requests()

* tag 'block-6.15-20250403' of git://git.kernel.dk/linux:
  block: don't grab elevator lock during queue initialization
  nvme-pci: skip nvme_write_sq_db on empty rqlist
  nvme-multipath: change the NVME_MULTIPATH config option
  nvme: update the multipath warning in nvme_init_ns_head
  nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io()
  nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request()
  nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer
  nvmet: pci-epf: Keep completion queues mapped
  block: remove unused nseg parameter

3 weeks agoMerge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Jakub Kicinski [Thu, 3 Apr 2025 22:56:49 +0000 (15:56 -0700)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-04-02 (igc, e1000e, ixgbe, idpf)

For igc:
Joe Damato removes unmapping of XSK queues from NAPI instance.

Zdenek Bouska swaps condition checks/call to prevent AF_XDP Tx drops
with low budget value.

For e1000e:
Vitaly adjusts Kumeran interface configuration to prevent MDI errors.

For ixgbe:
Piotr clears PHY high values on media type detection to ensure stale
values are not used.

For idpf:
Emil adjusts shutdown calls to prevent NULL pointer dereference.

* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  idpf: fix adapter NULL pointer dereference on reboot
  ixgbe: fix media type detection for E610 device
  e1000e: change k1 configuration on MTP and later platforms
  igc: Fix TX drops in XDP ZC
  igc: Fix XSK queue NAPI ID mapping
====================

Link: https://patch.msgid.link/20250402173900.1957261-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linux
Linus Torvalds [Thu, 3 Apr 2025 22:48:58 +0000 (15:48 -0700)]
Merge tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
 "Set of fixes/updates for io_uring that should go into this release.

  The ublk bits could've gone via either tree - usually I put them in
  block, but they got a bit mixed this series with the zero-copy
  supported that ended up dipping into both trees.

  This contains:

   - Fix for sendmsg zc, include in pinned pages accounting like we do
     for the other zc types

   - Series for ublk fixing request aborting, doing various little
     cleanups, fixing some zc issues, and adding queue_rqs support

   - Another ublk series doing some code cleanups

   - Series cleaning up the io_uring send path, mostly in preparation
     for registered buffers

   - Series doing little MSG_RING cleanups

   - Fix for the newly added zc rx, fixing len being 0 for the last
     invocation of the callback

   - Add vectored registered buffer support for ublk. With that, then
     ublk also supports this feature in the kernel revision where it
     could generically introduced for rw/net

   - A bunch of selftest additions for ublk. This is the majority of the
     diffstat

   - Silence a KCSAN data race warning for io-wq

   - Various little cleanups and fixes"

* tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linux: (44 commits)
  io_uring: always do atomic put from iowq
  selftests: ublk: enable zero copy for stripe target
  io_uring: support vectored kernel fixed buffer
  block: add for_each_mp_bvec()
  io_uring: add validate_fixed_range() for validate fixed buffer
  selftests: ublk: kublk: fix an error log line
  selftests: ublk: kublk: use ioctl-encoded opcodes
  io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0
  io_uring/net: avoid import_ubuf for regvec send
  io_uring/rsrc: check size when importing reg buffer
  io_uring: cleanup {g,s]etsockopt sqe reading
  io_uring: hide caches sqes from drivers
  io_uring: make zcrx depend on CONFIG_IO_URING
  io_uring: add req flag invariant build assertion
  Documentation: ublk: remove dead footnote
  selftests: ublk: specify io_cmd_buf pointer type
  ublk: specify io_cmd_buf pointer type
  io_uring: don't pass ctx to tw add remote helper
  io_uring/msg: initialise msg request opcode
  io_uring/msg: rename io_double_lock_ctx()
  ...

3 weeks agonet: fix geneve_opt length integer overflow
Lin Ma [Wed, 2 Apr 2025 16:56:32 +0000 (00:56 +0800)]
net: fix geneve_opt length integer overflow

struct geneve_opt uses 5 bit length for each single option, which
means every vary size option should be smaller than 128 bytes.

However, all current related Netlink policies cannot promise this
length condition and the attacker can exploit a exact 128-byte size
option to *fake* a zero length option and confuse the parsing logic,
further achieve heap out-of-bounds read.

One example crash log is like below:

[    3.905425] ==================================================================
[    3.905925] BUG: KASAN: slab-out-of-bounds in nla_put+0xa9/0xe0
[    3.906255] Read of size 124 at addr ffff888005f291cc by task poc/177
[    3.906646]
[    3.906775] CPU: 0 PID: 177 Comm: poc-oob-read Not tainted 6.1.132 #1
[    3.907131] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[    3.907784] Call Trace:
[    3.907925]  <TASK>
[    3.908048]  dump_stack_lvl+0x44/0x5c
[    3.908258]  print_report+0x184/0x4be
[    3.909151]  kasan_report+0xc5/0x100
[    3.909539]  kasan_check_range+0xf3/0x1a0
[    3.909794]  memcpy+0x1f/0x60
[    3.909968]  nla_put+0xa9/0xe0
[    3.910147]  tunnel_key_dump+0x945/0xba0
[    3.911536]  tcf_action_dump_1+0x1c1/0x340
[    3.912436]  tcf_action_dump+0x101/0x180
[    3.912689]  tcf_exts_dump+0x164/0x1e0
[    3.912905]  fw_dump+0x18b/0x2d0
[    3.913483]  tcf_fill_node+0x2ee/0x460
[    3.914778]  tfilter_notify+0xf4/0x180
[    3.915208]  tc_new_tfilter+0xd51/0x10d0
[    3.918615]  rtnetlink_rcv_msg+0x4a2/0x560
[    3.919118]  netlink_rcv_skb+0xcd/0x200
[    3.919787]  netlink_unicast+0x395/0x530
[    3.921032]  netlink_sendmsg+0x3d0/0x6d0
[    3.921987]  __sock_sendmsg+0x99/0xa0
[    3.922220]  __sys_sendto+0x1b7/0x240
[    3.922682]  __x64_sys_sendto+0x72/0x90
[    3.922906]  do_syscall_64+0x5e/0x90
[    3.923814]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[    3.924122] RIP: 0033:0x7e83eab84407
[    3.924331] Code: 48 89 fa 4c 89 df e8 38 aa 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 faf
[    3.925330] RSP: 002b:00007ffff505e370 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
[    3.925752] RAX: ffffffffffffffda RBX: 00007e83eaafa740 RCX: 00007e83eab84407
[    3.926173] RDX: 00000000000001a8 RSI: 00007ffff505e3c0 RDI: 0000000000000003
[    3.926587] RBP: 00007ffff505f460 R08: 00007e83eace1000 R09: 000000000000000c
[    3.926977] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffff505f3c0
[    3.927367] R13: 00007ffff505f5c8 R14: 00007e83ead1b000 R15: 00005d4fbbe6dcb8

Fix these issues by enforing correct length condition in related
policies.

Fixes: 925d844696d9 ("netfilter: nft_tunnel: add support for geneve opts")
Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve")
Fixes: 0ed5269f9e41 ("net/sched: add tunnel option support to act_tunnel_key")
Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250402165632.6958-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agofs: actually hold the namespace semaphore
Christian Brauner [Thu, 3 Apr 2025 14:43:50 +0000 (16:43 +0200)]
fs: actually hold the namespace semaphore

Don't use a scoped guard that only protects the next statement.

Use a regular guard to make sure that the namespace semaphore is held
across the whole function.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Reported-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/all/20250401170715.GA112019@unreal/
Fixes: db04662e2f4f ("fs: allow detached mounts in clone_private_mount()")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 weeks agoio_uring/zcrx: fix selftests w/ updated netdev Python helpers
David Wei [Wed, 2 Apr 2025 17:24:14 +0000 (10:24 -0700)]
io_uring/zcrx: fix selftests w/ updated netdev Python helpers

Fix io_uring zero copy rx selftest with updated netdev Python helpers.

Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250402172414.895276-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs
Linus Torvalds [Thu, 3 Apr 2025 22:39:47 +0000 (15:39 -0700)]
Merge tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs

Pull more bcachefs updates from Kent Overstreet:
 "More notable fixes:

   - Fix for striping behaviour on tiering filesystems where replicas
     exceeds durability on destination target

   - Fix a race in device removal where deleting alloc info races with
     the discard worker

   - Some small stack usage improvements: this is just enough for KMSAN
     builds to not blow the stack, more is queued up for 6.16"

* tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix "journal stuck" during recovery
  bcachefs: backpointer_get_key: check for null from peek_slot()
  bcachefs: Fix null ptr deref in invalidate_one_bucket()
  bcachefs: Fix check_snapshot_exists() restart handling
  bcachefs: use nonblocking variant of print_string_as_lines in error path
  bcachefs: Fix scheduling while atomic from logging changes
  bcachefs: Add error handling for zlib_deflateInit2()
  bcachefs: add missing selection of XARRAY_MULTI
  bcachefs: bch_dev_usage_full
  bcachefs: Kill btree_iter.trans
  bcachefs: do_trace_key_cache_fill()
  bcachefs: Split up bch_dev.io_ref
  bcachefs: fix ref leak in btree_node_read_all_replicas
  bcachefs: Fix null ptr deref in bch2_write_endio()
  bcachefs: Fix field spanning write warning
  bcachefs: Fix striping behaviour

3 weeks agoMerge tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux
Linus Torvalds [Thu, 3 Apr 2025 22:35:46 +0000 (15:35 -0700)]
Merge tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:

 - fix handling of bogus (negative/too long) replies

 - fix crash on mkdir with ACLs (... looks like nobody is using ACLs
   with semi-recent kernels...)

 - ipv6 support for trans=tcp

 - minor concurrency fix to make syzbot happy

 - minor cleanup

* tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux:
  docs: fs/9p: Add missing "not" in cache documentation
  9p: Use hashtable.h for hash_errmap
  Documentation/fs/9p: fix broken link
  9p/trans_fd: mark concurrent read and writes to p9_conn->err
  9p/net: return error on bogus (longer than requested) replies
  9p/net: fix improper handling of bogus negative read/write replies
  fs/9p: fix NULL pointer dereference on mkdir
  net/9p/fd: support ipv6 for trans=tcp

3 weeks agoMerge branch 'net-hold-instance-lock-during-netdev_up-register'
Jakub Kicinski [Thu, 3 Apr 2025 22:32:20 +0000 (15:32 -0700)]
Merge branch 'net-hold-instance-lock-during-netdev_up-register'

Stanislav Fomichev says:

====================
net: hold instance lock during NETDEV_UP/REGISTER

Solving the issue reported by Cosmin in [0] requires consistent
lock during NETDEV_UP/REGISTER notifiers. This series
addresses that (along with some other fixes in net/ipv4/devinet.c
and net/ipv6/addrconf.c) and appends the patches from Jakub
that were conditional on consistent locking in NETDEV_UNREGISTER.

0: https://lore.kernel.org/700fa36b94cbd57cfea2622029b087643c80cbc9.camel@nvidia.com
====================

Link: https://patch.msgid.link/20250401163452.622454-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoselftests: net: use netdevsim in netns test
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:49 +0000 (09:34 -0700)]
selftests: net: use netdevsim in netns test

Netdevsim has extra register_netdevice_notifier_dev_net notifiers,
use netdevim instead of dummy device to test them out.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-9-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodocs: net: document netdev notifier expectations
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:48 +0000 (09:34 -0700)]
docs: net: document netdev notifier expectations

We don't have a consistent state yet, but document where we think
we are and where we wanna be.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-8-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: dummy: request ops lock
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:47 +0000 (09:34 -0700)]
net: dummy: request ops lock

Even though dummy device doesn't really need an instance lock,
a lot of selftests use dummy so it's useful to have extra
expose to the instance lock on NIPA. Request the instance/ops
locking.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonetdevsim: add dummy device notifiers
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:46 +0000 (09:34 -0700)]
netdevsim: add dummy device notifiers

In order to exercise and verify notifiers' locking assumptions,
register dummy notifiers (via register_netdevice_notifier_dev_net).
Share notifier event handler that enforces the assumptions with
lock_debug.c (rename and export rtnl_net_debug_event as
netdev_debug_event). Add ops lock asserts to netdev_debug_event.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: rename rtnl_net_debug to lock_debug
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:45 +0000 (09:34 -0700)]
net: rename rtnl_net_debug to lock_debug

And make it selected by CONFIG_DEBUG_NET. Don't rename any of
the structs/functions. Next patch will use rtnl_net_debug_event in
netdevsim.

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: use netif_disable_lro in ipv6_add_dev
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:44 +0000 (09:34 -0700)]
net: use netif_disable_lro in ipv6_add_dev

ipv6_add_dev might call dev_disable_lro which unconditionally grabs
instance lock, so it will deadlock during NETDEV_REGISTER. Switch
to netif_disable_lro.

Make sure all callers hold the instance lock as well.

Cc: Cosmin Ratiu <cratiu@nvidia.com>
Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: hold instance lock during NETDEV_REGISTER/UP
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:43 +0000 (09:34 -0700)]
net: hold instance lock during NETDEV_REGISTER/UP

Callers of inetdev_init can come from several places with inconsistent
expectation about netdev instance lock. Grab instance lock during
REGISTER (plus UP). Also solve the inconsistency with UNREGISTER
where it was locked only during move netns path.

WARNING: CPU: 10 PID: 1479 at ./include/net/netdev_lock.h:54
__netdev_update_features+0x65f/0xca0
__warn+0x81/0x180
__netdev_update_features+0x65f/0xca0
report_bug+0x156/0x180
handle_bug+0x4f/0x90
exc_invalid_op+0x13/0x60
asm_exc_invalid_op+0x16/0x20
__netdev_update_features+0x65f/0xca0
netif_disable_lro+0x30/0x1d0
inetdev_init+0x12f/0x1f0
inetdev_event+0x48b/0x870
notifier_call_chain+0x38/0xf0
register_netdevice+0x741/0x8b0
register_netdev+0x1f/0x40
mlx5e_probe+0x4e3/0x8e0 [mlx5_core]
auxiliary_bus_probe+0x3f/0x90
really_probe+0xc3/0x3a0
__driver_probe_device+0x80/0x150
driver_probe_device+0x1f/0x90
__device_attach_driver+0x7d/0x100
bus_for_each_drv+0x80/0xd0
__device_attach+0xb4/0x1c0
bus_probe_device+0x91/0xa0
device_add+0x657/0x870

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Cosmin Ratiu <cratiu@nvidia.com>
Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: switch to netif_disable_lro in inetdev_init
Stanislav Fomichev [Tue, 1 Apr 2025 16:34:42 +0000 (09:34 -0700)]
net: switch to netif_disable_lro in inetdev_init

Cosmin reports the following deadlock:
dump_stack_lvl+0x62/0x90
print_deadlock_bug+0x274/0x3b0
__lock_acquire+0x1229/0x2470
lock_acquire+0xb7/0x2b0
__mutex_lock+0xa6/0xd20
dev_disable_lro+0x20/0x80
inetdev_init+0x12f/0x1f0
inetdev_event+0x48b/0x870
notifier_call_chain+0x38/0xf0
netif_change_net_namespace+0x72e/0x9f0
do_setlink.isra.0+0xd5/0x1220
rtnl_newlink+0x7ea/0xb50
rtnetlink_rcv_msg+0x459/0x5e0
netlink_rcv_skb+0x54/0x100
netlink_unicast+0x193/0x270
netlink_sendmsg+0x204/0x450

Switch to netif_disable_lro which assumes the caller holds the instance
lock. inetdev_init is called for blackhole device (which sw device and
doesn't grab instance lock) and from REGISTER/UNREGISTER notifiers.
We already hold the instance lock for REGISTER notifier during
netns change and we'll soon hold the lock during other paths.

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Cosmin Ratiu <cratiu@nvidia.com>
Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge tag 'rtc-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Linus Torvalds [Thu, 3 Apr 2025 22:31:14 +0000 (15:31 -0700)]
Merge tag 'rtc-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

Pull RTC updates from Alexandre Belloni:
 "We see a net reduction of the number of lines of code thanks to the
  removal of a now unused driver and a testing tool that is not used
  anymore. Apart from this, the max31335 driver gets support for a new
  part number and pm8xxx gets UEFI support.

  Core:

   - setdate is removed as it has better replacements

   - skip alarms with a second resolution when we know the RTC doesn't
     support those.

  Subsystem:

   - remove unnecessary private struct members

   - use devm_pm_set_wake_irq were relevant

  Drivers:

   - ds1307: stop disabling alarms on probe for DS1337, DS1339, DS1341
     and DS3231

   - max31335: add max31331 support

   - pcf50633 is removed as support for the related SoC has been removed

   - pcf85063: properly handle POR failures"

* tag 'rtc-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (50 commits)
  rtc: remove 'setdate' test program
  selftest: rtc: skip some tests if the alarm only supports minutes
  rtc: mt6397: drop unused defines
  rtc: pcf85063: replace dev_err+return with return dev_err_probe
  rtc: pcf85063: do a SW reset if POR failed
  rtc: max31335: Add driver support for max31331
  dt-bindings: rtc: max31335: Add max31331 support
  rtc: cros-ec: Avoid a couple of -Wflex-array-member-not-at-end warnings
  dt-bindings: rtc: pcf2127: Reference spi-peripheral-props.yaml
  rtc: rzn1: implement one-second accuracy for alarms
  rtc: pcf50633: Remove
  rtc: pm8xxx: implement qcom,no-alarm flag for non-HLOS owned alarm
  rtc: pm8xxx: mitigate flash wear
  rtc: pm8xxx: add support for uefi offset
  dt-bindings: rtc: qcom-pm8xxx: document qcom,no-alarm flag
  rtc: rv3032: drop WADA
  rtc: rv3032: fix EERD location
  rtc: pm8xxx: switch to devm_device_init_wakeup
  rtc: pm8xxx: fix possible race condition
  rtc: mpfs: switch to devm_device_init_wakeup
  ...

3 weeks agonet: airoha: Validate egress gdm port in airoha_ppe_foe_entry_prepare()
Lorenzo Bianconi [Tue, 1 Apr 2025 09:42:30 +0000 (11:42 +0200)]
net: airoha: Validate egress gdm port in airoha_ppe_foe_entry_prepare()

Dev pointer in airoha_ppe_foe_entry_prepare routine is not strictly
a device allocated by airoha_eth driver since it is an egress device
and the flowtable can contain even wlan, pppoe or vlan devices. E.g:

flowtable ft {
        hook ingress priority filter
        devices = { eth1, lan1, lan2, lan3, lan4, wlan0 }
        flags offload                               ^
                                                    |
                     "not allocated by airoha_eth" --
}

In this case airoha_get_dsa_port() will just return the original device
pointer and we can't assume netdev priv pointer points to an
airoha_gdm_port struct.
Fix the issue validating egress gdm port in airoha_ppe_foe_entry_prepare
routine before accessing net_device priv pointer.

Fixes: 00a7678310fe ("net: airoha: Introduce flowtable offload support")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250401-airoha-validate-egress-gdm-port-v4-1-c7315d33ce10@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: dsa: mv88e6xxx: propperly shutdown PPU re-enable timer on destroy
David Oberhollenzer [Tue, 1 Apr 2025 13:56:37 +0000 (15:56 +0200)]
net: dsa: mv88e6xxx: propperly shutdown PPU re-enable timer on destroy

The mv88e6xxx has an internal PPU that polls PHY state. If we want to
access the internal PHYs, we need to disable the PPU first. Because
that is a slow operation, a 10ms timer is used to re-enable it,
canceled with every access, so bulk operations effectively only
disable it once and re-enable it some 10ms after the last access.

If a PHY is accessed and then the mv88e6xxx module is removed before
the 10ms are up, the PPU re-enable ends up accessing a dangling pointer.

This especially affects probing during bootup. The MDIO bus and PHY
registration may succeed, but registration with the DSA framework
may fail later on (e.g. because the CPU port depends on another,
very slow device that isn't done probing yet, returning -EPROBE_DEFER).
In this case, probe() fails, but the MDIO subsystem may already have
accessed the MIDO bus or PHYs, arming the timer.

This is fixed as follows:
 - If probe fails after mv88e6xxx_phy_init(), make sure we also call
   mv88e6xxx_phy_destroy() before returning
 - In mv88e6xxx_remove(), make sure we do the teardown in the correct
   order, calling mv88e6xxx_phy_destroy() after unregistering the
   switch device.
 - In mv88e6xxx_phy_destroy(), destroy both the timer and the work item
   that the timer might schedule, synchronously waiting in case one of
   the callbacks already fired and destroying the timer first, before
   waiting for the work item.
 - Access to the PPU is guarded by a mutex, the worker acquires it
   with a mutex_trylock(), not proceeding with the expensive shutdown
   if that fails. We grab the mutex in mv88e6xxx_phy_destroy() to make
   sure the slow PPU shutdown is already done or won't even enter, when
   we wait for the work item.

Fixes: 2e5f032095ff ("dsa: add support for the Marvell 88E6131 switch chip")
Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/20250401135705.92760-1-david.oberhollenzer@sigma-star.at
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMAINTAINERS: Update Loic Poulain's email address
Loic Poulain [Tue, 1 Apr 2025 14:53:44 +0000 (16:53 +0200)]
MAINTAINERS: Update Loic Poulain's email address

Update Loic Poulain's email address to @oss.qualcomm.com.

Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250401145344.10669-1-loic.poulain@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: fix omitted netlink attributes when using RTEXT_FILTER_SKIP_STATS
Fernando Fernandez Mancera [Wed, 2 Apr 2025 12:17:51 +0000 (14:17 +0200)]
ipv6: fix omitted netlink attributes when using RTEXT_FILTER_SKIP_STATS

Using RTEXT_FILTER_SKIP_STATS is incorrectly skipping non-stats IPv6
netlink attributes on link dump. This causes issues on userspace tools,
e.g iproute2 is not rendering address generation mode as it should due
to missing netlink attribute.

Move the filling of IFLA_INET6_STATS and IFLA_INET6_ICMP6STATS to a
helper function guarded by a flag check to avoid hitting the same
situation in the future.

Fixes: d5566fd72ec1 ("rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250402121751.3108-1-ffmancera@riseup.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoeth: bnxt: fix deadlock in the mgmt_ops
Taehee Yoo [Wed, 2 Apr 2025 13:31:23 +0000 (13:31 +0000)]
eth: bnxt: fix deadlock in the mgmt_ops

When queue is being reset, callbacks of mgmt_ops are called by
netdev_nl_bind_rx_doit().
The netdev_nl_bind_rx_doit() first acquires netdev_lock() and then calls
callbacks.
So, mgmt_ops callbacks should not acquire netdev_lock() internaly.

The bnxt_queue_{start | stop}() calls napi_{enable | disable}() but they
internally acquire netdev_lock().
So, deadlock occurs.

To avoid deadlock, napi_{enable | disable}_locked() should be used
instead.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Fixes: cae03e5bdd9e ("net: hold netdev instance lock during queue operations")
Link: https://patch.msgid.link/20250402133123.840173-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/selftests: Add loopback link local route for self-connect
Dmitry Safonov [Wed, 2 Apr 2025 00:59:31 +0000 (01:59 +0100)]
net/selftests: Add loopback link local route for self-connect

self-connect-ipv6 got slightly flaky on netdev:
> # timeout set to 120
> # selftests: net/tcp_ao: self-connect_ipv6
> # 1..5
> # # 708[lib/setup.c:250] rand seed 1742872572
> # TAP version 13
> # # 708[lib/proc.c:213]    Snmp6            Ip6OutNoRoutes: 0 => 1
> # not ok 1 # error 708[self-connect.c:70] failed to connect()
> # ok 2 No unexpected trace events during the test run
> # # Planned tests != run tests (5 != 2)
> # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:1
> ok 1 selftests: net/tcp_ao: self-connect_ipv6

I can not reproduce it on my machines, but judging by "Ip6OutNoRoutes"
there is no route to the local_addr (::1).

Looking at the kernel code, I see that kernel does add link-local
address automatically in init_loopback(), but that is called from
ipv6 notifier block. So, in turn the userspace that brought up
the loopback interface may see rtnetlink ACK earlier than
addrconf_notify() does it's job (at least, on a slow VM such as netdev).
Probably, for ipv4 it's the same, judging by inetdev_event().

The fix is quite simple: set the link-local route straight after
bringing the loopback interface. That will make it synchronous.

Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://patch.msgid.link/20250402-tcp-ao-selfconnect-flake-v1-1-8388d629ef3d@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agosfc: fix NULL dereferences in ef100_process_design_param()
Edward Cree [Tue, 1 Apr 2025 22:54:39 +0000 (23:54 +0100)]
sfc: fix NULL dereferences in ef100_process_design_param()

Since cited commit, ef100_probe_main() and hence also
 ef100_check_design_params() run before efx->net_dev is created;
 consequently, we cannot netif_set_tso_max_size() or _segs() at this
 point.
Move those netif calls to ef100_probe_netdev(), and also replace
 netif_err within the design params code with pci_err.

Reported-by: Kyungwook Boo <bookyungwook@gmail.com>
Fixes: 98ff4c7c8ac7 ("sfc: Separate netdev probe/remove from PCI probe/remove")
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250401225439.2401047-1-edward.cree@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agogve: handle overflow when reporting TX consumed descriptors
Joshua Washington [Wed, 2 Apr 2025 00:10:37 +0000 (00:10 +0000)]
gve: handle overflow when reporting TX consumed descriptors

When the tx tail is less than the head (in cases of wraparound), the TX
consumed descriptor statistic in DQ will be reported as
UINT32_MAX - head + tail, which is incorrect. Mask the difference of
head and tail according to the ring size when reporting the statistic.

Cc: stable@vger.kernel.org
Fixes: 2c9198356d56 ("gve: Add consumed counts to ethtool stats")
Signed-off-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250402001037.2717315-1-hramamurthy@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux
Linus Torvalds [Thu, 3 Apr 2025 19:21:44 +0000 (12:21 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux

Pull ARM and clkdev updates from Russell King:

 - Simplify ARM_MMU_KEEP usage

 - Add Rust support for ARM architecture version 7

 - Align IPIs reported in /proc/interrupts

 - require linker to support KEEP within OVERLAY

 - add KEEP() for ARM vectors

 - add __printf() attribute for clkdev functions

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
  ARM: 9445/1: clkdev: Mark some functions with __printf() attribute
  ARM: 9444/1: add KEEP() keyword to ARM_VECTORS
  ARM: 9443/1: Require linker to support KEEP within OVERLAY for DCE
  ARM: 9442/1: smp: Fix IPI alignment in /proc/interrupts
  ARM: 9441/1: rust: Enable Rust support for ARMv7
  ARM: 9439/1: arm32: simplify ARM_MMU_KEEP usage

3 weeks agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Thu, 3 Apr 2025 19:07:01 +0000 (12:07 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

 - Fix max_pfn calculation when hotplugging memory so that it never
   decreases

 - Fix dereference of unused source register in the MOPS SET operation
   fault handling

 - Fix NULL calling in do_compat_alignment_fixup() when the 32-bit user
   space does an unaligned LDREX/STREX

 - Add the HiSilicon HIP09 processor to the Spectre-BHB affected CPUs

 - Drop unused code pud accessors (special/mkspecial)

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Don't call NULL in do_compat_alignment_fixup()
  arm64: Add support for HIP09 Spectre-BHB mitigation
  arm64: mm: Drop dead code for pud special bit handling
  arm64: mops: Do not dereference src reg for a set operation
  arm64: mm: Correct the update of max_pfn

3 weeks agoMerge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Linus Torvalds [Thu, 3 Apr 2025 18:55:41 +0000 (11:55 -0700)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix BPF selftests expectations of assembler output and struct layout
   (Song Liu and Yonghong Song)

 - Fix XSK error code when queue is full (Wang Liang)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Fix verifier_private_stack test failure
  selftests/bpf: Fix verifier_bpf_fastcall test
  selftests/bpf: Fix tests after fields reorder in struct file
  xsk: Fix __xsk_generic_xmit() error code when cq is full

3 weeks agoMerge tag 'mm-nonmm-stable-2025-04-02-22-12' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Thu, 3 Apr 2025 18:16:57 +0000 (11:16 -0700)]
Merge tag 'mm-nonmm-stable-2025-04-02-22-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more non-MM updates from Andrew Morton:
 "One bugfix and a couple of small late-arriving updates"

* tag 'mm-nonmm-stable-2025-04-02-22-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  lib: scatterlist: fix sg_split_phys to preserve original scatterlist offsets
  lib/sort.c: add _nonatomic() variants with cond_resched()
  mailmap: add an entry for Nicolas Schier

3 weeks agoMerge tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 3 Apr 2025 18:10:00 +0000 (11:10 -0700)]
Merge tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - The series "mm: fixes for fallouts from mem_init() cleanup" from Mike
   Rapoport fixes a couple of issues with the just-merged "arch, mm:
   reduce code duplication in mem_init()" series

 - The series "MAINTAINERS: add my isub-entries to MM part." from Mike
   Rapoport does some maintenance on MAINTAINERS

 - The series "remove tlb_remove_page_ptdesc()" from Qi Zheng does some
   cleanup work to the page mapping code

 - The series "mseal system mappings" from Jeff Xu permits sealing of
   "system mappings", such as vdso, vvar, vvar_vclock, vectors (arm
   compat-mode), sigpage (arm compat-mode)

 - Plus the usual shower of singleton patches

* tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
  mseal sysmap: add arch-support txt
  mseal sysmap: enable s390
  selftest: test system mappings are sealed
  mseal sysmap: update mseal.rst
  mseal sysmap: uprobe mapping
  mseal sysmap: enable arm64
  mseal sysmap: enable x86-64
  mseal sysmap: generic vdso vvar mapping
  selftests: x86: test_mremap_vdso: skip if vdso is msealed
  mseal sysmap: kernel config and header change
  mm: pgtable: remove tlb_remove_page_ptdesc()
  x86: pgtable: convert to use tlb_remove_ptdesc()
  riscv: pgtable: unconditionally use tlb_remove_ptdesc()
  mm: pgtable: convert some architectures to use tlb_remove_ptdesc()
  mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc*
  mm: pgtable: make generic tlb_remove_table() use struct ptdesc
  microblaze/mm: put mm_cmdline_setup() in .init.text section
  mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range
  MAINTAINERS: mm: add entry for secretmem
  MAINTAINERS: mm: add entry for numa memblocks and numa emulation
  ...

3 weeks agoMerge tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Thu, 3 Apr 2025 17:47:47 +0000 (10:47 -0700)]
Merge tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM hotfixes from Andrew Morton:
 "Five hotfixes. Three are cc:stable and the remainder address post-6.14
  issues or aren't considered necessary for -stable kernels.

  All patches are for MM"

* tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead()
  mm/hugetlb: move hugetlb_sysctl_init() to the __init section
  mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock
  mm/hugetlb_vmemmap: fix memory loads ordering
  mm/userfaultfd: fix release hang over concurrent GUP

3 weeks agoMerge tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 3 Apr 2025 17:03:38 +0000 (10:03 -0700)]
Merge tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Calling scx_bpf_create_dsq() with the same ID would succeed creating
   duplicate DSQs. Fix it to return -EEXIST.

 - scx_select_cpu_dfl() fixes and cleanups.

 - Synchronize tool/sched_ext with external scheduler repo. While this
   isn't a fix. There's no risk to the kernel and it's better if they
   stay synced closer.

* tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  tools/sched_ext: Sync with scx repo
  sched_ext: initialize built-in idle state before ops.init()
  sched_ext: create_dsq: Return -EEXIST on duplicate request
  sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()
  sched_ext: idle: Fix return code of scx_select_cpu_dfl()

3 weeks agoMerge tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Thu, 3 Apr 2025 16:52:44 +0000 (09:52 -0700)]
Merge tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix build error when CONFIG_PROBE_EVENTS_BTF_ARGS is not enabled

   The tracing of arguments in the function tracer depends on some
   functions that are only defined when PROBE_EVENTS_BTF_ARGS is
   enabled. In fact, PROBE_EVENTS_BTF_ARGS also depends on all the same
   configs as the function argument tracing requires. Just have the
   function argument tracing depend on PROBE_EVENTS_BTF_ARGS.

 - Free module_delta for persistent ring buffer instance

   When an instance holds the persistent ring buffer, it allocates a
   helper array to hold the deltas between where modules are loaded on
   the last boot and the current boot. This array needs to be freed when
   the instance is freed.

 - Add cond_resched() to loop in ftrace_graph_set_hash()

   The hash functions in ftrace loop over every function that can be
   enabled by ftrace. This can be 50,000 functions or more. This loop is
   known to trigger soft lockup warnings and requires a cond_resched().
   The loop in ftrace_graph_set_hash() was missing it.

 - Fix the event format verifier to include "%*p.." arguments

   To prevent events from dereferencing stale pointers that can happen
   if a trace event uses a dereferece pointer to something that was not
   copied into the ring buffer and can be freed by the time the trace is
   read, a verifier is called. At boot or module load, the verifier
   scans the print format string for pointers that can be dereferenced
   and it checks the arguments to make sure they do not contain
   something that can be freed. The "%*p" was not handled, which would
   add another argument and cause the verifier to not only not verify
   this pointer, but it will look at the wrong argument for every
   pointer after that.

 - Fix mcount sorttable building for different endian type target

   When modifying the ELF file to sort the mcount_loc table in the
   sorttable.c code, the endianess of the file and the host is used to
   determine if the bytes need to be swapped when calculations are done.
   A change was made to the sorting of the mcount_loc that read the
   values from the ELF file into an array and the swap happened on the
   filling of the array. But one of the calculations of the array still
   did the swap when it did not need to. This caused building on a
   little endian machine for a big endian target to not find the mcount
   function in the 'nm' table and it zeroed it out, causing there to be
   no functions available to trace.

 - Add goto out_unlock jump to rv_register_monitor() on error path

   One of the error paths in rv_register_monitor() just returned the
   error when it should have jumped to the out_unlock label to release
   the mutex.

* tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Fix missing unlock on double nested monitors return path
  scripts/sorttable: Fix endianness handling in build-time mcount sort
  tracing: Verify event formats that have "%*p.."
  ftrace: Add cond_resched() to ftrace_graph_set_hash()
  tracing: Free module_delta on freeing of persistent ring buffer
  ftrace: Have tracing function args depend on PROBE_EVENTS_BTF_ARGS

3 weeks agobcachefs: Fix "journal stuck" during recovery
Kent Overstreet [Wed, 2 Apr 2025 16:23:25 +0000 (12:23 -0400)]
bcachefs: Fix "journal stuck" during recovery

If we crash when the journal pin fifo is completely full - i.e. we're
at the maximum number of dirty journal entries - that may put us in a
sticky situation in recovery, as journal replay will need to be able to
open new journal entries in order to get going.

bch2_fs_journal_start() already had provisions for resizing the journal
pin fifo if needed, but it needs a fudge factor to ensure there's room
for journal replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: backpointer_get_key: check for null from peek_slot()
Kent Overstreet [Wed, 2 Apr 2025 16:54:25 +0000 (12:54 -0400)]
bcachefs: backpointer_get_key: check for null from peek_slot()

peek_slot() doesn't normally return bkey_s_c_null - except when we ask
for a key at a btree level that doesn't exist, which can happen here.

We might want to revisit this, but we'll have to look over all the
places where we use peek_slot() on interior nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Fix null ptr deref in invalidate_one_bucket()
Kent Overstreet [Wed, 2 Apr 2025 16:48:23 +0000 (12:48 -0400)]
bcachefs: Fix null ptr deref in invalidate_one_bucket()

bch2_backpointer_get_key() returns bkey_s_c_null when the target isn't
found.

backpointer_get_key() flags the error, so there's nothing else to do
here - just skip it and move on.

Link: https://github.com/koverstreet/bcachefs/issues/847
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Fix check_snapshot_exists() restart handling
Kent Overstreet [Wed, 2 Apr 2025 18:31:12 +0000 (14:31 -0400)]
bcachefs: Fix check_snapshot_exists() restart handling

Codepaths that create entries in the snapshots btree currently call
bch2_mark_snapshot(), which updates the in-memory snapshot table, before
transaction commit.

This is because bch2_mark_snapshot() is an atomic trigger, run with
btree write locks held, and isn't allowed to fail - but it might need to
reallocate the table, hence we call it early when we're still allowed to
fail.

This is generally harmless - if we fail, we'll have left an entry in the
snapshots table around, but nothing will reference it and it'll get
overwritten if reused by another transaction.

But check_snapshot_exists(), which reconstructs snapshots when the
snapshots btree has been corrupted or lost, was erronously rechecking if
the snapshot exists inside the transaction commit loop - so on
transaction restart (in this case mem_realloced), the second iteration
would return without repairing.

This code needs some cleanup: splitting out a "maybe realloc snapshots
table" helper would have avoided this, that will be in the next patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: use nonblocking variant of print_string_as_lines in error path
Bharadwaj Raju [Wed, 2 Apr 2025 18:15:53 +0000 (23:45 +0530)]
bcachefs: use nonblocking variant of print_string_as_lines in error path

The inconsistency error path calls print_string_as_lines, which calls
console_lock, which is a potentially-sleeping function and so can't be
called in an atomic context.

Replace calls to it with the nonblocking variant which is safe to call.

Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Fix scheduling while atomic from logging changes
Kent Overstreet [Tue, 1 Apr 2025 17:01:29 +0000 (13:01 -0400)]
bcachefs: Fix scheduling while atomic from logging changes

Two fixes from the recent logging changes:

bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt
context, or with rcu_read_lock() held.

The one syzbot found is in
  bch2_bkey_pick_read_device
  bch2_dev_rcu
  bch2_fs_inconsistent

We're starting to switch to lift the printbufs up to higher levels so we
can emit better log messages and print them all in one go (avoid
garbling), so that conversion will help with spotting these in the
future; when we declare a printbuf it must be flagged if we're in an
atomic context.

Secondly, in btree_node_write_endio:

00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321
00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6
00085 preempt_count: 10001, expected: 0
00085 RCU nest depth: 0, expected: 0
00085 4 locks held by bch-reclaim/fa6/618:
00085  #0: ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198
00085  #1: ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440
00085  #2: ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440
00085  #3: ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130
00085 irq event stamp: 328
00085 hardirqs last  enabled at (327): [<ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0
00085 hardirqs last disabled at (328): [<ffffffc080971a10>] el1_interrupt+0x20/0x60
00085 softirqs last  enabled at (0): [<ffffffc08002f920>] copy_process+0x7c8/0x2118
00085 softirqs last disabled at (0): [<0000000000000000>] 0x0
00085 Preemption disabled at:
00085 [<ffffffc08003ada0>] irq_enter_rcu+0x18/0x90
00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted 6.14.0-rc6-ktest-g04630bde23e8 #18798
00085 Hardware name: linux,dummy-virt (DT)
00085 Call trace:
00085  show_stack+0x1c/0x30 (C)
00085  dump_stack_lvl+0x84/0xc0
00085  dump_stack+0x14/0x20
00085  __might_resched+0x180/0x288
00085  __might_sleep+0x4c/0x88
00085  __kmalloc_node_track_caller_noprof+0x34c/0x3e0
00085  krealloc_noprof+0x1a0/0x2d8
00085  bch2_printbuf_make_room+0x9c/0x120
00085  bch2_prt_printf+0x60/0x1b8
00085  btree_node_write_endio+0x1b0/0x2d8
00085  bio_endio+0x138/0x1f0
00085  btree_node_write_endio+0xe8/0x2d8
00085  bio_endio+0x138/0x1f0
00085  blk_update_request+0x220/0x4c0
00085  blk_mq_end_request+0x28/0x148
00085  virtblk_request_done+0x64/0xe8
00085  blk_mq_complete_request+0x34/0x40
00085  virtblk_done+0x78/0x130
00085  vring_interrupt+0x6c/0xb0
00085  __handle_irq_event_percpu+0x8c/0x2e0
00085  handle_irq_event+0x50/0xb0
00085  handle_fasteoi_irq+0xc4/0x250
00085  handle_irq_desc+0x44/0x60
00085  generic_handle_domain_irq+0x20/0x30
00085  gic_handle_irq+0x54/0xc8
00085  call_on_irq_stack+0x24/0x40

Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Add error handling for zlib_deflateInit2()
Wentao Liang [Wed, 2 Apr 2025 13:45:44 +0000 (21:45 +0800)]
bcachefs: Add error handling for zlib_deflateInit2()

In attempt_compress(), the return value of zlib_deflateInit2() needs to be
checked. A proper implementation can be found in  pstore_compress().

Add an error check and return 0 immediately if the initialzation fails.

Fixes: 986e9842fb68 ("bcachefs: Compression levels")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agoblock: don't grab elevator lock during queue initialization
Ming Lei [Thu, 3 Apr 2025 10:54:02 +0000 (18:54 +0800)]
block: don't grab elevator lock during queue initialization

->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.

queue freeze lock depends on fs_reclaim.

So don't grab elevator lock during queue initialization which needs to
call kmalloc(GFP_KERNEL), and we can cut the dependency between
->elevator_lock and fs_reclaim, then the lockdep warning can be killed.

This way is safe because elevator setting isn't ready to run during
queue initialization.

There isn't such issue in __blk_mq_update_nr_hw_queues() because
memalloc_noio_save() is called before acquiring elevator lock.

Fixes the following lockdep warning:

https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/

Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
Cc: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250403105402.1334206-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoio_uring: always do atomic put from iowq
Pavel Begunkov [Thu, 3 Apr 2025 11:29:30 +0000 (12:29 +0100)]
io_uring: always do atomic put from iowq

io_uring always switches requests to atomic refcounting for iowq
execution before there is any parallilism by setting REQ_F_REFCOUNT,
and the flag is not cleared until the request completes. That should be
fine as long as the compiler doesn't make up a non existing value for
the flags, however KCSAN still complains when the request owner changes
oter flag bits:

BUG: KCSAN: data-race in io_req_task_cancel / io_wq_free_work
...
read to 0xffff888117207448 of 8 bytes by task 3871 on cpu 0:
 req_ref_put_and_test io_uring/refs.h:22 [inline]

Skip REQ_F_REFCOUNT checks for iowq, we know it's set.

Reported-by: syzbot+903a2ad71fb3f1e47cf5@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d880bc27fb8c3209b54641be4ff6ac02b0e5789a.1743679736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agonetfilter: nft_tunnel: fix geneve_opt type confusion addition
Lin Ma [Wed, 2 Apr 2025 17:00:26 +0000 (01:00 +0800)]
netfilter: nft_tunnel: fix geneve_opt type confusion addition

When handling multiple NFTA_TUNNEL_KEY_OPTS_GENEVE attributes, the
parsing logic should place every geneve_opt structure one by one
compactly. Hence, when deciding the next geneve_opt position, the
pointer addition should be in units of char *.

However, the current implementation erroneously does type conversion
before the addition, which will lead to heap out-of-bounds write.

[    6.989857] ==================================================================
[    6.990293] BUG: KASAN: slab-out-of-bounds in nft_tunnel_obj_init+0x977/0xa70
[    6.990725] Write of size 124 at addr ffff888005f18974 by task poc/178
[    6.991162]
[    6.991259] CPU: 0 PID: 178 Comm: poc-oob-write Not tainted 6.1.132 #1
[    6.991655] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[    6.992281] Call Trace:
[    6.992423]  <TASK>
[    6.992586]  dump_stack_lvl+0x44/0x5c
[    6.992801]  print_report+0x184/0x4be
[    6.993790]  kasan_report+0xc5/0x100
[    6.994252]  kasan_check_range+0xf3/0x1a0
[    6.994486]  memcpy+0x38/0x60
[    6.994692]  nft_tunnel_obj_init+0x977/0xa70
[    6.995677]  nft_obj_init+0x10c/0x1b0
[    6.995891]  nf_tables_newobj+0x585/0x950
[    6.996922]  nfnetlink_rcv_batch+0xdf9/0x1020
[    6.998997]  nfnetlink_rcv+0x1df/0x220
[    6.999537]  netlink_unicast+0x395/0x530
[    7.000771]  netlink_sendmsg+0x3d0/0x6d0
[    7.001462]  __sock_sendmsg+0x99/0xa0
[    7.001707]  ____sys_sendmsg+0x409/0x450
[    7.002391]  ___sys_sendmsg+0xfd/0x170
[    7.003145]  __sys_sendmsg+0xea/0x170
[    7.004359]  do_syscall_64+0x5e/0x90
[    7.005817]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[    7.006127] RIP: 0033:0x7ec756d4e407
[    7.006339] Code: 48 89 fa 4c 89 df e8 38 aa 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 faf
[    7.007364] RSP: 002b:00007ffed5d46760 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[    7.007827] RAX: ffffffffffffffda RBX: 00007ec756cc4740 RCX: 00007ec756d4e407
[    7.008223] RDX: 0000000000000000 RSI: 00007ffed5d467f0 RDI: 0000000000000003
[    7.008620] RBP: 00007ffed5d468a0 R08: 0000000000000000 R09: 0000000000000000
[    7.009039] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[    7.009429] R13: 00007ffed5d478b0 R14: 00007ec756ee5000 R15: 00005cbd4e655cb8

Fix this bug with correct pointer addition and conversion in parse
and dump code.

Fixes: 925d844696d9 ("netfilter: nft_tunnel: add support for geneve opts")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 weeks agonet: decrease cached dst counters in dst_release
Antoine Tenart [Wed, 26 Mar 2025 17:36:32 +0000 (18:36 +0100)]
net: decrease cached dst counters in dst_release

Upstream fix ac888d58869b ("net: do not delay dst_entries_add() in
dst_release()") moved decrementing the dst count from dst_destroy to
dst_release to avoid accessing already freed data in case of netns
dismantle. However in case CONFIG_DST_CACHE is enabled and OvS+tunnels
are used, this fix is incomplete as the same issue will be seen for
cached dsts:

  Unable to handle kernel paging request at virtual address ffff5aabf6b5c000
  Call trace:
   percpu_counter_add_batch+0x3c/0x160 (P)
   dst_release+0xec/0x108
   dst_cache_destroy+0x68/0xd8
   dst_destroy+0x13c/0x168
   dst_destroy_rcu+0x1c/0xb0
   rcu_do_batch+0x18c/0x7d0
   rcu_core+0x174/0x378
   rcu_core_si+0x18/0x30

Fix this by invalidating the cache, and thus decrementing cached dst
counters, in dst_release too.

Fixes: d71785ffc7e7 ("net: add dst_cache to ovs vxlan lwtunnel")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250326173634.31096-1-atenart@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoMerge tag 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Thu, 3 Apr 2025 05:41:04 +0000 (22:41 -0700)]
Merge tag 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

Pull firewire update from Takashi Sakamoto:
 "A single commit to use the common helper function for on-stack
  trailing array to enqueue any isochronous packet by the requests
  from userspace applications"

* tag 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: core: avoid -Wflex-array-member-not-at-end warning

3 weeks agoselftests/bpf: Fix verifier_private_stack test failure
Yonghong Song [Mon, 31 Mar 2025 03:38:28 +0000 (20:38 -0700)]
selftests/bpf: Fix verifier_private_stack test failure

Several verifier_private_stack tests failed with latest bpf-next.
For example, for 'Private stack, single prog' subtest, the
jitted code:
  func #0:
  0:      f3 0f 1e fa                             endbr64
  4:      0f 1f 44 00 00                          nopl    (%rax,%rax)
  9:      0f 1f 00                                nopl    (%rax)
  c:      55                                      pushq   %rbp
  d:      48 89 e5                                movq    %rsp, %rbp
  10:     f3 0f 1e fa                             endbr64
  14:     49 b9 58 74 8a 8f 7d 60 00 00           movabsq $0x607d8f8a7458, %r9
  1e:     65 4c 03 0c 25 28 c0 48 87              addq    %gs:-0x78b73fd8, %r9
  27:     bf 2a 00 00 00                          movl    $0x2a, %edi
  2c:     49 89 b9 00 ff ff ff                    movq    %rdi, -0x100(%r9)
  33:     31 c0                                   xorl    %eax, %eax
  35:     c9                                      leave
  36:     e9 20 5d 0f e1                          jmp     0xffffffffe10f5d5b

The insn 'addq %gs:-0x78b73fd8, %r9' does not match the expected
regex 'addq %gs:0x{{.*}}, %r9' and this caused test failure.

Fix it by changing '%gs:0x{{.*}}' to '%gs:{{.*}}' to accommodate the
possible negative offset. A few other subtests are fixed in a similar way.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250331033828.365077-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agoselftests/bpf: Fix verifier_bpf_fastcall test
Song Liu [Fri, 28 Mar 2025 19:31:24 +0000 (12:31 -0700)]
selftests/bpf: Fix verifier_bpf_fastcall test

Commit [1] moves percpu data on x86 from address 0x000... to address
0xfff...

Before [1]:

159020: 0000000000030700     0 OBJECT  GLOBAL DEFAULT   23 pcpu_hot

After [1]:

152602: ffffffff83a3e034     4 OBJECT  GLOBAL DEFAULT   35 pcpu_hot

As a result, verifier_bpf_fastcall tests should now expect a negative
value for pcpu_hot, IOW, the disassemble should show "r=" instead of
"w=".

Fix this in the test.

Note that, a later change created a new variable "cpu_number" for
bpf_get_smp_processor_id() [2]. The inlining logic is updated properly
as part of this change, so there is no need to fix anything on the
kernel side.

[1] commit 9d7de2aa8b41 ("x86/percpu/64: Use relative percpu offsets")
[2] commit 01c7bc5198e9 ("x86/smp: Move cpu number to percpu hot section")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250328193124.808784-1-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>