]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
6 months agonetmem: add a couple of page helper wrappers
Alexander Lobakin [Tue, 3 Dec 2024 17:37:30 +0000 (18:37 +0100)]
netmem: add a couple of page helper wrappers

Add the following netmem counterparts:

* virt_to_netmem() -- simple page_to_netmem(virt_to_page()) wrapper;
* netmem_is_pfmemalloc() -- page_is_pfmemalloc() for page-backed
    netmems, false otherwise;

and the following "unsafe" versions:

* __netmem_to_page()
* __netmem_get_pp()
* __netmem_address()

They do the same as their non-underscored buddies, but assume the netmem
is always page-backed. When working with header &page_pools, you don't
need to check whether netmem belongs to the host memory and you can
never get NULL instead of &page. Checks for the LSB, clearing the LSB,
branches take cycles and increase object code size, sometimes
significantly. When you're sure your PP is always host, you can avoid
this by using the underscored counterparts.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241203173733.3181246-8-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoxdp: register system page pool as an XDP memory model
Toke Høiland-Jørgensen [Tue, 3 Dec 2024 17:37:29 +0000 (18:37 +0100)]
xdp: register system page pool as an XDP memory model

To make the system page pool usable as a source for allocating XDP
frames, we need to register it with xdp_reg_mem_model(), so that page
return works correctly. This is done in preparation for using the system
page_pool to convert XDP_PASS XSk frames to skbs; for the same reason,
make the per-cpu variable non-static so we can access it from other
source files as well (but w/o exporting).

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-7-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoxsk: allow attaching XSk pool via xdp_rxq_info_reg_mem_model()
Alexander Lobakin [Tue, 3 Dec 2024 17:37:28 +0000 (18:37 +0100)]
xsk: allow attaching XSk pool via xdp_rxq_info_reg_mem_model()

When you register an XSk pool as XDP Rxq info memory model, you then
need to manually attach it after the registration.
Let the user combine both actions into one by just passing a pointer
to the pool directly to xdp_rxq_info_reg_mem_model(), which will take
care of calling xsk_pool_set_rxq_info(). This looks similar to how a
&page_pool gets registered and reduce repeating driver code.

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-6-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoxdp: allow attaching already registered memory model to xdp_rxq_info
Alexander Lobakin [Tue, 3 Dec 2024 17:37:27 +0000 (18:37 +0100)]
xdp: allow attaching already registered memory model to xdp_rxq_info

One may need to register memory model separately from xdp_rxq_info. One
simple example may be XDP test run code, but in general, it might be
useful when memory model registering is managed by one layer and then
XDP RxQ info by a different one.
Allow such scenarios by adding a simple helper which "attaches"
already registered memory model to the desired xdp_rxq_info. As this
is mostly needed for Page Pool, add a special function to do that for
a &page_pool pointer.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoxdp, xsk: constify read-only arguments of some static inline helpers
Alexander Lobakin [Tue, 3 Dec 2024 17:37:26 +0000 (18:37 +0100)]
xdp, xsk: constify read-only arguments of some static inline helpers

Lots of read-only helpers for &xdp_buff and &xdp_frame, such as getting
the frame length, skb_shared_info etc., don't have their arguments
marked with `const` for no reason. Add the missing annotations to leave
less place for mistakes and more for optimization.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agobpf, xdp: constify some bpf_prog * function arguments
Alexander Lobakin [Tue, 3 Dec 2024 17:37:25 +0000 (18:37 +0100)]
bpf, xdp: constify some bpf_prog * function arguments

In lots of places, bpf_prog pointer is used only for tracing or other
stuff that doesn't modify the structure itself. Same for net_device.
Address at least some of them and add `const` attributes there. The
object code didn't change, but that may prevent unwanted data
modifications and also allow more helpers to have const arguments.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoxsk: align &xdp_buff_xsk harder
Alexander Lobakin [Tue, 3 Dec 2024 17:37:24 +0000 (18:37 +0100)]
xsk: align &xdp_buff_xsk harder

After the series "XSk buff on a diet" by Maciej, the greatest pow-2
which &xdp_buff_xsk can be divided got reduced from 16 to 8 on x86_64.
Also, sizeof(xdp_buff_xsk) now is 120 bytes, which, taking the previous
sentence into account, leads to that it leaves 8 bytes at the end of
cacheline, which means an array of buffs will have its elements
messed between the cachelines chaotically.
Use __aligned_largest for this struct. This alignment is usually 16
bytes, which makes it fill two full cachelines and align an array
nicely. ___cacheline_aligned may be excessive here, especially on
arches with 128-256 byte CLs, as well as 32-bit arches (76 -> 96
bytes on MIPS32R2), while not doing better than _largest.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241203173733.3181246-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'net_sched-sch_sfq-reject-limit-of-1'
Jakub Kicinski [Fri, 6 Dec 2024 02:02:15 +0000 (18:02 -0800)]
Merge branch 'net_sched-sch_sfq-reject-limit-of-1'

Octavian Purdila says:

====================
net_sched: sch_sfq: reject limit of 1

The implementation does not properly support limits of 1. Add an
in-kernel check, in addition to existing iproute2 check, since other
tools may be used for configuration.

This patch set also adds a selfcheck to test that a limit of 1 is
rejected.

An alternative (or in addition) we could fix the implementation by
setting q->tail to NULL in sfq_drop if this is the last slot we marked
empty, e.g.:

  --- a/net/sched/sch_sfq.c
  +++ b/net/sched/sch_sfq.c
  @@ -317,8 +317,11 @@ static unsigned int sfq_drop(struct Qdisc *sch, struct sk_buff **to_free)
                  /* It is difficult to believe, but ALL THE SLOTS HAVE LENGTH 1. */
                  x = q->tail->next;
                  slot = &q->slots[x];
  -               q->tail->next = slot->next;
                  q->ht[slot->hash] = SFQ_EMPTY_SLOT;
  +               if (x == slot->next)
  +                       q->tail = NULL; /* no more active slots */
  +               else
  +                       q->tail->next = slot->next;
                  goto drop;
          }
====================

Link: https://patch.msgid.link/20241204030520.2084663-1-tavip@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoselftests/tc-testing: sfq: test that kernel rejects limit of 1
Octavian Purdila [Wed, 4 Dec 2024 03:05:20 +0000 (19:05 -0800)]
selftests/tc-testing: sfq: test that kernel rejects limit of 1

Add test to check that the kernel rejects a configuration with the
limit set to 1.

Signed-off-by: Octavian Purdila <tavip@google.com>
Link: https://patch.msgid.link/20241204030520.2084663-3-tavip@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet_sched: sch_sfq: don't allow 1 packet limit
Octavian Purdila [Wed, 4 Dec 2024 03:05:19 +0000 (19:05 -0800)]
net_sched: sch_sfq: don't allow 1 packet limit

The current implementation does not work correctly with a limit of
1. iproute2 actually checks for this and this patch adds the check in
kernel as well.

This fixes the following syzkaller reported crash:

UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:210:6
index 65535 is out of range for type 'struct sfq_head[128]'
CPU: 0 PID: 2569 Comm: syz-executor101 Not tainted 5.10.0-smp-DEV #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
  __dump_stack lib/dump_stack.c:79 [inline]
  dump_stack+0x125/0x19f lib/dump_stack.c:120
  ubsan_epilogue lib/ubsan.c:148 [inline]
  __ubsan_handle_out_of_bounds+0xed/0x120 lib/ubsan.c:347
  sfq_link net/sched/sch_sfq.c:210 [inline]
  sfq_dec+0x528/0x600 net/sched/sch_sfq.c:238
  sfq_dequeue+0x39b/0x9d0 net/sched/sch_sfq.c:500
  sfq_reset+0x13/0x50 net/sched/sch_sfq.c:525
  qdisc_reset+0xfe/0x510 net/sched/sch_generic.c:1026
  tbf_reset+0x3d/0x100 net/sched/sch_tbf.c:319
  qdisc_reset+0xfe/0x510 net/sched/sch_generic.c:1026
  dev_reset_queue+0x8c/0x140 net/sched/sch_generic.c:1296
  netdev_for_each_tx_queue include/linux/netdevice.h:2350 [inline]
  dev_deactivate_many+0x6dc/0xc20 net/sched/sch_generic.c:1362
  __dev_close_many+0x214/0x350 net/core/dev.c:1468
  dev_close_many+0x207/0x510 net/core/dev.c:1506
  unregister_netdevice_many+0x40f/0x16b0 net/core/dev.c:10738
  unregister_netdevice_queue+0x2be/0x310 net/core/dev.c:10695
  unregister_netdevice include/linux/netdevice.h:2893 [inline]
  __tun_detach+0x6b6/0x1600 drivers/net/tun.c:689
  tun_detach drivers/net/tun.c:705 [inline]
  tun_chr_close+0x104/0x1b0 drivers/net/tun.c:3640
  __fput+0x203/0x840 fs/file_table.c:280
  task_work_run+0x129/0x1b0 kernel/task_work.c:185
  exit_task_work include/linux/task_work.h:33 [inline]
  do_exit+0x5ce/0x2200 kernel/exit.c:931
  do_group_exit+0x144/0x310 kernel/exit.c:1046
  __do_sys_exit_group kernel/exit.c:1057 [inline]
  __se_sys_exit_group kernel/exit.c:1055 [inline]
  __x64_sys_exit_group+0x3b/0x40 kernel/exit.c:1055
 do_syscall_64+0x6c/0xd0
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7fe5e7b52479
Code: Unable to access opcode bytes at RIP 0x7fe5e7b5244f.
RSP: 002b:00007ffd3c800398 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe5e7b52479
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 00007fe5e7bcd2d0 R08: ffffffffffffffb8 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe5e7bcd2d0
R13: 0000000000000000 R14: 00007fe5e7bcdd20 R15: 00007fe5e7b24270

The crash can be also be reproduced with the following (with a tc
recompiled to allow for sfq limits of 1):

tc qdisc add dev dummy0 handle 1: root tbf rate 1Kbit burst 100b lat 1s
../iproute2-6.9.0/tc/tc qdisc add dev dummy0 handle 2: parent 1:10 sfq limit 1
ifconfig dummy0 up
ping -I dummy0 -f -c2 -W0.1 8.8.8.8
sleep 1

Scenario that triggers the crash:

* the first packet is sent and queued in TBF and SFQ; qdisc qlen is 1

* TBF dequeues: it peeks from SFQ which moves the packet to the
  gso_skb list and keeps qdisc qlen set to 1. TBF is out of tokens so
  it schedules itself for later.

* the second packet is sent and TBF tries to queues it to SFQ. qdisc
  qlen is now 2 and because the SFQ limit is 1 the packet is dropped
  by SFQ. At this point qlen is 1, and all of the SFQ slots are empty,
  however q->tail is not NULL.

At this point, assuming no more packets are queued, when sch_dequeue
runs again it will decrement the qlen for the current empty slot
causing an underflow and the subsequent out of bounds access.

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Octavian Purdila <tavip@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241204030520.2084663-2-tavip@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet_sched: sch_fq: add three drop_reason
Eric Dumazet [Wed, 4 Dec 2024 17:19:50 +0000 (17:19 +0000)]
net_sched: sch_fq: add three drop_reason

Add three new drop_reason, more precise than generic QDISC_DROP:

"tc -s qd" show aggregate counters, it might be more useful
to use drop_reason infrastructure for bug hunting.

1) SKB_DROP_REASON_FQ_BAND_LIMIT
   Whenever a packet is added while its band limit is hit.
   Corresponding value in "tc -s qd" is bandX_drops XXXX

2) SKB_DROP_REASON_FQ_HORIZON_LIMIT
   Whenever a packet has a timestamp too far in the future.
   Corresponding value in "tc -s qd" is horizon_drops XXXX

3) SKB_DROP_REASON_FQ_FLOW_LIMIT
   Whenever a flow has reached its limit.
   Corresponding value in "tc -s qd" is flows_plimit XXXX

Tested:
tc qd replace dev eth1 root fq flow_limit 10 limit 100000
perf record -a -e skb:kfree_skb sleep 1; perf script

      udp_stream   12329 [004]   216.929492: skb:kfree_skb: skbaddr=0xffff888eabe17e00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT
      udp_stream   12385 [006]   216.929593: skb:kfree_skb: skbaddr=0xffff888ef8827f00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT
      udp_stream   12389 [005]   216.929871: skb:kfree_skb: skbaddr=0xffff888ecb9ba500 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT
      udp_stream   12316 [009]   216.930398: skb:kfree_skb: skbaddr=0xffff888eca286b00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT
      udp_stream   12400 [008]   216.930490: skb:kfree_skb: skbaddr=0xffff888eabf93d00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT

tc qd replace dev eth1 root fq flow_limit 100 limit 10000
perf record -a -e skb:kfree_skb sleep 1; perf script

      udp_stream   18074 [001]  1058.318040: skb:kfree_skb: skbaddr=0xffffa23c881fc000 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT
      udp_stream   18126 [005]  1058.320651: skb:kfree_skb: skbaddr=0xffffa23c6aad4000 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT
      udp_stream   18118 [006]  1058.321065: skb:kfree_skb: skbaddr=0xffffa23df0d48a00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT
      udp_stream   18074 [001]  1058.321126: skb:kfree_skb: skbaddr=0xffffa23c881ffa00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT
      udp_stream   15815 [003]  1058.321224: skb:kfree_skb: skbaddr=0xffffa23c9835db00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT

tc -s -d qd sh dev eth1
qdisc fq 8023: root refcnt 257 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023
 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 weights 589824 196608 65536 quantum 18Kb
 initial_quantum 92120b low_rate_threshold 550Kbit refill_delay 40ms
 timer_slack 10us horizon 10s horizon_drop
 Sent 492439603330 bytes 336953991 pkt (dropped 61724094, overlimits 0 requeues 4463)
 backlog 14611228b 9995p requeues 4463
  flows 2965 (inactive 1151 throttled 0) band0_pkts 0 band1_pkts 9993 band2_pkts 0
  gc 6347 highprio 0 fastpath 30 throttled 5 latency 2.32us flows_plimit 7403693
 band1_drops 54320401

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20241204171950.89829-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'ethtool-generate-uapi-header-from-the-spec'
Jakub Kicinski [Thu, 5 Dec 2024 20:03:09 +0000 (12:03 -0800)]
Merge branch 'ethtool-generate-uapi-header-from-the-spec'

Stanislav Fomichev says:

====================
ethtool: generate uapi header from the spec

We keep expanding ethtool netlink api surface and this leads to
constantly playing catchup on the ynl spec side. There are a couple
of things that prevent us from fully converting to generating
the header from the spec (stats and cable tests), but we can
generate 95% of the header which is still better than maintaining
c header and spec separately. The series adds a couple of missing
features on the ynl-gen-c side and separates the parts
that we can generate into new ethtool_netlink_generated.h.
====================

Link: https://patch.msgid.link/20241204155549.641348-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoethtool: regenerate uapi header from the spec
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:49 +0000 (07:55 -0800)]
ethtool: regenerate uapi header from the spec

No functional changes. Mostly the following formatting:
- extra docs
- extra enums
- XXX_MAX = __XXX_CNT - 1 -> XXX_MAX = (__XXX_CNT - 1)
- newlines

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-9-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoethtool: remove the comments that are not gonna be generated
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:48 +0000 (07:55 -0800)]
ethtool: remove the comments that are not gonna be generated

Cleanup the header manually to make it easier to review the changes that ynl
generator brings in. No functional changes.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-8-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoethtool: separate definitions that are gonna be generated
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:47 +0000 (07:55 -0800)]
ethtool: separate definitions that are gonna be generated

Reshuffle definitions that are gonna be generated into
ethtool_netlink_generated.h and match ynl spec order.
This should make it easier to compare the output of the ynl-gen-c
to the existing uapi header. No functional changes.

Things that are still remaining to be manually defined:
- ETHTOOL_FLAG_ALL - probably no good way to add to spec?
- some of the cable test bits (not sure whether it's possible to move to
  spec)
- some of the stats definitions (no way currently to move to spec)

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoynl: include uapi header after all dependencies
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:46 +0000 (07:55 -0800)]
ynl: include uapi header after all dependencies

Essentially reverse the order of headers for userspace generated files.

Before (make -C tools/net/ynl/; cat tools/net/ynl/ethtool-user.h):
  #include <linux/ethtool_netlink_generated.h>
  #include <linux/ethtool.h>
  #include <linux/ethtool.h>
  #include <linux/ethtool.h>

After:
  #include <linux/ethtool.h>
  #include <linux/ethtool_netlink_generated.h>

While at it, make sure we track which headers we've already included
and include the headers only once.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoynl: add missing pieces to ethtool spec to better match uapi header
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:45 +0000 (07:55 -0800)]
ynl: add missing pieces to ethtool spec to better match uapi header

- __ETHTOOL_UDP_TUNNEL_TYPE_CNT and render max
- skip rendering stringset (empty enum)
- skip rendering c33-pse-ext-state (defined in ethtool.h)
- rename header flags to ethtool-flag-
- add attr-cnt-name to each attribute to use XXX_CNT instead of XXX_MAX
- add unspec 0 entry to each attribute
- carry some doc entries from the existing header
- tcp-header-split

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoynl: support directional specs in ynl-gen-c.py
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:44 +0000 (07:55 -0800)]
ynl: support directional specs in ynl-gen-c.py

The intent is to generate ethtool uapi headers. For now, some of the
things are hard-coded:
- <FAMILY>_MSG_{USER,KERNEL}_MAX
- the split between USER and KERNEL messages

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoynl: skip rendering attributes with header property in uapi mode
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:43 +0000 (07:55 -0800)]
ynl: skip rendering attributes with header property in uapi mode

To allow omitting some of the attributes in the final generated file.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoynl: support enum-cnt-name attribute in legacy definitions
Stanislav Fomichev [Wed, 4 Dec 2024 15:55:42 +0000 (07:55 -0800)]
ynl: support enum-cnt-name attribute in legacy definitions

This is similar to existing attr-cnt-name in the attributes
to allow changing the name of the 'count' enum entry.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 5 Dec 2024 19:48:58 +0000 (11:48 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.13-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 5 Dec 2024 18:25:06 +0000 (10:25 -0800)]
Merge tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from can and netfilter.

  Current release - regressions:

   - rtnetlink: fix double call of rtnl_link_get_net_ifla()

   - tcp: populate XPS related fields of timewait sockets

   - ethtool: fix access to uninitialized fields in set RXNFC command

   - selinux: use sk_to_full_sk() in selinux_ip_output()

  Current release - new code bugs:

   - net: make napi_hash_lock irq safe

   - eth:
      - bnxt_en: support header page pool in queue API
      - ice: fix NULL pointer dereference in switchdev

  Previous releases - regressions:

   - core: fix icmp host relookup triggering ip_rt_bug

   - ipv6:
      - avoid possible NULL deref in modify_prefix_route()
      - release expired exception dst cached in socket

   - smc: fix LGR and link use-after-free issue

   - hsr: avoid potential out-of-bound access in fill_frame_info()

   - can: hi311x: fix potential use-after-free

   - eth: ice: fix VLAN pruning in switchdev mode

  Previous releases - always broken:

   - netfilter:
      - ipset: hold module reference while requesting a module
      - nft_inner: incorrect percpu area handling under softirq

   - can: j1939: fix skb reference counting

   - eth:
      - mlxsw: use correct key block on Spectrum-4
      - mlx5: fix memory leak in mlx5hws_definer_calc_layout"

* tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
  net :mana :Request a V2 response version for MANA_QUERY_GF_STAT
  net: avoid potential UAF in default_operstate()
  vsock/test: verify socket options after setting them
  vsock/test: fix parameter types in SO_VM_SOCKETS_* calls
  vsock/test: fix failures due to wrong SO_RCVLOWAT parameter
  net/mlx5e: Remove workaround to avoid syndrome for internal port
  net/mlx5e: SD, Use correct mdev to build channel param
  net/mlx5: E-Switch, Fix switching to switchdev mode in MPV
  net/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabled
  net/mlx5: HWS: Properly set bwc queue locks lock classes
  net/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layout
  bnxt_en: handle tpa_info in queue API implementation
  bnxt_en: refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap()
  bnxt_en: refactor tpa_info alloc/free into helpers
  geneve: do not assume mac header is set in geneve_xmit_skb()
  mlxsw: spectrum_acl_flex_keys: Use correct key block on Spectrum-4
  ethtool: Fix wrong mod state in case of verbose and no_mask bitset
  ipmr: tune the ipmr_can_free_table() checks.
  netfilter: nft_set_hash: skip duplicated elements pending gc run
  netfilter: ipset: Hold module reference while requesting a module
  ...

6 months agoMerge tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Thu, 5 Dec 2024 18:17:55 +0000 (10:17 -0800)]
Merge tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix trace histogram sort function cmp_entries_dup()

   The sort function cmp_entries_dup() returns either 1 or 0, and not -1
   if parameter "a" is less than "b" by memcmp().

 - Fix archs that call trace_hardirqs_off() without RCU watching

   Both x86 and arm64 no longer call any tracepoints with RCU not
   watching. It was assumed that it was safe to get rid of
   trace_*_rcuidle() version of the tracepoint calls. This was needed to
   get rid of the SRCU protection and be able to implement features like
   faultable traceponits and add rust tracepoints.

   Unfortunately, there were a few architectures that still relied on
   that logic. There's only one file that has tracepoints that are
   called without RCU watching. Add macro logic around the tracepoints
   for architectures that do not have CONFIG_ARCH_WANTS_NO_INSTR defined
   will check if the code is in the idle path (the only place RCU isn't
   watching), and enable RCU around calling the tracepoint, but only do
   it if the tracepoint is enabled.

* tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix archs that still call tracepoints without RCU watching
  tracing: Fix cmp_entries_dup() to respect sort() comparison rules

6 months agoMerge tag 'hid-for-linus-2024120501' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 5 Dec 2024 18:06:47 +0000 (10:06 -0800)]
Merge tag 'hid-for-linus-2024120501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid

Pull HID fixes from Benjamin Tissoires:

 - regression fix in suspend/resume for i2c-hid (Kenny Levinsen)

 - fix wacom driver assuming a name can not be null (WangYuli)

 - a couple of constify changes/fixes (Thomas Weißschuh)

 - a couple of selftests/hid fixes (Maximilian Heyne & Benjamin
   Tissoires)

* tag 'hid-for-linus-2024120501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  selftests/hid: fix kfunc inclusions with newer bpftool
  HID: bpf: drop unneeded casts discarding const
  HID: bpf: constify hid_ops
  selftests: hid: fix typo and exit code
  HID: wacom: fix when get product name maybe null pointer
  HID: i2c-hid: Revert to using power commands to wake on resume

6 months agoMerge tag 'linux-watchdog-6.13-rc1' of git://www.linux-watchdog.org/linux-watchdog
Linus Torvalds [Thu, 5 Dec 2024 18:03:43 +0000 (10:03 -0800)]
Merge tag 'linux-watchdog-6.13-rc1' of git://www.linux-watchdog.org/linux-watchdog

Pull watchdog updates from Wim Van Sebroeck:

 - Add support for exynosautov920 SoC

 - Add support for Airoha EN7851 watchdog

 - Add support for MT6735 TOPRGU/WDT

 - Delete the cpu5wdt driver

 - Always print when registering watchdog fails

 - Several other small fixes and improvements

* tag 'linux-watchdog-6.13-rc1' of git://www.linux-watchdog.org/linux-watchdog: (36 commits)
  watchdog: rti: of: honor timeout-sec property
  watchdog: s3c2410_wdt: add support for exynosautov920 SoC
  dt-bindings: watchdog: Document ExynosAutoV920 watchdog bindings
  watchdog: mediatek: Add support for MT6735 TOPRGU/WDT
  watchdog: mediatek: Make sure system reset gets asserted in mtk_wdt_restart()
  dt-bindings: watchdog: fsl-imx-wdt: Add missing 'big-endian' property
  dt-bindings: watchdog: Document Qualcomm QCS8300
  docs: ABI: Fix spelling mistake in pretimeout_avaialable_governors
  Revert "watchdog: s3c2410_wdt: use exynos_get_pmu_regmap_by_phandle() for PMU regs"
  watchdog: rzg2l_wdt: Power on the watchdog domain in the restart handler
  watchdog: Switch back to struct platform_driver::remove()
  watchdog: it87_wdt: add PWRGD enable quirk for Qotom QCML04
  watchdog: da9063: Remove __maybe_unused notations
  watchdog: da9063: Do not use a global variable
  watchdog: Delete the cpu5wdt driver
  watchdog: Add support for Airoha EN7851 watchdog
  dt-bindings: watchdog: airoha: document watchdog for Airoha EN7581
  watchdog: sl28cpld_wdt: don't print out if registering watchdog fails
  watchdog: rza_wdt: don't print out if registering watchdog fails
  watchdog: rti_wdt: don't print out if registering watchdog fails
  ...

6 months agotracing: Fix archs that still call tracepoints without RCU watching
Steven Rostedt [Wed, 4 Dec 2024 15:04:14 +0000 (10:04 -0500)]
tracing: Fix archs that still call tracepoints without RCU watching

Tracepoints require having RCU "watching" as it uses RCU to do updates to
the tracepoints. There are some cases that would call a tracepoint when
RCU was not "watching". This was usually in the idle path where RCU has
"shutdown". For the few locations that had tracepoints without RCU
watching, there was an trace_*_rcuidle() variant that could be used. This
used SRCU for protection.

There are tracepoints that trace when interrupts and preemption are
enabled and disabled. In some architectures, these tracepoints are called
in a path where RCU is not watching. When x86 and arm64 removed these
locations, it was incorrectly assumed that it would be safe to remove the
trace_*_rcuidle() variant and also remove the SRCU logic, as it made the
code more complex and harder to implement new tracepoint features (like
faultable tracepoints and tracepoints in rust).

Instead of bringing back the trace_*_rcuidle(), as it will not be trivial
to do as new code has already been added depending on its removal, add a
workaround to the one file that still requires it (trace_preemptirq.c). If
the architecture does not define CONFIG_ARCH_WANTS_NO_INSTR, then check if
the code is in the idle path, and if so, call ct_irq_enter/exit() which
will enable RCU around the tracepoint.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20241204100414.4d3e06d0@gandalf.local.home
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 48bcda684823 ("tracing: Remove definition of trace_*_rcuidle()")
Closes: https://lore.kernel.org/all/bddb02de-957a-4df5-8e77-829f55728ea2@roeck-us.net/
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
6 months agonet :mana :Request a V2 response version for MANA_QUERY_GF_STAT
Shradha Gupta [Wed, 4 Dec 2024 05:48:20 +0000 (21:48 -0800)]
net :mana :Request a V2 response version for MANA_QUERY_GF_STAT

The current requested response version(V1) for MANA_QUERY_GF_STAT query
results in STATISTICS_FLAGS_TX_ERRORS_GDMA_ERROR value being set to
0 always.
In order to get the correct value for this counter we request the response
version to be V2.

Cc: stable@vger.kernel.org
Fixes: e1df5202e879 ("net :mana :Add remaining GDMA stats for MANA to ethtool")
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1733291300-12593-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: avoid potential UAF in default_operstate()
Eric Dumazet [Tue, 3 Dec 2024 17:09:33 +0000 (17:09 +0000)]
net: avoid potential UAF in default_operstate()

syzbot reported an UAF in default_operstate() [1]

Issue is a race between device and netns dismantles.

After calling __rtnl_unlock() from netdev_run_todo(),
we can not assume the netns of each device is still alive.

Make sure the device is not in NETREG_UNREGISTERED state,
and add an ASSERT_RTNL() before the call to
__dev_get_by_index().

We might move this ASSERT_RTNL() in __dev_get_by_index()
in the future.

[1]

BUG: KASAN: slab-use-after-free in __dev_get_by_index+0x5d/0x110 net/core/dev.c:852
Read of size 8 at addr ffff888043eba1b0 by task syz.0.0/5339

CPU: 0 UID: 0 PID: 5339 Comm: syz.0.0 Not tainted 6.12.0-syzkaller-10296-gaaf20f870da0 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:94 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
  print_address_description mm/kasan/report.c:378 [inline]
  print_report+0x169/0x550 mm/kasan/report.c:489
  kasan_report+0x143/0x180 mm/kasan/report.c:602
  __dev_get_by_index+0x5d/0x110 net/core/dev.c:852
  default_operstate net/core/link_watch.c:51 [inline]
  rfc2863_policy+0x224/0x300 net/core/link_watch.c:67
  linkwatch_do_dev+0x3e/0x170 net/core/link_watch.c:170
  netdev_run_todo+0x461/0x1000 net/core/dev.c:10894
  rtnl_unlock net/core/rtnetlink.c:152 [inline]
  rtnl_net_unlock include/linux/rtnetlink.h:133 [inline]
  rtnl_dellink+0x760/0x8d0 net/core/rtnetlink.c:3520
  rtnetlink_rcv_msg+0x791/0xcf0 net/core/rtnetlink.c:6911
  netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2541
  netlink_unicast_kernel net/netlink/af_netlink.c:1321 [inline]
  netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1347
  netlink_sendmsg+0x8e4/0xcb0 net/netlink/af_netlink.c:1891
  sock_sendmsg_nosec net/socket.c:711 [inline]
  __sock_sendmsg+0x221/0x270 net/socket.c:726
  ____sys_sendmsg+0x52a/0x7e0 net/socket.c:2583
  ___sys_sendmsg net/socket.c:2637 [inline]
  __sys_sendmsg+0x269/0x350 net/socket.c:2669
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f2a3cb80809
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f2a3d9cd058 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f2a3cd45fa0 RCX: 00007f2a3cb80809
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000008
RBP: 00007f2a3cbf393e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f2a3cd45fa0 R15: 00007ffd03bc65c8
 </TASK>

Allocated by task 5339:
  kasan_save_stack mm/kasan/common.c:47 [inline]
  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
  poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
  __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
  kasan_kmalloc include/linux/kasan.h:260 [inline]
  __kmalloc_cache_noprof+0x243/0x390 mm/slub.c:4314
  kmalloc_noprof include/linux/slab.h:901 [inline]
  kmalloc_array_noprof include/linux/slab.h:945 [inline]
  netdev_create_hash net/core/dev.c:11870 [inline]
  netdev_init+0x10c/0x250 net/core/dev.c:11890
  ops_init+0x31e/0x590 net/core/net_namespace.c:138
  setup_net+0x287/0x9e0 net/core/net_namespace.c:362
  copy_net_ns+0x33f/0x570 net/core/net_namespace.c:500
  create_new_namespaces+0x425/0x7b0 kernel/nsproxy.c:110
  unshare_nsproxy_namespaces+0x124/0x180 kernel/nsproxy.c:228
  ksys_unshare+0x57d/0xa70 kernel/fork.c:3314
  __do_sys_unshare kernel/fork.c:3385 [inline]
  __se_sys_unshare kernel/fork.c:3383 [inline]
  __x64_sys_unshare+0x38/0x40 kernel/fork.c:3383
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 12:
  kasan_save_stack mm/kasan/common.c:47 [inline]
  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
  kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
  poison_slab_object mm/kasan/common.c:247 [inline]
  __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
  kasan_slab_free include/linux/kasan.h:233 [inline]
  slab_free_hook mm/slub.c:2338 [inline]
  slab_free mm/slub.c:4598 [inline]
  kfree+0x196/0x420 mm/slub.c:4746
  netdev_exit+0x65/0xd0 net/core/dev.c:11992
  ops_exit_list net/core/net_namespace.c:172 [inline]
  cleanup_net+0x802/0xcc0 net/core/net_namespace.c:632
  process_one_work kernel/workqueue.c:3229 [inline]
  process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310
  worker_thread+0x870/0xd30 kernel/workqueue.c:3391
  kthread+0x2f0/0x390 kernel/kthread.c:389
  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

The buggy address belongs to the object at ffff888043eba000
 which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 432 bytes inside of
 freed 2048-byte region [ffff888043eba000ffff888043eba800)

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x43eb8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x4fff00000000040(head|node=1|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 04fff00000000040 ffff88801ac42000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
head: 04fff00000000040 ffff88801ac42000 dead000000000122 0000000000000000
head: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
head: 04fff00000000003 ffffea00010fae01 ffffffffffffffff 0000000000000000
head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5339, tgid 5338 (syz.0.0), ts 69674195892, free_ts 69663220888
  set_page_owner include/linux/page_owner.h:32 [inline]
  post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1556
  prep_new_page mm/page_alloc.c:1564 [inline]
  get_page_from_freelist+0x3649/0x3790 mm/page_alloc.c:3474
  __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4751
  alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
  alloc_slab_page+0x6a/0x140 mm/slub.c:2408
  allocate_slab+0x5a/0x2f0 mm/slub.c:2574
  new_slab mm/slub.c:2627 [inline]
  ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3815
  __slab_alloc+0x58/0xa0 mm/slub.c:3905
  __slab_alloc_node mm/slub.c:3980 [inline]
  slab_alloc_node mm/slub.c:4141 [inline]
  __do_kmalloc_node mm/slub.c:4282 [inline]
  __kmalloc_noprof+0x2e6/0x4c0 mm/slub.c:4295
  kmalloc_noprof include/linux/slab.h:905 [inline]
  sk_prot_alloc+0xe0/0x210 net/core/sock.c:2165
  sk_alloc+0x38/0x370 net/core/sock.c:2218
  __netlink_create+0x65/0x260 net/netlink/af_netlink.c:629
  __netlink_kernel_create+0x174/0x6f0 net/netlink/af_netlink.c:2015
  netlink_kernel_create include/linux/netlink.h:62 [inline]
  uevent_net_init+0xed/0x2d0 lib/kobject_uevent.c:783
  ops_init+0x31e/0x590 net/core/net_namespace.c:138
  setup_net+0x287/0x9e0 net/core/net_namespace.c:362
page last free pid 1032 tgid 1032 stack trace:
  reset_page_owner include/linux/page_owner.h:25 [inline]
  free_pages_prepare mm/page_alloc.c:1127 [inline]
  free_unref_page+0xdf9/0x1140 mm/page_alloc.c:2657
  __slab_free+0x31b/0x3d0 mm/slub.c:4509
  qlink_free mm/kasan/quarantine.c:163 [inline]
  qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
  kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
  __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
  kasan_slab_alloc include/linux/kasan.h:250 [inline]
  slab_post_alloc_hook mm/slub.c:4104 [inline]
  slab_alloc_node mm/slub.c:4153 [inline]
  kmem_cache_alloc_node_noprof+0x1d9/0x380 mm/slub.c:4205
  __alloc_skb+0x1c3/0x440 net/core/skbuff.c:668
  alloc_skb include/linux/skbuff.h:1323 [inline]
  alloc_skb_with_frags+0xc3/0x820 net/core/skbuff.c:6612
  sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2881
  sock_alloc_send_skb include/net/sock.h:1797 [inline]
  mld_newpack+0x1c3/0xaf0 net/ipv6/mcast.c:1747
  add_grhead net/ipv6/mcast.c:1850 [inline]
  add_grec+0x1492/0x19a0 net/ipv6/mcast.c:1988
  mld_send_initial_cr+0x228/0x4b0 net/ipv6/mcast.c:2234
  ipv6_mc_dad_complete+0x88/0x490 net/ipv6/mcast.c:2245
  addrconf_dad_completed+0x712/0xcd0 net/ipv6/addrconf.c:4342
 addrconf_dad_work+0xdc2/0x16f0
  process_one_work kernel/workqueue.c:3229 [inline]
  process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310

Memory state around the buggy address:
 ffff888043eba080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888043eba100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888043eba180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                     ^
 ffff888043eba200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888043eba280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 8c55facecd7a ("net: linkwatch: only report IF_OPER_LOWERLAYERDOWN if iflink is actually down")
Reported-by: syzbot+1939f24bdb783e9e43d9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/674f3a18.050a0220.48a03.0041.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241203170933.2449307-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agoMerge tag 'nf-24-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Paolo Abeni [Thu, 5 Dec 2024 10:49:14 +0000 (11:49 +0100)]
Merge tag 'nf-24-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix esoteric undefined behaviour due to uninitialized stack access
   in ip_vs_protocol_init(), from Jinghao Jia.

2) Fix iptables xt_LED slab-out-of-bounds due to incorrect sanitization
   of the led string identifier, reported by syzbot. Patch from
   Dmitry Antipov.

3) Remove WARN_ON_ONCE reachable from userspace to check for the maximum
   cgroup level, nft_socket cgroup matching is restricted to 255 levels,
   but cgroups allow for INT_MAX levels by default. Reported by syzbot.

4) Fix nft_inner incorrect use of percpu area to store tunnel parser
   context with softirqs, resulting in inconsistent inner header
   offsets that could lead to bogus rule mismatches, reported by syzbot.

5) Grab module reference on ipset core while requesting set type modules,
   otherwise kernel crash is possible by removing ipset core module,
   patch from Phil Sutter.

6) Fix possible double-free in nft_hash garbage collector due to unstable
   walk interator that can provide twice the same element. Use a sequence
   number to skip expired/dead elements that have been already scheduled
   for removal. Based on patch from Laurent Fasnach

netfilter pull request 24-12-05

* tag 'nf-24-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_set_hash: skip duplicated elements pending gc run
  netfilter: ipset: Hold module reference while requesting a module
  netfilter: nft_inner: incorrect percpu area handling under softirq
  netfilter: nft_socket: remove WARN_ON_ONCE on maximum cgroup level
  netfilter: x_tables: fix LED ID check in led_tg_check()
  ipvs: fix UB due to uninitialized stack access in ip_vs_protocol_init()
====================

Link: https://patch.msgid.link/20241205002854.162490-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agoMerge branch 'vsock-test-fix-wrong-setsockopt-parameters'
Paolo Abeni [Thu, 5 Dec 2024 10:39:36 +0000 (11:39 +0100)]
Merge branch 'vsock-test-fix-wrong-setsockopt-parameters'

Konstantin Shkolnyy says:

====================
vsock/test: fix wrong setsockopt() parameters

Parameters were created using wrong C types, which caused them to be of
wrong size on some architectures, causing problems.

The problem with SO_RCVLOWAT was found on s390 (big endian), while x86-64
didn't show it. After the fix, all tests pass on s390.
Then Stefano Garzarella pointed out that SO_VM_SOCKETS_* calls might have
a similar problem, which turned out to be true, hence, the second patch.

Changes for v8:
- Fix whitespace warnings from "checkpatch.pl --strict"
- Add maintainers to Cc:
Changes for v7:
- Rebase on top of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
- Add the "net" tags to the subjects
Changes for v6:
- rework the patch #3 to avoid creating a new file for new functions,
and exclude vsock_perf from calling the new functions.
- add "Reviewed-by:" to the patch #2.
Changes for v5:
- in the patch #2 replace the introduced uint64_t with unsigned long long
to match documentation
- add a patch #3 that verifies every setsockopt() call.
Changes for v4:
- add "Reviewed-by:" to the first patch, and add a second patch fixing
SO_VM_SOCKETS_* calls, which depends on the first one (hence, it's now
a patch series.)
Changes for v3:
- fix the same problem in vsock_perf and update commit message
Changes for v2:
- add "Fixes:" lines to the commit message
====================

Link: https://patch.msgid.link/20241203150656.287028-1-kshk@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agovsock/test: verify socket options after setting them
Konstantin Shkolnyy [Tue, 3 Dec 2024 15:06:56 +0000 (09:06 -0600)]
vsock/test: verify socket options after setting them

Replace setsockopt() calls with calls to functions that follow
setsockopt() with getsockopt() and check that the returned value and its
size are the same as have been set. (Except in vsock_perf.)

Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agovsock/test: fix parameter types in SO_VM_SOCKETS_* calls
Konstantin Shkolnyy [Tue, 3 Dec 2024 15:06:55 +0000 (09:06 -0600)]
vsock/test: fix parameter types in SO_VM_SOCKETS_* calls

Change parameters of SO_VM_SOCKETS_* to unsigned long long as documented
in the vm_sockets.h, because the corresponding kernel code requires them
to be at least 64-bit, no matter what architecture. Otherwise they are
too small on 32-bit machines.

Fixes: 5c338112e48a ("test/vsock: rework message bounds test")
Fixes: 685a21c314a8 ("test/vsock: add big message test")
Fixes: 542e893fbadc ("vsock/test: two tests to check credit update logic")
Fixes: 8abbffd27ced ("test/vsock: vsock_perf utility")
Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agovsock/test: fix failures due to wrong SO_RCVLOWAT parameter
Konstantin Shkolnyy [Tue, 3 Dec 2024 15:06:54 +0000 (09:06 -0600)]
vsock/test: fix failures due to wrong SO_RCVLOWAT parameter

This happens on 64-bit big-endian machines.
SO_RCVLOWAT requires an int parameter. However, instead of int, the test
uses unsigned long in one place and size_t in another. Both are 8 bytes
long on 64-bit machines. The kernel, having received the 8 bytes, doesn't
test for the exact size of the parameter, it only cares that it's >=
sizeof(int), and casts the 4 lower-addressed bytes to an int, which, on
a big-endian machine, contains 0. 0 doesn't trigger an error, SO_RCVLOWAT
returns with success and the socket stays with the default SO_RCVLOWAT = 1,
which results in vsock_test failures, while vsock_perf doesn't even notice
that it's failed to change it.

Fixes: b1346338fbae ("vsock_test: POLLIN + SO_RCVLOWAT test")
Fixes: 542e893fbadc ("vsock/test: two tests to check credit update logic")
Fixes: 8abbffd27ced ("test/vsock: vsock_perf utility")
Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agoMerge branch 'mitigate-the-two-reallocations-issue-for-iptunnels'
Paolo Abeni [Thu, 5 Dec 2024 10:15:59 +0000 (11:15 +0100)]
Merge branch 'mitigate-the-two-reallocations-issue-for-iptunnels'

Justin Iurman says:

====================
Mitigate the two-reallocations issue for iptunnels

RESEND v5:
- v5 was sent just when net-next closed
v5:
- address Paolo's comments
- s/int dst_dev_overhead()/unsigned int dst_dev_overhead()/
v4:
- move static inline function to include/net/dst.h
v3:
- fix compilation error in seg6_iptunnel
v2:
- add missing "static" keywords in seg6_iptunnel
- use a static-inline function to return the dev overhead (as suggested
  by Olek, thanks)

The same pattern is found in ioam6, rpl6, and seg6. Basically, it first
makes sure there is enough room for inserting a new header:

(1) err = skb_cow_head(skb, len + skb->mac_len);

Then, when the insertion (encap or inline) is performed, the input and
output handlers respectively make sure there is enough room for layer 2:

(2) err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));

skb_cow_head() does nothing when there is enough room. Otherwise, it
reallocates more room, which depends on the architecture. Briefly,
skb_cow_head() calls __skb_cow() which then calls pskb_expand_head() as
follows:

pskb_expand_head(skb, ALIGN(delta, NET_SKB_PAD), 0, GFP_ATOMIC);

"delta" represents the number of bytes to be added. This value is
aligned with NET_SKB_PAD, which is defined as follows:

NET_SKB_PAD = max(32, L1_CACHE_BYTES)

... where L1_CACHE_BYTES also depends on the architecture. In our case
(x86), it is defined as follows:

L1_CACHE_BYTES = (1 << CONFIG_X86_L1_CACHE_SHIFT)

... where (again, in our case) CONFIG_X86_L1_CACHE_SHIFT equals 6
(=X86_GENERIC).

All this to say, skb_cow_head() would reallocate to the next multiple of
NET_SKB_PAD (in our case a 64-byte multiple) when there is not enough
room.

Back to the main issue with the pattern: in some cases, two
reallocations are triggered, resulting in a performance drop (i.e.,
lines (1) and (2) would both trigger an implicit reallocation). How's
that possible? Well, this is kind of bad luck as we hit an exact
NET_SKB_PAD boundary and when skb->mac_len (=14) is smaller than
LL_RESERVED_SPACE(dst->dev) (=16 in our case). For an x86 arch, it
happens in the following cases (with the default needed_headroom):

- ioam6:
 - (inline mode) pre-allocated data trace of 236 or 240 bytes
 - (encap mode) pre-allocated data trace of 196 or 200 bytes
- seg6:
 - (encap mode) for 13, 17, 21, 25, 29, 33, ...(+4)... prefixes

Let's illustrate the problem, i.e., when we fall on the exact
NET_SKB_PAD boundary. In the case of ioam6, for the above problematic
values, the total overhead is 256 bytes for both modes. Based on line
(1), skb->mac_len (=14) is added, therefore passing 270 bytes to
skb_cow_head(). At that moment, the headroom has 206 bytes available (in
our case). Since 270 > 206, skb_cow_head() performs a reallocation and
the new headroom is now 206 + 64 (NET_SKB_PAD) = 270. Which is exactly
the room we needed. After the insertion, the headroom has 0 byte
available. But, there's line (2) where 16 bytes are still needed. Which,
again, triggers another reallocation.

The same logic is applied to seg6 (although it does not happen with the
inline mode, i.e., -40 bytes). It happens with other L1 cache shifts too
(the larger the cache shift, the less often it happens). For example,
with a +32 cache shift (instead of +64), the following number of
segments would trigger two reallocations: 11, 15, 19, ... With a +128
cache shift, the following number of segments would trigger two
reallocations: 17, 25, 33, ... And so on and so forth. Note that it is
the same for both the "encap" and "l2encap" modes. For the "encap.red"
and "l2encap.red" modes, it is the same logic but with "segs+1" (e.g.,
14, 18, 22, 26, etc for a +64 cache shift). Note also that it may happen
with rpl6 (based on some calculations), although it did not in our case.

This series provides a solution to mitigate the aforementioned issue for
ioam6, seg6, and rpl6. It provides the dst_entry (in the cache) to
skb_cow_head() **before** the insertion (line (1)). As a result, the
very first iteration would still trigger two reallocations (i.e., empty
cache), while next iterations would only trigger a single reallocation.
====================

Link: https://patch.msgid.link/20241203124945.22508-1-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: ipv6: rpl_iptunnel: mitigate 2-realloc issue
Justin Iurman [Tue, 3 Dec 2024 12:49:45 +0000 (13:49 +0100)]
net: ipv6: rpl_iptunnel: mitigate 2-realloc issue

This patch mitigates the two-reallocations issue with rpl_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration would still
trigger two reallocations (i.e., empty cache), while next iterations
would only trigger a single reallocation.

Performance tests before/after applying this patch, which clearly shows
there is no impact (it even shows improvement):
- before: https://ibb.co/nQJhqwc
- after: https://ibb.co/4ZvW6wV

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: ipv6: seg6_iptunnel: mitigate 2-realloc issue
Justin Iurman [Tue, 3 Dec 2024 12:49:44 +0000 (13:49 +0100)]
net: ipv6: seg6_iptunnel: mitigate 2-realloc issue

This patch mitigates the two-reallocations issue with seg6_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration would still
trigger two reallocations (i.e., empty cache), while next iterations
would only trigger a single reallocation.

Performance tests before/after applying this patch, which clearly shows
the improvement:
- before: https://ibb.co/3Cg4sNH
- after: https://ibb.co/8rQ350r

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: David Lebrun <dlebrun@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: ipv6: ioam6_iptunnel: mitigate 2-realloc issue
Justin Iurman [Tue, 3 Dec 2024 12:49:43 +0000 (13:49 +0100)]
net: ipv6: ioam6_iptunnel: mitigate 2-realloc issue

This patch mitigates the two-reallocations issue with ioam6_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration may still trigger
two reallocations (i.e., empty cache), while next iterations would only
trigger a single reallocation.

Performance tests before/after applying this patch, which clearly shows
the improvement:
- inline mode:
  - before: https://ibb.co/LhQ8V63
  - after: https://ibb.co/x5YT2bS
- encap mode:
  - before: https://ibb.co/3Cjm5m0
  - after: https://ibb.co/TwpsxTC
- encap mode with tunsrc:
  - before: https://ibb.co/Gpy9QPg
  - after: https://ibb.co/PW1bZFT

This patch also fixes an incorrect behavior: after the insertion, the
second call to skb_cow_head() makes sure that the dev has enough
headroom in the skb for layer 2 and stuff. In that case, the "old"
dst_entry was used, which is now fixed. After discussing with Paolo, it
appears that both patches can be merged into a single one -this one-
(for the sake of readability) and target net-next.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agoinclude: net: add static inline dst_dev_overhead() to dst.h
Justin Iurman [Tue, 3 Dec 2024 12:49:42 +0000 (13:49 +0100)]
include: net: add static inline dst_dev_overhead() to dst.h

Add static inline dst_dev_overhead() function to include/net/dst.h. This
helper function is used by ioam6_iptunnel, rpl_iptunnel and
seg6_iptunnel to get the dev's overhead based on a cache entry
(dst_entry). If the cache is empty, the default and generic value
skb->mac_len is returned. Otherwise, LL_RESERVED_SPACE() over dst's dev
is returned.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>
Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net...
Jakub Kicinski [Thu, 5 Dec 2024 03:46:49 +0000 (19:46 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-12-03 (ice, idpf, ixgbe, ixgbevf, igb)

This series contains updates to ice, idpf, ixgbe, ixgbevf, and igb
drivers.

For ice:
Arkadiusz corrects search for determining whether PHY clock recovery is
supported on the device.

Przemyslaw corrects mask used for PHY timestamps on ETH56G devices.

Wojciech adds missing virtchnl ops which caused NULL pointer
dereference.

Marcin fixes VLAN filter settings for uplink VSI in switchdev mode.

For idpf:
Josh restores setting of completion tag for empty buffers.

For ixgbevf:
Jake removes incorrect initialization/support of IPSEC for mailbox
version 1.5.

For ixgbe:
Jake rewords and downgrades misleading message when negotiation
of VF mailbox version is not supported.

Tore Amundsen corrects value for BASE-BX10 capability.

For igb:
Yuan Can adds proper teardown on failed pci_register_driver() call.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  igb: Fix potential invalid memory access in igb_init_module()
  ixgbe: Correct BASE-BX10 compliance code
  ixgbe: downgrade logging of unsupported VF API version to debug
  ixgbevf: stop attempting IPSEC offload on Mailbox API 1.5
  idpf: set completion tag for "empty" bufs associated with a packet
  ice: Fix VLAN pruning in switchdev mode
  ice: Fix NULL pointer dereference in switchdev
  ice: fix PHY timestamp extraction for ETH56G
  ice: fix PHY Clock Recovery availability check
====================

Link: https://patch.msgid.link/20241203215521.1646668-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agor8169: simplify setting hwmon attribute visibility
Heiner Kallweit [Tue, 3 Dec 2024 21:33:22 +0000 (22:33 +0100)]
r8169: simplify setting hwmon attribute visibility

Use new member visible to simplify setting the static visibility.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/dba77e76-be45-4a30-96c7-45e284072ad2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'mlx5-misc-fixes-2024-12-03'
Jakub Kicinski [Thu, 5 Dec 2024 03:43:48 +0000 (19:43 -0800)]
Merge branch 'mlx5-misc-fixes-2024-12-03'

Tariq Toukan says:

====================
mlx5 misc fixes 2024-12-03

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.
====================

Link: https://patch.msgid.link/20241203204920.232744-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet/mlx5e: Remove workaround to avoid syndrome for internal port
Jianbo Liu [Tue, 3 Dec 2024 20:49:20 +0000 (22:49 +0200)]
net/mlx5e: Remove workaround to avoid syndrome for internal port

Previously a workaround was added to avoid syndrome 0xcdb051. It is
triggered when offload a rule with tunnel encapsulation, and
forwarding to another table, but not matching on the internal port in
firmware steering mode. The original workaround skips internal tunnel
port logic, which is not correct as not all cases are considered. As
an example, if vlan is configured on the uplink port, traffic can't
pass because vlan header is not added with this workaround. Besides,
there is no such issue for software steering. So, this patch removes
that, and returns error directly if trying to offload such rule for
firmware steering.

Fixes: 06b4eac9c4be ("net/mlx5e: Don't offload internal port if filter device is out device")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Tested-by: Frode Nordahl <frode.nordahl@canonical.com>
Reviewed-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet/mlx5e: SD, Use correct mdev to build channel param
Tariq Toukan [Tue, 3 Dec 2024 20:49:19 +0000 (22:49 +0200)]
net/mlx5e: SD, Use correct mdev to build channel param

In a multi-PF netdev, each traffic channel creates its own resources
against a specific PF.
In the cited commit, where this support was added, the channel_param
logic was mistakenly kept unchanged, so it always used the primary PF
which is found at priv->mdev.
In this patch we fix this by moving the logic to be per-channel, and
passing the correct mdev instance.

This bug happened to be usually harmless, as the resulting cparam
structures would be the same for all channels, due to identical FW logic
and decisions.
However, in some use cases, like fwreset, this gets broken.

This could lead to different symptoms. Example:
Error cqe on cqn 0x428, ci 0x0, qn 0x10a9, opcode 0xe, syndrome 0x4,
vendor syndrome 0x32

Fixes: e4f9686bdee7 ("net/mlx5e: Let channels be SD-aware")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet/mlx5: E-Switch, Fix switching to switchdev mode in MPV
Patrisious Haddad [Tue, 3 Dec 2024 20:49:18 +0000 (22:49 +0200)]
net/mlx5: E-Switch, Fix switching to switchdev mode in MPV

Fix the mentioned commit change for MPV mode, since in MPV mode the IB
device is shared between different core devices, so under this change
when moving both devices simultaneously to switchdev mode the IB device
removal and re-addition can race with itself causing unexpected behavior.

In such case do rescan_drivers() only once in order to add the ethernet
representor auxiliary device, and skip adding and removing IB devices.

Fixes: ab85ebf43723 ("net/mlx5: E-switch, refactor eswitch mode change")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabled
Patrisious Haddad [Tue, 3 Dec 2024 20:49:17 +0000 (22:49 +0200)]
net/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabled

In case that IB device is already disabled when moving to switchdev mode,
which can happen when working with LAG, need to do rescan_drivers()
before leaving in order to add ethernet representor auxiliary device.

Fixes: ab85ebf43723 ("net/mlx5: E-switch, refactor eswitch mode change")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet/mlx5: HWS: Properly set bwc queue locks lock classes
Cosmin Ratiu [Tue, 3 Dec 2024 20:49:16 +0000 (22:49 +0200)]
net/mlx5: HWS: Properly set bwc queue locks lock classes

The mentioned "Fixes" patch forgot to do that.

Fixes: 9addffa34359 ("net/mlx5: HWS, use lock classes for bwc locks")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layout
Cosmin Ratiu [Tue, 3 Dec 2024 20:49:15 +0000 (22:49 +0200)]
net/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layout

It allocates a match template, which creates a compressed definer fc
struct, but that is not deallocated.

This commit fixes that.

Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'bnxt_en-support-header-page-pool-in-queue-api'
Jakub Kicinski [Thu, 5 Dec 2024 03:23:36 +0000 (19:23 -0800)]
Merge branch 'bnxt_en-support-header-page-pool-in-queue-api'

David Wei says:

====================
bnxt_en: support header page pool in queue API

Commit 7ed816be35ab ("eth: bnxt: use page pool for head frags") added a
separate page pool for header frags. Now, frags are allocated from this
header page pool e.g. rxr->tpa_info.data.

The queue API did not properly handle rxr->tpa_info and so using the
queue API to i.e. reset any queues will result in pages being returned
to the incorrect page pool, causing inflight != 0 warnings.

Fix this bug by properly allocating/freeing tpa_info and copying/freeing
head_pool in the queue API implementation.

The 1st patch is a prep patch that refactors helpers out to be used by
the implementation patch later.

The 2nd patch is a drive-by refactor. Happy to take it out and re-send
to net-next if there are any objections.

The 3rd patch is the implementation patch that will properly alloc/free
rxr->tpa_info.
====================

Link: https://patch.msgid.link/20241204041022.56512-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agobnxt_en: handle tpa_info in queue API implementation
David Wei [Wed, 4 Dec 2024 04:10:22 +0000 (20:10 -0800)]
bnxt_en: handle tpa_info in queue API implementation

Commit 7ed816be35ab ("eth: bnxt: use page pool for head frags") added a
page pool for header frags, which may be distinct from the existing pool
for the aggregation ring. Prior to this change, frags used in the TPA
ring rx_tpa were allocated from system memory e.g. napi_alloc_frag()
meaning their lifetimes were not associated with a page pool. They can
be returned at any time and so the queue API did not alloc or free
rx_tpa.

But now frags come from a separate head_pool which may be different to
page_pool. Without allocating and freeing rx_tpa, frags allocated from
the old head_pool may be returned to a different new head_pool which
causes a mismatch between the pp hold/release count.

Fix this problem by properly freeing and allocating rx_tpa in the queue
API implementation.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241204041022.56512-4-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agobnxt_en: refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap()
David Wei [Wed, 4 Dec 2024 04:10:21 +0000 (20:10 -0800)]
bnxt_en: refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap()

Refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap() for
allocating rx_agg_bmap.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241204041022.56512-3-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agobnxt_en: refactor tpa_info alloc/free into helpers
David Wei [Wed, 4 Dec 2024 04:10:20 +0000 (20:10 -0800)]
bnxt_en: refactor tpa_info alloc/free into helpers

Refactor bnxt_rx_ring_info->tpa_info operations into helpers that work
on a single tpa_info in prep for queue API using them.

There are 2 pairs of operations:

* bnxt_alloc_one_tpa_info()
* bnxt_free_one_tpa_info()

These alloc/free the tpa_info array itself.

* bnxt_alloc_one_tpa_info_data()
* bnxt_free_one_tpa_info_data()

These alloc/free the frags stored in tpa_info array.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241204041022.56512-2-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoselftests/net: call sendmmsg via udpgso_bench.sh
Kenjiro Nakayama [Tue, 3 Dec 2024 22:28:44 +0000 (07:28 +0900)]
selftests/net: call sendmmsg via udpgso_bench.sh

Currently, sendmmsg is implemented in udpgso_bench_tx.c,
but it is not called by any test script.

This patch adds a test for sendmmsg in udpgso_bench.sh.
This allows for basic API testing and benchmarking
comparisons with GSO.

Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241203222843.26983-1-nakayamakenjiro@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agogeneve: do not assume mac header is set in geneve_xmit_skb()
Eric Dumazet [Tue, 3 Dec 2024 18:21:21 +0000 (18:21 +0000)]
geneve: do not assume mac header is set in geneve_xmit_skb()

We should not assume mac header is set in output path.

Use skb_eth_hdr() instead of eth_hdr() to fix the issue.

sysbot reported the following :

 WARNING: CPU: 0 PID: 11635 at include/linux/skbuff.h:3052 skb_mac_header include/linux/skbuff.h:3052 [inline]
 WARNING: CPU: 0 PID: 11635 at include/linux/skbuff.h:3052 eth_hdr include/linux/if_ether.h:24 [inline]
 WARNING: CPU: 0 PID: 11635 at include/linux/skbuff.h:3052 geneve_xmit_skb drivers/net/geneve.c:898 [inline]
 WARNING: CPU: 0 PID: 11635 at include/linux/skbuff.h:3052 geneve_xmit+0x4c38/0x5730 drivers/net/geneve.c:1039
Modules linked in:
CPU: 0 UID: 0 PID: 11635 Comm: syz.4.1423 Not tainted 6.12.0-syzkaller-10296-gaaf20f870da0 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
 RIP: 0010:skb_mac_header include/linux/skbuff.h:3052 [inline]
 RIP: 0010:eth_hdr include/linux/if_ether.h:24 [inline]
 RIP: 0010:geneve_xmit_skb drivers/net/geneve.c:898 [inline]
 RIP: 0010:geneve_xmit+0x4c38/0x5730 drivers/net/geneve.c:1039
Code: 21 c6 02 e9 35 d4 ff ff e8 a5 48 4c fb 90 0f 0b 90 e9 fd f5 ff ff e8 97 48 4c fb 90 0f 0b 90 e9 d8 f5 ff ff e8 89 48 4c fb 90 <0f> 0b 90 e9 41 e4 ff ff e8 7b 48 4c fb 90 0f 0b 90 e9 cd e7 ff ff
RSP: 0018:ffffc90003b2f870 EFLAGS: 00010283
RAX: 000000000000037a RBX: 000000000000ffff RCX: ffffc9000dc3d000
RDX: 0000000000080000 RSI: ffffffff86428417 RDI: 0000000000000003
RBP: ffffc90003b2f9f0 R08: 0000000000000003 R09: 000000000000ffff
R10: 000000000000ffff R11: 0000000000000002 R12: ffff88806603c000
R13: 0000000000000000 R14: ffff8880685b2780 R15: 0000000000000e23
FS:  00007fdc2deed6c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b30a1dff8 CR3: 0000000056b8c000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
  __netdev_start_xmit include/linux/netdevice.h:5002 [inline]
  netdev_start_xmit include/linux/netdevice.h:5011 [inline]
  __dev_direct_xmit+0x58a/0x720 net/core/dev.c:4490
  dev_direct_xmit include/linux/netdevice.h:3181 [inline]
  packet_xmit+0x1e4/0x360 net/packet/af_packet.c:285
  packet_snd net/packet/af_packet.c:3146 [inline]
  packet_sendmsg+0x2700/0x5660 net/packet/af_packet.c:3178
  sock_sendmsg_nosec net/socket.c:711 [inline]
  __sock_sendmsg net/socket.c:726 [inline]
  __sys_sendto+0x488/0x4f0 net/socket.c:2197
  __do_sys_sendto net/socket.c:2204 [inline]
  __se_sys_sendto net/socket.c:2200 [inline]
  __x64_sys_sendto+0xe0/0x1c0 net/socket.c:2200
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: a025fb5f49ad ("geneve: Allow configuration of DF behaviour")
Reported-by: syzbot+3ec5271486d7cb2d242a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/674f4b72.050a0220.17bd51.004a.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Link: https://patch.msgid.link/20241203182122.2725517-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoinet: add indirect call wrapper for getfrag() calls
Eric Dumazet [Tue, 3 Dec 2024 17:36:17 +0000 (17:36 +0000)]
inet: add indirect call wrapper for getfrag() calls

UDP send path suffers from one indirect call to ip_generic_getfrag()

We can use INDIRECT_CALL_1() to avoid it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Link: https://patch.msgid.link/20241203173617.2595451-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'net-add-negotiation-of-in-band-capabilities'
Jakub Kicinski [Thu, 5 Dec 2024 03:19:09 +0000 (19:19 -0800)]
Merge branch 'net-add-negotiation-of-in-band-capabilities'

Russell King says:

====================
net: add negotiation of in-band capabilities

This is a repost without RFC for this series, shrunk down to 13 patches
by removing the non-Marvell PCS.

Phylink's handling of in-band has been deficient for a long time, and
people keep hitting problems with it. Notably, situations with the way-
to-late standardized 2500Base-X and whether that should or should not
have in-band enabled. We have also been carrying a hack in the form of
phylink_phy_no_inband() for a PHY that has been used on a SFP module,
but has no in-band capabilities, not even for SGMII.

When phylink is trying to operate in in-band mode, this series will look
at the capabilities of the MAC-side PCS and PHY, and work out whether
in-band can or should be used, programming the PHY as appropriate. This
includes in-band bypass mode at the PHY.

We don't... yet... support bypass on the MAC side PCS, because that
requires yet more complexity.

Patch 1 passes struct phylink and struct phylink_pcs into
phylink_pcs_neg_mode() so we can look at more state in this function in
a future patch.

Patch 2 splits "cur_link_an_mode" (the MLO_AN_* mode) into two separate
purposes - a requested and an active mode. The active mode is the one
we will be using for the MAC, which becomes dependent on the result of
in-band negotiation.

Patch 3 adds debug to phylink_major_config() so we can see what is going
on with the requested and active AN modes.

Patch 4 adds to phylib a method to get the in-band capabilities of the
PHY from phylib. Patches 5 and 6 add implementations for BCM84881 and
some Marvell PHYs found on SFPs.

Patch 7 adds to phylib a method to configure the PHY in-band signalling,
and patch 8 implements it for those Marvell PHYs that support the method
in patch 4.

Patch 9 does the same as patch 4 but for the MAC-side PCS, with patches
10 and 11 adding support to Marvell NETA and PP2.

Patch 12 adds the code to phylink_pcs_neg_mode() which looks at the
capabilities, and works out whether to use in-band or out-band mode for
driving the link between the MAC PCS and PHY.

Patch 13 removes the phylink_phy_no_inband() hack now that we are
publishing the in-band capabilities from the BCM84881 PHY driver.

Three more PCS, omitted from this series due to the limit of 15 patches,
will be sent once this has been merged.
====================

Link: https://patch.msgid.link/Z08kCwxdkU4n2V6x@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phylink: remove phylink_phy_no_inband()
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:48 +0000 (15:31 +0000)]
net: phylink: remove phylink_phy_no_inband()

Remove phylink_phy_no_inband() now that we are handling the lack of
inband negotiation by querying the capabilities of the PHY and PCS,
and the BCM84881 PHY driver provides us the information necessary to
make the decision.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUsO-006IUt-KN@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phylink: add negotiation of in-band capabilities
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:43 +0000 (15:31 +0000)]
net: phylink: add negotiation of in-band capabilities

Support for in-band signalling with Serdes links is uncertain. Some
PHYs do not support in-band for e.g. SGMII. Some PCS do not support
in-band for 2500Base-X. Some PCS require in-band for Base-X protocols.

Simply using what is in DT is insufficient when we have hot-pluggable
PHYs e.g. in the form of SFP modules, which may not provide the
in-band signalling.

In order to address this, we have introduced phy_inband_caps() and
pcs_inband_caps() functions to allow phylink to retrieve the
capabilities from each end of the PCS/PHY link. This commit adds code
to resolve whether in-band will be used in the various scenarios that
we have: In-band not being used, PHY present using SGMII or Base-X,
PHY not present. We also deal with no capabilties provided.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUsJ-006IUn-H3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: mvpp2: implement pcs_inband_caps() method
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:38 +0000 (15:31 +0000)]
net: mvpp2: implement pcs_inband_caps() method

Report the PCS in-band capabilities to phylink for Marvell PP2
interfaces.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUsE-006IUh-E7@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: mvneta: implement pcs_inband_caps() method
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:33 +0000 (15:31 +0000)]
net: mvneta: implement pcs_inband_caps() method

Report the PCS in-band capabilities to phylink for Marvell NETA
interfaces.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUs9-006IUb-Au@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phylink: add pcs_inband_caps() method
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:28 +0000 (15:31 +0000)]
net: phylink: add pcs_inband_caps() method

Add a pcs_inband_caps() method to query the PCS for its inband link
capabilities, and use this to determine whether link modes used with
optical SFPs can be supported.

When a PCS does not provide a method, we allow inband negotiation to
be either on or off, making this a no-op until the pcs_inband_caps()
method is implemented by a PCS driver.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUs4-006IUU-7K@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phy: marvell: implement config_inband() method
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:23 +0000 (15:31 +0000)]
net: phy: marvell: implement config_inband() method

Implement the config_inband() method for Marvell 88E111288E1111,
and Finisar's 88E1111 variant.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUrz-006IUO-3r@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phy: add phy_config_inband()
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:18 +0000 (15:31 +0000)]
net: phy: add phy_config_inband()

Add a method to configure the PHY's in-band mode.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUru-006IUI-08@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phy: marvell: implement phy_inband_caps() method
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:12 +0000 (15:31 +0000)]
net: phy: marvell: implement phy_inband_caps() method

Provide an implementation for phy_inband_caps() for Marvell PHYs used
on SFP modules, so that phylink knows the PHYs capabilities.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUro-006IUC-Rq@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phy: bcm84881: implement phy_inband_caps() method
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:07 +0000 (15:31 +0000)]
net: phy: bcm84881: implement phy_inband_caps() method

BCM84881 has no support for inband signalling, so this is a trivial
implementation that returns no support for inband.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/E1tIUrj-006IU6-ON@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phy: add phy_inband_caps()
Russell King (Oracle) [Tue, 3 Dec 2024 15:31:02 +0000 (15:31 +0000)]
net: phy: add phy_inband_caps()

Add a method to query the PHY's in-band capabilities for a PHY
interface mode.

Where the interface mode does not have in-band capability, or the PHY
driver has not been updated to return this information, then
phy_inband_caps() should return zero. Otherwise, PHY drivers will
return a value consisting of the following flags:

LINK_INBAND_DISABLE indicates that the hardware does not support
in-band signalling, or can have in-band signalling configured via
software to be disabled.

LINK_INBAND_ENABLE indicates that the hardware will use in-band
signalling, or can have in-band signalling configured via software
to be enabled.

LINK_INBAND_BYPASS indicates that the hardware has the ability to
bypass in-band signalling when enabled after a timeout if the link
partner does not respond to its in-band signalling.

This reports the PHY capabilities for the particular interface mode,
not the current configuration.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUre-006ITz-KF@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phylink: add debug for phylink_major_config()
Russell King (Oracle) [Tue, 3 Dec 2024 15:30:57 +0000 (15:30 +0000)]
net: phylink: add debug for phylink_major_config()

Now that we have a more complexity in phylink_major_config(), augment
the debugging so we can see what's going on there.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUrZ-006ITt-Fa@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phylink: split cur_link_an_mode into requested and active
Russell King (Oracle) [Tue, 3 Dec 2024 15:30:52 +0000 (15:30 +0000)]
net: phylink: split cur_link_an_mode into requested and active

There is an interdependence between the current link_an_mode and
pcs_neg_mode that some drivers rely upon to know whether inband or PHY
mode will be used.

In order to support detection of PCS and PHY inband capabilities
resulting in automatic selection of inband or PHY mode, we need to
cater for this, and support changing the MAC link_an_mode. However, we
end up with an inter-dependency between the current link_an_mode and
pcs_neg_mode.

To solve this, split the current link_an_mode into the requested
link_an_mode and active link_an_mode. The requested link_an_mode will
always be passed to phylink_pcs_neg_mode(), and the active link_an_mode
will be used for everything else, and only updated during
phylink_major_config(). This will ensure that phylink_pcs_neg_mode()'s
link_an_mode will not depend on the active link_an_mode that will,
in a future patch, depend on pcs_neg_mode.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUrU-006ITn-Ai@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phylink: pass phylink and pcs into phylink_pcs_neg_mode()
Russell King (Oracle) [Tue, 3 Dec 2024 15:30:47 +0000 (15:30 +0000)]
net: phylink: pass phylink and pcs into phylink_pcs_neg_mode()

Move the call to phylink_pcs_neg_mode() in phylink_major_config() after
we have selected the appropriate PCS to allow the PCS to be passed in.

Add struct phylink and struct phylink_pcs pointers to
phylink_pcs_neg_mode() and pass in the appropriate structures. Set
pl->pcs_neg_mode before returning, and remove the return value.

This will allow the capabilities of the PCS and any PHY to be used when
deciding which pcs_neg_mode should be used.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tIUrP-006ITh-6u@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agomlxsw: spectrum_acl_flex_keys: Use correct key block on Spectrum-4
Ido Schimmel [Tue, 3 Dec 2024 15:16:05 +0000 (16:16 +0100)]
mlxsw: spectrum_acl_flex_keys: Use correct key block on Spectrum-4

The driver is currently using an ACL key block that is not supported by
Spectrum-4. This works because the driver is only using a single field
from this key block which is located in the same offset in the
equivalent Spectrum-4 key block.

The issue was discovered when the firmware started rejecting the use of
the unsupported key block. The change has been reverted to avoid
breaking users that only update their firmware.

Nonetheless, fix the issue by using the correct key block.

Fixes: 07ff135958dd ("mlxsw: Introduce flex key elements for Spectrum-4")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/35e72c97bdd3bc414fb8e4d747e5fb5d26c29658.1733237440.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agortase: Add support for RTL907XD-VA PCIe port
Justin Lai [Tue, 3 Dec 2024 10:31:46 +0000 (18:31 +0800)]
rtase: Add support for RTL907XD-VA PCIe port

1. Add RTL907XD-VA hardware version id.
2. Add the reported speed for RTL907XD-VA.

Signed-off-by: Justin Lai <justinlai0215@realtek.com>
Link: https://patch.msgid.link/20241203103146.734516-1-justinlai0215@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'netcons-add-udp-send-fail-statistics-to-netconsole'
Jakub Kicinski [Thu, 5 Dec 2024 03:15:39 +0000 (19:15 -0800)]
Merge branch 'netcons-add-udp-send-fail-statistics-to-netconsole'

Maksym Kutsevol says:

====================
netcons: Add udp send fail statistics to netconsole

Enhance observability of netconsole. Packet sends can fail.
Start tracking at least two failure possibilities: ENOMEM and
NET_XMIT_DROP for every target. Stats are exposed via an additional
attribute in CONFIGFS.

The exposed statistics allows easier debugging of cases when netconsole
messages were not seen by receivers, eliminating the guesswork if the
sender thinks that messages in question were sent out.

Stats are not reset on enable/disable/change remote ip/etc, they
belong to the netcons target itself.
====================

Link: https://patch.msgid.link/20241202-netcons-add-udp-send-fail-statistics-to-netconsole-v5-0-70e82239f922@kutsevol.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonetcons: Add udp send fail statistics to netconsole
Maksym Kutsevol [Mon, 2 Dec 2024 19:55:08 +0000 (11:55 -0800)]
netcons: Add udp send fail statistics to netconsole

Enhance observability of netconsole. Packet sends can fail.
Start tracking at least two failure possibilities: ENOMEM and
NET_XMIT_DROP for every target. Stats are exposed via an additional
attribute in CONFIGFS.

The exposed statistics allows easier debugging of cases when netconsole
messages were not seen by receivers, eliminating the guesswork if the
sender thinks that messages in question were sent out.

Stats are not reset on enable/disable/change remote ip/etc, they
belong to the netcons target itself.

Reported-by: Breno Leitao <leitao@debian.org>
Closes: https://lore.kernel.org/all/ZsWoUzyK5du9Ffl+@gmail.com/
Signed-off-by: Maksym Kutsevol <max@kutsevol.com>
Link: https://patch.msgid.link/20241202-netcons-add-udp-send-fail-statistics-to-netconsole-v5-2-70e82239f922@kutsevol.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonetpoll: Make netpoll_send_udp return status instead of void
Maksym Kutsevol [Mon, 2 Dec 2024 19:55:07 +0000 (11:55 -0800)]
netpoll: Make netpoll_send_udp return status instead of void

netpoll_send_udp can return if send was successful.
It will allow client code to be aware of the send status.

Possible return values are the result of __netpoll_send_skb (cast to int)
and -ENOMEM. This doesn't cover the case when TX was not successful
instantaneously and was scheduled for later, __netpoll__send_skb returns
success in that case.

Signed-off-by: Maksym Kutsevol <max@kutsevol.com>
Link: https://patch.msgid.link/20241202-netcons-add-udp-send-fail-statistics-to-netconsole-v5-1-70e82239f922@kutsevol.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoethtool: Fix wrong mod state in case of verbose and no_mask bitset
Kory Maincent [Mon, 2 Dec 2024 15:33:57 +0000 (16:33 +0100)]
ethtool: Fix wrong mod state in case of verbose and no_mask bitset

A bitset without mask in a _SET request means we want exactly the bits in
the bitset to be set. This works correctly for compact format but when
verbose format is parsed, ethnl_update_bitset32_verbose() only sets the
bits present in the request bitset but does not clear the rest. The commit
6699170376ab ("ethtool: fix application of verbose no_mask bitset") fixes
this issue by clearing the whole target bitmap before we start iterating.
The solution proposed brought an issue with the behavior of the mod
variable. As the bitset is always cleared the old value will always
differ to the new value.

Fix it by adding a new function to compare bitmaps and a temporary variable
which save the state of the old bitmap.

Fixes: 6699170376ab ("ethtool: fix application of verbose no_mask bitset")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20241202153358.1142095-1-kory.maincent@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoipmr: tune the ipmr_can_free_table() checks.
Paolo Abeni [Tue, 3 Dec 2024 09:48:15 +0000 (10:48 +0100)]
ipmr: tune the ipmr_can_free_table() checks.

Eric reported a syzkaller-triggered splat caused by recent ipmr changes:

WARNING: CPU: 2 PID: 6041 at net/ipv6/ip6mr.c:419
ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419
Modules linked in:
CPU: 2 UID: 0 PID: 6041 Comm: syz-executor183 Not tainted
6.12.0-syzkaller-10681-g65ae975e97d5 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419
Code: 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c
02 00 75 58 49 83 bc 24 c0 0e 00 00 00 74 09 e8 44 ef a9 f7 90 <0f> 0b
90 e8 3b ef a9 f7 48 8d 7b 38 e8 12 a3 96 f7 48 89 df be 0f
RSP: 0018:ffffc90004267bd8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88803c710000 RCX: ffffffff89e4d844
RDX: ffff88803c52c880 RSI: ffffffff89e4d87c RDI: ffff88803c578ec0
RBP: 0000000000000001 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88803c578000
R13: ffff88803c710000 R14: ffff88803c710008 R15: dead000000000100
FS: 00007f7a855ee6c0(0000) GS:ffff88806a800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7a85689938 CR3: 000000003c492000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ip6mr_rules_exit+0x176/0x2d0 net/ipv6/ip6mr.c:283
ip6mr_net_exit_batch+0x53/0xa0 net/ipv6/ip6mr.c:1388
ops_exit_list+0x128/0x180 net/core/net_namespace.c:177
setup_net+0x4fe/0x860 net/core/net_namespace.c:394
copy_net_ns+0x2b4/0x6b0 net/core/net_namespace.c:500
create_new_namespaces+0x3ea/0xad0 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
ksys_unshare+0x45d/0xa40 kernel/fork.c:3334
__do_sys_unshare kernel/fork.c:3405 [inline]
__se_sys_unshare kernel/fork.c:3403 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:3403
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f7a856332d9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f7a855ee238 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f7a856bd308 RCX: 00007f7a856332d9
RDX: 00007f7a8560f8c6 RSI: 0000000000000000 RDI: 0000000062040200
RBP: 00007f7a856bd300 R08: 00007fff932160a7 R09: 00007f7a855ee6c0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a856bd30c
R13: 0000000000000000 R14: 00007fff93215fc0 R15: 00007fff932160a8
</TASK>

The root cause is a network namespace creation failing after successful
initialization of the ipmr subsystem. Such a case is not currently
matched by the ipmr_can_free_table() helper.

New namespaces are zeroed on allocation and inserted into net ns list
only after successful creation; when deleting an ipmr table, the list
next pointer can be NULL only on netns initialization failure.

Update the ipmr_can_free_table() checks leveraging such condition.

Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+6e8cb445d4b43d006e0c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6e8cb445d4b43d006e0c
Fixes: 11b6e701bce9 ("ipmr: add debug check for mr table cleanup")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/8bde975e21bbca9d9c27e36209b2dd4f1d7a3f00.1733212078.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonetpoll: Use rtnl_dereference() for npinfo pointer access
Breno Leitao [Tue, 3 Dec 2024 00:22:05 +0000 (16:22 -0800)]
netpoll: Use rtnl_dereference() for npinfo pointer access

In the __netpoll_setup() function, when accessing the device's npinfo
pointer, replace rcu_access_pointer() with rtnl_dereference(). This
change is more appropriate, as suggested by Herbert Xu[1].

The function is called with the RTNL mutex held, and the pointer is
being dereferenced later, so, dereference earlier and just reuse the
pointer for the if/else.

The replacement ensures correct pointer access while maintaining
the existing locking and RCU semantics of the netpoll subsystem.

Link: https://lore.kernel.org/lkml/Zz1cKZYt1e7elibV@gondor.apana.org.au/
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://patch.msgid.link/20241202-netpoll_rcu_herbet_fix-v2-1-2b9d58edc76a@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonetfilter: nft_set_hash: skip duplicated elements pending gc run
Pablo Neira Ayuso [Sun, 1 Dec 2024 23:04:49 +0000 (00:04 +0100)]
netfilter: nft_set_hash: skip duplicated elements pending gc run

rhashtable does not provide stable walk, duplicated elements are
possible in case of resizing. I considered that checking for errors when
calling rhashtable_walk_next() was sufficient to detect the resizing.
However, rhashtable_walk_next() returns -EAGAIN only at the end of the
iteration, which is too late, because a gc work containing duplicated
elements could have been already scheduled for removal to the worker.

Add a u32 gc worker sequence number per set, bump it on every workqueue
run. Annotate gc worker sequence number on the expired element. Use it
to skip those already seen in this gc workqueue run.

Note that this new field is never reset in case gc transaction fails, so
next gc worker run on the expired element overrides it. Wraparound of gc
worker sequence number should not be an issue with stale gc worker
sequence number in the element, that would just postpone the element
removal in one gc run.

Note that it is not possible to use flags to annotate that element is
pending gc run to detect duplicates, given that gc transaction can be
invalidated in case of update from the control plane, therefore, not
allowing to clear such flag.

On x86_64, pahole reports no changes in the size of nft_rhash_elem.

Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Reported-by: Laurent Fasnacht <laurent.fasnacht@proton.ch>
Tested-by: Laurent Fasnacht <laurent.fasnacht@proton.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
6 months agoMerge tag 'loongarch-fixes-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 4 Dec 2024 18:31:37 +0000 (10:31 -0800)]
Merge tag 'loongarch-fixes-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:
 "Fix bugs about EFI screen info, hugetlb pte clear and Lockdep-RCU
  splat in KVM, plus some trival cleanups"

* tag 'loongarch-fixes-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Protect kvm_io_bus_{read,write}() with SRCU
  LoongArch: KVM: Protect kvm_check_requests() with SRCU
  LoongArch: BPF: Adjust the parameter of emit_jirl()
  LoongArch: Add architecture specific huge_pte_clear()
  LoongArch/irq: Use seq_put_decimal_ull_width() for decimal values
  LoongArch: Fix reserving screen info memory for above-4G firmware

6 months agoMerge tag 'platform-drivers-x86-v6.13-2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 4 Dec 2024 18:28:30 +0000 (10:28 -0800)]
Merge tag 'platform-drivers-x86-v6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Ilpo Järvinen:

 - asus-nb-wmi: Silence unknown event warning when charger is plugged in

 - asus-wmi: Handle return code variations during thermal policy writing
   graciously

 - samsung-laptop: Correct module description

* tag 'platform-drivers-x86-v6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: asus-nb-wmi: Ignore unknown event 0xCF
  platform/x86: asus-wmi: Ignore return value when writing thermal policy
  platform/x86: samsung-laptop: Match MODULE_DESCRIPTION() to functionality

6 months agotracing: Fix cmp_entries_dup() to respect sort() comparison rules
Kuan-Wei Chiu [Tue, 3 Dec 2024 20:22:28 +0000 (04:22 +0800)]
tracing: Fix cmp_entries_dup() to respect sort() comparison rules

The cmp_entries_dup() function used as the comparator for sort()
violated the symmetry and transitivity properties required by the
sorting algorithm. Specifically, it returned 1 whenever memcmp() was
non-zero, which broke the following expectations:

* Symmetry: If x < y, then y > x.
* Transitivity: If x < y and y < z, then x < z.

These violations could lead to incorrect sorting and failure to
correctly identify duplicate elements.

Fix the issue by directly returning the result of memcmp(), which
adheres to the required comparison properties.

Cc: stable@vger.kernel.org
Fixes: 08d43a5fa063 ("tracing: Add lock-free tracing_map")
Link: https://lore.kernel.org/20241203202228.1274403-1-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
6 months agonetfilter: ipset: Hold module reference while requesting a module
Phil Sutter [Fri, 29 Nov 2024 15:30:38 +0000 (16:30 +0100)]
netfilter: ipset: Hold module reference while requesting a module

User space may unload ip_set.ko while it is itself requesting a set type
backend module, leading to a kernel crash. The race condition may be
provoked by inserting an mdelay() right after the nfnl_unlock() call.

Fixes: a7b4f989a629 ("netfilter: ipset: IP set core support")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
6 months agonet: sched: fix ordering of qlen adjustment
Lion Ackermann [Mon, 2 Dec 2024 16:22:57 +0000 (17:22 +0100)]
net: sched: fix ordering of qlen adjustment

Changes to sch->q.qlen around qdisc_tree_reduce_backlog() need to happen
_before_ a call to said function because otherwise it may fail to notify
parent qdiscs when the child is about to become empty.

Signed-off-by: Lion Ackermann <nnamrec@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: sched: fix erspan_opt settings in cls_flower
Xin Long [Mon, 2 Dec 2024 15:21:38 +0000 (10:21 -0500)]
net: sched: fix erspan_opt settings in cls_flower

When matching erspan_opt in cls_flower, only the (version, dir, hwid)
fields are relevant. However, in fl_set_erspan_opt() it initializes
all bits of erspan_opt and its mask to 1. This inadvertently requires
packets to match not only the (version, dir, hwid) fields but also the
other fields that are unexpectedly set to 1.

This patch resolves the issue by ensuring that only the (version, dir,
hwid) fields are configured in fl_set_erspan_opt(), leaving the other
fields to 0 in erspan_opt.

Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agor8169: remove support for chip version 11
Heiner Kallweit [Mon, 2 Dec 2024 20:20:02 +0000 (21:20 +0100)]
r8169: remove support for chip version 11

This is a follow-up to 982300c115d2 ("r8169: remove detection of chip
version 11 (early RTL8168b)"). Nobody complained yet, so remove
support for this chip version.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/b689ab6d-20b5-4b64-bd7e-531a0a972ba3@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agor8169: remove unused flag RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE
Heiner Kallweit [Mon, 2 Dec 2024 20:14:35 +0000 (21:14 +0100)]
r8169: remove unused flag RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE

After 854d71c555dfc3 ("r8169: remove original workaround for RTL8125
broken rx issue") this flag isn't used any longer. So remove it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/d9dd214b-3027-4f60-b0e8-6f34a0c76582@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoethtool: Fix access to uninitialized fields in set RXNFC command
Gal Pressman [Mon, 2 Dec 2024 16:48:05 +0000 (18:48 +0200)]
ethtool: Fix access to uninitialized fields in set RXNFC command

The check for non-zero ring with RSS is only relevant for
ETHTOOL_SRXCLSRLINS command, in other cases the check tries to access
memory which was not initialized by the userspace tool. Only perform the
check in case of ETHTOOL_SRXCLSRLINS.

Without this patch, filter deletion (for example) could statistically
result in a false error:
  # ethtool --config-ntuple eth3 delete 484
  rmgr: Cannot delete RX class rule: Invalid argument
  Cannot delete classification rule

Fixes: 9e43ad7a1ede ("net: ethtool: only allow set_rxnfc with rss + ring_cookie if driver opts in")
Link: https://lore.kernel.org/netdev/871a9ecf-1e14-40dd-bbd7-e90c92f89d47@nvidia.com/
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20241202164805.1637093-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoRevert "udp: avoid calling sock_def_readable() if possible"
Fernando Fernandez Mancera [Mon, 2 Dec 2024 15:56:08 +0000 (15:56 +0000)]
Revert "udp: avoid calling sock_def_readable() if possible"

This reverts commit 612b1c0dec5bc7367f90fc508448b8d0d7c05414. On a
scenario with multiple threads blocking on a recvfrom(), we need to call
sock_def_readable() on every __udp_enqueue_schedule_skb() otherwise the
threads won't be woken up as __skb_wait_for_more_packets() is using
prepare_to_wait_exclusive().

Link: https://bugzilla.redhat.com/2308477
Fixes: 612b1c0dec5b ("udp: avoid calling sock_def_readable() if possible")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241202155620.1719-1-ffmancera@riseup.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: Make napi_hash_lock irq safe
Joe Damato [Mon, 2 Dec 2024 18:21:02 +0000 (18:21 +0000)]
net: Make napi_hash_lock irq safe

Make napi_hash_lock IRQ safe. It is used during the control path, and is
taken and released in napi_hash_add and napi_hash_del, which will
typically be called by calls to napi_enable and napi_disable.

This change avoids a deadlock in pcnet32 (and other any other drivers
which follow the same pattern):

 CPU 0:
 pcnet32_open
    spin_lock_irqsave(&lp->lock, ...)
      napi_enable
        napi_hash_add <- before this executes, CPU 1 proceeds
          spin_lock(napi_hash_lock)
       [...]
    spin_unlock_irqrestore(&lp->lock, flags);

 CPU 1:
   pcnet32_close
     napi_disable
       napi_hash_del
         spin_lock(napi_hash_lock)
          < INTERRUPT >
            pcnet32_interrupt
              spin_lock(lp->lock) <- DEADLOCK

Changing the napi_hash_lock to be IRQ safe prevents the IRQ from firing
on CPU 1 until napi_hash_lock is released, preventing the deadlock.

Cc: stable@vger.kernel.org
Fixes: 86e25f40aa1e ("net: napi: Add napi_config")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/netdev/85dd4590-ea6b-427d-876a-1d8559c7ad82@roeck-us.net/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241202182103.363038-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoRevert "ptp: Switch back to struct platform_driver::remove()"
Jakub Kicinski [Wed, 4 Dec 2024 01:16:43 +0000 (17:16 -0800)]
Revert "ptp: Switch back to struct platform_driver::remove()"

This reverts commit b32913a5609a36c230e9b091da26d38f8e80a056.

Linus applied directly commit e70140ba0d2b ("Get rid of 'remove_new'
relic from platform driver struct"), drop our local change to avoid
conflicts.

Link: https://lore.kernel.org/CAMuHMdV3J=o2x9G=1t_y97iv9eLsPfiej108vU6JHnn=AR-Nvw@mail.gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonetfilter: nft_inner: incorrect percpu area handling under softirq
Pablo Neira Ayuso [Wed, 27 Nov 2024 11:46:54 +0000 (12:46 +0100)]
netfilter: nft_inner: incorrect percpu area handling under softirq

Softirq can interrupt ongoing packet from process context that is
walking over the percpu area that contains inner header offsets.

Disable bh and perform three checks before restoring the percpu inner
header offsets to validate that the percpu area is valid for this
skbuff:

1) If the NFT_PKTINFO_INNER_FULL flag is set on, then this skbuff
   has already been parsed before for inner header fetching to
   register.

2) Validate that the percpu area refers to this skbuff using the
   skbuff pointer as a cookie. If there is a cookie mismatch, then
   this skbuff needs to be parsed again.

3) Finally, validate if the percpu area refers to this tunnel type.

Only after these three checks the percpu area is restored to a on-stack
copy and bh is enabled again.

After inner header fetching, the on-stack copy is stored back to the
percpu area.

Fixes: 3a07327d10a0 ("netfilter: nft_inner: support for inner tunnel header matching")
Reported-by: syzbot+84d0441b9860f0d63285@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
6 months agoMerge tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Tue, 3 Dec 2024 19:02:17 +0000 (11:02 -0800)]
Merge tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - add lockdep annotations for io_uring/encoded read integration, inode
   lock is held when returning to userspace

 - properly reflect experimental config option to sysfs

 - handle NULL root in case the rescue mode accepts invalid/damaged tree
   roots (rescue=ibadroot)

 - regression fix of a deadlock between transaction and extent locks

 - fix pending bio accounting bug in encoded read ioctl

 - fix NOWAIT mode when checking references for NOCOW files

 - fix use-after-free in a rb-tree cleanup in ref-verify debugging tool

* tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix lockdep warnings on io_uring encoded reads
  btrfs: ref-verify: fix use-after-free after invalid ref action
  btrfs: add a sanity check for btrfs root in btrfs_search_slot()
  btrfs: don't loop for nowait writes when checking for cross references
  btrfs: sysfs: advertise experimental features only if CONFIG_BTRFS_EXPERIMENTAL=y
  btrfs: fix deadlock between transaction commits and extent locks
  btrfs: fix use-after-free in btrfs_encoded_read_endio()

6 months agoMerge tag 'fs_for_v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack...
Linus Torvalds [Tue, 3 Dec 2024 18:50:22 +0000 (10:50 -0800)]
Merge tag 'fs_for_v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull quota and udf fixes from Jan Kara:
 "Two small UDF fixes for better handling of corrupted filesystem and a
  quota fix to fix handling of filesystem freezing"

* tag 'fs_for_v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Verify inode link counts before performing rename
  udf: Skip parent dir link count update if corrupted
  quota: flush quota_release_work upon quota writeback

6 months agoMerge tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Tue, 3 Dec 2024 18:46:49 +0000 (10:46 -0800)]
Merge tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

 - Use xchg() in xlog_cil_insert_pcp_aggregate()

 - Fix ABBA deadlock on a race between mount and log shutdown

 - Fix quota softlimit incoherency on delalloc

 - Fix sparse inode limits on runt AG

 - remove unknown compat feature checks in SB write valdation

 - Eliminate a lockdep false positive

* tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delay
  xfs: Use xchg() in xlog_cil_insert_pcp_aggregate()
  xfs: prevent mount and log shutdown race
  xfs: delalloc and quota softlimit timers are incoherent
  xfs: fix sparse inode limits on runt AG
  xfs: remove unknown compat feature check in superblock write validation
  xfs: eliminate lockdep false positives in xfs_attr_shortform_list

6 months agoigb: Fix potential invalid memory access in igb_init_module()
Yuan Can [Wed, 23 Oct 2024 12:10:48 +0000 (20:10 +0800)]
igb: Fix potential invalid memory access in igb_init_module()

The pci_register_driver() can fail and when this happened, the dca_notifier
needs to be unregistered, otherwise the dca_notifier can be called when
igb fails to install, resulting to invalid memory access.

Fixes: bbd98fe48a43 ("igb: Fix DCA errors and do not use context index for 82576")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: Correct BASE-BX10 compliance code
Tore Amundsen [Fri, 15 Nov 2024 14:17:36 +0000 (14:17 +0000)]
ixgbe: Correct BASE-BX10 compliance code

SFF-8472 (section 5.4 Transceiver Compliance Codes) defines bit 6 as
BASE-BX10. Bit 6 means a value of 0x40 (decimal 64).

The current value in the source code is 0x64, which appears to be a
mix-up of hex and decimal values. A value of 0x64 (binary 01100100)
incorrectly sets bit 2 (1000BASE-CX) and bit 5 (100BASE-FX) as well.

Fixes: 1b43e0d20f2d ("ixgbe: Add 1000BASE-BX support")
Signed-off-by: Tore Amundsen <tore@amundsen.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Acked-by: Ernesto Castellotti <ernesto@castellotti.net>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: downgrade logging of unsupported VF API version to debug
Jacob Keller [Fri, 1 Nov 2024 23:05:43 +0000 (16:05 -0700)]
ixgbe: downgrade logging of unsupported VF API version to debug

The ixgbe PF driver logs an info message when a VF attempts to negotiate an
API version which it does not support:

  VF 0 requested invalid api version 6

The ixgbevf driver attempts to load with mailbox API v1.5, which is
required for best compatibility with other hosts such as the ESX VMWare PF.

The Linux PF only supports API v1.4, and does not currently have support
for the v1.5 API.

The logged message can confuse users, as the v1.5 API is valid, but just
happens to not currently be supported by the Linux PF.

Downgrade the info message to a debug message, and fix the language to
use 'unsupported' instead of 'invalid' to improve message clarity.

Long term, we should investigate whether the improvements in the v1.5 API
make sense for the Linux PF, and if so implement them properly. This may
require yet another API version to resolve issues with negotiating IPSEC
offload support.

Fixes: 339f28964147 ("ixgbevf: Add support for new mailbox communication between PF and VF")
Reported-by: Yifei Liu <yifei.l.liu@oracle.com>
Link: https://lore.kernel.org/intel-wired-lan/20240301235837.3741422-1-yifei.l.liu@oracle.com/
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbevf: stop attempting IPSEC offload on Mailbox API 1.5
Jacob Keller [Fri, 1 Nov 2024 23:05:42 +0000 (16:05 -0700)]
ixgbevf: stop attempting IPSEC offload on Mailbox API 1.5

Commit 339f28964147 ("ixgbevf: Add support for new mailbox communication
between PF and VF") added support for v1.5 of the PF to VF mailbox
communication API. This commit mistakenly enabled IPSEC offload for API
v1.5.

No implementation of the v1.5 API has support for IPSEC offload. This
offload is only supported by the Linux PF as mailbox API v1.4. In fact, the
v1.5 API is not implemented in any Linux PF.

Attempting to enable IPSEC offload on a PF which supports v1.5 API will not
work. Only the Linux upstream ixgbe and ixgbevf support IPSEC offload, and
only as part of the v1.4 API.

Fix the ixgbevf Linux driver to stop attempting IPSEC offload when
the mailbox API does not support it.

The existing API design choice makes it difficult to support future API
versions, as other non-Linux hosts do not implement IPSEC offload. If we
add support for v1.5 to the Linux PF, then we lose support for IPSEC
offload.

A full solution likely requires a new mailbox API with a proper negotiation
to check that IPSEC is actually supported by the host.

Fixes: 339f28964147 ("ixgbevf: Add support for new mailbox communication between PF and VF")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoidpf: set completion tag for "empty" bufs associated with a packet
Joshua Hay [Mon, 7 Oct 2024 20:24:35 +0000 (13:24 -0700)]
idpf: set completion tag for "empty" bufs associated with a packet

Commit d9028db618a6 ("idpf: convert to libeth Tx buffer completion")
inadvertently removed code that was necessary for the tx buffer cleaning
routine to iterate over all buffers associated with a packet.

When a frag is too large for a single data descriptor, it will be split
across multiple data descriptors. This means the frag will span multiple
buffers in the buffer ring in order to keep the descriptor and buffer
ring indexes aligned. The buffer entries in the ring are technically
empty and no cleaning actions need to be performed. These empty buffers
can precede other frags associated with the same packet. I.e. a single
packet on the buffer ring can look like:

buf[0]=skb0.frag0
buf[1]=skb0.frag1
buf[2]=empty
buf[3]=skb0.frag2

The cleaning routine iterates through these buffers based on a matching
completion tag. If the completion tag is not set for buf2, the loop will
end prematurely. Frag2 will be left uncleaned and next_to_clean will be
left pointing to the end of packet, which will break the cleaning logic
for subsequent cleans. This consequently leads to tx timeouts.

Assign the empty bufs the same completion tag for the packet to ensure
the cleaning routine iterates over all of the buffers associated with
the packet.

Fixes: d9028db618a6 ("idpf: convert to libeth Tx buffer completion")
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Acked-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Madhu chittim <madhu.chittim@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoice: Fix VLAN pruning in switchdev mode
Marcin Szycik [Mon, 4 Nov 2024 18:49:09 +0000 (19:49 +0100)]
ice: Fix VLAN pruning in switchdev mode

In switchdev mode the uplink VSI should receive all unmatched packets,
including VLANs. Therefore, VLAN pruning should be disabled if uplink is
in switchdev mode. It is already being done in ice_eswitch_setup_env(),
however the addition of ice_up() in commit 44ba608db509 ("ice: do
switchdev slow-path Rx using PF VSI") caused VLAN pruning to be
re-enabled after disabling it.

Add a check to ice_set_vlan_filtering_features() to ensure VLAN
filtering will not be enabled if uplink is in switchdev mode. Note that
ice_is_eswitch_mode_switchdev() is being used instead of
ice_is_switchdev_running(), as the latter would only return true after
the whole switchdev setup completes.

Fixes: 44ba608db509 ("ice: do switchdev slow-path Rx using PF VSI")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Tested-by: Priya Singh <priyax.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoice: Fix NULL pointer dereference in switchdev
Wojciech Drewek [Tue, 29 Oct 2024 09:42:59 +0000 (10:42 +0100)]
ice: Fix NULL pointer dereference in switchdev

Commit 608a5c05c39b ("virtchnl: support queue rate limit and quanta
size configuration") introduced new virtchnl ops:
- get_qos_caps
- cfg_q_bw
- cfg_q_quanta

New ops were added to ice_virtchnl_dflt_ops, in
commit 015307754a19 ("ice: Support VF queue rate limit and quanta
size configuration"), but not to the ice_virtchnl_repr_ops. Because
of that, if we get one of those messages in switchdev mode we end up
with NULL pointer dereference:

[ 1199.794701] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1199.794804] Workqueue: ice ice_service_task [ice]
[ 1199.794878] RIP: 0010:0x0
[ 1199.795027] Call Trace:
[ 1199.795033]  <TASK>
[ 1199.795039]  ? __die+0x20/0x70
[ 1199.795051]  ? page_fault_oops+0x140/0x520
[ 1199.795064]  ? exc_page_fault+0x7e/0x270
[ 1199.795074]  ? asm_exc_page_fault+0x22/0x30
[ 1199.795086]  ice_vc_process_vf_msg+0x6e5/0xd30 [ice]
[ 1199.795165]  __ice_clean_ctrlq+0x734/0x9d0 [ice]
[ 1199.795207]  ice_service_task+0xccf/0x12b0 [ice]
[ 1199.795248]  process_one_work+0x21a/0x620
[ 1199.795260]  worker_thread+0x18d/0x330
[ 1199.795269]  ? __pfx_worker_thread+0x10/0x10
[ 1199.795279]  kthread+0xec/0x120
[ 1199.795288]  ? __pfx_kthread+0x10/0x10
[ 1199.795296]  ret_from_fork+0x2d/0x50
[ 1199.795305]  ? __pfx_kthread+0x10/0x10
[ 1199.795312]  ret_from_fork_asm+0x1a/0x30
[ 1199.795323]  </TASK>

Fixes: 015307754a19 ("ice: Support VF queue rate limit and quanta size configuration")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>