www.infradead.org Git - users/hch/misc.git/log

Merge patch series "can: flexcan: add transceiver capabilities"

Dimitri Fedrau <dimitri.fedrau@liebherr.com> says:

Currently the flexcan driver does only support adding PHYs by using
the "old" regulator bindings. Add support for CAN transceivers as a
PHY. Add the capability to ensure that the PHY is in operational state
when the link is set to an "up" state.

Changes in v4:
- Dropped "if: required: phys" in bindings
- Link to v3: https://lore.kernel.org/r/20250221-flexcan-add-transceiver-caps-v3-0-a947bde55a62@liebherr.com

Changes in v3:
- Have xceiver-supply or phys properties in bindings
- Switch do dev_err_probe in flexcan_probe when checking error of call
  devm_phy_optional_get
- Link to v2: https://lore.kernel.org/r/20250220-flexcan-add-transceiver-caps-v2-0-a81970f11846@liebherr.com

Changes in v2:
- Rename variable xceiver to transceiver in struct flexcan_priv and in
  flexcan_probe
- Set priv->can.bitrate_max if transceiver is found
- Fix commit messages which claim that transceivers are not supported
- Do not print error on EPROBE_DEFER after calling devm_phy_optional_get in
  flexcan_probe
- Link to v1: https://lore.kernel.org/r/20250211-flexcan-add-transceiver-caps-v1-0-c6abb7817b0f@liebherr.com

Link: https://patch.msgid.link/20250312-flexcan-add-transceiver-caps-v4-0-29e89ae0225a@liebherr.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: flexcan: add transceiver capabilities

Currently the flexcan driver does only support adding PHYs by using the
"old" regulator bindings. Add support for CAN transceivers as a PHY. Add
the capability to ensure that the PHY is in operational state when the link
is set to an "up" state.

Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Link: https://patch.msgid.link/20250312-flexcan-add-transceiver-caps-v4-2-29e89ae0225a@liebherr.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

dt-bindings: can: fsl,flexcan: add transceiver capabilities

Currently the flexcan driver does only support adding PHYs by using the
"old" regulator bindings. Add support for CAN transceivers as a PHY.

Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Link: https://patch.msgid.link/20250312-flexcan-add-transceiver-caps-v4-1-29e89ae0225a@liebherr.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

tools/testing/selftests/drivers/net/ping.py
  75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py")
  de94e8697405 ("selftests: drv-net: store addresses in dict indexed by ipver")
https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/

net/core/devmem.c
  a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
  1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations")
https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/

Adjacent changes:

tools/testing/selftests/net/Makefile
  6f50175ccad4 ("selftests: Add IPv6 link-local address generation tests for GRE devices.")
  2e5584e0f913 ("selftests/net: expand cmsg_ipv6.sh with ipv4")

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  661958552eda ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic")
  fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'net-6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter, bluetooth and wireless.

  No known regressions outstanding.

  Current release - regressions:

   - wifi: nl80211: fix assoc link handling

   - eth: lan78xx: sanitize return values of register read/write
     functions

  Current release - new code bugs:

   - ethtool: tsinfo: fix dump command

   - bluetooth: btusb: configure altsetting for HCI_USER_CHANNEL

   - eth: mlx5: DR, use the right action structs for STEv3

  Previous releases - regressions:

   - netfilter: nf_tables: make destruction work queue pernet

   - gre: fix IPv6 link-local address generation.

   - wifi: iwlwifi: fix TSO preparation

   - bluetooth: revert "bluetooth: hci_core: fix sleeping function
     called from invalid context"

   - ovs: revert "openvswitch: switch to per-action label counting in
     conntrack"

   - eth:
       - ice: fix switchdev slow-path in LAG
       - bonding: fix incorrect MAC address setting to receive NS
         messages

  Previous releases - always broken:

   - core: prevent TX of unreadable skbs

   - sched: prevent creation of classes with TC_H_ROOT

   - netfilter: nft_exthdr: fix offset with ipv4_find_option()

   - wifi: cfg80211: cancel wiphy_work before freeing wiphy

   - mctp: copy headers if cloned

   - phy: nxp-c45-tja11xx: add errata for TJA112XA/B

   - eth:
       - bnxt: fix kernel panic in the bnxt_get_queue_stats{rx | tx}
       - mlx5: bridge, fix the crash caused by LAG state check"

* tag 'net-6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits)
  net: mana: cleanup mana struct after debugfs_remove()
  net/mlx5e: Prevent bridge link show failure for non-eswitch-allowed devices
  net/mlx5: Bridge, fix the crash caused by LAG state check
  net/mlx5: Lag, Check shared fdb before creating MultiPort E-Switch
  net/mlx5: Fix incorrect IRQ pool usage when releasing IRQs
  net/mlx5: HWS, Rightsize bwc matcher priority
  net/mlx5: DR, use the right action structs for STEv3
  Revert "openvswitch: switch to per-action label counting in conntrack"
  net: openvswitch: remove misbehaving actions length check
  selftests: Add IPv6 link-local address generation tests for GRE devices.
  gre: Fix IPv6 link-local address generation.
  netfilter: nft_exthdr: fix offset with ipv4_find_option()
  selftests/tc-testing: Add a test case for DRR class with TC_H_ROOT
  net_sched: Prevent creation of classes with TC_H_ROOT
  ipvs: prevent integer overflow in do_ip_vs_get_ctl()
  selftests: netfilter: skip br_netfilter queue tests if kernel is tainted
  netfilter: nf_conncount: Fully initialize struct nf_conncount_tuple in insert_tree()
  wifi: mac80211: fix MPDU length parsing for EHT 5/6 GHz
  qlcnic: fix memory leak issues in qlcnic_sriov_common.c
  rtase: Fix improper release of ring list entries in rtase_sw_reset
  ...

Merge tag 'vfs-6.14-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

- Bring in an RCU pathwalk fix for afs. This is brought in as a merge
   from the vfs-6.15.shared.afs branch that needs this commit and other
   trees already depend on it.

- Fix vboxfs unterminated string handling.

* tag 'vfs-6.14-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  vboxsf: Add __nonstring annotations for unterminated strings
  afs: Fix afs_atcell_get_link() to handle RCU pathwalk

Merge tag 'nf-25-03-13' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for net:

1) Missing initialization of cpu and jiffies32 fields in conncount,
   from Kohei Enju.

2) Skip several tests in case kernel is tainted, otherwise tests bogusly
   report failure too as they also check for tainted kernel,
   from Florian Westphal.

3) Fix a hyphothetical integer overflow in do_ip_vs_get_ctl() leading
   to bogus error logs, from Dan Carpenter.

4) Fix incorrect offset in ipv4 option match in nft_exthdr, from
   Alexey Kashavkin.

netfilter pull request 25-03-13

* tag 'nf-25-03-13' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_exthdr: fix offset with ipv4_find_option()
  ipvs: prevent integer overflow in do_ip_vs_get_ctl()
  selftests: netfilter: skip br_netfilter queue tests if kernel is tainted
  netfilter: nf_conncount: Fully initialize struct nf_conncount_tuple in insert_tree()
====================

Link: https://patch.msgid.link/20250313095636.2186-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: mana: cleanup mana struct after debugfs_remove()

When on a MANA VM hibernation is triggered, as part of hibernate_snapshot(),
mana_gd_suspend() and mana_gd_resume() are called. If during this
mana_gd_resume(), a failure occurs with HWC creation, mana_port_debugfs
pointer does not get reinitialized and ends up pointing to older,
cleaned-up dentry.
Further in the hibernation path, as part of power_down(), mana_gd_shutdown()
is triggered. This call, unaware of the failures in resume, tries to cleanup
the already cleaned up  mana_port_debugfs value and hits the following bug:

[  191.359296] mana 7870:00:00.0: Shutdown was called
[  191.359918] BUG: kernel NULL pointer dereference, address: 0000000000000098
[  191.360584] #PF: supervisor write access in kernel mode
[  191.361125] #PF: error_code(0x0002) - not-present page
[  191.361727] PGD 1080ea067 P4D 0
[  191.362172] Oops: Oops: 0002 [#1] SMP NOPTI
[  191.362606] CPU: 11 UID: 0 PID: 1674 Comm: bash Not tainted 6.14.0-rc5+ #2
[  191.363292] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/21/2024
[  191.364124] RIP: 0010:down_write+0x19/0x50
[  191.364537] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 de cd ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 16 65 48 8b 05 88 24 4c 6a 48 89 43 08 48 8b 5d
[  191.365867] RSP: 0000:ff45fbe0c1c037b8 EFLAGS: 00010246
[  191.366350] RAX: 0000000000000000 RBX: 0000000000000098 RCX: ffffff8100000000
[  191.366951] RDX: 0000000000000001 RSI: 0000000000000064 RDI: 0000000000000098
[  191.367600] RBP: ff45fbe0c1c037c0 R08: 0000000000000000 R09: 0000000000000001
[  191.368225] R10: ff45fbe0d2b01000 R11: 0000000000000008 R12: 0000000000000000
[  191.368874] R13: 000000000000000b R14: ff43dc27509d67c0 R15: 0000000000000020
[  191.369549] FS:  00007dbc5001e740(0000) GS:ff43dc663f380000(0000) knlGS:0000000000000000
[  191.370213] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  191.370830] CR2: 0000000000000098 CR3: 0000000168e8e002 CR4: 0000000000b73ef0
[  191.371557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  191.372192] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  191.372906] Call Trace:
[  191.373262]  <TASK>
[  191.373621]  ? show_regs+0x64/0x70
[  191.374040]  ? __die+0x24/0x70
[  191.374468]  ? page_fault_oops+0x290/0x5b0
[  191.374875]  ? do_user_addr_fault+0x448/0x800
[  191.375357]  ? exc_page_fault+0x7a/0x160
[  191.375971]  ? asm_exc_page_fault+0x27/0x30
[  191.376416]  ? down_write+0x19/0x50
[  191.376832]  ? down_write+0x12/0x50
[  191.377232]  simple_recursive_removal+0x4a/0x2a0
[  191.377679]  ? __pfx_remove_one+0x10/0x10
[  191.378088]  debugfs_remove+0x44/0x70
[  191.378530]  mana_detach+0x17c/0x4f0
[  191.378950]  ? __flush_work+0x1e2/0x3b0
[  191.379362]  ? __cond_resched+0x1a/0x50
[  191.379787]  mana_remove+0xf2/0x1a0
[  191.380193]  mana_gd_shutdown+0x3b/0x70
[  191.380642]  pci_device_shutdown+0x3a/0x80
[  191.381063]  device_shutdown+0x13e/0x230
[  191.381480]  kernel_power_off+0x35/0x80
[  191.381890]  hibernate+0x3c6/0x470
[  191.382312]  state_store+0xcb/0xd0
[  191.382734]  kobj_attr_store+0x12/0x30
[  191.383211]  sysfs_kf_write+0x3e/0x50
[  191.383640]  kernfs_fop_write_iter+0x140/0x1d0
[  191.384106]  vfs_write+0x271/0x440
[  191.384521]  ksys_write+0x72/0xf0
[  191.384924]  __x64_sys_write+0x19/0x20
[  191.385313]  x64_sys_call+0x2b0/0x20b0
[  191.385736]  do_syscall_64+0x79/0x150
[  191.386146]  ? __mod_memcg_lruvec_state+0xe7/0x240
[  191.386676]  ? __lruvec_stat_mod_folio+0x79/0xb0
[  191.387124]  ? __pfx_lru_add+0x10/0x10
[  191.387515]  ? queued_spin_unlock+0x9/0x10
[  191.387937]  ? do_anonymous_page+0x33c/0xa00
[  191.388374]  ? __handle_mm_fault+0xcf3/0x1210
[  191.388805]  ? __count_memcg_events+0xbe/0x180
[  191.389235]  ? handle_mm_fault+0xae/0x300
[  191.389588]  ? do_user_addr_fault+0x559/0x800
[  191.390027]  ? irqentry_exit_to_user_mode+0x43/0x230
[  191.390525]  ? irqentry_exit+0x1d/0x30
[  191.390879]  ? exc_page_fault+0x86/0x160
[  191.391235]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  191.391745] RIP: 0033:0x7dbc4ff1c574
[  191.392111] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[  191.393412] RSP: 002b:00007ffd95a23ab8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[  191.393990] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007dbc4ff1c574
[  191.394594] RDX: 0000000000000005 RSI: 00005a6eeadb0ce0 RDI: 0000000000000001
[  191.395215] RBP: 00007ffd95a23ae0 R08: 00007dbc50003b20 R09: 0000000000000000
[  191.395805] R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000000005
[  191.396404] R13: 00005a6eeadb0ce0 R14: 00007dbc500045c0 R15: 00007dbc50001ee0
[  191.396987]  </TASK>

To fix this, we explicitly set such mana debugfs variables to NULL after
debugfs_remove() is called.

Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
Cc: stable@vger.kernel.org
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1741688260-28922-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mlx5-misc-fixes-2025-03-10'

Tariq Toukan says:

====================
mlx5 misc fixes 2025-03-10

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.
====================

Link: https://patch.msgid.link/1741644104-97767-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5e: Prevent bridge link show failure for non-eswitch-allowed devices

mlx5_eswitch_get_vepa returns -EPERM if the device lacks
eswitch_manager capability, blocking mlx5e_bridge_getlink from
retrieving VEPA mode. Since mlx5e_bridge_getlink implements
ndo_bridge_getlink, returning -EPERM causes bridge link show to fail
instead of skipping devices without this capability.

To avoid this, return -EOPNOTSUPP from mlx5e_bridge_getlink when
mlx5_eswitch_get_vepa fails, ensuring the command continues processing
other devices while ignoring those without the necessary capability.

Fixes: 4b89251de024 ("net/mlx5: Support ndo bridge_setlink and getlink")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1741644104-97767-7-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Bridge, fix the crash caused by LAG state check

When removing LAG device from bridge, NETDEV_CHANGEUPPER event is
triggered. Driver finds the lower devices (PFs) to flush all the
offloaded entries. And mlx5_lag_is_shared_fdb is checked, it returns
false if one of PF is unloaded. In such case,
mlx5_esw_bridge_lag_rep_get() and its caller return NULL, instead of
the alive PF, and the flush is skipped.

Besides, the bridge fdb entry's lastuse is updated in mlx5 bridge
event handler. But this SWITCHDEV_FDB_ADD_TO_BRIDGE event can be
ignored in this case because the upper interface for bond is deleted,
and the entry will never be aged because lastuse is never updated.

To make things worse, as the entry is alive, mlx5 bridge workqueue
keeps sending that event, which is then handled by kernel bridge
notifier. It causes the following crash when accessing the passed bond
netdev which is already destroyed.

To fix this issue, remove such checks. LAG state is already checked in
commit 15f8f168952f ("net/mlx5: Bridge, verify LAG state when adding
bond to bridge"), driver still need to skip offload if LAG becomes
invalid state after initialization.

Oops: stack segment: 0000 [#1] SMP
CPU: 3 UID: 0 PID: 23695 Comm: kworker/u40:3 Tainted: G           OE      6.11.0_mlnx #1
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Workqueue: mlx5_bridge_wq mlx5_esw_bridge_update_work [mlx5_core]
RIP: 0010:br_switchdev_event+0x2c/0x110 [bridge]
Code: 44 00 00 48 8b 02 48 f7 00 00 02 00 00 74 69 41 54 55 53 48 83 ec 08 48 8b a8 08 01 00 00 48 85 ed 74 4a 48 83 fe 02 48 89 d3 <4c> 8b 65 00 74 23 76 49 48 83 fe 05 74 7e 48 83 fe 06 75 2f 0f b7
RSP: 0018:ffffc900092cfda0 EFLAGS: 00010297
RAX: ffff888123bfe000 RBX: ffffc900092cfe08 RCX: 00000000ffffffff
RDX: ffffc900092cfe08 RSI: 0000000000000001 RDI: ffffffffa0c585f0
RBP: 6669746f6e690a30 R08: 0000000000000000 R09: ffff888123ae92c8
R10: 0000000000000000 R11: fefefefefefefeff R12: ffff888123ae9c60
R13: 0000000000000001 R14: ffffc900092cfe08 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88852c980000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f15914c8734 CR3: 0000000002830005 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
  <TASK>
  ? __die_body+0x1a/0x60
  ? die+0x38/0x60
  ? do_trap+0x10b/0x120
  ? do_error_trap+0x64/0xa0
  ? exc_stack_segment+0x33/0x50
  ? asm_exc_stack_segment+0x22/0x30
  ? br_switchdev_event+0x2c/0x110 [bridge]
  ? sched_balance_newidle.isra.149+0x248/0x390
  notifier_call_chain+0x4b/0xa0
  atomic_notifier_call_chain+0x16/0x20
  mlx5_esw_bridge_update+0xec/0x170 [mlx5_core]
  mlx5_esw_bridge_update_work+0x19/0x40 [mlx5_core]
  process_scheduled_works+0x81/0x390
  worker_thread+0x106/0x250
  ? bh_worker+0x110/0x110
  kthread+0xb7/0xe0
  ? kthread_park+0x80/0x80
  ret_from_fork+0x2d/0x50
  ? kthread_park+0x80/0x80
  ret_from_fork_asm+0x11/0x20
  </TASK>

Fixes: ff9b7521468b ("net/mlx5: Bridge, support LAG")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1741644104-97767-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Lag, Check shared fdb before creating MultiPort E-Switch

Currently, MultiPort E-Switch is requesting to create a LAG with shared
FDB without checking the LAG is supporting shared FDB.
Add the check.

Fixes: a32327a3a02c ("net/mlx5: Lag, Control MultiPort E-Switch single FDB mode")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1741644104-97767-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Fix incorrect IRQ pool usage when releasing IRQs

mlx5_irq_pool_get() is a getter for completion IRQ pool only.
However, after the cited commit, mlx5_irq_pool_get() is called during
ctrl IRQ release flow to retrieve the pool, resulting in the use of an
incorrect IRQ pool.

Hence, use the newly introduced mlx5_irq_get_pool() getter to retrieve
the correct IRQ pool based on the IRQ itself. While at it, rename
mlx5_irq_pool_get() to mlx5_irq_table_get_comp_irq_pool() which
accurately reflects its purpose and improves code readability.

Fixes: 0477d5168bbb ("net/mlx5: Expose SFs IRQs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1741644104-97767-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: HWS, Rightsize bwc matcher priority

The bwc layer was clamping the matcher priority from 32 bits to 16 bits.
This didn't show up until a matcher was resized, since the initial
native matcher was created using the correct 32 bit value.

The fix also reorders fields to avoid some padding.

Fixes: 2111bb970c78 ("net/mlx5: HWS, added backward-compatible API handling")
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1741644104-97767-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: DR, use the right action structs for STEv3

Some actions in ConnectX-8 (STEv3) have different structure,
and they are handled separately in ste_ctx_v3.
This separate handling was missing two actions: INSERT_HDR
and REMOVE_HDR, which broke SWS for Linux Bridge.
This patch resolves the issue by introducing dedicated
callbacks for the insert and remove header functions,
with version-specific implementations for each STE variant.

Fixes: 4d617b57574f ("net/mlx5: DR, add support for ConnectX-8 steering")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1741644104-97767-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Tariq Toukan says:

====================
mlx5-next updates 2025-03-10

The following pull-request contains common mlx5 updates for your *net-next* tree.
Please pull and let me know of any problem.

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Add IFC bits for PPCNT recovery counters group
  net/mlx5: fs, add RDMA TRANSPORT steering domain support
  net/mlx5: Query ADV_RDMA capabilities
  net/mlx5: Limit non-privileged commands
  net/mlx5: Allow the throttle mechanism to be more dynamic
  net/mlx5: Add RDMA_CTRL HW capabilities
====================

Link: https://patch.msgid.link/1741608293-41436-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: Define interrupt constraints for DWMAC vendor bindings

The `snps,dwmac.yaml` binding currently sets `maxItems: 3` for the
`interrupts` and `interrupt-names` properties, but vendor bindings
selecting `snps,dwmac.yaml` do not impose these limits.

Define constraints for `interrupts` and `interrupt-names` properties in
various DWMAC vendor bindings to ensure proper validation and consistency.

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Acked-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Link: https://patch.msgid.link/20250309003301.1152228-1-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-stmmac-dwmac-rk-validate-grf-and-peripheral-grf-during-probe'

Jonas Karlman says:

====================
net: stmmac: dwmac-rk: Validate GRF and peripheral GRF during probe

All Rockchip GMAC variants typically write to GRF regs to control e.g.
interface mode, speed and MAC rx/tx delay. Newer SoCs such as RK3576 and
RK3588 use a mix of GRF and peripheral GRF regs. These syscon regmaps is
located with help of a rockchip,grf and rockchip,php-grf phandle.

However, validating the rockchip,grf and rockchip,php-grf syscon regmap
is deferred until e.g. interface mode or speed is configured.

This series change to validate the GRF and peripheral GRF syscon regmap
at probe time to help simplify the SoC specific operations.

This should not introduce any backward compatibility issues as all
GMAC nodes have been added together with a rockchip,grf phandle (and
rockchip,php-grf where required) in their initial commit.
====================

Link: https://patch.msgid.link/20250308213720.2517944-1-jonas@kwiboo.se
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: dwmac-rk: Remove unneeded GRF and peripheral GRF checks

Now that GRF, and peripheral GRF where needed, is validated at probe
time there is no longer any need to check and log an error in each SoC
specific operation.

Remove unneeded IS_ERR() checks and early bail out from each SoC
specific operation.

Signed-off-by: Jonas Karlman <jonas@kwiboo.se>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250308213720.2517944-4-jonas@kwiboo.se
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: dwmac-rk: Validate GRF and peripheral GRF during probe

All Rockchip GMAC variants typically write to GRF regs to control e.g.
interface mode, speed and MAC rx/tx delay. Newer SoCs such as RK3576 and
RK3588 use a mix of GRF and peripheral GRF regs. These syscon regmaps is
located with help of a rockchip,grf and rockchip,php-grf phandle.

However, validating the rockchip,grf and rockchip,php-grf syscon regmap
is deferred until e.g. interface mode or speed is configured, inside the
individual SoC specific operations.

Change to validate the rockchip,grf and rockchip,php-grf syscon regmap
at probe time to simplify all SoC specific operations.

This should not introduce any backward compatibility issues as all
GMAC nodes have been added together with a rockchip,grf phandle (and
rockchip,php-grf where required) in their initial commit.

Signed-off-by: Jonas Karlman <jonas@kwiboo.se>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250308213720.2517944-3-jonas@kwiboo.se
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: rockchip-dwmac: Require rockchip,grf and rockchip,php-grf

All Rockchip GMAC variants typically write to GRF regs to control e.g.
interface mode, speed and MAC rx/tx delay. Newer SoCs such as RK3562,
RK3576 and RK3588 use a mix of GRF and peripheral GRF regs.

Prior to the commit b331b8ef86f0 ("dt-bindings: net: convert
rockchip-dwmac to json-schema") the property rockchip,grf was listed
under "Required properties". During the conversion this was lost and
rockchip,grf has since then incorrectly been treated as optional and
not as required.

Similarly, when rockchip,php-grf was added to the schema in the
commit a2b77831427c ("dt-bindings: net: rockchip-dwmac: add rk3588 gmac
compatible") it also incorrectly has been treated as optional for all
GMAC variants, when it should have been required for RK3588, and later
also for RK3576.

Update this binding to require rockchip,grf and rockchip,php-grf to
properly reflect that GRF (and peripheral GRF for RK3576/RK3588) is
required to control part of GMAC.

This should not introduce any breakage as all Rockchip GMAC nodes have
been added together with a rockchip,grf phandle (and rockchip,php-grf
where required) in their initial commit.

Signed-off-by: Jonas Karlman <jonas@kwiboo.se>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250308213720.2517944-2-jonas@kwiboo.se
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Revert "openvswitch: switch to per-action label counting in conntrack"

Currently, ovs_ct_set_labels() is only called for confirmed conntrack
entries (ct) within ovs_ct_commit(). However, if the conntrack entry
does not have the labels_ext extension, attempting to allocate it in
ovs_ct_get_conn_labels() for a confirmed entry triggers a warning in
nf_ct_ext_add():

WARN_ON(nf_ct_is_confirmed(ct));

This happens when the conntrack entry is created externally before OVS
increments net->ct.labels_used. The issue has become more likely since
commit fcb1aa5163b1 ("openvswitch: switch to per-action label counting
in conntrack"), which changed to use per-action label counting and
increment net->ct.labels_used when a flow with ct action is added.

Since there’s no straightforward way to fully resolve this issue at the
moment, this reverts the commit to avoid breaking existing use cases.

Fixes: fcb1aa5163b1 ("openvswitch: switch to per-action label counting in conntrack")
Reported-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/1bdeb2f3a812bca016a225d3de714427b2cd4772.1741457143.git.lucien.xin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: openvswitch: remove misbehaving actions length check

The actions length check is unreliable and produces different results
depending on the initial length of the provided netlink attribute and
the composition of the actual actions inside of it.  For example, a
user can add 4088 empty clone() actions without triggering -EMSGSIZE,
on attempt to add 4089 such actions the operation will fail with the
-EMSGSIZE verdict.  However, if another 16 KB of other actions will
be *appended* to the previous 4089 clone() actions, the check passes
and the flow is successfully installed into the openvswitch datapath.

The reason for a such a weird behavior is the way memory is allocated.
When ovs_flow_cmd_new() is invoked, it calls ovs_nla_copy_actions(),
that in turn calls nla_alloc_flow_actions() with either the actual
length of the user-provided actions or the MAX_ACTIONS_BUFSIZE.  The
function adds the size of the sw_flow_actions structure and then the
actually allocated memory is rounded up to the closest power of two.

So, if the user-provided actions are larger than MAX_ACTIONS_BUFSIZE,
then MAX_ACTIONS_BUFSIZE + sizeof(*sfa) rounded up is 32K + 24 -> 64K.
Later, while copying individual actions, we look at ksize(), which is
64K, so this way the MAX_ACTIONS_BUFSIZE check is not actually
triggered and the user can easily allocate almost 64 KB of actions.

However, when the initial size is less than MAX_ACTIONS_BUFSIZE, but
the actions contain ones that require size increase while copying
(such as clone() or sample()), then the limit check will be performed
during the reserve_sfa_size() and the user will not be allowed to
create actions that yield more than 32 KB internally.

This is one part of the problem.  The other part is that it's not
actually possible for the userspace application to know beforehand
if the particular set of actions will be rejected or not.

Certain actions require more space in the internal representation,
e.g. an empty clone() takes 4 bytes in the action list passed in by
the user, but it takes 12 bytes in the internal representation due
to an extra nested attribute, and some actions require less space in
the internal representations, e.g. set(tunnel(..)) normally takes
64+ bytes in the action list provided by the user, but only needs to
store a single pointer in the internal implementation, since all the
data is stored in the tunnel_info structure instead.

And the action size limit is applied to the internal representation,
not to the action list passed by the user.  So, it's not possible for
the userpsace application to predict if the certain combination of
actions will be rejected or not, because it is not possible for it to
calculate how much space these actions will take in the internal
representation without knowing kernel internals.

All that is causing random failures in ovs-vswitchd in userspace and
inability to handle certain traffic patterns as a result.  For example,
it is reported that adding a bit more than a 1100 VMs in an OpenStack
setup breaks the network due to OVS not being able to handle ARP
traffic anymore in some cases (it tries to install a proper datapath
flow, but the kernel rejects it with -EMSGSIZE, even though the action
list isn't actually that large.)

Kernel behavior must be consistent and predictable in order for the
userspace application to use it in a reasonable way.  ovs-vswitchd has
a mechanism to re-direct parts of the traffic and partially handle it
in userspace if the required action list is oversized, but that doesn't
work properly if we can't actually tell if the action list is oversized
or not.

Solution for this is to check the size of the user-provided actions
instead of the internal representation.  This commit just removes the
check from the internal part because there is already an implicit size
check imposed by the netlink protocol.  The attribute can't be larger
than 64 KB.  Realistically, we could reduce the limit to 32 KB, but
we'll be risking to break some existing setups that rely on the fact
that it's possible to create nearly 64 KB action lists today.

Vast majority of flows in real setups are below 100-ish bytes.  So
removal of the limit will not change real memory consumption on the
system.  The absolutely worst case scenario is if someone adds a flow
with 64 KB of empty clone() actions.  That will yield a 192 KB in the
internal representation consuming 256 KB block of memory.  However,
that list of actions is not meaningful and also a no-op.  Real world
very large action lists (that can occur for a rare cases of BUM
traffic handling) are unlikely to contain a large number of clones and
will likely have a lot of tunnel attributes making the internal
representation comparable in size to the original action list.
So, it should be fine to just remove the limit.

Commit in the 'Fixes' tag is the first one that introduced the
difference between internal representation and the user-provided action
lists, but there were many more afterwards that lead to the situation
we have today.

Fixes: 7d5437c709de ("openvswitch: Add tunneling interface.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250308004609.2881861-1-i.maximets@ovn.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'gre-fix-regressions-in-ipv6-link-local-address-generation'

Guillaume Nault says:

====================
gre: Fix regressions in IPv6 link-local address generation.

IPv6 link-local address generation has some special cases for GRE
devices. This has led to several regressions in the past, and some of
them are still not fixed. This series fixes the remaining problems,
like the ipv6.conf.<dev>.addr_gen_mode sysctl being ignored and the
router discovery process not being started (see details in patch 1).

To avoid any further regressions, patch 2 adds selftests covering
IPv4 and IPv6 gre/gretap devices with all combinations of currently
supported addr_gen_mode values.
====================

Link: https://patch.msgid.link/cover.1741375285.git.gnault@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: Add IPv6 link-local address generation tests for GRE devices.

GRE devices have their special code for IPv6 link-local address
generation that has been the source of several regressions in the past.

Add selftest to check that all gre, ip6gre, gretap and ip6gretap get an
IPv6 link-link local address in accordance with the
net.ipv6.conf.<dev>.addr_gen_mode sysctl.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/2d6772af8e1da9016b2180ec3f8d9ee99f470c77.1741375285.git.gnault@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

gre: Fix IPv6 link-local address generation.

Use addrconf_addr_gen() to generate IPv6 link-local addresses on GRE
devices in most cases and fall back to using add_v4_addrs() only in
case the GRE configuration is incompatible with addrconf_addr_gen().

GRE used to use addrconf_addr_gen() until commit e5dd729460ca
("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL
address") restricted this use to gretap and ip6gretap devices, and
created add_v4_addrs() (borrowed from SIT) for non-Ethernet GRE ones.

The original problem came when commit 9af28511be10 ("addrconf: refuse
isatap eui64 for INADDR_ANY") made __ipv6_isatap_ifid() fail when its
addr parameter was 0. The commit says that this would create an invalid
address, however, I couldn't find any RFC saying that the generated
interface identifier would be wrong. Anyway, since gre over IPv4
devices pass their local tunnel address to __ipv6_isatap_ifid(), that
commit broke their IPv6 link-local address generation when the local
address was unspecified.

Then commit e5dd729460ca ("ip/ip6_gre: use the same logic as SIT
interfaces when computing v6LL address") tried to fix that case by
defining add_v4_addrs() and calling it to generate the IPv6 link-local
address instead of using addrconf_addr_gen() (apart for gretap and
ip6gretap devices, which would still use the regular
addrconf_addr_gen(), since they have a MAC address).

That broke several use cases because add_v4_addrs() isn't properly
integrated into the rest of IPv6 Neighbor Discovery code. Several of
these shortcomings have been fixed over time, but add_v4_addrs()
remains broken on several aspects. In particular, it doesn't send any
Router Sollicitations, so the SLAAC process doesn't start until the
interface receives a Router Advertisement. Also, add_v4_addrs() mostly
ignores the address generation mode of the interface
(/proc/sys/net/ipv6/conf/*/addr_gen_mode), thus breaking the
IN6_ADDR_GEN_MODE_RANDOM and IN6_ADDR_GEN_MODE_STABLE_PRIVACY cases.

Fix the situation by using add_v4_addrs() only in the specific scenario
where the normal method would fail. That is, for interfaces that have
all of the following characteristics:

  * run over IPv4,
  * transport IP packets directly, not Ethernet (that is, not gretap
    interfaces),
  * tunnel endpoint is INADDR_ANY (that is, 0),
  * device address generation mode is EUI64.

In all other cases, revert back to the regular addrconf_addr_gen().

Also, remove the special case for ip6gre interfaces in add_v4_addrs(),
since ip6gre devices now always use addrconf_addr_gen() instead.

Fixes: e5dd729460ca ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/559c32ce5c9976b269e6337ac9abb6a96abe5096.1741375285.git.gnault@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hsr: Add KUnit test for PRP

Add unit tests for the PRP duplicate detection

Signed-off-by: Jaakko Karrenpalo <jkarrenpalo@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250307161700.1045-2-jkarrenpalo@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hsr: Fix PRP duplicate detection

Add PRP specific function for handling duplicate
packets. This is needed because of potential
L2 802.1p prioritization done by network switches.

The L2 prioritization can re-order the PRP packets
from a node causing the existing implementation to
discard the frame(s) that have been received 'late'
because the sequence number is before the previous
received packet. This can happen if the node is
sending multiple frames back-to-back with different
priority.

Signed-off-by: Jaakko Karrenpalo <jkarrenpalo@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250307161700.1045-1-jkarrenpalo@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netfilter: nft_exthdr: fix offset with ipv4_find_option()

There is an incorrect calculation in the offset variable which causes
the nft_skb_copy_to_reg() function to always return -EFAULT. Adding the
start variable is redundant. In the __ip_options_compile() function the
correct offset is specified when finding the function. There is no need
to add the size of the iphdr structure to the offset.

Fixes: dbb5281a1f84 ("netfilter: nf_tables: add support for matching IPv4 options")
Signed-off-by: Alexey Kashavkin <akashavkin@gmail.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

net: cn23xx: fix typos

This patch fixes a few typos, spelling mistakes, and a bit of grammar,
increasing the comments readability.

Signed-off-by: Janik Haag <janik@aq0.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250307145648.1679912-2-janik@aq0.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hns3: use string choices helper

Use string choices helper for better readability.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250307113733.819448-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fix from Tejun Heo:
"BPF schedulers could trigger a crash by passing in an invalid CPU to
  the scx_bpf_select_cpu_dfl() helper.

  Fix it by verifying input validity"

* tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()

Merge tag 'spi-fix-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"A couple of driver specific fixes, an error handling fix for the Atmel
  QuadSPI driver and a fix for a nasty synchronisation issue in the data
  path for the Microchip driver which affects larger transfers.

  There's also a MAINTAINERS update for the Samsung driver"

* tag 'spi-fix-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: microchip-core: prevent RX overflows when transmit size > FIFO size
  MAINTAINERS: add tambarus as R for Samsung SPI
  spi: atmel-quadspi: remove references to runtime PM on error path

netdevsim: 'support' multi-buf XDP

Don't error out on large MTU if XDP is multi-buf.
The ping test now tests ping with XDP and high MTU.
netdevsim doesn't actually run the prog (yet?) so
it doesn't matter if the prog was multi-buf..

Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/20250311092820.542148-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-remove-rtnl_lock-from-the-callers-of-queue-apis'

Stanislav Fomichev says:

====================
net: remove rtnl_lock from the callers of queue APIs

All drivers that use queue management APIs already depend on the netdev
lock. Ultimately, we want to have most of the paths that work with
specific netdev to be rtnl_lock-free (ethtool mostly in particular).
Queue API currently has a much smaller API surface, so start with
rtnl_lock from it:

- add mutex to each dmabuf binding (to replace rtnl_lock)
- move netdev lock management to the callers of netdev_rx_queue_restart
and drop rtnl_lock
====================

Link: https://patch.msgid.link/20250311144026.4154277-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: drop rtnl_lock for queue_mgmt operations

All drivers that use queue API are already converted to use
netdev instance lock. Move netdev instance lock management to
the netlink layer and drop rtnl_lock.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry. <almasrymina@google.com>
Link: https://patch.msgid.link/20250311144026.4154277-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add granular lock for the netdev netlink socket

As we move away from rtnl_lock for queue ops, introduce
per-netdev_nl_sock lock.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250311144026.4154277-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: create netdev_nl_sock to wrap bindings list

No functional changes. Next patches will add more granular locking
to netdev_nl_sock.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250311144026.4154277-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Avoid unnecessary use of comma operator

Although it does not seem to have any untoward side-effects,
the use of ';' to separate to assignments seems more appropriate than ','.

Flagged by clang-19 -Wcomma

No functional change intended.
Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250307-mlx5-comma-v1-1-934deb6927bb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: bump GRO timeout for gro/setup_veth

Commit 51bef03e1a71 ("selftests/net: deflake GRO tests") recently
switched to NAPI suspension, and lowered the timeout from 1ms to 100us.
This started causing flakes in netdev-run CI. Let's bump it to 200us.
In a quick test of a debug kernel I see failures with 100us, with 200us
in 5 runs I see 2 completely clean runs and 3 with a single retry
(GRO test will retry up to 5 times).

Reviewed-by: Kevin Krakauer <krakauer@google.com>
Link: https://patch.msgid.link/20250310110821.385621-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: add missing netdev lock management to bnxt_dl_reload_up

bnxt_dl_reload_up is completely missing instance lock management
which can result in `devlink dev reload` leaving with instance
lock held. Add the missing calls.

Also add netdev_assert_locked to make it clear that the up() method
is running with the instance lock grabbed.

v2:
- add net/netdev_lock.h include to bnxt_devlink.c for netdev_assert_locked

Fixes: 004b5008016a ("eth: bnxt: remove most dependencies on RTNL")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250309215851.2003708-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: request unconditional ops lock

netdev_lock_ops conditionally grabs instance lock when queue_mgmt_ops
is defined. However queue_mgmt_ops support is signaled via FW
so we can sometimes boot without queue_mgmt_ops being set.
This will result in bnxt running without instance lock which
the driver now heavily depends on. Set request_ops_lock to true
unconditionally to always request netdev instance lock.

Fixes: 004b5008016a ("eth: bnxt: remove most dependencies on RTNL")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250309215851.2003708-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: switch to netif_close

All (error) paths that call dev_close are already holding instance lock,
so switch to netif_close to avoid the deadlock.

v2:
- add missing EXPORT_MODULE for netif_close

Fixes: 004b5008016a ("eth: bnxt: remove most dependencies on RTNL")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250309215851.2003708-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: revert to lockless TC_SETUP_BLOCK and TC_SETUP_FT

There is a couple of places from which we can arrive to ndo_setup_tc
with TC_SETUP_BLOCK/TC_SETUP_FT:
- netlink
- netlink notifier
- netdev notifier

Locking netdev too deep in this call chain seems to be problematic
(especially assuming some/all of the call_netdevice_notifiers
NETDEV_UNREGISTER) might soon be running with the instance lock).
Revert to lockless ndo_setup_tc for TC_SETUP_BLOCK/TC_SETUP_FT. NFT
framework already takes care of most of the locking. Document
the assumptions.

ndo_setup_tc TC_SETUP_BLOCK
  nft_block_offload_cmd
    nft_chain_offload_cmd
      nft_flow_block_chain
        nft_flow_offload_chain
  nft_flow_rule_offload_abort
    nft_flow_rule_offload_commit
  nft_flow_rule_offload_commit
    nf_tables_commit
      nfnetlink_rcv_batch
        nfnetlink_rcv_skb_batch
  nfnetlink_rcv
nft_offload_netdev_event
  NETDEV_UNREGISTER notifier

ndo_setup_tc TC_SETUP_FT
  nf_flow_table_offload_cmd
    nf_flow_table_offload_setup
      nft_unregister_flowtable_hook
        nft_register_flowtable_net_hooks
  nft_flowtable_update
  nf_tables_newflowtable
    nfnetlink_rcv_batch (.call NFNL_CB_BATCH)
nft_flowtable_update
  nf_tables_newflowtable
nft_flowtable_event
  nf_tables_flowtable_event
    NETDEV_UNREGISTER notifier
      __nft_unregister_flowtable_net_hooks
        nft_unregister_flowtable_net_hooks
  nf_tables_commit
    nfnetlink_rcv_batch (.call NFNL_CB_BATCH)
  __nf_tables_abort
    nf_tables_abort
      nfnetlink_rcv_batch
__nft_release_hook
  __nft_release_hooks
    nf_tables_pre_exit_net -> module unload
  nft_rcv_nl_event
    netlink_register_notifier (oh boy)
      nft_register_flowtable_net_hooks
       nft_flowtable_update
  nf_tables_newflowtable
        nf_tables_newflowtable

Fixes: c4f0f30b424e ("net: hold netdev instance lock during nft ndo_setup_tc")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reported-by: syzbot+0afb4bcf91e5a1afdcad@syzkaller.appspotmail.com
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250308044726.1193222-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net_sched-prevent-creation-of-classes-with-tc_h_root'

Cong Wang says:

====================
net_sched: Prevent creation of classes with TC_H_ROOT

This patchset contains a bug fix and its TDC test case.
====================

Link: https://patch.msgid.link/20250306232355.93864-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/tc-testing: Add a test case for DRR class with TC_H_ROOT

Integrate the reproduer from Mingi to TDC.

All test results:

1..4
ok 1 0385 - Create DRR with default setting
ok 2 2375 - Delete DRR with handle
ok 3 3092 - Show DRR class
ok 4 4009 - Reject creation of DRR class with classid TC_H_ROOT

Cc: Mingi Cho <mincho@theori.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250306232355.93864-3-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net_sched: Prevent creation of classes with TC_H_ROOT

The function qdisc_tree_reduce_backlog() uses TC_H_ROOT as a termination
condition when traversing up the qdisc tree to update parent backlog
counters. However, if a class is created with classid TC_H_ROOT, the
traversal terminates prematurely at this class instead of reaching the
actual root qdisc, causing parent statistics to be incorrectly maintained.
In case of DRR, this could lead to a crash as reported by Mingi Cho.

Prevent the creation of any Qdisc class with classid TC_H_ROOT
(0xFFFFFFFF) across all qdisc types, as suggested by Jamal.

Reported-by: Mingi Cho <mincho@theori.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 066a3b5b2346 ("[NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loop")
Link: https://patch.msgid.link/20250306232355.93864-2-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipvs: prevent integer overflow in do_ip_vs_get_ctl()

The get->num_services variable is an unsigned int which is controlled by
the user.  The struct_size() function ensures that the size calculation
does not overflow an unsigned long, however, we are saving the result to
an int so the calculation can overflow.

Both "len" and "get->num_services" come from the user.  This check is
just a sanity check to help the user and ensure they are using the API
correctly.  An integer overflow here is not a big deal.  This has no
security impact.

Save the result from struct_size() type size_t to fix this integer
overflow bug.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

selftests: netfilter: skip br_netfilter queue tests if kernel is tainted

These scripts fail if the kernel is tainted which leads to wrong test
failure reports in CI environments when an unrelated test triggers some
splat.

Check taint state at start of script and SKIP if its already dodgy.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conncount: Fully initialize struct nf_conncount_tuple in insert_tree()

Since commit b36e4523d4d5 ("netfilter: nf_conncount: fix garbage
collection confirm race"), `cpu` and `jiffies32` were introduced to
the struct nf_conncount_tuple.

The commit made nf_conncount_add() initialize `conn->cpu` and
`conn->jiffies32` when allocating the struct.
In contrast, count_tree() was not changed to initialize them.

By commit 34848d5c896e ("netfilter: nf_conncount: Split insert and
traversal"), count_tree() was split and the relevant allocation
code now resides in insert_tree().
Initialize `conn->cpu` and `conn->jiffies32` in insert_tree().

BUG: KMSAN: uninit-value in find_or_evict net/netfilter/nf_conncount.c:117 [inline]
BUG: KMSAN: uninit-value in __nf_conncount_add+0xd9c/0x2850 net/netfilter/nf_conncount.c:143
find_or_evict net/netfilter/nf_conncount.c:117 [inline]
__nf_conncount_add+0xd9c/0x2850 net/netfilter/nf_conncount.c:143
count_tree net/netfilter/nf_conncount.c:438 [inline]
nf_conncount_count+0x82f/0x1e80 net/netfilter/nf_conncount.c:521
connlimit_mt+0x7f6/0xbd0 net/netfilter/xt_connlimit.c:72
__nft_match_eval net/netfilter/nft_compat.c:403 [inline]
nft_match_eval+0x1a5/0x300 net/netfilter/nft_compat.c:433
expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline]
nft_do_chain+0x426/0x2290 net/netfilter/nf_tables_core.c:288
nft_do_chain_ipv4+0x1a5/0x230 net/netfilter/nft_chain_filter.c:23
nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626
nf_hook_slow_list+0x24d/0x860 net/netfilter/core.c:663
NF_HOOK_LIST include/linux/netfilter.h:350 [inline]
ip_sublist_rcv+0x17b7/0x17f0 net/ipv4/ip_input.c:633
ip_list_rcv+0x9ef/0xa40 net/ipv4/ip_input.c:669
__netif_receive_skb_list_ptype net/core/dev.c:5936 [inline]
__netif_receive_skb_list_core+0x15c5/0x1670 net/core/dev.c:5983
__netif_receive_skb_list net/core/dev.c:6035 [inline]
netif_receive_skb_list_internal+0x1085/0x1700 net/core/dev.c:6126
netif_receive_skb_list+0x5a/0x460 net/core/dev.c:6178
xdp_recv_frames net/bpf/test_run.c:280 [inline]
xdp_test_run_batch net/bpf/test_run.c:361 [inline]
bpf_test_run_xdp_live+0x2e86/0x3480 net/bpf/test_run.c:390
bpf_prog_test_run_xdp+0xf1d/0x1ae0 net/bpf/test_run.c:1316
bpf_prog_test_run+0x5e5/0xa30 kernel/bpf/syscall.c:4407
__sys_bpf+0x6aa/0xd90 kernel/bpf/syscall.c:5813
__do_sys_bpf kernel/bpf/syscall.c:5902 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5900 [inline]
__ia32_sys_bpf+0xa0/0xe0 kernel/bpf/syscall.c:5900
ia32_sys_call+0x394d/0x4180 arch/x86/include/generated/asm/syscalls_32.h:358
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0xb0/0x110 arch/x86/entry/common.c:387
do_fast_syscall_32+0x38/0x80 arch/x86/entry/common.c:412
do_SYSENTER_32+0x1f/0x30 arch/x86/entry/common.c:450
entry_SYSENTER_compat_after_hwframe+0x84/0x8e

Uninit was created at:
slab_post_alloc_hook mm/slub.c:4121 [inline]
slab_alloc_node mm/slub.c:4164 [inline]
kmem_cache_alloc_noprof+0x915/0xe10 mm/slub.c:4171
insert_tree net/netfilter/nf_conncount.c:372 [inline]
count_tree net/netfilter/nf_conncount.c:450 [inline]
nf_conncount_count+0x1415/0x1e80 net/netfilter/nf_conncount.c:521
connlimit_mt+0x7f6/0xbd0 net/netfilter/xt_connlimit.c:72
__nft_match_eval net/netfilter/nft_compat.c:403 [inline]
nft_match_eval+0x1a5/0x300 net/netfilter/nft_compat.c:433
expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline]
nft_do_chain+0x426/0x2290 net/netfilter/nf_tables_core.c:288
nft_do_chain_ipv4+0x1a5/0x230 net/netfilter/nft_chain_filter.c:23
nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626
nf_hook_slow_list+0x24d/0x860 net/netfilter/core.c:663
NF_HOOK_LIST include/linux/netfilter.h:350 [inline]
ip_sublist_rcv+0x17b7/0x17f0 net/ipv4/ip_input.c:633
ip_list_rcv+0x9ef/0xa40 net/ipv4/ip_input.c:669
__netif_receive_skb_list_ptype net/core/dev.c:5936 [inline]
__netif_receive_skb_list_core+0x15c5/0x1670 net/core/dev.c:5983
__netif_receive_skb_list net/core/dev.c:6035 [inline]
netif_receive_skb_list_internal+0x1085/0x1700 net/core/dev.c:6126
netif_receive_skb_list+0x5a/0x460 net/core/dev.c:6178
xdp_recv_frames net/bpf/test_run.c:280 [inline]
xdp_test_run_batch net/bpf/test_run.c:361 [inline]
bpf_test_run_xdp_live+0x2e86/0x3480 net/bpf/test_run.c:390
bpf_prog_test_run_xdp+0xf1d/0x1ae0 net/bpf/test_run.c:1316
bpf_prog_test_run+0x5e5/0xa30 kernel/bpf/syscall.c:4407
__sys_bpf+0x6aa/0xd90 kernel/bpf/syscall.c:5813
__do_sys_bpf kernel/bpf/syscall.c:5902 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5900 [inline]
__ia32_sys_bpf+0xa0/0xe0 kernel/bpf/syscall.c:5900
ia32_sys_call+0x394d/0x4180 arch/x86/include/generated/asm/syscalls_32.h:358
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0xb0/0x110 arch/x86/entry/common.c:387
do_fast_syscall_32+0x38/0x80 arch/x86/entry/common.c:412
do_SYSENTER_32+0x1f/0x30 arch/x86/entry/common.c:450
entry_SYSENTER_compat_after_hwframe+0x84/0x8e

Reported-by: syzbot+83fed965338b573115f7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=83fed965338b573115f7
Fixes: b36e4523d4d5 ("netfilter: nf_conncount: fix garbage collection confirm race")
Signed-off-by: Kohei Enju <enjuk@amazon.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Merge tag 'wireless-2025-03-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes berg says:

====================
Few more fixes:
- cfg80211/mac80211
   - stop possible runaway wiphy worker
   - EHT should not use reserved MPDU size bits
   - don't run worker for stopped interfaces
   - fix SA Query processing with MLO
   - fix lookup of assoc link BSS entries
   - correct station flush on unauthorize
- iwlwifi:
   - TSO fixes
   - fix non-MSI-X platforms
   - stop possible runaway restart worker
- rejigger maintainers so I'm not CC'ed on
   everything
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

wifi: mac80211: fix MPDU length parsing for EHT 5/6 GHz

The MPDU length is only configured using the EHT capabilities element on
2.4 GHz. On 5/6 GHz it is configured using the VHT or HE capabilities
respectively.

Fixes: cf0079279727 ("wifi: mac80211: parse A-MSDU len from EHT capabilities")
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20250311121704.0634d31f0883.I28063e4d3ef7d296b7e8a1c303460346a30bf09c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

Merge tag 'hyperv-fixes-signed-20250311' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

- Patches to fix Hyper-v framebuffer code (Michael Kelley and Saurabh
   Sengar)

- Fix for Hyper-V output argument to hypercall that changes page
   visibility (Michael Kelley)

- Fix for Hyper-V VTL mode (Naman Jain)

* tag 'hyperv-fixes-signed-20250311' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Drivers: hv: vmbus: Don't release fb_mmio resource in vmbus_free_mmio()
  x86/hyperv: Fix output argument to hypercall that changes page visibility
  fbdev: hyperv_fb: Allow graceful removal of framebuffer
  fbdev: hyperv_fb: Simplify hvfb_putmem
  fbdev: hyperv_fb: Fix hang in kdump kernel when on Hyper-V Gen 2 VMs
  drm/hyperv: Fix address space leak when Hyper-V DRM device is removed
  fbdev: hyperv_fb: iounmap() the correct memory when removing a device
  x86/hyperv/vtl: Stop kernel from probing VTL0 low memory

Merge tag 'pinctrl-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:

- Fix the regmap settings for bcm281xx, this was missing the stride

- NULL check for the Nuvoton npcm8xx devm_kasprintf()

- Enable the Spacemit pin controller by default in the SoC config. The
   SoC will not boot without it so this one is pretty much required

* tag 'pinctrl-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: spacemit: enable config option
  pinctrl: nuvoton: npcm8xx: Add NULL check in npcm8xx_gpio_fw
  pinctrl: bcm281xx: Fix incorrect regmap max_registers value

qlcnic: fix memory leak issues in qlcnic_sriov_common.c

Add qlcnic_sriov_free_vlans() in qlcnic_sriov_alloc_vlans() if
any sriov_vlans fails to be allocated.
Add qlcnic_sriov_free_vlans() to free the memory allocated by
qlcnic_sriov_alloc_vlans() if "sriov->allowed_vlans" fails to
be allocated.

Fixes: 91b7282b613d ("qlcnic: Support VLAN id config.")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Link: https://patch.msgid.link/20250307094952.14874-1-haoxiang_li2024@163.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

docs: netdev: add a note on selftest posting

We haven't had much discussion on the list about this, but
a handful of people have been confused about rules on
posting selftests for fixes, lately. I tend to post fixes
with their respective selftests in the same series.
There are tradeoffs around size of the net tree and conflicts
but so far it hasn't been a major issue.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250306180533.1864075-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtase: Fix improper release of ring list entries in rtase_sw_reset

Since rtase_init_ring, which is called within rtase_sw_reset, adds ring
entries already present in the ring list back into the list, it causes
the ring list to form a cycle. This results in list_for_each_entry_safe
failing to find an endpoint during traversal, leading to an error.
Therefore, it is necessary to remove the previously added ring_list nodes
before calling rtase_init_ring.

Fixes: 079600489960 ("rtase: Implement net_device_ops")
Signed-off-by: Justin Lai <justinlai0215@realtek.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250306070510.18129-1-justinlai0215@realtek.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'bonding-fix-incorrect-mac-address-setting'

Hangbin Liu says:

====================
bonding: fix incorrect mac address setting

The mac address on backup slave should be convert from Solicited-Node
Multicast address, not from bonding unicast target address.
====================

Link: https://patch.msgid.link/20250306023923.38777-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: bonding: fix incorrect mac address

The correct mac address for NS target 2001:db8::254 is 33:33:ff:00:02:54,
not 33:33:00:00:02:54. The same with client maddress.

Fixes: 86fb6173d11e ("selftests: bonding: add ns multicast group testing")
Acked-by: Jay Vosburgh <jv@jvosburgh.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250306023923.38777-3-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bonding: fix incorrect MAC address setting to receive NS messages

When validation on the backup slave is enabled, we need to validate the
Neighbor Solicitation (NS) messages received on the backup slave. To
receive these messages, the correct destination MAC address must be added
to the slave. However, the target in bonding is a unicast address, which
we cannot use directly. Instead, we should first convert it to a
Solicited-Node Multicast Address and then derive the corresponding MAC
address.

Fix the incorrect MAC address setting on both slave_set_ns_maddr() and
slave_set_ns_maddrs(). Since the two function names are similar. Add
some description for the functions. Also only use one mac_addr variable
in slave_set_ns_maddr() to save some code and logic.

Fixes: 8eb36164d1a6 ("bonding: add ns target multicast address to slave device")
Acked-by: Jay Vosburgh <jv@jvosburgh.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250306023923.38777-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: mctp: unshare packets when reassembling

Ensure that the frag_list used for reassembly isn't shared with other
packets. This avoids incorrect reassembly when packets are cloned, and
prevents a memory leak due to circular references between fragments and
their skb_shared_info.

The upcoming MCTP-over-USB driver uses skb_clone which can trigger the
problem - other MCTP drivers don't share SKBs.

A kunit test is added to reproduce the issue.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Fixes: 4a992bbd3650 ("mctp: Implement message fragmentation & reassembly")
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250306-matt-mctp-usb-v1-1-085502b3dd28@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vboxsf: Add __nonstring annotations for unterminated strings

When a character array without a terminating NUL character has a static
initializer, GCC 15's -Wunterminated-string-initialization will only
warn if the array lacks the "nonstring" attribute[1]. Mark the arrays
with __nonstring to and correctly identify the char array as "not a C
string" and thereby eliminate the warning.

This effectively reverts the change in 4e7487245abc ("vboxsf: fix building
with GCC 15"), to add the annotation that has other uses (i.e. warning
if the string is ever used with C string APIs).

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117178
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Brahmajit Das <brahmajit.xyz@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250310222530.work.374-kees@kernel.org
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

net: switchdev: Convert blocking notification chain to a raw one

A blocking notification chain uses a read-write semaphore to protect the
integrity of the chain. The semaphore is acquired for writing when
adding / removing notifiers to / from the chain and acquired for reading
when traversing the chain and informing notifiers about an event.

In case of the blocking switchdev notification chain, recursive
notifications are possible which leads to the semaphore being acquired
twice for reading and to lockdep warnings being generated [1].

Specifically, this can happen when the bridge driver processes a
SWITCHDEV_BRPORT_UNOFFLOADED event which causes it to emit notifications
about deferred events when calling switchdev_deferred_process().

Fix this by converting the notification chain to a raw notification
chain in a similar fashion to the netdev notification chain. Protect
the chain using the RTNL mutex by acquiring it when modifying the chain.
Events are always informed under the RTNL mutex, but add an assertion in
call_switchdev_blocking_notifiers() to make sure this is not violated in
the future.

Maintain the "blocking" prefix as events are always emitted from process
context and listeners are allowed to block.

[1]:
WARNING: possible recursive locking detected
6.14.0-rc4-custom-g079270089484 #1 Not tainted
--------------------------------------------
ip/52731 is trying to acquire lock:
ffffffff850918d8 ((switchdev_blocking_notif_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x58/0xa0

but task is already holding lock:
ffffffff850918d8 ((switchdev_blocking_notif_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x58/0xa0

other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock((switchdev_blocking_notif_chain).rwsem);
lock((switchdev_blocking_notif_chain).rwsem);

*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by ip/52731:
#0: ffffffff84f795b0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x727/0x1dc0
#1: ffffffff8731f628 (&net->rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x790/0x1dc0
#2: ffffffff850918d8 ((switchdev_blocking_notif_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x58/0xa0

stack backtrace:
...
? __pfx_down_read+0x10/0x10
? __pfx_mark_lock+0x10/0x10
? __pfx_switchdev_port_attr_set_deferred+0x10/0x10
blocking_notifier_call_chain+0x58/0xa0
switchdev_port_attr_notify.constprop.0+0xb3/0x1b0
? __pfx_switchdev_port_attr_notify.constprop.0+0x10/0x10
? mark_held_locks+0x94/0xe0
? switchdev_deferred_process+0x11a/0x340
switchdev_port_attr_set_deferred+0x27/0xd0
switchdev_deferred_process+0x164/0x340
br_switchdev_port_unoffload+0xc8/0x100 [bridge]
br_switchdev_blocking_event+0x29f/0x580 [bridge]
notifier_call_chain+0xa2/0x440
blocking_notifier_call_chain+0x6e/0xa0
switchdev_bridge_port_unoffload+0xde/0x1a0
...

Fixes: f7a70d650b0b6 ("net: bridge: switchdev: Ensure deferred event delivery on unoffload")
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/20250305121509.631207-1-amcohen@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-ti-icssg-prueth-add-native-mode-xdp-support'

Meghana Malladi says:

====================
net: ti: icssg-prueth: Add native mode XDP support

This series adds native XDP support using page_pool.
XDP zero copy support is not included in this patch series.

Patch 1/3: Replaces skb with page pool for Rx buffer allocation
Patch 2/3: Adds prueth_swdata struct for SWDATA for all swdata cases
Patch 3/3: Introduces native mode XDP support

v3: https://lore.kernel.org/all/20250224110102.1528552-1-m-malladi@ti.com/
====================

Link: https://patch.msgid.link/20250305101422.1908370-1-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Add XDP support

Add native XDP support. We do not support zero copy yet.

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Link: https://patch.msgid.link/20250305101422.1908370-4-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: introduce and use prueth_swdata struct for SWDATA

We have different cases for SWDATA (skb, page, cmd, etc)
so it is better to have a dedicated data structure for that.
We can embed the type field inside the struct and use it
to interpret the data in completion handlers.

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Link: https://patch.msgid.link/20250305101422.1908370-3-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Use page_pool API for RX buffer allocation

This is to prepare for native XDP support.

The page pool API is more faster in allocating pages than
__alloc_skb(). Drawback is that it works at PAGE_SIZE granularity
so we are not efficient in memory usage.
i.e. we are using PAGE_SIZE (4KB) memory for 1.5KB max packet size.

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Link: https://patch.msgid.link/20250305101422.1908370-2-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'enic-enable-32-64-byte-cqes-and-get-max-rx-tx-ring-size-from-hw'

Satish Kharat via says:

====================
enic: enable 32, 64 byte cqes and get max rx/tx ring size from hw

This series enables using the max rx and tx ring sizes read from hw.
For newer hw that can be up to 16k entries. This requires bigger
completion entries for rx queues. This series enables the use of the
32 and 64 byte completion queues entries for enic rx queues on
supported hw versions. This is in addition to the exiting (default)
16 byte rx cqes.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
====================

Link: https://patch.msgid.link/20250304-enic_cleanup_and_ext_cq-v2-0-85804263dad8@cisco.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

enic: get max rq & wq entries supported by hw, 16K queues

Enables reading the max rq and wq entries supported from the hw.
Enables 16k rq and wq entries on hw that supports.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: John Daley <johndale@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20250304-enic_cleanup_and_ext_cq-v2-8-85804263dad8@cisco.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

enic: cleanup of enic wq request completion path

Cleans up the enic wq request completion path needed for 16k wq size
support.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: John Daley <johndale@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20250304-enic_cleanup_and_ext_cq-v2-7-85804263dad8@cisco.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

enic: added enic_wq.c and enic_wq.h

Moves wq related function to enic_wq.c. Prepares for
a cleaup of enic wq code path.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: John Daley <johndale@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20250304-enic_cleanup_and_ext_cq-v2-6-85804263dad8@cisco.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

enic: remove unused function cq_enet_wq_desc_dec

Removes cq_enet_wq_desc_dec, not needed anymore.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: John Daley <johndale@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20250304-enic_cleanup_and_ext_cq-v2-5-85804263dad8@cisco.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

enic: enable rq extended cq support

Enables getting from hw all the supported rq cq sizes and
uses the highest supported cq size.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: John Daley <johndale@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20250304-enic_cleanup_and_ext_cq-v2-4-85804263dad8@cisco.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

enic: enic rq extended cq defines

Adds the defines for 32 and 64 byte receive queue completion queue
descriptors.
Adds devcmd define to get rq cq descriptor size/s supported by hw.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: John Daley <johndale@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20250304-enic_cleanup_and_ext_cq-v2-3-85804263dad8@cisco.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

enic: enic rq code reorg

Separates enic rx path from generic vnic api. Removes some
complexity of doign enic callbacks through vnic api in rx.
This is in preparation for enabling enic extended cq which
applies only to enic rx path.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: John Daley <johndale@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20250304-enic_cleanup_and_ext_cq-v2-2-85804263dad8@cisco.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

enic: Move function from header file to c file

Moves cq_enet_rq_desc_dec from cq_enet_desc.h to enic_rq.c.
This is in preparation for enic extended completion queue
enabling.

Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: John Daley <johndale@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20250304-enic_cleanup_and_ext_cq-v2-1-85804263dad8@cisco.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mptcp-pm-code-reorganisation'

Matthieu Baerts says:

====================
mptcp: pm: code reorganisation

Before this series, the PM code was dispersed in different places:

- pm.c had common code for all PMs.

- pm_netlink.c was initially only about the in-kernel PM, but ended up
  also getting exported common helpers, callbacks used by the different
  PMs, NL events for PM userspace daemon, etc. quite confusing.

- pm_userspace.c had userspace PM only code, but it was using "specific"
  in-kernel PM helpers according to their names.

To clarify the code, a reorganisation is suggested here, only by moving
code around, and small helper renaming to avoid confusions:

- pm_netlink.c now only contains common PM generic Netlink code:
  - PM events: this code was already there
  - shared helpers around Netlink code that were already there as well
  - shared Netlink commands code from pm.c

- pm_kernel.c now contains only code that is specific to the in-kernel
  PM. Now all functions are either called from:
  - pm.c: events coming from the core, when this PM is being used
  - pm_netlink.c: for shared Netlink commands
  - mptcp_pm_gen.c: for Netlink commands specific to the in-kernel PM
  - sockopt.c: for the exported counters per netns

- pm.c got many code from pm_netlink.c:
  - helpers used from both PMs and not linked to Netlink
  - callbacks used by different PMs, e.g. ADD_ADDR management
  - some helpers have been renamed to remove the '_nl' prefix, and some
    have been marked as 'static'.

- protocol.h has been updated accordingly:
  - some helpers no longer need to be exported
  - new ones needed to be exported: they have been prefixed if needed.

The code around the PM is now less confusing, which should help for the
maintenance in the long term, and the introduction of a PM Ops.

This will certainly impact future backports, but because other cleanups
have already done recently, and more are coming to ease the addition of
a new path-manager controlled with BPF (struct_ops), doing that now
seems to be a good time. Also, many issues around the PM have been fixed
a few months ago while increasing the code coverage in the selftests, so
such big reorganisation can be done with more confidence now.

Note that checkpatch, when used with --max-line-length=80, will complain
about lines being over the 80 limits, but these warnings were already
there before moving the code around.

Also, patch 1 is not directly related to the code reorganisation, but it
was a remaining cleanup that we didn't upstream before, because it was
conflicting with another patch that has been sent for inclusion to the
net tree.
====================

Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-0-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: move Netlink PM helpers to pm_netlink.c

Before this patch, the PM code was dispersed in different places:

- pm.c had common code for all PMs, but also Netlink specific code that
  will not be needed with the future BPF path-managers.

- pm_netlink.c had common Netlink code.

To clarify the code, a reorganisation is suggested here, only by moving
code around, and small helper renaming to avoid confusions:

- pm_netlink.c now only contains common PM Netlink code:
  - PM events: this code was already there
  - shared helpers around Netlink code that were already there as well
  - shared Netlink commands code from pm.c

- pm.c now no longer contain Netlink specific code.

- protocol.h has been updated accordingly:
  - mptcp_nl_fill_addr() no longer need to be exported.

The code around the PM is now less confusing, which should help for the
maintenance in the long term.

This will certainly impact future backports, but because other cleanups
have already done recently, and more are coming to ease the addition of
a new path-manager controlled with BPF (struct_ops), doing that now
seems to be a good time. Also, many issues around the PM have been fixed
a few months ago while increasing the code coverage in the selftests, so
such big reorganisation can be done with more confidence now.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-15-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: split in-kernel PM specific code

Before this patch, the PM code was dispersed in different places:

- pm.c had common code for all PMs

- pm_netlink.c was supposed to be about the in-kernel PM, but also had
  exported common Netlink helpers, NL events for PM userspace daemons,
  etc. quite confusing.

To clarify the code, a reorganisation is suggested here, only by moving
code around to avoid confusions:

- pm_netlink.c now only contains common PM Netlink code:
  - PM events: this code was already there
  - shared helpers around Netlink code that were already there as well
  - more shared Netlink commands code from pm.c will come after

- pm_kernel.c now contains only code that is specific to the in-kernel
  PM. Now all functions are either called from:
  - pm.c: events coming from the core, when this PM is being used
  - pm_netlink.c: for shared Netlink commands
  - mptcp_pm_gen.c: for Netlink commands specific to the in-kernel PM
  - sockopt.c: for the exported counters per netns
  - (while at it, a useless 'return;' spot by checkpatch at the end of
     mptcp_pm_nl_set_flags_all, has been removed)

The code around the PM is now less confusing, which should help for the
maintenance in the long term.

This will certainly impact future backports, but because other cleanups
have already done recently, and more are coming to ease the addition of
a new path-manager controlled with BPF (struct_ops), doing that now
seems to be a good time. Also, many issues around the PM have been fixed
a few months ago while increasing the code coverage in the selftests, so
such big reorganisation can be done with more confidence now.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-14-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: move generic PM helpers to pm.c

Before this patch, the PM code was dispersed in different places:

- pm.c had common code for all PMs

- pm_netlink.c was supposed to be about the in-kernel PM, but also had
  exported common helpers, callbacks used by the different PMs, NL
  events for PM userspace daemon, etc. quite confusing.

- pm_userspace.c had userspace PM only code, but using specific
  in-kernel PM helpers

To clarify the code, a reorganisation is suggested here, only by moving
code around, and (un)exporting functions:

- helpers used from both PMs and not linked to Netlink
- callbacks used by different PMs, e.g. ADD_ADDR management
- some helpers have been marked as 'static'
- protocol.h has been updated accordingly
- (while at it, a needless if before a kfree(), spot by checkpatch in
   mptcp_remove_anno_list_by_saddr(), has been removed)

The code around the PM is now less confusing, which should help for the
maintenance in the long term.

This will certainly impact future backports, but because other cleanups
have already done recently, and more are coming to ease the addition of
a new path-manager controlled with BPF (struct_ops), doing that now
seems to be a good time. Also, many issues around the PM have been fixed
a few months ago while increasing the code coverage in the selftests, so
such big reorganisation can be done with more confidence now.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-13-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: move generic helper at the top

In prevision to another change importing all generic PM helpers from
pm_netlink.c to there.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-12-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: export mptcp_remote_address

In a following commit, the 'remote_address' helper will need to be used
from different files.

It is then exported, and prefixed with 'mptcp_', similar to
'mptcp_local_address'.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-11-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: worker: split in-kernel and common tasks

To make it clear what actions are in-kernel PM specific and which ones
are not and done for all PMs, e.g. sending ADD_ADDR and close associated
subflows when a RM_ADDR is received.

The behavioural is changed a bit: MPTCP_PM_ADD_ADDR_RECEIVED is now
treated after MPTCP_PM_ADD_ADDR_SEND_ACK and MPTCP_PM_RM_ADDR_RECEIVED,
but that should not change anything in practice.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-10-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: avoid calling PM specific code from core

When destroying an MPTCP socket, some userspace PM specific code was
called from mptcp_destroy_common() in protocol.c. That feels wrong, and
it is the only case.

Instead, the core now calls mptcp_pm_destroy() from pm.c which is now in
charge of cleaning the announced addresses list, and ask the different
PMs to do extra cleaning if needed, e.g. the userspace PM, if used, will
clean the local addresses list.

While at it, the userspace PM specific helper has been prefixed with
'mptcp_userspace_pm_' like the other ones.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-9-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: kernel: add '_pm' to mptcp_nl_set_flags

Currently, in-kernel PM specific helpers are prefixed with
'mptcp_pm_nl_'. Here, '_pm' was missing from 'mptcp_nl_set_flags'.

Add '_pm' to be similar to others, and add '_all' to avoid confusions
witih the global 'mptcp_pm_nl_set_flags'.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-8-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: remove '_nl' from mptcp_pm_nl_is_init_remote_addr

Currently, in-kernel PM specific helpers are prefixed with
'mptcp_pm_nl_'. But here 'mptcp_pm_nl_is_init_remote_addr' is not
specific to this PM: it is called from pm.c for both the in-kernel and
userspace PMs.

To avoid confusions, the '_nl' bit has been removed from the name.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-7-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: remove '_nl' from mptcp_pm_nl_subflow_chk_stale()

Currently, in-kernel PM specific helpers are prefixed with
'mptcp_pm_nl_'. But here 'mptcp_pm_nl_subflow_chk_stale' is not specific
to this PM: it is called from pm.c for both the in-kernel and userspace
PMs.

To avoid confusions, the '_nl' bit has been removed from the name.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-6-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: remove '_nl' from mptcp_pm_nl_rm_addr_received

Currently, in-kernel PM specific helpers are prefixed with
'mptcp_pm_nl_'. But here 'mptcp_pm_nl_rm_addr_received' is not specific
to this PM: it is called from the PM worker, and used by both the
in-kernel and userspace PMs. The helper has been renamed to
'mptcp_pm_rm_addr_recv' instead of '_received' to avoid confusions with
the one from pm.c.

mptcp_pm_nl_rm_addr_or_subflow', and 'mptcp_pm_nl_rm_subflow_received'
have been updated too for the same reason.

To avoid confusions, the '_nl' bit has been removed from the name.

While at it, the in-kernel PM specific code has been move from
mptcp_pm_rm_addr_or_subflow to a new dedicated helper, clearer.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-5-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: remove '_nl' from mptcp_pm_nl_work

Currently, in-kernel PM specific helpers are prefixed with
'mptcp_pm_nl_'. But here 'mptcp_pm_nl_work' is not specific to this PM:
it is called from the core to call helpers, some of them needed by both
the in-kernel and userspace PMs.

To avoid confusions, the '_nl' bit has been removed from the name.

Also used 'worker' instead of 'work', similar to protocol.c's worker.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-4-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: remove '_nl' from mptcp_pm_nl_mp_prio_send_ack

Currently, in-kernel PM specific helpers are prefixed with
'mptcp_pm_nl_'. But here 'mptcp_pm_nl_mp_prio_send_ack()' is not
specific to this PM: it is used by both the in-kernel and userspace PMs.

To avoid confusions, the '_nl' bit has been removed from the name.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-3-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: remove '_nl' from mptcp_pm_nl_addr_send_ack

Currently, in-kernel PM specific helpers are prefixed with
'mptcp_pm_nl_'. But here 'mptcp_pm_nl_addr_send_ack()' is not specific
to this PM: it is used by both the in-kernel and userspace PMs.

To avoid confusions, the '_nl' bit has been removed from the name.

No behavioural changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-2-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: use addr entry for get_local_id

The following code in mptcp_userspace_pm_get_local_id() that assigns "skc"
to "new_entry" is not allowed in BPF if we use the same code to implement
the get_local_id() interface of a BFP path manager:

memset(&new_entry, 0, sizeof(struct mptcp_pm_addr_entry));
new_entry.addr = *skc;
new_entry.addr.id = 0;
new_entry.flags = MPTCP_PM_ADDR_FLAG_IMPLICIT;

To solve the issue, this patch moves this assignment to "new_entry" forward
to mptcp_pm_get_local_id(), and then passing "new_entry" as a parameter to
both mptcp_pm_nl_get_local_id() and mptcp_userspace_pm_get_local_id().

No behavioural changes intended.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-1-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: sfc: remove Martin Habets

Martin has left AMD and no longer works on the sfc driver.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250307154731.211368-1-edward.cree@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'eth-bnxt-fix-several-bugs-in-the-bnxt-module'

Taehee Yoo says:

====================
eth: bnxt: fix several bugs in the bnxt module

The first fixes setting incorrect skb->truesize.
When xdp-mb prog returns XDP_PASS, skb is allocated and initialized.
Currently, The truesize is calculated as BNXT_RX_PAGE_SIZE *
sinfo->nr_frags, but sinfo->nr_frags is flushed by napi_build_skb().
So, it stores sinfo before calling napi_build_skb() and then use it
for calculate truesize.

The second fixes kernel panic in the bnxt_queue_mem_alloc().
The bnxt_queue_mem_alloc() accesses rx ring descriptor.
rx ring descriptors are allocated when the interface is up and it's
freed when the interface is down.
So, if bnxt_queue_mem_alloc() is called when the interface is down,
kernel panic occurs.
This patch makes the bnxt_queue_mem_alloc() return -ENETDOWN if rx ring
descriptors are not allocated.

The third patch fixes kernel panic in the bnxt_queue_{start | stop}().
When a queue is restarted bnxt_queue_{start | stop}() are called.
These functions set MRU to 0 to stop packet flow and then to set up the
remaining things.
MRU variable is a member of vnic_info[] the first vnic_info is for
default and the second is for ntuple.
The first vnic_info is always allocated when interface is up, but the
second is allocated only when ntuple is enabled.
(ethtool -K eth0 ntuple <on | off>).
Currently, the bnxt_queue_{start | stop}() access
vnic_info[BNXT_VNIC_NTUPLE] regardless of whether ntuple is enabled or
not.
So kernel panic occurs.
This patch make the bnxt_queue_{start | stop}() use bp->nr_vnics instead
of BNXT_VNIC_NTUPLE.

The fourth patch fixes a warning due to checksum state.
The bnxt_rx_pkt() checks whether skb->ip_summed is not CHECKSUM_NONE
before updating ip_summed. if ip_summed is not CHECKSUM_NONE, it WARNS
about it. However, the bnxt_xdp_build_skb() is called in XDP-MB-PASS
path and it updates ip_summed earlier than bnxt_rx_pkt().
So, in the XDP-MB-PASS path, the bnxt_rx_pkt() always warns about
checksum.
Updating ip_summed at the bnxt_xdp_build_skb() is unnecessary and
duplicate, so it is removed.

The fifth patch fixes a kernel panic in the
bnxt_get_queue_stats{rx | tx}().
The bnxt_get_queue_stats{rx | tx}() callback functions are called when
a queue is resetting.
These internally access rx and tx rings without null check, but rings
are allocated and initialized when the interface is up.
So, these functions are called when the interface is down, it
occurs a kernel panic.

The sixth patch fixes memory leak in queue reset logic.
When a queue is resetting, tpa_info is allocated for the new queue and
tpa_info for an old queue is not used anymore.
So it should be freed, but not.

The seventh patch makes net_devmem_unbind_dmabuf() ignore -ENETDOWN.
When devmem socket is closed, net_devmem_unbind_dmabuf() is called to
unbind/release resources.
If interface is down, the driver returns -ENETDOWN.
The -ENETDOWN return value is not an actual error, because the interface
will release resources when the interface is down.
So, net_devmem_unbind_dmabuf() needs to ignore -ENETDOWN.

The last patch adds XDP testcases to
tools/testing/selftests/drivers/net/ping.py.
====================

Link: https://patch.msgid.link/20250309134219.91670-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: add xdp cases for ping.py

ping.py has 3 cases, test_v4, test_v6 and test_tcp.
But these cases are not executed on the XDP environment.
So, it adds XDP environment, existing tests(test_v4, test_v6, and
test_tcp) are executed too on the below XDP environment.
So, it adds XDP cases.
1. xdp-generic + single-buffer
2. xdp-generic + multi-buffer
3. xdp-native + single-buffer
4. xdp-native + multi-buffer
5. xdp-offload

It also makes test_{v4 | v6 | tcp} sending large size packets. this may
help to check whether multi-buffer is working or not.

Note that the physical interface may be down and then up when xdp is
attached or detached.
This takes some period to activate traffic. So sleep(10) is
added if the test interface is the physical interface.
netdevsim and veth type interfaces skip sleep.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250309134219.91670-9-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: devmem: do not WARN conditionally after netdev_rx_queue_restart()

When devmem socket is closed, netdev_rx_queue_restart() is called to
reset queue by the net_devmem_unbind_dmabuf(). But callback may return
-ENETDOWN if the interface is down because queues are already freed
when the interface is down so queue reset is not needed.
So, it should not warn if the return value is -ENETDOWN.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250309134219.91670-8-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: fix memory leak in queue reset

When the queue is reset, the bnxt_alloc_one_tpa_info() is called to
allocate tpa_info for the new queue.
And then the old queue's tpa_info should be removed by the
bnxt_free_one_tpa_info(), but it is not called.
So memory leak occurs.
It adds the bnxt_free_one_tpa_info() in the bnxt_queue_mem_free().

unreferenced object 0xffff888293cc0000 (size 16384):
  comm "ncdevmem", pid 2076, jiffies 4296604081
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 40 75 78 93 82 88 ff ff  ........@ux.....
    40 75 78 93 02 00 00 00 00 00 00 00 00 00 00 00  @ux.............
  backtrace (crc 5d7d4798):
    ___kmalloc_large_node+0x10d/0x1b0
    __kmalloc_large_node_noprof+0x17/0x60
    __kmalloc_noprof+0x3f6/0x520
    bnxt_alloc_one_tpa_info+0x5f/0x300 [bnxt_en]
    bnxt_queue_mem_alloc+0x8e8/0x14f0 [bnxt_en]
    netdev_rx_queue_restart+0x233/0x620
    net_devmem_bind_dmabuf_to_queue+0x2a3/0x600
    netdev_nl_bind_rx_doit+0xc00/0x10a0
    genl_family_rcv_msg_doit+0x1d4/0x2b0
    genl_rcv_msg+0x3fb/0x6c0
    netlink_rcv_skb+0x12c/0x360
    genl_rcv+0x24/0x40
    netlink_unicast+0x447/0x710
    netlink_sendmsg+0x712/0xbc0
    __sys_sendto+0x3fd/0x4d0
    __x64_sys_sendto+0xdc/0x1b0

Fixes: 2d694c27d32e ("bnxt_en: implement netdev_queue_mgmt_ops")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250309134219.91670-7-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: fix kernel panic in the bnxt_get_queue_stats{rx | tx}

When qstats-get operation is executed, callbacks of netdev_stats_ops
are called. The bnxt_get_queue_stats{rx | tx} collect per-queue stats
from sw_stats in the rings.
But {rx | tx | cp}_ring are allocated when the interface is up.
So, these rings are not allocated when the interface is down.

The qstats-get is allowed even if the interface is down. However,
the bnxt_get_queue_stats{rx | tx}() accesses cp_ring and tx_ring
without null check.
So, it needs to avoid accessing rings if the interface is down.

Reproducer:
ip link set $interface down
./cli.py --spec netdev.yaml --dump qstats-get
OR
ip link set $interface down
python ./stats.py

Splat looks like:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1680fa067 P4D 1680fa067 PUD 16be3b067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 UID: 0 PID: 1495 Comm: python3 Not tainted 6.14.0-rc4+ #32 5cd0f999d5a15c574ac72b3e4b907341
Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0603 11/01/2021
RIP: 0010:bnxt_get_queue_stats_rx+0xf/0x70 [bnxt_en]
Code: c6 87 b5 18 00 00 02 eb a2 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 01
RSP: 0018:ffffabef43cdb7e0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffffffc04c8710 RCX: 0000000000000000
RDX: ffffabef43cdb858 RSI: 0000000000000000 RDI: ffff8d504e850000
RBP: ffff8d506c9f9c00 R08: 0000000000000004 R09: ffff8d506bcd901c
R10: 0000000000000015 R11: ffff8d506bcd9000 R12: 0000000000000000
R13: ffffabef43cdb8c0 R14: ffff8d504e850000 R15: 0000000000000000
FS:  00007f2c5462b080(0000) GS:ffff8d575f600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000167fd0000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
  <TASK>
  ? __die+0x20/0x70
  ? page_fault_oops+0x15a/0x460
  ? sched_balance_find_src_group+0x58d/0xd10
  ? exc_page_fault+0x6e/0x180
  ? asm_exc_page_fault+0x22/0x30
  ? bnxt_get_queue_stats_rx+0xf/0x70 [bnxt_en cdd546fd48563c280cfd30e9647efa420db07bf1]
  netdev_nl_stats_by_netdev+0x2b1/0x4e0
  ? xas_load+0x9/0xb0
  ? xas_find+0x183/0x1d0
  ? xa_find+0x8b/0xe0
  netdev_nl_qstats_get_dumpit+0xbf/0x1e0
  genl_dumpit+0x31/0x90
  netlink_dump+0x1a8/0x360

Fixes: af7b3b4adda5 ("eth: bnxt: support per-queue statistics")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Link: https://patch.msgid.link/20250309134219.91670-6-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: do not update checksum in bnxt_xdp_build_skb()

The bnxt_rx_pkt() updates ip_summed value at the end if checksum offload
is enabled.
When the XDP-MB program is attached and it returns XDP_PASS, the
bnxt_xdp_build_skb() is called to update skb_shared_info.
The main purpose of bnxt_xdp_build_skb() is to update skb_shared_info,
but it updates ip_summed value too if checksum offload is enabled.
This is actually duplicate work.

When the bnxt_rx_pkt() updates ip_summed value, it checks if ip_summed
is CHECKSUM_NONE or not.
It means that ip_summed should be CHECKSUM_NONE at this moment.
But ip_summed may already be updated to CHECKSUM_UNNECESSARY in the
XDP-MB-PASS path.
So the by skb_checksum_none_assert() WARNS about it.

This is duplicate work and updating ip_summed in the
bnxt_xdp_build_skb() is not needed.

Splat looks like:
WARNING: CPU: 3 PID: 5782 at ./include/linux/skbuff.h:5155 bnxt_rx_pkt+0x479b/0x7610 [bnxt_en]
Modules linked in: bnxt_re bnxt_en rdma_ucm rdma_cm iw_cm ib_cm ib_uverbs veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_]
CPU: 3 UID: 0 PID: 5782 Comm: socat Tainted: G        W          6.14.0-rc4+ #27
Tainted: [W]=WARN
Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0603 11/01/2021
RIP: 0010:bnxt_rx_pkt+0x479b/0x7610 [bnxt_en]
Code: 54 24 0c 4c 89 f1 4c 89 ff c1 ea 1f ff d3 0f 1f 00 49 89 c6 48 85 c0 0f 84 4c e5 ff ff 48 89 c7 e8 ca 3d a0 c8 e9 8f f4 ff ff <0f> 0b f
RSP: 0018:ffff88881ba09928 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00000000c7590303 RCX: 0000000000000000
RDX: 1ffff1104e7d1610 RSI: 0000000000000001 RDI: ffff8881c91300b8
RBP: ffff88881ba09b28 R08: ffff888273e8b0d0 R09: ffff888273e8b070
R10: ffff888273e8b010 R11: ffff888278b0f000 R12: ffff888273e8b080
R13: ffff8881c9130e00 R14: ffff8881505d3800 R15: ffff888273e8b000
FS:  00007f5a2e7be080(0000) GS:ffff88881ba00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fff2e708ff8 CR3: 000000013e3b0000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<IRQ>
? __warn+0xcd/0x2f0
? bnxt_rx_pkt+0x479b/0x7610
? report_bug+0x326/0x3c0
? handle_bug+0x53/0xa0
? exc_invalid_op+0x14/0x50
? asm_exc_invalid_op+0x16/0x20
? bnxt_rx_pkt+0x479b/0x7610
? bnxt_rx_pkt+0x3e41/0x7610
? __pfx_bnxt_rx_pkt+0x10/0x10
? napi_complete_done+0x2cf/0x7d0
__bnxt_poll_work+0x4e8/0x1220
? __pfx___bnxt_poll_work+0x10/0x10
? __pfx_mark_lock.part.0+0x10/0x10
bnxt_poll_p5+0x36a/0xfa0
? __pfx_bnxt_poll_p5+0x10/0x10
__napi_poll.constprop.0+0xa0/0x440
net_rx_action+0x899/0xd00
...

Following ping.py patch adds xdp-mb-pass case. so ping.py is going
to be able to reproduce this issue.

Fixes: 1dc4c557bfed ("bnxt: adding bnxt_xdp_build_skb to build skb from multibuffer xdp_buff")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Link: https://patch.msgid.link/20250309134219.91670-5-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic

When a queue is restarted, it sets MRU to 0 for stopping packet flow.
MRU variable is a member of vnic_info[], the first vnic_info is default
and the second is ntuple.
Only when ntuple is enabled(ethtool -K eth0 ntuple on), vnic_info for
ntuple is allocated in init logic.
The bp->nr_vnics indicates how many vnic_info are allocated.
However bnxt_queue_{start | stop}() accesses vnic_info[BNXT_VNIC_NTUPLE]
regardless of ntuple state.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Fixes: b9d2956e869c ("bnxt_en: stop packet flow during bnxt_queue_stop/start")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250309134219.91670-4-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>