]> www.infradead.org Git - nvme.git/log
nvme.git
12 months agonet: skb: add pskb_network_may_pull_reason() helper
Menglong Dong [Wed, 9 Oct 2024 02:28:19 +0000 (10:28 +0800)]
net: skb: add pskb_network_may_pull_reason() helper

Introduce the function pskb_network_may_pull_reason() and make
pskb_network_may_pull() a simple inline call to it. The drop reasons of
it just come from pskb_may_pull_reason.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 months agonet: bcmasp: enable SW timestamping
Justin Chen [Thu, 10 Oct 2024 22:15:06 +0000 (15:15 -0700)]
net: bcmasp: enable SW timestamping

Add skb_tx_timestamp() call and enable support for SW
timestamping.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20241010221506.802730-1-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: broadcom: remove select MII from brcmstb Ethernet drivers
Justin Chen [Thu, 10 Oct 2024 19:13:32 +0000 (12:13 -0700)]
net: broadcom: remove select MII from brcmstb Ethernet drivers

The MII driver isn't used by brcmstb Ethernet drivers. Remove it
from the BCMASP, GENET, and SYSTEMPORT drivers.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20241010191332.1074642-1-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMerge branch 'microchip_t1s-update-on-microchip-10base-t1s-phy-driver'
Jakub Kicinski [Fri, 11 Oct 2024 22:54:42 +0000 (15:54 -0700)]
Merge branch 'microchip_t1s-update-on-microchip-10base-t1s-phy-driver'

Parthiban Veerasooran says:

====================
microchip_t1s: Update on Microchip 10BASE-T1S PHY driver

This patch series contain the below updates:
- Restructured lan865x_write_cfg_params() and lan865x_read_cfg_params()
  functions arguments to more generic.
- Updated new/improved initial settings of LAN865X Rev.B0 from latest
  AN1760.
- Added support for LAN865X Rev.B1 from latest AN1760.
- Moved LAN867X reset handling to a new function for flexibility.
- Added support for LAN867X Rev.C1/C2 from latest AN1699.
- Disabled/enabled collision detection based on PLCA setting.
====================

Link: https://patch.msgid.link/20241010082205.221493-1-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: phy: microchip_t1s: configure collision detection based on PLCA mode
Parthiban Veerasooran [Thu, 10 Oct 2024 08:22:05 +0000 (13:52 +0530)]
net: phy: microchip_t1s: configure collision detection based on PLCA mode

As per LAN8650/1 Rev.B0/B1 AN1760 (Revision F (DS60001760G - June 2024))
and LAN8670/1/2 Rev.C1/C2 AN1699 (Revision E (DS60001699F - June 2024)),
under normal operation, the device should be operated in PLCA mode.
Disabling collision detection is recommended to allow the device to
operate in noisy environments or when reflections and other inherent
transmission line distortion cause poor signal quality. Collision
detection must be re-enabled if the device is configured to operate in
CSMA/CD mode.

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20241010082205.221493-8-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: phy: microchip_t1s: add support for Microchip's LAN867X Rev.C2
Parthiban Veerasooran [Thu, 10 Oct 2024 08:22:04 +0000 (13:52 +0530)]
net: phy: microchip_t1s: add support for Microchip's LAN867X Rev.C2

Add support for LAN8670/1/2 Rev.C2 as per the latest configuration note
AN1699 released (Revision E (DS60001699F - June 2024)) for Rev.C1 is also
applicable for Rev.C2. Refer hardware revisions list in the latest AN1699
Revision E (DS60001699F - June 2024).
https://www.microchip.com/en-us/application-notes/an1699

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20241010082205.221493-7-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: phy: microchip_t1s: add support for Microchip's LAN867X Rev.C1
Parthiban Veerasooran [Thu, 10 Oct 2024 08:22:03 +0000 (13:52 +0530)]
net: phy: microchip_t1s: add support for Microchip's LAN867X Rev.C1

Add support for LAN8670/1/2 Rev.C1 as per the latest configuration note
AN1699 released (Revision E (DS60001699F - June 2024)).
https://www.microchip.com/en-us/application-notes/an1699

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20241010082205.221493-6-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: phy: microchip_t1s: move LAN867X reset handling to a new function
Parthiban Veerasooran [Thu, 10 Oct 2024 08:22:02 +0000 (13:52 +0530)]
net: phy: microchip_t1s: move LAN867X reset handling to a new function

Move LAN867X reset handling code to a new function called
lan867x_check_reset_complete() which will be useful for the next patch
which also uses the same code to handle the reset functionality.

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20241010082205.221493-5-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: phy: microchip_t1s: add support for Microchip's LAN865X Rev.B1
Parthiban Veerasooran [Thu, 10 Oct 2024 08:22:01 +0000 (13:52 +0530)]
net: phy: microchip_t1s: add support for Microchip's LAN865X Rev.B1

Add support for LAN8650/1 Rev.B1. As per the latest configuration note
AN1760 released (Revision F (DS60001760G - June 2024)) for Rev.B0 is also
applicable for Rev.B1. Refer hardware revisions list in the latest AN1760
Revision F (DS60001760G - June 2024).
https://www.microchip.com/en-us/application-notes/an1760

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20241010082205.221493-4-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: phy: microchip_t1s: update new initial settings for LAN865X Rev.B0
Parthiban Veerasooran [Thu, 10 Oct 2024 08:22:00 +0000 (13:52 +0530)]
net: phy: microchip_t1s: update new initial settings for LAN865X Rev.B0

Update the new/improved initial settings from the latest configuration
application note AN1760 released for LAN8650/1 Rev.B0 Revision F
(DS60001760G - June 2024).
https://www.microchip.com/en-us/application-notes/an1760

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20241010082205.221493-3-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: phy: microchip_t1s: restructure cfg read/write functions arguments
Parthiban Veerasooran [Thu, 10 Oct 2024 08:21:59 +0000 (13:51 +0530)]
net: phy: microchip_t1s: restructure cfg read/write functions arguments

Restructure lan865x_write_cfg_params() and lan865x_read_cfg_params()
functions arguments to more generic which will be useful for the next
patch which updates the improved initial configuration for LAN8650/1
Rev.B0 published in the Configuration Note.

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20241010082205.221493-2-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoselftests: drv-net: add missing trailing backslash
Jakub Kicinski [Thu, 10 Oct 2024 21:18:57 +0000 (14:18 -0700)]
selftests: drv-net: add missing trailing backslash

Commit b3ea416419c8 ("testing: net-drv: add basic shaper test")
removed the trailing backslash from the last entry. We have
a terminating comment here to avoid having to modify the last
line when adding at the end.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20241010211857.2193076-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMerge branch 'netdevsim-better-ipsec-output-format'
Jakub Kicinski [Fri, 11 Oct 2024 22:44:29 +0000 (15:44 -0700)]
Merge branch 'netdevsim-better-ipsec-output-format'

Hangbin Liu says:

====================
netdevsim: better ipsec output format

The first 2 patches improve the netdevsim ipsec debug output with better
format. The 3rd patch update the selftests.

v2: update rtnetlink selftest with new output format (Stanislav Fomichev, Jakub Kicinski)
====================

Link: https://patch.msgid.link/20241010040027.21440-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoselftests: rtnetlink: update netdevsim ipsec output format
Hangbin Liu [Thu, 10 Oct 2024 04:00:27 +0000 (04:00 +0000)]
selftests: rtnetlink: update netdevsim ipsec output format

After the netdevsim update to use human-readable IP address formats for
IPsec, we can now use the source and destination IPs directly in testing.
Here is the result:
  # ./rtnetlink.sh -t kci_test_ipsec_offload
  PASS: ipsec_offload

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241010040027.21440-4-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonetdevsim: copy addresses for both in and out paths
Hangbin Liu [Thu, 10 Oct 2024 04:00:26 +0000 (04:00 +0000)]
netdevsim: copy addresses for both in and out paths

The current code only copies the address for the in path, leaving the out
path address set to 0. This patch corrects the issue by copying the addresses
for both the in and out paths. Before this patch:

  # cat /sys/kernel/debug/netdevsim/netdevsim0/ports/0/ipsec
  SA count=2 tx=20
  sa[0] tx ipaddr=0.0.0.0
  sa[0]    spi=0x00000100 proto=0x32 salt=0x0adecc3a crypt=1
  sa[0]    key=0x3167608a ca4f1397 43565909 941fa627
  sa[1] rx ipaddr=192.168.0.1
  sa[1]    spi=0x00000101 proto=0x32 salt=0x0adecc3a crypt=1
  sa[1]    key=0x3167608a ca4f1397 43565909 941fa627

After this patch:

  = cat /sys/kernel/debug/netdevsim/netdevsim0/ports/0/ipsec
  SA count=2 tx=20
  sa[0] tx ipaddr=192.168.0.2
  sa[0]    spi=0x00000100 proto=0x32 salt=0x0adecc3a crypt=1
  sa[0]    key=0x3167608a ca4f1397 43565909 941fa627
  sa[1] rx ipaddr=192.168.0.1
  sa[1]    spi=0x00000101 proto=0x32 salt=0x0adecc3a crypt=1
  sa[1]    key=0x3167608a ca4f1397 43565909 941fa627

Fixes: 7699353da875 ("netdevsim: add ipsec offload testing")
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20241010040027.21440-3-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonetdevsim: print human readable IP address
Hangbin Liu [Thu, 10 Oct 2024 04:00:25 +0000 (04:00 +0000)]
netdevsim: print human readable IP address

Currently, IPSec addresses are printed in hexadecimal format, which is
not user-friendly. e.g.

  # cat /sys/kernel/debug/netdevsim/netdevsim0/ports/0/ipsec
  SA count=2 tx=20
  sa[0] rx ipaddr=0x00000000 00000000 00000000 0100a8c0
  sa[0]    spi=0x00000101 proto=0x32 salt=0x0adecc3a crypt=1
  sa[0]    key=0x3167608a ca4f1397 43565909 941fa627
  sa[1] tx ipaddr=0x00000000 00000000 00000000 00000000
  sa[1]    spi=0x00000100 proto=0x32 salt=0x0adecc3a crypt=1
  sa[1]    key=0x3167608a ca4f1397 43565909 941fa627

This patch updates the code to print the IPSec address in a human-readable
format for easier debug. e.g.

 # cat /sys/kernel/debug/netdevsim/netdevsim0/ports/0/ipsec
 SA count=4 tx=40
 sa[0] tx ipaddr=0.0.0.0
 sa[0]    spi=0x00000100 proto=0x32 salt=0x0adecc3a crypt=1
 sa[0]    key=0x3167608a ca4f1397 43565909 941fa627
 sa[1] rx ipaddr=192.168.0.1
 sa[1]    spi=0x00000101 proto=0x32 salt=0x0adecc3a crypt=1
 sa[1]    key=0x3167608a ca4f1397 43565909 941fa627
 sa[2] tx ipaddr=::
 sa[2]    spi=0x00000100 proto=0x32 salt=0x0adecc3a crypt=1
 sa[2]    key=0x3167608a ca4f1397 43565909 941fa627
 sa[3] rx ipaddr=2000::1
 sa[3]    spi=0x00000101 proto=0x32 salt=0x0adecc3a crypt=1
 sa[3]    key=0x3167608a ca4f1397 43565909 941fa627

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20241010040027.21440-2-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: dsa: mv88e6xxx: Fix uninitialised err value
Aryan Srivastava [Wed, 9 Oct 2024 21:23:19 +0000 (10:23 +1300)]
net: dsa: mv88e6xxx: Fix uninitialised err value

The err value in mv88e6xxx_region_atu_snapshot is now potentially
uninitialised on return. Initialise err as 0.

Fixes: ada5c3229b32 ("net: dsa: mv88e6xxx: Add FID map cache")
Signed-off-by: Aryan Srivastava <aryan.srivastava@alliedtelesis.co.nz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241009212319.1045176-1-aryan.srivastava@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMerge branch 'net-xilinx-emaclite-adopt-clock-support'
Jakub Kicinski [Fri, 11 Oct 2024 22:41:36 +0000 (15:41 -0700)]
Merge branch 'net-xilinx-emaclite-adopt-clock-support'

Radhey Shyam Pandey says:

====================
net: xilinx: emaclite: Adopt clock support

This patchset adds emaclite clock support. AXI Ethernet Lite IP can also
be used on SoC platforms like Zynq UltraScale+ MPSoC which combines
powerful processing system (PS) and user-programmable logic (PL) into
the same device. On these platforms it is mandatory to explicitly enable
IP clocks for proper functionality.
====================

Link: https://patch.msgid.link/1728491303-1456171-1-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: emaclite: Adopt clock support
Abin Joseph [Wed, 9 Oct 2024 16:28:23 +0000 (21:58 +0530)]
net: emaclite: Adopt clock support

Adapt to use the clock framework. Add s_axi_aclk clock from the processor
bus clock domain and make clk optional to keep DTB backward compatibility.

Signed-off-by: Abin Joseph <abin.joseph@amd.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1728491303-1456171-4-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: emaclite: Replace alloc_etherdev() with devm_alloc_etherdev()
Abin Joseph [Wed, 9 Oct 2024 16:28:22 +0000 (21:58 +0530)]
net: emaclite: Replace alloc_etherdev() with devm_alloc_etherdev()

Use device managed ethernet device allocation to simplify the error
handling logic. No functional change.

Signed-off-by: Abin Joseph <abin.joseph@amd.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1728491303-1456171-3-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agodt-bindings: net: emaclite: Add clock support
Abin Joseph [Wed, 9 Oct 2024 16:28:21 +0000 (21:58 +0530)]
dt-bindings: net: emaclite: Add clock support

Add s_axi_aclk AXI4 clock support. Traditionally this IP was used on
microblaze platforms which had fixed clocks enabled all the time. But
since its a PL IP, it can also be used on SoC platforms like Zynq
UltraScale+ MPSoC which combines processing system (PS) and user
programmable logic (PL) into the same device. On these platforms instead
of fixed enabled clocks it is mandatory to explicitly enable IP clocks
for proper functionality.

So make clock a required property and also define max supported clock
constraints.

Signed-off-by: Abin Joseph <abin.joseph@amd.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/1728491303-1456171-2-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMerge branch 'net-remove-rtnl-from-fib_seq_sum'
Jakub Kicinski [Fri, 11 Oct 2024 22:35:07 +0000 (15:35 -0700)]
Merge branch 'net-remove-rtnl-from-fib_seq_sum'

Eric Dumazet says:

====================
net: remove RTNL from fib_seq_sum()

This series is inspired by a syzbot report showing
rtnl contention and one thread blocked in:

7 locks held by syz-executor/10835:
  #0: ffff888033390420 (sb_writers#8){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2931 [inline]
  #0: ffff888033390420 (sb_writers#8){.+.+}-{0:0}, at: vfs_write+0x224/0xc90 fs/read_write.c:679
  #1: ffff88806df6bc88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x1ea/0x500 fs/kernfs/file.c:325
  #2: ffff888026fcf3c8 (kn->active#50){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x20e/0x500 fs/kernfs/file.c:326
  #3: ffffffff8f56f848 (nsim_bus_dev_list_lock){+.+.}-{3:3}, at: new_device_store+0x1b4/0x890 drivers/net/netdevsim/bus.c:166
  #4: ffff88805e0140e8 (&dev->mutex){....}-{3:3}, at: device_lock include/linux/device.h:1014 [inline]
  #4: ffff88805e0140e8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x8e/0x520 drivers/base/dd.c:1005
  #5: ffff88805c5fb250 (&devlink->lock_key#55){+.+.}-{3:3}, at: nsim_drv_probe+0xcb/0xb80 drivers/net/netdevsim/dev.c:1534
  #6: ffffffff8fcd1748 (rtnl_mutex){+.+.}-{3:3}, at: fib_seq_sum+0x31/0x290 net/core/fib_notifier.c:46
====================

Link: https://patch.msgid.link/20241009184405.3752829-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: do not acquire rtnl in fib_seq_sum()
Eric Dumazet [Wed, 9 Oct 2024 18:44:05 +0000 (18:44 +0000)]
net: do not acquire rtnl in fib_seq_sum()

After we made sure no fib_seq_read() handlers needs RTNL anymore,
we can remove RTNL from fib_seq_sum().

Note that after RTNL was dropped, fib_seq_sum() result was possibly
outdated anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoipmr: use READ_ONCE() to read net->ipv[46].ipmr_seq
Eric Dumazet [Wed, 9 Oct 2024 18:44:04 +0000 (18:44 +0000)]
ipmr: use READ_ONCE() to read net->ipv[46].ipmr_seq

mr_call_vif_notifiers() and mr_call_mfc_notifiers() already
uses WRITE_ONCE() on the write side.

Using RTNL to protect the reads seems a big hammer.

Constify 'struct net' argument of ip6mr_rules_seq_read()
and ipmr_rules_seq_read().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoipv6: use READ_ONCE()/WRITE_ONCE() on fib6_table->fib_seq
Eric Dumazet [Wed, 9 Oct 2024 18:44:03 +0000 (18:44 +0000)]
ipv6: use READ_ONCE()/WRITE_ONCE() on fib6_table->fib_seq

Using RTNL to protect ops->fib_rules_seq reads seems a big hammer.

Writes are protected by RTNL.
We can use READ_ONCE() when reading it.

Constify 'struct net' argument of fib6_tables_seq_read() and
fib6_rules_seq_read().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoipv4: use READ_ONCE()/WRITE_ONCE() on net->ipv4.fib_seq
Eric Dumazet [Wed, 9 Oct 2024 18:44:02 +0000 (18:44 +0000)]
ipv4: use READ_ONCE()/WRITE_ONCE() on net->ipv4.fib_seq

Using RTNL to protect ops->fib_rules_seq reads seems a big hammer.

Writes are protected by RTNL.
We can use READ_ONCE() when reading it.

Constify 'struct net' argument of fib4_rules_seq_read()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agofib: rules: use READ_ONCE()/WRITE_ONCE() on ops->fib_rules_seq
Eric Dumazet [Wed, 9 Oct 2024 18:44:01 +0000 (18:44 +0000)]
fib: rules: use READ_ONCE()/WRITE_ONCE() on ops->fib_rules_seq

Using RTNL to protect ops->fib_rules_seq reads seems a big hammer.

Writes are protected by RTNL.
We can use READ_ONCE() on readers.

Constify 'struct net' argument of fib_rules_seq_read()
and lookup_rules_ops().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agotcp: move sysctl_tcp_l3mdev_accept to netns_ipv4_read_rx
Eric Dumazet [Thu, 10 Oct 2024 03:41:00 +0000 (03:41 +0000)]
tcp: move sysctl_tcp_l3mdev_accept to netns_ipv4_read_rx

sysctl_tcp_l3mdev_accept is read from TCP receive fast path from
tcp_v6_early_demux(),
 __inet6_lookup_established,
  inet_request_bound_dev_if().

Move it to netns_ipv4_read_rx.

Remove the '#ifdef CONFIG_NET_L3_MASTER_DEV' that was guarding
its definition.

Note this adds a hole of three bytes that could be filled later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Link: https://patch.msgid.link/20241010034100.320832-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: phy: aquantia: poll status register
Aryan Srivastava [Thu, 10 Oct 2024 00:49:34 +0000 (13:49 +1300)]
net: phy: aquantia: poll status register

The system interface connection status register is not immediately
correct upon line side link up. This results in the status being read as
OFF and then transitioning to the correct host side link mode with a
short delay. This causes the phylink framework passing the OFF status
down to all MAC config drivers, resulting in the host side link being
misconfigured, which in turn can lead to link flapping or complete
packet loss in some cases.

Mitigate this by periodically polling the register until it not showing
the OFF state. This will be done every 1ms for 10ms, using the same
poll/timeout as the processor intensive operation reads.

If the phy is still expressing the OFF state after the timeout, then set
the link to false and pass the NA interface mode onto the phylink
framework.

Signed-off-by: Aryan Srivastava <aryan.srivastava@alliedtelesis.co.nz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241010004935.1774601-1-aryan.srivastava@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoeth: remove the DLink/Sundance (ST201) driver
Jakub Kicinski [Tue, 8 Oct 2024 15:48:24 +0000 (08:48 -0700)]
eth: remove the DLink/Sundance (ST201) driver

Konstantin reports the maintainer's address bounces.
There is no other maintainer and the driver is quite old.
There is a good chance nobody is using this driver any more.
Let's try to remove it completely, we can revert it back in
if someone complains.

Link: https://lore.kernel.org/20240925-bizarre-earwig-from-pluto-1484aa@lemu/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Denis Kirjanov <dkirjanov@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 months agoMerge branch 'tg3-link-irqs-napis-and-queues'
Jakub Kicinski [Fri, 11 Oct 2024 01:40:34 +0000 (18:40 -0700)]
Merge branch 'tg3-link-irqs-napis-and-queues'

Joe Damato says:

====================
tg3: Link IRQs, NAPIs, and queues

This follows from a previous RFC (wherein I botched the subject lines of
all the messages) [1].

I've taken Michael Chan's suggestion on modifying patch 2 and I've
updated the commit messages of both patches to test and show the output
for the default 1 TX 4 RX queues and the 4 TX and 4 RX queues cases.

Reviewers: please check the commit messages carefully to ensure the
output is correct (or on your own systems to verify, if you like). I am
not a tg3 expert and it's possible that I got something wrong.

[1]: https://lore.kernel.org/all/20241005145717.302575-3-jdamato@fastly.com/T/
====================

Link: https://patch.msgid.link/20241009175509.31753-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agotg3: Link queues to NAPIs
Joe Damato [Wed, 9 Oct 2024 17:55:09 +0000 (17:55 +0000)]
tg3: Link queues to NAPIs

Link queues to NAPIs using the netdev-genl API so this information is
queryable.

First, test with the default setting on my tg3 NIC at boot with 1 TX
queue:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump queue-get --json='{"ifindex": 2}'

[{'id': 0, 'ifindex': 2, 'napi-id': 8194, 'type': 'rx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8195, 'type': 'rx'},
 {'id': 2, 'ifindex': 2, 'napi-id': 8196, 'type': 'rx'},
 {'id': 3, 'ifindex': 2, 'napi-id': 8197, 'type': 'rx'},
 {'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'tx'}]

Now, adjust the number of TX queues to be 4 via ethtool:

$ sudo ethtool -L eth0 tx 4
$ sudo ethtool -l eth0 | tail -5
Current hardware settings:
RX: 4
TX: 4
Other: n/a
Combined: n/a

Despite "Combined: n/a" in the ethtool output, /proc/interrupts shows
the tg3 has renamed the IRQs to be combined:

343: [...] eth0-0
344: [...] eth0-txrx-1
345: [...] eth0-txrx-2
346: [...] eth0-txrx-3
347: [...] eth0-txrx-4

Now query this via netlink to ensure the queues are linked properly to
their NAPIs:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump queue-get --json='{"ifindex": 2}'
[{'id': 0, 'ifindex': 2, 'napi-id': 8960, 'type': 'rx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8961, 'type': 'rx'},
 {'id': 2, 'ifindex': 2, 'napi-id': 8962, 'type': 'rx'},
 {'id': 3, 'ifindex': 2, 'napi-id': 8963, 'type': 'rx'},
 {'id': 0, 'ifindex': 2, 'napi-id': 8960, 'type': 'tx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8961, 'type': 'tx'},
 {'id': 2, 'ifindex': 2, 'napi-id': 8962, 'type': 'tx'},
 {'id': 3, 'ifindex': 2, 'napi-id': 8963, 'type': 'tx'}]

As you can see above, id 0 for both TX and RX share a NAPI, NAPI ID
8960, and so on for each queue index up to 3.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241009175509.31753-3-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agotg3: Link IRQs to NAPI instances
Joe Damato [Wed, 9 Oct 2024 17:55:08 +0000 (17:55 +0000)]
tg3: Link IRQs to NAPI instances

Link IRQs to NAPI instances with netif_napi_set_irq. This information
can be queried with the netdev-genl API.

Begin by testing my tg3 device in its default state: 1 TX queue and 4 RX
queues.

Compare the output of /proc/interrupts for my tg3 device with the output of
netdev-genl after applying this patch:

$ cat /proc/interrupts | grep eth0
343: [...] eth0-tx-0
344: [...] eth0-rx-1
345: [...] eth0-rx-2
346: [...] eth0-rx-3
347: [...] eth0-rx-4

As you can see above, tg3 has named the IRQs such that there is a
dedicated tx IRQ and 4 dedicated rx IRQs, for a total of 5 IRQs.

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
 --dump napi-get --json='{"ifindex": 2}'

[{'id': 8197, 'ifindex': 2, 'irq': 347},
 {'id': 8196, 'ifindex': 2, 'irq': 346},
 {'id': 8195, 'ifindex': 2, 'irq': 345},
 {'id': 8194, 'ifindex': 2, 'irq': 344},
 {'id': 8193, 'ifindex': 2, 'irq': 343}]

Netlink displays the same IRQs as above, noting that each is mapped to a
unique NAPI instance.

Now, reconfigure the NIC to have 4 TX queues and 4 RX queues:

$ sudo ethtool -L eth0 rx 4 tx 4
$ sudo ethtool -l eth0 | tail -5
Current hardware settings:
RX: 4
TX: 4
Other: n/a
Combined: n/a

Examine /proc/interrupts once again, noting that tg3 will now rename the
IRQs to suggest that they are combined tx and rx without allocating
additional IRQs, so the total IRQ count in /proc/interrupts is
unchanged:

343: [...] eth0-0
344: [...] eth0-txrx-1
345: [...] eth0-txrx-2
346: [...] eth0-txrx-3
347: [...] eth0-txrx-4

Check the output from netlink again:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 2}'
[{'id': 8973, 'ifindex': 2, 'irq': 347},
 {'id': 8972, 'ifindex': 2, 'irq': 346},
 {'id': 8971, 'ifindex': 2, 'irq': 345},
 {'id': 8970, 'ifindex': 2, 'irq': 344},
 {'id': 8969, 'ifindex': 2, 'irq': 343}]

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241009175509.31753-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agor8169: remove original workaround for RTL8125 broken rx issue
Heiner Kallweit [Wed, 9 Oct 2024 05:48:05 +0000 (07:48 +0200)]
r8169: remove original workaround for RTL8125 broken rx issue

Now that we have b9c7ac4fe22c ("r8169: disable ALDPS per default for
RTL8125"), the first attempt to fix the issue shouldn't be needed
any longer. So let's effectively revert 621735f59064 ("r8169: fix
rare issue with broken rx after link-down on RTL8125") and see
whether anybody complains.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/382d8c88-cbce-400f-ad62-fda0181c7e38@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agor8169: don't apply UDP padding quirk on RTL8126A
Heiner Kallweit [Wed, 9 Oct 2024 05:44:23 +0000 (07:44 +0200)]
r8169: don't apply UDP padding quirk on RTL8126A

Vendor drivers r8125/r8126 indicate that this quirk isn't needed
any longer for RTL8126A. Mimic this in r8169.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/d1317187-aa81-4a69-b831-678436e4de62@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 3 Oct 2024 17:05:55 +0000 (10:05 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.12-rc3).

No conflicts and no adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMerge tag 'net-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 10 Oct 2024 19:36:35 +0000 (12:36 -0700)]
Merge tag 'net-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth and netfilter.

  Current release - regressions:

   - dsa: sja1105: fix reception from VLAN-unaware bridges

   - Revert "net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is
     enabled"

   - eth: fec: don't save PTP state if PTP is unsupported

  Current release - new code bugs:

   - smc: fix lack of icsk_syn_mss with IPPROTO_SMC, prevent null-deref

   - eth: airoha: update Tx CPU DMA ring idx at the end of xmit loop

   - phy: aquantia: AQR115c fix up PMA capabilities

  Previous releases - regressions:

   - tcp: 3 fixes for retrans_stamp and undo logic

  Previous releases - always broken:

   - net: do not delay dst_entries_add() in dst_release()

   - netfilter: restrict xtables extensions to families that are safe,
     syzbot found a way to combine ebtables with extensions that are
     never used by userspace tools

   - sctp: ensure sk_state is set to CLOSED if hashing fails in
     sctp_listen_start

   - mptcp: handle consistently DSS corruption, and prevent corruption
     due to large pmtu xmit"

* tag 'net-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
  MAINTAINERS: Add headers and mailing list to UDP section
  MAINTAINERS: consistently exclude wireless files from NETWORKING [GENERAL]
  slip: make slhc_remember() more robust against malicious packets
  net/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC
  ppp: fix ppp_async_encode() illegal access
  docs: netdev: document guidance on cleanup patches
  phonet: Handle error of rtnl_register_module().
  mpls: Handle error of rtnl_register_module().
  mctp: Handle error of rtnl_register_module().
  bridge: Handle error of rtnl_register_module().
  vxlan: Handle error of rtnl_register_module().
  rtnetlink: Add bulk registration helpers for rtnetlink message handlers.
  net: do not delay dst_entries_add() in dst_release()
  mptcp: pm: do not remove closing subflows
  mptcp: fallback when MPTCP opts are dropped after 1st data
  tcp: fix mptcp DSS corruption due to large pmtu xmit
  mptcp: handle consistently DSS corruption
  net: netconsole: fix wrong warning
  net: dsa: refuse cross-chip mirroring operations
  net: fec: don't save PTP state if PTP is unsupported
  ...

12 months agoMerge tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 10 Oct 2024 19:25:32 +0000 (12:25 -0700)]
Merge tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:
 "Ring-buffer fix: do not have boot-mapped buffers use CPU hotplug
  callbacks

  When a ring buffer is mapped to memory assigned at boot, it also
  splits it up evenly between the possible CPUs. But the allocation code
  still attached a CPU notifier callback to this ring buffer. When a CPU
  is added, the callback will happen and another per-cpu buffer is
  created for the ring buffer.

  But for boot mapped buffers, there is no room to add another one (as
  they were all created already). The result of calling the CPU hotplug
  notifier on a boot mapped ring buffer is unpredictable and could lead
  to a system crash.

  If the ring buffer is boot mapped simply do not attach the CPU
  notifier to it"

* tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Do not have boot mapped buffers hook to CPU hotplug

12 months agoMerge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Thu, 10 Oct 2024 17:02:59 +0000 (10:02 -0700)]
Merge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - update fstrim loop and add more cancellation points, fix reported
   delayed or blocked suspend if there's a huge chunk queued

 - fix error handling in recent qgroup xarray conversion

 - in zoned mode, fix warning printing device path without RCU
   protection

 - again fix invalid extent xarray state (6252690f7e1b), lost due to
   refactoring

* tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix clear_dirty and writeback ordering in submit_one_sector()
  btrfs: zoned: fix missing RCU locking in error message when loading zone info
  btrfs: fix missing error handling when adding delayed ref with qgroups enabled
  btrfs: add cancellation points to trim loops
  btrfs: split remaining space to discard in chunks

12 months agoMerge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Linus Torvalds [Thu, 10 Oct 2024 16:52:49 +0000 (09:52 -0700)]
Merge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix NFSD bring-up / shutdown

 - Fix a UAF when releasing a stateid

* tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix possible badness in FREE_STATEID
  nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed
  NFSD: Mark filecache "down" if init fails

12 months agoMerge tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Thu, 10 Oct 2024 16:45:45 +0000 (09:45 -0700)]
Merge tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

 - A few small typo fixes

 - fstests xfs/538 DEBUG-only fix

 - Performance fix on blockgc on COW'ed files, by skipping trims on
   cowblock inodes currently opened for write

 - Prevent cowblocks to be freed under dirty pagecache during unshare

 - Update MAINTAINERS file to quote the new maintainer

* tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix a typo
  xfs: don't free cowblocks from under dirty pagecache on unshare
  xfs: skip background cowblock trims on inodes open for write
  xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
  xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc
  xfs: don't ifdef around the exact minlen allocations
  xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate
  xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
  xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
  xfs: return bool from xfs_attr3_leaf_add
  xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname
  xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()
  xfs: scrub: convert comma to semicolon
  xfs: Remove empty declartion in header file
  MAINTAINERS: add Carlos Maiolino as XFS release manager

12 months agoMerge branch 'maintainers-networking-file-coverage-updates'
Jakub Kicinski [Thu, 10 Oct 2024 16:35:50 +0000 (09:35 -0700)]
Merge branch 'maintainers-networking-file-coverage-updates'

Simon Horman says:

====================
MAINTAINERS: Networking file coverage updates

The aim of this proposal is to make the handling of some files,
related to Networking and Wireless, more consistently. It does so by:

1. Adding some more headers to the UDP section, making it consistent
   with the TCP section.

2. Excluding some files relating to Wireless from NETWORKING [GENERAL],
   making their handling consistent with other files related to
   Wireless.

The aim of this is to make things more consistent.  And for MAINTAINERS
to better reflect the situation on the ground.  I am more than happy to
be told that the current state of affairs is fine. Or for other ideas to
be discussed.

v1: https://lore.kernel.org/20241004-maint-net-hdrs-v1-0-41fd555aacc5@kernel.org
====================

Link: https://patch.msgid.link/20241009-maint-net-hdrs-v2-0-f2c86e7309c8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMAINTAINERS: Add headers and mailing list to UDP section
Simon Horman [Wed, 9 Oct 2024 08:47:23 +0000 (09:47 +0100)]
MAINTAINERS: Add headers and mailing list to UDP section

Add netdev mailing list and some more udp.h headers to the UDP section.
This is now more consistent with the TCP section.

Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241009-maint-net-hdrs-v2-2-f2c86e7309c8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMAINTAINERS: consistently exclude wireless files from NETWORKING [GENERAL]
Simon Horman [Wed, 9 Oct 2024 08:47:22 +0000 (09:47 +0100)]
MAINTAINERS: consistently exclude wireless files from NETWORKING [GENERAL]

We already exclude wireless drivers from the netdev@ traffic, to
delegate it to linux-wireless@, and avoid overwhelming netdev@.

Many of the following wireless-related sections MAINTAINERS
are already not included in the NETWORKING [GENERAL] section.
For consistency, exclude those that are.

* 802.11 (including CFG80211/NL80211)
* MAC80211
* RFKILL

Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241009-maint-net-hdrs-v2-1-f2c86e7309c8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoslip: make slhc_remember() more robust against malicious packets
Eric Dumazet [Wed, 9 Oct 2024 09:11:32 +0000 (09:11 +0000)]
slip: make slhc_remember() more robust against malicious packets

syzbot found that slhc_remember() was missing checks against
malicious packets [1].

slhc_remember() only checked the size of the packet was at least 20,
which is not good enough.

We need to make sure the packet includes the IPv4 and TCP header
that are supposed to be carried.

Add iph and th pointers to make the code more readable.

[1]

BUG: KMSAN: uninit-value in slhc_remember+0x2e8/0x7b0 drivers/net/slip/slhc.c:666
  slhc_remember+0x2e8/0x7b0 drivers/net/slip/slhc.c:666
  ppp_receive_nonmp_frame+0xe45/0x35e0 drivers/net/ppp/ppp_generic.c:2455
  ppp_receive_frame drivers/net/ppp/ppp_generic.c:2372 [inline]
  ppp_do_recv+0x65f/0x40d0 drivers/net/ppp/ppp_generic.c:2212
  ppp_input+0x7dc/0xe60 drivers/net/ppp/ppp_generic.c:2327
  pppoe_rcv_core+0x1d3/0x720 drivers/net/ppp/pppoe.c:379
  sk_backlog_rcv+0x13b/0x420 include/net/sock.h:1113
  __release_sock+0x1da/0x330 net/core/sock.c:3072
  release_sock+0x6b/0x250 net/core/sock.c:3626
  pppoe_sendmsg+0x2b8/0xb90 drivers/net/ppp/pppoe.c:903
  sock_sendmsg_nosec net/socket.c:729 [inline]
  __sock_sendmsg+0x30f/0x380 net/socket.c:744
  ____sys_sendmsg+0x903/0xb60 net/socket.c:2602
  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656
  __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742
  __do_sys_sendmmsg net/socket.c:2771 [inline]
  __se_sys_sendmmsg net/socket.c:2768 [inline]
  __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768
  x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Uninit was created at:
  slab_post_alloc_hook mm/slub.c:4091 [inline]
  slab_alloc_node mm/slub.c:4134 [inline]
  kmem_cache_alloc_node_noprof+0x6bf/0xb80 mm/slub.c:4186
  kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:587
  __alloc_skb+0x363/0x7b0 net/core/skbuff.c:678
  alloc_skb include/linux/skbuff.h:1322 [inline]
  sock_wmalloc+0xfe/0x1a0 net/core/sock.c:2732
  pppoe_sendmsg+0x3a7/0xb90 drivers/net/ppp/pppoe.c:867
  sock_sendmsg_nosec net/socket.c:729 [inline]
  __sock_sendmsg+0x30f/0x380 net/socket.c:744
  ____sys_sendmsg+0x903/0xb60 net/socket.c:2602
  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656
  __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742
  __do_sys_sendmmsg net/socket.c:2771 [inline]
  __se_sys_sendmmsg net/socket.c:2768 [inline]
  __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768
  x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

CPU: 0 UID: 0 PID: 5460 Comm: syz.2.33 Not tainted 6.12.0-rc2-syzkaller-00006-g87d6aab2389e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024

Fixes: b5451d783ade ("slip: Move the SLIP drivers")
Reported-by: syzbot+2ada1bc857496353be5a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/670646db.050a0220.3f80e.0027.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241009091132.2136321-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet/smc: Address spelling errors
Simon Horman [Wed, 9 Oct 2024 10:05:21 +0000 (11:05 +0100)]
net/smc: Address spelling errors

Address spelling errors flagged by codespell.

This patch is intended to cover all files under drivers/smc

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://patch.msgid.link/20241009-smc-starspell-v1-1-b8b395bbaf82@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC
D. Wythe [Wed, 9 Oct 2024 06:55:16 +0000 (14:55 +0800)]
net/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC

Eric report a panic on IPPROTO_SMC, and give the facts
that when INET_PROTOSW_ICSK was set, icsk->icsk_sync_mss must be set too.

Bug: Unable to handle kernel NULL pointer dereference at virtual address
0000000000000000
Mem abort info:
ESR = 0x0000000086000005
EC = 0x21: IABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
user pgtable: 4k pages, 48-bit VAs, pgdp=00000001195d1000
[0000000000000000] pgd=0800000109c46003, p4d=0800000109c46003,
pud=0000000000000000
Internal error: Oops: 0000000086000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 UID: 0 PID: 8037 Comm: syz.3.265 Not tainted
6.11.0-rc7-syzkaller-g5f5673607153 #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 08/06/2024
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : 0x0
lr : cipso_v4_sock_setattr+0x2a8/0x3c0 net/ipv4/cipso_ipv4.c:1910
sp : ffff80009b887a90
x29: ffff80009b887aa0 x28: ffff80008db94050 x27: 0000000000000000
x26: 1fffe0001aa6f5b3 x25: dfff800000000000 x24: ffff0000db75da00
x23: 0000000000000000 x22: ffff0000d8b78518 x21: 0000000000000000
x20: ffff0000d537ad80 x19: ffff0000d8b78000 x18: 1fffe000366d79ee
x17: ffff8000800614a8 x16: ffff800080569b84 x15: 0000000000000001
x14: 000000008b336894 x13: 00000000cd96feaa x12: 0000000000000003
x11: 0000000000040000 x10: 00000000000020a3 x9 : 1fffe0001b16f0f1
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 000000000000003f
x5 : 0000000000000040 x4 : 0000000000000001 x3 : 0000000000000000
x2 : 0000000000000002 x1 : 0000000000000000 x0 : ffff0000d8b78000
Call trace:
0x0
netlbl_sock_setattr+0x2e4/0x338 net/netlabel/netlabel_kapi.c:1000
smack_netlbl_add+0xa4/0x154 security/smack/smack_lsm.c:2593
smack_socket_post_create+0xa8/0x14c security/smack/smack_lsm.c:2973
security_socket_post_create+0x94/0xd4 security/security.c:4425
__sock_create+0x4c8/0x884 net/socket.c:1587
sock_create net/socket.c:1622 [inline]
__sys_socket_create net/socket.c:1659 [inline]
__sys_socket+0x134/0x340 net/socket.c:1706
__do_sys_socket net/socket.c:1720 [inline]
__se_sys_socket net/socket.c:1718 [inline]
__arm64_sys_socket+0x7c/0x94 net/socket.c:1718
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
Code: ???????? ???????? ???????? ???????? (????????)
---[ end trace 0000000000000000 ]---

This patch add a toy implementation that performs a simple return to
prevent such panic. This is because MSS can be set in sock_create_kern
or smc_setsockopt, similar to how it's done in AF_SMC. However, for
AF_SMC, there is currently no way to synchronize MSS within
__sys_connect_file. This toy implementation lays the groundwork for us
to support such feature for IPPROTO_SMC in the future.

Fixes: d25a92ccae6b ("net/smc: Introduce IPPROTO_SMC")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://patch.msgid.link/1728456916-67035-1-git-send-email-alibuda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoppp: fix ppp_async_encode() illegal access
Eric Dumazet [Wed, 9 Oct 2024 18:58:02 +0000 (18:58 +0000)]
ppp: fix ppp_async_encode() illegal access

syzbot reported an issue in ppp_async_encode() [1]

In this case, pppoe_sendmsg() is called with a zero size.
Then ppp_async_encode() is called with an empty skb.

BUG: KMSAN: uninit-value in ppp_async_encode drivers/net/ppp/ppp_async.c:545 [inline]
 BUG: KMSAN: uninit-value in ppp_async_push+0xb4f/0x2660 drivers/net/ppp/ppp_async.c:675
  ppp_async_encode drivers/net/ppp/ppp_async.c:545 [inline]
  ppp_async_push+0xb4f/0x2660 drivers/net/ppp/ppp_async.c:675
  ppp_async_send+0x130/0x1b0 drivers/net/ppp/ppp_async.c:634
  ppp_channel_bridge_input drivers/net/ppp/ppp_generic.c:2280 [inline]
  ppp_input+0x1f1/0xe60 drivers/net/ppp/ppp_generic.c:2304
  pppoe_rcv_core+0x1d3/0x720 drivers/net/ppp/pppoe.c:379
  sk_backlog_rcv+0x13b/0x420 include/net/sock.h:1113
  __release_sock+0x1da/0x330 net/core/sock.c:3072
  release_sock+0x6b/0x250 net/core/sock.c:3626
  pppoe_sendmsg+0x2b8/0xb90 drivers/net/ppp/pppoe.c:903
  sock_sendmsg_nosec net/socket.c:729 [inline]
  __sock_sendmsg+0x30f/0x380 net/socket.c:744
  ____sys_sendmsg+0x903/0xb60 net/socket.c:2602
  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656
  __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742
  __do_sys_sendmmsg net/socket.c:2771 [inline]
  __se_sys_sendmmsg net/socket.c:2768 [inline]
  __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768
  x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Uninit was created at:
  slab_post_alloc_hook mm/slub.c:4092 [inline]
  slab_alloc_node mm/slub.c:4135 [inline]
  kmem_cache_alloc_node_noprof+0x6bf/0xb80 mm/slub.c:4187
  kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:587
  __alloc_skb+0x363/0x7b0 net/core/skbuff.c:678
  alloc_skb include/linux/skbuff.h:1322 [inline]
  sock_wmalloc+0xfe/0x1a0 net/core/sock.c:2732
  pppoe_sendmsg+0x3a7/0xb90 drivers/net/ppp/pppoe.c:867
  sock_sendmsg_nosec net/socket.c:729 [inline]
  __sock_sendmsg+0x30f/0x380 net/socket.c:744
  ____sys_sendmsg+0x903/0xb60 net/socket.c:2602
  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656
  __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742
  __do_sys_sendmmsg net/socket.c:2771 [inline]
  __se_sys_sendmmsg net/socket.c:2768 [inline]
  __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768
  x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

CPU: 1 UID: 0 PID: 5411 Comm: syz.1.14 Not tainted 6.12.0-rc1-syzkaller-00165-g360c1f1f24c6 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+1d121645899e7692f92a@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241009185802.3763282-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agodocs: netdev: document guidance on cleanup patches
Simon Horman [Wed, 9 Oct 2024 09:12:19 +0000 (10:12 +0100)]
docs: netdev: document guidance on cleanup patches

The purpose of this section is to document what is the current practice
regarding clean-up patches which address checkpatch warnings and similar
problems. I feel there is a value in having this documented so others
can easily refer to it.

Clearly this topic is subjective. And to some extent the current
practice discourages a wider range of patches than is described here.
But I feel it is best to start somewhere, with the most well established
part of the current practice.

Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241009-doc-mc-clean-v2-1-e637b665fa81@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMerge branch 'net-introduce-tx-h-w-shaping-api'
Jakub Kicinski [Thu, 10 Oct 2024 15:31:06 +0000 (08:31 -0700)]
Merge branch 'net-introduce-tx-h-w-shaping-api'

Paolo Abeni says:

====================
net: introduce TX H/W shaping API

We have a plurality of shaping-related drivers API, but none flexible
enough to meet existing demand from vendors[1].

This series introduces new device APIs to configure in a flexible way
TX H/W shaping. The new functionalities are exposed via a newly
defined generic netlink interface and include introspection
capabilities. Some self-tests are included, on top of a dummy
netdevsim implementation. Finally a basic implementation for the iavf
driver is provided.

Some usage examples:

* Configure shaping on a given queue:

./tools/net/ynl/cli.py --spec Documentation/netlink/specs/shaper.yaml \
--do set --json '{"ifindex": '$IFINDEX',
  "shaper": {"handle":
     {"scope": "queue", "id":'$QUEUEID'},
  "bw-max": 2000000}}'

* Container B/W sharing

The orchestration infrastructure wants to group the
container-related queues under a RR scheduling and limit the aggregate
bandwidth:

./tools/net/ynl/cli.py --spec Documentation/netlink/specs/shaper.yaml \
--do group --json '{"ifindex": '$IFINDEX',
"leaves": [
  {"handle": {"scope": "queue", "id":'$QID1'},
   "weight": '$W1'},
  {"handle": {"scope": "queue", "id":'$QID2'},
   "weight": '$W2'}],
  {"handle": {"scope": "queue", "id":'$QID3'},
   "weight": '$W3'}],
"handle": {"scope":"node"},
"bw-max": 10000000}'
{'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 0}}

Q1 \
    \
Q2 -- node 0 -------  netdev
    / (bw-max: 10M)
Q3 /

* Delegation

A containers wants to limit the aggregate B/W bandwidth of 2 of the 3
queues it owns - the starting configuration is the one from the
previous point:

SPEC=Documentation/netlink/specs/net_shaper.yaml
./tools/net/ynl/cli.py --spec $SPEC \
--do group --json '{"ifindex": '$IFINDEX',
"leaves": [
  {"handle": {"scope": "queue", "id":'$QID1'},
   "weight": '$W1'},
  {"handle": {"scope": "queue", "id":'$QID2'},
   "weight": '$W2'}],
"handle": {"scope": "node"},
"bw-max": 5000000 }'
{'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 1}}

Q1 -- node 1 --------\
    / (bw-max: 5M)    \
Q2 /                   node 0 -------  netdev
                      /(bw-max: 10M)
Q3 ------------------/

In a group operation, when prior to the op itself, the leaves have
different parents, the user must specify the parent handle for the
group. I.e., starting from the previous config:

./tools/net/ynl/cli.py --spec $SPEC \
--do group --json '{"ifindex": '$IFINDEX',
"leaves": [
  {"handle": {"scope": "queue", "id":'$QID1'},
   "weight": '$W1'},
  {"handle": {"scope": "queue", "id":'$QID3'},
   "weight": '$W3'}],
"handle": {"scope": "node"},
"bw-max": 3000000 }'
Netlink error: Invalid argument
nl_len = 96 (80) nl_flags = 0x300 nl_type = 2
error: -22
extack: {'msg': 'All the leaves shapers must have the same old parent'}

./tools/net/ynl/cli.py --spec $SPEC \
--do group --json '{"ifindex": '$IFINDEX',
"leaves": [
  {"handle": {"scope": "queue", "id":'$QID1'},
   "weight": '$W1'},
  {"handle": {"scope": "queue", "id":'$QID3'},
   "weight": '$W3'}],
"handle": {"scope": "node"},
"parent": {"scope": "node", "id": 1},
"bw-max": 3000000 }
{'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 2}}

Q1 -- node 2 ---
    /(bw-max:3M)\
Q3 /             \
         ---- node 1 \
        / (bw-max: 5M)\
      Q2              node 0 -------  netdev
                      (bw-max: 10M)

* Cleanup:

Still starting from config 1To delete a single queue shaper

./tools/net/ynl/cli.py --spec $SPEC --do delete --json \
'{"ifindex": '$IFINDEX',
  "handle": {"scope": "queue", "id":'$QID3'}}'

Q1 -- node 2 ---
     (bw-max:3M)\
                 \
         ---- node 1 \
        / (bw-max: 5M)\
      Q2              node 0 -------  netdev
                      (bw-max: 10M)

Deleting a node shaper relinks all its leaves to the node's parent:

./tools/net/ynl/cli.py --spec $SPEC --do delete --json \
'{"ifindex": '$IFINDEX',
  "handle": {"scope": "node", "id":2}}'

Q1 ---\
       \
        node 1----- \
       / (bw-max: 5M)\
Q2----/              node 0 -------  netdev
                     (bw-max: 10M)

Deleting the last shaper under a node shaper deletes the node, too:

./tools/net/ynl/cli.py --spec $SPEC --do delete --json \
'{"ifindex": '$IFINDEX',
  "handle": {"scope": "queue", "id":'$QID1'}}'
./tools/net/ynl/cli.py --spec $SPEC --do delete --json \
'{"ifindex": '$IFINDEX',
  "handle": {"scope": "queue", "id":'$QID2'}}'
./tools/net/ynl/cli.py --spec $SPEC --do get --json \
'{"ifindex": '$IFINDEX',
  "handle": {"scope": "node", "id": 1}}'
Netlink error: No such file or directory
nl_len = 44 (28) nl_flags = 0x300 nl_type = 2
error: -2
extack: {'bad-attr': '.handle'}

Such delete recurses on parents that are left over with no leaves:

./tools/net/ynl/cli.py --spec $SPEC --do get --json \
'{"ifindex": '$IFINDEX',
  "handle": {"scope": "node", "id": 0}}'
Netlink error: No such file or directory
nl_len = 44 (28) nl_flags = 0x300 nl_type = 2
error: -2
extack: {'bad-attr': '.handle'}

v8: https://lore.kernel.org/cover.1727704215.git.pabeni@redhat.com
v7: https://lore.kernel.org/cover.1725919039.git.pabeni@redhat.com
v6: https://lore.kernel.org/cover.1725457317.git.pabeni@redhat.com
v5: https://lore.kernel.org/cover.1724944116.git.pabeni@redhat.com
v4: https://lore.kernel.org/cover.1724165948.git.pabeni@redhat.com
v3: https://lore.kernel.org/cover.1722357745.git.pabeni@redhat.com
RFC v2: https://lore.kernel.org/cover.1721851988.git.pabeni@redhat.com
RFC v1: https://lore.kernel.org/cover.1719518113.git.pabeni@redhat.com
====================

Link: https://patch.msgid.link/cover.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoiavf: add support to exchange qos capabilities
Sudheer Mogilappagari [Wed, 9 Oct 2024 08:10:01 +0000 (10:10 +0200)]
iavf: add support to exchange qos capabilities

During driver initialization VF determines QOS capability is allowed
by PF and receives QOS parameters. After which quanta size for queues
is configured which is not configurable and is set to 1KB currently.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/72cbeb9c88d40e557053c57d7531c96bed490576.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoiavf: Add net_shaper_ops support
Sudheer Mogilappagari [Wed, 9 Oct 2024 08:10:00 +0000 (10:10 +0200)]
iavf: Add net_shaper_ops support

Implement net_shaper_ops support for IAVF. This enables configuration
of rate limiting on per queue basis. Customer intends to enforce
bandwidth limit on Tx traffic steered to the queue by configuring
rate limits on the queue.

To set rate limiting for a queue, update shaper object of given queues
in driver and send VIRTCHNL_OP_CONFIG_QUEUE_BW to PF to update HW
configuration.

Deleting shaper configured for queue is nothing but configuring shaper
with bw_max 0. The PF restores the default rate limiting config
when bw_max is zero.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/5a882cb51998c4c2c3d21fed521498eba1c8f079.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoice: Support VF queue rate limit and quanta size configuration
Wenjun Wu [Wed, 9 Oct 2024 08:09:59 +0000 (10:09 +0200)]
ice: Support VF queue rate limit and quanta size configuration

Add support to configure VF queue rate limit and quanta size.

For quanta size configuration, the quanta profiles are divided evenly
by PF numbers. For each port, the first quanta profile is reserved for
default. When VF is asked to set queue quanta size, PF will search for
an available profile, change the fields and assigned this profile to the
queue.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Wenjun Wu <wenjun1.wu@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/fddefc2c1ec3ab32b241ce444af401da19e834dd.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agovirtchnl: support queue rate limit and quanta size configuration
Wenjun Wu [Wed, 9 Oct 2024 08:09:58 +0000 (10:09 +0200)]
virtchnl: support queue rate limit and quanta size configuration

This patch adds new virtchnl opcodes and structures for rate limit
and quanta size configuration, which include:
1. VIRTCHNL_OP_CONFIG_QUEUE_BW, to configure max bandwidth for each
VF per queue.
2. VIRTCHNL_OP_CONFIG_QUANTA, to configure quanta size per queue.
3. VIRTCHNL_OP_GET_QOS_CAPS, VF queries current QoS configuration, such
as enabled TCs, arbiter type, up2tc and bandwidth of VSI node. The
configuration is previously set by DCB and PF, and now is the potential
QoS capability of VF. VF can take it as reference to configure queue TC
mapping.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Wenjun Wu <wenjun1.wu@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/839002f7bd6f63b985a060a51b079f6e6dbbe237.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agotesting: net-drv: add basic shaper test
Paolo Abeni [Wed, 9 Oct 2024 08:09:57 +0000 (10:09 +0200)]
testing: net-drv: add basic shaper test

Leverage a basic/dummy netdevsim implementation to do functional
coverage for NL interface.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/43092afbf38365c796088bf8fc155e523ab434ae.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet-shapers: implement cap validation in the core
Paolo Abeni [Wed, 9 Oct 2024 08:09:56 +0000 (10:09 +0200)]
net-shapers: implement cap validation in the core

Use the device capabilities to reject invalid attribute values before
pushing them to the H/W.

Note that validating the metric explicitly avoids NL_SET_BAD_ATTR()
usage, to provide unambiguous error messages to the user.

Validating the nesting requires the knowledge of the new parent for
the given shaper; as such is a chicken-egg problem: to validate the
leaf nesting we need to know the node scope, to validate the node
nesting we need to know the leafs parent scope.

To break the circular dependency, place the leafs nesting validation
after the parsing.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/54667601813e4c0348f39bf8ad2446ffc9fcd383.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet: shaper: implement introspection support
Paolo Abeni [Wed, 9 Oct 2024 08:09:55 +0000 (10:09 +0200)]
net: shaper: implement introspection support

The netlink op is a simple wrapper around the device callback.

Extend the existing fetch_dev() helper adding an attribute argument
for the requested device. Reuse such helper in the newly implemented
operation.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/66eb62f22b3a5ba06ca91d01ae77515e5f447e15.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonetlink: spec: add shaper introspection support
Paolo Abeni [Wed, 9 Oct 2024 08:09:54 +0000 (10:09 +0200)]
netlink: spec: add shaper introspection support

Allow the user-space to fine-grain query the shaping features
supported by the NIC on each domain.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/3ddd10e450e3fe7d4b944c0d0b886d4483529ee6.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet-shapers: implement shaper cleanup on queue deletion
Paolo Abeni [Wed, 9 Oct 2024 08:09:53 +0000 (10:09 +0200)]
net-shapers: implement shaper cleanup on queue deletion

hook into netif_set_real_num_tx_queues() to cleanup any shaper
configured on top of the to-be-destroyed TX queues.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/6da4ee03cae2b2a757d7b59e88baf09cc94c5ef1.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet-shapers: implement delete support for NODE scope shaper
Paolo Abeni [Wed, 9 Oct 2024 08:09:52 +0000 (10:09 +0200)]
net-shapers: implement delete support for NODE scope shaper

Leverage the previously introduced group operation to implement
the removal of NODE scope shaper, re-linking its leaves under the
the parent node before actually deleting the specified NODE scope
shaper.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/763d484b5b69e365acccfd8031b183c647a367a4.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet-shapers: implement NL group operation
Paolo Abeni [Wed, 9 Oct 2024 08:09:51 +0000 (10:09 +0200)]
net-shapers: implement NL group operation

Allow grouping multiple leaves shaper under the given root.
The node and the leaves shapers are created, if needed, otherwise
the existing shapers are re-linked as requested.

Try hard to pre-allocated the needed resources, to avoid non
trivial H/W configuration rollbacks in case of any failure.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/8a721274fde18b872d1e3a61aaa916bb7b7996d3.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet-shapers: implement NL set and delete operations
Paolo Abeni [Wed, 9 Oct 2024 08:09:50 +0000 (10:09 +0200)]
net-shapers: implement NL set and delete operations

Both NL operations directly map on the homonymous device shaper
callbacks, update accordingly the shapers cache and are serialized
via a per device lock.
Implement the cache modification helpers to additionally deal with
NODE scope shaper. That will be needed by the group() operation
implemented in the next patch.
The delete implementation is partial: does not handle NODE scope
shaper yet. Such support will require infrastructure from
the next patch and will be implemented later in the series.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/1e6a34a4095b35d773d2b9c476164671bbcf8397.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonet-shapers: implement NL get operation
Paolo Abeni [Wed, 9 Oct 2024 08:09:49 +0000 (10:09 +0200)]
net-shapers: implement NL get operation

Introduce the basic infrastructure to implement the net-shaper
core functionality. Each network devices carries a net-shaper cache,
the NL get() operation fetches the data from such cache.

The cache is initially empty, will be fill by the set()/group()
operation implemented later and is destroyed at device cleanup time.

The net_shaper_fill_handle(), net_shaper_ctx_init(), and
net_shaper_generic_pre() implementations handle generic index type
attributes, despite the current caller always pass a constant value
to avoid more noise in later patches using them with different
attributes.

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/ddd10fd645a9367803ad02fca4a5664ea5ace170.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agonetlink: spec: add shaper YAML spec
Paolo Abeni [Wed, 9 Oct 2024 08:09:48 +0000 (10:09 +0200)]
netlink: spec: add shaper YAML spec

Define the user-space visible interface to query, configure and delete
network shapers via yaml definition.

Add dummy implementations for the relevant NL callbacks.

set() and delete() operations touch a single shaper creating/updating or
deleting it.
The group() operation creates a shaper's group, nesting multiple input
shapers under the specified output shaper.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/7a33a1ff370bdbcd0cd3f909575c912cd56f41da.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agogenetlink: extend info user-storage to match NL cb ctx
Paolo Abeni [Wed, 9 Oct 2024 08:09:47 +0000 (10:09 +0200)]
genetlink: extend info user-storage to match NL cb ctx

This allows a more uniform implementation of non-dump and dump
operations, and will be used later in the series to avoid some
per-operation allocation.

Additionally rename the NL_ASSERT_DUMP_CTX_FITS macro, to
fit a more extended usage.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/1130cc2896626b84587a2a5f96a5c6829638f4da.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoMerge branch 'rtnetlink-handle-error-of-rtnl_register_module'
Paolo Abeni [Thu, 10 Oct 2024 13:39:37 +0000 (15:39 +0200)]
Merge branch 'rtnetlink-handle-error-of-rtnl_register_module'

Kuniyuki Iwashima says:

====================
rtnetlink: Handle error of rtnl_register_module().

While converting phonet to per-netns RTNL, I found a weird comment

  /* Further rtnl_register_module() cannot fail */

that was true but no longer true after commit addf9b90de22 ("net:
rtnetlink: use rcu to free rtnl message handlers").

Many callers of rtnl_register_module() just ignore the returned
value but should handle them properly.

This series introduces two helpers, rtnl_register_many() and
rtnl_unregister_many(), to do that easily and fix such callers.

All rtnl_register() and rtnl_register_module() will be converted
to _many() variant and some rtnl_lock() will be saved in _many()
later in net-next.

Changes:
  v4:
    * Add more context in changelog of each patch

  v3: https://lore.kernel.org/all/20241007124459.5727-1-kuniyu@amazon.com/
    * Move module *owner to struct rtnl_msg_handler
    * Make struct rtnl_msg_handler args/vars const
    * Update mctp goto labels

  v2: https://lore.kernel.org/netdev/20241004222358.79129-1-kuniyu@amazon.com/
    * Remove __exit from mctp_neigh_exit().

  v1: https://lore.kernel.org/netdev/20241003205725.5612-1-kuniyu@amazon.com/
====================

Link: https://patch.msgid.link/20241008184737.9619-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agophonet: Handle error of rtnl_register_module().
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:37 +0000 (11:47 -0700)]
phonet: Handle error of rtnl_register_module().

Before commit addf9b90de22 ("net: rtnetlink: use rcu to free rtnl
message handlers"), once the first rtnl_register_module() allocated
rtnl_msg_handlers[PF_PHONET], the following calls never failed.

However, after the commit, rtnl_register_module() could fail silently
to allocate rtnl_msg_handlers[PF_PHONET][msgtype] and requires error
handling for each call.

Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality.  This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.

Let's use rtnl_register_many() to handle the errors easily.

Fixes: addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message handlers")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Rémi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agompls: Handle error of rtnl_register_module().
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:36 +0000 (11:47 -0700)]
mpls: Handle error of rtnl_register_module().

Since introduced, mpls_init() has been ignoring the returned
value of rtnl_register_module(), which could fail silently.

Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality.  This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.

Let's handle the errors by rtnl_register_many().

Fixes: 03c0566542f4 ("mpls: Netlink commands to add, remove, and dump routes")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agomctp: Handle error of rtnl_register_module().
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:35 +0000 (11:47 -0700)]
mctp: Handle error of rtnl_register_module().

Since introduced, mctp has been ignoring the returned value of
rtnl_register_module(), which could fail silently.

Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality.  This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.

Let's handle the errors by rtnl_register_many().

Fixes: 583be982d934 ("mctp: Add device handling and netlink interface")
Fixes: 831119f88781 ("mctp: Add neighbour netlink interface")
Fixes: 06d2f4c583a7 ("mctp: Add netlink route management")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agobridge: Handle error of rtnl_register_module().
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:34 +0000 (11:47 -0700)]
bridge: Handle error of rtnl_register_module().

Since introduced, br_vlan_rtnl_init() has been ignoring the returned
value of rtnl_register_module(), which could fail silently.

Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality.  This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.

Let's handle the errors by rtnl_register_many().

Fixes: 8dcea187088b ("net: bridge: vlan: add rtm definitions and dump support")
Fixes: f26b296585dc ("net: bridge: vlan: add new rtm message support")
Fixes: adb3ce9bcb0f ("net: bridge: vlan: add del rtm message support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agovxlan: Handle error of rtnl_register_module().
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:33 +0000 (11:47 -0700)]
vxlan: Handle error of rtnl_register_module().

Since introduced, vxlan_vnifilter_init() has been ignoring the
returned value of rtnl_register_module(), which could fail silently.

Handling the error allows users to view a module as an all-or-nothing
thing in terms of the rtnetlink functionality.  This prevents syzkaller
from reporting spurious errors from its tests, where OOM often occurs
and module is automatically loaded.

Let's handle the errors by rtnl_register_many().

Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agortnetlink: Add bulk registration helpers for rtnetlink message handlers.
Kuniyuki Iwashima [Tue, 8 Oct 2024 18:47:32 +0000 (11:47 -0700)]
rtnetlink: Add bulk registration helpers for rtnetlink message handlers.

Before commit addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message
handlers"), once rtnl_msg_handlers[protocol] was allocated, the following
rtnl_register_module() for the same protocol never failed.

However, after the commit, rtnl_msg_handler[protocol][msgtype] needs to
be allocated in each rtnl_register_module(), so each call could fail.

Many callers of rtnl_register_module() do not handle the returned error,
and we need to add many error handlings.

To handle that easily, let's add wrapper functions for bulk registration
of rtnetlink message handlers.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet: phy: Validate PHY LED OPs presence before registering
Christian Marangi [Tue, 8 Oct 2024 19:47:16 +0000 (21:47 +0200)]
net: phy: Validate PHY LED OPs presence before registering

Validate PHY LED OPs presence before registering and parsing them.
Defining LED nodes for a PHY driver that actually doesn't supports them
is redundant and useless.

It's also the case with Generic PHY driver used and a DT having LEDs
node for the specific PHY.

Skip it and report the error with debug print enabled.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241008194718.9682-1-ansuelsmth@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agoMerge tag 'nf-24-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Paolo Abeni [Thu, 10 Oct 2024 11:50:55 +0000 (13:50 +0200)]
Merge tag 'nf-24-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Restrict xtables extensions to families that are safe, syzbot found
   a way to combine ebtables with extensions that are never used by
   userspace tools. From Florian Westphal.

2) Set l3mdev inconditionally whenever possible in nft_fib to fix lookup
   mismatch, also from Florian.

netfilter pull request 24-10-09

* tag 'nf-24-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: conntrack_vrf.sh: add fib test case
  netfilter: fib: check correct rtable in vrf setups
  netfilter: xtables: avoid NFPROTO_UNSPEC where needed
====================

Link: https://patch.msgid.link/20241009213858.3565808-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agoMerge branch 'net-mlx5-qos-refactor-esw-qos-to-support-new-features'
Paolo Abeni [Thu, 10 Oct 2024 11:12:02 +0000 (13:12 +0200)]
Merge branch 'net-mlx5-qos-refactor-esw-qos-to-support-new-features'

Tariq Toukan says:

====================
net/mlx5: qos: Refactor esw qos to support new features

This patch series by Cosmin and Carolina prepares the mlx5 qos infra for
the upcoming feature of cross E-Switch scheduling.

Noop cleanups:
net/mlx5: qos: Flesh out element_attributes in mlx5_ifc.h
net/mlx5: qos: Rename vport 'tsar' into 'sched_elem'.
net/mlx5: qos: Consistently name vport vars as 'vport'
net/mlx5: qos: Refactor and document bw_share calculation
net/mlx5: qos: Rename rate group 'list' as 'parent_entry'

Refactor the code with the goal of moving groups out of E-Switches:
net/mlx5: qos: Maintain rate group vport members in a list
net/mlx5: qos: Always create group0
net/mlx5: qos: Drop 'esw' param from vport qos functions
net/mlx5: qos: Store the eswitch in a mlx5_esw_rate_group

Move groups from an E-Switch into an mlx5_qos_domain:
net/mlx5: qos: Store rate groups in a qos domain

Refactor locking to use a new mutex in the qos domain:
net/mlx5: qos: Refactor locking to a qos domain mutex

In follow-up patchsets, we'll allow qos domains to be shared
between E-Switches of the same NIC.

The two top patches are simple enhancements.
====================

Link: https://patch.msgid.link/20241008183222.137702-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: Add support check for TSAR types in QoS scheduling
Carolina Jubran [Tue, 8 Oct 2024 18:32:22 +0000 (21:32 +0300)]
net/mlx5: Add support check for TSAR types in QoS scheduling

Introduce a new function, mlx5_qos_tsar_type_supported(), to handle the
validation of TSAR types within QoS scheduling contexts.

Refactor the existing code to use this new function, replacing direct
checks for TSAR type support in the NIC scheduling hierarchy.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: Unify QoS element type checks across NIC and E-Switch
Carolina Jubran [Tue, 8 Oct 2024 18:32:21 +0000 (21:32 +0300)]
net/mlx5: Unify QoS element type checks across NIC and E-Switch

Refactor the QoS element type support check by introducing a new
function, mlx5_qos_element_type_supported(), which handles element type
validation for both NIC and E-Switch schedulers.

This change removes the redundant esw_qos_element_type_supported()
function and unifies the element type checks into a single
implementation.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Refactor locking to a qos domain mutex
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:20 +0000 (21:32 +0300)]
net/mlx5: qos: Refactor locking to a qos domain mutex

E-Switch qos changes used the esw state_lock to serialize qos changes.
With the introduction of cross-esw scheduling, multiple E-Switches might
be involved in a qos operation, so prepare for that by switching locking
to use a qos domain mutex.
Add three helper functions:
- esw_qos_lock
- esw_qos_unlock
- esw_assert_qos_lock_held

Convert existing direct lock/unlock/lockdep calls to them. Also call
esw_assert_qos_lock_held in a couple more places.

mlx5_esw_qos_set_vport_rate expected to be called with the esw
state_lock already held.
Change it to instead acquire the qos lock directly.

mlx5_eswitch_get_vport_config also accessed qos properties with the esw
state lock. Introduce a new function mlx5_esw_qos_get_vport_rate to
access those with the correct lock and change get_vport_config to use
it.

Finally, mlx5_vport_disable is called from the cleanup path with the esw
state_lock held, so have it additionally acquire the qos lock to make
sure there are no races.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Store rate groups in a qos domain
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:19 +0000 (21:32 +0300)]
net/mlx5: qos: Store rate groups in a qos domain

Groups are currently maintained as a list in their corresponding
eswitch, protected by the esw state_lock.
The upcoming cross-eswitch scheduling feature cannot work with this
approach, as it would require acquiring multiple eswitch locks (in the
correct order) in order to maintain group membership.

This commit moves the rate groups into a new 'qos domain' struct and
adds explicit qos init/cleanup steps to the eswitch init/cleanup.
Upcoming patches will expand the qos domain struct and allow it to be
shared between eswitches. For now, qos domains are private to each esw
so there's only an extra indirection.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Rename rate group 'list' as 'parent_entry'
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:18 +0000 (21:32 +0300)]
net/mlx5: qos: Rename rate group 'list' as 'parent_entry'

'list' is not very descriptive, I prefer list membership to clearly
specify which list the entry belongs to. This commit renames the list
entry into the esw groups list as 'parent_entry' to make the code more
readable. This is a no-op change.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Add an explicit 'dev' to vport trace calls
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:17 +0000 (21:32 +0300)]
net/mlx5: qos: Add an explicit 'dev' to vport trace calls

vport qos trace calls used vport->dev implicitly as the device to which
the command was sent (and thus the device logged in traces).
But that will no longer be the case for cross-esw scheduling, where the
commands have to be sent to the group esw device instead.

This commit corrects that.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Store the eswitch in a mlx5_esw_rate_group
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:16 +0000 (21:32 +0300)]
net/mlx5: qos: Store the eswitch in a mlx5_esw_rate_group

The rate groups are about to be moved out of eswitches, so store a
reference to the eswitch they belong to so things can still work
later.

This allows dropping the esw parameter from a couple of functions and
simplifying some of the code. Use this opportunity to make sure that
vport scheduling element commands are always sent to the group eswitch,
because that will be relevant for cross-esw scheduling. For now though,
the eswitches are not different.

There is no functionality change here.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Drop 'esw' param from vport qos functions
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:15 +0000 (21:32 +0300)]
net/mlx5: qos: Drop 'esw' param from vport qos functions

The vport has a pointer to its own eswitch in vport->dev->priv.eswitch,
so passing the same eswitch as a parameter to the various functions
manipulating vport qos is superfluous at best and prone to errors at
worst.

More importantly, with the upcoming cross-esw scheduling changes, the
eswitch that should receive the various scheduling element commands is
NOT the same as the vport's eswitch, so the current code's assumptions
will break.

To avoid confusion and bugs, this commit drops the 'esw' parameter from
all vport qos functions and uses the vport's own eswitch pointer
instead.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Always create group0
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:14 +0000 (21:32 +0300)]
net/mlx5: qos: Always create group0

All vports not explicitly members of a group with QoS enabled are part
of the internal esw group0, except when the hw reports that groups
aren't supported (log_esw_max_sched_depth == 0). This creates corner
cases in the code, which has to make sure that this case is supported.
Additionally, the groups are about to be moved out of eswitches, and
group0 being NULL creates additional complications there.

This patch makes sure to always create group0, even if max sched depth
is 0. In that case, a software-only group0 is created referencing the
root TSAR. Vports can point to this group when their QoS is enabled and
they'll be attached to the root TSAR directly. This eliminates corner
cases in the code by offering the guarantee that if qos is enabled,
vport->qos.group is non-NULL.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Maintain rate group vport members in a list
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:13 +0000 (21:32 +0300)]
net/mlx5: qos: Maintain rate group vport members in a list

Previously, finding group members was done by iterating over all vports
of an eswitch and comparing their group with the required one, but that
approach will break down when a group can contain vports from multiple
eswitches.

Solve that by maintaining a list of vport members.
Instead of iterating over esw vports, loop over the members list.
Use this opportunity to provide two new functions to allocate and free a
group, so that the number of state transitions is smaller. This will
also be used in a future patch.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Refactor and document bw_share calculation
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:12 +0000 (21:32 +0300)]
net/mlx5: qos: Refactor and document bw_share calculation

The previous function (esw_qos_calculate_group_min_rate_divider) had two
completely different modes of execution, depending on the 'group_level'
parameter. Split it into two separate functions:
- esw_qos_calculate_min_rate_divider - computes min across groups.
- esw_qos_calculate_group_min_rate_divider - computes min in a group.

Fold the divider calculation into the corresponding normalize functions
to avoid having the caller compute the corresponding divider.
Also rename the normalize functions to better indicate what level
they're operating on.
Finally, document everything so that this topic can more easily be
understood by future maintainers.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Consistently name vport vars as 'vport'
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:11 +0000 (21:32 +0300)]
net/mlx5: qos: Consistently name vport vars as 'vport'

The current mixture of 'vport' and 'evport' can be improved.
There is no functional change.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Rename vport 'tsar' into 'sched_elem'.
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:10 +0000 (21:32 +0300)]
net/mlx5: qos: Rename vport 'tsar' into 'sched_elem'.

Vports do not use TSARs (Transmit Scheduling ARbiters), which are used
for grouping multiple entities together. Use the correct name in
variables and functions for clarity.
Also move the scheduling context to a local variable in the
esw_qos_sched_elem_config function instead of an empty parameter that
needs to be provided by all callers.
There is no functional change here.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet/mlx5: qos: Flesh out element_attributes in mlx5_ifc.h
Cosmin Ratiu [Tue, 8 Oct 2024 18:32:09 +0000 (21:32 +0300)]
net/mlx5: qos: Flesh out element_attributes in mlx5_ifc.h

This is used for multiple purposes, depending on the scheduling element
created. There are a few helper struct defined a long time ago, but they
are not easy to find in the file and they are about to get new members.
This commit cleans up this area a bit by:
- moving the helper structs closer to where they are relevant.
- defining a helper union to include all of them to help
  discoverability.
- making use of it everywhere element_attributes is used.
- using a consistent 'attr' name.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agoMerge branch 'eth-fbnic-add-timestamping-support'
Paolo Abeni [Thu, 10 Oct 2024 10:52:29 +0000 (12:52 +0200)]
Merge branch 'eth-fbnic-add-timestamping-support'

Vadim Fedorenko says:

====================
eth: fbnic: add timestamping support

The series is to add timestamping support for Meta's NIC driver.

Changelog:
v3 -> v4:
- use adjust_by_scaled_ppm() instead of open coding it
- adjust cached value of high bits of timestamp to be sure it
  is older then incoming timestamps
v2 -> v3:
- rebase on top of net-next
- add doc to describe retur value of fbnic_ts40_to_ns()
v1 -> v2:
- adjust comment about using u64 stats locking primitive
- fix typo in the first patch
- Cc Richard
====================

Link: https://patch.msgid.link/20241008181436.4120604-1-vadfed@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agoeth: fbnic: add ethtool timestamping statistics
Vadim Fedorenko [Tue, 8 Oct 2024 18:14:36 +0000 (11:14 -0700)]
eth: fbnic: add ethtool timestamping statistics

Add counters of packets with HW timestamps requests and lost timestamps
with no associated skbs. Use ethtool interface to report these counters.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agoeth: fbnic: add TX packets timestamping support
Vadim Fedorenko [Tue, 8 Oct 2024 18:14:35 +0000 (11:14 -0700)]
eth: fbnic: add TX packets timestamping support

Add TX configuration to ethtool interface. Add processing of TX
timestamp completions as well as configuration to request HW to create
TX timestamp completion.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agoeth: fbnic: add RX packets timestamping support
Vadim Fedorenko [Tue, 8 Oct 2024 18:14:34 +0000 (11:14 -0700)]
eth: fbnic: add RX packets timestamping support

Add callbacks to support timestamping configuration via ethtool.
Add processing of RX timestamps.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agoeth: fbnic: add initial PHC support
Vadim Fedorenko [Tue, 8 Oct 2024 18:14:33 +0000 (11:14 -0700)]
eth: fbnic: add initial PHC support

Create PHC device and provide callbacks needed for ptp_clock device.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agoeth: fbnic: add software TX timestamping support
Vadim Fedorenko [Tue, 8 Oct 2024 18:14:32 +0000 (11:14 -0700)]
eth: fbnic: add software TX timestamping support

Add software TX timestamping support. RX software timestamping is
implemented in the core and there is no need to provide special flag
in the driver anymore.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet: Remove likely from l3mdev_master_ifindex_by_index
Breno Leitao [Tue, 8 Oct 2024 16:32:04 +0000 (09:32 -0700)]
net: Remove likely from l3mdev_master_ifindex_by_index

The likely() annotation in l3mdev_master_ifindex_by_index() has been
found to be incorrect 100% of the time in real-world workloads (e.g.,
web servers).

Annotated branches shows the following in these servers:

correct incorrect  %        Function                  File              Line
      0 169053813 100 l3mdev_master_ifindex_by_index l3mdev.h             81

This is happening because l3mdev_master_ifindex_by_index() is called
from __inet_check_established(), which calls
l3mdev_master_ifindex_by_index() passing the socked bounded interface.

l3mdev_master_ifindex_by_index(net, sk->sk_bound_dev_if);

Since most sockets are not going to be bound to a network device,
the likely() is giving the wrong assumption.

Remove the likely() annotation to ensure more accurate branch
prediction.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241008163205.3939629-1-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agonet: do not delay dst_entries_add() in dst_release()
Eric Dumazet [Tue, 8 Oct 2024 14:31:10 +0000 (14:31 +0000)]
net: do not delay dst_entries_add() in dst_release()

dst_entries_add() uses per-cpu data that might be freed at netns
dismantle from ip6_route_net_exit() calling dst_entries_destroy()

Before ip6_route_net_exit() can be called, we release all
the dsts associated with this netns, via calls to dst_release(),
which waits an rcu grace period before calling dst_destroy()

dst_entries_add() use in dst_destroy() is racy, because
dst_entries_destroy() could have been called already.

Decrementing the number of dsts must happen sooner.

Notes:

1) in CONFIG_XFRM case, dst_destroy() can call
   dst_release_immediate(child), this might also cause UAF
   if the child does not have DST_NOCOUNT set.
   IPSEC maintainers might take a look and see how to address this.

2) There is also discussion about removing this count of dst,
   which might happen in future kernels.

Fixes: f88649721268 ("ipv4: fix dst race in sk_dst_get()")
Closes: https://lore.kernel.org/lkml/CANn89iLCCGsP7SFn9HKpvnKu96Td4KD08xf7aGtiYgZnkjaL=w@mail.gmail.com/T/
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20241008143110.1064899-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 months agoMerge branch 'ipv4-namespacify-ipv4-address-hash-table'
Jakub Kicinski [Thu, 10 Oct 2024 03:08:09 +0000 (20:08 -0700)]
Merge branch 'ipv4-namespacify-ipv4-address-hash-table'

Kuniyuki Iwashima says:

====================
ipv4: Namespacify IPv4 address hash table.

This is a prep of per-net RTNL conversion for RTM_(NEW|DEL|SET)ADDR.

Currently, each IPv4 address is linked to the global hash table, and
this needs to be protected by another global lock or namespacified to
support per-net RTNL.

Adding a global lock will cause deadlock in the rtnetlink path and GC,

  rtnetlink                      check_lifetime
  |- rtnl_net_lock(net)          |- acquire the global lock
  |- acquire the global lock     |- check ifa's netns
  `- put ifa into hash table     `- rtnl_net_lock(net)

so we need to namespacify the hash table.

The IPv6 one is already namespacified, let's follow that.

v2: https://lore.kernel.org/netdev/20241004195958.64396-1-kuniyu@amazon.com/
v1: https://lore.kernel.org/netdev/20241001024837.96425-1-kuniyu@amazon.com/
====================

Link: https://patch.msgid.link/20241008172906.1326-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoipv4: Retire global IPv4 hash table inet_addr_lst.
Kuniyuki Iwashima [Tue, 8 Oct 2024 17:29:06 +0000 (10:29 -0700)]
ipv4: Retire global IPv4 hash table inet_addr_lst.

No one uses inet_addr_lst anymore, so let's remove it.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241008172906.1326-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 months agoipv4: Namespacify IPv4 address GC.
Kuniyuki Iwashima [Tue, 8 Oct 2024 17:29:05 +0000 (10:29 -0700)]
ipv4: Namespacify IPv4 address GC.

Each IPv4 address could have a lifetime, which is useful for DHCP,
and GC is periodically executed as check_lifetime_work.

check_lifetime() does the actual GC under RTNL.

  1. Acquire RTNL
  2. Iterate inet_addr_lst
  3. Remove IPv4 address if expired
  4. Release RTNL

Namespacifying the GC is required for per-netns RTNL, but using the
per-netns hash table will shorten the time on the hash bucket iteration
under RTNL.

Let's add per-netns GC work and use the per-netns hash table.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241008172906.1326-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>