]> www.infradead.org Git - users/hch/misc.git/log
users/hch/misc.git
2 months agoselftests/bpf: Move open_tuntap to network helpers
Marcus Wichelmann [Wed, 5 Mar 2025 21:34:35 +0000 (21:34 +0000)]
selftests/bpf: Move open_tuntap to network helpers

To test the XDP metadata functionality of the tun driver, it's necessary
to create a new tap device first. A helper function for this already
exists in lwt_helpers.h. Move it to the common network helpers header,
so it can be reused in other tests.

Signed-off-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250305213438.3863922-4-marcus.wichelmann@hetzner-cloud.de
2 months agonet: tun: Enable transfer of XDP metadata to skb
Marcus Wichelmann [Wed, 5 Mar 2025 21:34:34 +0000 (21:34 +0000)]
net: tun: Enable transfer of XDP metadata to skb

When the XDP metadata area was used, it is expected that the same
metadata can also be accessed from TC, as can be read in the description
of the bpf_xdp_adjust_meta helper function. In the tun driver, this was
not yet implemented.

To make this work, the skb that is being built on XDP_PASS should know
of the current size of the metadata area. This is ensured by adding
calls to skb_metadata_set. For the tun_xdp_one code path, an additional
check is necessary to handle the case where the externally initialized
xdp_buff has no metadata support (xdp->data_meta == xdp->data + 1).

More information about this feature can be found in the commit message
of commit de8f3a83b0a0 ("bpf: add meta pointer for direct access").

Signed-off-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250305213438.3863922-3-marcus.wichelmann@hetzner-cloud.de
2 months agonet: tun: Enable XDP metadata support
Marcus Wichelmann [Wed, 5 Mar 2025 21:34:33 +0000 (21:34 +0000)]
net: tun: Enable XDP metadata support

Enable the support for the bpf_xdp_adjust_meta helper function for XDP
buffers initialized by the tun driver. This allows to reserve a metadata
area that is useful to pass any information from one XDP program to
another one, for example when using tail-calls.

Whether this helper function can be used in an XDP program depends on
how the xdp_buff was initialized. Most net drivers initialize the
xdp_buff in a way, that allows bpf_xdp_adjust_meta to be used. In case
of the tun driver, this is currently not the case.

There are two code paths in the tun driver that lead to a
bpf_prog_run_xdp and where metadata support should be enabled:

1. tun_build_skb, which is called by tun_get_user and is used when
   writing packets from userspace into the device. In this case, the
   xdp_buff created in tun_build_skb has no support for
   bpf_xdp_adjust_meta and calls of that helper function result in
   ENOTSUPP.

   For this code path, it's sufficient to set the meta_valid argument of
   the xdp_prepare_buff call. The reserved headroom is large enough
   already.

2. tun_xdp_one, which is called by tun_sendmsg which again is called by
   other drivers (e.g. vhost_net). When the TUN_MSG_PTR mode is used,
   another driver may pass a batch of xdp_buffs to the tun driver. In
   this case, that other driver is the one initializing the xdp_buff.

   See commit 043d222f93ab ("tuntap: accept an array of XDP buffs
   through sendmsg()") for details.

   For now, the vhost_net driver is the only one using TUN_MSG_PTR and
   it already initializes the xdp_buffs with metadata support and
   sufficient headroom. But the tun driver disables it again, so the
   xdp_set_data_meta_invalid call has to be removed.

Signed-off-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250305213438.3863922-2-marcus.wichelmann@hetzner-cloud.de
2 months agoMerge branch 'mctp-add-mctp-over-usb-hardware-transport-binding'
Jakub Kicinski [Sat, 22 Feb 2025 00:45:26 +0000 (16:45 -0800)]
Merge branch 'mctp-add-mctp-over-usb-hardware-transport-binding'

Jeremy Kerr says:

====================
mctp: Add MCTP-over-USB hardware transport binding

Add an implementation of the DMTF standard DSP0283, providing an MCTP
channel over high-speed USB.

This is a fairly trivial first implementation, in that we only submit
one tx and one rx URB at a time. We do accept multi-packet transfers,
but do not yet generate them on transmit.

Of course, questions and comments are most welcome, particularly on the
USB interfaces.

v2: https://lore.kernel.org/20250212-dev-mctp-usb-v2-0-76e67025d764@codeconstruct.com.au
v1: https://lore.kernel.org/20250206-dev-mctp-usb-v1-0-81453fe26a61@codeconstruct.com.au
====================

Link: https://patch.msgid.link/20250221-dev-mctp-usb-v3-0-3353030fe9cc@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: mctp: Add MCTP USB transport driver
Jeremy Kerr [Fri, 21 Feb 2025 00:56:58 +0000 (08:56 +0800)]
net: mctp: Add MCTP USB transport driver

Add an implementation for DMTF DSP0283, which defines a MCTP-over-USB
transport. As per that spec, we're restricted to full speed mode,
requiring 512-byte transfers.

Each MCTP-over-USB interface is a peer-to-peer link to a single MCTP
endpoint, so no physical addressing is required (of course, that MCTP
endpoint may then bridge to further MCTP endpoints). Consequently,
interfaces will report with no lladdr data:

    # mctp link
    dev lo index 1 address 00:00:00:00:00:00 net 1 mtu 65536 up
    dev mctpusb0 index 6 address none net 1 mtu 68 up

This is a simple initial implementation, with single rx & tx urbs, and
no multi-packet tx transfers - although we do accept multi-packet rx
from the device.

Includes suggested fixes from Santosh Puranik <spuranik@nvidia.com>.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Santosh Puranik <spuranik@nvidia.com>
Link: https://patch.msgid.link/20250221-dev-mctp-usb-v3-2-3353030fe9cc@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agousb: Add base USB MCTP definitions
Jeremy Kerr [Fri, 21 Feb 2025 00:56:57 +0000 (08:56 +0800)]
usb: Add base USB MCTP definitions

Upcoming changes will add a USB host (and later gadget) driver for the
MCTP-over-USB protocol. Add a header that provides common definitions
for protocol support: the packet header format and a few framing
definitions. Add a define for the MCTP class code, as per
https://usb.org/defined-class-codes.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20250221-dev-mctp-usb-v3-1-3353030fe9cc@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: cadence: macb: Implement BQL
Sean Anderson [Thu, 20 Feb 2025 16:42:57 +0000 (11:42 -0500)]
net: cadence: macb: Implement BQL

Implement byte queue limits to allow queuing disciplines to account for
packets enqueued in the ring buffer but not yet transmitted. There are a
separate set of transmit functions for AT91 that I haven't touched since
I don't have hardware to test on.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20250220164257.96859-1-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: print stmmac_init_dma_engine() errors using netdev_err()
Russell King (Oracle) [Thu, 20 Feb 2025 12:47:49 +0000 (12:47 +0000)]
net: stmmac: print stmmac_init_dma_engine() errors using netdev_err()

stmmac_init_dma_engine() uses dev_err() which leads to errors being
reported as e.g:

dwc-eth-dwmac 2490000.ethernet: Failed to reset the dma
dwc-eth-dwmac 2490000.ethernet eth0: stmmac_hw_setup: DMA engine initialization failed

stmmac_init_dma_engine() is only called from stmmac_hw_setup() which
itself uses netdev_err(), and we will have a net_device setup. So,
change the dev_err() to netdev_err() to give consistent error
messages.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tl5y1-004UgG-8X@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: fib_nexthops: do not mark skipped tests as failed
Hangbin Liu [Thu, 20 Feb 2025 08:53:26 +0000 (08:53 +0000)]
selftests: fib_nexthops: do not mark skipped tests as failed

The current test marks all unexpected return values as failed and sets ret
to 1. If a test is skipped, the entire test also returns 1, incorrectly
indicating failure.

To fix this, add a skipped variable and set ret to 4 if it was previously
0. Otherwise, keep ret set to 1.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250220085326.1512814-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'net-fib_rules-add-dscp-mask-support'
Jakub Kicinski [Sat, 22 Feb 2025 00:08:54 +0000 (16:08 -0800)]
Merge branch 'net-fib_rules-add-dscp-mask-support'

Ido Schimmel says:

====================
net: fib_rules: Add DSCP mask support

In some deployments users would like to encode path information into
certain bits of the IPv6 flow label, the UDP source port and the DSCP
field and use this information to route packets accordingly.

Redirecting traffic to a routing table based on specific bits in the
DSCP field is not currently possible. Only exact match is currently
supported by FIB rules.

This patchset extends FIB rules to match on the DSCP field with an
optional mask.

Patches #1-#5 gradually extend FIB rules to match on the DSCP field with
an optional mask.

Patch #6 adds test cases for the new functionality.

iproute2 support can be found here [1].

[1] https://github.com/idosch/iproute2/tree/submit/fib_rule_mask_v1
====================

Link: https://patch.msgid.link/20250220080525.831924-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: fib_rule_tests: Add DSCP mask match tests
Ido Schimmel [Thu, 20 Feb 2025 08:05:25 +0000 (10:05 +0200)]
selftests: fib_rule_tests: Add DSCP mask match tests

Add tests for FIB rules that match on DSCP with a mask. Test both good
and bad flows and both the input and output paths.

 # ./fib_rule_tests.sh
 IPv6 FIB rule tests
 [...]
    TEST: rule6 check: dscp redirect to table                           [ OK ]
    TEST: rule6 check: dscp no redirect to table                        [ OK ]
    TEST: rule6 del by pref: dscp redirect to table                     [ OK ]
    TEST: rule6 check: iif dscp redirect to table                       [ OK ]
    TEST: rule6 check: iif dscp no redirect to table                    [ OK ]
    TEST: rule6 del by pref: iif dscp redirect to table                 [ OK ]
    TEST: rule6 check: dscp masked redirect to table                    [ OK ]
    TEST: rule6 check: dscp masked no redirect to table                 [ OK ]
    TEST: rule6 del by pref: dscp masked redirect to table              [ OK ]
    TEST: rule6 check: iif dscp masked redirect to table                [ OK ]
    TEST: rule6 check: iif dscp masked no redirect to table             [ OK ]
    TEST: rule6 del by pref: iif dscp masked redirect to table          [ OK ]
 [...]

 Tests passed: 316
 Tests failed:   0

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20250220080525.831924-7-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonetlink: specs: Add FIB rule DSCP mask attribute
Ido Schimmel [Thu, 20 Feb 2025 08:05:24 +0000 (10:05 +0200)]
netlink: specs: Add FIB rule DSCP mask attribute

Add new DSCP mask attribute to the spec. Example:

 # ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_rule.yaml \
 --do newrule \
 --json '{"family": 2, "dscp": 10, "dscp-mask": 63, "action": 1, "table": 1}'
 None
 $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_rule.yaml \
 --dump getrule --json '{"family": 2}' --output-json | jq '.[]'
 [...]
 {
   "table": 1,
   "suppress-prefixlen": "0xffffffff",
   "protocol": 0,
   "priority": 32765,
   "dscp": 10,
   "dscp-mask": "0x3f",
   "family": 2,
   "dst-len": 0,
   "src-len": 0,
   "tos": 0,
   "action": "to-tbl",
   "flags": 0
 }
 [...]

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20250220080525.831924-6-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: fib_rules: Enable DSCP mask usage
Ido Schimmel [Thu, 20 Feb 2025 08:05:23 +0000 (10:05 +0200)]
net: fib_rules: Enable DSCP mask usage

Allow user space to configure FIB rules that match on DSCP with a mask,
now that support has been added to the IPv4 and IPv6 address families.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20250220080525.831924-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoipv6: fib_rules: Add DSCP mask matching
Ido Schimmel [Thu, 20 Feb 2025 08:05:22 +0000 (10:05 +0200)]
ipv6: fib_rules: Add DSCP mask matching

Extend IPv6 FIB rules to match on DSCP using a mask. Unlike IPv4, also
initialize the DSCP mask when a non-zero 'tos' is specified as there is
no difference in matching between 'tos' and 'dscp'. As a side effect,
this makes it possible to match on 'dscp 0', like in IPv4.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20250220080525.831924-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoipv4: fib_rules: Add DSCP mask matching
Ido Schimmel [Thu, 20 Feb 2025 08:05:21 +0000 (10:05 +0200)]
ipv4: fib_rules: Add DSCP mask matching

Extend IPv4 FIB rules to match on DSCP using a mask. The mask is only
set in rules that match on DSCP (not TOS) and initialized to cover the
entire DSCP field if the mask attribute is not specified.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20250220080525.831924-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: fib_rules: Add DSCP mask attribute
Ido Schimmel [Thu, 20 Feb 2025 08:05:20 +0000 (10:05 +0200)]
net: fib_rules: Add DSCP mask attribute

Add an attribute that allows matching on DSCP with a mask. Matching on
DSCP with a mask is needed in deployments where users encode path
information into certain bits of the DSCP field.

Temporarily set the type of the attribute to 'NLA_REJECT' while support
is being added.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20250220080525.831924-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Fri, 21 Feb 2025 23:59:47 +0000 (15:59 -0800)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-02-20

We've added 19 non-merge commits during the last 8 day(s) which contain
a total of 35 files changed, 1126 insertions(+), 53 deletions(-).

The main changes are:

1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing

2) Add network TX timestamping support to BPF sock_ops, from Jason Xing

3) Add TX metadata Launch Time support, from Song Yoong Siang

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  igc: Add launch time support to XDP ZC
  igc: Refactor empty frame insertion for launch time support
  net: stmmac: Add launch time support to XDP ZC
  selftests/bpf: Add launch time request to xdp_hw_metadata
  xsk: Add launch time hardware offload support to XDP Tx metadata
  selftests/bpf: Add simple bpf tests in the tx path for timestamping feature
  bpf: Support selective sampling for bpf timestamping
  bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
  net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
  bpf: Disable unsafe helpers in TX timestamping callbacks
  bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
  bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
  bpf: Add networking timestamping support to bpf_get/setsockopt()
  selftests/bpf: Add rto max for bpf_setsockopt test
  bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt
====================

Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agogve: Add RSS cache for non RSS device option scenario
Ziwei Xiao [Wed, 19 Feb 2025 20:04:51 +0000 (12:04 -0800)]
gve: Add RSS cache for non RSS device option scenario

Not all the devices have the capability for the driver to query for the
registered RSS configuration. The driver can discover this by checking
the relevant device option during setup. If it cannot, the driver needs
to store the RSS config cache and directly return such cache when
queried by the ethtool. RSS config is inited when driver probes. Also the
default RSS config will be adjusted when there is RX queue count change.

At this point, only keys of GVE_RSS_KEY_SIZE and indirection tables of
GVE_RSS_INDIR_SIZE are supported.

Signed-off-by: Ziwei Xiao <ziweixiao@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Link: https://patch.msgid.link/20250219200451.3348166-1-jeroendb@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet/rds: Replace deprecated strncpy() with strscpy_pad()
Thorsten Blum [Wed, 19 Feb 2025 22:47:31 +0000 (23:47 +0100)]
net/rds: Replace deprecated strncpy() with strscpy_pad()

strncpy() is deprecated for NUL-terminated destination buffers. Use
strscpy_pad() instead and remove the manual NUL-termination.

Compile-tested only.

Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Tested-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250219224730.73093-2-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'net-improve-netns-handling-in-rtnetlink'
Jakub Kicinski [Fri, 21 Feb 2025 23:28:07 +0000 (15:28 -0800)]
Merge branch 'net-improve-netns-handling-in-rtnetlink'

Xiao Liang says:

====================
net: Improve netns handling in rtnetlink

This patch series includes some netns-related improvements and fixes for
rtnetlink, to make link creation more intuitive:

 1) Creating link in another net namespace doesn't conflict with link
    names in current one.
 2) Refector rtnetlink link creation. Create link in target namespace
    directly.

So that

  # ip link add netns ns1 link-netns ns2 tun0 type gre ...

will create tun0 in ns1, rather than create it in ns2 and move to ns1.
And don't conflict with another interface named "tun0" in current netns.

Patch 01 avoids link name conflict in different netns.

To achieve 2), there're mainly 3 steps:

 - Patch 02 packs newlink() parameters into a struct, including
   the original "src_net" along with more netns context. No semantic
   changes are introduced.
 - Patch 03 ~ 09 converts device drivers to use the explicit netns
   extracted from params.
 - Patch 10 ~ 11 removes the old netns parameter, and converts
   rtnetlink to create device in target netns directly.

Patch 12 ~ 13 adds some tests for link name and link netns.

---

Please note there're some issues found in current code:

- In amt_newlink() drivers/net/amt.c:

    amt->net = net;
    ...
    amt->stream_dev = dev_get_by_index(net, ...

  Uses net, but amt_lookup_upper_dev() only searches in dev_net.
  So the AMT device may not be properly deleted if it's in a different
  netns from lower dev.

- In lowpan_newlink() in net/ieee802154/6lowpan/core.c:

    wdev = dev_get_by_index(dev_net(ldev), nla_get_u32(tb[IFLA_LINK]));

  Looks for IFLA_LINK in dev_net, but in theory the ifindex is defined
  in link netns.

And thanks to Kuniyuki for fixing related issues in gtp and pfcp:
https://lore.kernel.org/netdev/20250110014754.33847-1-kuniyu@amazon.com/

v9: https://lore.kernel.org/20250210133002.883422-1-shaw.leon@gmail.com
v8: https://lore.kernel.org/20250113143719.7948-1-shaw.leon@gmail.com
v7: https://lore.kernel.org/20250104125732.17335-1-shaw.leon@gmail.com
v6: https://lore.kernel.org/20241218130909.2173-1-shaw.leon@gmail.com
v5: https://lore.kernel.org/20241209140151.231257-1-shaw.leon@gmail.com
v4: https://lore.kernel.org/20241118143244.1773-1-shaw.leon@gmail.com
v3: https://lore.kernel.org/20241113125715.150201-1-shaw.leon@gmail.com
v2: https://lore.kernel.org/20241107133004.7469-1-shaw.leon@gmail.com
v1: https://lore.kernel.org/20241023023146.372653-1-shaw.leon@gmail.com
====================

Link: https://patch.msgid.link/20250219125039.18024-1-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: net: Add test cases for link and peer netns
Xiao Liang [Wed, 19 Feb 2025 12:50:39 +0000 (20:50 +0800)]
selftests: net: Add test cases for link and peer netns

- Add test for creating link in another netns when a link of the same
   name and ifindex exists in current netns.
 - Add test to verify that link is created in target netns directly -
   no link new/del events should be generated in link netns or current
   netns.
 - Add test cases to verify that link-netns is set as expected for
   various drivers and combination of namespace-related parameters.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Link: https://patch.msgid.link/20250219125039.18024-14-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: net: Add python context manager for netns entering
Xiao Liang [Wed, 19 Feb 2025 12:50:38 +0000 (20:50 +0800)]
selftests: net: Add python context manager for netns entering

Change netns of current thread and switch back on context exit.
For example:

    with NetNSEnter("ns1"):
        ip("link add dummy0 type dummy")

The command be executed in netns "ns1".

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Link: https://patch.msgid.link/20250219125039.18024-13-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agortnetlink: Create link directly in target net namespace
Xiao Liang [Wed, 19 Feb 2025 12:50:37 +0000 (20:50 +0800)]
rtnetlink: Create link directly in target net namespace

Make rtnl_newlink_create() create device in target namespace directly.
Avoid extra netns change when link netns is provided.

Device drivers has been converted to be aware of link netns, that is not
assuming device netns is and link netns is the same when ops->newlink()
is called.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-12-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agortnetlink: Remove "net" from newlink params
Xiao Liang [Wed, 19 Feb 2025 12:50:36 +0000 (20:50 +0800)]
rtnetlink: Remove "net" from newlink params

Now that devices have been converted to use the specific netns instead
of ambiguous "net", let's remove it from newlink parameters.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-11-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: xfrm: Use link netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:35 +0000 (20:50 +0800)]
net: xfrm: Use link netns in newlink() of rtnl_link_ops

When link_net is set, use it as link netns instead of dev_net(). This
prepares for rtnetlink core to create device in target netns directly,
in which case the two namespaces may be different.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-10-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: ipv6: Use link netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:34 +0000 (20:50 +0800)]
net: ipv6: Use link netns in newlink() of rtnl_link_ops

When link_net is set, use it as link netns instead of dev_net(). This
prepares for rtnetlink core to create device in target netns directly,
in which case the two namespaces may be different.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-9-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: ipv6: Init tunnel link-netns before registering dev
Xiao Liang [Wed, 19 Feb 2025 12:50:33 +0000 (20:50 +0800)]
net: ipv6: Init tunnel link-netns before registering dev

Currently some IPv6 tunnel drivers set tnl->net to dev_net(dev) in
ndo_init(), which is called in register_netdevice(). However, it lacks
the context of link-netns when we enable cross-net tunnels at device
registration time.

Let's move the init of tunnel link-netns before register_netdevice().

ip6_gre has already initialized netns, so just remove the redundant
assignment.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-8-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: ip_tunnel: Use link netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:32 +0000 (20:50 +0800)]
net: ip_tunnel: Use link netns in newlink() of rtnl_link_ops

When link_net is set, use it as link netns instead of dev_net(). This
prepares for rtnetlink core to create device in target netns directly,
in which case the two namespaces may be different.

Convert common ip_tunnel_newlink() to accept an extra link netns
argument.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-7-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: ip_tunnel: Don't set tunnel->net in ip_tunnel_init()
Xiao Liang [Wed, 19 Feb 2025 12:50:31 +0000 (20:50 +0800)]
net: ip_tunnel: Don't set tunnel->net in ip_tunnel_init()

ip_tunnel_init() is called from register_netdevice(). In all code paths
reaching here, tunnel->net should already have been set (either in
ip_tunnel_newlink() or __ip_tunnel_create()). So don't set it again.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-6-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoieee802154: 6lowpan: Validate link netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:30 +0000 (20:50 +0800)]
ieee802154: 6lowpan: Validate link netns in newlink() of rtnl_link_ops

Device denoted by IFLA_LINK is in link_net (IFLA_LINK_NETNSID) or
source netns by design, but 6lowpan uses dev_net.

Note dev->netns_local is set to true and currently link_net is
implemented via a netns change. These together effectively reject
IFLA_LINK_NETNSID.

This patch adds a validation to ensure link_net is either NULL or
identical to dev_net. Thus it would be fine to continue using dev_net
when rtnetlink core begins to create devices directly in target netns.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-5-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Use link/peer netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:29 +0000 (20:50 +0800)]
net: Use link/peer netns in newlink() of rtnl_link_ops

Add two helper functions - rtnl_newlink_link_net() and
rtnl_newlink_peer_net() for netns fallback logic. Peer netns falls back
to link netns, and link netns falls back to source netns.

Convert the use of params->net in netdevice drivers to one of the helper
functions for clarity.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-4-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agortnetlink: Pack newlink() params into struct
Xiao Liang [Wed, 19 Feb 2025 12:50:28 +0000 (20:50 +0800)]
rtnetlink: Pack newlink() params into struct

There are 4 net namespaces involved when creating links:

 - source netns - where the netlink socket resides,
 - target netns - where to put the device being created,
 - link netns - netns associated with the device (backend),
 - peer netns - netns of peer device.

Currently, two nets are passed to newlink() callback - "src_net"
parameter and "dev_net" (implicitly in net_device). They are set as
follows, depending on netlink attributes in the request.

 +------------+-------------------+---------+---------+
 | peer netns | IFLA_LINK_NETNSID | src_net | dev_net |
 +------------+-------------------+---------+---------+
 |            | absent            | source  | target  |
 | absent     +-------------------+---------+---------+
 |            | present           | link    | link    |
 +------------+-------------------+---------+---------+
 |            | absent            | peer    | target  |
 | present    +-------------------+---------+---------+
 |            | present           | peer    | link    |
 +------------+-------------------+---------+---------+

When IFLA_LINK_NETNSID is present, the device is created in link netns
first and then moved to target netns. This has some side effects,
including extra ifindex allocation, ifname validation and link events.
These could be avoided if we create it in target netns from
the beginning.

On the other hand, the meaning of src_net parameter is ambiguous. It
varies depending on how parameters are passed. It is the effective
link (or peer netns) by design, but some drivers ignore it and use
dev_net instead.

To provide more netns context for drivers, this patch packs existing
newlink() parameters, along with the source netns, link netns and peer
netns, into a struct. The old "src_net" is renamed to "net" to avoid
confusion with real source netns, and will be deprecated later. The use
of src_net are converted to params->net trivially.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-3-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agortnetlink: Lookup device in target netns when creating link
Xiao Liang [Wed, 19 Feb 2025 12:50:27 +0000 (20:50 +0800)]
rtnetlink: Lookup device in target netns when creating link

When creating link, lookup for existing device in target net namespace
instead of current one.
For example, two links created by:

  # ip link add dummy1 type dummy
  # ip link add netns ns1 dummy1 type dummy

should have no conflict since they are in different namespaces.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-2-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'dt-bindings-net-realtek-rtl9301-switch'
Jakub Kicinski [Fri, 21 Feb 2025 23:08:05 +0000 (15:08 -0800)]
Merge branch 'dt-bindings-net-realtek-rtl9301-switch'

Chris Packham says:

====================
dt-bindings: net: realtek,rtl9301-switch (schema part)

This is my attempt at trying to sort out the mess I've created with the RTL9300
switch dt-bindings. Some context is available on [1] and [2].

The first patch just moves the binding from mfd/ to net/ (with an adjustment of
the internal path name). The next two patches are successors to patches already
sent as part of the series [3].

[1] - https://lore.kernel.org/lkml/20250204-eccentric-deer-of-felicity-02b7ee@krzk-bin/
[2] - https://lore.kernel.org/lkml/4e3c5d83-d215-4eff-bf02-6d420592df8f@alliedtelesis.co.nz/
[3] - https://lore.kernel.org/lkml/20250204030249.1965444-1-chris.packham@alliedtelesis.co.nz/
====================

Link: https://patch.msgid.link/20250218195216.1034220-1-chris.packham@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agodt-bindings: net: Add Realtek MDIO controller
Chris Packham [Tue, 18 Feb 2025 19:52:14 +0000 (08:52 +1300)]
dt-bindings: net: Add Realtek MDIO controller

Add dtschema for the MDIO controller found in the RTL9300 Ethernet
switch. The controller is slightly unusual in that direct MDIO
communication is not possible. We model the MDIO controller with the
MDIO buses as child nodes and the PHYs as children of the buses. The
mapping of switch port number to MDIO bus/addr requires the
ethernet-ports sibling to provide the mapping via the phy-handle
property.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250218195216.1034220-4-chris.packham@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agodt-bindings: net: Add switch ports and interrupts to RTL9300
Chris Packham [Tue, 18 Feb 2025 19:52:13 +0000 (08:52 +1300)]
dt-bindings: net: Add switch ports and interrupts to RTL9300

Add bindings for the ethernet-switch and interrupt properties for the
RTL9300.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250218195216.1034220-3-chris.packham@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agodt-bindings: net: Move realtek,rtl9301-switch to net
Chris Packham [Tue, 18 Feb 2025 19:52:12 +0000 (08:52 +1300)]
dt-bindings: net: Move realtek,rtl9301-switch to net

Initially realtek,rtl9301-switch was placed under mfd/ because it had
some non-switch related blocks (specifically i2c and reset) but with a
bit more review it has become apparent that this was wrong and the
binding should live under net/.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Acked-by: Lee Jones <lee@kernel.org>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250218195216.1034220-2-chris.packham@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: sfp: add quirk for 2.5G OEM BX SFP
Birger Koblitz [Tue, 18 Feb 2025 17:59:40 +0000 (18:59 +0100)]
net: sfp: add quirk for 2.5G OEM BX SFP

The OEM SFP-2.5G-BX10-D/U SFP module pair is meant to operate with
2500Base-X. However, in their EEPROM they incorrectly specify:
Transceiver codes   : 0x00 0x12 0x00 0x00 0x12 0x00 0x01 0x05 0x00
BR, Nominal         : 2500MBd

Use sfp_quirk_2500basex for this module to allow 2500Base-X mode anyway.
Tested on BananaPi R3.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/20250218-b4-lkmsub-v1-1-1e51dcabed90@birger-koblitz.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: phy: remove unused feature array declarations
Heiner Kallweit [Wed, 19 Feb 2025 20:15:05 +0000 (21:15 +0100)]
net: phy: remove unused feature array declarations

After 12d5151be010 ("net: phy: remove leftovers from switch to linkmode
bitmaps") the following declarations are unused and can be removed too.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/b2883c75-4108-48f2-ab73-e81647262bc2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'selftests-drv-net-improve-the-queue-test-for-xsk'
Jakub Kicinski [Fri, 21 Feb 2025 01:57:11 +0000 (17:57 -0800)]
Merge branch 'selftests-drv-net-improve-the-queue-test-for-xsk'

Jakub Kicinski says:

====================
selftests: drv-net: improve the queue test for XSK

We see some flakes in the the XSK test:

   Exception| Traceback (most recent call last):
   Exception|   File "/home/virtme/testing-18/tools/testing/selftests/net/lib/py/ksft.py", line 218, in ksft_run
   Exception|     case(*args)
   Exception|   File "/home/virtme/testing-18/tools/testing/selftests/drivers/net/./queues.py", line 53, in check_xdp
   Exception|     ksft_eq(q['xsk'], {})
   Exception| KeyError: 'xsk'

I think it's because the method of running the helper in the background
is racy. Add more solid infra for waiting for a background helper to be
initialized.

v1: https://lore.kernel.org/20250218195048.74692-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250219234956.520599-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: drv-net: rename queues check_xdp to check_xsk
Jakub Kicinski [Wed, 19 Feb 2025 23:49:56 +0000 (15:49 -0800)]
selftests: drv-net: rename queues check_xdp to check_xsk

The test is for AF_XDP, we refer to AF_XDP as XSK.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: drv-net: improve the use of ksft helpers in XSK queue test
Jakub Kicinski [Wed, 19 Feb 2025 23:49:55 +0000 (15:49 -0800)]
selftests: drv-net: improve the use of ksft helpers in XSK queue test

Avoid exceptions when xsk attr is not present, and add a proper ksft
helper for "not in" condition.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Kurt Kanzenbach <kurt@linutronix.de>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: drv-net: add a way to wait for a local process
Jakub Kicinski [Wed, 19 Feb 2025 23:49:54 +0000 (15:49 -0800)]
selftests: drv-net: add a way to wait for a local process

We use wait_port_listen() extensively to wait for a process
we spawned to be ready. Not all processes will open listening
sockets. Add a method of explicitly waiting for a child to
be ready. Pass a FD to the spawned process and wait for it
to write a message to us. FD number is passed via KSFT_READY_FD
env variable.

Similarly use KSFT_WAIT_FD to let the child process for a sign
that we are done and child should exit. Sending a signal to
a child with shell=True can get tricky.

Make use of this method in the queues test to make it less flaky.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: drv-net: probe for AF_XDP sockets more explicitly
Jakub Kicinski [Wed, 19 Feb 2025 23:49:53 +0000 (15:49 -0800)]
selftests: drv-net: probe for AF_XDP sockets more explicitly

Separate the support check from socket binding for easier refactoring.
Use: ./helper - - just to probe if we can open the socket.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: drv-net: add missing new line in xdp_helper
Jakub Kicinski [Wed, 19 Feb 2025 23:49:52 +0000 (15:49 -0800)]
selftests: drv-net: add missing new line in xdp_helper

Kurt and Joe report missing new line at the end of Usage.

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: drv-net: use cfg.rpath() in netlink xsk attr test
Jakub Kicinski [Wed, 19 Feb 2025 23:49:51 +0000 (15:49 -0800)]
selftests: drv-net: use cfg.rpath() in netlink xsk attr test

The cfg.rpath() helper was been recently added to make formatting
paths for helper binaries easier.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoselftests: drv-net: add a warning for bkg + shell + terminate
Jakub Kicinski [Wed, 19 Feb 2025 23:49:50 +0000 (15:49 -0800)]
selftests: drv-net: add a warning for bkg + shell + terminate

Joe Damato reports that some shells will fork before running
the command when python does "sh -c $cmd", while bash on my
machine does an exec of $cmd directly.

This will have implications for our ability to terminate
the child process on various configurations of bash and
other shells. Warn about using

bkg(... shell=True, termininate=True)

most background commands can hopefully exit cleanly (exit_wait).

Link: https://lore.kernel.org/Z7Yld21sv_Ip3gQx@LQ3V64L9R2
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoocteontx2: hide unused label
Arnd Bergmann [Wed, 19 Feb 2025 16:21:14 +0000 (17:21 +0100)]
octeontx2: hide unused label

A previous patch introduces a build-time warning when CONFIG_DCB
is disabled:

drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c: In function 'otx2_probe':
drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c:3217:1: error: label 'err_free_zc_bmap' defined but not used [-Werror=unused-label]
drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c: In function 'otx2vf_probe':
drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c:740:1: error: label 'err_free_zc_bmap' defined but not used [-Werror=unused-label]

Add the same #ifdef check around it.

Fixes: efabce290151 ("octeontx2-pf: AF_XDP zero copy receive support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Suman Ghosh <sumang@marvell.com>
Link: https://patch.msgid.link/20250219162239.1376865-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: phy: qt2025: Fix hardware revision check comment
Charalampos Mitrodimas [Wed, 19 Feb 2025 12:41:55 +0000 (12:41 +0000)]
net: phy: qt2025: Fix hardware revision check comment

Correct the hardware revision check comment in the QT2025 driver. The
revision value was documented as 0x3b instead of the correct 0xb3,
which matches the actual comparison logic in the code.

Reviewed-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Signed-off-by: Charalampos Mitrodimas <charmitro@posteo.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Trevor Gross <tmgross@umich.edu>
Link: https://patch.msgid.link/20250219-qt2025-comment-fix-v2-1-029f67696516@posteo.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'mlx5-misc-enhancements-2025-02-19'
Jakub Kicinski [Fri, 21 Feb 2025 01:36:11 +0000 (17:36 -0800)]
Merge branch 'mlx5-misc-enhancements-2025-02-19'

Tariq Toukan says:

====================
mlx5 misc enhancements 2025-02-19

This small series enhances the mlx5 ethtool link speed code (no
functional change), in addition to a Kconfig description enhancement.
====================

Link: https://patch.msgid.link/20250219114112.403808-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet/mlx5e: Separate extended link modes request from link modes type selection
Shahar Shitrit [Wed, 19 Feb 2025 11:41:12 +0000 (13:41 +0200)]
net/mlx5e: Separate extended link modes request from link modes type selection

The function ext_requested() serves two distinct purposes: it checks
if extended link modes were requested, and it selects whether to use
extended or legacy link modes.

This change separates these two purposes. Now, ext_link_mode_requested()
is used directly for checking if extended link modes are requested,
while the selection of extended modes is handled independently based on
the autonegotiation status.

By making this distinction, the logic for determining whether to select
extended or legacy link modes is clearer.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet/mlx5e: Change eth_proto parameter naming
Shahar Shitrit [Wed, 19 Feb 2025 11:41:11 +0000 (13:41 +0200)]
net/mlx5e: Change eth_proto parameter naming

eth_proto_cap parameter represents the supported link modes,
while eth_proto_admin refers to the configured ones.

The function get_advertising() retrieves the configured link
modes, thus we update its parameter name to eth_proto_admin.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet/mlx5e: Introduce ptys2ethtool_process_link()
Shahar Shitrit [Wed, 19 Feb 2025 11:41:10 +0000 (13:41 +0200)]
net/mlx5e: Introduce ptys2ethtool_process_link()

The functions ptys2ethtool_supported_link(), ptys2ethtool_adver_link()
share the same code, thus, in order to remove code duplication we
introduce a new function ptys2ethtool_process_link() to handle the
processing of both supported and advertised link modes.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet/mlx5e: Refactor ptys2ethtool_adver_link()
Shahar Shitrit [Wed, 19 Feb 2025 11:41:09 +0000 (13:41 +0200)]
net/mlx5e: Refactor ptys2ethtool_adver_link()

The function ptys2ethtool_adver_link() contains duplicated code that
is found in mlx5e_ethtool_get_speed_arr().

To eliminate this redundancy, we update mlx5e_ethtool_get_speed_arr()
to select the appropriate table based on the ext argument passed by
the caller, rather than querying the supported mode locally.

This allows us to replace the current logic in ptys2ethtool_adver_link()
with a call to mlx5e_ethtool_get_speed_arr().

This adjustment aligns with the ptys2ethtool_supported_link() function
and prepares for an upcoming patch that reduces code duplication.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet/mlx5: Bridge, correct config option description
Cosmin Ratiu [Wed, 19 Feb 2025 11:41:08 +0000 (13:41 +0200)]
net/mlx5: Bridge, correct config option description

The implication of the previous help text was that without this option
enabled, representor devices couldn't be added to a bridge device, while
in fact that was possible, just that rules didn't get offloaded to hw.

This commit clarifies the help text.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoneighbour: Replace kvzalloc() with kzalloc() when GFP_ATOMIC is specified
Kohei Enju [Wed, 19 Feb 2025 10:22:27 +0000 (19:22 +0900)]
neighbour: Replace kvzalloc() with kzalloc() when GFP_ATOMIC is specified

kzalloc() uses page allocator when size is larger than
KMALLOC_MAX_CACHE_SIZE, so the intention of commit ab101c553bc1
("neighbour: use kvzalloc()/kvfree()") can be achieved by using kzalloc().

When using GFP_ATOMIC, kvzalloc() only tries the kmalloc path,
since the vmalloc path does not support the flag.
In this case, kvzalloc() is equivalent to kzalloc() in that neither try
the vmalloc path, so this replacement brings no functional change.
This is primarily a cleanup change, as the original code functions
correctly.

This patch replaces kvzalloc() introduced by commit 41b3caa7c076
("neighbour: Add hlist_node to struct neighbour"), which is called in
the same context and with the same gfp flag as the aforementioned commit
ab101c553bc1 ("neighbour: use kvzalloc()/kvfree()").

Signed-off-by: Kohei Enju <enjuk@amazon.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/20250219102227.72488-1-enjuk@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'some-pktgen-fixes-improvments-part-i'
Jakub Kicinski [Fri, 21 Feb 2025 01:24:58 +0000 (17:24 -0800)]
Merge branch 'some-pktgen-fixes-improvments-part-i'

Peter Seiderer says:

====================
Some pktgen fixes/improvments (part I)

While taking a look at '[PATCH net] pktgen: Avoid out-of-range in
get_imix_entries' ([1]) and '[PATCH net v2] pktgen: Avoid out-of-bounds
access in get_imix_entries' ([2], [3]) and doing some tests and code review
I detected that the /proc/net/pktgen/... parsing logic does not honour the
user given buffer bounds (resulting in out-of-bounds access).

This can be observed e.g. by the following simple test (sometimes the
old/'longer' previous value is re-read from the buffer):

        $ echo add_device lo@0 > /proc/net/pktgen/kpktgend_0

        $ echo "min_pkt_size 12345" > /proc/net/pktgen/lo\@0 && grep min_pkt_size /proc/net/pktgen/lo\@0
Params: count 1000  min_pkt_size: 12345  max_pkt_size: 0
Result: OK: min_pkt_size=12345

        $ echo -n "min_pkt_size 123" > /proc/net/pktgen/lo\@0 && grep min_pkt_size /proc/net/pktgen/lo\@0
Params: count 1000  min_pkt_size: 12345  max_pkt_size: 0
Result: OK: min_pkt_size=12345

        $ echo "min_pkt_size 123" > /proc/net/pktgen/lo\@0 && grep min_pkt_size /proc/net/pktgen/lo\@0
Params: count 1000  min_pkt_size: 123  max_pkt_size: 0
Result: OK: min_pkt_size=123

So fix the out-of-bounds access (and some minor findings) and add a simple
proc_net_pktgen selftest...

[1] https://lore.kernel.org/netdev/20241006221221.3744995-1-artem.chernyshev@red-soft.ru/
[2] https://lore.kernel.org/netdev/20250109083039.14004-1-pchelkin@ispras.ru/
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=76201b5979768500bca362871db66d77cb4c225e
====================

Link: https://patch.msgid.link/20250219084527.20488-1-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: pktgen: fix access outside of user given buffer in pktgen_thread_write()
Peter Seiderer [Wed, 19 Feb 2025 08:45:27 +0000 (09:45 +0100)]
net: pktgen: fix access outside of user given buffer in pktgen_thread_write()

Honour the user given buffer size for the strn_len() calls (otherwise
strn_len() will access memory outside of the user given buffer).

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-8-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: pktgen: fix ctrl interface command parsing
Peter Seiderer [Wed, 19 Feb 2025 08:45:26 +0000 (09:45 +0100)]
net: pktgen: fix ctrl interface command parsing

Enable command writing without trailing '\n':

- the good case

$ echo "reset" > /proc/net/pktgen/pgctrl

- the bad case (before the patch)

$ echo -n "reset" > /proc/net/pktgen/pgctrl
-bash: echo: write error: Invalid argument

- with patch applied

$ echo -n "reset" > /proc/net/pktgen/pgctrl

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-7-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: pktgen: fix 'ratep 0' error handling (return -EINVAL)
Peter Seiderer [Wed, 19 Feb 2025 08:45:25 +0000 (09:45 +0100)]
net: pktgen: fix 'ratep 0' error handling (return -EINVAL)

Given an invalid 'ratep' command e.g. 'ratep 0' the return value is '1',
leading to the following misleading output:

- the good case

$ echo "ratep 100" > /proc/net/pktgen/lo\@0
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: OK: ratep=100

- the bad case (before the patch)

$ echo "ratep 0" > /proc/net/pktgen/lo\@0"
-bash: echo: write error: Invalid argument
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: No such parameter "atep"

- with patch applied

$ echo "ratep 0" > /proc/net/pktgen/lo\@0
-bash: echo: write error: Invalid argument
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: Idle

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-6-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: pktgen: fix 'rate 0' error handling (return -EINVAL)
Peter Seiderer [Wed, 19 Feb 2025 08:45:24 +0000 (09:45 +0100)]
net: pktgen: fix 'rate 0' error handling (return -EINVAL)

Given an invalid 'rate' command e.g. 'rate 0' the return value is '1',
leading to the following misleading output:

- the good case

$ echo "rate 100" > /proc/net/pktgen/lo\@0
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: OK: rate=100

- the bad case (before the patch)

$ echo "rate 0" > /proc/net/pktgen/lo\@0"
-bash: echo: write error: Invalid argument
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: No such parameter "ate"

- with patch applied

$ echo "rate 0" > /proc/net/pktgen/lo\@0
-bash: echo: write error: Invalid argument
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: Idle

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-5-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: pktgen: fix hex32_arg parsing for short reads
Peter Seiderer [Wed, 19 Feb 2025 08:45:23 +0000 (09:45 +0100)]
net: pktgen: fix hex32_arg parsing for short reads

Fix hex32_arg parsing for short reads (here 7 hex digits instead of the
expected 8), shift result only on successful input parsing.

- before the patch

$ echo "mpls 0000123" > /proc/net/pktgen/lo\@0
$ grep mpls /proc/net/pktgen/lo\@0
     mpls: 00001230
Result: OK: mpls=00001230

- with patch applied

$ echo "mpls 0000123" > /proc/net/pktgen/lo\@0
$ grep mpls /proc/net/pktgen/lo\@0
     mpls: 00000123
Result: OK: mpls=00000123

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-4-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: pktgen: enable 'param=value' parsing
Peter Seiderer [Wed, 19 Feb 2025 08:45:22 +0000 (09:45 +0100)]
net: pktgen: enable 'param=value' parsing

Enable more flexible parameters syntax, allowing 'param=value' in
addition to the already supported 'param value' pattern (additional
this gives the skipping '=' in count_trail_chars() a purpose).

Tested with:

$ echo "min_pkt_size 999" > /proc/net/pktgen/lo\@0
$ echo "min_pkt_size=999" > /proc/net/pktgen/lo\@0
$ echo "min_pkt_size =999" > /proc/net/pktgen/lo\@0
$ echo "min_pkt_size= 999" > /proc/net/pktgen/lo\@0
$ echo "min_pkt_size = 999" > /proc/net/pktgen/lo\@0

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-3-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: pktgen: replace ENOTSUPP with EOPNOTSUPP
Peter Seiderer [Wed, 19 Feb 2025 08:45:21 +0000 (09:45 +0100)]
net: pktgen: replace ENOTSUPP with EOPNOTSUPP

Replace ENOTSUPP with EOPNOTSUPP, fixes checkpatch hint

  WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP

and e.g.

  $ echo "clone_skb 1" > /proc/net/pktgen/lo\@0
  -bash: echo: write error: Unknown error 524

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-2-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoaf_unix: Fix undefined 'other' error
Purva Yeshi [Tue, 18 Feb 2025 14:10:45 +0000 (19:40 +0530)]
af_unix: Fix undefined 'other' error

Fix an issue with the sparse static analysis tool where an
"undefined 'other'" error occurs due to `__releases(&unix_sk(other)->lock)`
being placed before 'other' is in scope.

Remove the `__releases()` annotation from the `unix_wait_for_peer()`
function to eliminate the sparse error. The annotation references `other`
before it is declared, leading to a false positive error during static
analysis.

Since AF_UNIX does not use sparse annotations, this annotation is
unnecessary and does not impact functionality.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250218141045.38947-1-purvayeshi550@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'xsk-tx-metadata-launch-time-support'
Martin KaFai Lau [Thu, 20 Feb 2025 23:13:46 +0000 (15:13 -0800)]
Merge branch 'xsk-tx-metadata-launch-time-support'

Song Yoong Siang says:

====================
xsk: TX metadata Launch Time support

This series expands the XDP TX metadata framework to allow user
applications to pass per packet 64-bit launch time directly to the kernel
driver, requesting launch time hardware offload support. The XDP TX
metadata framework will not perform any clock conversion or packet
reordering.

Please note that the role of Tx metadata is just to pass the launch time,
not to enable the offload feature. Users will need to enable the launch
time hardware offload feature of the device by using the respective
command, such as the tc-etf command.

Although some devices use the tc-etf command to enable their launch time
hardware offload feature, xsk packets will not go through the etf qdisc.
Therefore, in my opinion, the launch time should always be based on the PTP
Hardware Clock (PHC). Thus, i did not include a clock ID to indicate the
clock source.

To simplify the test steps, I modified the xdp_hw_metadata bpf self-test
tool in such a way that it will set the launch time based on the offset
provided by the user and the value of the Receive Hardware Timestamp, which
is against the PHC. This will eliminate the need to discipline System Clock
with the PHC and then use clock_gettime() to get the time.

Please note that AF_XDP lacks a feedback mechanism to inform the
application if the requested launch time is invalid. So, users are expected
to familiar with the horizon of the launch time of the device they use and
not request a launch time that is beyond the horizon. Otherwise, the driver
might interpret the launch time incorrectly and react wrongly. For stmmac
and igc, where modulo computation is used, a launch time larger than the
horizon will cause the device to transmit the packet earlier that the
requested launch time.

Although there is no feedback mechanism for the launch time request
for now, user still can check whether the requested launch time is
working or not, by requesting the Transmit Completion Hardware Timestamp.

v12:
  - Fix the comment in include/uapi/linux/if_xdp.h to allign with what is
    generated by ./tools/net/ynl/ynl-regen.sh to avoid dirty tree error in
    the netdev/ynl checks.

v11: https://lore.kernel.org/netdev/20250216074302.956937-1-yoong.siang.song@intel.com/
  - regenerate netdev_xsk_flags based on latest netdev.yaml (Jakub)

v10: https://lore.kernel.org/netdev/20250207021943.814768-1-yoong.siang.song@intel.com/
  - use net_err_ratelimited(), instead of net_ratelimit() (Maciej)
  - accumulate the amount of used descs in local variable and update the
    igc_metadata_request::used_desc once (Maciej)
  - Ensure reverse christmas tree rule (Maciej)

V9: https://lore.kernel.org/netdev/20250206060408.808325-1-yoong.siang.song@intel.com/
  - Remove the igc_desc_unused() checking (Maciej)
  - Ensure that skb allocation and DMA mapping work before proceeding to
    fill in igc_tx_buffer info, context desc, and data desc (Maciej)
  - Rate limit the error messages (Maciej)
  - Update the comment to indicate that the 2 descriptors needed by the
    empty frame are already taken into consideration (Maciej)
  - Handle the case where the insertion of an empty frame fails and
    explain the reason behind (Maciej)
  - put self SOB tag as last tag (Maciej)

V8: https://lore.kernel.org/netdev/20250205024116.798862-1-yoong.siang.song@intel.com/
  - check the number of used descriptor in xsk_tx_metadata_request()
    by using used_desc of struct igc_metadata_request, and then decreases
    the budget with it (Maciej)
  - submit another bug fix patch to set the buffer type for empty frame (Maciej):
    https://lore.kernel.org/netdev/20250205023603.798819-1-yoong.siang.song@intel.com/

V7: https://lore.kernel.org/netdev/20250204004907.789330-1-yoong.siang.song@intel.com/
  - split the refactoring code of igc empty packet insertion into a separate
    commit (Faizal)
  - add explanation on why the value "4" is used as igc transmit budget
    (Faizal)
  - perform a stress test by sending 1000 packets with 10ms interval and
    launch time set to 500us in the future (Faizal & Yong Liang)

V6: https://lore.kernel.org/netdev/20250116155350.555374-1-yoong.siang.song@intel.com/
  - fix selftest build errors by using asprintf() and realloc(), instead of
    managing the buffer sizes manually (Daniel, Stanislav)

V5: https://lore.kernel.org/netdev/20250114152718.120588-1-yoong.siang.song@intel.com/
  - change netdev feature name from tx-launch-time to tx-launch-time-fifo
    to explicitly state the FIFO behaviour (Stanislav)
  - improve the looping of xdp_hw_metadata app to wait for packet tx
    completion to be more readable by using clock_gettime() (Stanislav)
  - add launch time setup steps into xdp_hw_metadata app (Stanislav)

V4: https://lore.kernel.org/netdev/20250106135506.9687-1-yoong.siang.song@intel.com/
  - added XDP launch time support to the igc driver (Jesper & Florian)
  - added per-driver launch time limitation on xsk-tx-metadata.rst (Jesper)
  - added explanation on FIFO behavior on xsk-tx-metadata.rst (Jakub)
  - added step to enable launch time in the commit message (Jesper & Willem)
  - explicitly documented the type of launch_time and which clock source
    it is against (Willem)

V3: https://lore.kernel.org/netdev/20231203165129.1740512-1-yoong.siang.song@intel.com/
  - renamed to use launch time (Jesper & Willem)
  - changed the default launch time in xdp_hw_metadata apps from 1s to 0.1s
    because some NICs do not support such a large future time.

V2: https://lore.kernel.org/netdev/20231201062421.1074768-1-yoong.siang.song@intel.com/
  - renamed to use Earliest TxTime First (Willem)
  - renamed to use txtime (Willem)

V1: https://lore.kernel.org/netdev/20231130162028.852006-1-yoong.siang.song@intel.com/
====================

Link: https://patch.msgid.link/20250216093430.957880-1-yoong.siang.song@intel.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoigc: Add launch time support to XDP ZC
Song Yoong Siang [Sun, 16 Feb 2025 09:34:30 +0000 (17:34 +0800)]
igc: Add launch time support to XDP ZC

Enable Launch Time Control (LTC) support for XDP zero copy via XDP Tx
metadata framework.

This patch has been tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel I225-LM Ethernet controller. Below are the test steps and result.

Test 1: Send a single packet with the launch time set to 1 s in the future.

Test steps:
1. On the DUT, start the xdp_hw_metadata selftest application:
   $ sudo ./xdp_hw_metadata enp2s0 -l 1000000000 -L 1

2. On the Link Partner, send a UDP packet with VLAN priority 1 to port 9091
   of the DUT.

Result:
When the launch time is set to 1 s in the future, the delta between the
launch time and the transmit hardware timestamp is 0.016 us, as shown in
printout of the xdp_hw_metadata application below.
  0x562ff5dc8880: rx_desc[4]->addr=84110 addr=84110 comp_addr=84110 EoP
  rx_hash: 0xE343384 with RSS type:0x1
  HW RX-time:   1734578015467548904 (sec:1734578015.4675)
                delta to User RX-time sec:0.0002 (183.103 usec)
  XDP RX-time:   1734578015467651698 (sec:1734578015.4677)
                 delta to User RX-time sec:0.0001 (80.309 usec)
  No rx_vlan_tci or rx_vlan_proto, err=-95
  0x562ff5dc8880: ping-pong with csum=561c (want c7dd)
                  csum_start=34 csum_offset=6
  HW RX-time:   1734578015467548904 (sec:1734578015.4675)
                delta to HW Launch-time sec:1.0000 (1000000.000 usec)
  0x562ff5dc8880: complete tx idx=4 addr=4018
  HW Launch-time:   1734578016467548904 (sec:1734578016.4675)
                    delta to HW TX-complete-time sec:0.0000 (0.016 usec)
  HW TX-complete-time:   1734578016467548920 (sec:1734578016.4675)
                         delta to User TX-complete-time sec:0.0000
                         (32.546 usec)
  XDP RX-time:   1734578015467651698 (sec:1734578015.4677)
                 delta to User TX-complete-time sec:0.9999
                 (999929.768 usec)
  HW RX-time:   1734578015467548904 (sec:1734578015.4675)
                delta to HW TX-complete-time sec:1.0000 (1000000.016 usec)
  0x562ff5dc8880: complete rx idx=132 addr=84110

Test 2: Send 1000 packets with a 10 ms interval and the launch time set to
        500 us in the future.

Test steps:
1. On the DUT, start the xdp_hw_metadata selftest application:
   $ sudo chrt -f 99 ./xdp_hw_metadata enp2s0 -l 500000 -L 1 > \
     /dev/shm/result.log

2. On the Link Partner, send 1000 UDP packets with a 10 ms interval and
   VLAN priority 1 to port 9091 of the DUT.

Result:
When the launch time is set to 500 us in the future, the average delta
between the launch time and the transmit hardware timestamp is 0.016 us,
as shown in the analysis of /dev/shm/result.log below. The XDP launch time
works correctly in sending 1000 packets continuously.
  Min delta: 0.005 us
  Avr delta: 0.016 us
  Max delta: 0.031 us
  Total packets forwarded: 1000

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250216093430.957880-6-yoong.siang.song@intel.com
2 months agoigc: Refactor empty frame insertion for launch time support
Song Yoong Siang [Sun, 16 Feb 2025 09:34:29 +0000 (17:34 +0800)]
igc: Refactor empty frame insertion for launch time support

Refactor the code for inserting an empty frame into a new function
igc_insert_empty_frame(). This change extracts the logic for inserting
an empty packet from igc_xmit_frame_ring() into a separate function,
allowing it to be reused in future implementations, such as the XDP
zero copy transmit function.

Remove the igc_desc_unused() checking in igc_init_tx_empty_descriptor()
because the number of descriptors needed is guaranteed.

Ensure that skb allocation and DMA mapping work for the empty frame,
before proceeding to fill in igc_tx_buffer info, context descriptor,
and data descriptor.

Rate limit the error messages for skb allocation and DMA mapping failures.

Update the comment to indicate that the 2 descriptors needed by the empty
frame are already taken into consideration in igc_xmit_frame_ring().

Handle the case where the insertion of an empty frame fails and explain
the reason behind this handling.

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250216093430.957880-5-yoong.siang.song@intel.com
2 months agonet: stmmac: Add launch time support to XDP ZC
Song Yoong Siang [Sun, 16 Feb 2025 09:34:28 +0000 (17:34 +0800)]
net: stmmac: Add launch time support to XDP ZC

Enable launch time (Time-Based Scheduling) support for XDP zero copy via
the XDP Tx metadata framework.

This patch has been tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel Tiger Lake platform. Below are the test steps and result.

Test 1: Send a single packet with the launch time set to 1 s in the future.

Test steps:
1. On the DUT, start the xdp_hw_metadata selftest application:
   $ sudo ./xdp_hw_metadata enp0s30f4 -l 1000000000 -L 1

2. On the Link Partner, send a UDP packet with VLAN priority 1 to port 9091
   of the DUT.

Result:
When the launch time is set to 1 s in the future, the delta between the
launch time and the transmit hardware timestamp is 16.963 us, as shown in
printout of the xdp_hw_metadata application below.
  0x55b5864717a8: rx_desc[4]->addr=88100 addr=88100 comp_addr=88100 EoP
  No rx_hash, err=-95
  HW RX-time:   1734579065767717328 (sec:1734579065.7677)
                delta to User RX-time sec:0.0004 (375.624 usec)
  XDP RX-time:   1734579065768004454 (sec:1734579065.7680)
                 delta to User RX-time sec:0.0001 (88.498 usec)
  No rx_vlan_tci or rx_vlan_proto, err=-95
  0x55b5864717a8: ping-pong with csum=5619 (want 0000)
                  csum_start=34 csum_offset=6
  HW RX-time:   1734579065767717328 (sec:1734579065.7677)
                delta to HW Launch-time sec:1.0000 (1000000.000 usec)
  0x55b5864717a8: complete tx idx=4 addr=4018
  HW Launch-time:   1734579066767717328 (sec:1734579066.7677)
                    delta to HW TX-complete-time sec:0.0000 (16.963 usec)
  HW TX-complete-time:   1734579066767734291 (sec:1734579066.7677)
                         delta to User TX-complete-time sec:0.0001
                         (130.408 usec)
  XDP RX-time:   1734579065768004454 (sec:1734579065.7680)
                 delta to User TX-complete-time sec:0.9999
                (999860.245 usec)
  HW RX-time:   1734579065767717328 (sec:1734579065.7677)
                delta to HW TX-complete-time sec:1.0000 (1000016.963 usec)
  0x55b5864717a8: complete rx idx=132 addr=88100

Test 2: Send 1000 packets with a 10 ms interval and the launch time set to
        500 us in the future.

Test steps:
1. On the DUT, start the xdp_hw_metadata selftest application:
   $ sudo chrt -f 99 ./xdp_hw_metadata enp0s30f4 -l 500000 -L 1 > \
     /dev/shm/result.log

2. On the Link Partner, send 1000 UDP packets with a 10 ms interval and
   VLAN priority 1 to port 9091 of the DUT.

Result:
When the launch time is set to 500 us in the future, the average delta
between the launch time and the transmit hardware timestamp is 13.854 us,
as shown in the analysis of /dev/shm/result.log below. The XDP launch time
works correctly in sending 1000 packets continuously.
  Min delta: 08.410 us
  Avr delta: 13.854 us
  Max delta: 17.076 us
  Total packets forwarded: 1000

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Choong Yong Liang <yong.liang.choong@linux.intel.com>
Link: https://patch.msgid.link/20250216093430.957880-4-yoong.siang.song@intel.com
2 months agoselftests/bpf: Add launch time request to xdp_hw_metadata
Song Yoong Siang [Sun, 16 Feb 2025 09:34:27 +0000 (17:34 +0800)]
selftests/bpf: Add launch time request to xdp_hw_metadata

Add launch time hardware offload request to xdp_hw_metadata. Users can
configure the delta of launch time relative to HW RX-time using the "-l"
argument. By default, the delta is set to 0 ns, which means the launch time
is disabled. By setting the delta to a non-zero value, the launch time
hardware offload feature will be enabled and requested. Additionally, users
can configure the Tx Queue to be enabled with the launch time hardware
offload using the "-L" argument. By default, Tx Queue 0 will be used.

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250216093430.957880-3-yoong.siang.song@intel.com
2 months agoxsk: Add launch time hardware offload support to XDP Tx metadata
Song Yoong Siang [Sun, 16 Feb 2025 09:34:26 +0000 (17:34 +0800)]
xsk: Add launch time hardware offload support to XDP Tx metadata

Extend the XDP Tx metadata framework so that user can requests launch time
hardware offload, where the Ethernet device will schedule the packet for
transmission at a pre-determined time called launch time. The value of
launch time is communicated from user space to Ethernet driver via
launch_time field of struct xsk_tx_metadata.

Suggested-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250216093430.957880-2-yoong.siang.song@intel.com
2 months agoeth: fbnic: Add ethtool support for IRQ coalescing
Mohsin Bashir [Tue, 18 Feb 2025 02:35:20 +0000 (18:35 -0800)]
eth: fbnic: Add ethtool support for IRQ coalescing

Add ethtool support to configure the IRQ coalescing behavior. Support
separate timers for Rx and Tx for time based coalescing. For frame based
configuration, currently we only support the Rx side.

The hardware allows configuration of descriptor count instead of frame
count requiring conversion between the two. We assume 2 descriptors
per frame, one for the metadata and one for the data segment.

When rx-frames are not configured, we set the RX descriptor count to
half the ring size as a fail safe.

Default configuration:
ethtool -c eth0 | grep -E "rx-usecs:|tx-usecs:|rx-frames:"
rx-usecs:       30
rx-frames:      0
tx-usecs:       35

IRQ rate test:
With single iperf flow we monitor IRQ rate while changing the tx-usesc and
rx-usecs to high and low values.

ethtool -C eth0 rx-frames 8192 rx-usecs 150 tx-usecs 150
irq/sec   13k
irq/sec   14k
irq/sec   14k

ethtool -C eth0 rx-frames 8192 rx-usecs 10 tx-usecs 10
irq/sec  27k
irq/sec  28k
irq/sec  28k

Validating the use of extack:
ethtool -C eth0 rx-frames 16384
netlink error: fbnic: rx_frames is above device max
netlink error: Invalid argument

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250218023520.2038010-1-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'support-ptp-clock-for-wangxun-nics'
Jakub Kicinski [Thu, 20 Feb 2025 22:59:39 +0000 (14:59 -0800)]
Merge branch 'support-ptp-clock-for-wangxun-nics'

Jiawen Wu says:

====================
Support PTP clock for Wangxun NICs

Implement support for PTP clock on Wangxun NICs.

v7: https://lore.kernel.org/20250213083041.78917-1-jiawenwu@trustnetic.com
v6: https://lore.kernel.org/20250208031348.4368-1-jiawenwu@trustnetic.com
v5: https://lore.kernel.org/20250117062051.2257073-1-jiawenwu@trustnetic.com
v4: https://lore.kernel.org/20250114084425.2203428-1-jiawenwu@trustnetic.com
v3: https://lore.kernel.org/20250110031716.2120642-1-jiawenwu@trustnetic.com
v2: https://lore.kernel.org/20250106084506.2042912-1-jiawenwu@trustnetic.com
v1: https://lore.kernel.org/20250102103026.1982137-1-jiawenwu@trustnetic.com
====================

Link: https://patch.msgid.link/20250218023432.146536-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: ngbe: Add support for 1PPS and TOD
Jiawen Wu [Tue, 18 Feb 2025 02:34:32 +0000 (10:34 +0800)]
net: ngbe: Add support for 1PPS and TOD

Implement support for generating a 1pps output signal on SDP0.
And support custom firmware to output TOD.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20250218023432.146536-5-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: wangxun: Add periodic checks for overflow and errors
Jiawen Wu [Tue, 18 Feb 2025 02:34:31 +0000 (10:34 +0800)]
net: wangxun: Add periodic checks for overflow and errors

Implement watchdog task to detect SYSTIME overflow and error cases of
Rx/Tx timestamp.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250218023432.146536-4-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: wangxun: Support to get ts info
Jiawen Wu [Tue, 18 Feb 2025 02:34:30 +0000 (10:34 +0800)]
net: wangxun: Support to get ts info

Implement the function get_ts_info and get_ts_stats in ethtool_ops to
get the HW capabilities and statistics for timestamping.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250218023432.146536-3-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: wangxun: Add support for PTP clock
Jiawen Wu [Tue, 18 Feb 2025 02:34:29 +0000 (10:34 +0800)]
net: wangxun: Add support for PTP clock

Implement support for PTP clock on Wangxun NICs.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250218023432.146536-2-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'net-timestamp-bpf-extension-to-equip-applications-transparently'
Martin KaFai Lau [Thu, 20 Feb 2025 21:07:22 +0000 (13:07 -0800)]
Merge branch 'net-timestamp-bpf-extension-to-equip-applications-transparently'

Jason Xing says:

====================
net-timestamp: bpf extension to equip applications transparently

"Timestamping is key to debugging network stack latency. With
SO_TIMESTAMPING, bugs that are otherwise incorrectly assumed to be
network issues can be attributed to the kernel." This is extracted
from the talk "SO_TIMESTAMPING: Powering Fleetwide RPC Monitoring"
addressed by Willem de Bruijn at netdevconf 0x17).

There are a few areas that need optimization with the consideration of
easier use and less performance impact, which I highlighted and mainly
discussed at netconf 2024 with Willem de Bruijn and John Fastabend:
uAPI compatibility, extra system call overhead, and the need for
application modification. I initially managed to solve these issues
by writing a kernel module that hooks various key functions. However,
this approach is not suitable for the next kernel release. Therefore,
a BPF extension was proposed. During recent period, Martin KaFai Lau
provides invaluable suggestions about BPF along the way. Many thanks
here!

This series adds the BPF networking timestamping infrastructure through
reusing most of the tx timestamping callback that is currently enabled
by the SO_TIMESTAMPING.. This series also adds TX timestamping support
for TCP. The RX timestamping and UDP support will be added in the future.
====================

Link: https://patch.msgid.link/20250220072940.99994-1-kerneljasonxing@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoselftests/bpf: Add simple bpf tests in the tx path for timestamping feature
Jason Xing [Thu, 20 Feb 2025 07:29:40 +0000 (15:29 +0800)]
selftests/bpf: Add simple bpf tests in the tx path for timestamping feature

BPF program calculates a couple of latency deltas between each tx
timestamping callbacks. It can be used in the real world to diagnose
the kernel behaviour in the tx path.

Check the safety issues by accessing a few bpf calls in
bpf_test_access_bpf_calls() which are implemented in the patch 3 and 4.

Check if the bpf timestamping can co-exist with socket timestamping.

There remains a few realistic things[1][2] to highlight:
1. in general a packet may pass through multiple qdiscs. For instance
with bonding or tunnel virtual devices in the egress path.
2. packets may be resent, in which case an ACK might precede a repeat
SCHED and SND.
3. erroneous or malicious peers may also just never send an ACK.

[1]: https://lore.kernel.org/all/67a389af981b0_14e0832949d@willemb.c.googlers.com.notmuch/
[2]: https://lore.kernel.org/all/c329a0c1-239b-4ca1-91f2-cb30b8dd2f6a@linux.dev/

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-13-kerneljasonxing@gmail.com
2 months agobpf: Support selective sampling for bpf timestamping
Jason Xing [Thu, 20 Feb 2025 07:29:39 +0000 (15:29 +0800)]
bpf: Support selective sampling for bpf timestamping

Add the bpf_sock_ops_enable_tx_tstamp kfunc to allow BPF programs to
selectively enable TX timestamping on a skb during tcp_sendmsg().

For example, BPF program will limit tracking X numbers of packets
and then will stop there instead of tracing all the sendmsgs of
matched flow all along. It would be helpful for users who cannot
afford to calculate latencies from every sendmsg call probably
due to the performance or storage space consideration.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-12-kerneljasonxing@gmail.com
2 months agobpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback
Jason Xing [Thu, 20 Feb 2025 07:29:38 +0000 (15:29 +0800)]
bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback

This patch introduces a new callback in tcp_tx_timestamp() to correlate
tcp_sendmsg timestamp with timestamps from other tx timestamping
callbacks (e.g., SND/SW/ACK).

Without this patch, BPF program wouldn't know which timestamps belong
to which flow because of no socket lock protection. This new callback
is inserted in tcp_tx_timestamp() to address this issue because
tcp_tx_timestamp() still owns the same socket lock with
tcp_sendmsg_locked() in the meanwhile tcp_tx_timestamp() initializes
the timestamping related fields for the skb, especially tskey. The
tskey is the bridge to do the correlation.

For TCP, BPF program hooks the beginning of tcp_sendmsg_locked() and
then stores the sendmsg timestamp at the bpf_sk_storage, correlating
this timestamp with its tskey that are later used in other sending
timestamping callbacks.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-11-kerneljasonxing@gmail.com
2 months agobpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
Jason Xing [Thu, 20 Feb 2025 07:29:37 +0000 (15:29 +0800)]
bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback

Support the ACK case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_ACK_CB. This
callback will occur at the same timestamping point as the user
space's SCM_TSTAMP_ACK. The BPF program can use it to get the
same SCM_TSTAMP_ACK timestamp without modifying the user-space
application.

This patch extends txstamp_ack to two bits: 1 stands for
SO_TIMESTAMPING mode, 2 bpf extension.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-10-kerneljasonxing@gmail.com
2 months agobpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback
Jason Xing [Thu, 20 Feb 2025 07:29:36 +0000 (15:29 +0800)]
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback

Support hw SCM_TSTAMP_SND case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_HW_CB. This
callback will occur at the same timestamping point as the user
space's hardware SCM_TSTAMP_SND. The BPF program can use it to
get the same SCM_TSTAMP_SND timestamp without modifying the
user-space application.

To avoid increasing the code complexity, replace SKBTX_HW_TSTAMP
with SKBTX_HW_TSTAMP_NOBPF instead of changing numerous callers
from driver side using SKBTX_HW_TSTAMP. The new definition of
SKBTX_HW_TSTAMP means the combination tests of socket timestamping
and bpf timestamping. After this patch, drivers can work under the
bpf timestamping.

Considering some drivers don't assign the skb with hardware
timestamp, this patch does the assignment and then BPF program
can acquire the hwstamp from skb directly.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-9-kerneljasonxing@gmail.com
2 months agobpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback
Jason Xing [Thu, 20 Feb 2025 07:29:35 +0000 (15:29 +0800)]
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback

Support sw SCM_TSTAMP_SND case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_SW_CB. This
callback will occur at the same timestamping point as the user
space's software SCM_TSTAMP_SND. The BPF program can use it to
get the same SCM_TSTAMP_SND timestamp without modifying the
user-space application.

Based on this patch, BPF program will get the software
timestamp when the driver is ready to send the skb. In the
sebsequent patch, the hardware timestamp will be supported.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-8-kerneljasonxing@gmail.com
2 months agobpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
Jason Xing [Thu, 20 Feb 2025 07:29:34 +0000 (15:29 +0800)]
bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback

Support SCM_TSTAMP_SCHED case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SCHED_CB. This
callback will occur at the same timestamping point as the user
space's SCM_TSTAMP_SCHED. The BPF program can use it to get the
same SCM_TSTAMP_SCHED timestamp without modifying the user-space
application.

A new SKBTX_BPF flag is added to mark skb_shinfo(skb)->tx_flags,
ensuring that the new BPF timestamping and the current user
space's SO_TIMESTAMPING do not interfere with each other.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-7-kerneljasonxing@gmail.com
2 months agonet-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
Jason Xing [Thu, 20 Feb 2025 07:29:33 +0000 (15:29 +0800)]
net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING

No functional changes here. Only add test to see if the orig_skb
matches the usage of application SO_TIMESTAMPING.

In this series, bpf timestamping and previous socket timestamping
are implemented in the same function __skb_tstamp_tx(). To test
the socket enables socket timestamping feature, this function
skb_tstamp_tx_report_so_timestamping() is added.

In the next patch, another check for bpf timestamping feature
will be introduced just like the above report function, namely,
skb_tstamp_tx_report_bpf_timestamping(). Then users will be able
to know the socket enables either or both of features.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-6-kerneljasonxing@gmail.com
2 months agobpf: Disable unsafe helpers in TX timestamping callbacks
Jason Xing [Thu, 20 Feb 2025 07:29:32 +0000 (15:29 +0800)]
bpf: Disable unsafe helpers in TX timestamping callbacks

New TX timestamping sock_ops callbacks will be added in the
subsequent patch. Some of the existing BPF helpers will not
be safe to be used in the TX timestamping callbacks.

The bpf_sock_ops_setsockopt, bpf_sock_ops_getsockopt, and
bpf_sock_ops_cb_flags_set require owning the sock lock. TX
timestamping callbacks will not own the lock.

The bpf_sock_ops_load_hdr_opt needs the skb->data pointing
to the TCP header. This will not be true in the TX timestamping
callbacks.

At the beginning of these helpers, this patch checks the
bpf_sock->op to ensure these helpers are used by the existing
sock_ops callbacks only.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250220072940.99994-5-kerneljasonxing@gmail.com
2 months agobpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
Jason Xing [Thu, 20 Feb 2025 07:29:31 +0000 (15:29 +0800)]
bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback

The subsequent patch will implement BPF TX timestamping. It will
call the sockops BPF program without holding the sock lock.

This breaks the current assumption that all sock ops programs will
hold the sock lock. The sock's fields of the uapi's bpf_sock_ops
requires this assumption.

To address this, a new "u8 is_locked_tcp_sock;" field is added. This
patch sets it in the current sock_ops callbacks. The "is_fullsock"
test is then replaced by the "is_locked_tcp_sock" test during
sock_ops_convert_ctx_access().

The new TX timestamping callbacks added in the subsequent patch will
not have this set. This will prevent unsafe access from the new
timestamping callbacks.

Potentially, we could allow read-only access. However, this would
require identifying which callback is read-safe-only and also requires
additional BPF instruction rewrites in the covert_ctx. Since the BPF
program can always read everything from a socket (e.g., by using
bpf_core_cast), this patch keeps it simple and disables all read
and write access to any socket fields through the bpf_sock_ops
UAPI from the new TX timestamping callback.

Moreover, note that some of the fields in bpf_sock_ops are specific
to tcp_sock, and sock_ops currently only supports tcp_sock. In
the future, UDP timestamping will be added, which will also break
this assumption. The same idea used in this patch will be reused.
Considering that the current sock_ops only supports tcp_sock, the
variable is named is_locked_"tcp"_sock.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250220072940.99994-4-kerneljasonxing@gmail.com
2 months agobpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
Jason Xing [Thu, 20 Feb 2025 07:29:30 +0000 (15:29 +0800)]
bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping

This patch introduces a new bpf_skops_tx_timestamping() function
that prepares the "struct bpf_sock_ops" ctx and then executes the
sockops BPF program.

The subsequent patch will utilize bpf_skops_tx_timestamping() at
the existing TX timestamping kernel callbacks (__sk_tstamp_tx
specifically) to call the sockops BPF program. Later, four callback
points to report information to user space based on this patch will
be introduced.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250220072940.99994-3-kerneljasonxing@gmail.com
2 months agobpf: Add networking timestamping support to bpf_get/setsockopt()
Jason Xing [Thu, 20 Feb 2025 07:29:29 +0000 (15:29 +0800)]
bpf: Add networking timestamping support to bpf_get/setsockopt()

The new SK_BPF_CB_FLAGS and new SK_BPF_CB_TX_TIMESTAMPING are
added to bpf_get/setsockopt. The later patches will implement the
BPF networking timestamping. The BPF program will use
bpf_setsockopt(SK_BPF_CB_FLAGS, SK_BPF_CB_TX_TIMESTAMPING) to
enable the BPF networking timestamping on a socket.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-2-kerneljasonxing@gmail.com
2 months agotun: Pad virtio headers
Akihiko Odaki [Sat, 15 Feb 2025 06:04:50 +0000 (15:04 +0900)]
tun: Pad virtio headers

tun simply advances iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This will become
especially problematic when tun starts to allow enabling the hash
reporting feature; even if the feature is enabled, the packet may lack a
hash value and may contain a hole in the virtio header because the
packet arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.

In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so replace advancing the iterator with writing zeros.

A user might have initialized the buffer to some non-zero value,
expecting tun to skip writing it. As this was never a documented
feature, this seems unlikely.

The overhead of filling the hole in the header is negligible when the
header size is specified according to the specification as doing so will
not make another cache line dirty under a reasonable assumption. Below
is a proof of this statement:

The first 10 bytes of the header is always written and tun also writes
the packet itself immediately after the packet unless the packet is
empty. This makes a hole between these writes whose size is: sz - 10
where sz is the specified header size.

Therefore, we will never make another cache line dirty when:
sz < L1_CACHE_BYTES + 10
where L1_CACHE_BYTES is the cache line size. Assuming
L1_CACHE_BYTES >= 16, this inequation holds when: sz < 26.

sz <= 20 according to the current specification so we even have a
margin of 5 bytes in case that the header size grows in a future version
of the specification.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Link: https://patch.msgid.link/20250215-buffers-v2-1-1fbc6aaf8ad6@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonetdevsim: call napi_schedule from a timer context
Breno Leitao [Wed, 19 Feb 2025 16:41:20 +0000 (08:41 -0800)]
netdevsim: call napi_schedule from a timer context

The netdevsim driver was experiencing NOHZ tick-stop errors during packet
transmission due to pending softirq work when calling napi_schedule().
This issue was observed when running the netconsole selftest, which
triggered the following error message:

  NOHZ tick-stop error: local softirq work is pending, handler #08!!!

To fix this issue, introduce a timer that schedules napi_schedule()
from a timer context instead of calling it directly from the TX path.

Create an hrtimer for each queue and kick it from the TX path,
which then schedules napi_schedule() from the timer context.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250219-netdevsim-v3-1-811e2b8abc4c@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'flexible-array-for-ip-tunnel-options'
Jakub Kicinski [Thu, 20 Feb 2025 21:17:21 +0000 (13:17 -0800)]
Merge branch 'flexible-array-for-ip-tunnel-options'

Gal Pressman says:

====================
Flexible array for ip tunnel options

Remove the hidden assumption that options are allocated at the end of
the struct, and teach the compiler about them using a flexible array.

First patch is converting hard-coded 'info + 1' to use ip_tunnel_info()
helper.
Second patch adds the 'options' flexible array and changes the helper to
use it.

v4: https://lore.kernel.org/20250217202503.265318-1-gal@nvidia.com
v3: https://lore.kernel.org/20250212140953.107533-1-gal@nvidia.com
v2: https://lore.kernel.org/20250209101853.15828-1-gal@nvidia.com
====================

Link: https://patch.msgid.link/20250219143256.370277-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: Add options as a flexible array to struct ip_tunnel_info
Gal Pressman [Wed, 19 Feb 2025 14:32:56 +0000 (16:32 +0200)]
net: Add options as a flexible array to struct ip_tunnel_info

Remove the hidden assumption that options are allocated at the end of
the struct, and teach the compiler about them using a flexible array.

With this, we can revert the unsafe_memcpy() call we have in
tun_dst_unclone() [1], and resolve the false field-spanning write
warning caused by the memcpy() in ip_tunnel_info_opts_set().

The layout of struct ip_tunnel_info remains the same with this patch.
Before this patch, there was an implicit padding at the end of the
struct, options would be written at 'info + 1' which is after the
padding.
This will remain the same as this patch explicitly aligns 'options'.
The alignment is needed as the options are later casted to different
structs, and might result in unaligned memory access.

Pahole output before this patch:
struct ip_tunnel_info {
    struct ip_tunnel_key       key;                  /*     0    64 */

    /* XXX last struct has 1 byte of padding */

    /* --- cacheline 1 boundary (64 bytes) --- */
    struct ip_tunnel_encap     encap;                /*    64     8 */
    struct dst_cache           dst_cache;            /*    72    16 */
    u8                         options_len;          /*    88     1 */
    u8                         mode;                 /*    89     1 */

    /* size: 96, cachelines: 2, members: 5 */
    /* padding: 6 */
    /* paddings: 1, sum paddings: 1 */
    /* last cacheline: 32 bytes */
};

Pahole output after this patch:
struct ip_tunnel_info {
    struct ip_tunnel_key       key;                  /*     0    64 */

    /* XXX last struct has 1 byte of padding */

    /* --- cacheline 1 boundary (64 bytes) --- */
    struct ip_tunnel_encap     encap;                /*    64     8 */
    struct dst_cache           dst_cache;            /*    72    16 */
    u8                         options_len;          /*    88     1 */
    u8                         mode;                 /*    89     1 */

    /* XXX 6 bytes hole, try to pack */

    u8                         options[] __attribute__((__aligned__(16))); /*    96     0 */

    /* size: 96, cachelines: 2, members: 6 */
    /* sum members: 90, holes: 1, sum holes: 6 */
    /* paddings: 1, sum paddings: 1 */
    /* forced alignments: 1, forced holes: 1, sum forced holes: 6 */
    /* last cacheline: 32 bytes */
} __attribute__((__aligned__(16)));

[1] Commit 13cfd6a6d7ac ("net: Silence false field-spanning write warning in metadata_dst memcpy")

Link: https://lore.kernel.org/all/53D1D353-B8F6-4ADC-8F29-8C48A7C9C6F1@kernel.org/
Suggested-by: Kees Cook <kees@kernel.org>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250219143256.370277-3-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoip_tunnel: Use ip_tunnel_info() helper instead of 'info + 1'
Gal Pressman [Wed, 19 Feb 2025 14:32:55 +0000 (16:32 +0200)]
ip_tunnel: Use ip_tunnel_info() helper instead of 'info + 1'

Tunnel options should not be accessed directly, use the ip_tunnel_info()
accessor instead.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250219143256.370277-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 20 Feb 2025 18:36:34 +0000 (10:36 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.14-rc4).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge tag 'net-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 20 Feb 2025 18:19:54 +0000 (10:19 -0800)]
Merge tag 'net-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Smaller than usual with no fixes from any subtree.

  Current release - regressions:

   - core: fix race of rtnl_net_lock(dev_net(dev))

  Previous releases - regressions:

   - core: remove the single page frag cache for good

   - flow_dissector: fix handling of mixed port and port-range keys

   - sched: cls_api: fix error handling causing NULL dereference

   - tcp:
       - adjust rcvq_space after updating scaling ratio
       - drop secpath at the same time as we currently drop dst

   - eth: gtp: suppress list corruption splat in gtp_net_exit_batch_rtnl().

  Previous releases - always broken:

   - vsock:
       - fix variables initialization during resuming
       - for connectible sockets allow only connected

   - eth:
       - geneve: fix use-after-free in geneve_find_dev()
       - ibmvnic: don't reference skb after sending to VIOS"

* tag 'net-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
  Revert "net: skb: introduce and use a single page frag cache"
  net: allow small head cache usage with large MAX_SKB_FRAGS values
  nfp: bpf: Add check for nfp_app_ctrl_msg_alloc()
  tcp: drop secpath at the same time as we currently drop dst
  net: axienet: Set mac_managed_pm
  arp: switch to dev_getbyhwaddr() in arp_req_set_public()
  net: Add non-RCU dev_getbyhwaddr() helper
  sctp: Fix undefined behavior in left shift operation
  selftests/bpf: Add a specific dst port matching
  flow_dissector: Fix port range key handling in BPF conversion
  selftests/net/forwarding: Add a test case for tc-flower of mixed port and port-range
  flow_dissector: Fix handling of mixed port and port-range keys
  geneve: Suppress list corruption splat in geneve_destroy_tunnels().
  gtp: Suppress list corruption splat in gtp_net_exit_batch_rtnl().
  dev: Use rtnl_net_dev_lock() in unregister_netdev().
  net: Fix dev_net(dev) race in unregister_netdevice_notifier_dev_net().
  net: Add net_passive_inc() and net_passive_dec().
  net: pse-pd: pd692x0: Fix power limit retrieval
  MAINTAINERS: trim the GVE entry
  gve: set xdp redirect target only when it is available
  ...

2 months agoMerge tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Thu, 20 Feb 2025 16:59:00 +0000 (08:59 -0800)]
Merge tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Fix for chmod regression

 - Two reparse point related fixes

 - One minor cleanup (for GCC 14 compiles)

 - Fix for SMB3.1.1 POSIX Extensions reporting incorrect file type

* tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Treat unhandled directory name surrogate reparse points as mount directory nodes
  cifs: Throw -EOPNOTSUPP error on unsupported reparse point type from parse_reparse_point()
  smb311: failure to open files of length 1040 when mounting with SMB3.1.1 POSIX extensions
  smb: client, common: Avoid multiple -Wflex-array-member-not-at-end warnings
  smb: client: fix chmod(2) regression with ATTR_READONLY

2 months agoMerge tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs
Linus Torvalds [Thu, 20 Feb 2025 16:51:57 +0000 (08:51 -0800)]
Merge tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Small stuff:

   - The fsck code for Hongbo's directory i_size patch was wrong, caught
     by transaction restart injection: we now have the CI running
     another test variant with restart injection enabled

   - Another fixup for reflink pointers to missing indirect extents:
     previous fix was for fsck code, this fixes the normal runtime paths

   - Another small srcu lock hold time fix, reported by jpsollie"

* tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix srcu lock warning in btree_update_nodes_written()
  bcachefs: Fix bch2_indirect_extent_missing_error()
  bcachefs: Fix fsck directory i_size checking

2 months agoMerge tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Thu, 20 Feb 2025 16:48:55 +0000 (08:48 -0800)]
Merge tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:
 "Just a collection of bug fixes, nothing really stands out"

* tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: flush inodegc before swapon
  xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate
  xfs: Do not allow norecovery mount with quotacheck
  xfs: do not check NEEDSREPAIR if ro,norecovery mount.
  xfs: fix data fork format filtering during inode repair
  xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n