]> www.infradead.org Git - users/hch/uuid.git/log
users/hch/uuid.git
22 months agonet: Namespace-ify sysctl_optmem_max
Eric Dumazet [Thu, 14 Dec 2023 10:49:00 +0000 (10:49 +0000)]
net: Namespace-ify sysctl_optmem_max

optmem_max being used in tx zerocopy,
we want to be able to control it on a netns basis.

Following patch changes two tests.

Tested:

oqq130:~# cat /proc/sys/net/core/optmem_max
131072
oqq130:~# echo 1000000 >/proc/sys/net/core/optmem_max
oqq130:~# cat /proc/sys/net/core/optmem_max
1000000
oqq130:~# unshare -n
oqq130:~# cat /proc/sys/net/core/optmem_max
131072
oqq130:~# exit
logout
oqq130:~# cat /proc/sys/net/core/optmem_max
1000000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agonet: increase optmem_max default value
Eric Dumazet [Thu, 14 Dec 2023 10:48:59 +0000 (10:48 +0000)]
net: increase optmem_max default value

For many years, /proc/sys/net/core/optmem_max default value
on a 64bit kernel has been 20 KB.

Regular usage of TCP tx zerocopy needs a bit more.

Google has used 128KB as the default value for 7 years without
any problem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agoMerge branch 'mlxsw-CFF-flood-mode'
David S. Miller [Fri, 15 Dec 2023 10:58:00 +0000 (10:58 +0000)]
Merge branch 'mlxsw-CFF-flood-mode'

Petr Machata says:

====================
mlxsw: CFF flood mode: NVE underlay configuration

Recently, support for CFF flood mode (for Compressed FID Flooding) was
added to the mlxsw driver. The most recent patchset has a detailed coverage
of what CFF is and what has changed and how:

    https://lore.kernel.org/netdev/cover.1701183891.git.petrm@nvidia.com/

In CFF flood mode, each FID allocates a handful (in our implementation two
or three) consecutive PGT entries. One entry holds the flood vector for
unknown-UC traffic, one for MC, one for BC.

To determine how to look up flood vectors, the CFF flood mode uses a
concept of flood profiles, which are IDs that reference mappings from
traffic types to offsets. In the case of CFF flood mode, the offset in
question is applied to the PGT address configured at a FID. The same
mechanism is used by NVE underlay for flooding. Again the profile ID and
the traffic type determine the offset to apply, this time to KVD address
used to look up flooding entries. Since mlxsw configures NVE underlay flood
the same regardless of traffic type, only one offset was ever needed: the
zero, which is the default, and thus no explicit configuration was needed.

Now that CFF uses profiles as well, it would be better to configure the
profile used by NVE explicitly, to make the configuration visible in the
source code.

In this patchset, add the register support (in patch #1), add a new traffic
type to refer to "any traffic at all" (in patch #2) and finally configure
the NVE profile explicitly for FIDs (in patch #3).

So far, the implicitly configured flood profile was the ID 0. With this
patchset, it changes to 3, leaving the 0 free to allow us to spot missed
configuration.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agomlxsw: spectrum_fid: Set NVE flood profile as part of FID configuration
Petr Machata [Thu, 14 Dec 2023 13:19:07 +0000 (14:19 +0100)]
mlxsw: spectrum_fid: Set NVE flood profile as part of FID configuration

The NVE flood profile is used for determining of offset applied to KVD
address for NVE flood. We currently do not set it, leaving it at the
default value of 0. That is not an issue: all the traffic-type-to-offset
mappings (as configured by SFFP) default to offset of 0. This is what we
need anyway, as mlxsw only allocates a single KVD entry for NVE underlay.

The field is only relevant on Spectrum-2 and above. So to be fully
consistent, we should split the existing controlled ops to Spectrum-1 and
Spectrum>1 variants, with only the latter setting the field. But that seems
like a lot of overhead for a single field whose meaning is "everything is
the default". So instead pretend that the NVE flood profile does not exist
in the controlled flood mode, like we have so far, and only set it when
flood mode is CFF.

Setting this at all serves dual purpose. First, it is now clear which
profile belongs to NVE, because in the CFF mode, we have multiple users.
This should prevent bugs in flood profile management. Second, using
specifically non-zero value means there will be no valid uses of the
profile 0, which we can therefore use as a sentinel.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agomlxsw: spectrum_fid: Add an "any" packet type
Petr Machata [Thu, 14 Dec 2023 13:19:06 +0000 (14:19 +0100)]
mlxsw: spectrum_fid: Add an "any" packet type

Flood profiles have been used prior to CFF support for NVE underlay. Like
is the case with FID flooding, an NVE profile describes at which offset a
datum is located given traffic type. mlxsw currently only ever uses one KVD
entry for NVE lookup, i.e. regardless of traffic type, the offset is always
zero. To be able to describe this, add a traffic type enumerator describing
"any traffic type".

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agomlxsw: reg: Add nve_flood_prf_id field to SFMR
Petr Machata [Thu, 14 Dec 2023 13:19:05 +0000 (14:19 +0100)]
mlxsw: reg: Add nve_flood_prf_id field to SFMR

The field is used for setting a flood profile for lookup of KVD entry for
NVE underlay. As the other uses of flood profile, this references a traffic
type-to-offset mapping, except here it is not applied to PGT offsets, but
KVD offsets.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agoMerge branch 'net-at803x-cleanups'
David S. Miller [Fri, 15 Dec 2023 10:45:57 +0000 (10:45 +0000)]
Merge branch 'net-at803x-cleanups'

Christian Marangi says:

====================
net: phy: at803x: additional cleanup for qca808x

This small series is a preparation for the big code split. While the
qca808x code is waiting to be reviwed and merged, we can further cleanup
and generalize shared functions between at803x and qca808x.

With these last 2 patch everything is ready to move the driver to a
dedicated directory and split the code by creating a library module
for the few shared functions between the 2 driver.

Eventually at803x can be further cleaned and generalized but everything
will be already self contained and related only to at803x family of PHYs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agonet: phy: at803x: make read specific status function more generic
Christian Marangi [Thu, 14 Dec 2023 00:44:32 +0000 (01:44 +0100)]
net: phy: at803x: make read specific status function more generic

Rework read specific status function to be more generic. The function
apply different speed mask based on the PHY ID. Make it more generic by
adding an additional arg to pass the specific speed (ss) mask and use
the provided mask to parse the speed value.

This is needed to permit an easier deatch of qca808x code from the
at803x driver.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agonet: phy: at803x: move specific qca808x config_aneg to dedicated function
Christian Marangi [Thu, 14 Dec 2023 00:44:31 +0000 (01:44 +0100)]
net: phy: at803x: move specific qca808x config_aneg to dedicated function

Move specific qca808x config_aneg to dedicated function to permit easier
split of qca808x portion from at803x driver.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agoMerge branch 'vsock-credit-update'
David S. Miller [Fri, 15 Dec 2023 10:37:36 +0000 (10:37 +0000)]
Merge branch 'vsock-credit-update'

Arseniy Krasnov says:

====================
send credit update during setting SO_RCVLOWAT

                               DESCRIPTION

This patchset fixes old problem with hungup of both rx/tx sides and adds
test for it. This happens due to non-default SO_RCVLOWAT value and
deferred credit update in virtio/vsock. Link to previous old patchset:
https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e5574524c@sberdevices.ru/

Here is what happens step by step:

                                  TEST

                            INITIAL CONDITIONS

1) Vsock buffer size is 128KB.
2) Maximum packet size is also 64KB as defined in header (yes it is
   hardcoded, just to remind about that value).
3) SO_RCVLOWAT is default, e.g. 1 byte.

                                 STEPS

            SENDER                              RECEIVER
1) sends 128KB + 1 byte in a
   single buffer. 128KB will
   be sent, but for 1 byte
   sender will wait for free
   space at peer. Sender goes
   to sleep.

2)                                     reads 64KB, credit update not sent
3)                                     sets SO_RCVLOWAT to 64KB + 1
4)                                     poll() -> wait forever, there is
                                       only 64KB available to read.

So in step 4) receiver also goes to sleep, waiting for enough data or
connection shutdown message from the sender. Idea to fix it is that rx
kicks tx side to continue transmission (and may be close connection)
when rx changes number of bytes to be woken up (e.g. SO_RCVLOWAT) and
this value is bigger than number of available bytes to read.

I've added small test for this, but not sure as it uses hardcoded value
for maximum packet length, this value is defined in kernel header and
used to control deferred credit update. And as this is not available to
userspace, I can't control test parameters correctly (if one day this
define will be changed - test may become useless).

Head for this patchset is:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=9bab51bd662be4c3ebb18a28879981d69f3ef15a

Link to v1:
https://lore.kernel.org/netdev/20231108072004.1045669-1-avkrasnov@salutedevices.com/
Link to v2:
https://lore.kernel.org/netdev/20231119204922.2251912-1-avkrasnov@salutedevices.com/
Link to v3:
https://lore.kernel.org/netdev/20231122180510.2297075-1-avkrasnov@salutedevices.com/
Link to v4:
https://lore.kernel.org/netdev/20231129212519.2938875-1-avkrasnov@salutedevices.com/
Link to v5:
https://lore.kernel.org/netdev/20231130130840.253733-1-avkrasnov@salutedevices.com/
Link to v6:
https://lore.kernel.org/netdev/20231205064806.2851305-1-avkrasnov@salutedevices.com/
Link to v7:
https://lore.kernel.org/netdev/20231206211849.2707151-1-avkrasnov@salutedevices.com/
Link to v8:
https://lore.kernel.org/netdev/20231211211658.2904268-1-avkrasnov@salutedevices.com/
Link to v9:
https://lore.kernel.org/netdev/20231214091947.395892-1-avkrasnov@salutedevices.com/

Changelog:
v1 -> v2:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * New patch is added as 0001 - it removes return from SO_RCVLOWAT set
   callback in 'af_vsock.c' when transport callback is set - with that
   we can set 'sk_rcvlowat' only once in 'af_vsock.c' and in future do
   not copy-paste it to every transport. It was discussed in v1.
 * See per-patch changelog after ---.
v2 -> v3:
 * See changelog after --- in 0003 only (0001 and 0002 still same).
v3 -> v4:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * See per-patch changelog after ---.
v4 -> v5:
 * Change patchset tag 'RFC' -> 'net-next'.
 * See per-patch changelog after ---.
v5 -> v6:
 * New patch 0003 which sends credit update during reading bytes from
   socket.
 * See per-patch changelog after ---.
v6 -> v7:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * See per-patch changelog after ---.
v7 -> v8:
 * See per-patch changelog after ---.
v8 -> v9:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * Add 'Fixes' tag for the current 0002.
 * Reorder patches by moving two fixes first.
v9 -> v10:
 * Squash 0002 and 0003 and update commit message in result.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agovsock/test: two tests to check credit update logic
Arseniy Krasnov [Thu, 14 Dec 2023 12:52:30 +0000 (15:52 +0300)]
vsock/test: two tests to check credit update logic

Both tests are almost same, only differs in two 'if' conditions, so
implemented in a single function. Tests check, that credit update
message is sent:

1) During setting SO_RCVLOWAT value of the socket.
2) When number of 'rx_bytes' become smaller than SO_RCVLOWAT value.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agovirtio/vsock: send credit update during setting SO_RCVLOWAT
Arseniy Krasnov [Thu, 14 Dec 2023 12:52:29 +0000 (15:52 +0300)]
virtio/vsock: send credit update during setting SO_RCVLOWAT

Send credit update message when SO_RCVLOWAT is updated and it is bigger
than number of bytes in rx queue. It is needed, because 'poll()' will
wait until number of bytes in rx queue will be not smaller than
O_RCVLOWAT, so kick sender to send more data. Otherwise mutual hungup
for tx/rx is possible: sender waits for free space and receiver is
waiting data in 'poll()'.

Rename 'set_rcvlowat' callback to 'notify_set_rcvlowat' and set
'sk->sk_rcvlowat' only in one place (i.e. 'vsock_set_rcvlowat'), so the
transport doesn't need to do it.

Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agovirtio/vsock: fix logic which reduces credit update messages
Arseniy Krasnov [Thu, 14 Dec 2023 12:52:28 +0000 (15:52 +0300)]
virtio/vsock: fix logic which reduces credit update messages

Add one more condition for sending credit update during dequeue from
stream socket: when number of bytes in the rx queue is smaller than
SO_RCVLOWAT value of the socket. This is actual for non-default value
of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
transmission, because we need at least SO_RCVLOWAT bytes in our rx
queue to wake up user for reading data (in corner case it is also
possible to stuck both tx and rx sides, this is why 'Fixes' is used).

Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agoipmr: support IP_PKTINFO on cache report IGMP msg
Leone Fernando [Wed, 13 Dec 2023 16:19:35 +0000 (17:19 +0100)]
ipmr: support IP_PKTINFO on cache report IGMP msg

In order to support IP_PKTINFO on those packets, we need to call
ipv4_pktinfo_prepare.

When sending mrouted/pimd daemons a cache report IGMP msg, it is
unnecessary to set dst on the newly created skb.
It used to be necessary on older versions until
commit d826eb14ecef ("ipv4: PKTINFO doesnt need dst reference") which
changed the way IP_PKTINFO struct is been retrieved.

Changes from v1:
1. Undo changes in ipv4_pktinfo_prepare function. use it directly
   and copy the control block.

Fixes: d826eb14ecef ("ipv4: PKTINFO doesnt need dst reference")
Signed-off-by: Leone Fernando <leone4fernando@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agonet: mana: add msix index sharing between EQs
Konstantin Taranov [Wed, 13 Dec 2023 10:01:47 +0000 (02:01 -0800)]
net: mana: add msix index sharing between EQs

This patch allows to assign and poll more than one EQ on the same
msix index.
It is achieved by introducing a list of attached EQs in each IRQ context.
It also removes the existing msix_index map that tried to ensure that there
is only one EQ at each msix_index.
This patch exports symbols for creating EQs from other MANA kernel modules.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agoocteontx2-af: Fix multicast/mirror group lock/unlock issue
Suman Ghosh [Wed, 13 Dec 2023 09:53:49 +0000 (15:23 +0530)]
octeontx2-af: Fix multicast/mirror group lock/unlock issue

As per the existing implementation, there exists a race between finding
a multicast/mirror group entry and deleting that entry. The group lock
was taken and released independently by rvu_nix_mcast_find_grp_elem()
function. Which is incorrect and group lock should be taken during the
entire operation of group updation/deletion. This patch fixes the same.

Fixes: 51b2804c19cd ("octeontx2-af: Add new mbox to support multicast/mirror offload")
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agoMerge tag 'mlx5-updates-2023-12-13' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Fri, 15 Dec 2023 10:00:02 +0000 (10:00 +0000)]
Merge tag 'mlx5-updates-2023-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-12-13

Preparation for mlx5e socket direct feature.

Socket direct will allow multiple PF devices attached to different
NUMA nodes but sharing the same physical port.

The following series is a small refactoring series in preparation
to support socket direct in the following submission.

Highlights:
 - Define required device registers and bits related to socket direct
 - Flow steering re-arrangements
 - Generalize TX objects (TISs) and store them in a common object, will
   be useful in the next series for per function object management.
 - Decouple raw CQ objects from their parent netdev priv
 - Prepare devcom for Socket Direct device group discovery.

Please see the individual patches for more information.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agoMerge branch 'net-phy-rust'
David S. Miller [Fri, 15 Dec 2023 09:35:50 +0000 (09:35 +0000)]
Merge branch 'net-phy-rust'

FUJITA Tomonori says:

====================
Rust abstractions for network PHY drivers

No functional change since v10; only comment and commit log updates.

This patchset adds Rust abstractions for phylib. It doesn't fully
cover the C APIs yet but I think that it's already useful. I implement
two PHY drivers (Asix AX88772A PHYs and Realtek Generic FE-GE). Seems
they work well with real hardware.

The first patch introduces Rust bindings for phylib.

The second patch adds a macro to declare a kernel module for PHYs
drivers.

The third adds the Rust ETHERNET PHY LIBRARY entry to MAINTAINERS
file; adds the binding file and me as a maintainer (as Andrew Lunn
suggested) with Trevor Gross as a reviewer.

The last patch introduces the Rust version of Asix PHY driver,
drivers/net/phy/ax88796b.c. The features are equivalent to the C
version. You can choose C (by default) or Rust version on kernel
configuration.

v11:
  - adds Andrew, Alice, and Trevor's Reviewed-by
  - comment update
v10: https://lore.kernel.org/netdev/20231210234924.1453917-1-fujita.tomonori@gmail.com/T/
  - adds Trevor's SoB to the third patch
  - adds Benno's Reviewed-by to the second patch
v9: https://lore.kernel.org/netdev/20231205.124531.842372711631366729.fujita.tomonori@gmail.com/T/
  - adds a workaround to access to a bit field in phy_device
  - fixes a comment typo
v8: https://lore.kernel.org/netdev/20231123050412.1012252-1-fujita.tomonori@gmail.com/
  - updates the safety comments on Device and its related code
  - uses _phy_start_aneg instead of phy_start_aneg
  - drops the patch for enum synchronization
  - moves Sync From Registration to DriverVTable
  - fixes doctest errors
  - minor cleanups
v7: https://lore.kernel.org/netdev/20231026001050.1720612-1-fujita.tomonori@gmail.com/T/
  - renames get_link() to is_link_up()
  - improves the macro format
  - improves the commit log in the third patch
  - improves comments
v6: https://lore.kernel.org/netdev/20231025.090243.1437967503809186729.fujita.tomonori@gmail.com/T/
  - improves comments
  - makes the requirement of phy_drivers_register clear
  - fixes Makefile of the third patch
v5: https://lore.kernel.org/all/20231019.094147.1808345526469629486.fujita.tomonori@gmail.com/T/
  - drops the rustified-enum option, writes match by hand; no *risk* of UB
  - adds Miguel's patch for enum checking
  - moves CONFIG_RUST_PHYLIB_ABSTRACTIONS to drivers/net/phy/Kconfig
  - adds a new entry for this abstractions in MAINTAINERS
  - changes some of Device's methods to take &mut self
  - comment improvment
v4: https://lore.kernel.org/netdev/20231012125349.2702474-1-fujita.tomonori@gmail.com/T/
  - split the core patch
  - making Device::from_raw() private
  - comment improvement with code update
  - commit message improvement
  - avoiding using bindings::phy_driver in public functions
  - using an anonymous constant in module_phy_driver macro
v3: https://lore.kernel.org/netdev/20231011.231607.1747074555988728415.fujita.tomonori@gmail.com/T/
  - changes the base tree to net-next from rust-next
  - makes this feature optional; only enabled with CONFIG_RUST_PHYLIB_BINDINGS=y
  - cosmetic code and comment improvement
  - adds copyright
v2: https://lore.kernel.org/netdev/20231006094911.3305152-2-fujita.tomonori@gmail.com/T/
  - build failure fix
  - function renaming
v1: https://lore.kernel.org/netdev/20231002085302.2274260-3-fujita.tomonori@gmail.com/T/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agonet: phy: add Rust Asix PHY driver
FUJITA Tomonori [Wed, 13 Dec 2023 00:42:11 +0000 (09:42 +0900)]
net: phy: add Rust Asix PHY driver

This is the Rust implementation of drivers/net/phy/ax88796b.c. The
features are equivalent. You can choose C or Rust version kernel
configuration.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Reviewed-by: Trevor Gross <tmgross@umich.edu>
Reviewed-by: Benno Lossin <benno.lossin@proton.me>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agoMAINTAINERS: add Rust PHY abstractions for ETHERNET PHY LIBRARY
FUJITA Tomonori [Wed, 13 Dec 2023 00:42:10 +0000 (09:42 +0900)]
MAINTAINERS: add Rust PHY abstractions for ETHERNET PHY LIBRARY

Adds me as a maintainer and Trevor as a reviewer.

The files are placed at rust/kernel/ directory for now but the files
are likely to be moved to net/ directory once a new Rust build system
is implemented.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Signed-off-by: Trevor Gross <tmgross@umich.edu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agorust: net::phy add module_phy_driver macro
FUJITA Tomonori [Wed, 13 Dec 2023 00:42:09 +0000 (09:42 +0900)]
rust: net::phy add module_phy_driver macro

This macro creates an array of kernel's `struct phy_driver` and
registers it. This also corresponds to the kernel's
`MODULE_DEVICE_TABLE` macro, which embeds the information for module
loading into the module binary file.

A PHY driver should use this macro.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Benno Lossin <benno.lossin@proton.me>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Trevor Gross <tmgross@umich.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agorust: core abstractions for network PHY drivers
FUJITA Tomonori [Wed, 13 Dec 2023 00:42:08 +0000 (09:42 +0900)]
rust: core abstractions for network PHY drivers

This patch adds abstractions to implement network PHY drivers; the
driver registration and bindings for some of callback functions in
struct phy_driver and many genphy_ functions.

This feature is enabled with CONFIG_RUST_PHYLIB_ABSTRACTIONS=y.

This patch enables unstable const_maybe_uninit_zeroed feature for
kernel crate to enable unsafe code to handle a constant value with
uninitialized data. With the feature, the abstractions can initialize
a phy_driver structure with zero easily; instead of initializing all
the members by hand. It's supposed to be stable in the not so distant
future.

Link: https://github.com/rust-lang/rust/pull/116218
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agonet: stmmac: don't create a MDIO bus if unnecessary
Andrew Halaney [Tue, 12 Dec 2023 22:07:36 +0000 (16:07 -0600)]
net: stmmac: don't create a MDIO bus if unnecessary

Currently a MDIO bus is created if the devicetree description is either:

    1. Not fixed-link
    2. fixed-link but contains a MDIO bus as well

The "1" case above isn't always accurate. If there's a phy-handle,
it could be referencing a phy on another MDIO controller's bus[1]. In
this case, where the MDIO bus is not described at all, currently
stmmac will make a MDIO bus and scan its address space to discover
phys (of which there are none). This process takes time scanning a bus
that is known to be empty, delaying time to complete probe.

There are also a lot of upstream devicetrees[2] that expect a MDIO bus
to be created, scanned for phys, and the first one found connected
to the MAC. This case can be inferred from the platform description by
not having a phy-handle && not being fixed-link. This hits case "1" in
the current driver's logic, and must be handled in any logic change here
since it is a valid legacy dt-binding.

Let's improve the logic to create a MDIO bus if either:

    - Devicetree contains a MDIO bus
    - !fixed-link && !phy-handle (legacy handling)

This way the case where no MDIO bus should be made is handled, as well
as retaining backwards compatibility with the valid cases.

Below devicetree snippets can be found that explain some of
the cases above more concretely.

Here's[0] a devicetree example where the MAC is both fixed-link and
driving a switch on MDIO (case "2" above). This needs a MDIO bus to
be created:

    &fec1 {
            phy-mode = "rmii";

            fixed-link {
                    speed = <100>;
                    full-duplex;
            };

            mdio1: mdio {
                    switch0: switch0@0 {
                            compatible = "marvell,mv88e6190";
                            pinctrl-0 = <&pinctrl_gpio_switch0>;
                    };
            };
    };

Here's[1] an example where there is no MDIO bus or fixed-link for
the ethernet1 MAC, so no MDIO bus should be created since ethernet0
is the MDIO master for ethernet1's phy:

    &ethernet0 {
            phy-mode = "sgmii";
            phy-handle = <&sgmii_phy0>;

            mdio {
                    compatible = "snps,dwmac-mdio";
                    sgmii_phy0: phy@8 {
                            compatible = "ethernet-phy-id0141.0dd4";
                            reg = <0x8>;
                            device_type = "ethernet-phy";
                    };

                    sgmii_phy1: phy@a {
                            compatible = "ethernet-phy-id0141.0dd4";
                            reg = <0xa>;
                            device_type = "ethernet-phy";
                    };
            };
    };

    &ethernet1 {
            phy-mode = "sgmii";
            phy-handle = <&sgmii_phy1>;
    };

Finally there's descriptions like this[2] which don't describe the
MDIO bus but expect it to be created and the whole address space
scanned for a phy since there's no phy-handle or fixed-link described:

    &gmac {
            phy-supply = <&vcc_lan>;
            phy-mode = "rmii";
            snps,reset-gpio = <&gpio3 RK_PB4 GPIO_ACTIVE_HIGH>;
            snps,reset-active-low;
            snps,reset-delays-us = <0 10000 1000000>;
    };

[0] https://elixir.bootlin.com/linux/v6.5-rc5/source/arch/arm/boot/dts/nxp/vf/vf610-zii-ssmb-dtu.dts
[1] https://elixir.bootlin.com/linux/v6.6-rc5/source/arch/arm64/boot/dts/qcom/sa8775p-ride.dts
[2] https://elixir.bootlin.com/linux/v6.6-rc5/source/arch/arm64/boot/dts/rockchip/rk3368-r88.dts#L164

Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Co-developed-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22 months agoi40e: remove fake support of rx-frames-irq
Jason Xing [Wed, 13 Dec 2023 18:44:05 +0000 (10:44 -0800)]
i40e: remove fake support of rx-frames-irq

Since we never support this feature for I40E driver, we don't have to
display the value when using 'ethtool -c eth0'.

Before this patch applied, the rx-frames-irq is 256 which is consistent
with tx-frames-irq. Apparently it could mislead users.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20231213184406.1306602-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge branch 'mdio-mux-cleanup'
Jakub Kicinski [Fri, 15 Dec 2023 02:55:40 +0000 (18:55 -0800)]
Merge branch 'mdio-mux-cleanup'

Vladimir Oltean says:

====================
MDIO mux cleanup

This small patch set resolves some technical debt in the MDIO mux driver
which was discovered during the investigation for commit 1f9f2143f24e
("net: mdio-mux: fix C45 access returning -EIO after API change").

The patches have been sitting for 2 months in the NXP SDK kernel and
haven't caused issues.
====================

Link: https://lore.kernel.org/r/20231213152712.320842-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: mdio-mux: be compatible with parent buses which only support C45
Vladimir Oltean [Wed, 13 Dec 2023 15:27:12 +0000 (17:27 +0200)]
net: mdio-mux: be compatible with parent buses which only support C45

After the mii_bus API conversion to a split read() / read_c45(), there
might be MDIO parent buses which only populate the read_c45() and
write_c45() function pointers but not the C22 variants.

We haven't seen these in the wild paired with MDIO multiplexers, but
Andrew points out we should treat the corner case.

Link: https://lore.kernel.org/netdev/4ccd7dc9-b611-48aa-865f-68d3a1327ce8@lunn.ch/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231213152712.320842-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: mdio-mux: show errors on probe failure
Vladimir Oltean [Wed, 13 Dec 2023 15:27:11 +0000 (17:27 +0200)]
net: mdio-mux: show errors on probe failure

Showing the precise error symbols can help debugging probe issues, such
as the recent -EIO error in of_mdiobus_register() caused by the lack of
bus->read_c45() and bus->write_c45() methods.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231213152712.320842-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: atlantic: eliminate double free in error handling logic
Igor Russkikh [Wed, 13 Dec 2023 09:50:44 +0000 (10:50 +0100)]
net: atlantic: eliminate double free in error handling logic

Driver has a logic leak in ring data allocation/free,
where aq_ring_free could be called multiple times on same ring,
if system is under stress and got memory allocation error.

Ring pointer was used as an indicator of failure, but this is
not correct since only ring data is allocated/deallocated.
Ring itself is an array member.

Changing ring allocation functions to return error code directly.
This simplifies error handling and eliminates aq_ring_free
on higher layer.

Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Link: https://lore.kernel.org/r/20231213095044.23146-1-irusskikh@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge branch 'convert-net-selftests-to-run-in-unique-namespace-part-3'
Jakub Kicinski [Fri, 15 Dec 2023 02:38:37 +0000 (18:38 -0800)]
Merge branch 'convert-net-selftests-to-run-in-unique-namespace-part-3'

Hangbin Liu says:

====================
Convert net selftests to run in unique namespace (Part 3)

Here is the 3rd part of converting net selftests to run in unique namespace.
This part converts all srv6 and fib tests.

Note that patch 06 is a fix for testing fib_nexthop_multiprefix.

Here is the part 1 link:
https://lore.kernel.org/netdev/20231202020110.362433-1-liuhangbin@gmail.com
And part 2 link:
https://lore.kernel.org/netdev/20231206070801.1691247-1-liuhangbin@gmail.com
====================

Link: https://lore.kernel.org/r/20231213060856.4030084-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert fdb_flush.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:56 +0000 (14:08 +0800)]
selftests/net: convert fdb_flush.sh to run it in unique namespace

Here is the test result after conversion.
 # ./fdb_flush.sh
 TEST: vx10: Expected 5 FDB entries, got 5                           [ OK ]
 TEST: vx20: Expected 5 FDB entries, got 5                           [ OK ]
 ...
 TEST: vx10: Expected 5 FDB entries, got 5                           [ OK ]
 TEST: Test entries with dst 192.0.2.1                               [ OK ]

Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-14-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert fib_tests.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:55 +0000 (14:08 +0800)]
selftests/net: convert fib_tests.sh to run it in unique namespace

Here is the test result after conversion.

 # ./fib_tests.sh

 Single path route test
     Start point
     TEST: IPv4 fibmatch                                                 [ OK ]

 ...

 Fib6 garbage collection test
     TEST: ipv6 route garbage collection                                 [ OK ]

 IPv4 multipath list receive tests
     TEST: Multipath route hit ratio (1.00)                              [ OK ]

 IPv6 multipath list receive tests
     TEST: Multipath route hit ratio (1.00)                              [ OK ]

 Tests passed: 225
 Tests failed:   0

Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-13-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert fib_rule_tests.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:54 +0000 (14:08 +0800)]
selftests/net: convert fib_rule_tests.sh to run it in unique namespace

Here is the test result after conversion.

 ]# ./fib_rule_tests.sh

     TEST: rule6 check: oif redirect to table                  [ OK ]

     ...

     TEST: rule4 dsfield tcp connect (dsfield 0x07)            [ OK ]

 Tests passed:  66
 Tests failed:   0

Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-12-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert fib-onlink-tests.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:53 +0000 (14:08 +0800)]
selftests/net: convert fib-onlink-tests.sh to run it in unique namespace

Remove PEER_CMD, which is not used in this test

Here is the test result after conversion.

 ]# ./fib-onlink-tests.sh
 Error: ipv4: FIB table does not exist.
 Flush terminated
 Error: ipv6: FIB table does not exist.
 Flush terminated

 ########################################
 Configuring interfaces

   ...

     TEST: Gateway resolves to wrong nexthop device - VRF      [ OK ]

 Tests passed:  38
 Tests failed:   0

Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-11-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert fib_nexthops.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:52 +0000 (14:08 +0800)]
selftests/net: convert fib_nexthops.sh to run it in unique namespace

Here is the test result after conversion.

 ]# ./fib_nexthops.sh

 Basic functional tests
 ----------------------
 TEST: List with nothing defined                                     [ OK ]
 TEST: Nexthop get on non-existent id                                [ OK ]

 ...

 TEST: IPv6 resilient nexthop group torture test                     [ OK ]

 Tests passed: 234
 Tests failed:   0

Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-10-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert fib_nexthop_nongw.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:51 +0000 (14:08 +0800)]
selftests/net: convert fib_nexthop_nongw.sh to run it in unique namespace

Here is the test result after conversion.

 ]# ./fib_nexthop_nongw.sh
 TEST: nexthop: get route with nexthop without gw                    [ OK ]
 TEST: nexthop: ping through nexthop without gw                      [ OK ]

Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-9-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert fib_nexthop_multiprefix to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:50 +0000 (14:08 +0800)]
selftests/net: convert fib_nexthop_multiprefix to run it in unique namespace

Here is the test result after conversion.

 ]# ./fib_nexthop_multiprefix.sh
 TEST: IPv4: host 0 to host 1, mtu 1300                              [ OK ]
 TEST: IPv6: host 0 to host 1, mtu 1300                              [ OK ]

 TEST: IPv4: host 0 to host 2, mtu 1350                              [ OK ]
 TEST: IPv6: host 0 to host 2, mtu 1350                              [ OK ]

 TEST: IPv4: host 0 to host 3, mtu 1400                              [ OK ]
 TEST: IPv6: host 0 to host 3, mtu 1400                              [ OK ]

 TEST: IPv4: host 0 to host 1, mtu 1300                              [ OK ]
 TEST: IPv6: host 0 to host 1, mtu 1300                              [ OK ]

 TEST: IPv4: host 0 to host 2, mtu 1350                              [ OK ]
 TEST: IPv6: host 0 to host 2, mtu 1350                              [ OK ]

 TEST: IPv4: host 0 to host 3, mtu 1400                              [ OK ]
 TEST: IPv6: host 0 to host 3, mtu 1400                              [ OK ]

Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-8-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: fix grep checking for fib_nexthop_multiprefix
Hangbin Liu [Wed, 13 Dec 2023 06:08:49 +0000 (14:08 +0800)]
selftests/net: fix grep checking for fib_nexthop_multiprefix

When running fib_nexthop_multiprefix test I saw all IPv6 test failed.
e.g.

 ]# ./fib_nexthop_multiprefix.sh
 TEST: IPv4: host 0 to host 1, mtu 1300                              [ OK ]
 TEST: IPv6: host 0 to host 1, mtu 1300                              [FAIL]

 With -v it shows

 COMMAND: ip netns exec h0 /usr/sbin/ping6 -s 1350 -c5 -w5 2001:db8:101::1
 PING 2001:db8:101::1(2001:db8:101::1) 1350 data bytes
 From 2001:db8:100::64 icmp_seq=1 Packet too big: mtu=1300

 --- 2001:db8:101::1 ping statistics ---
 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

 Route get
 2001:db8:101::1 via 2001:db8:100::64 dev eth0 src 2001:db8:100::1 metric 1024 expires 599sec mtu 1300 pref medium
 Searching for:
     2001:db8:101::1 from :: via 2001:db8:100::64 dev eth0 src 2001:db8:100::1 .* mtu 1300

The reason is when CONFIG_IPV6_SUBTREES is not enabled, rt6_fill_node() will
not put RTA_SRC info. After fix:

]# ./fib_nexthop_multiprefix.sh
TEST: IPv4: host 0 to host 1, mtu 1300                              [ OK ]
TEST: IPv6: host 0 to host 1, mtu 1300                              [ OK ]

Fixes: 735ab2f65dce ("selftests: Add test with multiple prefixes using single nexthop")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-7-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert fcnal-test.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:48 +0000 (14:08 +0800)]
selftests/net: convert fcnal-test.sh to run it in unique namespace

Here is the test result after conversion. There are some failures, but it
also exists on my system without this patch. So it's not affectec by
this patch and I will check the reason later.

  ]# time ./fcnal-test.sh
  /usr/bin/which: no nettest in (/root/.local/bin:/root/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)

  ###########################################################################
  IPv4 ping
  ###########################################################################

  #################################################################
  No VRF

  SYSCTL: net.ipv4.raw_l3mdev_accept=0

  TEST: ping out - ns-B IP                                                      [ OK ]
  TEST: ping out, device bind - ns-B IP                                         [ OK ]
  TEST: ping out, address bind - ns-B IP                                        [ OK ]
  ...

  #################################################################
  SNAT on VRF

  TEST: IPv4 TCP connection over VRF with SNAT                                  [ OK ]
  TEST: IPv6 TCP connection over VRF with SNAT                                  [ OK ]

  Tests passed: 893
  Tests failed:  21

  real    52m48.178s
  user    0m34.158s
  sys     1m42.976s

BTW, this test needs a really long time. So expand the timeout to 1h.

Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-6-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert srv6_end_dt6_l3vpn_test.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:47 +0000 (14:08 +0800)]
selftests/net: convert srv6_end_dt6_l3vpn_test.sh to run it in unique namespace

As the name \${rt-${rt}} may make reader confuse, convert the variable
hs/rt in setup_rt/hs to hid, rid. Here is the test result after conversion.

 ]# ./srv6_end_dt6_l3vpn_test.sh

 ################################################################################
 TEST SECTION: IPv6 routers connectivity test
 ################################################################################

     TEST: Routers connectivity: rt-1 -> rt-2                            [ OK ]

     TEST: Routers connectivity: rt-2 -> rt-1                            [ OK ]
 ...

     TEST: Hosts isolation: hs-t200-4 -X-> hs-t100-2                     [ OK ]

 Tests passed:  18
 Tests failed:   0

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-5-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert srv6_end_dt4_l3vpn_test.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:46 +0000 (14:08 +0800)]
selftests/net: convert srv6_end_dt4_l3vpn_test.sh to run it in unique namespace

As the name \${rt-${rt}} may make reader confuse, convert the variable
hs/rt in setup_rt/hs to hid, rid. Here is the test result after conversion.

 ]# ./srv6_end_dt4_l3vpn_test.sh

 ################################################################################
 TEST SECTION: IPv6 routers connectivity test
 ################################################################################

     TEST: Routers connectivity: rt-1 -> rt-2                            [ OK ]

     TEST: Routers connectivity: rt-2 -> rt-1                            [ OK ]
 ...
     TEST: Hosts isolation: hs-t200-4 -X-> hs-t100-2                     [ OK ]

 Tests passed:  18
 Tests failed:   0

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-4-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: convert srv6_end_dt46_l3vpn_test.sh to run it in unique namespace
Hangbin Liu [Wed, 13 Dec 2023 06:08:45 +0000 (14:08 +0800)]
selftests/net: convert srv6_end_dt46_l3vpn_test.sh to run it in unique namespace

As the name \${rt-${rt}} may make reader confuse, convert the variable
hs/rt in setup_rt/hs to hid, rid. Here is the test result after conversion.

 ]# ./srv6_end_dt46_l3vpn_test.sh

 ################################################################################
 TEST SECTION: IPv6 routers connectivity test
 ################################################################################

     TEST: Routers connectivity: rt-1 -> rt-2                            [ OK ]

     TEST: Routers connectivity: rt-2 -> rt-1                            [ OK ]

 ...

     TEST: IPv4 Hosts isolation: hs-t200-4 -X-> hs-t100-2                [ OK ]

 Tests passed:  34
 Tests failed:   0

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-3-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoselftests/net: add variable NS_LIST for lib.sh
Hangbin Liu [Wed, 13 Dec 2023 06:08:44 +0000 (14:08 +0800)]
selftests/net: add variable NS_LIST for lib.sh

Add a global variable NS_LIST to store all the namespaces that setup_ns
created, so the caller could call cleanup_all_ns() instead of remember
all the netns names when using cleanup_ns().

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20231213060856.4030084-2-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agopage_pool: fix typos and punctuation
Randy Dunlap [Wed, 13 Dec 2023 04:36:50 +0000 (20:36 -0800)]
page_pool: fix typos and punctuation

Correct spelling (s/and/any) and a run-on sentence.
Spell out "multi".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20231213043650.12672-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: skbuff: fix spelling errors
Randy Dunlap [Wed, 13 Dec 2023 04:35:11 +0000 (20:35 -0800)]
net: skbuff: fix spelling errors

Correct spelling as reported by codespell.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231213043511.10357-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge branch 'net-mdio-mdio-bcm-unimac-optimizations-and-clean-up'
Jakub Kicinski [Fri, 15 Dec 2023 01:59:00 +0000 (17:59 -0800)]
Merge branch 'net-mdio-mdio-bcm-unimac-optimizations-and-clean-up'

Justin Chen says:

====================
net: mdio: mdio-bcm-unimac: optimizations and clean up

Clean up mdio poll to use read_poll_timeout() and reduce the potential
poll time.
====================

Link: https://lore.kernel.org/r/20231213222744.2891184-1-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: mdio: mdio-bcm-unimac: Use read_poll_timeout
Justin Chen [Wed, 13 Dec 2023 22:27:44 +0000 (14:27 -0800)]
net: mdio: mdio-bcm-unimac: Use read_poll_timeout

Simplify the code by using read_poll_timeout().

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231213222744.2891184-3-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: mdio: mdio-bcm-unimac: Delay before first poll
Justin Chen [Wed, 13 Dec 2023 22:27:43 +0000 (14:27 -0800)]
net: mdio: mdio-bcm-unimac: Delay before first poll

With a clock interval of 400 nsec and a 64 bit transactions (32 bit
preamble & 16 bit control & 16 bit data), it is reasonable to assume
the mdio transaction will take 25.6 usec. Add a 30 usec delay before
the first poll to reduce the chance of a 1000-2000 usec sleep.

Reduce the timeout from 1000ms to 100ms as it is unlikely for the bus
to take this long.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231213222744.2891184-2-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge branch 'tools-ynl-gen-fill-in-the-gaps-in-support-of-legacy-families'
Jakub Kicinski [Fri, 15 Dec 2023 01:51:23 +0000 (17:51 -0800)]
Merge branch 'tools-ynl-gen-fill-in-the-gaps-in-support-of-legacy-families'

Jakub Kicinski says:

====================
tools: ynl-gen: fill in the gaps in support of legacy families

Fill in the gaps in YNL C code gen so that we can generate user
space code for all genetlink families for which we have specs.

The two major changes we need are support for fixed headers and
support for recursive nests.

For fixed header support - place the struct for the fixed header
directly in the request struct (and don't bother generating access
helpers). The member of a fixed header can't be too complex, and
also are by definition not optional so the user has to fill them in.
The YNL core needs a bit of a tweak to understand that the attrs
may now start at a fixed offset, which is not necessarily equal
to sizeof(struct genlmsghdr).

Dealing with nested sets is much harder. Previously we'd gen
the nested structs as:

 struct outer {
   struct inner inner;
 };

If structs are recursive (e.g. inner contains outer again)
we must break this chain and allocate one of the structs
dynamically (store a pointer rather than full struct).
====================

Link: https://lore.kernel.org/r/20231213231432.2944749-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agotools: ynl-gen: print prototypes for recursive stuff
Jakub Kicinski [Wed, 13 Dec 2023 23:14:32 +0000 (15:14 -0800)]
tools: ynl-gen: print prototypes for recursive stuff

We avoid printing forward declarations and prototypes for most
types by sorting things topologically. But if structs nest we
do need the forward declarations, there's no other way.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20231213231432.2944749-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agotools: ynl-gen: store recursive nests by a pointer
Jakub Kicinski [Wed, 13 Dec 2023 23:14:31 +0000 (15:14 -0800)]
tools: ynl-gen: store recursive nests by a pointer

To avoid infinite nesting store recursive structs by pointer.
If recursive struct is placed in the op directly - the first
instance can be stored by value. That makes the code much
less of a pain for majority of practical uses.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20231213231432.2944749-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agotools: ynl-gen: re-sort ignoring recursive nests
Jakub Kicinski [Wed, 13 Dec 2023 23:14:30 +0000 (15:14 -0800)]
tools: ynl-gen: re-sort ignoring recursive nests

We try to keep the structures and helpers "topologically sorted",
to avoid forward declarations. When recursive nests are at play
we need to sort twice, because structs which end up being marked
as recursive will get a full set of forward declarations, so we
should ignore them for the purpose of sorting.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20231213231432.2944749-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agotools: ynl-gen: record information about recursive nests
Jakub Kicinski [Wed, 13 Dec 2023 23:14:29 +0000 (15:14 -0800)]
tools: ynl-gen: record information about recursive nests

Track which nests are recursive. Non-recursive nesting gets
rendered in C as directly nested structs. For recursive
ones we need to put a pointer in, rather than full struct.

Track this information, no change to generated code, yet.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20231213231432.2944749-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agotools: ynl-gen: fill in implementations for TypeUnused
Jakub Kicinski [Wed, 13 Dec 2023 23:14:28 +0000 (15:14 -0800)]
tools: ynl-gen: fill in implementations for TypeUnused

Fill in more empty handlers for TypeUnused. When 'unused'
attr gets specified in a nested set we have to cleanly
skip it during code generation.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20231213231432.2944749-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agotools: ynl-gen: support fixed headers in genetlink
Jakub Kicinski [Wed, 13 Dec 2023 23:14:27 +0000 (15:14 -0800)]
tools: ynl-gen: support fixed headers in genetlink

Support genetlink families using simple fixed headers.
Assume fixed header is identical for all ops of the family for now.

Fixed headers are added to the request and reply structs as a _hdr
member, and copied to/from netlink messages appropriately.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20231213231432.2944749-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agotools: ynl-gen: use enum user type for members and args
Jakub Kicinski [Wed, 13 Dec 2023 23:14:26 +0000 (15:14 -0800)]
tools: ynl-gen: use enum user type for members and args

Commit 30c902001534 ("tools: ynl-gen: use enum name from the spec")
added pre-cooked user type for enums. Use it to fix ignoring
enum-name provided in the spec.

This changes a type in struct ethtool_tunnel_udp_entry but is
generally inconsequential for current families.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20231213231432.2944749-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agotools: ynl-gen: add missing request free helpers for dumps
Jakub Kicinski [Wed, 13 Dec 2023 23:14:25 +0000 (15:14 -0800)]
tools: ynl-gen: add missing request free helpers for dumps

The code gen generates a prototype for dump request free
in the header, but no implementation in the source.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20231213231432.2944749-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Fri, 15 Dec 2023 01:13:35 +0000 (17:13 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/intel/iavf/iavf_ethtool.c
  3a0b5a2929fd ("iavf: Introduce new state machines for flow director")
  95260816b489 ("iavf: use iavf_schedule_aq_request() helper")
https://lore.kernel.org/all/84e12519-04dc-bd80-bc34-8cf50d7898ce@intel.com/

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  c13e268c0768 ("bnxt_en: Fix HWTSTAMP_FILTER_ALL packet timestamp logic")
  c2f8063309da ("bnxt_en: Refactor RX VLAN acceleration logic.")
  a7445d69809f ("bnxt_en: Add support for new RX and TPA_START completion types for P7")
  1c7fd6ee2fe4 ("bnxt_en: Rename some macros for the P5 chips")
https://lore.kernel.org/all/20231211110022.27926ad9@canb.auug.org.au/

drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c
  bd6781c18cb5 ("bnxt_en: Fix wrong return value check in bnxt_close_nic()")
  84793a499578 ("bnxt_en: Skip nic close/open when configuring tstamp filters")
https://lore.kernel.org/all/20231214113041.3a0c003c@canb.auug.org.au/

drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
  3d7a3f2612d7 ("net/mlx5: Nack sync reset request when HotPlug is enabled")
  cecf44ea1a1f ("net/mlx5: Allow sync reset flow when BF MGT interface device is present")
https://lore.kernel.org/all/20231211110328.76c925af@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge tag 'net-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 14 Dec 2023 21:11:49 +0000 (13:11 -0800)]
Merge tag 'net-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Current release - regressions:

   - tcp: fix tcp_disordered_ack() vs usec TS resolution

  Current release - new code bugs:

   - dpll: sanitize possible null pointer dereference in
     dpll_pin_parent_pin_set()

   - eth: octeon_ep: initialise control mbox tasks before using APIs

  Previous releases - regressions:

   - io_uring/af_unix: disable sending io_uring over sockets

   - eth: mlx5e:
       - TC, don't offload post action rule if not supported
       - fix possible deadlock on mlx5e_tx_timeout_work

   - eth: iavf: fix iavf_shutdown to call iavf_remove instead iavf_close

   - eth: bnxt_en: fix skb recycling logic in bnxt_deliver_skb()

   - eth: ena: fix DMA syncing in XDP path when SWIOTLB is on

   - eth: team: fix use-after-free when an option instance allocation
     fails

  Previous releases - always broken:

   - neighbour: don't let neigh_forced_gc() disable preemption for long

   - net: prevent mss overflow in skb_segment()

   - ipv6: support reporting otherwise unknown prefix flags in
     RTM_NEWPREFIX

   - tcp: remove acked SYN flag from packet in the transmit queue
     correctly

   - eth: octeontx2-af:
       - fix a use-after-free in rvu_nix_register_reporters
       - fix promisc mcam entry action

   - eth: dwmac-loongson: make sure MDIO is initialized before use

   - eth: atlantic: fix double free in ring reinit logic"

* tag 'net-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
  net: atlantic: fix double free in ring reinit logic
  appletalk: Fix Use-After-Free in atalk_ioctl
  net: stmmac: Handle disabled MDIO busses from devicetree
  net: stmmac: dwmac-qcom-ethqos: Fix drops in 10M SGMII RX
  dpaa2-switch: do not ask for MDB, VLAN and FDB replay
  dpaa2-switch: fix size of the dma_unmap
  net: prevent mss overflow in skb_segment()
  vsock/virtio: Fix unsigned integer wrap around in virtio_transport_has_space()
  Revert "tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set"
  MIPS: dts: loongson: drop incorrect dwmac fallback compatible
  stmmac: dwmac-loongson: drop useless check for compatible fallback
  stmmac: dwmac-loongson: Make sure MDIO is initialized before use
  tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  dpll: sanitize possible null pointer dereference in dpll_pin_parent_pin_set()
  net: ena: Fix XDP redirection error
  net: ena: Fix DMA syncing in XDP path when SWIOTLB is on
  net: ena: Fix xdp drops handling due to multibuf packets
  net: ena: Destroy correct number of xdp queues upon failure
  net: Remove acked SYN flag from packet in the transmit queue correctly
  qed: Fix a potential use-after-free in qed_cxt_tables_alloc
  ...

22 months agoMerge tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Thu, 14 Dec 2023 19:53:00 +0000 (11:53 -0800)]
Merge tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
  "Some fixes to quota accounting code, mostly around error handling and
   correctness:

   - free reserves on various error paths, after IO errors or
     transaction abort

   - don't clear reserved range at the folio release time, it'll be
     properly cleared after final write

   - fix integer overflow due to int used when passing around size of
     freed reservations

   - fix a regression in squota accounting that missed some cases with
     delayed refs"

* tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: ensure releasing squota reserve on head refs
  btrfs: don't clear qgroup reserved bit in release_folio
  btrfs: free qgroup pertrans reserve on transaction abort
  btrfs: fix qgroup_free_reserved_data int overflow
  btrfs: free qgroup reserve when ORDERED_IOERR is set

22 months agonet: mvpp2: add support for mii
Stefan Eichenberger [Tue, 12 Dec 2023 14:12:00 +0000 (15:12 +0100)]
net: mvpp2: add support for mii

Currently, mvpp2 only supports RGMII. This commit adds support for MII.
The description in Marvell's functional specification seems to be wrong.
To enable MII, we need to set GENCONF_CTRL0_PORT3_RGMII, while for RGMII
we need to clear it. This is also how U-Boot handles it.

Signed-off-by: Stefan Eichenberger <eichest@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Marcin Wojtas <mw@semihalf.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20231212141200.62579-1-eichest@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
22 months agonet: atlantic: fix double free in ring reinit logic
Igor Russkikh [Wed, 13 Dec 2023 09:40:44 +0000 (10:40 +0100)]
net: atlantic: fix double free in ring reinit logic

Driver has a logic leak in ring data allocation/free,
where double free may happen in aq_ring_free if system is under
stress and driver init/deinit is happening.

The probability is higher to get this during suspend/resume cycle.

Verification was done simulating same conditions with

    stress -m 2000 --vm-bytes 20M --vm-hang 10 --backoff 1000
    while true; do sudo ifconfig enp1s0 down; sudo ifconfig enp1s0 up; done

Fixed by explicitly clearing pointers to NULL on deallocation

Fixes: 018423e90bee ("net: ethernet: aquantia: Add ring support code")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/netdev/CAHk-=wiZZi7FcvqVSUirHBjx0bBUZ4dFrMDVLc3+3HCrtq0rBA@mail.gmail.com/
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Link: https://lore.kernel.org/r/20231213094044.22988-1-irusskikh@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
22 months agoappletalk: Fix Use-After-Free in atalk_ioctl
Hyunwoo Kim [Wed, 13 Dec 2023 04:10:56 +0000 (23:10 -0500)]
appletalk: Fix Use-After-Free in atalk_ioctl

Because atalk_ioctl() accesses sk->sk_receive_queue
without holding a sk->sk_receive_queue.lock, it can
cause a race with atalk_recvmsg().
A use-after-free for skb occurs with the following flow.
```
atalk_ioctl() -> skb_peek()
atalk_recvmsg() -> skb_recv_datagram() -> skb_free_datagram()
```
Add sk->sk_receive_queue.lock to atalk_ioctl() to fix this issue.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Link: https://lore.kernel.org/r/20231213041056.GA519680@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
22 months agonet: stmmac: Handle disabled MDIO busses from devicetree
Andrew Halaney [Tue, 12 Dec 2023 22:18:33 +0000 (16:18 -0600)]
net: stmmac: Handle disabled MDIO busses from devicetree

Many hardware configurations have the MDIO bus disabled, and are instead
using some other MDIO bus to talk to the MAC's phy.

of_mdiobus_register() returns -ENODEV in this case. Let's handle it
gracefully instead of failing to probe the MAC.

Fixes: 47dd7a540b8a ("net: add support for STMicroelectronics Ethernet controllers.")
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Link: https://lore.kernel.org/r/20231212-b4-stmmac-handle-mdio-enodev-v2-1-600171acf79f@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
22 months agonet: stmmac: dwmac-qcom-ethqos: Fix drops in 10M SGMII RX
Sneh Shah [Tue, 12 Dec 2023 09:22:08 +0000 (14:52 +0530)]
net: stmmac: dwmac-qcom-ethqos: Fix drops in 10M SGMII RX

In 10M SGMII mode all the packets are being dropped due to wrong Rx clock.
SGMII 10MBPS mode needs RX clock divider programmed to avoid drops in Rx.
Update configure SGMII function with Rx clk divider programming.

Fixes: 463120c31c58 ("net: stmmac: dwmac-qcom-ethqos: add support for SGMII")
Tested-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Sneh Shah <quic_snehshah@quicinc.com>
Reviewed-by: Bjorn Andersson <quic_bjorande@quicinc.com>
Link: https://lore.kernel.org/r/20231212092208.22393-1-quic_snehshah@quicinc.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
22 months agoMerge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Jakub Kicinski [Thu, 14 Dec 2023 06:08:58 +0000 (22:08 -0800)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2023-12-12 (igb, e1000e)

This series contains updates to igb and e1000e drivers.

Ilpo Järvinen does some cleanups to both drivers: utilizing FIELD_GET()
helpers and using standard kernel defines over driver created ones.

* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  e1000e: Use pcie_capability_read_word() for reading LNKSTA
  e1000e: Use PCI_EXP_LNKSTA_NLW & FIELD_GET() instead of custom defines/code
  igb: Use FIELD_GET() to extract Link Width
====================

Link: https://lore.kernel.org/r/20231212204947.513563-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge branch 'support-symmetric-xor-rss-hash'
Jakub Kicinski [Thu, 14 Dec 2023 06:07:19 +0000 (22:07 -0800)]
Merge branch 'support-symmetric-xor-rss-hash'

Ahmed Zaki says:

====================
Support symmetric-xor RSS hash

Patches 1 and 2 modify the get/set_rxh ethtool API to take a pointer to
struct of parameters instead of individual params. This will allow future
changes to the uAPI-shared struct ethtool_rxfh without changing the
drivers' API.

Patch 3 adds the support at the Kernel level, allowing the user to set a
symmetric-xor RSS hash for a netdevice via:

    # ethtool -X eth0 hfunc toeplitz symmetric-xor

and clears the flag via:

    # ethtool -X eth0 hfunc toeplitz

The "symmetric-xor" is set in a new "input_xfrm" field in struct
ethtool_rxfh. Support for the new "symmetric-xor" flag will be later sent
to the "ethtool" user-space tool.

Patch 4 fixes a long standing bug with the ice hash function register
values. The bug has been benign for now since only (asymmetric) Toeplitz
hash (Zero) has been used.

Patches 5 and 6 lay some groundwork refactoring. While the first is
mainly cosmetic, the second is needed since there is no more room in the
previous 64-bit RSS profile ID for the symmetric attribute introduced in
the next patch.

Finally, patches 7 and 8 add the symmetric-xor support for the ice
(E800 PFs) and the iAVF drivers.
====================

Link: https://lore.kernel.org/r/20231213003321.605376-1-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoiavf: enable symmetric-xor RSS for Toeplitz hash function
Ahmed Zaki [Wed, 13 Dec 2023 00:33:21 +0000 (17:33 -0700)]
iavf: enable symmetric-xor RSS for Toeplitz hash function

Allow the user to set the symmetric Toeplitz hash function via:

    # ethtool -X eth0 hfunc toeplitz symmetric-xor

The driver will reject any new RSS configuration if a field other than
(IP src/dst and L4 src/dst ports) is requested for hashing.

The symmetric RSS will not be supported on PFs not advertising the ADV RSS
Offload flag (ADV_RSS_SUPPORT()), for example the E700 series (i40e).

Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-9-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoice: enable symmetric-xor RSS for Toeplitz hash function
Jeff Guo [Wed, 13 Dec 2023 00:33:20 +0000 (17:33 -0700)]
ice: enable symmetric-xor RSS for Toeplitz hash function

Allow the user to set the symmetric Toeplitz hash function via:

    # ethtool -X eth0 hfunc toeplitz symmetric-xor

All existing RSS configurations will be converted to symmetric unless they
have a non-symmetric field (other than IP src/dst and L4 src/dst ports)
used for hashing. The driver will reject a new RSS configuration if such
a field is requested.

The hash function in the E800 NICs is set per-VSI and a specific AQ
command is needed to modify the hash function. Use the AQ command to
enable setting the symmetric Toeplitz RSS hash function for any VSI
in the new ice_set_rss_hfunc().

When the Symmetric Toeplitz hash function is used, the hardware sets the
input set of the RSS (Toeplitz) algorithm to be the XOR of the fields
index by HSYMM and the fields index by the INSET registers. We use this
to create a symmetric hash by setting the HSYMM registers to point to
their counterparts in the INSET registers:

 HSYMM [src_fv] = dst_fv;
 HSYMM [dst_fv] = src_fv;

where src_fv and dst_fv are the indexes of the protocol's src and dst
fields.

Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Jeff Guo <jia.guo@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Co-developed-by: Ahmed Zaki <ahmed.zaki@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-8-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoice: refactor the FD and RSS flow ID generation
Ahmed Zaki [Wed, 13 Dec 2023 00:33:19 +0000 (17:33 -0700)]
ice: refactor the FD and RSS flow ID generation

The flow director and RSS blocks use separate methods to generate a
unique 64 bit ID for the flow. This is not extendable, especially for
the RSS that already uses all 64 bit space.

Refactor the flow generation API so that the ID is generated within
ice_flow_add_prof(). The FD and RSS blocks caches the generated ID for
later use.

Suggested-by: Dan Nowlin <dan.nowlin@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-7-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoice: refactor RSS configuration
Qi Zhang [Wed, 13 Dec 2023 00:33:18 +0000 (17:33 -0700)]
ice: refactor RSS configuration

Refactor the driver to use a communication data structure for RSS
config. To do so we introduce the new ice_rss_hash_cfg struct, and then
pass it as an argument to several functions.

Also introduce enum ice_rss_cfg_hdr_type to specify a more granular and
flexible RSS configuration:

ICE_RSS_OUTER_HEADERS - take outer layer as RSS input set
ICE_RSS_INNER_HEADERS - take inner layer as RSS input set
ICE_RSS_INNER_HEADERS_W_OUTER_IPV4 - take inner layer as RSS input set for
                                     packet with outer IPV4
ICE_RSS_INNER_HEADERS_W_OUTER_IPV6 - take inner layer as RSS input set for
                                     packet with outer IPV6
ICE_RSS_ANY_HEADERS - try with outer first then inner (same as the
      behaviour without this change)

Finally, move the virtchnl_rss_algorithm enum to be with the other RSS
related structures in the virtchnl.h file.

There should be no functional change due to this patch.

Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Co-developed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Co-developed-by: Ahmed Zaki <ahmed.zaki@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-6-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoice: fix ICE_AQ_VSI_Q_OPT_RSS_* register values
Ahmed Zaki [Wed, 13 Dec 2023 00:33:17 +0000 (17:33 -0700)]
ice: fix ICE_AQ_VSI_Q_OPT_RSS_* register values

Fix the values of the ICE_AQ_VSI_Q_OPT_RSS_* registers. Shifting is
already done when the values are used, no need to double shift. Bug was
not discovered earlier since only ICE_AQ_VSI_Q_OPT_RSS_TPLZ (Zero) is
currently used.

Also, rename ICE_AQ_VSI_Q_OPT_RSS_XXX to ICE_AQ_VSI_Q_OPT_RSS_HASH_XXX
for consistency.

Co-developed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-5-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: ethtool: add support for symmetric-xor RSS hash
Ahmed Zaki [Wed, 13 Dec 2023 00:33:16 +0000 (17:33 -0700)]
net: ethtool: add support for symmetric-xor RSS hash

Symmetric RSS hash functions are beneficial in applications that monitor
both Tx and Rx packets of the same flow (IDS, software firewalls, ..etc).
Getting all traffic of the same flow on the same RX queue results in
higher CPU cache efficiency.

A NIC that supports "symmetric-xor" can achieve this RSS hash symmetry
by XORing the source and destination fields and pass the values to the
RSS hash algorithm.

The user may request RSS hash symmetry for a specific algorithm, via:

    # ethtool -X eth0 hfunc <hash_alg> symmetric-xor

or turn symmetry off (asymmetric) by:

    # ethtool -X eth0 hfunc <hash_alg>

The specific fields for each flow type should then be specified as usual
via:
    # ethtool -N|-U eth0 rx-flow-hash <flow_type> s|d|f|n

Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-4-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: ethtool: get rid of get/set_rxfh_context functions
Ahmed Zaki [Wed, 13 Dec 2023 00:33:15 +0000 (17:33 -0700)]
net: ethtool: get rid of get/set_rxfh_context functions

Add the RSS context parameters to struct ethtool_rxfh_param and use the
get/set_rxfh to handle the RSS contexts as well.

This is part 2/2 of the fix suggested in [1]:

 - Add a rss_context member to the argument struct and a capability
   like cap_link_lanes_supported to indicate whether driver supports
   rss contexts, then you can remove *et_rxfh_context functions,
   and instead call *et_rxfh() with a non-zero rss_context.

Link: https://lore.kernel.org/netdev/20231121152906.2dd5f487@kernel.org/
CC: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: Tony Nguyen <anthony.l.nguyen@intel.com>
CC: Marcin Wojtas <mw@semihalf.com>
CC: Russell King <linux@armlinux.org.uk>
CC: Sunil Goutham <sgoutham@marvell.com>
CC: Geetha sowjanya <gakula@marvell.com>
CC: Subbaraya Sundeep <sbhatta@marvell.com>
CC: hariprasad <hkelam@marvell.com>
CC: Saeed Mahameed <saeedm@nvidia.com>
CC: Leon Romanovsky <leon@kernel.org>
CC: Edward Cree <ecree.xilinx@gmail.com>
CC: Martin Habets <habetsm.xilinx@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-3-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: ethtool: pass a pointer to parameters to get/set_rxfh ethtool ops
Ahmed Zaki [Wed, 13 Dec 2023 00:33:14 +0000 (17:33 -0700)]
net: ethtool: pass a pointer to parameters to get/set_rxfh ethtool ops

The get/set_rxfh ethtool ops currently takes the rxfh (RSS) parameters
as direct function arguments. This will force us to change the API (and
all drivers' functions) every time some new parameters are added.

This is part 1/2 of the fix, as suggested in [1]:

- First simplify the code by always providing a pointer to all params
   (indir, key and func); the fact that some of them may be NULL seems
   like a weird historic thing or a premature optimization.
   It will simplify the drivers if all pointers are always present.

 - Then make the functions take a dev pointer, and a pointer to a
   single struct wrapping all arguments. The set_* should also take
   an extack.

Link: https://lore.kernel.org/netdev/20231121152906.2dd5f487@kernel.org/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-2-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Jakub Kicinski [Thu, 14 Dec 2023 06:03:01 +0000 (22:03 -0800)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2023-12-12 (iavf)

This series contains updates to iavf driver only.

Piotr reworks Flow Director states to deal with issues in restoring
filters.

Slawomir fixes shutdown processing as it was missing needed calls.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  iavf: Fix iavf_shutdown to call iavf_remove instead iavf_close
  iavf: Handle ntuple on/off based on new state machines for flow director
  iavf: Introduce new state machines for flow director
====================

Link: https://lore.kernel.org/r/20231212203613.513423-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: page_pool: factor out releasing DMA from releasing the page
Jakub Kicinski [Fri, 8 Dec 2023 00:52:32 +0000 (16:52 -0800)]
net: page_pool: factor out releasing DMA from releasing the page

Releasing the DMA mapping will be useful for other types
of pages, so factor it out. Make sure compiler inlines it,
to avoid any regressions.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge branch 'dpaa2-switch-various-fixes'
Jakub Kicinski [Thu, 14 Dec 2023 02:38:56 +0000 (18:38 -0800)]
Merge branch 'dpaa2-switch-various-fixes'

Ioana Ciornei says:

====================
dpaa2-switch: various fixes

The first patch fixes the size passed to two dma_unmap_single() calls
which was wrongly put as the size of the pointer.

The second patch is new to this series and reverts the behavior of the
dpaa2-switch driver to not ask for object replay upon offloading so that
we avoid the errors encountered when a VLAN is installed multiple times
on the same port.
====================

Link: https://lore.kernel.org/r/20231212164326.2753457-1-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agodpaa2-switch: do not ask for MDB, VLAN and FDB replay
Ioana Ciornei [Tue, 12 Dec 2023 16:43:26 +0000 (18:43 +0200)]
dpaa2-switch: do not ask for MDB, VLAN and FDB replay

Starting with commit 4e51bf44a03a ("net: bridge: move the switchdev
object replay helpers to "push" mode") the switchdev_bridge_port_offload()
helper was extended with the intention to provide switchdev drivers easy
access to object addition and deletion replays. This works by calling
the replay helpers with non-NULL notifier blocks.

In the same commit, the dpaa2-switch driver was updated so that it
passes valid notifier blocks to the helper. At that moment, no
regression was identified through testing.

In the meantime, the blamed commit changed the behavior in terms of
which ports get hit by the replay. Before this commit, only the initial
port which identified itself as offloaded through
switchdev_bridge_port_offload() got a replay of all port objects and
FDBs. After this, the newly joining port will trigger a replay of
objects on all bridge ports and on the bridge itself.

This behavior leads to errors in dpaa2_switch_port_vlans_add() when a
VLAN gets installed on the same interface multiple times.

The intended mechanism to address this is to pass a non-NULL ctx to the
switchdev_bridge_port_offload() helper and then check it against the
port's private structure. But since the driver does not have any use for
the replayed port objects and FDBs until it gains support for LAG
offload, it's better to fix the issue by reverting the dpaa2-switch
driver to not ask for replay. The pointers will be added back when we
are prepared to ignore replays on unrelated ports.

Fixes: b28d580e2939 ("net: bridge: switchdev: replay all VLAN groups")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20231212164326.2753457-3-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agodpaa2-switch: fix size of the dma_unmap
Ioana Ciornei [Tue, 12 Dec 2023 16:43:25 +0000 (18:43 +0200)]
dpaa2-switch: fix size of the dma_unmap

The size of the DMA unmap was wrongly put as a sizeof of a pointer.
Change the value of the DMA unmap to be the actual macro used for the
allocation and the DMA map.

Fixes: 1110318d83e8 ("dpaa2-switch: add tc flower hardware offload on ingress traffic")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20231212164326.2753457-2-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agopage_pool: transition to reference count management after page draining
Liang Chen [Tue, 12 Dec 2023 04:46:11 +0000 (12:46 +0800)]
page_pool: transition to reference count management after page draining

To support multiple users referencing the same fragment,
'pp_frag_count' is renamed to 'pp_ref_count', transitioning pp pages
from fragment management to reference count management after draining
based on the suggestion from [1].

The idea is that the concept of fragmenting exists before the page is
drained, and all related functions retain their current names.
However, once the page is drained, its management shifts to being
governed by 'pp_ref_count'. Therefore, all functions associated with
that lifecycle stage of a pp page are renamed.

[1]
http://lore.kernel.org/netdev/f71d9448-70c8-8793-dc9a-0eb48a570300@huawei.com

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20231212044614.42733-2-liangchen.linux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agocxgb3: Avoid potential string truncation in desc
Kees Cook [Tue, 12 Dec 2023 22:09:57 +0000 (14:09 -0800)]
cxgb3: Avoid potential string truncation in desc

Builds with W=1 were warning about potential string truncations:

drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c: In function 'cxgb_up':
drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c:394:38: warning: '%d' directive output may be truncated writing between 1 and 11 bytes into a region of size between 5 and 20 [-Wformat-truncation=]
  394 |                                  "%s-%d", d->name, pi->first_qset + i);
      |                                      ^~
In function 'name_msix_vecs',
    inlined from 'cxgb_up' at drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c:1264:3: drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c:394:34: note: directive argument in the range [-2147483641, 509]
  394 |                                  "%s-%d", d->name, pi->first_qset + i);
      |                                  ^~~~~~~
drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c:393:25: note: 'snprintf' output between 3 and 28 bytes into a destination of size 21
  393 |                         snprintf(adap->msix_info[msi_idx].desc, n,
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  394 |                                  "%s-%d", d->name, pi->first_qset + i);
      |                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Avoid open-coded %NUL-termination (this code was assuming snprintf
wasn't %NUL terminating when it does -- likely thinking of strncpy),
and grow the size of the string to handle a maximal value.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312100937.ZPZCARhB-lkp@intel.com/
Cc: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231212220954.work.219-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoamd-xgbe: Avoid potential string truncation in name
Kees Cook [Tue, 12 Dec 2023 22:13:12 +0000 (14:13 -0800)]
amd-xgbe: Avoid potential string truncation in name

Build with W=1 were warning about a potential string truncation:

drivers/net/ethernet/amd/xgbe/xgbe-drv.c: In function 'xgbe_alloc_channels':
drivers/net/ethernet/amd/xgbe/xgbe-drv.c:211:73: warning: '%u' directive output may be truncated writing between 1 and 10 bytes into a region of size 8 [-Wformat-truncation=]
  211 |                 snprintf(channel->name, sizeof(channel->name), "channel-%u", i);
      |                                                                         ^~
drivers/net/ethernet/amd/xgbe/xgbe-drv.c:211:64: note: directive argument in the range [0, 4294967294]
  211 |                 snprintf(channel->name, sizeof(channel->name), "channel-%u", i);
      |                                                                ^~~~~~~~~~~~
drivers/net/ethernet/amd/xgbe/xgbe-drv.c:211:17: note: 'snprintf' output between 10 and 19 bytes into a destination of size 16
  211 |                 snprintf(channel->name, sizeof(channel->name), "channel-%u", i);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Increase the size of the "name" buffer to handle the full format range.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312100937.ZPZCARhB-lkp@intel.com/
Cc: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231212221312.work.830-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agodpll: allocate pin ids in cycle
Jiri Pirko [Tue, 12 Dec 2023 15:06:05 +0000 (16:06 +0100)]
dpll: allocate pin ids in cycle

Pin ID is just a number. Nobody should rely on a certain value, instead,
user should use either pin-id-get op or RTNetlink to get it.

Unify the pin ID allocation behavior with what there is already
implemented for dpll devices.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231212150605.1141261-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agosctp: support MSG_ERRQUEUE flag in recvmsg()
Eric Dumazet [Tue, 12 Dec 2023 14:55:50 +0000 (14:55 +0000)]
sctp: support MSG_ERRQUEUE flag in recvmsg()

For some reason sctp_poll() generates EPOLLERR if sk->sk_error_queue
is not empty but recvmsg() can not drain the error queue yet.

This is needed to better support timestamping.

I had to export inet_recv_error(), since sctp
can be compiled as a module.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/20231212145550.3872051-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet: prevent mss overflow in skb_segment()
Eric Dumazet [Tue, 12 Dec 2023 16:46:21 +0000 (16:46 +0000)]
net: prevent mss overflow in skb_segment()

Once again syzbot is able to crash the kernel in skb_segment() [1]

GSO_BY_FRAGS is a forbidden value, but unfortunately the following
computation in skb_segment() can reach it quite easily :

mss = mss * partial_segs;

65535 = 3 * 5 * 17 * 257, so many initial values of mss can lead to
a bad final result.

Make sure to limit segmentation so that the new mss value is smaller
than GSO_BY_FRAGS.

[1]

general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
CPU: 1 PID: 5079 Comm: syz-executor993 Not tainted 6.7.0-rc4-syzkaller-00141-g1ae4cd3cbdd0 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023
RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551
Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00
RSP: 0018:ffffc900043473d0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597
RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070
RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff
R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0
R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046
FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
udp6_ufo_fragment+0xa0e/0xd00 net/ipv6/udp_offload.c:109
ipv6_gso_segment+0x534/0x17e0 net/ipv6/ip6_offload.c:120
skb_mac_gso_segment+0x290/0x610 net/core/gso.c:53
__skb_gso_segment+0x339/0x710 net/core/gso.c:124
skb_gso_segment include/net/gso.h:83 [inline]
validate_xmit_skb+0x36c/0xeb0 net/core/dev.c:3626
__dev_queue_xmit+0x6f3/0x3d60 net/core/dev.c:4338
dev_queue_xmit include/linux/netdevice.h:3134 [inline]
packet_xmit+0x257/0x380 net/packet/af_packet.c:276
packet_snd net/packet/af_packet.c:3087 [inline]
packet_sendmsg+0x24c6/0x5220 net/packet/af_packet.c:3119
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0xd5/0x180 net/socket.c:745
__sys_sendto+0x255/0x340 net/socket.c:2190
__do_sys_sendto net/socket.c:2202 [inline]
__se_sys_sendto net/socket.c:2198 [inline]
__x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x40/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f8692032aa9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 d1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff8d685418 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8692032aa9
RDX: 0000000000010048 RSI: 00000000200000c0 RDI: 0000000000000003
RBP: 00000000000f4240 R08: 0000000020000540 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff8d685480
R13: 0000000000000001 R14: 00007fff8d685480 R15: 0000000000000003
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551
Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00
RSP: 0018:ffffc900043473d0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597
RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070
RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff
R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0
R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046
FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 3953c46c3ac7 ("sk_buff: allow segmenting based on frag sizes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20231212164621.4131800-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoMerge branch 'idpf-add-get-set-for-ethtool-s-header-split-ringparam'
Jakub Kicinski [Thu, 14 Dec 2023 02:22:10 +0000 (18:22 -0800)]
Merge branch 'idpf-add-get-set-for-ethtool-s-header-split-ringparam'

Alexander Lobakin says:

====================
idpf: add get/set for Ethtool's header split ringparam

Currently, the header split feature (putting headers in one smaller
buffer and then the data in a separate bigger one) is always enabled
in idpf when supported.
One may want to not have fragmented frames per each packet, for example,
to avoid XDP frags. To better optimize setups for particular workloads,
add ability to switch the header split state on and off via Ethtool's
ringparams, as well as to query the current status.
There's currently only GET in the Ethtool Netlink interface for now,
so add SET first. I suspect idpf is not the only one supporting this.
====================

Link: https://lore.kernel.org/r/20231212142752.935000-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoidpf: add get/set for Ethtool's header split ringparam
Michal Kubiak [Tue, 12 Dec 2023 14:27:52 +0000 (15:27 +0100)]
idpf: add get/set for Ethtool's header split ringparam

idpf supports the header split feature and that feature is always
enabled by default.
However, for flexibility reasons and to simplify some scenarios, it
would be useful to have the support for switching the header split
off (and on) from the userspace.

Address that need by adding the user config parameter, the functions
for disabling (or enabling) the header split feature, and calls to
them from the Ethtool ringparam callbacks.
It still is enabled by default if supported by the hardware.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20231212142752.935000-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agoethtool: add SET for TCP_DATA_SPLIT ringparam
Alexander Lobakin [Tue, 12 Dec 2023 14:27:51 +0000 (15:27 +0100)]
ethtool: add SET for TCP_DATA_SPLIT ringparam

Follow up commit 9690ae604290 ("ethtool: add header/data split
indication") and add the set part of Ethtool's header split, i.e.
ability to enable/disable header split via the Ethtool Netlink
interface. This might be helpful to optimize the setup for particular
workloads, for example, to avoid XDP frags, and so on.
A driver should advertise ``ETHTOOL_RING_USE_TCP_DATA_SPLIT`` in its
ops->supported_ring_params to allow doing that. "Unknown" passed from
the userspace when the header split is supported means the driver is
free to choose the preferred state.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20231212142752.935000-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agodocs: networking: timestamping: mention MSG_EOR flag
Eric Dumazet [Tue, 12 Dec 2023 11:06:08 +0000 (11:06 +0000)]
docs: networking: timestamping: mention MSG_EOR flag

TCP got MSG_EOR support in linux-4.7.

This is a canonical way of making sure no coalescing
will be performed on the skb, even if it could not be
immediately sent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20231212110608.3673677-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
22 months agonet/mlx5: DR, Use swap() instead of open coding it
Jiapeng Chong [Fri, 17 Nov 2023 07:19:47 +0000 (15:19 +0800)]
net/mlx5: DR, Use swap() instead of open coding it

Swap is a function interface that provides exchange function. To avoid
code duplication, we can use swap function.

./drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c:1254:50-51: WARNING opportunity for swap().

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7580
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5: devcom, Add component size getter
Tariq Toukan [Mon, 21 Aug 2023 11:31:35 +0000 (14:31 +0300)]
net/mlx5: devcom, Add component size getter

Add a getter for the number of participants in a devcom
component (those who share the same component id and key).

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5e: Decouple CQ from priv
Tariq Toukan [Fri, 4 Aug 2023 18:46:18 +0000 (21:46 +0300)]
net/mlx5e: Decouple CQ from priv

Make CQ struct and methods independent of "priv", use more basic
arguments instead.
This will ease the transition to netdev with multiple mdevs.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5e: Add wrapping for auxiliary_driver ops and remove unused args
Tariq Toukan [Thu, 17 Aug 2023 08:49:04 +0000 (11:49 +0300)]
net/mlx5e: Add wrapping for auxiliary_driver ops and remove unused args

Turn some of the struct auxiliary_driver ops into wrappers to stop
having dummy local vars passed as unused arguments.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5e: Statify function mlx5e_monitor_counter_arm
Tariq Toukan [Tue, 22 Aug 2023 10:49:22 +0000 (13:49 +0300)]
net/mlx5e: Statify function mlx5e_monitor_counter_arm

Function usage is limited to the monitor_stats.c file, do not expose it.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5: Move TISes from priv to mdev HW resources
Tariq Toukan [Sun, 6 Aug 2023 11:01:10 +0000 (14:01 +0300)]
net/mlx5: Move TISes from priv to mdev HW resources

The transport interface send (TIS) object is responsible for performing
all transport related operations of the transmit side. Messages from
Send Queues get segmented and transmitted by the TIS including all
transport required implications, e.g. in the case of large send offload,
the TIS is responsible for the segmentation.

These are stateless objects and can be used by multiple netdevs (e.g.
representors) who share the same core device.

Providing the TISes as a service from the core layer to the netdev layer
reduces the number of replecated TIS objects (in case of multiple
netdevs), and will ease the transition to netdev with multiple mdevs.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5e: Remove TLS-specific logic in generic create TIS API
Tariq Toukan [Thu, 10 Aug 2023 14:42:36 +0000 (17:42 +0300)]
net/mlx5e: Remove TLS-specific logic in generic create TIS API

TLS TISes are created using their own dedicated functions,
don't honor their specific logic here.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5: fs, Command to control TX flow table root
Tariq Toukan [Tue, 5 Dec 2023 10:07:29 +0000 (12:07 +0200)]
net/mlx5: fs, Command to control TX flow table root

Introduce an API to set/unset the TX flow table root for a device.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5: fs, Command to control L2TABLE entry silent mode
Tariq Toukan [Mon, 7 Aug 2023 06:09:04 +0000 (09:09 +0300)]
net/mlx5: fs, Command to control L2TABLE entry silent mode

Introduce an API to set/unset the L2TABLE entry silent mode for a
device. If silent, no north/south traffic is allowed, the device won't
be able to communicate with the port directly to send/receive traffic by
its own.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5: Expose Management PCIe Index Register (MPIR)
Tariq Toukan [Mon, 7 Aug 2023 06:05:34 +0000 (09:05 +0300)]
net/mlx5: Expose Management PCIe Index Register (MPIR)

MPIR register allows to query the PCIe indexes
and Socket-Direct related parameters.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
22 months agonet/mlx5: Add mlx5_ifc bits used for supporting single netdev Socket-Direct
Tariq Toukan [Sun, 7 May 2023 13:47:42 +0000 (16:47 +0300)]
net/mlx5: Add mlx5_ifc bits used for supporting single netdev Socket-Direct

Multiple device caps and features are required to support
single netdev Socket-Direct.
Add them here in preparation for the feature implementation.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>