www.infradead.org Git - users/hch/misc.git/log

eth: mlx4: create a page pool for Rx

Create a pool per rx queue. Subsequent patches will make use of it.

Move fcs_del to a hole to make space for the pointer.

Per common "wisdom" base the page pool size on the ring size.
Note that the page pool cache size is in full pages, so just
round up the effective buffer size to pages.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250213010635.1354034-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: marvell-88q2xxx: Init PHY private structure for mv88q211x

When adding LED support for mv88q222x devices the PHY private data
structure was added to the mv88q211x code path, the data structure is
however only allocated during mv88q222x probe. This results in a nullptr
deference for mv88q2110 devices.

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000001
Mem abort info:
  ESR = 0x0000000096000004
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x04: level 0 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[0000000000000001] user address but active_mm is swapper
Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc1-arm64-renesas-00342-ga3783dbf2574 #7
Hardware name: Renesas White Hawk Single board based on r8a779g2 (DT)
pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mv88q2xxx_config_init+0x28/0x84
lr : mv88q2110_config_init+0x98/0xb0
sp : ffff8000823eb9d0
x29: ffff8000823eb9d0 x28: ffff000440942000 x27: ffff80008144e400
x26: 0000000000001002 x25: 0000000000000000 x24: 0000000000000000
x23: 0000000000000009 x22: ffff8000810534f0 x21: ffff800081053550
x20: 0000000000000000 x19: ffff0004437d6800 x18: 0000000000000018
x17: 00000000000961c8 x16: ffff0006bef75ec0 x15: 0000000000000001
x14: 0000000000000001 x13: ffff000440218080 x12: 071c71c71c71c71c
x11: ffff000440218080 x10: 0000000000001420 x9 : ffff8000823eb770
x8 : ffff8000823eb650 x7 : ffff8000823eb750 x6 : ffff8000823eb710
x5 : 0000000000000000 x4 : 0000000000000800 x3 : 0000000000000001
x2 : 0000000000000000 x1 : 00000000ffffffff x0 : ffff0004437d6800
Call trace:
mv88q2xxx_config_init+0x28/0x84 (P)
mv88q2110_config_init+0x98/0xb0
phy_init_hw+0x64/0x9c
phy_attach_direct+0x118/0x320
phy_connect_direct+0x24/0x80
of_phy_connect+0x5c/0xa0
rtsn_open+0x5bc/0x78c
__dev_open+0xf8/0x1fc
__dev_change_flags+0x198/0x220
dev_change_flags+0x20/0x64
ip_auto_config+0x270/0xefc
do_one_initcall+0xe4/0x22c
kernel_init_freeable+0x2a8/0x308
kernel_init+0x20/0x130
ret_from_fork+0x10/0x20
Code: b907e404 f9432814 3100083f 540000e3 (39400680)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x000,00000070,00801250,8200700b
Memory Limit: none
---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Fix this by using a generic probe function for both mv88q211x and
mv88q222x devices that allocates the PHY private data structure, while
only the mv88q222x probes for LED support.

Fixes: a3783dbf2574 ("net: phy: marvell-88q2xxx: Add support for PHY LEDs on 88q2xxx")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20250214174650.2056949-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

trace: tcp: Add tracepoint for tcp_cwnd_reduction()

Add a lightweight tracepoint to monitor TCP congestion window
adjustments via tcp_cwnd_reduction(). This tracepoint enables tracking
of:
- TCP window size fluctuations
- Active socket behavior
- Congestion window reduction events

Meta has been using BPF programs to monitor this function for years.
Adding a proper tracepoint provides a stable API for all users who need
to monitor TCP congestion window behavior.

Use DECLARE_TRACE instead of TRACE_EVENT to avoid creating trace event
infrastructure and exporting to tracefs, keeping the implementation
minimal. (Thanks Steven Rostedt)

Given that this patch creates a rawtracepoint, you could hook into it
using regular tooling, like bpftrace, using regular rawtracepoint
infrastructure, such as:

rawtracepoint:tcp_cwnd_reduction_tp {
....
}

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250214-cwnd_tracepoint-v2-1-ef8d15162d95@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-phy-marvell-88q2xxx-cleanup'

Dimitri Fedrau says:

====================
net: phy: marvell-88q2xxx: cleanup

- align defines
- order includes alphabetically
- enable temperature sensor in mv88q2xxx_config_init

Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com>
====================

Link: https://patch.msgid.link/20250214-marvell-88q2xxx-cleanup-v1-0-71d67c20f308@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: marvell-88q2xxx: enable temperature sensor in mv88q2xxx_config_init

Temperature sensor gets enabled for 88Q222X devices in
mv88q222x_config_init. Move enabling to mv88q2xxx_config_init because
all 88Q2XXX devices support the temperature sensor.

Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com>
Tested-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: marvell-88q2xxx: order includes alphabetically

Order includes alphabetically.

Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: marvell-88q2xxx: align defines

Align some defines.

Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'vxlan-join-leave-mc-group-when-reconfigured'

Petr Machata says:

====================
vxlan: Join / leave MC group when reconfigured

When a vxlan netdevice is brought up, if its default remote is a multicast
address, the device joins the indicated group.

Therefore when the multicast remote address changes, the device should
leave the current group and subscribe to the new one. Similarly when the
interface used for endpoint communication is changed in a situation when
multicast remote is configured. This is currently not done.

Both vxlan_igmp_join() and vxlan_igmp_leave() can however fail. So it is
possible that with such fix, the netdevice will end up in an inconsistent
situation where the old group is not joined anymore, but joining the
new group fails. Should we join the new group first, and leave the old one
second, we might end up in the opposite situation, where both groups are
joined. Undoing any of this during rollback is going to be similarly
problematic.

One solution would be to just forbid the change when the netdevice is up.
However in vnifilter mode, changing the group address is allowed, and these
problems are simply ignored (see vxlan_vni_update_group()):

# ip link add name br up type bridge vlan_filtering 1
# ip link add vx1 up master br type vxlan external vnifilter local 192.0.2.1 dev lo dstport 4789
# bridge vni add dev vx1 vni 200 group 224.0.0.1
# tcpdump -i lo &
# bridge vni add dev vx1 vni 200 group 224.0.0.2
18:55:46.523438 IP 0.0.0.0 > 224.0.0.22: igmp v3 report, 1 group record(s)
18:55:46.943447 IP 0.0.0.0 > 224.0.0.22: igmp v3 report, 1 group record(s)
# bridge vni
dev vni group/remote
vx1 200 224.0.0.2

Having two different modes of operation for conceptually the same interface
is silly, so in this patchset, just do what the vnifilter code does and
deal with the errors by crossing fingers real hard.
====================

Link: https://patch.msgid.link/cover.1739548836.git.petrm@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: test_vxlan_fdb_changelink: Add a test for MC remote change

Changes to MC remote need to be reflected in actual group memberships.
Add a test to verify that it is the case.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: test_vxlan_fdb_changelink: Convert to lib.sh

Instead of inlining equivalents, use lib.sh-provided primitives.
Use defer to manage vx lifetime.

This will make it easier to extend the test in the next patch.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: forwarding: lib: Move require_command to net, generalize

This helper could be useful to more than just forwarding tests.
Move it upstairs and port over to log_test_skip().

Split the function into two parts: the bit that actually checks and
reports skip, which is in a new function check_command(). And a bit
that exits the test script if the check fails. This allows users
consistent checking behavior while giving an option to bail out from
a single test without bailing out of the whole script.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Join / leave MC group after remote changes

When a vxlan netdevice is brought up, if its default remote is a multicast
address, the device joins the indicated group.

Therefore when the multicast remote address changes, the device should
leave the current group and subscribe to the new one. Similarly when the
interface used for endpoint communication is changed in a situation when
multicast remote is configured. This is currently not done.

Both vxlan_igmp_join() and vxlan_igmp_leave() can however fail. So it is
possible that with such fix, the netdevice will end up in an inconsistent
situation where the old group is not joined anymore, but joining the new
group fails. Should we join the new group first, and leave the old one
second, we might end up in the opposite situation, where both groups are
joined. Undoing any of this during rollback is going to be similarly
problematic.

One solution would be to just forbid the change when the netdevice is up.
However in vnifilter mode, changing the group address is allowed, and these
problems are simply ignored (see vxlan_vni_update_group()):

# ip link add name br up type bridge vlan_filtering 1
# ip link add vx1 up master br type vxlan external vnifilter local 192.0.2.1 dev lo dstport 4789
# bridge vni add dev vx1 vni 200 group 224.0.0.1
# tcpdump -i lo &
# bridge vni add dev vx1 vni 200 group 224.0.0.2
18:55:46.523438 IP 0.0.0.0 > 224.0.0.22: igmp v3 report, 1 group record(s)
18:55:46.943447 IP 0.0.0.0 > 224.0.0.22: igmp v3 report, 1 group record(s)
# bridge vni
dev vni group/remote
vx1 200 224.0.0.2

Having two different modes of operation for conceptually the same interface
is silly, so in this patch, just do what the vnifilter code does and deal
with the errors by crossing fingers real hard.

The vnifilter code leaves old before joining new, and in case of join /
leave failures does not roll back the configuration changes that have
already been applied, but bails out of joining if it could not leave. Do
the same here: leave before join, apply changes unconditionally and do not
attempt to join if we couldn't leave.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Drop 'changelink' parameter from vxlan_dev_configure()

vxlan_dev_configure() only has a single caller that passes false
for the changelink parameter. Drop the parameter and inline the
sole value.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

page_pool: avoid infinite loop to schedule delayed worker

We noticed the kworker in page_pool_release_retry() was waken
up repeatedly and infinitely in production because of the
buggy driver causing the inflight less than 0 and warning
us in page_pool_inflight()[1].

Since the inflight value goes negative, it means we should
not expect the whole page_pool to get back to work normally.

This patch mitigates the adverse effect by not rescheduling
the kworker when detecting the inflight negative in
page_pool_release_retry().

[1]
[Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------
[Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages
...
[Mon Feb 10 20:36:11 2025] Call Trace:
[Mon Feb 10 20:36:11 2025]  page_pool_release_retry+0x23/0x70
[Mon Feb 10 20:36:11 2025]  process_one_work+0x1b1/0x370
[Mon Feb 10 20:36:11 2025]  worker_thread+0x37/0x3a0
[Mon Feb 10 20:36:11 2025]  kthread+0x11a/0x140
[Mon Feb 10 20:36:11 2025]  ? process_one_work+0x370/0x370
[Mon Feb 10 20:36:11 2025]  ? __kthread_cancel_work+0x40/0x40
[Mon Feb 10 20:36:11 2025]  ret_from_fork+0x35/0x40
[Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]---
Note: before this patch, the above calltrace would flood the
dmesg due to repeated reschedule of release_dw kworker.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250214064250.85987-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'add-af_xdp-support-for-cn10k'

Suman Ghosh says:

====================
Add af_xdp support for cn10k

This patchset includes changes to support AF_XDP for cn10k chipsets. Both
non-zero copy and zero copy will be supported after these changes. Also,
the RSS will be reconfigured once a particular receive queue is
added/removed to/from AF_XDP support.

Patch #1: octeontx2-pf: use xdp_return_frame() to free xdp buffers

Patch #2: octeontx2-pf: Add AF_XDP non-zero copy support

Patch #3: octeontx2-pf: AF_XDP zero copy receive support

Patch #4: octeontx2-pf: Reconfigure RSS table after enabling AF_XDP
zerocopy on rx queue

Patch #5: octeontx2-pf: Prepare for AF_XDP transmit

Patch #6: octeontx2-pf: AF_XDP zero copy transmit support
====================

Link: https://patch.msgid.link/20250213053141.2833254-1-sumang@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-pf: AF_XDP zero copy transmit support

This patch implements below changes,

1. To avoid concurrency with normal traffic uses
   XDP queues.
2. Since there are chances that XDP and AF_XDP can
   fall under same queue uses separate flags to handle
   dma buffers.

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-pf: Prepare for AF_XDP

Implement necessary APIs required for AF_XDP transmit.

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-pf: Reconfigure RSS table after enabling AF_XDP zerocopy on rx queue

RSS table needs to be reconfigured once a rx queue is enabled or
disabled for AF_XDP zerocopy support. After enabling UMEM on a rx queue,
that queue should not be part of RSS queue selection algorithm.
Similarly the queue should be considered again after UMEM is disabled.

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-pf: AF_XDP zero copy receive support

This patch adds support to AF_XDP zero copy for CN10K.
This patch specifically adds receive side support. In this approach once
a xdp program with zero copy support on a specific rx queue is enabled,
then that receive quse is disabled/detached from the existing kernel
queue and re-assigned to the umem memory.

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-pf: Add AF_XDP non-zero copy support

Set xdp rx ring memory type as MEM_TYPE_PAGE_POOL for
af-xdp to work. This is needed since xdp_return_frame
internally will use page pools.

Fixes: 06059a1a9a4a ("octeontx2-pf: Add XDP support to netdev PF")
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-pf: use xdp_return_frame() to free xdp buffers

xdp_return_frames() will help to free the xdp frames and their
associated pages back to page pool.

Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
ice, iavf: Add support for Rx timestamping

Mateusz Polchlopek says:

Initially, during VF creation it registers the PTP clock in
the system and negotiates with PF it's capabilities. In the
meantime the PF enables the Flexible Descriptor for VF.
Only this type of descriptor allows to receive Rx timestamps.

Enabling virtual clock would be possible, though it would probably
perform poorly due to the lack of direct time access.

Enable timestamping should be done using userspace tools, e.g.
hwstamp_ctl -i $VF -r 14

In order to report the timestamps to userspace, the VF extends
timestamp to 40b.

To support this feature the flexible descriptors and PTP part
in iavf driver have been introduced.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  iavf: add support for Rx timestamps to hotpath
  iavf: handle set and get timestamps ops
  iavf: Implement checking DD desc field
  iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors
  iavf: define Rx descriptors as qwords
  libeth: move idpf_rx_csum_decoded and idpf_rx_extracted
  iavf: periodically cache PHC time
  iavf: add support for indirect access to PHC time
  iavf: add initial framework for registering PTP clock
  iavf: negotiate PTP capabilities
  iavf: add support for negotiating flexible RXDID format
  virtchnl: add enumeration for the rxdid format
  ice: support Rx timestamp on flex descriptor
  virtchnl: add support for enabling PTP on iAVF
====================

Link: https://patch.msgid.link/20250214192739.1175740-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: support TCP segmentation offload

Add TSO support to the driver. Device can handle unencapsulated or
IPv6-in-IPv6 packets. Any other tunnel stacks are handled with
GSO partial.

Validate that the packet can be offloaded in ndo_features_check.
Main thing we need to check for is that the header geometry can
be expressed in the decriptor fields (offsets aren't too large).

Report number of TSO super-packets via the qstat API.

Link: https://patch.msgid.link/20250216174109.2808351-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdev: clarify GSO vs csum in qstats

Could be just me, but I had to pause and double check that the Tx csum
counter in qstat should include GSO'd packets. GSO pretty much implies
csum so one could possibly interpret the csum counter as pure csum offload.

But the counters are based on virtio:

  [tx_needs_csum]
      The number of packets which require checksum calculation by the device.

  [rx_needs_csum]
      The number of packets with VIRTIO_NET_HDR_F_NEEDS_CSUM.

and VIRTIO_NET_HDR_F_NEEDS_CSUM gets set on GSO packets virtio sends.

Clarify this in the spec to avoid any confusion.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250214224601.2271201-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: move stale comment about ntuple validation

Gal points out that the comment now belongs further down, since
the original if condition was split into two in
commit de7f7582dff2 ("net: ethtool: prevent flow steering to RSS contexts which don't exist")

Link: https://lore.kernel.org/de4a2a8a-1eb9-4fa8-af87-7526e58218e9@nvidia.com
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250214224340.2268691-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netdev-genl-add-an-xsk-attribute-to-queues'

Joe Damato says:

====================
netdev-genl: Add an xsk attribute to queues

This is an attempt to followup on something Jakub asked me about [1],
adding an xsk attribute to queues and more clearly documenting which
queues are linked to NAPIs...

After the RFC [2], Jakub suggested creating an empty nest for queues
which have a pool, so I've adjusted this version to work that way.

The nest can be extended in the future to express attributes about XSK
as needed. Queues which are not used for AF_XDP do not have the xsk
attribute present.

I've run the included test on:
- my mlx5 machine (via NETIF=)
- without setting NETIF

And the test seems to pass in both cases.

[1]: https://lore.kernel.org/netdev/20250113143109.60afa59a@kernel.org/
[2]: https://lore.kernel.org/netdev/20250129172431.65773-1-jdamato@fastly.com/
====================

Link: https://patch.msgid.link/20250214211255.14194-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: Test queue xsk attribute

Test that queues which are used for AF_XDP have the xsk nest attribute.
The attribute is currently empty, but its existence means the AF_XDP is
being used for the queue. Enable CONFIG_XDP_SOCKETS for
selftests/drivers/net tests, as well.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250214211255.14194-4-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdev-genl: Add an XSK attribute to queues

Expose a new per-queue nest attribute, xsk, which will be present for
queues that are being used for AF_XDP. If the queue is not being used for
AF_XDP, the nest will not be present.

In the future, this attribute can be extended to include more data about
XSK as it is needed.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250214211255.14194-3-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: Add nla_put_empty_nest helper

Creating empty nests is helpful when the exact attributes to be exposed
in the future are not known. Encapsulate the logic in a helper.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250214211255.14194-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY

Introduce tests to verify the correct functionality of the SO_RCVMARK and
SO_RCVPRIORITY socket options.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250214205828.48503-1-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: use napi_id_valid helper

In commit 6597e8d35851 ("netdev-genl: Elide napi_id when not present"),
napi_id_valid function was added. Use the helper to refactor open-coded
checks in the source.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Stefano Jordhani <sjordhani@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk> # for iouring
Link: https://patch.msgid.link/20250214181801.931-1-sjordhani@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-dp83822-add-support-for-changing-the-transmit-amplitude-voltage'

Dimitri Fedrau via says:

====================
net: phy: dp83822: Add support for changing the transmit amplitude voltage

Add support for changing the transmit amplitude voltage in 100BASE-TX mode.
Add support for configuration via DT.

v4: https://lore.kernel.org/20250211-dp83822-tx-swing-v4-0-1e8ebd71ad54@liebherr.com
v3: https://lore.kernel.org/20250204-dp83822-tx-swing-v3-0-9798e96500d9@liebherr.com
v2: https://lore.kernel.org/20250120-dp83822-tx-swing-v2-0-07c99dc42627@liebherr.com
v1: https://lore.kernel.org/20250113-dp83822-tx-swing-v1-0-7ed5a9d80010@liebherr.com
====================

Link: https://patch.msgid.link/20250214-dp83822-tx-swing-v5-0-02ca72620599@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: dp83822: Add support for changing the transmit amplitude voltage

Add support for changing the transmit amplitude voltage in 100BASE-TX mode.
Modifying it can be necessary to compensate losses on the PCB and
connector, so the voltages measured on the RJ45 pins are conforming.

Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250214-dp83822-tx-swing-v5-3-02ca72620599@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Add helper for getting tx amplitude gain

Add helper which returns the tx amplitude gain defined in device tree.
Modifying it can be necessary to compensate losses on the PCB and
connector, so the voltages measured on the RJ45 pins are conforming.

Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250214-dp83822-tx-swing-v5-2-02ca72620599@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: ethernet-phy: add property tx-amplitude-100base-tx-percent

Add property tx-amplitude-100base-tx-percent in the device tree bindings
for configuring the tx amplitude of 100BASE-TX PHYs. Modifying it can be
necessary to compensate losses on the PCB and connector, so the voltages
measured on the RJ45 pins are conforming.

Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250214-dp83822-tx-swing-v5-1-02ca72620599@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: fix grammar in reuseaddr_ports_exhausted.c log message

This patch fixes a grammatical error in a test log message
in reuseaddr_ports_exhausted.c for better clarity.

Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250213152612.4434-1-pranav.tyagi03@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlx5-add-sensor-name-in-temperature-message'

Tariq Toukan says:

====================
mlx5: Add sensor name in temperature message

This small series from Shahar adds the sensors names to the temperature
event messages, in addition to the existing bitmap indicators.
This improves human readability.

Series starts with simple refactoring and modifications. The top patch
adds the sensors names.
====================

Link: https://patch.msgid.link/20250213094641.226501-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add sensor name to temperature event message

Previously, a temperature event message included a bitmap indicating
which sensors detect high temperatures.

To enhance clarity, we modify the message format to explicitly list
the names of the overheating sensors, alongside the sensors bitmap.
If HWMON is not configured, the event message remains unchanged.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250213094641.226501-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Modify LSB bitmask in temperature event to include only the first bit

In the sensor_count field of the MTEWE register, bits 1-62 are
supported only for unmanaged switches, not for NICs, and bit 63
is reserved for internal use.

To prevent confusing output that may include set bits that are
not relevant to NIC sensors, we update the bitmask to retain only
the first bit, which corresponds to the sensor ASIC.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20250213094641.226501-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Prefix temperature event bitmap with '0x' for clarity

Prepend '0x' to the sensor bitmap in the warning message to clearly
indicate that the bitmap is in hexadecimal format.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20250213094641.226501-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Apply rate-limiting to high temperature warning

Wrap the high temperature warning in a temperature event with
a call to net_ratelimit() to prevent flooding the kernel log
with repeated warning messages when temperature exceeds the
threshold multiple times within a short duration.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20250213094641.226501-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-mediatek-add-token-ring-helper-functions'

Sky Huang says:

====================
net: phy: mediatek: Add token-ring helper functions

This patchset add token-ring helper functions and moves some macros from
mtk-ge.c into mtk-phy-lib.c.

v2: https://lore.kernel.org/20250116012159.3816135-2-SkyLake.Huang@mediatek.com
====================

Link: https://patch.msgid.link/20250213080553.921434-1-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mediatek: Move some macros to phy-lib for later use

Move some macros to phy-lib because MediaTek's 2.5G built-in
ethernet PHY will also use them.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250213080553.921434-6-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mediatek: Add token ring clear bit operation support

Similar to __mtk_tr_set_bits() support. Previously in mtk-ge-soc.c,
we clear some register bits via token ring, which were also implemented
in three __phy_write(). Now we can do the same thing via
__mtk_tr_clr_bits() helper.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250213080553.921434-5-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mediatek: Add token ring set bit operation support

Previously in mtk-ge-soc.c, we set some register bits via token
ring, which were implemented in three __phy_write().
Now we can do the same thing via __mtk_tr_set_bits() helper.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250213080553.921434-4-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mediatek: Add token ring access helper functions in mtk-phy-lib

This patch adds TR(token ring) manipulations and adds correct
macro names for those magic numbers. TR is a way to access
proprietary registers on page 52b5. Use these helper functions
so we can see which fields we're going to modify/set/clear.

TR functions with __* prefix mean that the operations inside
aren't wrapped by page select/restore functions.

This patch doesn't really change registers' settings but just
enhances readability and maintainability.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250213080553.921434-3-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mediatek: Change to more meaningful macros

Replace magic number with more meaningful macros in mtk-ge.c.
Also, move some common macros into mtk-phy-lib.c.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250213080553.921434-2-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xpcs: rearrange register definitions

Place register number definitions immediately above their field
definitions and order by register number.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1tjblS-00448F-8v@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ndisc: ndisc_send_redirect() cleanup

ndisc_send_redirect() is always called under rcu_read_lock().

It can use dev_net_rcu() and avoid one redundant
rcu_read_lock()/rcu_read_unlock() pair.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214140705.2105890-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bnxt_en-add-npar-1-2-and-tph-support'

Michael Chan says:

====================
bnxt_en: Add NPAR 1.2 and TPH support

The first patch adds NPAR 1.2 support.  Patches 2 to 11 add TPH
(TLP Processing Hints) support.  These TPH driver patches are new
revisions originally posted as part of the TPH PCI patch series.
Additional driver refactoring has been done so that we can free
and allocate RX completion ring and the TX rings if the channel is
a combined channel.  We also add napi_disable() and napi_enable()
during queue_stop() and queue_start() respectively, and reset for
error handling in queue_start().

v4: https://lore.kernel.org/20250208202916.1391614-1-michael.chan@broadcom.com
v3: https://lore.kernel.org/20250204004609.1107078-1-michael.chan@broadcom.com
v2: https://lore.kernel.org/20250116192343.34535-1-michael.chan@broadcom.com
v1: https://lore.kernel.org/20250113063927.4017173-1-michael.chan@broadcom.com

Discussion about adding napi_disable()/napi_enable():

https://lore.kernel.org/netdev/5336d624-8d8b-40a6-b732-b020e4a119a2@davidwei.uk/#t

Previous driver series fixing rtnl_lock and empty release function:

https://lore.kernel.org/netdev/20241115200412.1340286-1-wei.huang2@amd.com/

v5 of the PCI series using netdev_rx_queue_restart():

https://lore.kernel.org/netdev/20240916205103.3882081-5-wei.huang2@amd.com/

v1 of the PCI series using open/close:

https://lore.kernel.org/netdev/20240509162741.1937586-9-wei.huang2@amd.com/
====================

Link: https://patch.msgid.link/20250213011240.1640031-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Add TPH support in BNXT driver

Add TPH support to the Broadcom BNXT device driver. This allows the
driver to utilize TPH functions for retrieving and configuring Steering
Tags when changing interrupt affinity. With compatible NIC firmware,
network traffic will be tagged correctly with Steering Tags, resulting
in significant memory bandwidth savings and other advantages as
demonstrated by real network benchmarks on TPH-capable platforms.

Co-developed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Co-developed-by: Wei Huang <wei.huang2@amd.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Signed-off-by: Manoj Panicker <manoj.panicker2@amd.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-12-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Extend queue stop/start for TX rings

In order to use queue_stop/queue_start to support the new Steering
Tags, we need to free the TX ring and TX completion ring if it is a
combined channel with TX/RX sharing the same NAPI.  Otherwise
TX completions will not have the updated Steering Tag.  If TPH is
not enabled, we just stop the TX ring without freeing the TX/TX cmpl
rings.  With that we can now add napi_disable() and napi_enable()
during queue_stop()/ queue_start().  This will guarantee that NAPI
will stop processing the completion entries in case there are
additional pending entries in the completion rings after queue_stop().

There could be some NQEs sitting unprocessed while NAPI is disabled
thereby leaving the NQ unarmed.  Explicitly re-arm the NQ after
napi_enable() in queue start so that NAPI will resume properly.

Error handling in bnxt_queue_start() requires a reset.  If a TX
ring cannot be allocated or initialized properly, it will cause
TX timeout.  The reset will also free any partially allocated
rings.  We don't expect to hit this error path because re-allocating
previously reserved and allocated rings with the same parameters
should never fail.

Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-11-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Refactor TX ring free logic

Add a new bnxt_hwrm_tx_ring_free() function to handle freeing a HW
transmit ring. The new function will also be used in the next patch
to free the TX ring in queue_stop.

Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-10-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Reallocate RX completion ring for TPH support

In order to program the correct Steering Tag during an IRQ affinity
change, we need to free/re-allocate the RX completion ring during
queue_restart.  If TPH is enabled, call FW to free the Rx completion
ring and clear the ring entries in queue_stop().  Re-allocate it in
queue_start() if TPH is enabled.  Note that TPH mode is not enabled
in this patch and will be enabled later in the patch series.

While modifying bnxt_queue_start(), remove the unnecessary zeroing of
rxr->rx_next_cons.  It gets overwritten by the clone in
bnxt_queue_start().  Remove the rx_reset counter increment since
restart is not reset.  Add comment to clarify that the ring
allocations in queue_start should never fail.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-9-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Pass NQ ID to the FW when allocating RX/RX AGG rings

Newer firmware can use the NQ ring ID associated with each RX/RX AGG
ring to enable PCIe Steering Tags on P5_PLUS chips.  When allocating
RX/RX AGG rings, pass along NQ ring ID for the firmware to use.  This
information helps optimize DMA writes by directing them to the cache
closer to the CPU consuming the data, potentially improving the
processing speed.  This change is backward-compatible with older
firmware, which will simply disregard the information.

Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-8-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Refactor RX/RX AGG ring parameters setup for P5_PLUS

There is some common code for setting up RX and RX AGG ring allocation
parameters for P5_PLUS chips. Refactor the logic into a new function.

Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-7-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Refactor bnxt_free_tx_rings() to free per TX ring

Modify bnxt_free_tx_rings() to free the skbs per TX ring.
This will be useful later in the series.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Refactor completion ring free routine

Add a wrapper routine to free L2 completion rings. This will be
useful later in the series.

Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Refactor TX ring allocation logic

Add a new bnxt_hwrm_tx_ring_alloc() function to handle allocating
a transmit ring. This will be useful later in the series.

Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Refactor completion ring allocation logic for P5_PLUS chips

Add a new bnxt_hwrm_cp_ring_alloc_p5() function to handle allocating
one completion ring on P5_PLUS chips. This simplifies the existing code
and will be useful later in the series.

Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Set NPAR 1.2 support when registering with firmware

NPAR (Network interface card partitioning)[1] 1.2 adds a transparent
VLAN tag for all packets between the NIC and the switch.  Because of
that, RX VLAN acceleration cannot be supported for any additional
host configured VLANs.  The driver has to acknowledge that it can
support no RX VLAN acceleration and set the NPAR 1.2 supported flag
when registering with the FW.  Otherwise, the FW call will fail and
the driver will abort on these NPAR 1.2 NICs with this error:

bnxt_en 0000:26:00.0 (unnamed net_device) (uninitialized): hwrm req_type 0x1d seq id 0xb error 0x2

[1] https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/introduction/features/network-partitioning-npar.html

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250213011240.1640031-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx4_core: Avoid impossible mlx4_db_alloc() order value

GCC can see that the value range for "order" is capped, but this leads
it to consider that it might be negative, leading to a false positive
warning (with GCC 15 with -Warray-bounds -fdiagnostics-details):

../drivers/net/ethernet/mellanox/mlx4/alloc.c:691:47: error: array subscript -1 is below array bounds of 'long unsigned int *[2]' [-Werror=array-bounds=]
  691 |                 i = find_first_bit(pgdir->bits[o], MLX4_DB_PER_PAGE >> o);
      |                                    ~~~~~~~~~~~^~~
  'mlx4_alloc_db_from_pgdir': events 1-2
  691 |                 i = find_first_bit(pgdir->bits[o], MLX4_DB_PER_PAGE >> o);                        |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                     |                         |                                                   |                     |                         (2) out of array bounds here
      |                     (1) when the condition is evaluated to true                             In file included from ../drivers/net/ethernet/mellanox/mlx4/mlx4.h:53,
                 from ../drivers/net/ethernet/mellanox/mlx4/alloc.c:42:
../include/linux/mlx4/device.h:664:33: note: while referencing 'bits'
  664 |         unsigned long          *bits[2];
      |                                 ^~~~

Switch the argument to unsigned int, which removes the compiler needing
to consider negative values.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250210174504.work.075-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: c45: improve handling of disabled EEE modes in generic ethtool functions

Currently disabled EEE modes are shown as supported in ethtool.
Change this by filtering them out when populating data->supported
in genphy_c45_ethtool_get_eee.
Disabled EEE modes are silently filtered out by genphy_c45_write_eee_adv.
This is planned to be removed, therefore ensure in
genphy_c45_ethtool_set_eee that disabled EEE modes are removed from the
user space provided EEE advertisement. For now keep the current behavior
to do this silently.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/5187c86d-9a5a-482c-974f-cc103ce9738c@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: Fix signedness bug in ice_init_interrupt_scheme()

If pci_alloc_irq_vectors() can't allocate the minimum number of vectors
then it returns -ENOSPC so there is no need to check for that in the
caller. In fact, because pf->msix.min is an unsigned int, it means that
any negative error codes are type promoted to high positive values and
treated as success. So here, the "return -ENOMEM;" is unreachable code.
Check for negatives instead.

Now that we're only dealing with error codes, it's easier to propagate
the error code from pci_alloc_irq_vectors() instead of hardcoding
-ENOMEM.

Fixes: 79d97b8cf9a8 ("ice: remove splitting MSI-X between features")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/b16e4f01-4c85-46e2-b602-fce529293559@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove phylink_pcs .neg_mode boolean

As all PCS are using the neg_mode parameter rather than the legacy
an_mode, remove the ability to use the legacy an_mode. We remove the
tests in the phylink code, unconditionally passing the PCS neg_mode
parameter to PCS methods, and remove setting the flag from drivers.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tidPn-0040hd-2R@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

documentation: networking: Add NAPI config

Document the existence of persistent per-NAPI configuration space and
the API that drivers can opt into.

Update stale documentation which suggested that NAPI IDs cannot be
queried from userspace.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://patch.msgid.link/20250213191535.38792-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-clean-up-phy-h'

Heiner Kallweit says:

====================
net: phy: clean up phy.h

This series is a starting point to clean up phy.h and remove
definitions which are phylib-internal.
====================

Link: https://patch.msgid.link/d14f8a69-dc21-4ff7-8401-574ffe2f4bc5@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: remove helper phy_is_internal

Helper phy_is_internal() is just used in two places phylib-internally.
So let's remove it from the API.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/f3f35265-80a9-4ed7-ad78-ae22c21e288b@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: stop exporting phy_queue_state_machine

phy_queue_state_machine() isn't used outside phy.c,
so stop exporting it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/16986d3d-7baf-4b02-a641-e2916d491264@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: stop exporting feature arrays which aren't used outside phylib

Stop exporting feature arrays which aren't used outside phylib.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/01886672-4880-4ca8-b7b0-94d40f6e0ec5@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: remove fixup-related definitions from phy.h which are not used outside phylib

Certain fixup-related definitions aren't used outside phy_device.c.
So make them private and remove them from phy.h.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/ea6fde13-9183-4c7c-8434-6c0eb64fc72c@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-realtek-improve-mmd-register-access-for-internal-phy-s'

Heiner Kallweit says:

====================
net: phy: realtek: improve MMD register access for internal PHY's

The integrated PHYs on chip versions from RTL8168g allow to address
MDIO_MMD_VEND2 registers. All c22 standard registers are mapped to
MDIO_MMD_VEND2 registers. So far the paging mechanism is used to
address PHY registers. Add support for c45 ops to address MDIO_MMD_VEND2
registers directly, w/o the paging.
====================

Link: https://patch.msgid.link/c6a969ef-fd7f-48d6-8c48-4bc548831a8d@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: switch from paged to MMD ops in rtl822x functions

The MDIO bus provided by r8169 for the internal PHY's now supports
c45 ops for the MDIO_MMD_VEND2 device. So we can switch to standard
MMD ops here.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/81416f95-0fac-4225-87b4-828e3738b8ed@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: improve mmd register access for internal PHY's

r8169 provides the MDIO bus for the internal PHY's. It has been extended
with c45 access functions for addressing MDIO_MMD_VEND2 registers.
So we can switch from paged access to directly addressing the
MDIO_MMD_VEND2 registers.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/a5f2333c-dda9-48ad-9801-77049766e632@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: add PHY c45 ops for MDIO_MMD_VENDOR2 registers

The integrated PHYs on chip versions from RTL8168g allow to address
MDIO_MMD_VEND2 registers. All c22 standard registers are mapped to
MDIO_MMD_VEND2 registers. So far the paging mechanism is used to
address PHY registers. Add support for c45 ops to address MDIO_MMD_VEND2
registers directly, w/o the paging.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/d6f97eaa-0f13-468f-89cb-75a41087bc4a@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phylink-xpcs-stmmac-support-pcs-eee-configuration'

Russell King says:

====================
net: phylink,xpcs,stmmac: support PCS EEE configuration

This series adds support for phylink managed EEE at the PCS level,
allowing xpcs_config_eee() to be removed. Sadly, we still end up with
a XPCS specific function to configure the clock multiplier.
====================

Link: https://patch.msgid.link/Z6naiPpxfxGr1Ic6@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xpcs: group EEE code together

Move xpcs_config_eee() with the other EEE-related functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1thRQd-003w7a-MM@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xpcs: clean up xpcs_config_eee()

There is now no need to pass the mult_fact into xpcs_config_eee(), so
let's remove that argument and use xpcs->eee_mult_fact directly. While
changing the function signature, as we pass true/false for enable, use
"bool" instead of "int" for this.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1thRQY-003w7U-IG@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xpcs: remove xpcs_config_eee() from global scope

Make xpcs_config_eee() private to the XPCS driver, called only from
the phylink pcs_disable_eee() and pcs_enable_eee() methods.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1thRQT-003w7O-Ec@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove calls to xpcs_config_eee()

Remove the explicit calls to xpcs_config_eee() from the stmmac driver,
preferring instead for phylink to manage the EEE configuration at the
PCS.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1thRQO-003w7I-Ap@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xpcs: convert to phylink managed EEE

Convert XPCS to use the new pcs_disable_eee() and pcs_enable_eee()
methods. Since stmmac is the only user of xpcs_config_eee(), we can
make this a no-op along with this change.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1thRQJ-003w7C-6v@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: call xpcs_config_eee_mult_fact()

Arrange for stmmac to call the new xpcs_config_eee_mult_fact() function
to configure the EEE clock multiplying factor. This will allow the
removal of the xpcs_config_eee() calls in the next patch.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1thRQE-003w76-3C@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xpcs: add function to configure EEE clock multiplying factor

Add a function to separate out the EEE clock multiplying factor. This
will be called by the stmmac driver to configure this value.

It would have been better had the driver used the CLK API to retrieve
this clock, get its rate and calculate the appropriate multiplier, but
that door has closed.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1thRQ8-003w70-VT@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phylink: add support for notifying PCS about EEE

There are hooks in the stmmac driver into XPCS to control the EEE
settings when LPI is configured at the MAC. This bypasses the layering.
To allow this to be removed from the stmmac driver, add two new
methods for PCS to inform them when the LPI/EEE enablement state
changes at the MAC.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1thRQ3-003w6u-RH@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'inet-better-inet_sock_set_state-for-passive-flows'

Eric Dumazet says:

====================
inet: better inet_sock_set_state() for passive flows

Small series to make inet_sock_set_state() more interesting for
LISTEN -> TCP_SYN_RECV changes : The 4-tuple parts are now correct.

First patch is a cleanup.
====================

Link: https://patch.msgid.link/20250212131328.1514243-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet: consolidate inet_csk_clone_lock()

Current inet_sock_set_state trace from inet_csk_clone_lock() is missing
many details :

... sock:inet_sock_set_state: family=AF_INET6 protocol=IPPROTO_TCP \
    sport=4901 dport=0 \
    saddr=127.0.0.6 daddr=0.0.0.0 \
    saddrv6=:: daddrv6=:: \
    oldstate=TCP_LISTEN newstate=TCP_SYN_RECV

Only the sport gives the listener port, no other parts of the n-tuple are correct.

In this patch, I initialize relevant fields of the new socket before
calling inet_sk_set_state(newsk, TCP_SYN_RECV).

We now have a trace including all the source/destination bits.

... sock:inet_sock_set_state: family=AF_INET6 protocol=IPPROTO_TCP \
    sport=4901 dport=47648 \
    saddr=127.0.0.6 daddr=127.0.0.6 \
    saddrv6=2002:a05:6830:1f85:: daddrv6=2001:4860:f803:65::3 \
    oldstate=TCP_LISTEN newstate=TCP_SYN_RECV

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250212131328.1514243-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet: reduce inet_csk_clone_lock() indent level

Return early from inet_csk_clone_lock() if the socket
allocation failed, to reduce the indentation level.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250212131328.1514243-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: refactor clock management in EQoS driver

Refactor clock management in EQoS driver for code reuse and to avoid
redundancy. This way, only minimal changes are required when a new platform
is added.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Swathi K S <swathi.ks@samsung.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250213041559.106111-1-swathi.ks@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix TSO support for header cloned skbs

For GSO packets, skb_cow_head() will reallocate the skb for TSO header
cloned skbs in airoha_dev_xmit(). For this reason, sinfo pointer can be
no more valid. Fix the issue relying on skb_shinfo() macro directly in
airoha_dev_xmit().

The problem exists since
commit 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC")
but it is not a user visible, since we can't currently enable TSO
for DSA user ports since we are missing to initialize net_device
vlan_features field.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250213-airoha-en7581-flowtable-offload-v4-1-b69ca16d74db@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-add-export_ipv6_mod'

Eric Dumazet says:

====================
net: add EXPORT_IPV6_MOD()

In this series I am adding EXPORT_IPV6_MOD and EXPORT_IPV6_MOD_GPL()
so that we can replace some EXPORT_SYMBOL() when IPV6 is
not modular.

This is making all the selected symbols internal to core
linux networking.

v1: https://lore.kernel.org/20250210082805.465241-2-edumazet@google.com
====================

Link: https://patch.msgid.link/20250212132418.1524422-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp: use EXPORT_IPV6_MOD[_GPL]()

Use EXPORT_IPV6_MOD[_GPL]() for symbols that don't need
to be exported unless CONFIG_IPV6=m

udp_table is no longer used from any modules, and does not
need to be exported anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20250212132418.1524422-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: use EXPORT_IPV6_MOD[_GPL]()

Use EXPORT_IPV6_MOD[_GPL]() for symbols that don't need
to be exported unless CONFIG_IPV6=m

tcp_hashinfo and tcp_openreq_init_rwin() are no longer
used from any module anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20250212132418.1524422-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inetpeer: use EXPORT_IPV6_MOD[_GPL]()

Use EXPORT_IPV6_MOD[_GPL]() for symbols that do not need to
to be exported unless CONFIG_IPV6=m

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20250212132418.1524422-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL()

We have many EXPORT_SYMBOL(x) in networking tree because IPv6
can be built as a module.

CONFIG_IPV6=y is becoming the norm.

Define a EXPORT_IPV6_MOD(x) which only exports x
for modular IPv6.

Same principle applies to EXPORT_IPV6_MOD_GPL()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20250212132418.1524422-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

iavf: add support for Rx timestamps to hotpath

Add support for receive timestamps to the Rx hotpath. This support only
works when using the flexible descriptor format, so make sure that we
request this format by default if we have receive timestamp support
available in the PTP capabilities.

In order to report the timestamps to userspace, we need to perform
timestamp extension. The Rx descriptor does actually contain the "40
bit" timestamp. However, upper 32 bits which contain nanoseconds are
conveniently stored separately in the descriptor. We could extract the
32bits and lower 8 bits, then perform a bitwise OR to calculate the
40bit value. This makes no sense, because the timestamp extension
algorithm would simply discard the lower 8 bits anyways.

Thus, implement timestamp extension as iavf_ptp_extend_32b_timestamp(),
and extract and forward only the 32bits of nominal nanoseconds.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Sunil Goutham <sgoutham@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

iavf: handle set and get timestamps ops

Add handlers for the .ndo_hwtstamp_get and .ndo_hwtstamp_set ops which
allow userspace to request timestamp enablement for the device. This
support allows standard Linux applications to request the timestamping
desired.

As with other devices that support timestamping all packets, the driver
will upgrade any request for timestamping of a specific type of packet
to HWTSTAMP_FILTER_ALL.

The current configuration is stored, so that it can be retrieved by
calling .ndo_hwtstamp_get

The Tx timestamps are not implemented yet so calling set ops for
Tx path will end with EOPNOTSUPP error code.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

iavf: Implement checking DD desc field

Rx timestamping introduced in PF driver caused the need of refactoring
the VF driver mechanism to check packet fields.

The function to check errors in descriptor has been removed and from
now only previously set struct fields are being checked. The field DD
(descriptor done) needs to be checked at the very beginning, before
extracting other fields.

Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors

Using VIRTCHNL_VF_OFFLOAD_FLEX_DESC, the iAVF driver is capable of
negotiating to enable the advanced flexible descriptor layout. Add the
flexible NIC layout (RXDID=2) as a member of the Rx descriptor union.

Also add bit position definitions for the status and error indications
that are needed.

The iavf_clean_rx_irq function needs to extract a few fields from the Rx
descriptor, including the size, rx_ptype, and vlan_tag.
Move the extraction to a separate function that decodes the fields into
a structure. This will reduce the burden for handling multiple
descriptor types by keeping the relevant extraction logic in one place.

To support handling an additional descriptor format with minimal code
duplication, refactor Rx checksum handling so that the general logic
is separated from the bit calculations. Introduce an iavf_rx_desc_decoded
structure which holds the relevant bits decoded from the Rx descriptor.
This will enable implementing flexible descriptor handling without
duplicating the general logic twice.

Introduce an iavf_extract_flex_rx_fields, iavf_flex_rx_hash, and
iavf_flex_rx_csum functions which operate on the flexible NIC descriptor
format instead of the legacy 32 byte format. Based on the negotiated
RXDID, select the correct function for processing the Rx descriptors.

With this change, the Rx hot path should be functional when using either
the default legacy 32byte format or when we switch to the flexible NIC
layout.

Modify the Rx hot path to add support for the flexible descriptor
format and add request enabling Rx timestamps for all queues.

As in ice, make sure we bump the checksum level if the hardware detected
a packet type which could have an outer checksum. This is important
because hardware only verifies the inner checksum.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

iavf: define Rx descriptors as qwords

The union iavf_32byte_rx_desc consists of two unnamed structs defined
inside. One of them represents legacy 32 byte descriptor and second the
16 byte descriptor (extended to 32 byte). Each of them consists of
bunch of unions, structs and __le fields that represent specific fields
in descriptor.

This commit changes the representation of iavf_32byte_rx_desc union
to store four __le64 fields (qw0, qw1, qw2, qw3) that represent
quad-words. Those quad-words will be then accessed by calling
leXY_get_bits macros in upcoming commits.

Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

libeth: move idpf_rx_csum_decoded and idpf_rx_extracted

Structs idpf_rx_csum_decoded and idpf_rx_extracted are used both in
idpf and iavf Intel drivers. Change the prefix from idpf_* to libeth_*
and move mentioned structs to libeth's rx.h header file.

Adjust usage in idpf driver.

Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>