]> www.infradead.org Git - users/hch/block.git/log
users/hch/block.git
19 months agonet: ipa: introduce ipa_interrupt_init()
Alex Elder [Fri, 1 Mar 2024 17:02:37 +0000 (11:02 -0600)]
net: ipa: introduce ipa_interrupt_init()

Create a new function ipa_interrupt_init() that is called at probe
time to allocate and initialize the IPA interrupt data structure.
Create ipa_interrupt_exit() as its inverse.

This follows the normal IPA driver pattern of *_init() functions
doing things that can be done before access to hardware is required.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: ipa: change ipa_interrupt_config() prototype
Alex Elder [Fri, 1 Mar 2024 17:02:36 +0000 (11:02 -0600)]
net: ipa: change ipa_interrupt_config() prototype

Change the return type of ipa_interrupt_config() to be an error
code rather than an IPA interrupt structure pointer, and assign the
the pointer within that function.

Change ipa_interrupt_deconfig() to take the IPA pointer as argument
and have it invalidate the ipa->interrupt pointer.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'mptcp-lowat-sockopt'
David S. Miller [Mon, 4 Mar 2024 10:50:28 +0000 (10:50 +0000)]
Merge branch 'mptcp-lowat-sockopt'

Matthieu Baerts says:

====================
mptcp: add TCP_NOTSENT_LOWAT sockopt support

Patch 3 does the magic of adding TCP_NOTSENT_LOWAT support, all the
other ones are minor cleanup seen along when working on the new
feature.

Note that this feature relies on the existing accounting for snd_nxt.
Such accounting is not 110% accurate as it tracks the most recent
sequence number queued to any subflow, and not the actual sequence
number sent on the wire. Paolo experimented a lot, trying to implement
the latter, and in the end it proved to be both "too complex" and "not
necessary".

The complexity raises from the need for additional lock and a lot of
refactoring to introduce such protections without adding significant
overhead. Additionally, snd_nxt is currently used and exposed with the
current semantic by the internal packet scheduling. Introducing a
different tracking will still require us to keep the old one.

More interestingly, a more accurate tracking could be not strictly
necessary: as the MPTCP socket enqueues data to the subflows only up
to the available send window, any enqueue data is sent on the wire
instantly, without any blocking operation short or a drop in the tx
path at the nft or TC layer.
====================

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
19 months agomptcp: cleanup SOL_TCP handling
Paolo Abeni [Fri, 1 Mar 2024 17:43:47 +0000 (18:43 +0100)]
mptcp: cleanup SOL_TCP handling

Most TCP-level socket options get an integer from user space, and
set the corresponding field under the msk-level socket lock.

Reduce the code duplication moving such operations in the common code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agomptcp: implement TCP_NOTSENT_LOWAT support
Paolo Abeni [Fri, 1 Mar 2024 17:43:46 +0000 (18:43 +0100)]
mptcp: implement TCP_NOTSENT_LOWAT support

Add support for such socket option storing the user-space provided
value in a new msk field, and using such data to implement the
_mptcp_stream_memory_free() helper, similar to the TCP one.

To avoid adding more indirect calls in the fast path, open-code
a variant of sk_stream_memory_free() in mptcp_sendmsg() and add
direct calls to the mptcp stream memory free helper where possible.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/464
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agomptcp: avoid some duplicate code in socket option handling
Paolo Abeni [Fri, 1 Mar 2024 17:43:45 +0000 (18:43 +0100)]
mptcp: avoid some duplicate code in socket option handling

The mptcp_get_int_option() helper is needless open-coded in a
couple of places, replace the duplicate code with the helper
call.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agomptcp: cleanup writer wake-up
Paolo Abeni [Fri, 1 Mar 2024 17:43:44 +0000 (18:43 +0100)]
mptcp: cleanup writer wake-up

After commit 5cf92bbadc58 ("mptcp: re-enable sndbuf autotune"), the
MPTCP_NOSPACE bit is redundant: it is always set and cleared together with
SOCK_NOSPACE.

Let's drop the first and always relay on the latter, dropping a bunch
of useless code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: nlmon: Simplify nlmon_get_stats64
Breno Leitao [Fri, 1 Mar 2024 13:42:14 +0000 (05:42 -0800)]
net: nlmon: Simplify nlmon_get_stats64

Do not set rtnl_link_stats64 fields to zero, since they are zeroed
before ops->ndo_get_stats64 is called in core dev_get_stats() function.

Also, simplify the data collection by removing the temporary variable.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: nlmon: Remove init and uninit functions
Breno Leitao [Fri, 1 Mar 2024 13:42:13 +0000 (05:42 -0800)]
net: nlmon: Remove init and uninit functions

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the nlmon driver and leverage the network
core allocation.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoOcteontx2-af: Fix an issue in firmware shared data reserved space
Hariprasad Kelam [Fri, 1 Mar 2024 11:03:14 +0000 (16:33 +0530)]
Octeontx2-af: Fix an issue in firmware shared data reserved space

The last patch which added support to extend the firmware shared
data to add channel data information has introduced a bug due to
the reserved space not adjusted accordingly.

This patch fixes the issue and also adds BUILD_BUG to avoid this
regression error.

Fixes: 997814491cee ("Octeontx2-af: Fetch MAC channel info from firmware")
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoeth: igc: remove unused embedded struct net_device
Jakub Kicinski [Fri, 1 Mar 2024 07:02:55 +0000 (23:02 -0800)]
eth: igc: remove unused embedded struct net_device

struct net_device poll_dev in struct igc_q_vector was added
in one of the initial commits, but never used.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'net-gve-header-split-support'
David S. Miller [Mon, 4 Mar 2024 10:03:32 +0000 (10:03 +0000)]
Merge branch 'net-gve-header-split-support'

Ziwei Xiao says:

====================
gve: Add header split support

Currently, the ethtool's ringparam has added a new field
tcp-data-split for enabling and disabling header split. These three
patches will utilize that ethtool flag to support header split in GVE
driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agogve: Add header split ethtool stats
Jeroen de Borst [Thu, 29 Feb 2024 21:22:36 +0000 (13:22 -0800)]
gve: Add header split ethtool stats

To record the stats of header split packets, three stats are added in
the driver's ethtool stats.

- rx_hsplit_pkt is the split packets count with header split
- rx_hsplit_bytes is the received header bytes count with header split
- rx_hsplit_unsplit_pkt is the unsplit packet count due to header buffer
  overflow or zero header length when header split is enabled

Currently, it's entering the stats_update critical section more than
once per packet. We have plans to avoid that in the future change to let
all the stats_update happen in one place at the end of
`gve_rx_poll_dqo`.

Co-developed-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agogve: Add header split data path
Jeroen de Borst [Thu, 29 Feb 2024 21:22:35 +0000 (13:22 -0800)]
gve: Add header split data path

Add header buffers and ethtool support to enable header split via the
tcp-data-split flag in ethtool's ringparam config. A coherent dma memory
is allocated for the header buffers. There is one header buffer per ring
entry by calculating the offset to the header-buffers starting address.
The header buffer is always copied directly into the skb and payload is
always added as frags. When there is a header buffer overflow or the
header length is 0, the driver places the whole unsplit packet in frags.

When toggling header split, the driver will call gve_adjust_config to
set its queues appropriately. If header split is enabled by the user and
the max packet buffer size is no less than 4KB, driver will set the
packet buffer size as 4KB to support TCP_ZEROCOPY_RECEIVE. Otherwise the
driver will use the default 2KB as the packet buffer size.

`ethtool -G <dev> tcp-data-split on/off` is the command to toggle header
split.
`ethtool -g <dev>` will show the status of header split with the field
of `tcp-data-split`.

Co-developed-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agogve: Add header split device option
Jeroen de Borst [Thu, 29 Feb 2024 21:22:34 +0000 (13:22 -0800)]
gve: Add header split device option

To enable header split via ethtool, we first need to query the device to
get the max rx buffer size and header buffer size. Add a device option
to get these values and store them in the driver. If the header buffer
size received from the device is non-zero, it means header split is
supported in the device.

Currently the max rx buffer size will only be used when header split is
enabled which will set the data_buffer_size_dqo to be the max rx buffer
size. Also change the data_buffer_size_dqo from int to u16 since we are
modifying it and making it to be consistent with max_rx_buffer_size.

Co-developed-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: ip6_tunnel: Leverage core stats allocator
Breno Leitao [Wed, 28 Feb 2024 18:03:17 +0000 (10:03 -0800)]
net: ip6_tunnel: Leverage core stats allocator

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the ip6_tunnel driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'ionic-cleanups-and-perf-tuning'
David S. Miller [Mon, 4 Mar 2024 09:38:14 +0000 (09:38 +0000)]
Merge branch 'ionic-cleanups-and-perf-tuning'

Shannon Nelson says:

====================
ionic: code cleanup and performance tuning

Brett has been performance testing and code tweaking and has
come up with several improvements for our fast path operations.

In a simple single thread / single queue iperf case on a 1500 MTU
connection we see an improvement from 74.2 to 86.7 Gbits/sec.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: change MODULE_AUTHOR to person name
Shannon Nelson [Thu, 29 Feb 2024 19:39:35 +0000 (11:39 -0800)]
ionic: change MODULE_AUTHOR to person name

The MODULE_AUTHOR macro is supposed to be a person
not a company.

Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: Clean RCT ordering issues
Brett Creeley [Thu, 29 Feb 2024 19:39:34 +0000 (11:39 -0800)]
ionic: Clean RCT ordering issues

Clean up complaints from an xmastree.py scan.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: Use CQE profile for dim
Brett Creeley [Thu, 29 Feb 2024 19:39:33 +0000 (11:39 -0800)]
ionic: Use CQE profile for dim

Use the kernel's CQE dim table to align better with the
driver's use of completion queues, and use the tx moderation
when using Tx interrupts.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: change the hwstamp likely check
Brett Creeley [Thu, 29 Feb 2024 19:39:32 +0000 (11:39 -0800)]
ionic: change the hwstamp likely check

An earlier change moved the hwstamp queue check into a helper
function with an unlikely(). However, it makes more sense for
the caller to decide if it's likely() or unlikely(), so make
the change to support that.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: reduce the use of netdev
Shannon Nelson [Thu, 29 Feb 2024 19:39:31 +0000 (11:39 -0800)]
ionic: reduce the use of netdev

To help make sure we're only accessing things we really need
to access we can cut down on the q->lif->netdev references by
using q->dev which is already in cache.

Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: Pass local netdev instead of referencing struct
Brett Creeley [Thu, 29 Feb 2024 19:39:30 +0000 (11:39 -0800)]
ionic: Pass local netdev instead of referencing struct

Instead of using q->lif->netdev, just pass the netdev when it's
locally defined.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: Check stop no restart
Brett Creeley [Thu, 29 Feb 2024 19:39:29 +0000 (11:39 -0800)]
ionic: Check stop no restart

If there is a lot of transmit traffic the driver can get into a
situation that the device is starved due to the doorbell never
being rung. This can happen if xmit_more is set constantly
and __netdev_tx_sent_queue() keeps returning false. Fix this
by checking if the queue needs to be stopped right before
calling __netdev_tx_sent_queue(). Use MAX_SKB_FRAGS + 1 as the
stop condition because that's the maximum number of frags
supported for non-TSO transmit.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: Clean up BQL logic
Brett Creeley [Thu, 29 Feb 2024 19:39:28 +0000 (11:39 -0800)]
ionic: Clean up BQL logic

The driver currently calls netdev_tx_completed_queue() for every
Tx completion. However, this API is only meant to be called once
per NAPI if any Tx work is done.  Make the necessary changes to
support calling netdev_tx_completed_queue() only once per NAPI.

Also, use the __netdev_tx_sent_queue() API, which supports the
xmit_more functionality.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: Make use napi_consume_skb
Brett Creeley [Thu, 29 Feb 2024 19:39:27 +0000 (11:39 -0800)]
ionic: Make use napi_consume_skb

Make use of napi_consume_skb so that skb recycling
can happen by way of the napi_skb_cache.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: Shorten a Tx hotpath
Brett Creeley [Thu, 29 Feb 2024 19:39:26 +0000 (11:39 -0800)]
ionic: Shorten a Tx hotpath

Perf was showing some hot spots in ionic_tx_descs_needed()
for TSO traffic.  Rework the function to return sooner where
possible.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: Change default number of descriptors for Tx and Rx
Brett Creeley [Thu, 29 Feb 2024 19:39:25 +0000 (11:39 -0800)]
ionic: Change default number of descriptors for Tx and Rx

Cut down the number of default Tx and Rx descriptors to save
initial memory requirements.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoionic: Rework Tx start/stop flow
Brett Creeley [Thu, 29 Feb 2024 19:39:24 +0000 (11:39 -0800)]
ionic: Rework Tx start/stop flow

Currently the driver attempts to wake the Tx queue
for every descriptor processed. However, this is
overkill and can cause thrashing since Tx xmit can be
running concurrently on a different CPU than Tx clean.
Fix this by refactoring Tx cq servicing into its own
function so the Tx wake code can run after processing
all Tx descriptors.

The driver isn't using the expected memory barriers
to make sure the stop/start bits are coherent. Fix
this by  making sure to use the correct memory barriers.

Also, the driver is using the wake API during Tx
xmit even though it's already scheduled. Fix this by
using the start API during Tx xmit.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agodt-bindings: leds: pwm-multicolour: re-allow active-low
Conor Dooley [Thu, 29 Feb 2024 18:24:00 +0000 (18:24 +0000)]
dt-bindings: leds: pwm-multicolour: re-allow active-low

active-low was lifted to the common schema for leds, but it went
unnoticed that the leds-multicolour binding had "additionalProperties:
false" where the other users had "unevaluatedProperties: false", thereby
disallowing active-low for multicolour leds. Explicitly permit it again.

Fixes: c94d1783136e ("dt-bindings: net: phy: Make LED active-low property common")
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Lee Jones <lee@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: bareudp: Remove generic .ndo_get_stats64
Breno Leitao [Thu, 29 Feb 2024 17:04:24 +0000 (09:04 -0800)]
net: bareudp: Remove generic .ndo_get_stats64

Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: bareudp: Do not allocate stats in the driver
Breno Leitao [Thu, 29 Feb 2024 17:04:23 +0000 (09:04 -0800)]
net: bareudp: Do not allocate stats in the driver

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the bareudp driver and leverage the network
core allocation.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'skb-helpers'
David S. Miller [Mon, 4 Mar 2024 08:47:07 +0000 (08:47 +0000)]
Merge branch 'skb-helpers'

Eric Dumazet says:

====================
net: better use of skb helpers

First patch is a pure cleanup.

Second patch adds a DEBUG_NET_WARN_ON_ONCE() in skb_network_header_len(),
this could help to discover old bugs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: adopt skb_network_header_len() more broadly
Eric Dumazet [Thu, 29 Feb 2024 09:39:08 +0000 (09:39 +0000)]
net: adopt skb_network_header_len() more broadly

(skb_transport_header(skb) - skb_network_header(skb))
can be replaced by skb_network_header_len(skb)

Add a DEBUG_NET_WARN_ON_ONCE() in skb_network_header_len()
to catch cases were the transport_header was not set.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: adopt skb_network_offset() and similar helpers
Eric Dumazet [Thu, 29 Feb 2024 09:39:07 +0000 (09:39 +0000)]
net: adopt skb_network_offset() and similar helpers

This is a cleanup patch, making code a bit more concise.

1) Use skb_network_offset(skb) in place of
       (skb_network_header(skb) - skb->data)

2) Use -skb_network_offset(skb) in place of
       (skb->data - skb_network_header(skb))

3) Use skb_transport_offset(skb) in place of
       (skb_transport_header(skb) - skb->data)

4) Use skb_inner_transport_offset(skb) in place of
       (skb_inner_transport_header(skb) - skb->data)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Sun, 3 Mar 2024 04:50:59 +0000 (20:50 -0800)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-02-29

We've added 119 non-merge commits during the last 32 day(s) which contain
a total of 150 files changed, 3589 insertions(+), 995 deletions(-).

The main changes are:

1) Extend the BPF verifier to enable static subprog calls in spin lock
   critical sections, from Kumar Kartikeya Dwivedi.

2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
   in BPF global subprogs, from Andrii Nakryiko.

3) Larger batch of riscv BPF JIT improvements and enabling inlining
   of the bpf_kptr_xchg() for RV64, from Pu Lehui.

4) Allow skeleton users to change the values of the fields in struct_ops
   maps at runtime, from Kui-Feng Lee.

5) Extend the verifier's capabilities of tracking scalars when they
   are spilled to stack, especially when the spill or fill is narrowing,
   from Maxim Mikityanskiy & Eduard Zingerman.

6) Various BPF selftest improvements to fix errors under gcc BPF backend,
   from Jose E. Marchesi.

7) Avoid module loading failure when the module trying to register
   a struct_ops has its BTF section stripped, from Geliang Tang.

8) Annotate all kfuncs in .BTF_ids section which eventually allows
   for automatic kfunc prototype generation from bpftool, from Daniel Xu.

9) Several updates to the instruction-set.rst IETF standardization
   document, from Dave Thaler.

10) Shrink the size of struct bpf_map resp. bpf_array,
    from Alexei Starovoitov.

11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
    from Benjamin Tissoires.

12) Fix bpftool to be more portable to musl libc by using POSIX's
    basename(), from Arnaldo Carvalho de Melo.

13) Add libbpf support to gcc in CORE macro definitions,
    from Cupertino Miranda.

14) Remove a duplicate type check in perf_event_bpf_event,
    from Florian Lehner.

15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
    with notrace correctly, from Yonghong Song.

16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
    array to fix build warnings, from Kees Cook.

17) Fix resolve_btfids cross-compilation to non host-native endianness,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  selftests/bpf: Test if shadow types work correctly.
  bpftool: Add an example for struct_ops map and shadow type.
  bpftool: Generated shadow variables for struct_ops maps.
  libbpf: Convert st_ops->data to shadow type.
  libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
  bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
  bpf, arm64: use bpf_prog_pack for memory management
  arm64: patching: implement text_poke API
  bpf, arm64: support exceptions
  arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
  bpf: add is_async_callback_calling_insn() helper
  bpf: introduce in_sleepable() helper
  bpf: allow more maps in sleepable bpf programs
  selftests/bpf: Test case for lacking CFI stub functions.
  bpf: Check cfi_stubs before registering a struct_ops type.
  bpf: Clarify batch lookup/lookup_and_delete semantics
  bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
  bpf, docs: Fix typos in instruction-set.rst
  selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
  bpf: Shrink size of struct bpf_map/bpf_array.
  ...
====================

Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge branch 'inet_dump_ifaddr-no-rtnl'
David S. Miller [Fri, 1 Mar 2024 11:09:40 +0000 (11:09 +0000)]
Merge branch 'inet_dump_ifaddr-no-rtnl'

Eric Dumazet says:

====================
inet: no longer use RTNL to protect inet_dump_ifaddr()

This series convert inet so that a dump of addresses (ip -4 addr)
no longer requires RTNL.
====================

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: use xa_array iterator to implement inet_dump_ifaddr()
Eric Dumazet [Thu, 29 Feb 2024 11:40:16 +0000 (11:40 +0000)]
inet: use xa_array iterator to implement inet_dump_ifaddr()

1) inet_dump_ifaddr() can can run under RCU protection
   instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
   an extra recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: prepare inet_base_seq() to run without RTNL
Eric Dumazet [Thu, 29 Feb 2024 11:40:15 +0000 (11:40 +0000)]
inet: prepare inet_base_seq() to run without RTNL

In the following patch, inet_base_seq() will no longer be called
with RTNL held.

Add READ_ONCE()/WRITE_ONCE() annotations in dev_base_seq_inc()
and inet_base_seq().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: annotate data-races around ifa->ifa_flags
Eric Dumazet [Thu, 29 Feb 2024 11:40:14 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_flags

ifa->ifa_flags can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: annotate data-races around ifa->ifa_preferred_lft
Eric Dumazet [Thu, 29 Feb 2024 11:40:13 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_preferred_lft

ifa->ifa_preferred_lft can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: annotate data-races around ifa->ifa_valid_lft
Eric Dumazet [Thu, 29 Feb 2024 11:40:12 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_valid_lft

ifa->ifa_valid_lft can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: annotate data-races around ifa->ifa_tstamp and ifa->ifa_cstamp
Eric Dumazet [Thu, 29 Feb 2024 11:40:11 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_tstamp and ifa->ifa_cstamp

ifa->ifa_tstamp can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Do the same for ifa->ifa_cstamp to prepare upcoming changes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'netdevsim-link'
David S. Miller [Fri, 1 Mar 2024 10:43:11 +0000 (10:43 +0000)]
Merge branch 'netdevsim-link'

David Wei says:

====================
netdevsim: link and forward skbs between ports

This patchset adds the ability to link two netdevsim ports together and
forward skbs between them, similar to veth. The goal is to use netdevsim
for testing features e.g. zero copy Rx using io_uring.

This feature was tested locally on QEMU, and a selftest is included.

I ran netdev selftests CI style and all tests but the following passed:
- gro.sh
- l2tp.sh
- ip_local_port_range.sh

gro.sh fails because virtme-ng mounts as read-only and it tries to write
to log.txt. This issue was reported to virtme-ng upstream.

l2tp.sh and ip_local_port_range.sh both fail for me on net-next/main as
well.

---
v13->v14:
- implement ndo_get_iflink()
- fix returning 0 if peer is already linked during linking or not linked
  during unlinking
- bump dropped counter if nsim_ipsec_tx() fails and generally reorder
  nsim_start_xmit()
- fix overflowing lines and indentations

v12->v13:
- wait for socat listening port to be ready before sending data in
  selftest

v11->v12:
- fix leaked netns refs
- fix rtnetlink.sh kci_test_ipsec_offload() selftest

v10->v11:
- add udevadm settle after creating netdevsims in selftest

v9->v10:
- fix not freeing skb when not there is no peer
- prevent possible id clashes in selftest
- cleanup selftest on error paths

v8->v9:
- switch to getting netns using fd rather than id
- prevent linking a netdevsim to itself
- update tests

v7->v8:
- fix not dereferencing RCU ptr using rcu_dereference()
- remove unused variables in selftest

v6->v7:
- change link syntax to netnsid:ifidx
- replace dev_get_by_index() with __dev_get_by_index()
- check for NULL peer when linking
- add a sysfs attribute for unlinking
- only update Tx stats if not dropped
- update selftest

v5->v6:
- reworked to link two netdevsims using sysfs attribute on the bus
  device instead of debugfs due to deadlock possibility if a netdevsim
  is removed during linking
- removed unnecessary patch maintaining a list of probed nsim_devs
- updated selftest

v4->v5:
- reduce nsim_dev_list_lock critical section
- fixed missing mutex unlock during unwind ladder
- rework nsim_dev_peer_write synchronization to take devlink lock as
  well as rtnl_lock
- return err msgs to user during linking if port doesn't exist or
  linking to self
- update tx stats outside of RCU lock

v3->v4:
- maintain a mutex protected list of probed nsim_devs instead of using
  nsim_bus_dev
- fixed synchronization issues by taking rtnl_lock
- track tx_dropped skbs

v2->v3:
- take lock when traversing nsim_bus_dev_list
- take device ref when getting a nsim_bus_dev
- return 0 if nsim_dev_peer_read cannot find the port
- address code formatting
- do not hard code values in selftests
- add Makefile for selftests

v1->v2:
- renamed debugfs file from "link" to "peer"
- replaced strstep() with sscanf() for consistency
- increased char[] buf sz to 22 for copying id + port from user
- added err msg w/ expected fmt when linking as a hint to user
- prevent linking port to itself
- protect peer ptr using RCU

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: fix rtnetlink.sh selftest
David Wei [Wed, 28 Feb 2024 23:22:53 +0000 (15:22 -0800)]
netdevsim: fix rtnetlink.sh selftest

I cleared IFF_NOARP flag from netdevsim dev->flags in order to support
skb forwarding. This breaks the rtnetlink.sh selftest
kci_test_ipsec_offload() test because ipsec does not connect to peers it
cannot transmit to.

Fix the issue by adding a neigh entry manually. ipsec_offload test now
successfully pass.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: add selftest for forwarding skb between connected ports
David Wei [Wed, 28 Feb 2024 23:22:52 +0000 (15:22 -0800)]
netdevsim: add selftest for forwarding skb between connected ports

Connect two netdevsim ports in different namespaces together, then send
packets between them using socat.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: add ndo_get_iflink() implementation
David Wei [Wed, 28 Feb 2024 23:22:51 +0000 (15:22 -0800)]
netdevsim: add ndo_get_iflink() implementation

Add an implementation for ndo_get_iflink() in netdevsim that shows the
ifindex of the linked peer, if any.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: forward skbs from one connected port to another
David Wei [Wed, 28 Feb 2024 23:22:50 +0000 (15:22 -0800)]
netdevsim: forward skbs from one connected port to another

Forward skbs sent from one netdevsim port to its connected netdevsim
port using dev_forward_skb, in a spirit similar to veth.

Add a tx_dropped variable to struct netdevsim, tracking the number of
skbs that could not be forwarded using dev_forward_skb().

The xmit() function accessing the peer ptr is protected by an RCU read
critical section. The rcu_read_lock() is functionally redundant as since
v5.0 all softirqs are implicitly RCU read critical sections; but it is
useful for human readers.

If another CPU is concurrently in nsim_destroy(), then it will first set
the peer ptr to NULL. This does not affect any existing readers that
dereferenced a non-NULL peer. Then, in unregister_netdevice(), there is
a synchronize_rcu() before the netdev is actually unregistered and
freed. This ensures that any readers i.e. xmit() that got a non-NULL
peer will complete before the netdev is freed.

Any readers after the RCU_INIT_POINTER() but before synchronize_rcu()
will dereference NULL, making it safe.

The codepath to nsim_destroy() and nsim_create() takes both the newly
added nsim_dev_list_lock and rtnl_lock. This makes it safe with
concurrent calls to linking two netdevsims together.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: allow two netdevsim ports to be connected
David Wei [Wed, 28 Feb 2024 23:22:49 +0000 (15:22 -0800)]
netdevsim: allow two netdevsim ports to be connected

Add two netdevsim bus attribute to sysfs:
/sys/bus/netdevsim/link_device
/sys/bus/netdevsim/unlink_device

Writing "A M B N" to link_device will link netdevsim M in netnsid A with
netdevsim N in netnsid B.

Writing "A M" to unlink_device will unlink netdevsim M in netnsid A from
its peer, if any.

rtnl_lock is taken to ensure nothing changes during the linking.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'selftests-xfail'
David S. Miller [Fri, 1 Mar 2024 10:30:30 +0000 (10:30 +0000)]
Merge branch 'selftests-xfail'

Jakub Kicinski says:

====================
selftests: kselftest_harness: support using xfail

When running selftests for our subsystem in our CI we'd like all
tests to pass. Currently some tests use SKIP for cases they
expect to fail, because the kselftest_harness limits the return
codes to pass/fail/skip. XFAIL which would be a great match
here cannot be used.

Remove the no_print handling and use vfork() to run the test in
a different process than the setup. This way we don't need to
pass "failing step" via the exit code. Further clean up the exit
codes so that we can use all KSFT_* values. Rewrite the result
printing to make handling XFAIL/XPASS easier. Support tests
declaring combinations of fixture + variant they expect to fail.

Merge plan is to put it on top of -rc6 and merge into net-next.
That way others should be able to pull the patches without
any networking changes.

v4:
 - rebase on top of Mickael's vfork() changes
v3: https://lore.kernel.org/all/20240220192235.2953484-1-kuba@kernel.org/
 - combine multiple series
 - change to "list of expected failures" rather than SKIP()-like handling
v2: https://lore.kernel.org/all/20240216002619.1999225-1-kuba@kernel.org/
 - fix alignment
follow up RFC: https://lore.kernel.org/all/20240216004122.2004689-1-kuba@kernel.org/
v1: https://lore.kernel.org/all/20240213154416.422739-1-kuba@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: ip_local_port_range: use XFAIL instead of SKIP
Jakub Kicinski [Thu, 29 Feb 2024 00:59:19 +0000 (16:59 -0800)]
selftests: ip_local_port_range: use XFAIL instead of SKIP

SCTP does not support IP_LOCAL_PORT_RANGE and we know it,
so use XFAIL instead of SKIP.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: support using xfail
Jakub Kicinski [Thu, 29 Feb 2024 00:59:18 +0000 (16:59 -0800)]
selftests: kselftest_harness: support using xfail

Currently some tests report skip for things they expect to fail
e.g. when given combination of parameters is known to be unsupported.
This is confusing because in an ideal test environment and fully
featured kernel no tests should be skipped.

Selftest summary line already includes xfail and xpass counters,
e.g.:

  Totals: pass:725 fail:0 xfail:0 xpass:0 skip:0 error:0

but there's no way to use it from within the harness.

Add a new per-fixture+variant combination list of test cases
we expect to fail.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: let PASS / FAIL provide diagnostic
Jakub Kicinski [Thu, 29 Feb 2024 00:59:17 +0000 (16:59 -0800)]
selftests: kselftest_harness: let PASS / FAIL provide diagnostic

Switch to printing KTAP line for PASS / FAIL with ksft_test_result_code(),
this gives us the ability to report diagnostic messages.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: separate diagnostic message with # in ksft_test_result_...
Jakub Kicinski [Thu, 29 Feb 2024 00:59:16 +0000 (16:59 -0800)]
selftests: kselftest_harness: separate diagnostic message with # in ksft_test_result_code()

According to the spec we should always print a # if we add
a diagnostic message. Having the caller pass in the new line
as part of diagnostic message makes handling this a bit
counter-intuitive, so append the new line in the helper.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: print test name for SKIP
Jakub Kicinski [Thu, 29 Feb 2024 00:59:15 +0000 (16:59 -0800)]
selftests: kselftest_harness: print test name for SKIP

Jakub points out that for parsers it's rather useful to always
have the test name on the result line. Currently if we SKIP
(or soon XFAIL or XPASS), we will print:

ok 17 # SKIP SCTP doesn't support IP_BIND_ADDRESS_NO_PORT

     ^
     no test name

Always print the test name.
KTAP format seems to allow or even call for it, per:
https://docs.kernel.org/dev-tools/ktap.html

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/all/87jzn6lnou.fsf@cloudflare.com/
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest: add ksft_test_result_code(), handling all exit codes
Jakub Kicinski [Thu, 29 Feb 2024 00:59:14 +0000 (16:59 -0800)]
selftests: kselftest: add ksft_test_result_code(), handling all exit codes

For generic test harness code it's more useful to deal with exit
codes directly, rather than having to switch on them and call
the right ksft_test_result_*() helper. Add such function to kselftest.h.

Note that "directive" and "diagnostic" are what ktap docs call
those parts of the message.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: use exit code to store skip
Jakub Kicinski [Thu, 29 Feb 2024 00:59:13 +0000 (16:59 -0800)]
selftests: kselftest_harness: use exit code to store skip

We always use skip in combination with exit_code being 0
(KSFT_PASS). This are basic KSFT / KTAP semantics.
Store the right KSFT_* code in exit_code directly.

This makes it easier to support tests reporting other
extended KSFT_* codes like XFAIL / XPASS.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: save full exit code in metadata
Jakub Kicinski [Thu, 29 Feb 2024 00:59:12 +0000 (16:59 -0800)]
selftests: kselftest_harness: save full exit code in metadata

Instead of tracking passed = 0/1 rename the field to exit_code
and invert the values so that they match the KSFT_* exit codes.
This will allow us to fold SKIP / XFAIL into the same value.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: generate test name once
Jakub Kicinski [Thu, 29 Feb 2024 00:59:11 +0000 (16:59 -0800)]
selftests: kselftest_harness: generate test name once

Since we added variant support generating full test case
name takes 4 string arguments. We're about to need it
in another two places. Stop the duplication and print
once into a temporary buffer.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: use KSFT_* exit codes
Jakub Kicinski [Thu, 29 Feb 2024 00:59:10 +0000 (16:59 -0800)]
selftests: kselftest_harness: use KSFT_* exit codes

Now that we no longer need low exit codes to communicate
assertion steps - use normal KSFT exit codes.

Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests/harness: Merge TEST_F_FORK() into TEST_F()
Mickaël Salaün [Thu, 29 Feb 2024 00:59:09 +0000 (16:59 -0800)]
selftests/harness: Merge TEST_F_FORK() into TEST_F()

Replace Landlock-specific TEST_F_FORK() with an improved TEST_F() which
brings four related changes:

Run TEST_F()'s tests in a grandchild process to make it possible to
drop privileges and delegate teardown to the parent.

Compared to TEST_F_FORK(), simplify handling of the test grandchild
process thanks to vfork(2), and makes it generic (e.g. no explicit
conversion between exit code and _metadata).

Compared to TEST_F_FORK(), run teardown even when tests failed with an
assert thanks to commit 63e6b2a42342 ("selftests/harness: Run TEARDOWN
for ASSERT failures").

Simplify the test harness code by removing the no_print and step fields
which are not used.  I added this feature just after I made
kselftest_harness.h more broadly available but this step counter
remained even though it wasn't needed after all. See commit 369130b63178
("selftests: Enhance kselftest_harness.h to print which assert failed").

Replace spaces with tabs in one line of __TEST_F_IMPL().

Cc: Günther Noack <gnoack@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests/landlock: Redefine TEST_F() as TEST_F_FORK()
Mickaël Salaün [Thu, 29 Feb 2024 00:59:08 +0000 (16:59 -0800)]
selftests/landlock: Redefine TEST_F() as TEST_F_FORK()

This has the effect of creating a new test process for either TEST_F()
or TEST_F_FORK(), which doesn't change tests but will ease potential
backports.  See next commit for the TEST_F_FORK() merge into TEST_F().

Cc: Günther Noack <gnoack@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'net-asp22-optimizations'
David S. Miller [Fri, 1 Mar 2024 09:22:50 +0000 (09:22 +0000)]
Merge branch 'net-asp22-optimizations'

Justin Chen says:

====================
Support for ASP 2.2 and optimizations

ASP 2.2 adds some power savings during low power modes.

Also make various improvements when entering low power modes and
reduce MDIO traffic by hooking up interrupts.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: bcmasp: Add support for PHY interrupts
Justin Chen [Wed, 28 Feb 2024 22:54:00 +0000 (14:54 -0800)]
net: bcmasp: Add support for PHY interrupts

Hook up the phy interrupts for internal phys to reduce mdio traffic
and improve responsiveness of link changes.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: bcmasp: Keep buffers through power management
Justin Chen [Wed, 28 Feb 2024 22:53:59 +0000 (14:53 -0800)]
net: bcmasp: Keep buffers through power management

There is no advantage of freeing and re-allocating buffers through
suspend and resume. This waste cycles and makes suspend/resume time
longer. We also open ourselves to failed allocations in systems with
heavy memory fragmentation.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: phy: mdio-bcm-unimac: Add asp v2.2 support
Justin Chen [Wed, 28 Feb 2024 22:53:58 +0000 (14:53 -0800)]
net: phy: mdio-bcm-unimac: Add asp v2.2 support

Add mdio compat string for ASP 2.0 ethernet driver.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: bcmasp: Add support for ASP 2.2
Justin Chen [Wed, 28 Feb 2024 22:53:57 +0000 (14:53 -0800)]
net: bcmasp: Add support for ASP 2.2

ASP 2.2 improves power savings during low power modes.

A new register was added to toggle to a slower clock during low
power modes.

EEE was broken for ASP 2.0/2.1. A HW workaround was added for
ASP 2.2 that requires toggling a chicken bit.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agodt-bindings: net: brcm,asp-v2.0: Add asp-v2.2
Justin Chen [Wed, 28 Feb 2024 22:53:56 +0000 (14:53 -0800)]
dt-bindings: net: brcm,asp-v2.0: Add asp-v2.2

Add support for ASP 2.2.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agodt-bindings: net: brcm,unimac-mdio: Add asp-v2.2
Justin Chen [Wed, 28 Feb 2024 22:53:55 +0000 (14:53 -0800)]
dt-bindings: net: brcm,unimac-mdio: Add asp-v2.2

The ASP 2.2 Ethernet controller uses a brcm unimac.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'qcom-phy-possible'
David S. Miller [Fri, 1 Mar 2024 08:56:39 +0000 (08:56 +0000)]
Merge branch 'qcom-phy-possible'

Robert Marko says:

====================
net: phy: qcom: qca808x: fill in possible_interfaces

QCA808x does not currently fill in the possible_interfaces.

This leads to Phylink not being aware that it supports 2500Base-X as well
so in cases where it is connected to a DSA switch like MV88E6393 it will
limit that port to phy-mode set in the DTS.

That means that if SGMII is used you are limited to 1G only while if
2500Base-X was set you are limited to 2.5G only.

Populating the possible_interfaces fixes this.

Changes in v2:
* Get rid of the if/else by Russels suggestion in the helper
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: phy: qcom: qca808x: fill in possible_interfaces
Robert Marko [Wed, 28 Feb 2024 17:24:10 +0000 (18:24 +0100)]
net: phy: qcom: qca808x: fill in possible_interfaces

Currently QCA808x driver does not fill the possible_interfaces.
2.5G QCA808x support SGMII and 2500Base-X while 1G model only supports
SGMII, so fill the possible_interfaces accordingly.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: phy: qcom: qca808x: add helper for checking for 1G only model
Robert Marko [Wed, 28 Feb 2024 17:24:09 +0000 (18:24 +0100)]
net: phy: qcom: qca808x: add helper for checking for 1G only model

There are 2 versions of QCA808x, one 2.5G capable and one 1G capable.
Currently, this matter only in the .get_features call however, it will
be required for filling supported interface modes so lets add a helper
that can be reused.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoSimplify net_dbg_ratelimited() dummy
Geert Uytterhoeven [Wed, 28 Feb 2024 14:05:29 +0000 (15:05 +0100)]
Simplify net_dbg_ratelimited() dummy

There is no need to wrap calls to the no_printk() helper inside an
always-false check, as no_printk() already does that internally.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'ipv6-devconf-lockless'
David S. Miller [Fri, 1 Mar 2024 08:42:33 +0000 (08:42 +0000)]
Merge branch 'ipv6-devconf-lockless'

Eric Dumazet says:

====================
ipv6: lockless accesses to devconf

- First patch puts in a cacheline_group the fields used in fast paths.

- Annotate all data races around idev->cnf fields.

- Last patch in this series removes RTNL use for RTM_GETNETCONF dumps.

v3: addressed Jakub Kicinski feedback in addrconf_disable_ipv6()
    Added tags from Jiri and Florian.

v2: addressed Jiri Pirko feedback
 - Added "ipv6: addrconf_disable_ipv6() optimizations"
   and "ipv6: addrconf_disable_policy() optimization"
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: use xa_array iterator to implement inet6_netconf_dump_devconf()
Eric Dumazet [Wed, 28 Feb 2024 13:54:39 +0000 (13:54 +0000)]
ipv6: use xa_array iterator to implement inet6_netconf_dump_devconf()

1) inet6_netconf_dump_devconf() can run under RCU protection
   instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
   an extra recvmsg() system call.

3) Do not use inet6_base_seq() anymore, for_each_netdev_dump()
   has nice properties. Restarting a GETDEVCONF dump if a device has
   been added/removed or if net->ipv6.dev_addr_genid has changed is moot.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6/addrconf: annotate data-races around devconf fields (II)
Eric Dumazet [Wed, 28 Feb 2024 13:54:38 +0000 (13:54 +0000)]
ipv6/addrconf: annotate data-races around devconf fields (II)

Final (?) round of this series.

Annotate lockless reads on following devconf fields,
because they be changed concurrently from /proc/net/ipv6/conf.

- accept_dad
- optimistic_dad
- use_optimistic
- use_oif_addrs_only
- ra_honor_pio_life
- keep_addr_on_down
- ndisc_notify
- ndisc_evict_nocarrier
- suppress_frag_ndisc
- addr_gen_mode
- seg6_enabled
- ioam6_enabled
- ioam6_id
- ioam6_id_wide
- drop_unicast_in_l2_multicast
- mldv[12]_unsolicited_report_interval
- force_mld_version
- force_tllao
- accept_untracked_na
- drop_unsolicited_na
- accept_source_route

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6/addrconf: annotate data-races around devconf fields (I)
Eric Dumazet [Wed, 28 Feb 2024 13:54:37 +0000 (13:54 +0000)]
ipv6/addrconf: annotate data-races around devconf fields (I)

Annotate lockless reads and writes on following devconf fields:

- regen_min_advance
- regen_max_retry
- dad_transmits
- use_tempaddr
- max_addresses
- max_desync_factor
- temp_valid_lft
- rtr_solicits
- rtr_solicit_max_interval
- rtr_solicit_interval
- rtr_solicit_delay
- enhanced_dad
- accept_redirects

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: addrconf_disable_policy() optimization
Eric Dumazet [Wed, 28 Feb 2024 13:54:36 +0000 (13:54 +0000)]
ipv6: addrconf_disable_policy() optimization

Writing over /proc/sys/net/ipv6/conf/default/disable_policy
does not need to hold RTNL.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around devconf->disable_policy
Eric Dumazet [Wed, 28 Feb 2024 13:54:35 +0000 (13:54 +0000)]
ipv6: annotate data-races around devconf->disable_policy

idev->cnf.disable_policy and net->ipv6.devconf_all->disable_policy
can be read locklessly. Add appropriate annotations on reads
and writes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around devconf->proxy_ndp
Eric Dumazet [Wed, 28 Feb 2024 13:54:34 +0000 (13:54 +0000)]
ipv6: annotate data-races around devconf->proxy_ndp

devconf->proxy_ndp can be read and written locklessly,
add appropriate annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races in rt6_probe()
Eric Dumazet [Wed, 28 Feb 2024 13:54:33 +0000 (13:54 +0000)]
ipv6: annotate data-races in rt6_probe()

Use READ_ONCE() while reading idev->cnf.rtr_probe_interval
while its value could be changed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around idev->cnf.ignore_routes_with_linkdown
Eric Dumazet [Wed, 28 Feb 2024 13:54:32 +0000 (13:54 +0000)]
ipv6: annotate data-races around idev->cnf.ignore_routes_with_linkdown

idev->cnf.ignore_routes_with_linkdown can be used without any locks,
add appropriate annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races in ndisc_router_discovery()
Eric Dumazet [Wed, 28 Feb 2024 13:54:31 +0000 (13:54 +0000)]
ipv6: annotate data-races in ndisc_router_discovery()

Annotate reads from in6_dev->cnf.XXX fields, as they could
change concurrently.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around cnf.forwarding
Eric Dumazet [Wed, 28 Feb 2024 13:54:30 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.forwarding

idev->cnf.forwarding and net->ipv6.devconf_all->forwarding
might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around cnf.hop_limit
Eric Dumazet [Wed, 28 Feb 2024 13:54:29 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.hop_limit

idev->cnf.hop_limit and net->ipv6.devconf_all->hop_limit
might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de> # for netfilter parts
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around cnf.mtu6
Eric Dumazet [Wed, 28 Feb 2024 13:54:28 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.mtu6

idev->cnf.mtu6 might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: addrconf_disable_ipv6() optimization
Eric Dumazet [Wed, 28 Feb 2024 13:54:27 +0000 (13:54 +0000)]
ipv6: addrconf_disable_ipv6() optimization

Writing over /proc/sys/net/ipv6/conf/default/disable_ipv6
does not need to hold RTNL.

v3: remove a wrong change (Jakub Kicinski feedback)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around cnf.disable_ipv6
Eric Dumazet [Wed, 28 Feb 2024 13:54:26 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.disable_ipv6

disable_ipv6 is read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

v2: do not preload net before rtnl_trylock() in
    addrconf_disable_ipv6() (Jiri)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: add ipv6_devconf_read_txrx cacheline_group
Eric Dumazet [Wed, 28 Feb 2024 13:54:25 +0000 (13:54 +0000)]
ipv6: add ipv6_devconf_read_txrx cacheline_group

IPv6 TX and RX fast path use the following fields:

- disable_ipv6
- hop_limit
- mtu6
- forwarding
- disable_policy
- proxy_ndp

Place them in a group to increase data locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 29 Feb 2024 22:17:54 +0000 (14:17 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

net/mptcp/protocol.c
  adf1bb78dab5 ("mptcp: fix snd_wnd initialization for passive socket")
  9426ce476a70 ("mptcp: annotate lockless access for RX path fields")
https://lore.kernel.org/all/20240228103048.19255709@canb.auug.org.au/

Adjacent changes:

drivers/dpll/dpll_core.c
  0d60d8df6f49 ("dpll: rely on rcu for netdev_dpll_pin()")
  e7f8df0e81bf ("dpll: move xa_erase() call in to match dpll_pin_alloc() error path order")

drivers/net/veth.c
  1ce7d306ea63 ("veth: try harder when allocating queue memory")
  0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers")

drivers/net/wireless/intel/iwlwifi/mvm/d3.c
  8c9bef26e98b ("wifi: iwlwifi: mvm: d3: implement suspend with MLO")
  78f65fbf421a ("wifi: iwlwifi: mvm: ensure offloading TID queue exists")

net/wireless/nl80211.c
  f78c1375339a ("wifi: nl80211: reject iftype change with mesh ID change")
  414532d8aa89 ("wifi: cfg80211: use IEEE80211_MAX_MESH_ID_LEN appropriately")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge branch 'create-shadow-types-for-struct_ops-maps-in-skeletons'
Andrii Nakryiko [Thu, 29 Feb 2024 21:54:06 +0000 (13:54 -0800)]
Merge branch 'create-shadow-types-for-struct_ops-maps-in-skeletons'

Kui-Feng Lee says:

====================
Create shadow types for struct_ops maps in skeletons

This patchset allows skeleton users to change the values of the fields
in struct_ops maps at runtime. It will create a shadow type pointer in
a skeleton for each struct_ops map, allowing users to access the
values of fields through these pointers. For instance, if there is an
integer field named "FOO" in a struct_ops map called "testmap", you
can access the value of "FOO" in this way.

    skel->struct_ops.testmap->FOO = 13;

With this feature, the users can pass flags or other data along with
the map from the user space to the kernel without creating separate
struct_ops map for different values in BPF.

== Shadow Type ==

The shadow type of a struct_ops map is a variant of the original
struct type of the map. The code generator translates each field in
the original struct type to a field in the shadow type. The type of a
field in the shadow type may not be the same as the corresponding
field in the original struct type. For example, modifiers like
volatile, const, etc., are removed from the fields in a shadow
type. Function pointers are translated to pointers of struct
bpf_program.

Currently, only scalar types and function pointers are
supported. Fields belonging to structs, unions, non-function pointers,
arrays, or other types are not supported. For those unsupported
fields, they are converted to arrays of characters to preserve their
space within the original struct type.

The padding between consecutive fields is handled by padding fields
(__padding_*). This helps to maintain the memory layout consistent
with the original struct_type.

Here is an example of shadow types.
The origin struct type of a struct_ops map is

    struct bpf_testmod_ops {
     int (*test_1)(void);
     void (*test_2)(int a, int b);
     /* Used to test nullable arguments. */
     int (*test_maybe_null)(int dummy, struct task_struct *task);

     /* The following fields are used to test shadow copies. */
     char onebyte;
     struct {
     int a;
     int b;
     } unsupported;
     int data;
    };

The struct_ops map, named testmod_1, of this type will be translated
to a pointer in the shadow type.

    struct {
     struct my_skel__testmod_1__bpf_testmod_ops {
     const struct bpf_program *test_1;
     const struct bpf_program *test_2;
     const struct bpf_program *test_maybe_null;
     char onebyte;
     char __padding_4[3];
     char __unsupported_4[8];
     int data;
     } *testmod_1;
    } struct_ops;

== Convert st_ops->data to Shadow Type ==

libbpf converts st_ops->data to the format of the shadow type for each
struct_ops map. This means that the bytes where function pointers are
located are converted to the values of the pointers of struct
bpf_program. The fields of other types are kept as they were.

Libbpf will synchronize the pointers of struct bpf_program with
st_ops->progs[] so that users can change function pointers
(bpf_program) before loading the map.
---
Changes from v5:

 - Generate names for shadow types.

 - Check btf and the number of struct_ops maps in gen_st_ops_shadow()
   and gen_st_ops_shadow_init() instead of do_skeleton() and
   do_subskeleton().

 - Name unsupported fields in the pattern __unsupported_*.

 - Have a padding field for a unsupported fields as well if necessary.

 - Implement resolve_func_ptr() in gen.c instead of reusing the one in
   libbpf. (Remove the part 1 in v4.)

 - Fix stylistic issues.

Changes from v4:

 - Convert function pointers to the pointers to struct bpf_program in
   bpf_object__collect_st_ops_relos().

Changes from v3:

 - Add comment to avoid people from removing resolve_func_ptr() from
   libbpf_internal.h

 - Fix commit logs and comments.

 - Add an example about using the pointers of shadow types
   for struct_ops maps to bpftool-gen.8.

v5: https://lore.kernel.org/all/20240227010432.714127-1-thinker.li@gmail.com/
v4: https://lore.kernel.org/all/20240222222624.1163754-1-thinker.li@gmail.com/
v3: https://lore.kernel.org/all/20240221012329.1387275-1-thinker.li@gmail.com/
v2: https://lore.kernel.org/all/20240214020836.1845354-1-thinker.li@gmail.com/
v1: https://lore.kernel.org/all/20240124224130.859921-1-thinker.li@gmail.com/

Kui-Feng Lee (5):
  libbpf: set btf_value_type_id of struct bpf_map for struct_ops.
  libbpf: Convert st_ops->data to shadow type.
  bpftool: generated shadow variables for struct_ops maps.
  bpftool: Add an example for struct_ops map and shadow type.
  selftests/bpf: Test if shadow types work correctly.

 .../bpf/bpftool/Documentation/bpftool-gen.rst |  58 ++++-
 tools/bpf/bpftool/gen.c                       | 237 +++++++++++++++++-
 tools/lib/bpf/libbpf.c                        |  50 +++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  11 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   8 +
 .../bpf/prog_tests/test_struct_ops_module.c   |  19 +-
 .../selftests/bpf/progs/struct_ops_module.c   |   8 +
 7 files changed, 377 insertions(+), 14 deletions(-)
====================

Link: https://lore.kernel.org/r/20240229064523.2091270-1-thinker.li@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
19 months agoselftests/bpf: Test if shadow types work correctly.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:23 +0000 (22:45 -0800)]
selftests/bpf: Test if shadow types work correctly.

Change the values of fields, including scalar types and function pointers,
and check if the struct_ops map works as expected.

The test changes the field "test_2" of "testmod_1" from the pointer to
test_2() to pointer to test_3() and the field "data" to 13. The function
test_2() and test_3() both compute a new value for "test_2_result", but in
different way. By checking the value of "test_2_result", it ensures the
struct_ops map works as expected with changes through shadow types.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-6-thinker.li@gmail.com
19 months agobpftool: Add an example for struct_ops map and shadow type.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:22 +0000 (22:45 -0800)]
bpftool: Add an example for struct_ops map and shadow type.

The example in bpftool-gen.8 explains how to use the pointer of the shadow
type to change the value of a field of a struct_ops map.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-5-thinker.li@gmail.com
19 months agobpftool: Generated shadow variables for struct_ops maps.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:21 +0000 (22:45 -0800)]
bpftool: Generated shadow variables for struct_ops maps.

Declares and defines a pointer of the shadow type for each struct_ops map.

The code generator will create an anonymous struct type as the shadow type
for each struct_ops map. The shadow type is translated from the original
struct type of the map. The user of the skeleton use pointers of them to
access the values of struct_ops maps.

However, shadow types only supports certain types of fields, including
scalar types and function pointers. Any fields of unsupported types are
translated into an array of characters to occupy the space of the original
field. Function pointers are translated into pointers of the struct
bpf_program. Additionally, padding fields are generated to occupy the space
between two consecutive fields.

The pointers of shadow types of struct_osp maps are initialized when
*__open_opts() in skeletons are called. For a map called FOO, the user can
access it through the pointer at skel->struct_ops.FOO.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-4-thinker.li@gmail.com
19 months agolibbpf: Convert st_ops->data to shadow type.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:20 +0000 (22:45 -0800)]
libbpf: Convert st_ops->data to shadow type.

Convert st_ops->data to the shadow type of the struct_ops map. The shadow
type of a struct_ops type is a variant of the original struct type
providing a way to access/change the values in the maps of the struct_ops
type.

bpf_map__initial_value() will return st_ops->data for struct_ops types. The
skeleton is going to use it as the pointer to the shadow type of the
original struct type.

One of the main differences between the original struct type and the shadow
type is that all function pointers of the shadow type are converted to
pointers of struct bpf_program. Users can replace these bpf_program
pointers with other BPF programs. The st_ops->progs[] will be updated
before updating the value of a map to reflect the changes made by users.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-3-thinker.li@gmail.com
19 months agolibbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:19 +0000 (22:45 -0800)]
libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.

For a struct_ops map, btf_value_type_id is the type ID of it's struct
type. This value is required by bpftool to generate skeleton including
pointers of shadow types. The code generator gets the type ID from
bpf_map__btf_value_type_id() in order to get the type information of the
struct type of a map.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-2-thinker.li@gmail.com
19 months agobpf: Replace bpf_lpm_trie_key 0-length array with flexible array
Kees Cook [Thu, 22 Feb 2024 15:56:15 +0000 (07:56 -0800)]
bpf: Replace bpf_lpm_trie_key 0-length array with flexible array

Replace deprecated 0-length array in struct bpf_lpm_trie_key with
flexible array. Found with GCC 13:

../kernel/bpf/lpm_trie.c:207:51: warning: array subscript i is outside array bounds of 'const __u8[0]' {aka 'const unsigned char[]'} [-Warray-bounds=]
  207 |                                        *(__be16 *)&key->data[i]);
      |                                                   ^~~~~~~~~~~~~
../include/uapi/linux/swab.h:102:54: note: in definition of macro '__swab16'
  102 | #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
      |                                                      ^
../include/linux/byteorder/generic.h:97:21: note: in expansion of macro '__be16_to_cpu'
   97 | #define be16_to_cpu __be16_to_cpu
      |                     ^~~~~~~~~~~~~
../kernel/bpf/lpm_trie.c:206:28: note: in expansion of macro 'be16_to_cpu'
  206 |                 u16 diff = be16_to_cpu(*(__be16 *)&node->data[i]
^
      |                            ^~~~~~~~~~~
In file included from ../include/linux/bpf.h:7:
../include/uapi/linux/bpf.h:82:17: note: while referencing 'data'
   82 |         __u8    data[0];        /* Arbitrary size */
      |                 ^~~~

And found at run-time under CONFIG_FORTIFY_SOURCE:

  UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:218:49
  index 0 is out of range for type '__u8 [*]'

Changing struct bpf_lpm_trie_key is difficult since has been used by
userspace. For example, in Cilium:

struct egress_gw_policy_key {
        struct bpf_lpm_trie_key lpm_key;
        __u32 saddr;
        __u32 daddr;
};

While direct references to the "data" member haven't been found, there
are static initializers what include the final member. For example,
the "{}" here:

        struct egress_gw_policy_key in_key = {
                .lpm_key = { 32 + 24, {} },
                .saddr   = CLIENT_IP,
                .daddr   = EXTERNAL_SVC_IP & 0Xffffff,
        };

To avoid the build time and run time warnings seen with a 0-sized
trailing array for struct bpf_lpm_trie_key, introduce a new struct
that correctly uses a flexible array for the trailing bytes,
struct bpf_lpm_trie_key_u8. As part of this, include the "header"
portion (which is just the "prefixlen" member), so it can be used
by anything building a bpf_lpr_trie_key that has trailing members that
aren't a u8 flexible array (like the self-test[1]), which is named
struct bpf_lpm_trie_key_hdr.

Unfortunately, C++ refuses to parse the __struct_group() helper, so
it is not possible to define struct bpf_lpm_trie_key_hdr directly in
struct bpf_lpm_trie_key_u8, so we must open-code the union directly.

Adjust the kernel code to use struct bpf_lpm_trie_key_u8 through-out,
and for the selftest to use struct bpf_lpm_trie_key_hdr. Add a comment
to the UAPI header directing folks to the two new options.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Closes: https://paste.debian.net/hidden/ca500597/
Link: https://lore.kernel.org/all/202206281009.4332AA33@keescook/
Link: https://lore.kernel.org/bpf/20240222155612.it.533-kees@kernel.org
19 months agoMerge tag 'net-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 29 Feb 2024 20:40:20 +0000 (12:40 -0800)]
Merge tag 'net-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, WiFi and netfilter.

  We have one outstanding issue with the stmmac driver, which may be a
  LOCKDEP false positive, not a blocker.

  Current release - regressions:

   - netfilter: nf_tables: re-allow NFPROTO_INET in
     nft_(match/target)_validate()

   - eth: ionic: fix error handling in PCI reset code

  Current release - new code bugs:

   - eth: stmmac: complete meta data only when enabled, fix null-deref

   - kunit: fix again checksum tests on big endian CPUs

  Previous releases - regressions:

   - veth: try harder when allocating queue memory

   - Bluetooth:
      - hci_bcm4377: do not mark valid bd_addr as invalid
      - hci_event: fix handling of HCI_EV_IO_CAPA_REQUEST

  Previous releases - always broken:

   - info leak in __skb_datagram_iter() on netlink socket

   - mptcp:
      - map v4 address to v6 when destroying subflow
      - fix potential wake-up event loss due to sndbuf auto-tuning
      - fix double-free on socket dismantle

   - wifi: nl80211: reject iftype change with mesh ID change

   - fix small out-of-bound read when validating netlink be16/32 types

   - rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back

   - ipv6: fix potential "struct net" ref-leak in inet6_rtm_getaddr()

   - ip_tunnel: prevent perpetual headroom growth with huge number of
     tunnels on top of each other

   - mctp: fix skb leaks on error paths of mctp_local_output()

   - eth: ice: fixes for DPLL state reporting

   - dpll: rely on rcu for netdev_dpll_pin() to prevent UaF

   - eth: dpaa: accept phy-interface-type = '10gbase-r' in the device
     tree"

* tag 'net-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits)
  dpll: fix build failure due to rcu_dereference_check() on unknown type
  kunit: Fix again checksum tests on big endian CPUs
  tls: fix use-after-free on failed backlog decryption
  tls: separate no-async decryption request handling from async
  tls: fix peeking with sync+async decryption
  tls: decrement decrypt_pending if no async completion will be called
  gtp: fix use-after-free and null-ptr-deref in gtp_newlink()
  net: hsr: Use correct offset for HSR TLV values in supervisory HSR frames
  igb: extend PTP timestamp adjustments to i211
  rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back
  tools: ynl: fix handling of multiple mcast groups
  selftests: netfilter: add bridge conntrack + multicast test case
  netfilter: bridge: confirm multicast packets before passing them up the stack
  netfilter: nf_tables: allow NFPROTO_INET in nft_(match/target)_validate()
  Bluetooth: qca: Fix triggering coredump implementation
  Bluetooth: hci_qca: Set BDA quirk bit if fwnode exists in DT
  Bluetooth: qca: Fix wrong event type for patch config command
  Bluetooth: Enforce validation on max value of connection interval
  Bluetooth: hci_event: Fix handling of HCI_EV_IO_CAPA_REQUEST
  Bluetooth: mgmt: Fix limited discoverable off timeout
  ...

19 months agoMerge tag 'landlock-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic...
Linus Torvalds [Thu, 29 Feb 2024 20:29:23 +0000 (12:29 -0800)]
Merge tag 'landlock-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux

Pull Landlock fix from Mickaël Salaün:
 "Fix a potential issue when handling inodes with inconsistent
  properties"

* tag 'landlock-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  landlock: Fix asymmetric private inodes referring

19 months agodpll: fix build failure due to rcu_dereference_check() on unknown type
Eric Dumazet [Thu, 29 Feb 2024 19:05:15 +0000 (11:05 -0800)]
dpll: fix build failure due to rcu_dereference_check() on unknown type

Tasmiya reports that their compiler complains that we deref
a pointer to unknown type with rcu_dereference_rtnl():

include/linux/rcupdate.h:439:9: error: dereferencing pointer to incomplete type ‘struct dpll_pin’

Unclear what compiler it is, at the moment, and we can't report
but since DPLL can't be a module - move the code from the header
into the source file.

Fixes: 0d60d8df6f49 ("dpll: rely on rcu for netdev_dpll_pin()")
Reported-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com>
Link: https://lore.kernel.org/all/3fcf3a2c-1c1b-42c1-bacb-78fdcd700389@linux.vnet.ibm.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240229190515.2740221-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>