]> www.infradead.org Git - users/hch/uuid.git/log
users/hch/uuid.git
17 months agoipv6: introduce dst_rt6_info() helper
Eric Dumazet [Fri, 26 Apr 2024 15:19:52 +0000 (15:19 +0000)]
ipv6: introduce dst_rt6_info() helper

Instead of (struct rt6_info *)dst casts, we can use :

 #define dst_rt6_info(_ptr) \
         container_of_const(_ptr, struct rt6_info, dst)

Some places needed missing const qualifiers :

ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()

v2: added missing parts (David Ahern)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoMerge branch 'mlxsw-events-processing-performance'
David S. Miller [Mon, 29 Apr 2024 09:47:05 +0000 (10:47 +0100)]
Merge branch 'mlxsw-events-processing-performance'

Petr Machata says:

====================
mlxsw: Improve events processing performance

Amit Cohen writes:

Spectrum ASICs only support a single interrupt, it means that all the
events are handled by one IRQ (interrupt request) handler.

Currently, we schedule a tasklet to handle events in EQ, then we also use
tasklet for CQ, SDQ and RDQ. Tasklet runs in softIRQ (software IRQ)
context, and will be run on the same CPU which scheduled it. It means that
today we have one CPU which handles all the packets (both network packets
and EMADs) from hardware.

The existing implementation is not efficient and can be improved.

Measuring latency of EMADs in the driver (without the time in FW) shows
that latency is increased by factor of 28 (x28) when network traffic is
handled by the driver.

Measuring throughput in CPU shows that CPU can handle ~35% less packets
of specific flow when corrupted packets are also handled by the driver.
There are cases that these values even worse, we measure decrease of ~44%
packet rate.

This can be improved if network packet and EMADs will be handled in
parallel by several CPUs, and more than that, if different types of traffic
will be handled in parallel. We can achieve this using NAPI.

This set converts the driver to process completions from hardware via NAPI.
The idea is to add NAPI instance per CQ (which is mapped 1:1 to SDQ/RDQ),
which means that each DQ can be handled separately. we have DQ for EMADs
and DQs for each trap group (like LLDP, BGP, L3 drops, etc..). See more
details in commit messages.

An additional improvement which is done as part of this set is related to
doorbells' ring. The idea is to handle small chunks of Rx packets (which
is also recommended using NAPI) and ring doorbells once per chunk. This
reduces the access to hardware which is expensive (time wise) and might
take time because of memory barriers.

With this set we can see better performance.
To summerize:

EMADs latency:
+------------------------------------------------------------------------+
|                  | Before this set           | Now                     |
|------------------|---------------------------|-------------------------|
| Increased factor | x28                       | x1.5                    |
+------------------------------------------------------------------------+
Note that we can see even measurements that show better latency when
traffic is handled by the driver.

Throughput:
+------------------------------------------------------------------------+
|             | Before this set            | Now                         |
|-------------|----------------------------|-----------------------------|
| Reduced     | 35%                        | 6%                          |
| packet rate |                            |                             |
+------------------------------------------------------------------------+

Additional improvements are planned - use page pool for buffer allocations
and avoid cache miss of each SKB using napi_build_skb().

Patch set overview:
Patches #1-#2 improve access to hardware by reducing dorbells' rings
Patch #3-#4 are preaparations for NAPI usage
Patch #5 converts the driver to use NAPI
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agomlxsw: pci: Use NAPI for event processing
Amit Cohen [Fri, 26 Apr 2024 12:42:26 +0000 (14:42 +0200)]
mlxsw: pci: Use NAPI for event processing

Spectrum ASICs only support a single interrupt, that means that all the
events are handled by one IRQ (interrupt request) handler. Once an
interrupt is received, we schedule tasklet to handle events from EQ and
then schedule tasklets to handle completions from CQs. Tasklet runs in
softIRQ (software IRQ) context, and will be run on the same CPU which
scheduled it. That means that today we use only one CPU to handle all the
packets (both network packets and EMADs) from hardware.

This can be improved using NAPI. The idea is to use NAPI instance per
CQ, which is mapped 1:1 to DQ (RDQ or SDQ). NAPI poll method can be run
in kernel thread, so then the driver will be able to handle WQEs in several
CPUs. Convert the existing code to use NAPI APIs.

Add NAPI instance as part of 'struct mlxsw_pci_queue' and initialize it
as part of CQs initialization. Set the appropriate poll method and dummy
net device, according to queue number, similar to tasklet setup. For CQs
which are used for completions of RDQ, use Rx poll method and
'napi_dev_rx', which is set as 'threaded'. It means that Rx poll method
will run in kernel context, so several RDQs will be handled in parallel.
For CQs which are used for completions of SDQ, use Tx poll method and
'napi_dev_tx', this method will run in softIRQ context, as it is
recommended in NAPI documentation, as Tx packets' processing is short task.

Convert mlxsw_pci_cq_{rx,tx}_tasklet() to poll methods. Handle 'budget'
argument - ignore it in Tx poll method, as it is recommended to not limit
Tx processing. For Rx processing, handle up to 'budget' completions.
Return 'work_done' which is the amount of completions that were handled.

Handle the following cases:
1. After processing 'budget' completions, the driver still has work to do:
   Return work-done = budget. In that case, the NAPI instance will be
   polled again (without the need to be rescheduled). Do not re-arm the
   queue, as NAPI will handle the reschedule, so we do not have to involve
   hardware to send an additional interrupt for the completions that should
   be processed.

2. Event processing has been completed:
   Call napi_complete_done() to mark NAPI processing as completed, which
   means that the poll method will not be rescheduled. Re-arm the queue,
   as all completions were handled.

   In case that poll method handled exactly 'budget' completions, return
   work-done = budget -1, to distinguish from the case that driver still
   has completions to handle. Otherwise, return the amount of completions
   that were handled.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agomlxsw: pci: Reorganize 'mlxsw_pci_queue' structure
Amit Cohen [Fri, 26 Apr 2024 12:42:25 +0000 (14:42 +0200)]
mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure

The next patch will set the driver to use NAPI for event processing. Then
tasklet mechanism will be used only for EQ. Reorganize 'mlxsw_pci_queue'
to hold EQ and CQ attributes in a union. For now, add tasklet for both EQ
and CQ. This will be changed in the next patch, as 'tasklet_struct' will be
replaced with NAPI instance.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agomlxsw: pci: Initialize dummy net devices for NAPI
Amit Cohen [Fri, 26 Apr 2024 12:42:24 +0000 (14:42 +0200)]
mlxsw: pci: Initialize dummy net devices for NAPI

mlxsw will use NAPI for event processing in a next patch. As preparation,
add two dummy net devices and initialize them.

NAPI instance should be attached to net device. Usually each queue is used
by a single net device in network drivers, so the mapping between net
device to NAPI instance is intuitive. In our case, Rx queues are not per
port, they are per trap-group. Tx queues are mapped to net devices, but we
do not have a separate queue for each local port, several ports share the
same queue.

Use init_dummy_netdev() to initialize dummy net devices for NAPI.

To run NAPI poll method in a kernel thread, the net device which NAPI
instance is attached to should be marked as 'threaded'. It is
recommended to handle Tx packets in softIRQ context, as usually this is
a short task - just free the Tx packet which has been transmitted.
Rx packets handling is more complicated task, so drivers can use a
dedicated kernel thread to process them. It allows processing packets from
different Rx queues in parallel. We would like to handle only Rx packets in
kernel threads, which means that we will use two dummy net devices
(one for Rx and one for Tx). Set only one of them with 'threaded' as it
will be used for Rx processing. Do not fail in case that setting 'threaded'
fails, as it is better to use regular softIRQ NAPI rather than preventing
the driver from loading.

Note that the net devices are initialized with init_dummy_netdev(), so
they are not registered, which means that they will not be visible to user.
It will not be possible to change 'threaded' configuration from user
space, but it is reasonable in our case, as there is no another
configuration which makes sense, considering that user has no influence
on the usage of each queue.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agomlxsw: pci: Ring RDQ and CQ doorbells once per several completions
Amit Cohen [Fri, 26 Apr 2024 12:42:23 +0000 (14:42 +0200)]
mlxsw: pci: Ring RDQ and CQ doorbells once per several completions

Currently, for each CQE in CQ, we ring CQ doorbell, then handle RDQ and
ring RDQ doorbell. Finally we ring CQ arm doorbell - once per CQ tasklet.

The idea of ringing CQ doorbell before RDQ doorbell, is to be sure that
when we post new WQE (after RDQ is handled), there is an available CQE.
This was done because of a hardware bug as part of
commit c9ebea04cb1b ("mlxsw: pci: Ring CQ's doorbell before RDQ's").

There is no real reason to ring RDQ and CQ doorbells for each completion,
it is better to handle several completions and reduce number of ringings,
as access to hardware is expensive (time wise) and might take time because
of memory barriers.

A previous patch changed CQ tasklet to handle up to 64 Rx packets. With
this limitation, we can ring CQ and RDQ doorbells once per CQ tasklet.
The counters of the doorbells are increased by the amount of packets
that we handled, then the device will know for which completion to send
an additional event.

To avoid reordering CQ and RDQ doorbells' ring, let the tasklet to ring
also RDQ doorbell, mlxsw_pci_cqe_rdq_handle() handles the counter but
does not ring the doorbell.

Note that with this change there is no need to copy the CQE, as we ring CQ
doorbell only after Rx packet processing (which uses the CQE) is done.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agomlxsw: pci: Handle up to 64 Rx completions in tasklet
Amit Cohen [Fri, 26 Apr 2024 12:42:22 +0000 (14:42 +0200)]
mlxsw: pci: Handle up to 64 Rx completions in tasklet

We can get many completions in one interrupt. Currently, the CQ tasklet
handles up to half queue size completions, and then arms the hardware to
generate additional events, which means that in case that there were
additional completions that we did not handle, we will get immediately an
additional interrupt to handle the rest.

The decision to handle up to half of the queue size is arbitrary and was
determined in 2015, when mlxsw driver was added to the kernel. One
additional fact that should be taken into account is that while WQEs
from RDQ are handled, the CPU that handles the tasklet is dedicated for
this task, which means that we might hold the CPU for a long time.

Handle WQEs in smaller chucks, then arm CQ doorbell to notify the hardware
to send additional notifications. Set the chunk size to 64 as this number
is recommended using NAPI and the driver will use NAPI in a next patch.
Note that for now we use ARM doorbell to retrigger CQ tasklet, but with
NAPI it will be more efficient as software will reschedule the poll
method and we will not involve hardware for that.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoipv6: use call_rcu_hurry() in fib6_info_release()
Eric Dumazet [Fri, 26 Apr 2024 10:47:22 +0000 (10:47 +0000)]
ipv6: use call_rcu_hurry() in fib6_info_release()

This is a followup of commit c4e86b4363ac ("net: add two more
call_rcu_hurry()")

fib6_info_destroy_rcu() is calling nexthop_put() or fib6_nh_release()

We must not delay it too much or risk unregister_netdevice/ref_tracker
traces because references to netdev are not released in time.

This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoinet: use call_rcu_hurry() in inet_free_ifa()
Eric Dumazet [Fri, 26 Apr 2024 07:02:02 +0000 (07:02 +0000)]
inet: use call_rcu_hurry() in inet_free_ifa()

This is a followup of commit c4e86b4363ac ("net: add two more
call_rcu_hurry()")

Our reference to ifa->ifa_dev must be freed ASAP
to release the reference to the netdev the same way.

inet_rcu_free_ifa()

in_dev_put()
 -> in_dev_finish_destroy()
   -> netdev_put()

This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: give more chances to rcu in netdev_wait_allrefs_any()
Eric Dumazet [Fri, 26 Apr 2024 06:42:22 +0000 (06:42 +0000)]
net: give more chances to rcu in netdev_wait_allrefs_any()

This came while reviewing commit c4e86b4363ac ("net: add two more
call_rcu_hurry()").

Paolo asked if adding one synchronize_rcu() would help.

While synchronize_rcu() does not help, making sure to call
rcu_barrier() before msleep(wait) is definitely helping
to make sure lazy call_rcu() are completed.

Instead of waiting ~100 seconds in my tests, the ref_tracker
splats occurs one time only, and netdev_wait_allrefs_any()
latency is reduced to the strict minimum.

Ideally we should audit our call_rcu() users to make sure
no refcount (or cascading call_rcu()) is held too long,
because rcu_barrier() is quite expensive.

Fixes: 0e4be9e57e8c ("net: use exponential backoff in netdev_wait_allrefs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/all/28bbf698-befb-42f6-b561-851c67f464aa@kernel.org/T/#m76d73ed6b03cd930778ac4d20a777f22a08d6824
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: ethernet: ti: am65-cpsw-qos: Add support to taprio for past base_time
Tanmay Patil [Thu, 25 Apr 2024 10:31:42 +0000 (16:01 +0530)]
net: ethernet: ti: am65-cpsw-qos: Add support to taprio for past base_time

If the base-time for taprio is in the past, start the schedule at the time
of the form "base_time + N*cycle_time" where N is the smallest possible
integer such that the above time is in the future.

Signed-off-by: Tanmay Patil <t-patil@ti.com>
Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agotools: ynl: don't append doc of missing type directly to the type
Jakub Kicinski [Fri, 26 Apr 2024 00:31:11 +0000 (17:31 -0700)]
tools: ynl: don't append doc of missing type directly to the type

When using YNL in tests appending the doc string to the type
name makes it harder to check that we got the correct error.
Put the doc under a separate key.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240426003111.359285-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge branch 'selftests-drv-net-round-some-sharp-edges'
Jakub Kicinski [Fri, 26 Apr 2024 23:10:27 +0000 (16:10 -0700)]
Merge branch 'selftests-drv-net-round-some-sharp-edges'

Jakub Kicinski says:

====================
selftests: drv-net: round some sharp edges

I had to explain how to run the driver tests twice already.
Improve the README so we can just point to it.
Improve the config validation.

v1: https://lore.kernel.org/r/20240424221444.4194069-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20240425222341.309778-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoselftests: drv-net: validate the environment
Jakub Kicinski [Thu, 25 Apr 2024 22:23:41 +0000 (15:23 -0700)]
selftests: drv-net: validate the environment

Throw a slightly more helpful exception when env variables
are partially populated. Prior to this change we'd get
a dictionary key exception somewhere later on.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240425222341.309778-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoselftests: drv-net: reimplement the config parser
Jakub Kicinski [Thu, 25 Apr 2024 22:23:40 +0000 (15:23 -0700)]
selftests: drv-net: reimplement the config parser

The shell lexer is not helping much, do very basic parsing
manually.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240425222341.309778-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoselftests: drv-net: extend the README with more info and example
Jakub Kicinski [Thu, 25 Apr 2024 22:23:39 +0000 (15:23 -0700)]
selftests: drv-net: extend the README with more info and example

Add more info to the README. It's also now copied to GitHub for
increased visibility:

 https://github.com/linux-netdev/nipa/wiki/Running-driver-tests

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240425222341.309778-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agotcp: fix tcp_grow_skb() vs tstamps
Eric Dumazet [Thu, 25 Apr 2024 19:34:50 +0000 (19:34 +0000)]
tcp: fix tcp_grow_skb() vs tstamps

I forgot to call tcp_skb_collapse_tstamp() in the
case we consume the second skb in write queue.

Neal suggested to create a common helper used by tcp_mtu_probe()
and tcp_grow_skb().

Fixes: 8ee602c63520 ("tcp: try to send bigger TSO packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240425193450.411640-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet: dsa: lan9303: use ethtool_puts() for lan9303_get_strings()
Justin Stitt [Thu, 25 Apr 2024 01:19:13 +0000 (01:19 +0000)]
net: dsa: lan9303: use ethtool_puts() for lan9303_get_strings()

This pattern of strncpy with some pointer arithmetic setting fixed-sized
intervals with string literal data is a bit weird so let's use
ethtool_puts() as this has more obvious behavior and is less-error
prone.

Nicely, we also get to drop a usage of the now deprecated strncpy() [1].

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://github.com/KSPP/linux/issues/90
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240425-strncpy-drivers-net-dsa-lan9303-core-c-v4-1-9fafd419d7bb@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge branch 'implement-reset-reason-mechanism-to-detect'
Paolo Abeni [Fri, 26 Apr 2024 13:34:04 +0000 (15:34 +0200)]
Merge branch 'implement-reset-reason-mechanism-to-detect'

Jason Xing says:

====================
Implement reset reason mechanism to detect

From: Jason Xing <kernelxing@tencent.com>

In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.

RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own, such as active reset reasons which can not be
replace by skb drop reason or something like this.

In this series, I focus on the fundamental implement mostly about how
the rstreason mechanism works and give the detailed passive part as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.

Here are some examples when tracing:
<idle>-0       [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
        skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
<idle>-0       [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
        skaddr=x src=x dest=x state=x reason=NO_SOCKET

[1]
Link: https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w@mail.gmail.com/
====================

Link: https://lore.kernel.org/r/20240425031340.46946-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agorstreason: make it work in trace world
Jason Xing [Thu, 25 Apr 2024 03:13:40 +0000 (11:13 +0800)]
rstreason: make it work in trace world

At last, we should let it work by introducing this reset reason in
trace world.

One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agomptcp: introducing a helper into active reset logic
Jason Xing [Thu, 25 Apr 2024 03:13:39 +0000 (11:13 +0800)]
mptcp: introducing a helper into active reset logic

Since we have mapped every mptcp reset reason definition in enum
sk_rst_reason, introducing a new helper can cover some missing places
where we have already set the subflow->reset_reason.

Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert
it directly.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agomptcp: support rstreason for passive reset
Jason Xing [Thu, 25 Apr 2024 03:13:38 +0000 (11:13 +0800)]
mptcp: support rstreason for passive reset

It relies on what reset options in the skb are as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces most of the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agotcp: support rstreason for passive reset
Jason Xing [Thu, 25 Apr 2024 03:13:37 +0000 (11:13 +0800)]
tcp: support rstreason for passive reset

Reuse the dropreason logic to show the exact reason of tcp reset,
so we can finally display the corresponding item in enum sk_reset_reason
instead of reinventing new reset reasons. This patch replaces all
the prior NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agorstreason: prepare for active reset
Jason Xing [Thu, 25 Apr 2024 03:13:36 +0000 (11:13 +0800)]
rstreason: prepare for active reset

Like what we did to passive reset:
only passing possible reset reason in each active reset path.

No functional changes.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agorstreason: prepare for passive reset
Jason Xing [Thu, 25 Apr 2024 03:13:35 +0000 (11:13 +0800)]
rstreason: prepare for passive reset

Adjust the parameter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agonet: introduce rstreason to detect why the RST is sent
Jason Xing [Thu, 25 Apr 2024 03:13:34 +0000 (11:13 +0800)]
net: introduce rstreason to detect why the RST is sent

Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.

This patch only does the preparations for reset reason mechanism,
nothing else changes.

The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) our own independent reasons which aren't relying on other reasons at all
3) reuse MP_TCPRST option for MPTCP

The benefits of a standalone reset reason are listed here:
1) it can cover more than one case, such as reset reasons in MPTCP,
active reset reasons.
2) people can easily/fastly understand and maintain this mechanism.
3) we get unified format of output with prefix stripped.
4) more new reset reasons are on the way
...

I will implement the basic codes of active/passive reset reason in
those three protocols, which are not complete for this moment. For
passive reset part in TCP, I only introduce the NO_SOCKET common case
which could be set as an example.

After this series applied, it will have the ability to open a new
gate to let other people contribute more reasons into it :)

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agoigc: Add Tx hardware timestamp request for AF_XDP zero-copy packet
Song Yoong Siang [Wed, 24 Apr 2024 21:02:54 +0000 (14:02 -0700)]
igc: Add Tx hardware timestamp request for AF_XDP zero-copy packet

This patch adds support to per-packet Tx hardware timestamp request to
AF_XDP zero-copy packet via XDP Tx metadata framework. Please note that
user needs to enable Tx HW timestamp capability via igc_ioctl() with
SIOCSHWTSTAMP cmd before sending xsk Tx hardware timestamp request.

Same as implementation in RX timestamp XDP hints kfunc metadata, Timer 0
(adjustable clock) is used in xsk Tx hardware timestamp. i225/i226 have
four sets of timestamping registers. *skb and *xsk_tx_buffer pointers
are used to indicate whether the timestamping register is already occupied.

Furthermore, a boolean variable named xsk_pending_ts is used to hold the
transmit completion until the tx hardware timestamp is ready. This is
because, for i225/i226, the timestamp notification event comes some time
after the transmit completion event. The driver will retrigger hardware irq
to clean the packet after retrieve the tx hardware timestamp.

Besides, xsk_meta is added into struct igc_tx_timestamp_request as a hook
to the metadata location of the transmit packet. When the Tx timestamp
interrupt is fired, the interrupt handler will copy the value of Tx hwts
into metadata location via xsk_tx_metadata_complete().

This patch is tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel ADL-S platform. Below are the test steps and results.

Test Step 1: Run xdp_hw_metadata app
 ./xdp_hw_metadata <iface> > /dev/shm/result.log

Test Step 2: Enable Tx hardware timestamp
 hwstamp_ctl -i <iface> -t 1 -r 1

Test Step 3: Run ptp4l and phc2sys for time synchronization

Test Step 4: Generate UDP packets with 1ms interval for 10s
 trafgen --dev <iface> '{eth(da=<addr>), udp(dp=9091)}' -t 1ms -n 10000

Test Step 5: Rerun Step 1-3 with 10s iperf3 as background traffic

Test Step 6: Rerun Step 1-4 with 10s iperf3 as background traffic

Based on iperf3 results below, the impact of holding tx completion to
throughput is not observable.

Result of last UDP packet (no. 10000) in Step 4:
poll: 1 (0) skip=99 fail=0 redir=10000
xsk_ring_cons__peek: 1
0x5640a37972d0: rx_desc[9999]->addr=f2110 addr=f2110 comp_addr=f2110 EoP
rx_hash: 0x2049BE1D with RSS type:0x1
HW RX-time:   1679819246792971268 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (14.990 usec)
XDP RX-time:   1679819246792981987 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (4.271 usec)
No rx_vlan_tci or rx_vlan_proto, err=-95
0x5640a37972d0: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6
0x5640a37972d0: complete tx idx=9999 addr=f010
HW TX-complete-time:   1679819246793036971 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (77.656 usec)
XDP RX-time:   1679819246792981987 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (132.640 usec)
HW RX-time:   1679819246792971268 (sec:1679819246.7930) delta to HW TX-complete-time sec:0.0001 (65.703 usec)
0x5640a37972d0: complete rx idx=10127 addr=f2110

Result of iperf3 without tx hwts request in step 5:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.74 GBytes  2.36 Gbits/sec    0             sender
[  5]   0.00-10.05  sec  2.74 GBytes  2.34 Gbits/sec                  receiver

Result of iperf3 running parallel with trafgen command in step 6:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.74 GBytes  2.36 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  2.74 GBytes  2.34 Gbits/sec                  receiver

Co-developed-by: Lai Peter Jun Ann <jun.ann.lai@intel.com>
Signed-off-by: Lai Peter Jun Ann <jun.ann.lai@intel.com>
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240424210256.3440903-1-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agoMerge branch 'selftests-virtio_net-introduce-initial-testing-infrastructure'
Paolo Abeni [Fri, 26 Apr 2024 11:26:56 +0000 (13:26 +0200)]
Merge branch 'selftests-virtio_net-introduce-initial-testing-infrastructure'

Jiri Pirko says:

====================
selftests: virtio_net: introduce initial testing infrastructure

This patchset aims at introducing very basic initial infrastructure
for virtio_net testing, namely it focuses on virtio feature testing.

The first patch adds support for debugfs for virtio devices, allowing
user to filter features to pretend to be driver that is not capable
of the filtered feature.

Example:
$ cat /sys/bus/virtio/devices/virtio0/features
1110010111111111111101010000110010000000100000000000000000000000
$ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add
$ cat /sys/kernel/debug/virtio/virtio0/filter_features
5
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind
$ cat /sys/bus/virtio/devices/virtio0/features
1110000111111111111101010000110010000000100000000000000000000000

Leverage that in the last patch that lays ground for virtio_net
selftests testing, including very basic F_MAC feature test.

To run this, do:
$ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests

It is assumed, as with lot of other selftests in the net group,
that there are netdevices connected back-to-back. In this case,
two virtio_net devices connected back to back. If you use "tap" qemu
netdevice type, to configure this loop on a hypervisor, one may use
this script:

DEV1="$1"
DEV2="$2"

sudo tc qdisc add dev $DEV1 clsact
sudo tc qdisc add dev $DEV2 clsact
sudo tc filter add dev $DEV1 ingress protocol all pref 1 matchall action mirred egress redirect dev $DEV2
sudo tc filter add dev $DEV2 ingress protocol all pref 1 matchall action mirred egress redirect dev $DEV1
sudo ip link set $DEV1 up
sudo ip link set $DEV2 up

Another possibility is to use virtme-ng like this:
$ vng --network=loop
or directly:
$ vng --network=loop -- make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests

"loop" network type will take care of creating two "hubport" qemu netdevs
putting them into a single hub.

To do it manually with qemu, pass following command line options:
-nic hubport,hubid=1,id=nd0,model=virtio-net-pci
-nic hubport,hubid=1,id=nd1,model=virtio-net-pci
====================

Link: https://lore.kernel.org/r/20240424104049.3935572-1-jiri@resnulli.us
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agoselftests: virtio_net: add initial tests
Jiri Pirko [Wed, 24 Apr 2024 10:40:49 +0000 (12:40 +0200)]
selftests: virtio_net: add initial tests

Introduce initial tests for virtio_net driver. Focus on feature testing
leveraging previously introduced debugfs feature filtering
infrastructure. Add very basic ping and F_MAC feature tests.

To run this, do:
$ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests

Run it on a system with 2 virtio_net devices connected back-to-back
on the hypervisor.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Tested-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agoselftests: forwarding: add wait_for_dev() helper
Jiri Pirko [Wed, 24 Apr 2024 10:40:48 +0000 (12:40 +0200)]
selftests: forwarding: add wait_for_dev() helper

The existing setup_wait*() helper family check the status of the
interface to be up. Introduce wait_for_dev() to wait for the netdevice
to appear, for example after test script does manual device bind.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agoselftests: forwarding: add check_driver() helper
Jiri Pirko [Wed, 24 Apr 2024 10:40:47 +0000 (12:40 +0200)]
selftests: forwarding: add check_driver() helper

Add a helper to be used to check if the netdevice is backed by specified
driver.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agoselftests: forwarding: add ability to assemble NETIFS array by driver name
Jiri Pirko [Wed, 24 Apr 2024 10:40:46 +0000 (12:40 +0200)]
selftests: forwarding: add ability to assemble NETIFS array by driver name

Allow driver tests to work without specifying the netdevice names.
Introduce a possibility to search for available netdevices according to
set driver name. Allow test to specify the name by setting
NETIF_FIND_DRIVER variable.

Note that user overrides this either by passing netdevice names on the
command line or by declaring NETIFS array in custom forwarding.config
configuration file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agovirtio: add debugfs infrastructure to allow to debug virtio features
Jiri Pirko [Wed, 24 Apr 2024 10:40:45 +0000 (12:40 +0200)]
virtio: add debugfs infrastructure to allow to debug virtio features

Currently there is no way for user to set what features the driver
should obey or not, it is hard wired in the code.

In order to be able to debug the device behavior in case some feature is
disabled, introduce a debugfs infrastructure with couple of files
allowing user to see what features the device advertises and
to set filter for features used by driver.

Example:
$cat /sys/bus/virtio/devices/virtio0/features
1110010111111111111101010000110010000000100000000000000000000000
$ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add
$ cat /sys/kernel/debug/virtio/virtio0/filter_features
5
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind
$ cat /sys/bus/virtio/devices/virtio0/features
1110000111111111111101010000110010000000100000000000000000000000

Note that sysfs "features" now already exists, this patch does not
touch it.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agoMerge branch 'net-hsr-add-support-for-hsr-san-redbox'
Paolo Abeni [Fri, 26 Apr 2024 10:04:45 +0000 (12:04 +0200)]
Merge branch 'net-hsr-add-support-for-hsr-san-redbox'

Lukasz Majewski says:

====================
net: hsr: Add support for HSR-SAN (RedBOX)

This patch set provides v6 of HSR-SAN (RedBOX) as well as hsr_redbox.sh
test script.

The most straightforward way to test those patches is to use buildroot
(2024.02.01) to create rootfs and QEMU based environment to run x86_64
Linux.

Then one shall run hsr_redbox.sh and hsr_ping.sh from
tools/testing/selftests/net/hsr.
====================

Link: https://lore.kernel.org/r/20240423124908.2073400-1-lukma@denx.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agotest: hsr: Add test for HSR RedBOX (HSR-SAN) mode of operation
Lukasz Majewski [Tue, 23 Apr 2024 12:49:08 +0000 (14:49 +0200)]
test: hsr: Add test for HSR RedBOX (HSR-SAN) mode of operation

This patch adds hsr_redbox.sh script to test if HSR-SAN mode of operation
works correctly.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agotest: hsr: Extract version agnostic information from ping command output
Lukasz Majewski [Tue, 23 Apr 2024 12:49:07 +0000 (14:49 +0200)]
test: hsr: Extract version agnostic information from ping command output

Current code checks if ping command output match hardcoded pattern:
"10 packets transmitted, 10 received, 0% packet loss,".

Such approach will work only from one ping program version (for which
this test has been originally written).
This patch address problem when ping with different summary output
like "10 packets transmitted, 10 packets received, 0% packet" is
used to run this test - for example one from busybox (as the test
system runs in QEMU with rootfs created with buildroot).

The fix is to modify output of ping command to be agnostic to ping
version used on the platform.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agotest: hsr: Move common code to hsr_common.sh file
Lukasz Majewski [Tue, 23 Apr 2024 12:49:06 +0000 (14:49 +0200)]
test: hsr: Move common code to hsr_common.sh file

Some of the code already present in the hsr_ping.sh test program can be
moved to a separate script file, so it can be reused by other HSR
functionality (like HSR-SAN) tests.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agotest: hsr: Remove script code already implemented in lib.sh
Lukasz Majewski [Tue, 23 Apr 2024 12:49:05 +0000 (14:49 +0200)]
test: hsr: Remove script code already implemented in lib.sh

Some parts (like netns creation and cleanup) of hsr_ping.sh script are
already implemented in ../lib.sh common script, so can be replaced by it.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agonet: hsr: Provide RedBox support (HSR-SAN)
Lukasz Majewski [Tue, 23 Apr 2024 12:49:04 +0000 (14:49 +0200)]
net: hsr: Provide RedBox support (HSR-SAN)

Introduce RedBox support (HSR-SAN to be more precise) for HSR networks.
Following traffic reduction optimizations have been implemented:
- Do not send HSR supervisory frames to Port C (interlink)
- Do not forward to HSR ring frames addressed to Port C
- Do not forward to Port C frames from HSR ring
- Do not send duplicate HSR frame to HSR ring when destination is Port C

The corresponding patch to modify iptable2 sources has already been sent:
https://lore.kernel.org/netdev/20240308145729.490863-1-lukma@denx.de/T/

Testing procedure (veth and netns):
-----------------------------------
One shall run:
linux-vanila/tools/testing/selftests/net/hsr/hsr_redbox.sh
(Detailed description of the setup one can find in the test
script file).

Testing procedure (real hardware):
----------------------------------
The EVB-KSZ9477 has been used for testing on net-next branch
(SHA1: 5fc68320c1fb3c7d456ddcae0b4757326a043e6f).

Ports 4/5 were used for SW managed HSR (hsr1) as first hsr0 for ports 1/2
(with HW offloading for ksz9477) was created. Port 3 has been used as
interlink port (single USB-ETH dongle).

Configuration - RedBox (EVB-KSZ9477):
if link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 interlink lan3 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip link set lan3 up
ip addr add 192.168.0.11/24 dev hsr1
ip link set hsr1 up

Configuration - DAN-H (EVB-KSZ9477):

ip link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip addr add 192.168.0.12/24 dev hsr1
ip link set hsr1 up

This approach uses only SW based HSR devices (hsr1).

--------------          -----------------       ------------
DAN-H  Port5 | <------> | Port5         |       |
       Port4 | <------> | Port4   Port3 | <---> | PC
             |          | (RedBox)      |       | (USB-ETH)
EVB-KSZ9477  |          | EVB-KSZ9477   |       |
--------------          -----------------       ------------

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agonet/sched: fix false lockdep warning on qdisc root lock
Davide Caratti [Thu, 18 Apr 2024 13:50:11 +0000 (15:50 +0200)]
net/sched: fix false lockdep warning on qdisc root lock

Xiumei and Christoph reported the following lockdep splat, complaining of
the qdisc root lock being taken twice:

 ============================================
 WARNING: possible recursive locking detected
 6.7.0-rc3+ #598 Not tainted
 --------------------------------------------
 swapper/2/0 is trying to acquire lock:
 ffff888177190110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70

 but task is already holding lock:
 ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&sch->q.lock);
   lock(&sch->q.lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 5 locks held by swapper/2/0:
  #0: ffff888135a09d98 ((&in_dev->mr_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x11a/0x510
  #1: ffffffffaaee5260 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x2c0/0x1ed0
  #2: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70
  #3: ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70
  #4: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70

 stack backtrace:
 CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.7.0-rc3+ #598
 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x4a/0x80
  __lock_acquire+0xfdd/0x3150
  lock_acquire+0x1ca/0x540
  _raw_spin_lock+0x34/0x80
  __dev_queue_xmit+0x1560/0x2e70
  tcf_mirred_act+0x82e/0x1260 [act_mirred]
  tcf_action_exec+0x161/0x480
  tcf_classify+0x689/0x1170
  prio_enqueue+0x316/0x660 [sch_prio]
  dev_qdisc_enqueue+0x46/0x220
  __dev_queue_xmit+0x1615/0x2e70
  ip_finish_output2+0x1218/0x1ed0
  __ip_finish_output+0x8b3/0x1350
  ip_output+0x163/0x4e0
  igmp_ifc_timer_expire+0x44b/0x930
  call_timer_fn+0x1a2/0x510
  run_timer_softirq+0x54d/0x11a0
  __do_softirq+0x1b3/0x88f
  irq_exit_rcu+0x18f/0x1e0
  sysvec_apic_timer_interrupt+0x6f/0x90
  </IRQ>

This happens when TC does a mirred egress redirect from the root qdisc of
device A to the root qdisc of device B. As long as these two locks aren't
protecting the same qdisc, they can be acquired in chain: add a per-qdisc
lockdep key to silence false warnings.
This dynamic key should safely replace the static key we have in sch_htb:
it was added to allow enqueueing to the device "direct qdisc" while still
holding the qdisc root lock.

v2: don't use static keys anymore in HTB direct qdiscs (thanks Eric Dumazet)

CC: Maxim Mikityanskiy <maxim@isovalent.com>
CC: Xiumei Mu <xmu@redhat.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/451
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/7dc06d6158f72053cf877a82e2a7a5bd23692faa.1713448007.git.dcaratti@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
18 months agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Fri, 26 Apr 2024 03:00:54 +0000 (20:00 -0700)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
net: intel: start The Great Code Dedup + Page Pool for iavf

Alexander Lobakin says:

Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and
libie) and switch iavf to Page Pool. Details are in the commit messages;
here's a summary:

Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers. The first name that came to my mind was
"libie" -- "Intel Ethernet common library". Also this sounds like
"lovelie" (-> one word, no "lib I E" pls) and can be expanded as
"lib Internet Explorer" :P
The "generic", pure-software part is placed separately, so that it can be
easily reused in any driver by any vendor without linking to the Intel
pre-200G guts. In a few words, it's something any modern driver does the
same way, but nobody moved it level up (yet).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks:
"can I share this?". The final destination is very ambitious: have only
one unified driver for at least i40e, ice, iavf, and idpf with a struct
ops for each generation. That's never gonna happen, right? But you still
can at least try.
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so that a driver can't
use much of the lib until it's converted. iavf is only the example, the
rest will eventually be converted soon on a per-driver basis. That is
when it gets really interesting. Stay tech.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  MAINTAINERS: add entry for libeth and libie
  iavf: switch to Page Pool
  iavf: pack iavf_ring more efficiently
  libeth: add Rx buffer management
  page_pool: add DMA-sync-for-CPU inline helper
  page_pool: constify some read-only function arguments
  slab: introduce kvmalloc_array_node() and kvcalloc_node()
  iavf: drop page splitting and recycling
  iavf: kill "legacy-rx" for good
  net: intel: introduce {, Intel} Ethernet common library
====================

Link: https://lore.kernel.org/r/20240424203559.3420468-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoMerge branch 'net-lan966x-flower-validate-control-flags'
Jakub Kicinski [Fri, 26 Apr 2024 02:36:39 +0000 (19:36 -0700)]
Merge branch 'net-lan966x-flower-validate-control-flags'

Asbjørn Sloth Tønnesen says:

====================
net: lan966x: flower: validate control flags

This series adds flower control flags validation to the
lan966x driver, and changes it from assuming that it handles
all control flags, to instead reject rules if they have
masked any unknown/unsupported control flags.

v1: https://lore.kernel.org/netdev/20240423102720.228728-1-ast@fiberby.net/
====================

Link: https://lore.kernel.org/r/20240424125347.461995-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: lan966x: flower: check for unsupported control flags
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:53:40 +0000 (12:53 +0000)]
net: lan966x: flower: check for unsupported control flags

Use flow_rule_is_supp_control_flags() to reject filters with
unsupported control flags.

In case any unsupported control flags are masked,
flow_rule_is_supp_control_flags() sets a NL extended
error message, and we return -EOPNOTSUPP.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424125347.461995-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: lan966x: flower: rename goto in lan966x_tc_flower_handler_control_usage()
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:53:39 +0000 (12:53 +0000)]
net: lan966x: flower: rename goto in lan966x_tc_flower_handler_control_usage()

Rename goto label, as the error message is specific to the fragment flags.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424125347.461995-3-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: lan966x: flower: add extack to lan966x_tc_flower_handler_control_usage()
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:53:38 +0000 (12:53 +0000)]
net: lan966x: flower: add extack to lan966x_tc_flower_handler_control_usage()

Define extack locally, to reduce line lengths and aid future users.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424125347.461995-2-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoMerge branch 'net-sparx5-flower-validate-control-flags'
Jakub Kicinski [Fri, 26 Apr 2024 02:35:10 +0000 (19:35 -0700)]
Merge branch 'net-sparx5-flower-validate-control-flags'

Asbjørn Sloth Tønnesen says:

====================
net: sparx5: flower: validate control flags

This series adds flower control flags validation to the
sparx5 driver, and changes it from assuming that it handles
all control flags, to instead reject rules if they have
masked any unknown/unsupported control flags.
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
v1: https://lore.kernel.org/netdev/20240423102728.228765-1-ast@fiberby.net/
====================

Link: https://lore.kernel.org/r/20240424121632.459022-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: sparx5: flower: check for unsupported control flags
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:16:25 +0000 (12:16 +0000)]
net: sparx5: flower: check for unsupported control flags

Use flow_rule_is_supp_control_flags() to reject filters with
unsupported control flags.

In case any unsupported control flags are masked,
flow_rule_is_supp_control_flags() sets a NL extended
error message, and we return -EOPNOTSUPP.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-5-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: sparx5: flower: remove goto in sparx5_tc_flower_handler_control_usage()
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:16:24 +0000 (12:16 +0000)]
net: sparx5: flower: remove goto in sparx5_tc_flower_handler_control_usage()

Remove goto, as it's only used once, and the error message is
specific to that context.

Only compile tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: sparx5: flower: add extack to sparx5_tc_flower_handler_control_usage()
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:16:23 +0000 (12:16 +0000)]
net: sparx5: flower: add extack to sparx5_tc_flower_handler_control_usage()

Define extack locally, to reduce line lengths and aid future users.

Only compile tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-3-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: sparx5: flower: only do lookup if fragment flags are set
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:16:22 +0000 (12:16 +0000)]
net: sparx5: flower: only do lookup if fragment flags are set

The fragment lookup should only be performed, when
at least one of the fragment flags are set.

This change was deliberately not included in commit
68aba00483c7 ("net: sparx5: flower: fix fragment flags handling")
as it's only needed for future proffing the code, since
"mask" is currently only set in conjunction with the
fragment flags.
(The 3rd flag FLOW_DIS_ENCAPSULATION is only used with "key")

Only compile tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-2-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: wwan: t7xx: Un-embed dummy device
Breno Leitao [Wed, 24 Apr 2024 16:11:07 +0000 (09:11 -0700)]
net: wwan: t7xx: Un-embed dummy device

Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].

Un-embed the net_device from the private struct by converting it
into a pointer. Then use the leverage the new alloc_netdev_dummy()
helper to allocate and initialize dummy devices.

[1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240424161108.3397057-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoMerge branch 'net-microchip-correct-spelling-in-comments'
Jakub Kicinski [Fri, 26 Apr 2024 02:13:28 +0000 (19:13 -0700)]
Merge branch 'net-microchip-correct-spelling-in-comments'

Simon Horman says:

====================
net: microchip: Correct spelling in comments

Correct spelling in comments in Microchip drivers.
Flagged by codespell.

v1: https://lore.kernel.org/r/20240419-lan743x-confirm-v1-0-2a087617a3e5@kernel.org
====================

Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-0-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: sparx5: Correct spelling in comments
Simon Horman [Wed, 24 Apr 2024 15:13:26 +0000 (16:13 +0100)]
net: sparx5: Correct spelling in comments

Correct spelling in comments, as flagged by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-4-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: encx24j600: Correct spelling in comments
Simon Horman [Wed, 24 Apr 2024 15:13:25 +0000 (16:13 +0100)]
net: encx24j600: Correct spelling in comments

Correct spelling in comments, as flagged by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-3-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: lan966x: Correct spelling in comments
Simon Horman [Wed, 24 Apr 2024 15:13:24 +0000 (16:13 +0100)]
net: lan966x: Correct spelling in comments

Correct spelling in comments, as flagged by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-2-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: lan743x: Correct spelling in comments
Simon Horman [Wed, 24 Apr 2024 15:13:23 +0000 (16:13 +0100)]
net: lan743x: Correct spelling in comments

Correct spelling in comments, as flagged by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-1-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agor8152: replace dev_info with dev_dbg for loading firmware
Hayes Wang [Wed, 24 Apr 2024 08:45:32 +0000 (16:45 +0800)]
r8152: replace dev_info with dev_dbg for loading firmware

Someone complains the message appears continuously. This occurs
because the device is woken from UPS mode, and the driver re-loads
the firmware.

When the device enters runtime suspend and cable is unplugged, the
device would enter UPS mode. If the runtime resume occurs, and the
device is woken from UPS mode, the driver has to re-load the firmware
and causes the message. If someone wakes the device continuously, the
message would be shown continuously, too. Use dev_dbg to avoid it.

Note that, the function could be called before register_netdev(), so I
don't use netif_info() or netif_dbg().

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Link: https://lore.kernel.org/r/20240424084532.159649-1-hayeswang@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: usb: ax88179_178a: Add check for usbnet_get_endpoints()
Ma Ke [Wed, 24 Apr 2024 06:56:34 +0000 (14:56 +0800)]
net: usb: ax88179_178a: Add check for usbnet_get_endpoints()

To avoid the failure of usbnet_get_endpoints(), we should check the
return value of the usbnet_get_endpoints().

Signed-off-by: Ma Ke <make_ruc2021@163.com>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://lore.kernel.org/r/20240424065634.1870027-1-make_ruc2021@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: sfp: add quirk for ATS SFP-GE-T 1000Base-TX module
Daniel Golle [Tue, 23 Apr 2024 09:00:25 +0000 (11:00 +0200)]
net: sfp: add quirk for ATS SFP-GE-T 1000Base-TX module

Add quirk for ATS SFP-GE-T 1000Base-TX module.

This copper module comes with broken TX_FAULT indicator which must be
ignored for it to work.

Co-authored-by: Josef Schlehofer <pepe.schlehofer@gmail.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
[ rebased on top of net-next ]
Signed-off-by: Marek Behún <kabel@kernel.org>
Link: https://lore.kernel.org/r/20240423090025.29231-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: sfp: enhance quirk for Fibrestore 2.5G copper SFP module
Marek Behún [Tue, 23 Apr 2024 08:50:39 +0000 (10:50 +0200)]
net: sfp: enhance quirk for Fibrestore 2.5G copper SFP module

Enhance the quirk for Fibrestore 2.5G copper SFP module. The original
commit e27aca3760c0 ("net: sfp: add quirk for FS's 2.5G copper SFP")
introducing the quirk says that the PHY is inaccessible, but that is
not true.

The module uses Rollball protocol to talk to the PHY, and needs a 4
second wait before probing it, same as FS 10G module.

The PHY inside the module is Realtek RTL8221B-VB-CG PHY. The realtek
driver recently gained support to set it up via clause 45 accesses.

Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240423085039.26957-2-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: sfp: update comment for FS SFP-10G-T quirk
Marek Behún [Tue, 23 Apr 2024 08:50:38 +0000 (10:50 +0200)]
net: sfp: update comment for FS SFP-10G-T quirk

Update the comment for the Fibrestore SFP-10G-T module: since commit
e9301af385e7 ("net: sfp: fix PHY discovery for FS SFP-10G-T module")
we also do a 4 second wait before probing the PHY.

Fixes: e9301af385e7 ("net: sfp: fix PHY discovery for FS SFP-10G-T module")
Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240423085039.26957-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: add two more call_rcu_hurry()
Eric Dumazet [Tue, 23 Apr 2024 20:54:08 +0000 (20:54 +0000)]
net: add two more call_rcu_hurry()

I had failures with pmtu.sh selftests lately,
with netns dismantles firing ref_tracking alerts [1].

After much debugging, I found that some queued
rcu callbacks were delayed by minutes, because
of CONFIG_RCU_LAZY=y option.

Joel Fernandes had a similar issue in the past,
fixed with commit 483c26ff63f4 ("net: Use call_rcu_hurry()
for dst_release()")

In this commit, I make sure nexthop_free_rcu()
and free_fib_info_rcu() are not delayed too much
because they both can release device references.

tools/testing/selftests/net/pmtu.sh no longer fails.

Traces were:

[  968.179860] ref_tracker: veth_A-R1@00000000d0ff3fe2 has 3/5 users at
                    dst_alloc+0x76/0x160
                    ip6_dst_alloc+0x25/0x80
                    ip6_pol_route+0x2a8/0x450
                    ip6_pol_route_output+0x1f/0x30
                    fib6_rule_lookup+0x163/0x270
                    ip6_route_output_flags+0xda/0x190
                    ip6_dst_lookup_tail.constprop.0+0x1d0/0x260
                    ip6_dst_lookup_flow+0x47/0xa0
                    udp_tunnel6_dst_lookup+0x158/0x210
                    vxlan_xmit_one+0x4c2/0x1550 [vxlan]
                    vxlan_xmit+0x52d/0x14f0 [vxlan]
                    dev_hard_start_xmit+0x7b/0x1e0
                    __dev_queue_xmit+0x20b/0xe40
                    ip6_finish_output2+0x2ea/0x6e0
                    ip6_finish_output+0x143/0x320
                    ip6_output+0x74/0x140

[  968.179860] ref_tracker: veth_A-R1@00000000d0ff3fe2 has 1/5 users at
                    netdev_get_by_index+0xc0/0xe0
                    fib6_nh_init+0x1a9/0xa90
                    rtm_new_nexthop+0x6fa/0x1580
                    rtnetlink_rcv_msg+0x155/0x3e0
                    netlink_rcv_skb+0x61/0x110
                    rtnetlink_rcv+0x19/0x20
                    netlink_unicast+0x23f/0x380
                    netlink_sendmsg+0x1fc/0x430
                    ____sys_sendmsg+0x2ef/0x320
                    ___sys_sendmsg+0x86/0xd0
                    __sys_sendmsg+0x67/0xc0
                    __x64_sys_sendmsg+0x21/0x30
                    x64_sys_call+0x252/0x2030
                    do_syscall_64+0x6c/0x190
                    entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  968.179860] ref_tracker: veth_A-R1@00000000d0ff3fe2 has 1/5 users at
                    ipv6_add_dev+0x136/0x530
                    addrconf_notify+0x19d/0x770
                    notifier_call_chain+0x65/0xd0
                    raw_notifier_call_chain+0x1a/0x20
                    call_netdevice_notifiers_info+0x54/0x90
                    register_netdevice+0x61e/0x790
                    veth_newlink+0x230/0x440
                    __rtnl_newlink+0x7d2/0xaa0
                    rtnl_newlink+0x4c/0x70
                    rtnetlink_rcv_msg+0x155/0x3e0
                    netlink_rcv_skb+0x61/0x110
                    rtnetlink_rcv+0x19/0x20
                    netlink_unicast+0x23f/0x380
                    netlink_sendmsg+0x1fc/0x430
                    ____sys_sendmsg+0x2ef/0x320
                    ___sys_sendmsg+0x86/0xd0
....
[ 1079.316024]  ? show_regs+0x68/0x80
[ 1079.316087]  ? __warn+0x8c/0x140
[ 1079.316103]  ? ref_tracker_free+0x1a0/0x270
[ 1079.316117]  ? report_bug+0x196/0x1c0
[ 1079.316135]  ? handle_bug+0x42/0x80
[ 1079.316149]  ? exc_invalid_op+0x1c/0x70
[ 1079.316162]  ? asm_exc_invalid_op+0x1f/0x30
[ 1079.316193]  ? ref_tracker_free+0x1a0/0x270
[ 1079.316208]  ? _raw_spin_unlock+0x1a/0x40
[ 1079.316222]  ? free_unref_page+0x126/0x1a0
[ 1079.316239]  ? destroy_large_folio+0x69/0x90
[ 1079.316251]  ? __folio_put+0x99/0xd0
[ 1079.316276]  dst_dev_put+0x69/0xd0
[ 1079.316308]  fib6_nh_release_dsts.part.0+0x3d/0x80
[ 1079.316327]  fib6_nh_release+0x45/0x70
[ 1079.316340]  nexthop_free_rcu+0x131/0x170
[ 1079.316356]  rcu_do_batch+0x1ee/0x820
[ 1079.316370]  ? rcu_do_batch+0x179/0x820
[ 1079.316388]  rcu_core+0x1aa/0x4d0
[ 1079.316405]  rcu_core_si+0x12/0x20
[ 1079.316417]  __do_softirq+0x13a/0x3dc
[ 1079.316435]  __irq_exit_rcu+0xa3/0x110
[ 1079.316449]  irq_exit_rcu+0x12/0x30
[ 1079.316462]  sysvec_apic_timer_interrupt+0x5b/0xe0
[ 1079.316474]  asm_sysvec_apic_timer_interrupt+0x1f/0x30
[ 1079.316569] RIP: 0033:0x7f06b65c63f0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240423205408.39632-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 25 Apr 2024 19:40:48 +0000 (12:40 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/ti/icssg/icssg_prueth.c

net/mac80211/chan.c
  89884459a0b9 ("wifi: mac80211: fix idle calculation with multi-link")
  87f5500285fb ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()")
https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/

net/unix/garbage.c
  1971d13ffa84 ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().")
  4090fa373f0e ("af_unix: Replace garbage collection algorithm.")

drivers/net/ethernet/ti/icssg/icssg_prueth.c
drivers/net/ethernet/ti/icssg/icssg_common.c
  4dcd0e83ea1d ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()")
  e2dc7bfd677f ("net: ti: icssg-prueth: Move common functions into a separate file")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agotcp: avoid premature drops in tcp_add_backlog()
Eric Dumazet [Tue, 23 Apr 2024 12:56:20 +0000 (12:56 +0000)]
tcp: avoid premature drops in tcp_add_backlog()

While testing TCP performance with latest trees,
I saw suspect SOCKET_BACKLOG drops.

tcp_add_backlog() computes its limit with :

    limit = (u32)READ_ONCE(sk->sk_rcvbuf) +
            (u32)(READ_ONCE(sk->sk_sndbuf) >> 1);
    limit += 64 * 1024;

This does not take into account that sk->sk_backlog.len
is reset only at the very end of __release_sock().

Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach
sk_rcvbuf in normal conditions.

We should double sk->sk_rcvbuf contribution in the formula
to absorb bubbles in the backlog, which happen more often
for very fast flows.

This change maintains decent protection against abuses.

Fixes: c377411f2494 ("net: sk_add_backlog() take rmem_alloc into account")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240423125620.3309458-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoMerge tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Thu, 25 Apr 2024 18:49:35 +0000 (11:49 -0700)]
Merge tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.10

The second "new features" pull request for v6.10 with changes both in
stack and in drivers. This time the pull request is rather small and
nothing special standing out except maybe that we have several
kernel-doc fixes. Great to see that we are getting warning free
wireless code (until new warnings are added).

Major changes:

rtl8xxxu:
 * enable Management Frame Protection (MFP) support

rtw88:
 * disable unsupported interface type of mesh point for all chips, and only
   support station mode for SDIO chips.

* tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (63 commits)
  wifi: mac80211: handle link ID during management Tx
  wifi: mac80211: handle sdata->u.ap.active flag with MLO
  wifi: cfg80211: add return docs for regulatory functions
  wifi: cfg80211: make some regulatory functions void
  wifi: mac80211: add return docs for sta_info_flush()
  wifi: mac80211: keep mac80211 consistent on link activation failure
  wifi: mac80211: simplify ieee80211_assign_link_chanctx()
  wifi: mac80211: reserve chanctx during find
  wifi: cfg80211: fix cfg80211 function kernel-doc
  wifi: mac80211_hwsim: Use wider regulatory for custom for 6GHz tests
  wifi: iwlwifi: mvm: Don't allow EMLSR when the RSSI is low
  wifi: iwlwifi: mvm: disable EMLSR when we suspend with wowlan
  wifi: iwlwifi: mvm: get periodic statistics in EMLSR
  wifi: iwlwifi: mvm: don't recompute EMLSR mode in can_activate_links
  wifi: iwlwifi: mvm: implement EMLSR prevention mechanism.
  wifi: iwlwifi: mvm: exit EMLSR upon missed beacon
  wifi: iwlwifi: mvm: init vif works only once
  wifi: iwlwifi: mvm: Add helper functions to update EMLSR status
  wifi: iwlwifi: mvm: Implement new link selection algorithm
  wifi: iwlwifi: mvm: move EMLSR/links code
  ...
====================

Link: https://lore.kernel.org/r/20240424100122.217AEC113CE@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoMerge branch 'net-dsa-b53-remove-adjust_link'
Jakub Kicinski [Thu, 25 Apr 2024 18:46:41 +0000 (11:46 -0700)]
Merge branch 'net-dsa-b53-remove-adjust_link'

Florian Fainelli says:

====================
net: dsa: b53: Remove adjust_link

b53 is now the only remaining driver that uses both PHYLIB's adjust_link
and PHYLINK's mac_ops callbacks, convert entirely to PHYLINK.
====================

Link: https://lore.kernel.org/r/20240423183339.1368511-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: dsa: b53: provide own phylink MAC operations
Florian Fainelli [Tue, 23 Apr 2024 18:33:39 +0000 (11:33 -0700)]
net: dsa: b53: provide own phylink MAC operations

Convert b53 to provide its own phylink MAC operations, thus avoiding the
shim layer in DSA's port.c

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240423183339.1368511-9-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: dsa: b53: Remove b53_adjust_link()
Florian Fainelli [Tue, 23 Apr 2024 18:33:38 +0000 (11:33 -0700)]
net: dsa: b53: Remove b53_adjust_link()

Only use the PHYLINK implementation from there on now that an equivalent
configuration is applied to all of the switch ports.

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240423183339.1368511-8-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: dsa: b53: Call b53_eee_init() from b53_mac_link_up()
Florian Fainelli [Tue, 23 Apr 2024 18:33:37 +0000 (11:33 -0700)]
net: dsa: b53: Call b53_eee_init() from b53_mac_link_up()

And make sure this is done for the MLO_AN_PHY case, where it actually
makes sense, contrary to b53_adjust_link() which only did it for
fixed-PHY configurations where it does not make sense.

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240423183339.1368511-7-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: dsa: b53: Configure RGMII for 531x5 and MII for 5325
Florian Fainelli [Tue, 23 Apr 2024 18:33:36 +0000 (11:33 -0700)]
net: dsa: b53: Configure RGMII for 531x5 and MII for 5325

Call b53_adjust_531x5_rgmii() and b53_adjust_5325_mii() from
b53_phylink_mac_config() when we have a fixed PHY in preparation for removing
b53_adjust_link(). Also move b53_adjust_63xx_rgmii() to
b53_phylink_mac_config() where it logically belongs.

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240423183339.1368511-6-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: dsa: b53: Force flow control for BCM5301X CPU port(s)
Florian Fainelli [Tue, 23 Apr 2024 18:33:35 +0000 (11:33 -0700)]
net: dsa: b53: Force flow control for BCM5301X CPU port(s)

Just like what b53_adjust_link() does, force flow control for the
BCM5301X CPU port(s) by forcing rx_pause and tx_pause in
b53_phylink_mac_link_up(). Preparatory step for getting rid of
b53_adjust_link().

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240423183339.1368511-5-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: dsa: b53: Introduce b53_adjust_5325_mii()
Florian Fainelli [Tue, 23 Apr 2024 18:33:34 +0000 (11:33 -0700)]
net: dsa: b53: Introduce b53_adjust_5325_mii()

Takes care of doing the 5325 switch series specific MII programming and
is called from b53_adjust_link() to allow the future removal of
b53_adjust_link().

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240423183339.1368511-4-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: dsa: b53: Introduce b53_adjust_531x5_rgmii()
Florian Fainelli [Tue, 23 Apr 2024 18:33:33 +0000 (11:33 -0700)]
net: dsa: b53: Introduce b53_adjust_531x5_rgmii()

Takes care of doing the 531x5 switch series specific RGMII programming
and is called from b53_adjust_link() to allow the future removal of
b53_adjust_link().

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240423183339.1368511-3-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: dsa: b53: Stop exporting b53_phylink_* routines
Florian Fainelli [Tue, 23 Apr 2024 18:33:32 +0000 (11:33 -0700)]
net: dsa: b53: Stop exporting b53_phylink_* routines

They are not used outside of the b53_common.c file, no need to be
exported.

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240423183339.1368511-2-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoMerge tag 'net-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 25 Apr 2024 18:19:38 +0000 (11:19 -0700)]
Merge tag 'net-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, wireless and bluetooth.

  Nothing major, regression fixes are mostly in drivers, two more of
  those are flowing towards us thru various trees. I wish some of the
  changes went into -rc5, we'll try to keep an eye on frequency of PRs
  from sub-trees.

  Also disproportional number of fixes for bugs added in v6.4, strange
  coincidence.

  Current release - regressions:

   - igc: fix LED-related deadlock on driver unbind

   - wifi: mac80211: small fixes to recent clean up of the connection
     process

   - Revert "wifi: iwlwifi: bump FW API to 90 for BZ/SC devices", kernel
     doesn't have all the code to deal with that version, yet

   - Bluetooth:
       - set power_ctrl_enabled on NULL returned by gpiod_get_optional()
       - qca: fix invalid device address check, again

   - eth: ravb: fix registered interrupt names

  Current release - new code bugs:

   - wifi: mac80211: check EHT/TTLM action frame length

  Previous releases - regressions:

   - fix sk_memory_allocated_{add|sub} for architectures where
     __this_cpu_{add|sub}* are not IRQ-safe

   - dsa: mv88e6xx: fix link setup for 88E6250

  Previous releases - always broken:

   - ip: validate dev returned from __in_dev_get_rcu(), prevent possible
     null-derefs in a few places

   - switch number of for_each_rcu() loops using call_rcu() on the
     iterator to for_each_safe()

   - macsec: fix isolation of broadcast traffic in presence of offload

   - vxlan: drop packets from invalid source address

   - eth: mlxsw: trap and ACL programming fixes

   - eth: bnxt: PCIe error recovery fixes, fix counting dropped packets

   - Bluetooth:
       - lots of fixes for the command submission rework from v6.4
       - qca: fix NULL-deref on non-serdev suspend

  Misc:

   - tools: ynl: don't ignore errors in NLMSG_DONE messages"

* tag 'net-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
  af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().
  net: b44: set pause params only when interface is up
  tls: fix lockless read of strp->msg_ready in ->poll
  dpll: fix dpll_pin_on_pin_register() for multiple parent pins
  net: ravb: Fix registered interrupt names
  octeontx2-af: fix the double free in rvu_npc_freemem()
  net: ethernet: ti: am65-cpts: Fix PTPv1 message type on TX packets
  ice: fix LAG and VF lock dependency in ice_reset_vf()
  iavf: Fix TC config comparison with existing adapter TC config
  i40e: Report MFS in decimal base instead of hex
  i40e: Do not use WQ_MEM_RECLAIM flag for workqueue
  net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()
  net/mlx5e: Advertise mlx5 ethernet driver updates sk_buff md_dst for MACsec
  macsec: Detect if Rx skb is macsec-related for offloading devices that update md_dst
  ethernet: Add helper for assigning packet type when dest address does not match device address
  macsec: Enable devices to advertise whether they update sk_buff md_dst during offloads
  net: phy: dp83869: Fix MII mode failure
  netfilter: nf_tables: honor table dormant flag from netdev release event path
  eth: bnxt: fix counting packets discarded due to OOM and netpoll
  igc: Fix LED-related deadlock on driver unbind
  ...

18 months agoMerge tag 'nfsd-6.9-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Linus Torvalds [Thu, 25 Apr 2024 16:31:06 +0000 (09:31 -0700)]
Merge tag 'nfsd-6.9-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Revert some backchannel fixes that went into v6.9-rc

* tag 'nfsd-6.9-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  Revert "NFSD: Convert the callback workqueue to use delayed_work"
  Revert "NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"

18 months agoMerge tag 'for-linus-2024042501' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Thu, 25 Apr 2024 16:23:38 +0000 (09:23 -0700)]
Merge tag 'for-linus-2024042501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid

Pull HID fixes from Benjamin Tissoires:

 - A couple of i2c-hid fixes (Kenny Levinsen & Nam Cao)

 - A config issue with mcp-2221 when CONFIG_IIO is not enabled
   (Abdelrahman Morsy)

 - A dev_err fix in intel-ish-hid (Zhang Lixu)

 - A couple of mouse fixes for both nintendo and Logitech-dj (Nuno
   Pereira and Yaraslau Furman)

 - I'm changing my main kernel email address as it's way simpler for me
   than the Red Hat one (Benjamin Tissoires)

* tag 'for-linus-2024042501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: mcp-2221: cancel delayed_work only when CONFIG_IIO is enabled
  HID: logitech-dj: allow mice to use all types of reports
  HID: i2c-hid: Revert to await reset ACK before reading report descriptor
  HID: nintendo: Fix N64 controller being identified as mouse
  MAINTAINERS: update Benjamin's email address
  HID: intel-ish-hid: ipc: Fix dev_err usage with uninitialized dev->devc
  HID: i2c-hid: remove I2C_HID_READ_PENDING flag to prevent lock-up

18 months agoMerge tag 'nf-24-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Jakub Kicinski [Thu, 25 Apr 2024 15:46:53 +0000 (08:46 -0700)]
Merge tag 'nf-24-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains two Netfilter/IPVS fixes for net:

Patch #1 fixes SCTP checksumming for IPVS with gso packets,
 from Ismael Luceno.

Patch #2 honor dormant flag from netdev event path to fix a possible
 double hook unregistration.

* tag 'nf-24-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: honor table dormant flag from netdev release event path
  ipvs: Fix checksumming on GSO of SCTP packets
====================

Link: https://lore.kernel.org/r/20240425090149.1359547-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoaf_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().
Kuniyuki Iwashima [Wed, 24 Apr 2024 17:04:43 +0000 (10:04 -0700)]
af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().

syzbot reported a lockdep splat regarding unix_gc_lock and
unix_state_lock().

One is called from recvmsg() for a connected socket, and another
is called from GC for TCP_LISTEN socket.

So, the splat is false-positive.

Let's add a dedicated lock class for the latter to suppress the splat.

Note that this change is not necessary for net-next.git as the issue
is only applied to the old GC impl.

[0]:
WARNING: possible circular locking dependency detected
6.9.0-rc5-syzkaller-00007-g4d2008430ce8 #0 Not tainted
 -----------------------------------------------------
kworker/u8:1/11 is trying to acquire lock:
ffff88807cea4e70 (&u->lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
ffff88807cea4e70 (&u->lock){+.+.}-{2:2}, at: __unix_gc+0x40e/0xf70 net/unix/garbage.c:302

but task is already holding lock:
ffffffff8f6ab638 (unix_gc_lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
ffffffff8f6ab638 (unix_gc_lock){+.+.}-{2:2}, at: __unix_gc+0x117/0xf70 net/unix/garbage.c:261

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

 -> #1 (unix_gc_lock){+.+.}-{2:2}:
       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
       __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
       _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
       spin_lock include/linux/spinlock.h:351 [inline]
       unix_notinflight+0x13d/0x390 net/unix/garbage.c:140
       unix_detach_fds net/unix/af_unix.c:1819 [inline]
       unix_destruct_scm+0x221/0x350 net/unix/af_unix.c:1876
       skb_release_head_state+0x100/0x250 net/core/skbuff.c:1188
       skb_release_all net/core/skbuff.c:1200 [inline]
       __kfree_skb net/core/skbuff.c:1216 [inline]
       kfree_skb_reason+0x16d/0x3b0 net/core/skbuff.c:1252
       kfree_skb include/linux/skbuff.h:1262 [inline]
       manage_oob net/unix/af_unix.c:2672 [inline]
       unix_stream_read_generic+0x1125/0x2700 net/unix/af_unix.c:2749
       unix_stream_splice_read+0x239/0x320 net/unix/af_unix.c:2981
       do_splice_read fs/splice.c:985 [inline]
       splice_file_to_pipe+0x299/0x500 fs/splice.c:1295
       do_splice+0xf2d/0x1880 fs/splice.c:1379
       __do_splice fs/splice.c:1436 [inline]
       __do_sys_splice fs/splice.c:1652 [inline]
       __se_sys_splice+0x331/0x4a0 fs/splice.c:1634
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

 -> #0 (&u->lock){+.+.}-{2:2}:
       check_prev_add kernel/locking/lockdep.c:3134 [inline]
       check_prevs_add kernel/locking/lockdep.c:3253 [inline]
       validate_chain+0x18cb/0x58e0 kernel/locking/lockdep.c:3869
       __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
       __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
       _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
       spin_lock include/linux/spinlock.h:351 [inline]
       __unix_gc+0x40e/0xf70 net/unix/garbage.c:302
       process_one_work kernel/workqueue.c:3254 [inline]
       process_scheduled_works+0xa10/0x17c0 kernel/workqueue.c:3335
       worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
       kthread+0x2f0/0x390 kernel/kthread.c:388
       ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(unix_gc_lock);
                               lock(&u->lock);
                               lock(unix_gc_lock);
  lock(&u->lock);

 *** DEADLOCK ***

3 locks held by kworker/u8:1/11:
 #0: ffff888015089148 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3229 [inline]
 #0: ffff888015089148 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_scheduled_works+0x8e0/0x17c0 kernel/workqueue.c:3335
 #1: ffffc90000107d00 (unix_gc_work){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3230 [inline]
 #1: ffffc90000107d00 (unix_gc_work){+.+.}-{0:0}, at: process_scheduled_works+0x91b/0x17c0 kernel/workqueue.c:3335
 #2: ffffffff8f6ab638 (unix_gc_lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
 #2: ffffffff8f6ab638 (unix_gc_lock){+.+.}-{2:2}, at: __unix_gc+0x117/0xf70 net/unix/garbage.c:261

stack backtrace:
CPU: 0 PID: 11 Comm: kworker/u8:1 Not tainted 6.9.0-rc5-syzkaller-00007-g4d2008430ce8 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: events_unbound __unix_gc
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
 check_prev_add kernel/locking/lockdep.c:3134 [inline]
 check_prevs_add kernel/locking/lockdep.c:3253 [inline]
 validate_chain+0x18cb/0x58e0 kernel/locking/lockdep.c:3869
 __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
 _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
 spin_lock include/linux/spinlock.h:351 [inline]
 __unix_gc+0x40e/0xf70 net/unix/garbage.c:302
 process_one_work kernel/workqueue.c:3254 [inline]
 process_scheduled_works+0xa10/0x17c0 kernel/workqueue.c:3335
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
 kthread+0x2f0/0x390 kernel/kthread.c:388
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 </TASK>

Fixes: 47d8ac011fe1 ("af_unix: Fix garbage collector racing against connect()")
Reported-and-tested-by: syzbot+fa379358c28cc87cc307@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fa379358c28cc87cc307
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240424170443.9832-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: b44: set pause params only when interface is up
Peter Münster [Wed, 24 Apr 2024 13:51:52 +0000 (15:51 +0200)]
net: b44: set pause params only when interface is up

b44_free_rings() accesses b44::rx_buffers (and ::tx_buffers)
unconditionally, but b44::rx_buffers is only valid when the
device is up (they get allocated in b44_open(), and deallocated
again in b44_close()), any other time these are just a NULL pointers.

So if you try to change the pause params while the network interface
is disabled/administratively down, everything explodes (which likely
netifd tries to do).

Link: https://github.com/openwrt/openwrt/issues/13789
Fixes: 1da177e4c3f4 (Linux-2.6.12-rc2)
Cc: stable@vger.kernel.org
Reported-by: Peter Münster <pm@a16n.net>
Suggested-by: Jonas Gorski <jonas.gorski@gmail.com>
Signed-off-by: Vaclav Svoboda <svoboda@neng.cz>
Tested-by: Peter Münster <pm@a16n.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Peter Münster <pm@a16n.net>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/87y192oolj.fsf@a16n.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agotls: fix lockless read of strp->msg_ready in ->poll
Sabrina Dubroca [Wed, 24 Apr 2024 10:25:47 +0000 (12:25 +0200)]
tls: fix lockless read of strp->msg_ready in ->poll

tls_sk_poll is called without locking the socket, and needs to read
strp->msg_ready (via tls_strp_msg_ready). Convert msg_ready to a bool
and use READ_ONCE/WRITE_ONCE where needed. The remaining reads are
only performed when the socket is locked.

Fixes: 121dca784fc0 ("tls: suppress wakeups unless we have a full record")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/0b7ee062319037cf86af6b317b3d72f7bfcd2e97.1713797701.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agodpll: fix dpll_pin_on_pin_register() for multiple parent pins
Arkadiusz Kubalewski [Wed, 24 Apr 2024 10:16:36 +0000 (12:16 +0200)]
dpll: fix dpll_pin_on_pin_register() for multiple parent pins

In scenario where pin is registered with multiple parent pins via
dpll_pin_on_pin_register(..), all belonging to the same dpll device.
A second call to dpll_pin_on_pin_unregister(..) would cause a call trace,
as it tries to use already released registration resources (due to fix
introduced in b446631f355e). In this scenario pin was registered twice,
so resources are not yet expected to be release until each registered
pin/pin pair is unregistered.

Currently, the following crash/call trace is produced when ice driver is
removed on the system with installed E810T NIC which includes dpll device:

WARNING: CPU: 51 PID: 9155 at drivers/dpll/dpll_core.c:809 dpll_pin_ops+0x20/0x30
RIP: 0010:dpll_pin_ops+0x20/0x30
Call Trace:
 ? __warn+0x7f/0x130
 ? dpll_pin_ops+0x20/0x30
 dpll_msg_add_pin_freq+0x37/0x1d0
 dpll_cmd_pin_get_one+0x1c0/0x400
 ? __nlmsg_put+0x63/0x80
 dpll_pin_event_send+0x93/0x140
 dpll_pin_on_pin_unregister+0x3f/0x100
 ice_dpll_deinit_pins+0xa1/0x230 [ice]
 ice_remove+0xf1/0x210 [ice]

Fix by adding a parent pointer as a cookie when creating a registration,
also when searching for it. For the regular pins pass NULL, this allows to
create separated registration for each parent the pin is registered with.

Fixes: b446631f355e ("dpll: fix dpll_xa_ref_*_del() for multiple registrations")
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240424101636.1491424-1-arkadiusz.kubalewski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: ravb: Fix registered interrupt names
Geert Uytterhoeven [Wed, 24 Apr 2024 07:45:21 +0000 (09:45 +0200)]
net: ravb: Fix registered interrupt names

As interrupts are now requested from ravb_probe(), before calling
register_netdev(), ndev->name still contains the template "eth%d",
leading to funny names in /proc/interrupts.  E.g. on R-Car E3:

89:  0      0  GICv2  93 Level  eth%d:ch22:multi
90:  0      3  GICv2  95 Level  eth%d:ch24:emac
91:  0  23484  GICv2  71 Level  eth%d:ch0:rx_be
92:  0      0  GICv2  72 Level  eth%d:ch1:rx_nc
93:  0  13735  GICv2  89 Level  eth%d:ch18:tx_be
94:  0      0  GICv2  90 Level  eth%d:ch19:tx_nc

Worse, on platforms with multiple RAVB instances (e.g. R-Car V4H), all
interrupts have similar names.

Fix this by using the device name instead, like is done in several other
drivers:

89:  0      0  GICv2  93 Level  e6800000.ethernet:ch22:multi
90:  0      1  GICv2  95 Level  e6800000.ethernet:ch24:emac
91:  0  28578  GICv2  71 Level  e6800000.ethernet:ch0:rx_be
92:  0      0  GICv2  72 Level  e6800000.ethernet:ch1:rx_nc
93:  0  14044  GICv2  89 Level  e6800000.ethernet:ch18:tx_be
94:  0      0  GICv2  90 Level  e6800000.ethernet:ch19:tx_nc

Rename the local variable dev_name, as it shadows the dev_name()
function, and pre-initialize it, to simplify the code.

Fixes: 32f012b8c01ca9fd ("net: ravb: Move getting/requesting IRQs in the probe() method")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Tested-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> # on RZ/G3S
Link: https://lore.kernel.org/r/cde67b68adf115b3cf0b44c32334ae00b2fbb321.1713944647.git.geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoocteontx2-af: fix the double free in rvu_npc_freemem()
Su Hui [Wed, 24 Apr 2024 02:27:25 +0000 (10:27 +0800)]
octeontx2-af: fix the double free in rvu_npc_freemem()

Clang static checker(scan-build) warning:
drivers/net/ethernet/marvell/octeontx2/af/rvu_npc.c:line 2184, column 2
Attempt to free released memory.

npc_mcam_rsrcs_deinit() has released 'mcam->counters.bmap'. Deleted this
redundant kfree() to fix this double free problem.

Fixes: dd7842878633 ("octeontx2-af: Add new devlink param to configure maximum usable NIX block LFs")
Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Geetha sowjanya <gakula@marvell.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://lore.kernel.org/r/20240424022724.144587-1-suhui@nfschina.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: ethernet: ti: am65-cpts: Fix PTPv1 message type on TX packets
Jason Reeder [Wed, 24 Apr 2024 07:16:26 +0000 (12:46 +0530)]
net: ethernet: ti: am65-cpts: Fix PTPv1 message type on TX packets

The CPTS, by design, captures the messageType (Sync, Delay_Req, etc.)
field from the second nibble of the PTP header which is defined in the
PTPv2 (1588-2008) specification. In the PTPv1 (1588-2002) specification
the first two bytes of the PTP header are defined as the versionType
which is always 0x0001. This means that any PTPv1 packets that are
tagged for TX timestamping by the CPTS will have their messageType set
to 0x0 which corresponds to a Sync message type. This causes issues
when a PTPv1 stack is expecting a Delay_Req (messageType: 0x1)
timestamp that never appears.

Fix this by checking if the ptp_class of the timestamped TX packet is
PTP_CLASS_V1 and then matching the PTP sequence ID to the stored
sequence ID in the skb->cb data structure. If the sequence IDs match
and the packet is of type PTPv1 then there is a chance that the
messageType has been incorrectly stored by the CPTS so overwrite the
messageType stored by the CPTS with the messageType from the skb->cb
data structure. This allows the PTPv1 stack to receive TX timestamps
for Delay_Req packets which are necessary to lock onto a PTP Leader.

Signed-off-by: Jason Reeder <jreeder@ti.com>
Signed-off-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
Tested-by: Ed Trexel <ed.trexel@hp.com>
Fixes: f6bd59526ca5 ("net: ethernet: ti: introduce am654 common platform time sync driver")
Link: https://lore.kernel.org/r/20240424071626.32558-1-r-gunasekaran@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoMerge branch 'intel-wired-lan-driver-updates-2024-04-23-i40e-iavf-ice'
Jakub Kicinski [Thu, 25 Apr 2024 15:04:28 +0000 (08:04 -0700)]
Merge branch 'intel-wired-lan-driver-updates-2024-04-23-i40e-iavf-ice'

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-04-23 (i40e, iavf, ice)

This series contains updates to i40e, iavf, and ice drivers.

Sindhu removes WQ_MEM_RECLAIM flag from workqueue for i40e.

Erwan Velu adjusts message to avoid confusion on base being reported on
i40e.

Sudheer corrects insufficient check for TC equality on iavf.

Jake corrects ordering of locks to avoid possible deadlock on ice.
====================

Link: https://lore.kernel.org/r/20240423182723.740401-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoMerge branch 'fix-isolation-of-broadcast-traffic-and-unmatched-unicast-traffic-with...
Jakub Kicinski [Thu, 25 Apr 2024 14:59:04 +0000 (07:59 -0700)]
Merge branch 'fix-isolation-of-broadcast-traffic-and-unmatched-unicast-traffic-with-macsec-offload'

Rahul Rameshbabu says:

====================
Fix isolation of broadcast traffic and unmatched unicast traffic with MACsec offload

Some device drivers support devices that enable them to annotate whether a
Rx skb refers to a packet that was processed by the MACsec offloading
functionality of the device. Logic in the Rx handling for MACsec offload
does not utilize this information to preemptively avoid forwarding to the
macsec netdev currently. Because of this, things like multicast messages or
unicast messages with an unmatched destination address such as ARP requests
are forwarded to the macsec netdev whether the message received was MACsec
encrypted or not. The goal of this patch series is to improve the Rx
handling for MACsec offload for devices capable of annotating skbs received
that were decrypted by the NIC offload for MACsec.

Here is a summary of the issue that occurs with the existing logic today.

    * The current design of the MACsec offload handling path tries to use
      "best guess" mechanisms for determining whether a packet associated
      with the currently handled skb in the datapath was processed via HW
      offload
    * The best guess mechanism uses the following heuristic logic (in order of
      precedence)
      - Check if header destination MAC address matches MACsec netdev MAC
        address -> forward to MACsec port
      - Check if packet is multicast traffic -> forward to MACsec port
      - MACsec security channel was able to be looked up from skb offload
        context (mlx5 only) -> forward to MACsec port
    * Problem: plaintext traffic can potentially solicit a MACsec encrypted
      response from the offload device
      - Core aspect of MACsec is that it identifies unauthorized LAN connections
        and excludes them from communication
        + This behavior can be seen when not enabling offload for MACsec
      - The offload behavior violates this principle in MACsec

I believe this behavior is a security bug since applications utilizing
MACsec could be exploited using this behavior, and the correct way to
resolve this is by having the hardware correctly indicate whether MACsec
offload occurred for the packet or not. In the patches in this series, I
leave a warning for when the problematic path occurs because I cannot
figure out a secure way to fix the security issue that applies to the core
MACsec offload handling in the Rx path without breaking MACsec offload for
other vendors.

Shown at the bottom is an example use case where plaintext traffic sent to
a physical port of a NIC configured for MACsec offload is unable to be
handled correctly by the software stack when the NIC provides awareness to
the kernel about whether the received packet is MACsec traffic or not. In
this specific example, plaintext ARP requests are being responded with
MACsec encrypted ARP replies (which leads to routing information being
unable to be built for the requester).

    Side 1

      ip link del macsec0
      ip address flush mlx5_1
      ip address add 1.1.1.1/24 dev mlx5_1
      ip link set dev mlx5_1 up
      ip link add link mlx5_1 macsec0 type macsec sci 1 encrypt on
      ip link set dev macsec0 address 00:11:22:33:44:66
      ip macsec offload macsec0 mac
      ip macsec add macsec0 tx sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16
      ip macsec add macsec0 rx sci 2 on
      ip macsec add macsec0 rx sci 2 sa 0 pn 1 on key 00 ead3664f508eb06c40ac7104cdae4ce5
      ip address flush macsec0
      ip address add 2.2.2.1/24 dev macsec0
      ip link set dev macsec0 up

      # macsec0 enters promiscuous mode.
      # This enables all traffic received on macsec_vlan to be processed by
      # the macsec offload rx datapath. This however means that traffic
      # meant to be received by mlx5_1 will be incorrectly steered to
      # macsec0 as well.

      ip link add link macsec0 name macsec_vlan type vlan id 1
      ip link set dev macsec_vlan address 00:11:22:33:44:88
      ip address flush macsec_vlan
      ip address add 3.3.3.1/24 dev macsec_vlan
      ip link set dev macsec_vlan up

    Side 2

      ip link del macsec0
      ip address flush mlx5_1
      ip address add 1.1.1.2/24 dev mlx5_1
      ip link set dev mlx5_1 up
      ip link add link mlx5_1 macsec0 type macsec sci 2 encrypt on
      ip link set dev macsec0 address 00:11:22:33:44:77
      ip macsec offload macsec0 mac
      ip macsec add macsec0 tx sa 0 pn 1 on key 00 ead3664f508eb06c40ac7104cdae4ce5
      ip macsec add macsec0 rx sci 1 on
      ip macsec add macsec0 rx sci 1 sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16
      ip address flush macsec0
      ip address add 2.2.2.2/24 dev macsec0
      ip link set dev macsec0 up

      # macsec0 enters promiscuous mode.
      # This enables all traffic received on macsec_vlan to be processed by
      # the macsec offload rx datapath. This however means that traffic
      # meant to be received by mlx5_1 will be incorrectly steered to
      # macsec0 as well.

      ip link add link macsec0 name macsec_vlan type vlan id 1
      ip link set dev macsec_vlan address 00:11:22:33:44:99
      ip address flush macsec_vlan
      ip address add 3.3.3.2/24 dev macsec_vlan
      ip link set dev macsec_vlan up

    Side 1

      ping -I mlx5_1 1.1.1.2
      PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 mlx5_1: 56(84) bytes of data.
      From 1.1.1.1 icmp_seq=1 Destination Host Unreachable
      ping: sendmsg: No route to host
      From 1.1.1.1 icmp_seq=2 Destination Host Unreachable
      From 1.1.1.1 icmp_seq=3 Destination Host Unreachable

Changes:

  v2->v3:
    * Made dev paramater const for eth_skb_pkt_type helper as suggested by Sabrina
      Dubroca <sd@queasysnail.net>
  v1->v2:
    * Fixed series subject to detail the issue being fixed
    * Removed strange characters from cover letter
    * Added comment in example that illustrates the impact involving
      promiscuous mode
    * Added patch for generalizing packet type detection
    * Added Fixes: tags and targeting net
    * Removed pointless warning in the heuristic Rx path for macsec offload
    * Applied small refactor in Rx path offload to minimize scope of rx_sc
      local variable

Link: https://github.com/Binary-Eater/macsec-rx-offload/blob/trunk/MACsec_violation_in_core_stack_offload_rx_handling.pdf
Link: https://lore.kernel.org/netdev/20240419213033.400467-5-rrameshbabu@nvidia.com/
Link: https://lore.kernel.org/netdev/20240419011740.333714-1-rrameshbabu@nvidia.com/
Link: https://lore.kernel.org/netdev/87r0l25y1c.fsf@nvidia.com/
Link: https://lore.kernel.org/netdev/20231116182900.46052-1-rrameshbabu@nvidia.com/
====================

Link: https://lore.kernel.org/r/20240423181319.115860-1-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoice: fix LAG and VF lock dependency in ice_reset_vf()
Jacob Keller [Tue, 23 Apr 2024 18:27:20 +0000 (11:27 -0700)]
ice: fix LAG and VF lock dependency in ice_reset_vf()

9f74a3dfcf83 ("ice: Fix VF Reset paths when interface in a failed over
aggregate"), the ice driver has acquired the LAG mutex in ice_reset_vf().
The commit placed this lock acquisition just prior to the acquisition of
the VF configuration lock.

If ice_reset_vf() acquires the configuration lock via the ICE_VF_RESET_LOCK
flag, this could deadlock with ice_vc_cfg_qs_msg() because it always
acquires the locks in the order of the VF configuration lock and then the
LAG mutex.

Lockdep reports this violation almost immediately on creating and then
removing 2 VF:

======================================================
WARNING: possible circular locking dependency detected
6.8.0-rc6 #54 Tainted: G        W  O
------------------------------------------------------
kworker/60:3/6771 is trying to acquire lock:
ff40d43e099380a0 (&vf->cfg_lock){+.+.}-{3:3}, at: ice_reset_vf+0x22f/0x4d0 [ice]

but task is already holding lock:
ff40d43ea1961210 (&pf->lag_mutex){+.+.}-{3:3}, at: ice_reset_vf+0xb7/0x4d0 [ice]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&pf->lag_mutex){+.+.}-{3:3}:
       __lock_acquire+0x4f8/0xb40
       lock_acquire+0xd4/0x2d0
       __mutex_lock+0x9b/0xbf0
       ice_vc_cfg_qs_msg+0x45/0x690 [ice]
       ice_vc_process_vf_msg+0x4f5/0x870 [ice]
       __ice_clean_ctrlq+0x2b5/0x600 [ice]
       ice_service_task+0x2c9/0x480 [ice]
       process_one_work+0x1e9/0x4d0
       worker_thread+0x1e1/0x3d0
       kthread+0x104/0x140
       ret_from_fork+0x31/0x50
       ret_from_fork_asm+0x1b/0x30

-> #0 (&vf->cfg_lock){+.+.}-{3:3}:
       check_prev_add+0xe2/0xc50
       validate_chain+0x558/0x800
       __lock_acquire+0x4f8/0xb40
       lock_acquire+0xd4/0x2d0
       __mutex_lock+0x9b/0xbf0
       ice_reset_vf+0x22f/0x4d0 [ice]
       ice_process_vflr_event+0x98/0xd0 [ice]
       ice_service_task+0x1cc/0x480 [ice]
       process_one_work+0x1e9/0x4d0
       worker_thread+0x1e1/0x3d0
       kthread+0x104/0x140
       ret_from_fork+0x31/0x50
       ret_from_fork_asm+0x1b/0x30

other info that might help us debug this:
 Possible unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(&pf->lag_mutex);
                               lock(&vf->cfg_lock);
                               lock(&pf->lag_mutex);
  lock(&vf->cfg_lock);

 *** DEADLOCK ***
4 locks held by kworker/60:3/6771:
 #0: ff40d43e05428b38 ((wq_completion)ice){+.+.}-{0:0}, at: process_one_work+0x176/0x4d0
 #1: ff50d06e05197e58 ((work_completion)(&pf->serv_task)){+.+.}-{0:0}, at: process_one_work+0x176/0x4d0
 #2: ff40d43ea1960e50 (&pf->vfs.table_lock){+.+.}-{3:3}, at: ice_process_vflr_event+0x48/0xd0 [ice]
 #3: ff40d43ea1961210 (&pf->lag_mutex){+.+.}-{3:3}, at: ice_reset_vf+0xb7/0x4d0 [ice]

stack backtrace:
CPU: 60 PID: 6771 Comm: kworker/60:3 Tainted: G        W  O       6.8.0-rc6 #54
Hardware name:
Workqueue: ice ice_service_task [ice]
Call Trace:
 <TASK>
 dump_stack_lvl+0x4a/0x80
 check_noncircular+0x12d/0x150
 check_prev_add+0xe2/0xc50
 ? save_trace+0x59/0x230
 ? add_chain_cache+0x109/0x450
 validate_chain+0x558/0x800
 __lock_acquire+0x4f8/0xb40
 ? lockdep_hardirqs_on+0x7d/0x100
 lock_acquire+0xd4/0x2d0
 ? ice_reset_vf+0x22f/0x4d0 [ice]
 ? lock_is_held_type+0xc7/0x120
 __mutex_lock+0x9b/0xbf0
 ? ice_reset_vf+0x22f/0x4d0 [ice]
 ? ice_reset_vf+0x22f/0x4d0 [ice]
 ? rcu_is_watching+0x11/0x50
 ? ice_reset_vf+0x22f/0x4d0 [ice]
 ice_reset_vf+0x22f/0x4d0 [ice]
 ? process_one_work+0x176/0x4d0
 ice_process_vflr_event+0x98/0xd0 [ice]
 ice_service_task+0x1cc/0x480 [ice]
 process_one_work+0x1e9/0x4d0
 worker_thread+0x1e1/0x3d0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x104/0x140
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x31/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>

To avoid deadlock, we must acquire the LAG mutex only after acquiring the
VF configuration lock. Fix the ice_reset_vf() to acquire the LAG mutex only
after we either acquire or check that the VF configuration lock is held.

Fixes: 9f74a3dfcf83 ("ice: Fix VF Reset paths when interface in a failed over aggregate")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Dave Ertman <david.m.ertman@intel.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240423182723.740401-5-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoiavf: Fix TC config comparison with existing adapter TC config
Sudheer Mogilappagari [Tue, 23 Apr 2024 18:27:19 +0000 (11:27 -0700)]
iavf: Fix TC config comparison with existing adapter TC config

Same number of TCs doesn't imply that underlying TC configs are
same. The config could be different due to difference in number
of queues in each TC. Add utility function to determine if TC
configs are same.

Fixes: d5b33d024496 ("i40evf: add ndo_setup_tc callback to i40evf")
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Tested-by: Mineri Bhange <minerix.bhange@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240423182723.740401-4-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoi40e: Report MFS in decimal base instead of hex
Erwan Velu [Tue, 23 Apr 2024 18:27:18 +0000 (11:27 -0700)]
i40e: Report MFS in decimal base instead of hex

If the MFS is set below the default (0x2600), a warning message is
reported like the following :

MFS for port 1 has been set below the default: 600

This message is a bit confusing as the number shown here (600) is in
fact an hexa number: 0x600 = 1536

Without any explicit "0x" prefix, this message is read like the MFS is
set to 600 bytes.

MFS, as per MTUs, are usually expressed in decimal base.

This commit reports both current and default MFS values in decimal
so it's less confusing for end-users.

A typical warning message looks like the following :

MFS for port 1 (1536) has been set below the default (9728)

Signed-off-by: Erwan Velu <e.velu@criteo.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Tony Brelinski <tony.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Fixes: 3a2c6ced90e1 ("i40e: Add a check to see if MFS is set")
Link: https://lore.kernel.org/r/20240423182723.740401-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoi40e: Do not use WQ_MEM_RECLAIM flag for workqueue
Sindhu Devale [Tue, 23 Apr 2024 18:27:17 +0000 (11:27 -0700)]
i40e: Do not use WQ_MEM_RECLAIM flag for workqueue

Issue reported by customer during SRIOV testing, call trace:
When both i40e and the i40iw driver are loaded, a warning
in check_flush_dependency is being triggered. This seems
to be because of the i40e driver workqueue is allocated with
the WQ_MEM_RECLAIM flag, and the i40iw one is not.

Similar error was encountered on ice too and it was fixed by
removing the flag. Do the same for i40e too.

[Feb 9 09:08] ------------[ cut here ]------------
[  +0.000004] workqueue: WQ_MEM_RECLAIM i40e:i40e_service_task [i40e] is
flushing !WQ_MEM_RECLAIM infiniband:0x0
[  +0.000060] WARNING: CPU: 0 PID: 937 at kernel/workqueue.c:2966
check_flush_dependency+0x10b/0x120
[  +0.000007] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq
snd_timer snd_seq_device snd soundcore nls_utf8 cifs cifs_arc4
nls_ucs2_utils rdma_cm iw_cm ib_cm cifs_md4 dns_resolver netfs qrtr
rfkill sunrpc vfat fat intel_rapl_msr intel_rapl_common irdma
intel_uncore_frequency intel_uncore_frequency_common ice ipmi_ssif
isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
intel_powerclamp gnss coretemp ib_uverbs rapl intel_cstate ib_core
iTCO_wdt iTCO_vendor_support acpi_ipmi mei_me ipmi_si intel_uncore
ioatdma i2c_i801 joydev pcspkr mei ipmi_devintf lpc_ich
intel_pch_thermal i2c_smbus ipmi_msghandler acpi_power_meter acpi_pad
xfs libcrc32c ast sd_mod drm_shmem_helper t10_pi drm_kms_helper sg ixgbe
drm i40e ahci crct10dif_pclmul libahci crc32_pclmul igb crc32c_intel
libata ghash_clmulni_intel i2c_algo_bit mdio dca wmi dm_mirror
dm_region_hash dm_log dm_mod fuse
[  +0.000050] CPU: 0 PID: 937 Comm: kworker/0:3 Kdump: loaded Not
tainted 6.8.0-rc2-Feb-net_dev-Qiueue-00279-gbd43c5687e05 #1
[  +0.000003] Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS
SE5C620.86B.02.01.0013.121520200651 12/15/2020
[  +0.000001] Workqueue: i40e i40e_service_task [i40e]
[  +0.000024] RIP: 0010:check_flush_dependency+0x10b/0x120
[  +0.000003] Code: ff 49 8b 54 24 18 48 8d 8b b0 00 00 00 49 89 e8 48
81 c6 b0 00 00 00 48 c7 c7 b0 97 fa 9f c6 05 8a cc 1f 02 01 e8 35 b3 fd
ff <0f> 0b e9 10 ff ff ff 80 3d 78 cc 1f 02 00 75 94 e9 46 ff ff ff 90
[  +0.000002] RSP: 0018:ffffbd294976bcf8 EFLAGS: 00010282
[  +0.000002] RAX: 0000000000000000 RBX: ffff94d4c483c000 RCX:
0000000000000027
[  +0.000001] RDX: ffff94d47f620bc8 RSI: 0000000000000001 RDI:
ffff94d47f620bc0
[  +0.000001] RBP: 0000000000000000 R08: 0000000000000000 R09:
00000000ffff7fff
[  +0.000001] R10: ffffbd294976bb98 R11: ffffffffa0be65e8 R12:
ffff94c5451ea180
[  +0.000001] R13: ffff94c5ab5e8000 R14: ffff94c5c20b6e05 R15:
ffff94c5f1330ab0
[  +0.000001] FS:  0000000000000000(0000) GS:ffff94d47f600000(0000)
knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007f9e6f1fca70 CR3: 0000000038e20004 CR4:
00000000007706f0
[  +0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  +0.000001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  +0.000001] PKRU: 55555554
[  +0.000001] Call Trace:
[  +0.000001]  <TASK>
[  +0.000002]  ? __warn+0x80/0x130
[  +0.000003]  ? check_flush_dependency+0x10b/0x120
[  +0.000002]  ? report_bug+0x195/0x1a0
[  +0.000005]  ? handle_bug+0x3c/0x70
[  +0.000003]  ? exc_invalid_op+0x14/0x70
[  +0.000002]  ? asm_exc_invalid_op+0x16/0x20
[  +0.000006]  ? check_flush_dependency+0x10b/0x120
[  +0.000002]  ? check_flush_dependency+0x10b/0x120
[  +0.000002]  __flush_workqueue+0x126/0x3f0
[  +0.000015]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
[  +0.000056]  __ib_unregister_device+0x6a/0xb0 [ib_core]
[  +0.000023]  ib_unregister_device_and_put+0x34/0x50 [ib_core]
[  +0.000020]  i40iw_close+0x4b/0x90 [irdma]
[  +0.000022]  i40e_notify_client_of_netdev_close+0x54/0xc0 [i40e]
[  +0.000035]  i40e_service_task+0x126/0x190 [i40e]
[  +0.000024]  process_one_work+0x174/0x340
[  +0.000003]  worker_thread+0x27e/0x390
[  +0.000001]  ? __pfx_worker_thread+0x10/0x10
[  +0.000002]  kthread+0xdf/0x110
[  +0.000002]  ? __pfx_kthread+0x10/0x10
[  +0.000002]  ret_from_fork+0x2d/0x50
[  +0.000003]  ? __pfx_kthread+0x10/0x10
[  +0.000001]  ret_from_fork_asm+0x1b/0x30
[  +0.000004]  </TASK>
[  +0.000001] ---[ end trace 0000000000000000 ]---

Fixes: 4d5957cbdecd ("i40e: remove WQ_UNBOUND and the task limit of our workqueue")
Signed-off-by: Sindhu Devale <sindhu.devale@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Robert Ganzynkowicz <robert.ganzynkowicz@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240423182723.740401-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()
Dan Carpenter [Tue, 23 Apr 2024 16:15:22 +0000 (19:15 +0300)]
net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()

The rx_chn->irq[] array is unsigned int but it should be signed for the
error handling to work.  Also if k3_udma_glue_rx_get_irq() returns zero
then we should return -ENXIO instead of success.

Fixes: 128d5874c082 ("net: ti: icssg-prueth: Add ICSSG ethernet driver")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Link: https://lore.kernel.org/r/05282415-e7f4-42f3-99f8-32fde8f30936@moroto.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet/mlx5e: Advertise mlx5 ethernet driver updates sk_buff md_dst for MACsec
Rahul Rameshbabu [Tue, 23 Apr 2024 18:13:05 +0000 (11:13 -0700)]
net/mlx5e: Advertise mlx5 ethernet driver updates sk_buff md_dst for MACsec

mlx5 Rx flow steering and CQE handling enable the driver to be able to
update an skb's md_dst attribute as MACsec when MACsec traffic arrives when
a device is configured for offloading. Advertise this to the core stack to
take advantage of this capability.

Cc: stable@vger.kernel.org
Fixes: b7c9400cbc48 ("net/mlx5e: Implement MACsec Rx data path using MACsec skb_metadata_dst")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240423181319.115860-5-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agomacsec: Detect if Rx skb is macsec-related for offloading devices that update md_dst
Rahul Rameshbabu [Tue, 23 Apr 2024 18:13:04 +0000 (11:13 -0700)]
macsec: Detect if Rx skb is macsec-related for offloading devices that update md_dst

Can now correctly identify where the packets should be delivered by using
md_dst or its absence on devices that provide it.

This detection is not possible without device drivers that update md_dst. A
fallback pattern should be used for supporting such device drivers. This
fallback mode causes multicast messages to be cloned to both the non-macsec
and macsec ports, independent of whether the multicast message received was
encrypted over MACsec or not. Other non-macsec traffic may also fail to be
handled correctly for devices in promiscuous mode.

Link: https://lore.kernel.org/netdev/ZULRxX9eIbFiVi7v@hog/
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: stable@vger.kernel.org
Fixes: 860ead89b851 ("net/macsec: Add MACsec skb_metadata_dst Rx Data path support")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240423181319.115860-4-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agoethernet: Add helper for assigning packet type when dest address does not match devic...
Rahul Rameshbabu [Tue, 23 Apr 2024 18:13:03 +0000 (11:13 -0700)]
ethernet: Add helper for assigning packet type when dest address does not match device address

Enable reuse of logic in eth_type_trans for determining packet type.

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Cc: stable@vger.kernel.org
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240423181319.115860-3-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agomacsec: Enable devices to advertise whether they update sk_buff md_dst during offloads
Rahul Rameshbabu [Tue, 23 Apr 2024 18:13:02 +0000 (11:13 -0700)]
macsec: Enable devices to advertise whether they update sk_buff md_dst during offloads

Cannot know whether a Rx skb missing md_dst is intended for MACsec or not
without knowing whether the device is able to update this field during an
offload. Assume that an offload to a MACsec device cannot support updating
md_dst by default. Capable devices can advertise that they do indicate that
an skb is related to a MACsec offloaded packet using the md_dst.

Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: stable@vger.kernel.org
Fixes: 860ead89b851 ("net/macsec: Add MACsec skb_metadata_dst Rx Data path support")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240423181319.115860-2-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agotools: testing: selftests: prefer TEST_PROGS for conntrack_dump_flush
Florian Westphal [Wed, 24 Apr 2024 09:58:20 +0000 (11:58 +0200)]
tools: testing: selftests: prefer TEST_PROGS for conntrack_dump_flush

Currently conntrack_dump_flush test program always runs when passing
TEST_PROGS argument:

% make -C tools/testing/selftests TARGETS=net/netfilter \
 TEST_PROGS=conntrack_ipip_mtu.sh run_tests
make: Entering [..]
TAP version 13
1..2 [..]
  selftests: net/netfilter: conntrack_dump_flush [..]

Move away from TEST_CUSTOM_PROGS to avoid this.  After this,
above command will only run the program specified in TEST_PROGS.

Link: https://lore.kernel.org/netdev/20240423191609.70c14c42@kernel.org/
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240424095824.5555-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
18 months agonet: bridge: remove redundant check of f->dst
linke li [Tue, 23 Apr 2024 10:53:26 +0000 (18:53 +0800)]
net: bridge: remove redundant check of f->dst

In br_fill_forward_path(), f->dst is checked not to be NULL, then
immediately read using READ_ONCE and checked again. The first check is
useless, so this patch aims to remove the redundant check of f->dst.

Signed-off-by: linke li <lilinke99@qq.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
18 months agoMerge tag 'wireless-2024-04-23' of git://git.kernel.org/pub/scm/linux/kernel/git...
David S. Miller [Thu, 25 Apr 2024 11:18:37 +0000 (12:18 +0100)]
Merge tag 'wireless-2024-04-23' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes berg says:

====================
Fixes for the current cycle:
 * ath11k: convert to correct RCU iteration of IPv6 addresses
 * iwlwifi: link ID, FW API version, scanning and PASN fixes
 * cfg80211: NULL-deref and tracing fixes
 * mac80211: connection mode, mesh fast-TX, multi-link and
             various other small fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
18 months agonet: phy: dp83869: Fix MII mode failure
MD Danish Anwar [Tue, 23 Apr 2024 08:48:28 +0000 (14:18 +0530)]
net: phy: dp83869: Fix MII mode failure

The DP83869 driver sets the MII bit (needed for PHY to work in MII mode)
only if the op-mode is either DP83869_100M_MEDIA_CONVERT or
DP83869_RGMII_100_BASE.

Some drivers i.e. ICSSG support MII mode with op-mode as
DP83869_RGMII_COPPER_ETHERNET for which the MII bit is not set in dp83869
driver. As a result MII mode on ICSSG doesn't work and below log is seen.

TI DP83869 300b2400.mdio:0f: selected op-mode is not valid with MII mode
icssg-prueth icssg1-eth: couldn't connect to phy ethernet-phy@0
icssg-prueth icssg1-eth: can't phy connect port MII0

Fix this by setting MII bit for DP83869_RGMII_COPPER_ETHERNET op-mode as
well.

Fixes: 94e86ef1b801 ("net: phy: dp83869: support mii mode when rgmii strap cfg is used")
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>