Nick Child [Wed, 9 Aug 2023 22:10:34 +0000 (17:10 -0500)]
ibmvnic: Enforce stronger sanity checks on login response
Ensure that all offsets in a login response buffer are within the size
of the allocated response buffer. Any offsets or lengths that surpass
the allocation are likely the result of an incomplete response buffer.
In these cases, a full reset is necessary.
When attempting to login, the ibmvnic device will allocate a response
buffer and pass a reference to the VIOS. The VIOS will then send the
ibmvnic device a LOGIN_RSP CRQ to signal that the buffer has been filled
with data. If the ibmvnic device does not get a response in 20 seconds,
the old buffer is freed and a new login request is sent. With 2
outstanding requests, any LOGIN_RSP CRQ's could be for the older
login request. If this is the case then the login response buffer (which
is for the newer login request) could be incomplete and contain invalid
data. Therefore, we must enforce strict sanity checks on the response
buffer values.
Testing has shown that the `off_rxadd_buff_size` value is filled in last
by the VIOS and will be the smoking gun for these circumstances.
Until VIOS can implement a mechanism for tracking outstanding response
buffers and a method for mapping a LOGIN_RSP CRQ to a particular login
response buffer, the best ibmvnic can do in this situation is perform a
full reset.
Fixes: dff515a3e71d ("ibmvnic: Harden device login requests") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230809221038.51296-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Souradeep Chakrabarti [Wed, 9 Aug 2023 10:22:05 +0000 (03:22 -0700)]
net: mana: Fix MANA VF unload when hardware is unresponsive
When unloading the MANA driver, mana_dealloc_queues() waits for the MANA
hardware to complete any inflight packets and set the pending send count
to zero. But if the hardware has failed, mana_dealloc_queues()
could wait forever.
Fix this by adding a timeout to the wait. Set the timeout to 120 seconds,
which is a somewhat arbitrary value that is more than long enough for
functional hardware to complete any sends.
Maciej Żenczykowski [Mon, 7 Aug 2023 10:25:32 +0000 (03:25 -0700)]
ipv6: adjust ndisc_is_useropt() to also return true for PIO
The upcoming (and nearly finalized):
https://datatracker.ietf.org/doc/draft-collink-6man-pio-pflag/
will update the IPv6 RA to include a new flag in the PIO field,
which will serve as a hint to perform DHCPv6-PD.
As we don't want DHCPv6 related logic inside the kernel, this piece of
information needs to be exposed to userspace. The simplest option is to
simply expose the entire PIO through the already existing mechanism.
Even without this new flag, the already existing PIO R (router address)
flag (from RFC6275) cannot AFAICT be handled entirely in kernel,
and provides useful information that should be exposed to userspace
(the router's global address, for use by Mobile IPv6).
Also cc'ing stable@ for inclusion in LTS, as while technically this is
not quite a bugfix, and instead more of a feature, it is absolutely
trivial and the alternative is manually cherrypicking into all Android
Common Kernel trees - and I know Greg will ask for it to be sent in via
LTS instead...
Cc: Jen Linkova <furry@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> Cc: stable@vger.kernel.org Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20230807102533.1147559-1-maze@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 9 Aug 2023 22:04:44 +0000 (15:04 -0700)]
Merge tag 'wireless-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Just a few small updates:
* fix an integer overflow in nl80211
* fix rtw89 8852AE disconnections
* fix a buffer overflow in ath12k
* fix AP_VLAN configuration lookups
* fix allocation failure handling in brcm80211
* update MAINTAINERS for some drivers
* tag 'wireless-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: ath12k: Fix buffer overflow when scanning with extraie
wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
wifi: cfg80211: fix sband iftype data lookup for AP_VLAN
wifi: rtw89: fix 8852AE disconnection caused by RX full flags
MAINTAINERS: Remove tree entry for rtl8180
MAINTAINERS: Update entry for rtl8187
wifi: brcm80211: handle params_v1 allocation failure
====================
Ido Schimmel [Tue, 8 Aug 2023 14:15:03 +0000 (17:15 +0300)]
selftests: forwarding: bridge_mdb: Make test more robust
Some test cases check that the group timer is (or isn't) 0. Instead of
grepping for "0.00" grep for " 0.00" as the former can also match
"260.00" which is the default group membership interval.
Ido Schimmel [Tue, 8 Aug 2023 14:15:02 +0000 (17:15 +0300)]
selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet
As explained in commit 8bcfb4ae4d97 ("selftests: forwarding: Fix failing
tests with old libnet"), old versions of libnet (used by mausezahn) do
not use the "SO_BINDTODEVICE" socket option. For IP unicast packets,
this can be solved by prefixing mausezahn invocations with "ip vrf
exec". However, IP multicast packets do not perform routing and simply
egress the bound device, which does not exist in this case.
Fix by specifying the source and destination MAC of the packet which
will cause mausezahn to use a packet socket instead of an IP socket.
Fixes: 3446dcd7df05 ("selftests: forwarding: bridge_mdb_max: Add a new selftest") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/adc5e40d-d040-a65e-eb26-edf47dac5b02@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-17-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 14:15:01 +0000 (17:15 +0300)]
selftests: forwarding: bridge_mdb: Fix failing test with old libnet
As explained in commit 8bcfb4ae4d97 ("selftests: forwarding: Fix failing
tests with old libnet"), old versions of libnet (used by mausezahn) do
not use the "SO_BINDTODEVICE" socket option. For IP unicast packets,
this can be solved by prefixing mausezahn invocations with "ip vrf
exec". However, IP multicast packets do not perform routing and simply
egress the bound device, which does not exist in this case.
Fix by specifying the source and destination MAC of the packet which
will cause mausezahn to use a packet socket instead of an IP socket.
Ido Schimmel [Tue, 8 Aug 2023 14:15:00 +0000 (17:15 +0300)]
selftests: forwarding: tc_flower_l2_miss: Fix failing test with old libnet
As explained in commit 8bcfb4ae4d97 ("selftests: forwarding: Fix failing
tests with old libnet"), old versions of libnet (used by mausezahn) do
not use the "SO_BINDTODEVICE" socket option. For IP unicast packets,
this can be solved by prefixing mausezahn invocations with "ip vrf
exec". However, IP multicast packets do not perform routing and simply
egress the bound device, which does not exist in this case.
Fix by specifying the source and destination MAC of the packet which
will cause mausezahn to use a packet socket instead of an IP socket.
Fixes: 8c33266ae26a ("selftests: forwarding: Add layer 2 miss test cases") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/adc5e40d-d040-a65e-eb26-edf47dac5b02@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-15-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 14:14:59 +0000 (17:14 +0300)]
selftests: forwarding: tc_tunnel_key: Make filters more specific
The test installs filters that match on various IP fragments (e.g., no
fragment, first fragment) and expects a certain amount of packets to hit
each filter. This is problematic as the filters are not specific enough
and can match IP packets (e.g., IGMP) generated by the stack, resulting
in failures [1].
Fix by making the filters more specific and match on more fields in the
IP header: Source IP, destination IP and protocol.
[1]
# timeout set to 0
# selftests: net/forwarding: tc_tunnel_key.sh
# TEST: tunnel_key nofrag (skip_hw) [FAIL]
# packet smaller than MTU was not tunneled
# INFO: Could not test offloaded functionality
not ok 89 selftests: net/forwarding: tc_tunnel_key.sh # exit=1
Fixes: 533a89b1940f ("selftests: forwarding: add tunnel_key "nofrag" test case") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/adc5e40d-d040-a65e-eb26-edf47dac5b02@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Acked-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-14-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The test checks that filters that match on source or destination MAC
were only hit once. A host can send more than one packet with a given
source or destination MAC, resulting in failures.
Fix by relaxing the success criterion and instead check that the filters
were not hit zero times. Using tc_check_at_least_x_packets() is also an
option, but it is not available in older kernels.
Ido Schimmel [Tue, 8 Aug 2023 14:14:57 +0000 (17:14 +0300)]
selftests: forwarding: tc_actions: Use ncat instead of nc
The test relies on 'nc' being the netcat version from the nmap project.
While this seems to be the case on Fedora, it is not the case on Ubuntu,
resulting in failures such as [1].
Fix by explicitly using the 'ncat' utility from the nmap project and the
skip the test in case it is not installed.
Ido Schimmel [Tue, 8 Aug 2023 14:14:56 +0000 (17:14 +0300)]
selftests: forwarding: ethtool_mm: Skip when MAC Merge is not supported
MAC Merge cannot be tested with veth pairs, resulting in failures:
# ./ethtool_mm.sh
[...]
TEST: Manual configuration with verification: swp1 to swp2 [FAIL]
Verification did not succeed
Fix by skipping the test when the interfaces do not support MAC Merge.
Fixes: e6991384ace5 ("selftests: forwarding: add a test for MAC Merge layer") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/adc5e40d-d040-a65e-eb26-edf47dac5b02@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-11-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 14:14:55 +0000 (17:14 +0300)]
selftests: forwarding: hw_stats_l3_gre: Skip when using veth pairs
Layer 3 hardware stats cannot be used when the underlying interfaces are
veth pairs, resulting in failures:
# ./hw_stats_l3_gre.sh
TEST: ping gre flat [ OK ]
TEST: Test rx packets: [FAIL]
Traffic not reflected in the counter: 0 -> 0
TEST: Test tx packets: [FAIL]
Traffic not reflected in the counter: 0 -> 0
Fix by skipping the test when used with veth pairs.
Fixes: 813f97a26860 ("selftests: forwarding: Add a tunnel-based test for L3 HW stats") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/adc5e40d-d040-a65e-eb26-edf47dac5b02@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-10-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 14:14:52 +0000 (17:14 +0300)]
selftests: forwarding: Add a helper to skip test when using veth pairs
A handful of tests require physical loopbacks to be used instead of veth
pairs. Add a helper that these tests will invoke in order to be skipped
when executed with veth pairs.
Fixes: 64916b57c0b1 ("selftests: forwarding: Add speed and auto-negotiation test") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-7-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 14:14:51 +0000 (17:14 +0300)]
selftests: forwarding: Set default IPv6 traceroute utility
The test uses the 'TROUTE6' environment variable to encode the name of
the IPv6 traceroute utility. By default (without a configuration file),
this variable is not set, resulting in failures:
# ./ip6_forward_instats_vrf.sh
TEST: ping6 [ OK ]
TEST: Ip6InTooBigErrors [ OK ]
TEST: Ip6InHdrErrors [FAIL]
TEST: Ip6InAddrErrors [ OK ]
TEST: Ip6InDiscards [ OK ]
Fix by setting a default utility name and skip the test if the utility
is not present.
Fixes: 0857d6f8c759 ("ipv6: When forwarding count rx stats on the orig netdev") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/adc5e40d-d040-a65e-eb26-edf47dac5b02@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-6-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 14:14:50 +0000 (17:14 +0300)]
selftests: forwarding: bridge_mdb_max: Check iproute2 version
The selftest relies on iproute2 changes present in version 6.3, but the
test does not check for it, resulting in errors:
# ./bridge_mdb_max.sh
INFO: 802.1d tests
TEST: cfg4: port: ngroups reporting [FAIL]
Number of groups was null, now is null, but 5 expected
TEST: ctl4: port: ngroups reporting [FAIL]
Number of groups was null, now is null, but 5 expected
TEST: cfg6: port: ngroups reporting [FAIL]
Number of groups was null, now is null, but 5 expected
[...]
Fix by skipping the test if iproute2 is too old.
Fixes: 3446dcd7df05 ("selftests: forwarding: bridge_mdb_max: Add a new selftest") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/6b04b2ba-2372-6f6b-3ac8-b7cba1cfae83@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 14:14:49 +0000 (17:14 +0300)]
selftests: forwarding: bridge_mdb: Check iproute2 version
The selftest relies on iproute2 changes present in version 6.3, but the
test does not check for it, resulting in error:
# ./bridge_mdb.sh
INFO: # Host entries configuration tests
TEST: Common host entries configuration tests (IPv4) [FAIL]
Managed to add IPv4 host entry with a filter mode
TEST: Common host entries configuration tests (IPv6) [FAIL]
Managed to add IPv6 host entry with a filter mode
TEST: Common host entries configuration tests (L2) [FAIL]
Managed to add L2 host entry with a filter mode
INFO: # Port group entries configuration tests - (*, G)
Command "replace" is unknown, try "bridge mdb help".
[...]
Fix by skipping the test if iproute2 is too old.
Fixes: b6d00da08610 ("selftests: forwarding: Add bridge MDB test") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/6b04b2ba-2372-6f6b-3ac8-b7cba1cfae83@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 14:14:48 +0000 (17:14 +0300)]
selftests: forwarding: Switch off timeout
The default timeout for selftests is 45 seconds, but it is not enough
for forwarding selftests which can takes minutes to finish depending on
the number of tests cases:
# make -C tools/testing/selftests TARGETS=net/forwarding run_tests
TAP version 13
1..102
# timeout set to 45
# selftests: net/forwarding: bridge_igmp.sh
# TEST: IGMPv2 report 239.10.10.10 [ OK ]
# TEST: IGMPv2 leave 239.10.10.10 [ OK ]
# TEST: IGMPv3 report 239.10.10.10 is_include [ OK ]
# TEST: IGMPv3 report 239.10.10.10 include -> allow [ OK ]
#
not ok 1 selftests: net/forwarding: bridge_igmp.sh # TIMEOUT 45 seconds
Fix by switching off the timeout and setting it to 0. A similar change
was done for BPF selftests in commit 6fc5916cc256 ("selftests: bpf:
Switch off timeout").
Fixes: 81573b18f26d ("selftests/net/forwarding: add Makefile to install tests") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/8d149f8c-818e-d141-a0ce-a6bae606bc22@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 14:14:47 +0000 (17:14 +0300)]
selftests: forwarding: Skip test when no interfaces are specified
As explained in [1], the forwarding selftests are meant to be run with
either physical loopbacks or veth pairs. The interfaces are expected to
be specified in a user-provided forwarding.config file or as command
line arguments. By default, this file is not present and the tests fail:
# make -C tools/testing/selftests TARGETS=net/forwarding run_tests
[...]
TAP version 13
1..102
# timeout set to 45
# selftests: net/forwarding: bridge_igmp.sh
# Command line is not complete. Try option "help"
# Failed to create netif
not ok 1 selftests: net/forwarding: bridge_igmp.sh # exit=1
[...]
Fix by skipping a test if interfaces are not provided either via the
configuration file or command line arguments.
# make -C tools/testing/selftests TARGETS=net/forwarding run_tests
[...]
TAP version 13
1..102
# timeout set to 45
# selftests: net/forwarding: bridge_igmp.sh
# SKIP: Cannot create interface. Name not specified
ok 1 selftests: net/forwarding: bridge_igmp.sh # SKIP
[1] tools/testing/selftests/net/forwarding/README
Fixes: 81573b18f26d ("selftests/net/forwarding: add Makefile to install tests") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/856d454e-f83c-20cf-e166-6dc06cbc1543@alu.unizg.hr/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230808141503.4060661-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 9 Aug 2023 20:42:09 +0000 (13:42 -0700)]
Merge branch 'nexthop-nexthop-dump-fixes'
Ido Schimmel says:
====================
nexthop: Nexthop dump fixes
Patches #1 and #3 fix two problems related to nexthops and nexthop
buckets dump, respectively. Patch #2 is a preparation for the third
patch.
The pattern described in these patches of splitting the NLMSG_DONE to a
separate response is prevalent in other rtnetlink dump callbacks. I
don't know if it's because I'm missing something or if this was done
intentionally to ensure the message is delivered to user space. After
commit 0642840b8bb0 ("af_netlink: ensure that NLMSG_DONE never fails in
dumps") this is no longer necessary and I can improve these dump
callbacks assuming this analysis is correct.
Ido Schimmel [Tue, 8 Aug 2023 07:52:33 +0000 (10:52 +0300)]
nexthop: Fix infinite nexthop bucket dump when using maximum nexthop ID
A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.
The nexthop bucket dump callback always returns a positive number if
nexthop buckets were filled in the provided skb, even if the dump is
complete. This means that a dump will span at least two recvmsg() calls
as long as nexthop buckets are present. In the last recvmsg() call the
dump callback will not fill in any nexthop buckets because the previous
call indicated that the dump should restart from the last dumped nexthop
ID plus one.
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 2
# strace -e sendto,recvmsg -s 5 ip nexthop bucket
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396980, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 128
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
id 10 index 0 idle_time 6.66 nhid 1
id 10 index 1 idle_time 6.66 nhid 1
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
+++ exited with 0 +++
This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
# ip nexthop bucket
id 4294967295 index 0 idle_time 5.55 nhid 1
id 4294967295 index 1 idle_time 5.55 nhid 1
id 4294967295 index 0 idle_time 5.55 nhid 1
id 4294967295 index 1 idle_time 5.55 nhid 1
[...]
Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOPBUCKET responses:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
# strace -e sendto,recvmsg -s 5 ip nexthop bucket
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396737, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 148
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
id 4294967295 index 0 idle_time 6.61 nhid 1
id 4294967295 index 1 idle_time 6.61 nhid 1
+++ exited with 0 +++
Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.
Add a test that fails before the fix:
# ./fib_nexthops.sh -t basic_res
[...]
TEST: Maximum nexthop ID dump [FAIL]
[...]
And passes after it:
# ./fib_nexthops.sh -t basic_res
[...]
TEST: Maximum nexthop ID dump [ OK ]
[...]
Fixes: 8a1bbabb034d ("nexthop: Add netlink handlers for bucket dump") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230808075233.3337922-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 07:52:32 +0000 (10:52 +0300)]
nexthop: Make nexthop bucket dump more efficient
rtm_dump_nexthop_bucket_nh() is used to dump nexthop buckets belonging
to a specific resilient nexthop group. The function returns a positive
return code (the skb length) upon both success and failure.
The above behavior is problematic. When a complete nexthop bucket dump
is requested, the function that walks the different nexthops treats the
non-zero return code as an error. This causes buckets belonging to
different resilient nexthop groups to be dumped using different buffers
even if they can all fit in the same buffer:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 1
# ip nexthop add id 20 group 1 type resilient buckets 1
# strace -e recvmsg -s 0 ip nexthop bucket
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
id 10 index 0 idle_time 10.27 nhid 1
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
id 20 index 0 idle_time 6.44 nhid 1
[...]
Fix by only returning a non-zero return code when an error occurred and
restarting the dump from the bucket index we failed to fill in. This
allows buckets belonging to different resilient nexthop groups to be
dumped using the same buffer:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 1
# ip nexthop add id 20 group 1 type resilient buckets 1
# strace -e recvmsg -s 0 ip nexthop bucket
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
id 10 index 0 idle_time 30.21 nhid 1
id 20 index 0 idle_time 26.7 nhid 1
[...]
While this change is more of a performance improvement change than an
actual bug fix, it is a prerequisite for a subsequent patch that does
fix a bug.
Fixes: 8a1bbabb034d ("nexthop: Add netlink handlers for bucket dump") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230808075233.3337922-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Aug 2023 07:52:31 +0000 (10:52 +0300)]
nexthop: Fix infinite nexthop dump when using maximum nexthop ID
A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.
The nexthop dump callback always returns a positive number if nexthops
were filled in the provided skb, even if the dump is complete. This
means that a dump will span at least two recvmsg() calls as long as
nexthops are present. In the last recvmsg() call the dump callback will
not fill in any nexthops because the previous call indicated that the
dump should restart from the last dumped nexthop ID plus one.
This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:
# ip nexthop add id $((2**32-1)) blackhole
# ip nexthop
id 4294967295 blackhole
id 4294967295 blackhole
[...]
Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOP response:
Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.
Add a test that fails before the fix:
# ./fib_nexthops.sh -t basic
[...]
TEST: Maximum nexthop ID dump [FAIL]
[...]
And passes after it:
# ./fib_nexthops.sh -t basic
[...]
TEST: Maximum nexthop ID dump [ OK ]
[...]
Fixes: ab84be7e54fc ("net: Initial nexthop code") Reported-by: Petr Machata <petrm@nvidia.com> Closes: https://lore.kernel.org/netdev/87sf91enuf.fsf@nvidia.com/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230808075233.3337922-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vlad Buslov [Tue, 8 Aug 2023 09:35:21 +0000 (11:35 +0200)]
vlan: Fix VLAN 0 memory leak
The referenced commit intended to fix memleak of VLAN 0 that is implicitly
created on devices with NETIF_F_HW_VLAN_CTAG_FILTER feature. However, it
doesn't take into account that the feature can be re-set during the
netdevice lifetime which will cause memory leak if feature is disabled
during the device deletion as illustrated by [0]. Fix the leak by
unconditionally deleting VLAN 0 on NETDEV_DOWN event.
Wen Gong [Wed, 9 Aug 2023 08:12:41 +0000 (04:12 -0400)]
wifi: ath12k: Fix buffer overflow when scanning with extraie
If cfg80211 is providing extraie's for a scanning process then ath12k will
copy that over to the firmware. The extraie.len is a 32 bit value in struct
element_info and describes the amount of bytes for the vendor information
elements.
The problem is the allocation of the buffer. It has to align the TLV
sections by 4 bytes. But the code was using an u8 to store the newly
calculated length of this section (with alignment). And the new
calculated length was then used to allocate the skbuff. But the actual
code to copy in the data is using the extraie.len and not the calculated
"aligned" length.
The length of extraie with IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS enabled
was 264 bytes during tests with a wifi card. But it only allocated 8
bytes (264 bytes % 256) for it. As consequence, the code to memcpy the
extraie into the skb was then just overwriting data after skb->end. Things
like shinfo were therefore corrupted. This could usually be seen by a crash
in skb_zcopy_clear which tried to call a ubuf_info callback (using a bogus
address).
Keith Yeo [Mon, 31 Jul 2023 03:47:20 +0000 (11:47 +0800)]
wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
nl80211_parse_mbssid_elems() uses a u8 variable num_elems to count the
number of MBSSID elements in the nested netlink attribute attrs, which can
lead to an integer overflow if a user of the nl80211 interface specifies
256 or more elements in the corresponding attribute in userspace. The
integer overflow can lead to a heap buffer overflow as num_elems determines
the size of the trailing array in elems, and this array is thereafter
written to for each element in attrs.
Note that this vulnerability only affects devices with the
wiphy->mbssid_max_interfaces member set for the wireless physical device
struct in the device driver, and can only be triggered by a process with
CAP_NET_ADMIN capabilities.
Fix this by checking for a maximum of 255 elements in attrs.
Cc: stable@vger.kernel.org Fixes: dc1e3cb8da8b ("nl80211: MBSSID and EMA support in AP mode") Signed-off-by: Keith Yeo <keithyjy@gmail.com> Link: https://lore.kernel.org/r/20230731034719.77206-1-keithyjy@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") started to derive the effective buffer size for
SMC connections inconsistently in case a TCP fallback was used and
memory consumption of SMC with the default settings was doubled when
a connection negotiated SMC. That was not what we want.
This series consolidates the resulting effective buffer size that is
used with SMC sockets, which is based on Jan Karcher's effort (see
[1]). For all TCP exchanges (in particular in case of a fall back when
no SMC connection was possible) the values from net.ipv4.tcp_[rw]mem
are used. If SMC succeeds in establishing a SMC connection, the newly
introduced values from net.smc.[rw]mem are used.
net.smc.[rw]mem is initialized to 64kB, respectively. Internal test
have show this to be a good compromise between throughput/latency
and memory consumption. Also net.smc.[rw]mem is now decoupled completely
from any tuning through net.ipv4.tcp_[rw]mem.
If a user chose to tune a socket's receive or send buffer size with
setsockopt, this tuning is now consistently applied to either fall-back
TCP or proper SMC connections over the socket.
Thanks,
Gerd
v2 - v3:
- Rebase to and resolve conflict of second patch with latest net/master.
v1 - v2:
- In second patch, use sock_net() helper as suggested by Tony and demanded
by kernel test robot.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Gerd Bayer [Fri, 4 Aug 2023 17:06:24 +0000 (19:06 +0200)]
net/smc: Use correct buffer sizes when switching between TCP and SMC
Tuning of the effective buffer size through setsockopts was working for
SMC traffic only but not for TCP fall-back connections even before
commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and
make them tunable"). That change made it apparent that TCP fall-back
connections would use net.smc.[rw]mem as buffer size instead of
net.ipv4_tcp_[rw]mem.
Amend the code that copies attributes between the (TCP) clcsock and the
SMC socket and adjust buffer sizes appropriately:
- Copy over sk_userlocks so that both sockets agree on whether tuning
via setsockopt is active.
- When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with
setsockopt. Otherwise, use the sysctl value for TCP/IPv4.
- Likewise, use either values from setsockopt or from sysctl for SMC
(duplicated) on successful SMC connect.
In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that
is taken care of by the attribute copy.
Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerd Bayer [Fri, 4 Aug 2023 17:06:23 +0000 (19:06 +0200)]
net/smc: Fix setsockopt and sysctl to specify same buffer size again
Commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") introduced the net.smc.rmem and net.smc.wmem
sysctls to specify the size of buffers to be used for SMC type
connections. This created a regression for users that specified the
buffer size via setsockopt() as the effective buffer size was now
doubled.
Re-introduce the division by 2 in the SMC buffer create code and level
this out by duplicating the net.smc.[rw]mem values used for initializing
sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both
methods (setsockopt or sysctl) the effective buffer size that they
expect.
Initialize net.smc.[rw]mem from its own constant of 64kB, respectively.
Internal performance tests show that this value is a good compromise
between throughput/latency and memory consumption. Also, this decouples
it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the
module for SMC protocol was loaded. Check that no more than INT_MAX / 2
is assigned to net.smc.[rw]mem, in order to avoid any overflow condition
when that is doubled for use in sk_sndbuf or sk_rcvbuf.
While at it, drop the confusing sk_buf_size variable from
__smc_buf_create and name "compressed" buffer size variables more
consistently.
Background:
Before the commit mentioned above, SMC's buffer allocator in
__smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf
value as initial value to search for appropriate buffers. If the search
resorted to using a bigger buffer when all buffers of the specified
size were busy, the duplicate of the used effective buffer size is
stored back to sk_rcvbuf/sk_sndbuf.
When available, buffers of exactly the size that a user had specified as
input to setsockopt() were used, despite setsockopt()'s documentation in
"man 7 socket" talking of a mandatory duplication:
[...]
SO_SNDBUF
Sets or gets the maximum socket send buffer in bytes.
The kernel doubles this value (to allow space for book‐
keeping overhead) when it is set using setsockopt(2),
and this doubled value is returned by getsockopt(2).
The default value is set by the
/proc/sys/net/core/wmem_default file and the maximum
allowed value is set by the /proc/sys/net/core/wmem_max
file. The minimum (doubled) value for this option is
2048.
[...]
Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") Co-developed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
I'm not sure who should take this patch set (net maintainers or PCI
maintainers). Everyone could pick up just their part, and that would
work (no compile time dependencies). However, the entire series needs
ACK from both sides and Rob for sure.
Since commit 6fffbc7ae137 ("PCI: Honor firmware's device disabled
status"), this is redundant and does nothing, because enetc_pf_probe()
no longer even gets called.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Thu, 3 Aug 2023 13:58:57 +0000 (16:58 +0300)]
net: enetc: reimplement RFS/RSS memory clearing as PCI quirk
The workaround implemented in commit 3222b5b613db ("net: enetc:
initialize RFS/RSS memories for unused ports too") is no longer
effective after commit 6fffbc7ae137 ("PCI: Honor firmware's device
disabled status"). Thus, it has introduced a regression and we see AER
errors being reported again:
$ ip link set sw2p0 up && dhclient -i sw2p0 && ip addr show sw2p0
fsl_enetc 0000:00:00.2 eno2: configuring for fixed/internal link mode
fsl_enetc 0000:00:00.2 eno2: Link is Up - 2.5Gbps/Full - flow control rx/tx
mscc_felix 0000:00:00.5 swp2: configuring for fixed/sgmii link mode
mscc_felix 0000:00:00.5 swp2: Link is Up - 1Gbps/Full - flow control off
sja1105 spi2.2 sw2p0: configuring for phy/rgmii-id link mode
sja1105 spi2.2 sw2p0: Link is Up - 1Gbps/Full - flow control off
pcieport 0000:00:1f.0: AER: Multiple Corrected error received: 0000:00:00.0
pcieport 0000:00:1f.0: AER: can't find device of ID0000
Rob's suggestion is to reimplement the enetc driver workaround as a
PCI fixup, and to modify the PCI core to run the fixups for all PCI
functions. This change handles the first part.
We refactor the common code in enetc_psi_create() and enetc_psi_destroy(),
and use the PCI fixup only for those functions for which enetc_pf_probe()
won't get called. This avoids some work being done twice for the PFs
which are enabled.
Vladimir Oltean [Thu, 3 Aug 2023 13:58:56 +0000 (16:58 +0300)]
PCI: move OF status = "disabled" detection to dev->match_driver
The blamed commit has broken probing on
arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi when &enetc_port0
(PCI function 0) has status = "disabled".
Background: pci_scan_slot() has logic to say that if the function 0 of a
device is absent, the entire device is absent and we can skip the other
functions entirely. Traditionally, this has meant that
pci_bus_read_dev_vendor_id() returns an error code for that function.
However, since the blamed commit, there is an extra confounding
condition: function 0 of the device exists and has a valid vendor id,
but it is disabled in the device tree. In that case, pci_scan_slot()
would incorrectly skip the entire device instead of just that function.
In the case of NXP LS1028A, status = "disabled" does not mean that the
PCI function's config space is not available for reading. It is, but the
Ethernet port is just not functionally useful with a particular SerDes
protocol configuration (0x9999) due to pinmuxing constraints of the Soc.
So, pci_scan_slot() skips all other functions on the ENETC ECAM
(enetc_port1, enetc_port2, enetc_mdio_pf3 etc) when just enetc_port0 had
to not be probed.
There is an additional regression introduced by the change, caused by
its fundamental premise. The enetc driver needs to run code for all PCI
functions, regardless of whether they're enabled or not in the device
tree. That is no longer possible if the driver's probe function is no
longer called. But Rob recommends that we move the of_device_is_available()
detection to dev->match_driver, and this makes the PCI fixups still run
on all functions, while just probing drivers for those functions that
are enabled. So, a separate change in the enetc driver will have to move
the workarounds to a PCI fixup.
Piotr Gardocki [Mon, 7 Aug 2023 20:50:11 +0000 (13:50 -0700)]
iavf: fix potential races for FDIR filters
Add fdir_fltr_lock locking in unprotected places.
The change in iavf_fdir_is_dup_fltr adds a spinlock around a loop which
iterates over all filters and looks for a duplicate. The filter can be
removed from list and freed from memory at the same time it's being
compared. All other places where filters are deleted are already
protected with spinlock.
The remaining changes protect adapter->fdir_active_fltr variable so now
all its uses are under a spinlock.
Fixes: 527691bf0682 ("iavf: Support IPv4 Flow Director filters") Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230807205011.3129224-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Muhammad Husaini Zulkifli [Mon, 7 Aug 2023 20:51:29 +0000 (13:51 -0700)]
igc: Add lock to safeguard global Qbv variables
Access to shared variables through hrtimer requires locking in order
to protect the variables because actions to write into these variables
(oper_gate_closed, admin_gate_closed, and qbv_transition) might potentially
occur simultaneously. This patch provides a locking mechanisms to avoid
such scenarios.
Fixes: 175c241288c0 ("igc: Fix TX Hang issue when QBV Gate is closed") Suggested-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20230807205129.3129346-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Aug 2023 23:33:18 +0000 (16:33 -0700)]
Merge tag 'mlx5-fixes-2023-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2023-08-07
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2023-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Add capability check for vnic counters
net/mlx5: Reload auxiliary devices in pci error handlers
net/mlx5: Skip clock update work when device is in error state
net/mlx5: LAG, Check correct bucket when modifying LAG
net/mlx5e: Unoffload post act rule when handling FIB events
net/mlx5: Fix devlink controller number for ECVF
net/mlx5: Allow 0 for total host VFs
net/mlx5: Return correct EC_VF function ID
net/mlx5: DR, Fix wrong allocation of modify hdr pattern
net/mlx5e: TC, Fix internal port memory leak
net/mlx5e: Take RTNL lock when needed before calling xdp_set_features()
====================
In externel_lb process, the hns3 driver call napi_disable()
first, then the reset happen, then the restore process of the
externel_lb will fail, and will not call napi_enable(). When
doing externel_lb again, napi_disable() will be double call,
cause a deadlock of rtnl_lock().
This patch use the HNS3_NIC_STATE_DOWN state to protect the
calling of napi_disable() and napi_enable() in externel_lb
process, just as the usage in ndo_stop() and ndo_start().
Fixes: 04b6ba143521 ("net: hns3: add support for external loopback test") Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230807113452.474224-5-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jie Wang [Mon, 7 Aug 2023 11:34:51 +0000 (19:34 +0800)]
net: hns3: add wait until mac link down
In some configure flow of hns3 driver, for example, change mtu, it will
disable MAC through firmware before configuration. But firmware disables
MAC asynchronously. The rx traffic may be not stopped in this case.
So fixes it by waiting until mac link is down.
Fixes: a9775bb64aa7 ("net: hns3: fix set and get link ksettings issue") Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230807113452.474224-4-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Rheinsberg [Mon, 7 Aug 2023 08:12:25 +0000 (10:12 +0200)]
net/unix: use consistent error code in SO_PEERPIDFD
Change the new (unreleased) SO_PEERPIDFD sockopt to return ENODATA
rather than ESRCH if a socket type does not support remote peer-PID
queries.
Currently, SO_PEERPIDFD returns ESRCH when the socket in question is
not an AF_UNIX socket. This is quite unexpected, given that one would
assume ESRCH means the peer process already exited and thus cannot be
found. However, in that case the sockopt actually returns EINVAL (via
pidfd_prepare()). This is rather inconsistent with other syscalls, which
usually return ESRCH if a given PID refers to a non-existant process.
This changes SO_PEERPIDFD to return ENODATA instead. This is also what
SO_PEERGROUPS returns, and thus keeps a consistent behavior across
sockopts.
Note that this code is returned in 2 cases: First, if the socket type is
not AF_UNIX, and secondly if the socket was not yet connected. In both
cases ENODATA seems suitable.
Signed-off-by: David Rheinsberg <david@readahead.eu> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Luca Boccassi <bluca@debian.org> Fixes: 7b26952a91cf ("net: core: add getsockopt SO_PEERPIDFD") Link: https://lore.kernel.org/r/20230807081225.816199-1-david@readahead.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ping-Ke Shih [Tue, 8 Aug 2023 00:54:26 +0000 (08:54 +0800)]
wifi: rtw89: fix 8852AE disconnection caused by RX full flags
RX full flags are raised if certain types of RX FIFO are full, and then
drop all following MPDU of AMPDU. In order to resume to receive MPDU
when RX FIFO becomes available, we clear the register bits by the
commit a0d99ebb3ecd ("wifi: rtw89: initialize DMA of CMAC"). But, 8852AE
needs more settings to support this. To quickly fix disconnection problem,
revert the behavior as before.
Fixes: a0d99ebb3ecd ("wifi: rtw89: initialize DMA of CMAC") Reported-by: Damian B <bronecki.damian@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217710 Cc: <Stable@vger.kernel.org> Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Tested-by: Damian B <bronecki.damian@gmail.com> Link: https://lore.kernel.org/r/20230808005426.5327-1-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
$ ip route
10.0.0.0/24 dev swp5 proto kernel scope link src 10.0.0.1 rt_trap
10.0.1.0/24 nhid 20 via 10.0.0.2 dev swp5 rt_trap
When creating a route referencing a nexthop via its ID, the nexthop will
be stored in a separate nh pointer instead of the array of nexthops in
the fib_info struct. This causes issues since fib_info_nh() only handles
the nexthops array, but not the separate nh pointer, and will loudly
WARN about it.
In contrast fib_info_nhc() handles both, but returns a fib_nh_common
pointer instead of a fib_nh pointer. Luckily we only ever access fields
from the fib_nh_common parts, so we can just replace all instances of
fib_info_nh() with fib_info_nhc() and access the fields via their
fib_nh_common names.
This allows handling IPv4 routes with an external nexthop, and they now
get offloaded as expected:
$ ip route
10.0.0.0/24 dev swp5 proto kernel scope link src 10.0.0.1 rt_trap
10.0.1.0/24 nhid 20 via 10.0.0.2 dev swp5 offload rt_offload
xdp->frame_sz > PAGE_SIZE check was introduced in commit c8741e2bfe87
("xdp: Allow bpf_xdp_adjust_tail() to grow packet size"). But Jesper
Dangaard Brouer <jbrouer@redhat.com> noted that after introducing the
xdp_init_buff() which all XDP driver use - it's safe to remove this
check. The original intend was to catch cases where XDP drivers have
not been updated to use xdp.frame_sz, but that is not longer a concern
(since xdp_init_buff).
Running the initial syzkaller repro it was discovered that the
contiguous physical memory allocation is used for both xdp paths in
tun_get_user(), e.g. tun_build_skb() and tun_alloc_skb(). It was also
stated by Jesper Dangaard Brouer <jbrouer@redhat.com> that XDP can
work on higher order pages, as long as this is contiguous physical
memory (e.g. a page).
Andrew Kanner [Thu, 3 Aug 2023 18:59:48 +0000 (20:59 +0200)]
drivers: net: prevent tun_build_skb() to exceed the packet size limit
Using the syzkaller repro with reduced packet size it was discovered
that XDP_PACKET_HEADROOM is not checked in tun_can_build_skb(),
although pad may be incremented in tun_build_skb(). This may end up
with exceeding the PAGE_SIZE limit in tun_build_skb().
Jason Wang <jasowang@redhat.com> proposed to count XDP_PACKET_HEADROOM
always (e.g. without rcu_access_pointer(tun->xdp_prog)) in
tun_can_build_skb() since there's a window during which XDP program
might be attached between tun_can_build_skb() and tun_build_skb().
Jakub Kicinski [Mon, 7 Aug 2023 19:26:58 +0000 (12:26 -0700)]
Merge branch 'wireguard-fixes-for-6-5-rc6'
Jason A. Donenfeld says:
====================
wireguard fixes for 6.5-rc6
Just one patch this time, somewhat late in the cycle:
1) Fix an off-by-one calculation for the maximum node depth size in the
allowedips trie data structure, and also adjust the self-tests to hit
this case so it doesn't regress again in the future.
====================
Jason A. Donenfeld [Mon, 7 Aug 2023 13:21:27 +0000 (15:21 +0200)]
wireguard: allowedips: expand maximum node depth
In the allowedips self-test, nodes are inserted into the tree, but it
generated an even amount of nodes, but for checking maximum node depth,
there is of course the root node, which makes the total number
necessarily odd. With two few nodes added, it never triggered the
maximum depth check like it should have. So, add 129 nodes instead of
128 nodes, and do so with a more straightforward scheme, starting with
all the bits set, and shifting over one each time. Then increase the
maximum depth to 129, and choose a better name for that variable to
make it clear that it represents depth as opposed to bits.
Ziyang Xuan [Wed, 2 Aug 2023 11:43:20 +0000 (19:43 +0800)]
bonding: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves
BUG_ON(!vlan_info) is triggered in unregister_vlan_dev() with
following testcase:
# ip netns add ns1
# ip netns exec ns1 ip link add bond0 type bond mode 0
# ip netns exec ns1 ip link add bond_slave_1 type veth peer veth2
# ip netns exec ns1 ip link set bond_slave_1 master bond0
# ip netns exec ns1 ip link add link bond_slave_1 name vlan10 type vlan id 10 protocol 802.1ad
# ip netns exec ns1 ip link add link bond0 name bond0_vlan10 type vlan id 10 protocol 802.1ad
# ip netns exec ns1 ip link set bond_slave_1 nomaster
# ip netns del ns1
The logical analysis of the problem is as follows:
1. create ETH_P_8021AD protocol vlan10 for bond_slave_1:
register_vlan_dev()
vlan_vid_add()
vlan_info_alloc()
__vlan_vid_add() // add [ETH_P_8021AD, 10] vid to bond_slave_1
2. create ETH_P_8021AD protocol bond0_vlan10 for bond0:
register_vlan_dev()
vlan_vid_add()
__vlan_vid_add()
vlan_add_rx_filter_info()
if (!vlan_hw_filter_capable(dev, proto)) // condition established because bond0 without NETIF_F_HW_VLAN_STAG_FILTER
return 0;
if (netif_device_present(dev))
return dev->netdev_ops->ndo_vlan_rx_add_vid(dev, proto, vid); // will be never called
// The slaves of bond0 will not refer to the [ETH_P_8021AD, 10] vid.
3. detach bond_slave_1 from bond0:
__bond_release_one()
vlan_vids_del_by_dev()
list_for_each_entry(vid_info, &vlan_info->vid_list, list)
vlan_vid_del(dev, vid_info->proto, vid_info->vid);
// bond_slave_1 [ETH_P_8021AD, 10] vid will be deleted.
// bond_slave_1->vlan_info will be assigned NULL.
4. delete vlan10 during delete ns1:
default_device_exit_batch()
dev->rtnl_link_ops->dellink() // unregister_vlan_dev() for vlan10
vlan_info = rtnl_dereference(real_dev->vlan_info); // real_dev of vlan10 is bond_slave_1
BUG_ON(!vlan_info); // bond_slave_1->vlan_info is NULL now, bug is triggered!!!
Add S-VLAN tag related features support to bond driver. So the bond driver
will always propagate the VLAN info to its slaves.
Lama Kayal [Sun, 11 Jun 2023 13:29:13 +0000 (16:29 +0300)]
net/mlx5e: Add capability check for vnic counters
Add missing capability check for each of the vnic counters exposed by
devlink health reporter, and thus avoid unexpected behavior due to
invalid access to registers.
While at it, read only the exact number of bits for each counter whether
it was 32 bits or 64 bits.
Fixes: b0bc615df488 ("net/mlx5: Add vnic devlink health reporter to PFs/VFs") Fixes: a33682e4e78e ("net/mlx5e: Expose catastrophic steering error counters") Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
net/mlx5: Skip clock update work when device is in error state
When device is in error state, marked by the flag
MLX5_DEVICE_STATE_INTERNAL_ERROR, the HW and PCI may not be accessible
and so clock update work should be skipped. Furthermore, such access
through PCI in error state, after calling mlx5_pci_disable_device() can
result in failing to recover from pci errors.
Fixes: ef9814deafd0 ("net/mlx5e: Add HW timestamping (TS) support") Reported-and-tested-by: Ganesh G R <ganeshgr@linux.ibm.com> Closes: https://lore.kernel.org/netdev/9bdb9b9d-140a-7a28-f0de-2e64e873c068@nvidia.com Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
action order 2: tunnel_key set
src_ip 192.168.1.25
dst_ip 192.168.1.26
key_id 4
dst_port 4789
csum pipe
index 3 ref 1 bind 1 installed 3807 sec used 3779 sec firstused 3800 sec
Action statistics:
Sent 120 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
action order 3: mirred (Egress Redirect to device vxlan1) stolen
index 9 ref 1 bind 1 installed 3807 sec used 3779 sec firstused 3800 sec
Action statistics:
Sent 120 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
When handling FIB events, the rule in post act will not be deleted.
And because the post act rule has packet reformat and modify header
actions, also will hit the following syndromes:
mlx5_core 0000:08:00.0: mlx5_cmd_out_err:829:(pid 11613): DEALLOC_MODIFY_HEADER_CONTEXT(0x941) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x1ab444), err(-22)
mlx5_core 0000:08:00.0: mlx5_cmd_out_err:829:(pid 11613): DEALLOC_PACKET_REFORMAT_CONTEXT(0x93e) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x179e84), err(-22)
Fix it by unoffloading post act rule when handling FIB events.
Fixes: 314e1105831b ("net/mlx5e: Add post act offload/unoffload API") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Daniel Jurgens [Tue, 1 Aug 2023 18:46:00 +0000 (21:46 +0300)]
net/mlx5: Fix devlink controller number for ECVF
The controller number for ECVFs is always 0, because the ECPF must be
the eswitch owner for EC VFs to be enabled.
Fixes: dc13180824b7 ("net/mlx5: Enable devlink port for embedded cpu VF vports") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Daniel Jurgens [Mon, 10 Jul 2023 21:28:10 +0000 (00:28 +0300)]
net/mlx5: Allow 0 for total host VFs
When querying eswitch functions 0 is a valid number of host VFs. After
introducing ARM SRIOV falling through to getting the max value from PCI
results in using the total VFs allowed on the ARM for the host.
Fixes: 86eec50beaf3 ("net/mlx5: Support querying max VFs from device"); Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Daniel Jurgens [Mon, 26 Jun 2023 19:55:03 +0000 (22:55 +0300)]
net/mlx5: Return correct EC_VF function ID
The ECVF function ID range is 1..max_ec_vfs. Currently
mlx5_vport_to_func_id returns 0..max_ec_vfs - 1. Which
results in a syndrome when querying the caps with more
recent firmware, or reading incorrect caps with older
firmware that supports EC VFs.
Fixes: 9ac0b128248e ("net/mlx5: Update vport caps query/set for EC VFs") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
net/mlx5: DR, Fix wrong allocation of modify hdr pattern
Fixing wrong calculation of the modify hdr pattern size,
where the previously calculated number would not be enough
to accommodate the required number of actions.
Jianbo Liu [Fri, 14 Apr 2023 08:48:20 +0000 (08:48 +0000)]
net/mlx5e: TC, Fix internal port memory leak
The flow rule can be splited, and the extra post_act rules are added
to post_act table. It's possible to trigger memleak when the rule
forwards packets from internal port and over tunnel, in the case that,
for example, CT 'new' state offload is allowed. As int_port object is
assigned to the flow attribute of post_act rule, and its refcnt is
incremented by mlx5e_tc_int_port_get(), but mlx5e_tc_int_port_put() is
not called, the refcnt is never decremented, then int_port is never
freed.
Gal Pressman [Sun, 16 Jul 2023 11:28:10 +0000 (14:28 +0300)]
net/mlx5e: Take RTNL lock when needed before calling xdp_set_features()
Hold RTNL lock when calling xdp_set_features() with a registered netdev,
as the call triggers the netdev notifiers. This could happen when
switching from uplink rep to nic profile for example.
Nitya Sunkad [Fri, 4 Aug 2023 20:56:22 +0000 (13:56 -0700)]
ionic: Add missing err handling for queue reconfig
ionic_start_queues_reconfig returns an error code if txrx_init fails.
Handle this error code in the relevant places.
This fixes a corner case where the device could get left in a detached
state if the CMB reconfig fails and the attempt to clean up the mess
also fails. Note that calling netif_device_attach when the netdev is
already attached does not lead to unexpected behavior.
Change goto name "errout" to "err_out" to maintain consistency across
goto statements.
Fixes: 40bc471dc714 ("ionic: add tx/rx-push support with device Component Memory Buffers") Fixes: 6f7d6f0fd7a3 ("ionic: pull reset_queues into tx_timeout handler") Signed-off-by: Nitya Sunkad <nitya.sunkad@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 4 Aug 2023 22:59:51 +0000 (15:59 -0700)]
net: tls: avoid discarding data on record close
TLS records end with a 16B tag. For TLS device offload we only
need to make space for this tag in the stream, the device will
generate and replace it with the actual calculated tag.
Long time ago the code would just re-reference the head frag
which mostly worked but was suboptimal because it prevented TCP
from combining the record into a single skb frag. I'm not sure
if it was correct as the first frag may be shorter than the tag.
The commit under fixes tried to replace that with using the page
frag and if the allocation failed rolling back the data, if record
was long enough. It achieves better fragment coalescing but is
also buggy.
We don't roll back the iterator, so unless we're at the end of
send we'll skip the data we designated as tag and start the
next record as if the rollback never happened.
There's also the possibility that the record was constructed
with MSG_MORE and the data came from a different syscall and
we already told the user space that we "got it".
Allocate a single dummy page and use it as fallback.
Found by code inspection, and proven by forcing allocation
failures.
Fixes: e7b159a48ba6 ("net/tls: remove the record tail optimization") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Sat, 5 Aug 2023 01:26:29 +0000 (18:26 -0700)]
Merge branch 'mptcp-more-fixes-for-v6-5'
Matthieu Baerts says:
====================
mptcp: more fixes for v6.5
Here is a new batch of fixes related to MPTCP for v6.5 and older.
Patches 1 and 2 fix issues with MPTCP Join selftest when manually
launched with '-i' parameter to use 'ip mptcp' tool instead of the
dedicated one (pm_nl_ctl). The issues have been there since v5.18.
Thank you Andrea for your first contributions to MPTCP code in the
upstream kernel!
Patch 3 avoids corrupting the data stream when trying to reset
connections that have fallen back to TCP. This can happen from v6.1.
Patch 4 fixes a race when doing a disconnect() and an accept() in
parallel on a listener socket. The issue only happens in rare cases if
the user is really unlucky since a fix that landed in v6.3 but
backported up to v6.1.
====================
Paolo Abeni [Thu, 3 Aug 2023 16:27:30 +0000 (18:27 +0200)]
mptcp: fix disconnect vs accept race
Despite commit 0ad529d9fd2b ("mptcp: fix possible divide by zero in
recvmsg()"), the mptcp protocol is still prone to a race between
disconnect() (or shutdown) and accept.
The root cause is that the mentioned commit checks the msk-level
flag, but mptcp_stream_accept() does acquire the msk-level lock,
as it can rely directly on the first subflow lock.
As reported by Christoph than can lead to a race where an msk
socket is accepted after that mptcp_subflow_queue_clean() releases
the listener socket lock and just before it takes destructive
actions leading to the following splat:
Address the issue by temporary removing the pending request socket
from the accept queue, so that racing accept() can't touch them.
After depleting the msk - the ssk still exists, as plain TCP sockets,
re-insert them into the accept queue, so that later inet_csk_listen_stop()
will complete the tcp socket disposal.
Fixes: 2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/423 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-4-6671b1ab11cc@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Thu, 3 Aug 2023 16:27:29 +0000 (18:27 +0200)]
mptcp: avoid bogus reset on fallback close
Since the blamed commit, the MPTCP protocol unconditionally sends
TCP resets on all the subflows on disconnect().
That fits full-blown MPTCP sockets - to implement the fastclose
mechanism - but causes unexpected corruption of the data stream,
caught as sporadic self-tests failures.
Fixes: d21f83485518 ("mptcp: use fastclose on more edge scenarios") Cc: stable@vger.kernel.org Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/419 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-3-6671b1ab11cc@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrea Claudi [Thu, 3 Aug 2023 16:27:28 +0000 (18:27 +0200)]
selftests: mptcp: join: fix 'implicit EP' test
mptcp_join 'implicit EP' test currently fails when using ip mptcp:
$ ./mptcp_join.sh -iI
<snip>
001 implicit EP creation[fail] expected '10.0.2.2 10.0.2.2 id 1 implicit' found '10.0.2.2 id 1 rawflags 10 '
Error: too many addresses or duplicate one: -22.
ID change is prevented[fail] expected '10.0.2.2 10.0.2.2 id 1 implicit' found '10.0.2.2 id 1 rawflags 10 '
modif is allowed[fail] expected '10.0.2.2 10.0.2.2 id 1 signal' found '10.0.2.2 id 1 signal '
This happens because of two reasons:
- iproute v6.3.0 does not support the implicit flag, fixed with
iproute2-next commit 3a2535a41854 ("mptcp: add support for implicit
flag")
- pm_nl_check_endpoint wrongly expects the ip address to be repeated two
times in iproute output, and does not account for a final whitespace
in it.
This fixes the issue trimming the whitespace in the output string and
removing the double address in the expected string.
Fixes: 69c6ce7b6eca ("selftests: mptcp: add implicit endpoint test case") Cc: stable@vger.kernel.org Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-2-6671b1ab11cc@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrea Claudi [Thu, 3 Aug 2023 16:27:27 +0000 (18:27 +0200)]
selftests: mptcp: join: fix 'delete and re-add' test
mptcp_join 'delete and re-add' test fails when using ip mptcp:
$ ./mptcp_join.sh -iI
<snip>
002 delete and re-add before delete[ ok ]
mptcp_info subflows=1 [ ok ]
Error: argument "ADDRESS" is wrong: invalid for non-zero id address
after delete[fail] got 2:2 subflows expected 1
This happens because endpoint delete includes an ip address while id is
not 0, contrary to what is indicated in the ip mptcp man page:
"When used with the delete id operation, an IFADDR is only included when
the ID is 0."
This fixes the issue using the $addr variable in pm_nl_del_endpoint()
only when id is 0.
Fixes: 34aa6e3bccd8 ("selftests: mptcp: add ip mptcp wrappers") Cc: stable@vger.kernel.org Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-1-6671b1ab11cc@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal [Thu, 3 Aug 2023 15:26:50 +0000 (17:26 +0200)]
selftests: net: test vxlan pmtu exceptions with tcp
TCP might get stuck if a nonlinear skb exceeds the path MTU,
icmp error contains an incorrect icmp checksum in that case.
Extend the existing test for vxlan to also send at least 1MB worth of
data via TCP in addition to the existing 'large icmp packet adds
route exception'.
On my test VM this fails due to 0-size output file without
"tunnels: fix kasan splat when generating ipv4 pmtu error".
value changed: 0x0000000000000000 -> 0x0000000020000081
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 6.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230803145600.2937518-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Thu, 3 Aug 2023 13:42:53 +0000 (16:42 +0300)]
net: dsa: ocelot: call dsa_tag_8021q_unregister() under rtnl_lock() on driver remove
When the tagging protocol in current use is "ocelot-8021q" and we unbind
the driver, we see this splat:
$ echo '0000:00:00.2' > /sys/bus/pci/drivers/fsl_enetc/unbind
mscc_felix 0000:00:00.5 swp0: left promiscuous mode
sja1105 spi2.0: Link is Down
DSA: tree 1 torn down
mscc_felix 0000:00:00.5 swp2: left promiscuous mode
sja1105 spi2.2: Link is Down
DSA: tree 3 torn down
fsl_enetc 0000:00:00.2 eno2: left promiscuous mode
mscc_felix 0000:00:00.5: Link is Down
------------[ cut here ]------------
RTNL: assertion failed at net/dsa/tag_8021q.c (409)
WARNING: CPU: 1 PID: 329 at net/dsa/tag_8021q.c:409 dsa_tag_8021q_unregister+0x12c/0x1a0
Modules linked in:
CPU: 1 PID: 329 Comm: bash Not tainted 6.5.0-rc3+ #771
pc : dsa_tag_8021q_unregister+0x12c/0x1a0
lr : dsa_tag_8021q_unregister+0x12c/0x1a0
Call trace:
dsa_tag_8021q_unregister+0x12c/0x1a0
felix_tag_8021q_teardown+0x130/0x150
felix_teardown+0x3c/0xd8
dsa_tree_teardown_switches+0xbc/0xe0
dsa_unregister_switch+0x168/0x260
felix_pci_remove+0x30/0x60
pci_device_remove+0x4c/0x100
device_release_driver_internal+0x188/0x288
device_links_unbind_consumers+0xfc/0x138
device_release_driver_internal+0xe0/0x288
device_driver_detach+0x24/0x38
unbind_store+0xd8/0x108
drv_attr_store+0x30/0x50
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
RTNL: assertion failed at net/8021q/vlan_core.c (376)
WARNING: CPU: 1 PID: 329 at net/8021q/vlan_core.c:376 vlan_vid_del+0x1b8/0x1f0
CPU: 1 PID: 329 Comm: bash Tainted: G W 6.5.0-rc3+ #771
pc : vlan_vid_del+0x1b8/0x1f0
lr : vlan_vid_del+0x1b8/0x1f0
dsa_tag_8021q_unregister+0x8c/0x1a0
felix_tag_8021q_teardown+0x130/0x150
felix_teardown+0x3c/0xd8
dsa_tree_teardown_switches+0xbc/0xe0
dsa_unregister_switch+0x168/0x260
felix_pci_remove+0x30/0x60
pci_device_remove+0x4c/0x100
device_release_driver_internal+0x188/0x288
device_links_unbind_consumers+0xfc/0x138
device_release_driver_internal+0xe0/0x288
device_driver_detach+0x24/0x38
unbind_store+0xd8/0x108
drv_attr_store+0x30/0x50
DSA: tree 0 torn down
This was somewhat not so easy to spot, because "ocelot-8021q" is not the
default tagging protocol, and thus, not everyone who tests the unbinding
path may have switched to it beforehand. The default
felix_tag_npi_teardown() does not require rtnl_lock() to be held.
Xiang Yang [Thu, 3 Aug 2023 07:24:38 +0000 (07:24 +0000)]
mptcp: fix the incorrect judgment for msk->cb_flags
Coccicheck reports the error below:
net/mptcp/protocol.c:3330:15-28: ERROR: test of a variable/field address
Since the address of msk->cb_flags is used in __test_and_clear_bit, the
address should not be NULL. The judgment for if (unlikely(msk->cb_flags))
will always be true, we should check the real value of msk->cb_flags here.
Fixes: 65a569b03ca8 ("mptcp: optimize release_cb for the common case") Signed-off-by: Xiang Yang <xiangyang3@huawei.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803072438.1847500-1-xiangyang3@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Li Yang [Wed, 2 Aug 2023 19:13:47 +0000 (14:13 -0500)]
net: phy: at803x: remove set/get wol callbacks for AR8032
Since the AR8032 part does not support wol, remove related callbacks
from it.
Fixes: 5800091a2061 ("net: phy: at803x: add support for AR8032 PHY") Signed-off-by: Li Yang <leoyang.li@nxp.com> Cc: David Bauer <mail@david-bauer.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Li Yang [Wed, 2 Aug 2023 19:13:46 +0000 (14:13 -0500)]
net: phy: at803x: fix the wol setting functions
In commit 7beecaf7d507 ("net: phy: at803x: improve the WOL feature"), it
seems not correct to use a wol_en bit in a 1588 Control Register which is
only available on AR8031/AR8033(share the same phy_id) to determine if WoL
is enabled. Change it back to use AT803X_INTR_ENABLE_WOL for determining
the WoL status which is applicable on all chips supporting wol. Also update
the at803x_set_wol() function to only update the 1588 register on chips
having it. After this change, disabling wol at probe from commit d7cd5e06c9dd ("net: phy: at803x: disable WOL at probe") is no longer
needed. Change it to just disable the WoL bit in 1588 register for
AR8031/AR8033 to be aligned with AT803X_INTR_ENABLE_WOL in probe.
Fixes: 7beecaf7d507 ("net: phy: at803x: improve the WOL feature") Signed-off-by: Li Yang <leoyang.li@nxp.com> Reviewed-by: Viorel Suman <viorel.suman@nxp.com> Reviewed-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Chancellor [Wed, 2 Aug 2023 17:40:29 +0000 (10:40 -0700)]
mISDN: Update parameter type of dsp_cmx_send()
When booting a kernel with CONFIG_MISDN_DSP=y and CONFIG_CFI_CLANG=y,
there is a failure when dsp_cmx_send() is called indirectly from
call_timer_fn():
The function pointer prototype that call_timer_fn() expects is
void (*fn)(struct timer_list *)
whereas dsp_cmx_send() has a parameter type of 'void *', which causes
the control flow integrity checks to fail because the parameter types do
not match.
Change dsp_cmx_send()'s parameter type to be 'struct timer_list' to
match the expected prototype. The argument is unused anyways, so this
has no functional change, aside from avoiding the CFI failure.
Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202308020936.58787e6c-oliver.sang@intel.com Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Fixes: e313ac12eb13 ("mISDN: Convert timers to use timer_setup()") Link: https://lore.kernel.org/r/20230802-fix-dsp_cmx_send-cfi-failure-v1-1-2f2e79b0178d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 3 Aug 2023 21:00:02 +0000 (14:00 -0700)]
Merge tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and wireless.
Nothing scary here. Feels like the first wave of regressions from v6.5
is addressed - one outstanding fix still to come in TLS for the
sendpage rework.
Current release - regressions:
- udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
- dsa: fix older DSA drivers using phylink
Previous releases - regressions:
- gro: fix misuse of CB in udp socket lookup
- mlx5: unregister devlink params in case interface is down
- Revert "wifi: ath11k: Enable threaded NAPI"
Previous releases - always broken:
- sched: cls_u32: fix match key mis-addressing
- sched: bind logic fixes for cls_fw, cls_u32 and cls_route
- add bound checks to a number of places which hand-parse netlink
- bpf: disable preemption in perf_event_output helpers code
- qed: fix scheduling in a tasklet while getting stats
- avoid using APIs which are not hardirq-safe in couple of drivers,
when we may be in a hard IRQ (netconsole)
- wifi: cfg80211: fix return value in scan logic, avoid page
allocator warning
- wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D
(DBDC)
Misc:
- drop handful of inactive maintainers, put some new in place"
* tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits)
MAINTAINERS: update TUN/TAP maintainers
test/vsock: remove vsock_perf executable on `make clean`
tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
tcp_metrics: annotate data-races around tm->tcpm_net
tcp_metrics: annotate data-races around tm->tcpm_vals[]
tcp_metrics: annotate data-races around tm->tcpm_lock
tcp_metrics: annotate data-races around tm->tcpm_stamp
tcp_metrics: fix addr_same() helper
prestera: fix fallback to previous version on same major version
udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
net/mlx5e: Set proper IPsec source port in L4 selector
net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
net/mlx5: fs_core: Make find_closest_ft more generic
wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1()
vxlan: Fix nexthop hash size
ip6mr: Fix skb_under_panic in ip6mr_cache_report()
s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
net: tap_open(): set sk_uid from current_fsuid()
net: tun_chr_open(): set sk_uid from current_fsuid()
net: dcb: choose correct policy to parse DCB_ATTR_BCN
...
Jakub Kicinski [Wed, 2 Aug 2023 18:28:43 +0000 (11:28 -0700)]
MAINTAINERS: update TUN/TAP maintainers
Willem and Jason have agreed to take over the maintainer
duties for TUN/TAP, thank you!
There's an existing entry for TUN/TAP which only covers
the user mode Linux implementation.
Since we haven't heard from Maxim on the list for almost
a decade, extend that entry and take it over, rather than
adding a new one.
Jakub Kicinski [Thu, 3 Aug 2023 18:22:53 +0000 (11:22 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Martin KaFai Lau says:
====================
pull-request: bpf 2023-08-03
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 3 files changed, 37 insertions(+), 20 deletions(-).
The main changes are:
1) Disable preemption in perf_event_output helpers code,
from Jiri Olsa
2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing,
from Lin Ma
3) Multiple warning splat fixes in cpumap from Hou Tao
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, cpumap: Handle skb as well when clean up ptr_ring
bpf, cpumap: Make sure kthread is running before map update returns
bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
bpf: Disable preemption in bpf_event_output
bpf: Disable preemption in bpf_perf_event_output
====================
Jakub Kicinski [Thu, 3 Aug 2023 18:05:46 +0000 (11:05 -0700)]
Merge tag 'wireless-2023-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.5
We did some house cleaning in MAINTAINERS file so several patches
about that. Few regressions fixed and also fix some recently enabled
memcpy() warnings. Only small commits and nothing special standing
out.
* tag 'wireless-2023-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1()
wifi: ray_cs: Replace 1-element array with flexible array
MAINTAINERS: add Jeff as ath10k, ath11k and ath12k maintainer
MAINTAINERS: wifi: mark mlw8k as orphan
MAINTAINERS: wifi: mark b43 as orphan
MAINTAINERS: wifi: mark zd1211rw as orphan
MAINTAINERS: wifi: mark wl3501 as orphan
MAINTAINERS: wifi: mark rndis_wlan as orphan
MAINTAINERS: wifi: mark ar5523 as orphan
MAINTAINERS: wifi: mark cw1200 as orphan
MAINTAINERS: wifi: atmel: mark as orphan
MAINTAINERS: wifi: rtw88: change Ping as the maintainer
Revert "wifi: ath6k: silence false positive -Wno-dangling-pointer warning on GCC 12"
wifi: cfg80211: Fix return value in scan logic
Revert "wifi: ath11k: Enable threaded NAPI"
MAINTAINERS: Update mwifiex maintainer list
wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC)
====================
Eric Dumazet [Wed, 2 Aug 2023 13:15:00 +0000 (13:15 +0000)]
tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
Whenever tcpm_new() reclaims an old entry, tcpm_suck_dst()
would overwrite data that could be read from tcp_fastopen_cache_get()
or tcp_metrics_fill_info().
We need to acquire fastopen_seqlock to maintain consistency.
For newly allocated objects, tcpm_new() can switch to kzalloc()
to avoid an extra fastopen_seqlock acquisition.
Fixes: 1fe4c481ba63 ("net-tcp: Fast Open client - cookie cache") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Wed, 2 Aug 2023 13:14:59 +0000 (13:14 +0000)]
tcp_metrics: annotate data-races around tm->tcpm_net
tm->tcpm_net can be read or written locklessly.
Instead of changing write_pnet() and read_pnet() and potentially
hurt performance, add the needed READ_ONCE()/WRITE_ONCE()
in tm_net() and tcpm_new().
Fixes: 849e8a0ca8d5 ("tcp_metrics: Add a field tcpm_net and verify it matches on lookup") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Wed, 2 Aug 2023 13:14:58 +0000 (13:14 +0000)]
tcp_metrics: annotate data-races around tm->tcpm_vals[]
tm->tcpm_vals[] values can be read or written locklessly.
Add needed READ_ONCE()/WRITE_ONCE() to document this,
and force use of tcp_metric_get() and tcp_metric_set()
Fixes: 51c5d0c4b169 ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>