Jakub Kicinski [Thu, 18 Jan 2024 20:45:04 +0000 (12:45 -0800)]
Merge tag 'nf-24-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains Netfilter fixes for net. Slightly larger
than usual because this batch includes several patches to tighten the
nf_tables control plane to reject inconsistent configuration:
1) Restrict NFTA_SET_POLICY to NFT_SET_POL_PERFORMANCE and
NFT_SET_POL_MEMORY.
2) Bail out if a nf_tables expression registers more than 16 netlink
attributes which is what struct nft_expr_info allows.
3) Bail out if NFT_EXPR_STATEFUL provides no .clone interface, remove
existing fallback to memcpy() when cloning which might accidentally
duplicate memory reference to the same object.
4) Fix br_netfilter interaction with neighbour layer. This requires
three preparation patches:
- Use nf_bridge_get_physinif() in nfnetlink_log
- Use nf_bridge_info_exists() to check in br_netfilter context
is available in nf_queue.
- Pass net to nf_bridge_get_physindev()
And finally, the fix which replaces physindev with physinif
in nf_bridge_info.
Patches from Pavel Tikhomirov.
5) Catch-all deactivation happens in the transaction, hence this
oneliner to check for the next generation. This bug uncovered after
the removal of the _BUSY bit, which happened in set elements back in
summer 2023.
6) Ensure set (total) key length size and concat field length description
is consistent, otherwise bail out.
7) Skip set element with the _DEAD flag on from the netlink dump path.
A tests occasionally shows that dump is mismatching because GC might
lose race to get rid of this element while a netlink dump is in
progress.
8) Reject NFT_SET_CONCAT for field_count < 1.
9) Use IP6_INC_STATS in ipvs to fix preemption BUG splat, patch
from Fedor Pchelkin.
* tag 'nf-24-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
ipvs: avoid stat macros calls from preemptible context
netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description
netfilter: nf_tables: skip dead set elements in netlink dump
netfilter: nf_tables: do not allow mismatch field size and set key length
netfilter: nf_tables: check if catch-all set element is active in next generation
netfilter: bridge: replace physindev with physinif in nf_bridge_info
netfilter: propagate net to nf_bridge_get_physindev
netfilter: nf_queue: remove excess nf_bridge variable
netfilter: nfnetlink_log: use proper helper for fetching physinif
netfilter: nft_limit: do not ignore unsupported flags
netfilter: nf_tables: bail out if stateful expression provides no .clone
netfilter: nf_tables: validate .maxattr at expression registration
netfilter: nf_tables: reject invalid set policy
====================
Jakub Kicinski [Thu, 18 Jan 2024 17:54:24 +0000 (09:54 -0800)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-01-18
We've added 10 non-merge commits during the last 5 day(s) which contain
a total of 12 files changed, 806 insertions(+), 51 deletions(-).
The main changes are:
1) Fix an issue in bpf_iter_udp under backward progress which prevents
user space process from finishing iteration, from Martin KaFai Lau.
2) Fix BPF verifier to reject variable offset alu on registers with a type
of PTR_TO_FLOW_KEYS to prevent oob access, from Hao Sun.
3) Follow up fixes for kernel- and libbpf-side logic around handling
arg:ctx tagged arguments of BPF global subprogs, from Andrii Nakryiko.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
libbpf: warn on unexpected __arg_ctx type when rewriting BTF
selftests/bpf: add tests confirming type logic in kernel for __arg_ctx
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
bpf: extract bpf_ctx_convert_map logic and make it more reusable
libbpf: feature-detect arg:ctx tag support in kernel
selftests/bpf: Add test for alu on PTR_TO_FLOW_KEYS
bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS
selftests/bpf: Test udp and tcp iter batching
bpf: Avoid iter->offset making backward progress in bpf_iter_udp
bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket
====================
Tony Nguyen [Wed, 17 Jan 2024 17:25:32 +0000 (09:25 -0800)]
i40e: Include types.h to some headers
Commit 56df345917c0 ("i40e: Remove circular header dependencies and fix
headers") redistributed a number of includes from one large header file
to the locations they were needed. In some environments, types.h is not
included and causing compile issues. The driver should not rely on
implicit inclusion from other locations; explicitly include it to these
files.
Snippet of issue. Entire log can be seen through the Closes: link.
In file included from drivers/net/ethernet/intel/i40e/i40e_diag.h:7,
from drivers/net/ethernet/intel/i40e/i40e_diag.c:4:
drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h:33:9: error: unknown type name '__le16'
33 | __le16 flags;
| ^~~~~~
drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h:34:9: error: unknown type name '__le16'
34 | __le16 opcode;
| ^~~~~~
...
drivers/net/ethernet/intel/i40e/i40e_diag.h:22:9: error: unknown type name 'u32'
22 | u32 elements; /* number of elements if array */
| ^~~
drivers/net/ethernet/intel/i40e/i40e_diag.h:23:9: error: unknown type name 'u32'
23 | u32 stride; /* bytes between each element */
Reported-by: Martin Zaharinov <micron10@gmail.com> Closes: https://lore.kernel.org/netdev/21BBD62A-F874-4E42-B347-93087EEA8126@gmail.com/ Fixes: 56df345917c0 ("i40e: Remove circular header dependencies and fix headers") Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240117172534.3555162-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nikita Zhandarovich [Wed, 17 Jan 2024 17:21:02 +0000 (09:21 -0800)]
ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work
idev->mc_ifc_count can be written over without proper locking.
Originally found by syzbot [1], fix this issue by encapsulating calls
to mld_ifc_stop_work() (and mld_gq_stop_work() for good measure) with
mutex_lock() and mutex_unlock() accordingly as these functions
should only be called with mc_lock per their declarations.
[1]
BUG: KCSAN: data-race in ipv6_mc_down / mld_ifc_work
write to 0xffff88813a80c832 of 1 bytes by task 22 on cpu 1:
mld_ifc_work+0x54c/0x7b0 net/ipv6/mcast.c:2653
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2700
worker_thread+0x525/0x730 kernel/workqueue.c:2781
...
Amit Cohen [Wed, 17 Jan 2024 15:04:21 +0000 (16:04 +0100)]
selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes
'qos_pfc' test checks PFC behavior. The idea is to limit the traffic
using a shaper somewhere in the flow of the packets. In this area, the
buffer is smaller than the buffer at the beginning of the flow, so it fills
up until there is no more space left. The test configures there PFC
which is supposed to notice that the headroom is filling up and send PFC
Xoff to indicate the transmitter to stop sending traffic for the priorities
sharing this PG.
The Xon/Xoff threshold is auto-configured and always equal to
2*(MTU rounded up to cell size). Even after sending the PFC Xoff packet,
traffic will keep arriving until the transmitter receives and processes
the PFC packet. This amount of traffic is known as the PFC delay allowance.
Currently the buffer for the delay traffic is configured as 100KB. The
MTU in the test is 10KB, therefore the threshold for Xoff is about 20KB.
This allows 80KB extra to be stored in this buffer.
8-lane ports use two buffers among which the configured buffer is split,
the Xoff threshold then applies to each buffer in parallel.
The test does not take into account the behavior of 8-lane ports, when the
ports are configured to 400Gbps with 8 lanes or 800Gbps with 8 lanes,
packets are dropped and the test fails.
Check if the relevant ports use 8 lanes, in such case double the size of
the buffer, as the headroom is split half-half.
Cc: Shuah Khan <shuah@kernel.org> Fixes: bfa804784e32 ("selftests: mlxsw: Add a PFC test") Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/23ff11b7dff031eb04a41c0f5254a2b636cd8ebb.1705502064.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Wed, 17 Jan 2024 15:04:19 +0000 (16:04 +0100)]
mlxsw: spectrum_router: Register netdevice notifier before nexthop
If there are IPIP nexthops at the time when the driver is loaded (or the
devlink instance reloaded), the driver looks up the corresponding IPIP
entry. But IPIP entries are only created as a result of netdevice
notifications. Since the netdevice notifier is registered after the nexthop
notifier, mlxsw_sp_nexthop_type_init() never finds the IPIP entry,
registers the nexthop MLXSW_SP_NEXTHOP_TYPE_ETH, and fails to assign a CRIF
to the nexthop. Later on when the CRIF is necessary, the WARN_ON in
mlxsw_sp_nexthop_rif() triggers, causing the splat [1].
In order to fix the issue, reorder the netdevice notifier to be registered
before the nexthop one.
Ido Schimmel [Wed, 17 Jan 2024 15:04:18 +0000 (16:04 +0100)]
mlxsw: spectrum_acl_tcam: Fix stack corruption
When tc filters are first added to a net device, the corresponding local
port gets bound to an ACL group in the device. The group contains a list
of ACLs. In turn, each ACL points to a different TCAM region where the
filters are stored. During forwarding, the ACLs are sequentially
evaluated until a match is found.
One reason to place filters in different regions is when they are added
with decreasing priorities and in an alternating order so that two
consecutive filters can never fit in the same region because of their
key usage.
In Spectrum-2 and newer ASICs the firmware started to report that the
maximum number of ACLs in a group is more than 16, but the layout of the
register that configures ACL groups (PAGT) was not updated to account
for that. It is therefore possible to hit stack corruption [1] in the
rare case where more than 16 ACLs in a group are required.
Fix by limiting the maximum ACL group size to the minimum between what
the firmware reports and the maximum ACLs that fit in the PAGT register.
Add a test case to make sure the machine does not crash when this
condition is hit.
Ido Schimmel [Wed, 17 Jan 2024 15:04:17 +0000 (16:04 +0100)]
mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path
When calling mlxsw_sp_acl_tcam_region_destroy() from an error path after
failing to attach the region to an ACL group, we hit a NULL pointer
dereference upon 'region->group->tcam' [1].
Fix by retrieving the 'tcam' pointer using mlxsw_sp_acl_to_tcam().
Amit Cohen [Wed, 17 Jan 2024 15:04:16 +0000 (16:04 +0100)]
mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure
Lately, a bug was found when many TC filters are added - at some point,
several bugs are printed to dmesg [1] and the switch is crashed with
segmentation fault.
The issue starts when gen_pool_free() fails because of unexpected
behavior - a try to free memory which is already freed, this leads to BUG()
call which crashes the switch and makes many other bugs.
Trying to track down the unexpected behavior led to a bug in eRP code. The
function mlxsw_sp_acl_erp_table_alloc() gets a pointer to the allocated
index, sets the value and returns an error code. When gen_pool_alloc()
fails it returns address 0, we track it and return -ENOBUFS outside, BUT
the call for gen_pool_alloc() already override the index in erp_table
structure. This is a problem when such allocation is done as part of
table expansion. This is not a new table, which will not be used in case
of allocation failure. We try to expand eRP table and override the
current index (non-zero) with zero. Then, it leads to an unexpected
behavior when address 0 is freed twice. Note that address 0 is valid in
erp_table->base_index and indeed other tables use it.
gen_pool_alloc() fails in case that there is no space left in the
pre-allocated pool, in our case, the pool is limited to
ACL_MAX_ERPT_BANK_SIZE, which is read from hardware. When more than max
erp entries are required, we exceed the limit and return an error, this
error leads to "Failed to migrate vregion" print.
Fix this by changing erp_table->base_index only in case of a successful
allocation.
Add a test case for such a scenario. Without this fix it causes
segmentation fault:
$ TESTS="max_erp_entries_test" ./tc_flower.sh
./tc_flower.sh: line 988: 1560 Segmentation fault tc filter del dev $h2 ingress chain $i protocol ip pref $i handle $j flower &>/dev/null
Accessing an ethernet device that is powered off or clock gated might
cause the CPU to hang. Add ethnl_ops_begin/complete in
ethnl_set_features() to protect against this.
Benjamin Poirier [Tue, 16 Jan 2024 15:49:26 +0000 (10:49 -0500)]
selftests: bonding: Add more missing config options
As a followup to commit 03fb8565c880 ("selftests: bonding: add missing
build configs"), add more networking-specific config options which are
needed for bonding tests.
For testing, I used the minimal config generated by virtme-ng and I added
the options in the config file. All bonding tests passed.
Fixes: bbb774d921e2 ("net: Add tests for bonding and team address list management") # for ipv6 Fixes: 6cbe791c0f4e ("kselftest: bonding: add num_grat_arp test") # for tc options Fixes: 222c94ec0ad4 ("selftests: bonding: add tests for ether type changes") # for nlmon Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Link: https://lore.kernel.org/r/20240116154926.202164-1-bpoirier@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Tue, 16 Jan 2024 15:43:11 +0000 (07:43 -0800)]
selftests: netdevsim: add a config file
netdevsim tests aren't very well integrated with kselftest,
which has its advantages and disadvantages. But regardless
of the intended integration - a config file to know what kernel
to build is very useful, add one.
====================
Tighten up arg:ctx type enforcement
Follow up fixes for kernel-side and libbpf-side logic around handling arg:ctx
(__arg_ctx) tagged arguments of BPF global subprogs.
Patch #1 adds libbpf feature detection of kernel-side __arg_ctx support to
avoid unnecessary rewriting BTF types. With stricter kernel-side type
enforcement this is now mandatory to avoid problems with using `struct
bpf_user_pt_regs_t` instead of actual typedef. For __arg_ctx tagged arguments
verifier is now supporting either `bpf_user_pt_regs_t` typedef or resolves it
down to the actual struct (pt_regs/user_pt_regs/user_regs_struct), depending
on architecture), but for old kernels without __arg_ctx support it's more
backwards compatible for libbpf to use `struct bpf_user_pt_regs_t` rewrite
which will work on wider range of kernels. So feature detection prevent libbpf
accidentally breaking global subprogs on new kernels.
We also adjust selftests to do similar feature detection (much simpler, but
potentially breaking due to kernel source code refactoring, which is fine for
selftests), and skip tests expecting libbpf's BTF type rewrites.
Patch #2 is preparatory refactoring for patch #3 which adds type enforcement
for arg:ctx tagged global subprog args. See the patch for specifics.
Patch #4 adds many new cases to ensure type logic works as expected.
Finally, patch #5 adds a relevant subset of kernel-side type checks to
__arg_ctx cases that libbpf supports rewrite of. In libbpf's case, type
violations are reported as warnings and BTF rewrite is not performed, which
will eventually lead to BPF verifier complaining at program verification time.
Good care was taken to avoid conflicts between bpf and bpf-next tree (which
has few follow up refactorings in the same code area). Once trees converge
some of the code will be moved around a bit (and some will be deleted), but
with no change to functionality or general shape of the code.
v2->v3:
- support `bpf_user_pt_regs_t` typedef for KPROBE and PERF_EVENT (CI);
v1->v2:
- add user_pt_regs and user_regs_struct support for PERF_EVENT (CI);
- drop FEAT_ARG_CTX_TAG enum leftover from patch #1;
- fix warning about default: without break in the switch (CI).
====================
Andrii Nakryiko [Thu, 18 Jan 2024 03:31:43 +0000 (19:31 -0800)]
libbpf: warn on unexpected __arg_ctx type when rewriting BTF
On kernel that don't support arg:ctx tag, before adjusting global
subprog BTF information to match kernel's expected canonical type names,
make sure that types used by user are meaningful, and if not, warn and
don't do BTF adjustments.
This is similar to checks that kernel performs, but narrower in scope,
as only a small subset of BPF program types can be accommodated by
libbpf using canonical type names.
Libbpf unconditionally allows `struct pt_regs *` for perf_event program
types, unlike kernel, which supports that conditionally on architecture.
This is done to keep things simple and not cause unnecessary false
positives. This seems like a minor and harmless deviation, which in
real-world programs will be caught by kernels with arg:ctx tag support
anyways. So KISS principle.
This logic is hard to test (especially on latest kernels), so manual
testing was performed instead. Libbpf emitted the following warning for
perf_event program with wrong context argument type:
libbpf: prog 'arg_tag_ctx_perf': subprog 'subprog_ctx_tag' arg#0 is expected to be of `struct bpf_perf_event_data *` type
Andrii Nakryiko [Thu, 18 Jan 2024 03:31:41 +0000 (19:31 -0800)]
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
Add enforcement of expected types for context arguments tagged with
arg:ctx (__arg_ctx) tag.
First, any program type will accept generic `void *` context type when
combined with __arg_ctx tag.
Besides accepting "canonical" struct names and `void *`, for a bunch of
program types for which program context is actually a named struct, we
allows a bunch of pragmatic exceptions to match real-world and expected
usage:
- for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
canonical context argument type, where `bpf_user_pt_regs_t` is a
*typedef*, not a struct;
- for kprobes, we also always accept `struct pt_regs *`, as that's what
actually is passed as a context to any kprobe program;
- for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
down to actual struct type and accept `struct pt_regs *`, or
`struct user_pt_regs *`, or `struct user_regs_struct *`, depending
on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
expected;
- for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
what's expected with BPF_PROG() usage; otherwise, canonical
`struct bpf_raw_tracepoint_args *` is expected;
- tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
formats, both are coded as expections as tp_btf is actually a TRACING
program type, which has no canonical context type;
- iterator programs accept `struct bpf_iter__xxx *` structs, currently
with no further iterator-type specific enforcement;
- fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
- classic tracepoint programs, as well as syscall and freplace
programs allow any user-provided type.
In all other cases kernel will enforce exact match of struct name to
expected canonical type. And if user-provided type doesn't match that
expectation, verifier will emit helpful message with expected type name.
Note a bit unnatural way the check is done after processing all the
arguments. This is done to avoid conflict between bpf and bpf-next
trees. Once trees converge, a small follow up patch will place a simple
btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
(which bpf-next tree patch refactored already), removing duplicated
arg:ctx detection logic.
Andrii Nakryiko [Thu, 18 Jan 2024 03:31:40 +0000 (19:31 -0800)]
bpf: extract bpf_ctx_convert_map logic and make it more reusable
Refactor btf_get_prog_ctx_type() a bit to allow reuse of
bpf_ctx_convert_map logic in more than one places. Simplify interface by
returning btf_type instead of btf_member (field reference in BTF).
To do the above we need to touch and start untangling
btf_translate_to_vmlinux() implementation. We do the bare minimum to
not regress anything for btf_translate_to_vmlinux(), but its
implementation is very questionable for what it claims to be doing.
Mapping kfunc argument types to kernel corresponding types conceptually
is quite different from recognizing program context types. Fixing this
is out of scope for this change though.
Andrii Nakryiko [Thu, 18 Jan 2024 03:31:39 +0000 (19:31 -0800)]
libbpf: feature-detect arg:ctx tag support in kernel
Add feature detector of kernel-side arg:ctx (__arg_ctx) tag support. If
this is detected, libbpf will avoid doing any __arg_ctx-related BTF
rewriting and checks in favor of letting kernel handle this completely.
test_global_funcs/ctx_arg_rewrite subtest is adjusted to do the same
feature detection (albeit in much simpler, though round-about and
inefficient, way), and skip the tests. This is done to still be able to
execute this test on older kernels (like in libbpf CI).
Note, BPF token series ([0]) does a major refactor and code moving of
libbpf-internal feature detection "framework", so to avoid unnecessary
conflicts we keep newly added feature detection stand-alone with ad-hoc
result caching. Once things settle, there will be a small follow up to
re-integrate everything back and move code into its final place in
newly-added (by BPF token series) features.c file.
Fedor Pchelkin [Mon, 15 Jan 2024 14:39:22 +0000 (17:39 +0300)]
ipvs: avoid stat macros calls from preemptible context
Inside decrement_ttl() upon discovering that the packet ttl has exceeded,
__IP_INC_STATS and __IP6_INC_STATS macros can be called from preemptible
context having the following backtrace:
Pablo Neira Ayuso [Mon, 15 Jan 2024 11:50:29 +0000 (12:50 +0100)]
netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description
It is still possible to set on the NFT_SET_CONCAT flag by specifying a
set size and no field description, report EINVAL in such case.
Fixes: 1b6345d4160e ("netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 14 Jan 2024 23:14:38 +0000 (00:14 +0100)]
netfilter: nf_tables: skip dead set elements in netlink dump
Delete from packet path relies on the garbage collector to purge
elements with NFT_SET_ELEM_DEAD_BIT on.
Skip these dead elements from nf_tables_dump_setelem() path, I very
rarely see tests/shell/testcases/maps/typeof_maps_add_delete reports
[DUMP FAILED] showing a mismatch in the expected output with an element
that should not be there.
If the netlink dump happens before GC worker run, it might show dead
elements in the ruleset listing.
nft_rhash_get() already skips dead elements in nft_rhash_cmp(),
therefore, it already does not show the element when getting a single
element via netlink control plane.
Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Fri, 12 Jan 2024 22:28:45 +0000 (23:28 +0100)]
netfilter: nf_tables: check if catch-all set element is active in next generation
When deactivating the catch-all set element, check the state in the next
generation that represents this transaction.
This bug uncovered after the recent removal of the element busy mark a2dd0233cbc4 ("netfilter: nf_tables: remove busy mark and gc batch API").
Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Cc: stable@vger.kernel.org Reported-by: lonial con <kongln9170@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pavel Tikhomirov [Thu, 11 Jan 2024 15:06:40 +0000 (23:06 +0800)]
netfilter: bridge: replace physindev with physinif in nf_bridge_info
An skb can be added to a neigh->arp_queue while waiting for an arp
reply. Where original skb's skb->dev can be different to neigh's
neigh->dev. For instance in case of bridging dnated skb from one veth to
another, the skb would be added to a neigh->arp_queue of the bridge.
As skb->dev can be reset back to nf_bridge->physindev and used, and as
there is no explicit mechanism that prevents this physindev from been
freed under us (for instance neigh_flush_dev doesn't cleanup skbs from
different device's neigh queue) we can crash on e.g. this stack:
Let's use plain ifindex instead of net_device link. To peek into the
original net_device we will use dev_get_by_index_rcu(). Thus either we
get device and are safe to use it or we don't get it and drop skb.
Fixes: c4e70a87d975 ("netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.c") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pavel Tikhomirov [Thu, 11 Jan 2024 15:06:39 +0000 (23:06 +0800)]
netfilter: propagate net to nf_bridge_get_physindev
This is a preparation patch for replacing physindev with physinif on
nf_bridge_info structure. We will use dev_get_by_index_rcu to resolve
device, when needed, and it requires net to be available.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We don't really need nf_bridge variable here. And nf_bridge_info_exists
is better replacement for nf_bridge_info_get in case we are only
checking for existence.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pavel Tikhomirov [Thu, 11 Jan 2024 15:06:37 +0000 (23:06 +0800)]
netfilter: nfnetlink_log: use proper helper for fetching physinif
We don't use physindev in __build_packet_message except for getting
physinif from it. So let's switch to nf_bridge_get_physinif to get what
we want directly.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 7 Jan 2024 22:00:15 +0000 (23:00 +0100)]
netfilter: nf_tables: bail out if stateful expression provides no .clone
All existing NFT_EXPR_STATEFUL provide a .clone interface, remove
fallback to copy content of stateful expression since this is never
exercised and bail out if .clone interface is not defined.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sat, 6 Jan 2024 22:52:02 +0000 (23:52 +0100)]
netfilter: nf_tables: validate .maxattr at expression registration
struct nft_expr_info allows to store up to NFT_EXPR_MAXATTR (16)
attributes when parsing netlink attributes.
Rise a warning in case there is ever a nft expression whose .maxattr
goes beyond this number of expressions, in such case, struct nft_expr_info
needs to be updated.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: b63e78fca889 ("net: netdevsim: use mock PHC driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 16 Jan 2024 17:18:47 +0000 (18:18 +0100)]
mptcp: relax check on MPC passive fallback
While testing the blamed commit below, I was able to miss (!)
packetdrill failures in the fastopen test-cases.
On passive fastopen the child socket is created by incoming TCP MPC syn,
allow for both MPC_SYN and MPC_ACK header.
Fixes: 724b00c12957 ("mptcp: refine opt_mp_capable determination") Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Romain Gantois [Tue, 16 Jan 2024 12:19:17 +0000 (13:19 +0100)]
net: stmmac: Prevent DSA tags from breaking COE
Some DSA tagging protocols change the EtherType field in the MAC header
e.g. DSA_TAG_PROTO_(DSA/EDSA/BRCM/MTK/RTL4C_A/SJA1105). On TX these tagged
frames are ignored by the checksum offload engine and IP header checker of
some stmmac cores.
On RX, the stmmac driver wrongly assumes that checksums have been computed
for these tagged packets, and sets CHECKSUM_UNNECESSARY.
Add an additional check in the stmmac TX and RX hotpaths so that COE is
deactivated for packets with ethertypes that will not trigger the COE and
IP header checks.
Fixes: 6b2c6e4a938f ("net: stmmac: propagate feature flags to vlan") Cc: <stable@vger.kernel.org> Reported-by: Richard Tresidder <rtresidd@electromag.com.au> Link: https://lore.kernel.org/netdev/e5c6c75f-2dfa-4e50-a1fb-6bf4cdb617c2@electromag.com.au/ Reported-by: Romain Gantois <romain.gantois@bootlin.com> Link: https://lore.kernel.org/netdev/c57283ed-6b9b-b0e6-ee12-5655c1c54495@bootlin.com/ Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Mon, 15 Jan 2024 13:59:22 +0000 (14:59 +0100)]
selftests: rtnetlink: use setup_ns in bonding test
This is a follow-up of commit a159cbe81d3b ("selftests: rtnetlink: check
enslaving iface in a bond") after the merge of net-next into net.
The goal is to follow the new convention,
see commit d3b6b1116127 ("selftests/net: convert rtnetlink.sh to run it in
unique namespace") for more details.
Let's use also the generic dummy name instead of defining a new one.
Russell King (Oracle) [Mon, 15 Jan 2024 12:43:38 +0000 (12:43 +0000)]
net: sfp-bus: fix SFP mode detect from bitrate
The referenced commit moved the setting of the Autoneg and pause bits
early in sfp_parse_support(). However, we check whether the modes are
empty before using the bitrate to set some modes. Setting these bits
so early causes that test to always be false, preventing this working,
and thus some modules that used to work no longer do.
Move them just before the call to the quirk.
Fixes: 8110633db49d ("net: sfp-bus: allow SFP quirks to override Autoneg and pause bits") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://lore.kernel.org/r/E1rPMJW-001Ahf-L0@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hao Sun [Mon, 15 Jan 2024 08:20:28 +0000 (09:20 +0100)]
selftests/bpf: Add test for alu on PTR_TO_FLOW_KEYS
Add a test case for PTR_TO_FLOW_KEYS alu. Testing if alu with variable
offset on flow_keys is rejected. For the fixed offset success case, we
already have C code coverage to verify (e.g. via bpf_flow.c).
Hao Sun [Mon, 15 Jan 2024 08:20:27 +0000 (09:20 +0100)]
bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS
For PTR_TO_FLOW_KEYS, check_flow_keys_access() only uses fixed off
for validation. However, variable offset ptr alu is not prohibited
for this ptr kind. So the variable offset is not checked.
Fix this by rejecting ptr alu with variable offset on flow_keys.
Applying the patch rejects the program with "R7 pointer arithmetic
on flow_keys prohibited".
Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240115082028.9992-1-sunhao.th@gmail.com
Jakub Kicinski [Tue, 16 Jan 2024 02:02:01 +0000 (18:02 -0800)]
selftests: bonding: add missing build configs
bonding tests also try to create bridge, veth and dummy
interfaces. These are not currently listed in config.
Fixes: bbb774d921e2 ("net: Add tests for bonding and team address list management") Fixes: c078290a2b76 ("selftests: include bonding tests into the kselftest infra") Acked-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://lore.kernel.org/r/20240116020201.1883023-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sun, 14 Jan 2024 22:47:26 +0000 (14:47 -0800)]
selftests: netdevsim: sprinkle more udevadm settle
Number of tests are failing when netdev renaming is active
on the system. Add udevadm settle in logic determining
the names.
Fixes: 242aaf03dc9b ("selftests: add a test for ethtool pause stats") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240114224726.1210532-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Multiple disable_irq_wake() calls will keep decreasing the IRQ
wake_depth, When wake_depth is 0, calling disable_irq_wake() again,
will report the above calltrace.
Due to the need to appear in pairs, we cannot call disable_irq_wake()
without calling enable_irq_wake(). Fix this by making sure there are
no unbalanced disable_irq_wake() calls.
Fixes: 3172d3afa998 ("stmmac: support wake up irq from external sources (v3)") Signed-off-by: Qiang Ma <maqianga@uniontech.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240112021249.24598-1-maqianga@uniontech.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Benjamin Poirier [Wed, 10 Jan 2024 14:14:35 +0000 (09:14 -0500)]
selftests: bonding: Change script interpreter
The tests changed by this patch, as well as the scripts they source, use
features which are not part of POSIX sh (ex. 'source' and 'local'). As a
result, these tests fail when /bin/sh is dash such as on Debian. Change the
interpreter to bash so that these tests can run successfully.
Fixes: d43eff0b85ae ("selftests: bonding: up/down delay w/ slave link flapping") Tested-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Nikita Yushchenko [Sat, 13 Jan 2024 04:22:21 +0000 (10:22 +0600)]
net: ravb: Fix dma_addr_t truncation in error case
In ravb_start_xmit(), ravb driver uses u32 variable to store result of
dma_map_single() call. Since ravb hardware has 32-bit address fields in
descriptors, this works properly when mapping is successful - it is
platform's job to provide mapping addresses that fit into hardware
limitations.
However, in failure case dma_map_single() returns DMA_MAPPING_ERROR
constant that is 64-bit when dma_addr_t is 64-bit. Storing this constant
in u32 leads to truncation, and further call to dma_mapping_error()
fails to notice the error.
Fix that by storing result of dma_map_single() in a dma_addr_t
variable.
Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com> Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 14 Jan 2024 12:17:14 +0000 (12:17 +0000)]
Merge branch 'tls-splice-hint-fixes'
John Fastabend says:
====================
tls fixes for SPLICE with more hint
Syzbot found a splat where it tried to splice data over a tls socket
with the more hint and sending greater than the number of frags that
fit in a msg scatterlist. This resulted in an error where we do not
correctly send the data when the msg sg is full. The more flag being
just a hint not a strict contract. This then results in the syzbot
warning on the next send.
Edward generated an initial patch for this which checked for a full
msg on entry to the sendmsg hook. This fixed the WARNING, but didn't
fully resolve the issue because the full msg_pl scatterlist was never
sent resulting in a stuck socket. In this series instead avoid the
situation by forcing the send on the splice that fills the scatterlist.
Also in original thread I mentioned it didn't seem to be enough to
simply fix the send on full sg problem. That was incorrect and was
really a bug in my test program that was hanging the test program.
I had setup a repair socket and wasn't handling it correctly so my
tester got stuck.
Thanks. Please review. Fix in patch 1 and test in patch 2.
v2: use SPLICE_F_ flag names instead of cryptic 0xe
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Sat, 13 Jan 2024 00:32:58 +0000 (16:32 -0800)]
net: tls, add test to capture error on large splice
syzbot found an error with how splice() is handled with a msg greater
than 32. This was fixed in previous patch, but lets add a test for
it to ensure it continues to work.
Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Sat, 13 Jan 2024 00:32:57 +0000 (16:32 -0800)]
net: tls, fix WARNIING in __sk_msg_free
A splice with MSG_SPLICE_PAGES will cause tls code to use the
tls_sw_sendmsg_splice path in the TLS sendmsg code to move the user
provided pages from the msg into the msg_pl. This will loop over the
msg until msg_pl is full, checked by sk_msg_full(msg_pl). The user
can also set the MORE flag to hint stack to delay sending until receiving
more pages and ideally a full buffer.
If the user adds more pages to the msg than can fit in the msg_pl
scatterlist (MAX_MSG_FRAGS) we should ignore the MORE flag and send
the buffer anyways.
What actually happens though is we abort the msg to msg_pl scatterlist
setup and then because we forget to set 'full record' indicating we
can no longer consume data without a send we fallthrough to the 'continue'
path which will check if msg_data_left(msg) has more bytes to send and
then attempts to fit them in the already full msg_pl. Then next
iteration of sender doing send will encounter a full msg_pl and throw
the warning in the syzbot report.
To fix simply check if we have a full_record in splice code path and
if not send the msg regardless of MORE flag.
Reported-and-tested-by: syzbot+f2977222e0e95cec15c8@syzkaller.appspotmail.com Reported-by: Edward Adam Davis <eadavis@qq.com> Fixes: fe1e81d4f73b ("tls/sw: Support MSG_SPLICE_PAGES") Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
bpf: Fix backward progress bug in bpf_iter_udp
From: Martin KaFai Lau <martin.lau@kernel.org>
This patch set fixes an issue in bpf_iter_udp that makes backward
progress and prevents the user space process from finishing. There is
a test at the end to reproduce the bug.
Please see individual patches for details.
v3:
- Fixed the iter_fd check and local_port check in the
patch 3 selftest. (Yonghong)
- Moved jhash2 to test_jhash.h in the patch 3. (Yonghong)
- Added explanation in the bucket selection in the patch 3. (Yonghong)
v2:
- Added patch 1 to fix another bug that goes back to
the previous bucket
- Simplify the fix in patch 2 to always reset iter->offset to 0
- Add a test case to close all udp_sk in a bucket while
in the middle of the iteration.
====================
Martin KaFai Lau [Fri, 12 Jan 2024 19:05:30 +0000 (11:05 -0800)]
selftests/bpf: Test udp and tcp iter batching
The patch adds a test to exercise the bpf_iter_udp batching
logic. It specifically tests the case that there are multiple
so_reuseport udp_sk in a bucket of the udp_table.
The test creates two sets of so_reuseport sockets and
each set on a different port. Meaning there will be
two buckets in the udp_table.
The test does the following:
1. read() 3 out of 4 sockets in the first bucket.
2. close() all sockets in the first bucket. This
will ensure the current bucket's offset in
the kernel does not affect the read() of the
following bucket.
3. read() all 4 sockets in the second bucket.
The test also reads one udp_sk at a time from
the bpf_iter_udp prog. The true case in
"do_test(..., bool onebyone)". This is the buggy case
that the previous patch fixed.
It also tests the "false" case in "do_test(..., bool onebyone)",
meaning the userspace reads the whole bucket. There is
no bug in this case but adding this test also while
at it.
Considering the way to have multiple tcp_sk in the same
bucket is similar (by using so_reuseport),
this patch also tests the bpf_iter_tcp even though the
bpf_iter_tcp batching logic works correctly.
Both IP v4 and v6 are exercising the same bpf_iter batching
code path, so only v6 is tested.
Martin KaFai Lau [Fri, 12 Jan 2024 19:05:29 +0000 (11:05 -0800)]
bpf: Avoid iter->offset making backward progress in bpf_iter_udp
There is a bug in the bpf_iter_udp_batch() function that stops
the userspace from making forward progress.
The case that triggers the bug is the userspace passed in
a very small read buffer. When the bpf prog does bpf_seq_printf,
the userspace read buffer is not enough to capture the whole bucket.
When the read buffer is not large enough, the kernel will remember
the offset of the bucket in iter->offset such that the next userspace
read() can continue from where it left off.
The kernel will skip the number (== "iter->offset") of sockets in
the next read(). However, the code directly decrements the
"--iter->offset". This is incorrect because the next read() may
not consume the whole bucket either and then the next-next read()
will start from offset 0. The net effect is the userspace will
keep reading from the beginning of a bucket and the process will
never finish. "iter->offset" must always go forward until the
whole bucket is consumed.
This patch fixes it by using a local variable "resume_offset"
and "resume_bucket". "iter->offset" is always reset to 0 before
it may be used. "iter->offset" will be advanced to the
"resume_offset" when it continues from the "resume_bucket" (i.e.
"state->bucket == resume_bucket"). This brings it closer to
the bpf_iter_tcp's offset handling which does not suffer
the same bug.
Cc: Aditi Ghag <aditi.ghag@isovalent.com> Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator") Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20240112190530.3751661-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Fri, 12 Jan 2024 19:05:28 +0000 (11:05 -0800)]
bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket
The current logic is to use a default size 16 to batch the whole bucket.
If it is too small, it will retry with a larger batch size.
The current code accidentally does a state->bucket-- before retrying.
This goes back to retry with the previous bucket which has already
been done. This patch fixed it.
It is hard to create a selftest. I added a WARN_ON(state->bucket < 0),
forced a particular port to be hashed to the first bucket,
created >16 sockets, and observed the for-loop went back
to the "-1" bucket.
Cc: Aditi Ghag <aditi.ghag@isovalent.com> Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator") Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20240112190530.3751661-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Marc Kleine-Budde [Fri, 12 Jan 2024 16:13:14 +0000 (17:13 +0100)]
net: netdev_queue: netdev_txq_completed_mb(): fix wake condition
netif_txq_try_stop() uses "get_desc >= start_thrs" as the check for
the call to netif_tx_start_queue().
Use ">=" i netdev_txq_completed_mb(), too.
Fixes: c91c46de6bbc ("net: provide macros for commonly copied lockless queue stop/wake code") Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 12 Jan 2024 12:28:16 +0000 (12:28 +0000)]
net: add more sanity check in virtio_net_hdr_to_skb()
syzbot/KMSAN reports access to uninitialized data from gso_features_check() [1]
The repro use af_packet, injecting a gso packet and hdrlen == 0.
We could fix the issue making gso_features_check() more careful
while dealing with NETIF_F_TSO_MANGLEID in fast path.
Or we can make sure virtio_net_hdr_to_skb() pulls minimal network and
transport headers as intended.
Note that for GSO packets coming from untrusted sources, SKB_GSO_DODGY
bit forces a proper header validation (and pull) before the packet can
hit any device ndo_start_xmit(), thus we do not need a precise disection
at virtio_net_hdr_to_skb() stage.
CPU: 0 PID: 5025 Comm: syz-executor279 Not tainted 6.7.0-rc7-syzkaller-00003-gfbafc3e621c3 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
Reported-by: syzbot+7f4d0ea3df4d4fa9a65f@syzkaller.appspotmail.com Link: https://lore.kernel.org/netdev/0000000000005abd7b060eb160cd@google.com/ Fixes: 9274124f023b ("net: stricter validation of untrusted gso packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 12 Jan 2024 11:39:30 +0000 (12:39 +0100)]
net: sched: track device in tcf_block_get/put_ext() only for clsact binder types
Clsact/ingress qdisc is not the only one using shared block,
red is also using it. The device tracking was originally introduced
by commit 913b47d3424e ("net/sched: Introduce tc block netdev
tracking infra") for clsact/ingress only. Commit 94e2557d086a ("net:
sched: move block device tracking into tcf_block_get/put_ext()")
mistakenly enabled that for red as well.
Fix that by adding a check for the binder type being clsact when adding
device to the block->ports xarray.
Reported-by: Ido Schimmel <idosch@idosch.org> Closes: https://lore.kernel.org/all/ZZ6JE0odnu1lLPtu@shredder/ Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 12 Jan 2024 10:44:27 +0000 (10:44 +0000)]
udp: annotate data-races around up->pending
up->pending can be read without holding the socket lock,
as pointed out by syzbot [1]
Add READ_ONCE() in lockless contexts, and WRITE_ONCE()
on write side.
[1]
BUG: KCSAN: data-race in udpv6_sendmsg / udpv6_sendmsg
write to 0xffff88814e5eadf0 of 4 bytes by task 15547 on cpu 1:
udpv6_sendmsg+0x1405/0x1530 net/ipv6/udp.c:1596
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
__sys_sendto+0x257/0x310 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0x78/0x90 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
read to 0xffff88814e5eadf0 of 4 bytes by task 15551 on cpu 0:
udpv6_sendmsg+0x22c/0x1530 net/ipv6/udp.c:1373
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2586
___sys_sendmsg net/socket.c:2640 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2726
__do_sys_sendmmsg net/socket.c:2755 [inline]
__se_sys_sendmmsg net/socket.c:2752 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2752
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
value changed: 0x00000000 -> 0x0000000a
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 15551 Comm: syz-executor.1 Tainted: G W 6.7.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+8d482d0e407f665d9d10@syzkaller.appspotmail.com Link: https://lore.kernel.org/netdev/0000000000009e46c3060ebcdffd@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sneh Shah [Tue, 9 Jan 2024 14:47:29 +0000 (20:17 +0530)]
net: stmmac: Fix ethool link settings ops for integrated PCS
Currently get/set_link_ksettings ethtool ops are dependent on PCS.
When PCS is integrated, it will not have separate link config.
Bypass configuring and checking PCS for integrated PCS.
Fixes: aa571b6275fb ("net: stmmac: add new switch to struct plat_stmmacenet_data") Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8775p-ride Signed-off-by: Sneh Shah <quic_snehshah@quicinc.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Jan 2024 19:49:17 +0000 (19:49 +0000)]
mptcp: refine opt_mp_capable determination
OPTIONS_MPTCP_MPC is a combination of three flags.
It would be better to be strict about testing what
flag is expected, at least for code readability.
mptcp_parse_option() already makes the distinction.
- subflow_check_req() should use OPTION_MPTCP_MPC_SYN.
- mptcp_subflow_init_cookie_req() should use OPTION_MPTCP_MPC_ACK.
- subflow_finish_connect() should use OPTION_MPTCP_MPC_SYNACK
- subflow_syn_recv_sock should use OPTION_MPTCP_MPC_ACK
Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Fixes: 74c7dfbee3e1 ("mptcp: consolidate in_opt sub-options fields in a bitmask") Link: https://lore.kernel.org/r/20240111194917.4044654-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 11 Jan 2024 19:49:15 +0000 (19:49 +0000)]
mptcp: use OPTION_MPTCP_MPJ_SYNACK in subflow_finish_connect()
subflow_finish_connect() uses four fields (backup, join_id, thmac, none)
that may contain garbage unless OPTION_MPTCP_MPJ_SYNACK has been set
in mptcp_parse_option()
Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Peter Krystad <peter.krystad@linux.intel.com> Cc: Matthieu Baerts <matttbe@kernel.org> Cc: Mat Martineau <martineau@kernel.org> Cc: Geliang Tang <geliang.tang@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20240111194917.4044654-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Claudiu Beznea [Fri, 5 Jan 2024 08:52:42 +0000 (10:52 +0200)]
net: phy: micrel: populate .soft_reset for KSZ9131
The RZ/G3S SMARC Module has 2 KSZ9131 PHYs. In this setup, the KSZ9131 PHY
is used with the ravb Ethernet driver. It has been discovered that when
bringing the Ethernet interface down/up continuously, e.g., with the
following sh script:
$ while :; do ifconfig eth0 down; ifconfig eth0 up; done
the link speed and duplex are wrong after interrupting the bring down/up
operation even though the Ethernet interface is up. To recover from this
state the following configuration sequence is necessary (executed
manually):
$ ifconfig eth0 down
$ ifconfig eth0 up
The behavior has been identified also on the Microchip SAMA7G5-EK board
which runs the macb driver and uses the same PHY.
The order of PHY-related operations in ravb_open() is as follows:
ravb_open() ->
ravb_phy_start() ->
ravb_phy_init() ->
of_phy_connect() ->
phy_connect_direct() ->
phy_attach_direct() ->
phy_init_hw() ->
phydev->drv->soft_reset()
phydev->drv->config_init()
phydev->drv->config_intr()
phy_resume()
kszphy_resume()
The order of PHY-related operations in ravb_close is as follows:
ravb_close() ->
phy_stop() ->
phy_suspend() ->
kszphy_suspend() ->
genphy_suspend()
// set BMCR_PDOWN bit in MII_BMCR
In genphy_suspend() setting the BMCR_PDWN bit in MII_BMCR switches the PHY
to Software Power-Down (SPD) mode (according to the KSZ9131 datasheet).
Thus, when opening the interface after it has been previously closed (via
ravb_close()), the phydev->drv->config_init() and
phydev->drv->config_intr() reach the KSZ9131 PHY driver via the
ksz9131_config_init() and kszphy_config_intr() functions.
KSZ9131 specifies that the MII management interface remains operational
during SPD (Software Power-Down), but (according to manual):
- Only access to the standard registers (0 through 31) is supported.
- Access to MMD address spaces other than MMD address space 1 is possible
if the spd_clock_gate_override bit is set.
- Access to MMD address space 1 is not possible.
The spd_clock_gate_override bit is not used in the KSZ9131 driver.
ksz9131_config_init() configures RGMII delay, pad skews and LEDs by
accessesing MMD registers other than those in address space 1.
The datasheet for the KSZ9131 does not specify what happens if registers
from an unsupported address space are accessed while the PHY is in SPD.
To fix the issue the .soft_reset method has been instantiated for KSZ9131,
too. This resets the PHY to the default state before doing any
configurations to it, thus switching it out of SPD.
Fixes: bff5b4b37372 ("net: phy: micrel: add Microchip KSZ9131 initial driver") Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Wed, 10 Jan 2024 11:37:30 +0000 (12:37 +0100)]
net: micrel: Fix PTP frame parsing for lan8841
The HW has the capability to check each frame if it is a PTP frame,
which domain it is, which ptp frame type it is, different ip address in
the frame. And if one of these checks fail then the frame is not
timestamp. Most of these checks were disabled except checking the field
minorVersionPTP inside the PTP header. Meaning that once a partner sends
a frame compliant to 8021AS which has minorVersionPTP set to 1, then the
frame was not timestamp because the HW expected by default a value of 0
in minorVersionPTP.
Fix this issue by removing this check so the userspace can decide on this.
Fixes: cafc3662ee3f ("net: micrel: Add PHC support for lan8841") Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Divya Koppera <divya.koppera@microchip.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Taehee Yoo [Sun, 7 Jan 2024 14:42:41 +0000 (14:42 +0000)]
amt: do not use overwrapped cb area
amt driver uses skb->cb for storing tunnel information.
This job is worked before TC layer and then amt driver load tunnel info
from skb->cb after TC layer.
So, its cb area should not be overwrapped with CB area used by TC.
In order to not use cb area used by TC, it skips the biggest cb
structure used by TC, which was qdisc_skb_cb.
But it's not anymore.
Currently, biggest structure of TC's CB is tc_skb_cb.
So, it should skip size of tc_skb_cb instead of qdisc_skb_cb.
Fixes: ec624fe740b4 ("net/sched: Extend qdisc control block with tc control block") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240107144241.4169520-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: ethernet: ti: am65-cpsw: Allow for MTU values
The am65-cpsw-nuss driver has a fixed definition for the maximum ethernet
frame length of 1522 bytes (AM65_CPSW_MAX_PACKET_SIZE). This limits the switch
ports to only operate at a maximum MTU of 1500 bytes. When combining this CPSW
switch with a DSA switch connected to one of its ports this limitation shows up.
The extra 8 bytes the DSA subsystem adds internally to the ethernet frame
create resulting frames bigger than 1522 bytes (1518 for non VLAN + 8 for DSA
stuff) so they get dropped by the switch.
One of the issues with the the am65-cpsw-nuss driver is that the network device
max_mtu was being set to the same fixed value defined for the max total frame
length (1522 bytes). This makes the DSA subsystem believe that the MTU of the
interface can be set to 1508 bytes to make room for the extra 8 bytes of the DSA
headers. However, all packages created assuming the 1500 bytes payload get
dropped by the switch as oversized.
This series offers a solution to this problem. The max_mtu advertised on the
network device and the actual max frame size configured on the switch registers
are made consistent by letting the extra room needed for the ethernet headers
and the frame checksum (22 bytes including VLAN).
====================
net: ethernet: ti: am65-cpsw: Fix max mtu to fit ethernet frames
The value of AM65_CPSW_MAX_PACKET_SIZE represents the maximum length
of a received frame. This value is written to the register
AM65_CPSW_PORT_REG_RX_MAXLEN.
The maximum MTU configured on the network device should then leave
some room for the ethernet headers and frame check. Otherwise, if
the network interface is configured to its maximum mtu possible,
the frames will be larger than AM65_CPSW_MAX_PACKET_SIZE and will
get dropped as oversized.
The switch supports ethernet frame sizes between 64 and 2024 bytes
(including VLAN) as stated in the technical reference manual, so
define AM65_CPSW_MAX_PACKET_SIZE with that maximum size.
Zhu Yanjun [Thu, 4 Jan 2024 02:09:02 +0000 (10:09 +0800)]
virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings
Fix the warnings when building virtio_net driver.
"
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4551:48: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 10 [-Wformat-overflow=]
4551 | sprintf(vi->rq[i].name, "input.%d", i);
| ^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4551:41: note: directive argument in the range [-2147483643, 65534]
4551 | sprintf(vi->rq[i].name, "input.%d", i);
| ^~~~~~~~~~
drivers/net/virtio_net.c:4551:17: note: ‘sprintf’ output between 8 and 18 bytes into a destination of size 16
4551 | sprintf(vi->rq[i].name, "input.%d", i);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4552:49: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 9 [-Wformat-overflow=]
4552 | sprintf(vi->sq[i].name, "output.%d", i);
| ^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4552:41: note: directive argument in the range [-2147483643, 65534]
4552 | sprintf(vi->sq[i].name, "output.%d", i);
| ^~~~~~~~~~~
drivers/net/virtio_net.c:4552:17: note: ‘sprintf’ output between 9 and 19 bytes into a destination of size 16
4552 | sprintf(vi->sq[i].name, "output.%d", i);
Nithin Dabilpuram [Mon, 8 Jan 2024 07:30:36 +0000 (13:00 +0530)]
octeontx2-af: CN10KB: Fix FIFO length calculation for RPM2
RPM0 and RPM1 on the CN10KB SoC have 8 LMACs each, whereas RPM2
has only 4 LMACs. Similarly, the RPM0 and RPM1 have 256KB FIFO,
whereas RPM2 has 128KB FIFO. This patch fixes an issue with
improper TX credit programming for the RPM2 link.
Fixes: b9d0fedc6234 ("octeontx2-af: cn10kb: Add RPM_USX MAC support") Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240108073036.8766-1-naveenm@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
rtnetlink: allow to enslave with one msg an up interface
The first patch fixes a regression, introduced in linux v6.1, by reverting
a patch. The second patch adds a test to verify this API.
====================
The patch broke:
> ip link set dummy0 up
> ip link set dummy0 master bond0 down
This last command is useful to be able to enslave an interface with only
one netlink message.
After discussion, there is no good reason to support:
> ip link set dummy0 down
> ip link set dummy0 master bond0 up
because the bond interface already set the slave up when it is up.
Cc: stable@vger.kernel.org Fixes: a4abfa627c38 ("net: rtnetlink: Enslave device before bringing it up") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240108094103.2001224-2-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Tue, 9 Jan 2024 15:10:48 +0000 (15:10 +0000)]
rxrpc: Fix use of Don't Fragment flag
rxrpc normally has the Don't Fragment flag set on the UDP packets it
transmits, except when it has decided that DATA packets aren't getting
through - in which case it turns it off just for the DATA transmissions.
This can be a problem, however, for RESPONSE packets that convey
authentication and crypto data from the client to the server as ticket may
be larger than can fit in the MTU.
In such a case, rxrpc gets itself into an infinite loop as the sendmsg
returns an error (EMSGSIZE), which causes rxkad_send_response() to return
-EAGAIN - and the CHALLENGE packet is put back on the Rx queue to retry,
leading to the I/O thread endlessly attempting to perform the transmission.
Fix this by disabling DF on RESPONSE packets for now. The use of DF and
best data MTU determination needs reconsidering at some point in the
future.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/1581852.1704813048@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Which is obviously bogus, because not all net_devices have a netdev_priv()
of type struct dsa_user_priv. But struct dsa_user_priv is fairly small,
and p->dp means dereferencing 8 bytes starting with offset 16. Most
drivers allocate that much private memory anyway, making our access not
fault, and we discard the bogus data quickly afterwards, so this wasn't
caught.
But the dummy interface is somewhat special in that it calls
alloc_netdev() with a priv size of 0. So every netdev_priv() dereference
is invalid, and we get this when we emit a NETDEV_PRECHANGEUPPER event
with a VLAN as its new upper:
$ ip link add dummy1 type dummy
$ ip link add link dummy1 name dummy1.100 type vlan id 100
[ 43.309174] ==================================================================
[ 43.316456] BUG: KASAN: slab-out-of-bounds in dsa_user_prechangeupper+0x30/0xe8
[ 43.323835] Read of size 8 at addr ffff3f86481d2990 by task ip/374
[ 43.330058]
[ 43.342436] Call trace:
[ 43.366542] dsa_user_prechangeupper+0x30/0xe8
[ 43.371024] dsa_user_netdevice_event+0xb38/0xee8
[ 43.375768] notifier_call_chain+0xa4/0x210
[ 43.379985] raw_notifier_call_chain+0x24/0x38
[ 43.384464] __netdev_upper_dev_link+0x3ec/0x5d8
[ 43.389120] netdev_upper_dev_link+0x70/0xa8
[ 43.393424] register_vlan_dev+0x1bc/0x310
[ 43.397554] vlan_newlink+0x210/0x248
[ 43.401247] rtnl_newlink+0x9fc/0xe30
[ 43.404942] rtnetlink_rcv_msg+0x378/0x580
Avoid the kernel oops by dereferencing after the type check, as customary.
Fixes: 4c3f80d22b2e ("net: dsa: walk through all changeupper notifier functions") Reported-and-tested-by: syzbot+d81bcd883824180500c8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/0000000000001d4255060e87545c@google.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240110003354.2796778-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lin Ma [Wed, 10 Jan 2024 06:14:00 +0000 (14:14 +0800)]
net: qualcomm: rmnet: fix global oob in rmnet_policy
The variable rmnet_link_ops assign a *bigger* maxtype which leads to a
global out-of-bounds read when parsing the netlink attributes. See bug
trace below:
==================================================================
BUG: KASAN: global-out-of-bounds in validate_nla lib/nlattr.c:386 [inline]
BUG: KASAN: global-out-of-bounds in __nla_validate_parse+0x24af/0x2750 lib/nlattr.c:600
Read of size 1 at addr ffffffff92c438d0 by task syz-executor.6/84207
The intel test robot uses only selftest's Makefile, not the top linux
Makefile:
> make W=1 O=/tmp/kselftest -C tools/testing/selftests
So, $(LINK.c) is determined by environment, rather than by kernel
Makefiles. On my machine (as well as other people that ran tcp-ao
selftests) GNU/Make implicit definition does use $(LDFLAGS):
But, according to build robot report, it's not the case for them.
While I could just avoid using pre-defined $(LINK.c), it's also used by
selftests/lib.mk by default.
Anyways, according to GNU/Make documentation [1], I should have used
$(LDLIBS) instead of $(LDFLAGS) in the first place, so let's just do it:
> LDFLAGS
> Extra flags to give to compilers when they are supposed to invoke
> the linker, ‘ld’, such as -L. Libraries (-lfoo) should be added
> to the LDLIBS variable instead.
> LDLIBS
> Library flags or names given to compilers when they are supposed
> to invoke the linker, ‘ld’. LOADLIBES is a deprecated (but still
> supported) alternative to LDLIBS. Non-library linker flags, such
> as -L, should go in the LDFLAGS variable.
Arnd Bergmann [Thu, 11 Jan 2024 16:27:53 +0000 (17:27 +0100)]
wangxunx: select CONFIG_PHYLINK where needed
The ngbe driver needs phylink:
arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/libwx/wx_ethtool.o: in function `wx_nway_reset':
wx_ethtool.c:(.text+0x458): undefined reference to `phylink_ethtool_nway_reset'
arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/ngbe/ngbe_main.o: in function `ngbe_remove':
ngbe_main.c:(.text+0x7c): undefined reference to `phylink_destroy'
arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/ngbe/ngbe_main.o: in function `ngbe_open':
ngbe_main.c:(.text+0xf90): undefined reference to `phylink_connect_phy'
arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/ngbe/ngbe_mdio.o: in function `ngbe_mdio_init':
ngbe_mdio.c:(.text+0x314): undefined reference to `phylink_create'
Add the missing Kconfig description for this.
Fixes: bc2426d74aa3 ("net: ngbe: convert phylib to phylink") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240111162828.68564-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 9 Jan 2024 16:45:17 +0000 (08:45 -0800)]
MAINTAINERS: ibmvnic: drop Dany from reviewers
I missed that Dany uses a different email address
when tagging patches (drt@linux.ibm.com)
and asked him if he's still actively working on ibmvnic.
He doesn't really fall under our removal criteria,
but he admitted that he already moved on to other projects.
Jakub Kicinski [Tue, 9 Jan 2024 16:45:16 +0000 (08:45 -0800)]
MAINTAINERS: mark ax25 as Orphan
We haven't heard from Ralf for two years, according to lore.
We get a constant stream of "fixes" to ax25 from people using
code analysis tools. Nobody is reviewing those, let's reflect
this reality in MAINTAINERS.
Jakub Kicinski [Tue, 9 Jan 2024 16:45:15 +0000 (08:45 -0800)]
MAINTAINERS: Bluetooth: retire Johan (for now?)
Johan moved to maintaining the Zephyr Bluetooth stack,
and we haven't heard from him on the ML in 3 years
(according to lore), and seen any tags in git in 4 years.
Trade the MAINTAINER entry for CREDITS, we can revert
whenever Johan comes back to Linux hacking :)
Subsystem BLUETOOTH SUBSYSTEM
Changes 173 / 986 (17%)
Last activity: 2023-12-22
Marcel Holtmann <marcel@holtmann.org>:
Author 91cb4c19118a 2022-01-27 00:00:00 52
Committer edcb185fa9c4 2022-05-23 00:00:00 446
Tags 000c2fa2c144 2023-04-23 00:00:00 523
Johan Hedberg <johan.hedberg@gmail.com>:
Luiz Augusto von Dentz <luiz.dentz@gmail.com>:
Author d03376c18592 2023-12-22 00:00:00 241
Committer da9065caa594 2023-12-22 00:00:00 341
Tags da9065caa594 2023-12-22 00:00:00 493
Top reviewers:
[33]: alainm@chromium.org
[31]: mcchou@chromium.org
[27]: abhishekpandit@chromium.org
INACTIVE MAINTAINER Johan Hedberg <johan.hedberg@gmail.com>
Jakub Kicinski [Tue, 9 Jan 2024 16:45:12 +0000 (08:45 -0800)]
MAINTAINERS: eth: mt7530: move Landen Chao to CREDITS
mt7530 is a pretty active driver and last we have heard
from Landen Chao on the list was March. There were total
of 4 message from them in the last 2.5 years.
I think it's time to move to CREDITS.
Linus Torvalds [Thu, 11 Jan 2024 18:07:29 +0000 (10:07 -0800)]
Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"The most interesting thing is probably the networking structs
reorganization and a significant amount of changes is around
self-tests.
Core & protocols:
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up build
time warnings to safeguard against future header changes
This improves TCP performances with many concurrent connections up
to 40%
- Add page-pool netlink-based introspection, exposing the memory
usage and recycling stats. This helps indentify bad PP users and
possible leaks
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set. This
lowers the time taken by connect() for hosts having many active
connections to the same destination
- Refactor the TCP bind conflict code, shrinking related socket
structs
- Refactor TCP SYN-Cookie handling, as a preparation step to allow
arbitrary SYN-Cookie processing via eBPF
- Tune optmem_max for 0-copy usage, increasing the default value to
128KB and namespecifying it
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations
- Reduce extension header parsing overhead at GRO time
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries
- Reorder nftables struct members, to keep data accessed by the
datapath first
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex)
- More data-race annotations
- Extend the diag interface to dump TCP bound-only sockets
- Conditional notification of events for TC qdisc class and actions
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID
- Implement SMCv2.1 virtual ISM device support
- Add support for Batman-avd mulicast packet type
BPF:
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers.
This improves the verifier "instructions processed" metric from
single digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer
experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques
- Support uid/gid options when mounting bpffs
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is
identified by its id
- Extend verifier to allow bpf_refcount_acquire() of a map value
field obtained via direct load which is a use-case needed in
sched_ext
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter
- Support for VLAN tag in XDP hints
- Remove deprecated bpfilter kernel leftovers given the project is
developed in user-space (https://github.com/facebook/bpfilter)
Misc:
- Support for parellel TC self-tests execution
- Increase MPTCP self-tests coverage
- Updated the bridge documentation, including several so-far
undocumented features
- Convert all the net self-tests to run in unique netns, to avoid
random failures due to conflict and allow concurrent runs
- Add TCP-AO self-tests
- Add kunit tests for both cfg80211 and mac80211
- Autogenerate Netlink families documentation from YAML spec
- Add yml-gen support for fixed headers and recursive nests, the tool
can now generate user-space code for all genetlink families for
which we have specs
- A bunch of additional module descriptions fixes
- Catch incorrect freeing of pages belonging to a page pool
Driver API:
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host
- Add support for ethtool symmetric-xor RSS hash
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute
- Convert older drivers to platform remove callback returning void
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running
timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will
allow in future releases combining multiple PFs devices
attached to different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion
for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications
for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring
param, coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine
driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync"
* tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits)
lan78xx: remove redundant statement in lan78xx_get_eee
lan743x: remove redundant statement in lan743x_ethtool_get_eee
bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer()
bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel()
bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter()
tcp: Revert no longer abort SYN_SENT when receiving some ICMP
Revert "mlx5 updates 2023-12-20"
Revert "net: stmmac: Enable Per DMA Channel interrupt"
ipvlan: Remove usage of the deprecated ida_simple_xx() API
ipvlan: Fix a typo in a comment
net/sched: Remove ipt action tests
net: stmmac: Use interrupt mode INTM=1 for per channel irq
net: stmmac: Add support for TX/RX channel interrupt
net: stmmac: Make MSI interrupt routine generic
dt-bindings: net: snps,dwmac: per channel irq
net: phy: at803x: make read_status more generic
net: phy: at803x: add support for cdt cross short test for qca808x
net: phy: at803x: refactor qca808x cable test get status function
net: phy: at803x: generalize cdt fault length function
net: ethernet: cortina: Drop TSO support
...