]> www.infradead.org Git - users/hch/dma-mapping.git/log
users/hch/dma-mapping.git
21 months agobpf: Remove unnecessary cpu == 0 check in memalloc
Yonghong Song [Thu, 4 Jan 2024 16:57:44 +0000 (08:57 -0800)]
bpf: Remove unnecessary cpu == 0 check in memalloc

After merging the patch set [1] to reduce memory usage
for bpf_global_percpu_ma, Alexei found a redundant check (cpu == 0)
in function bpf_mem_alloc_percpu_unit_init() ([2]).
Indeed, the check is unnecessary since c->unit_size will
be all NULL or all non-NULL for all cpus before
for_each_possible_cpu() loop.
Removing the check makes code less confusing.

  [1] https://lore.kernel.org/all/20231222031729.1287957-1-yonghong.song@linux.dev/
  [2] https://lore.kernel.org/all/20231222031745.1289082-1-yonghong.song@linux.dev/

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240104165744.702239-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agoMerge branch 'libbpf-side-__arg_ctx-fallback-support'
Alexei Starovoitov [Thu, 4 Jan 2024 05:22:50 +0000 (21:22 -0800)]
Merge branch 'libbpf-side-__arg_ctx-fallback-support'

Andrii Nakryiko says:

====================
Libbpf-side __arg_ctx fallback support

Support __arg_ctx global function argument tag semantics even on older kernels
that don't natively support it through btf_decl_tag("arg:ctx").

Patches #2-#6 are preparatory work to allow to postpone BTF loading into the
kernel until after all the BPF program relocations (including global func
appending to main programs) are done. Patch #4 is perhaps the most important
and establishes pre-created stable placeholder FDs, so that relocations can
embed valid map FDs into ldimm64 instructions.

Once BTF is done after relocation, what's left is to adjust BTF information to
have each main program's copy of each used global subprog to point to its own
adjusted FUNC -> FUNC_PROTO type chain (if they use __arg_ctx) in such a way
as to satisfy type expectations of BPF verifier regarding the PTR_TO_CTX
argument definition. See patch #8 for details.

Patch #8 adds few more __arg_ctx use cases (edge cases like multiple arguments
having __arg_ctx, etc) to test_global_func_ctx_args.c, to make it simple to
validate that this logic indeed works on old kernels. It does. But just to be
100% sure patch #9 adds a test validating that libbpf uploads func_info with
properly modified BTF data.

v2->v3:
  - drop renaming patch (Alexei, Eduard);
  - use memfd_create() instead of /dev/null for placeholder FD (Eduard);
  - add one more test for validating BTF rewrite logic (Eduard);
  - fixed wrong -errno usage, reshuffled some BTF rewrite bits (Eduard);
v1->v2:
  - do internal functions renaming in patch #1 (Alexei);
  - extract cloning of FUNC -> FUNC_PROTO information into separate function
    (Alexei);
====================

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240104013847.3875810-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agoselftests/bpf: add __arg_ctx BTF rewrite test
Andrii Nakryiko [Thu, 4 Jan 2024 01:38:47 +0000 (17:38 -0800)]
selftests/bpf: add __arg_ctx BTF rewrite test

Add a test validating that libbpf uploads BTF and func_info with
rewritten type information for arguments of global subprogs that are
marked with __arg_ctx tag.

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agoselftests/bpf: add arg:ctx cases to test_global_funcs tests
Andrii Nakryiko [Thu, 4 Jan 2024 01:38:46 +0000 (17:38 -0800)]
selftests/bpf: add arg:ctx cases to test_global_funcs tests

Add a few extra cases of global funcs with context arguments. This time
rely on "arg:ctx" decl_tag (__arg_ctx macro), but put it next to
"classic" cases where context argument has to be of an exact type that
BPF verifier expects (e.g., bpf_user_pt_regs_t for kprobe/uprobe).

Colocating all these cases separately from other global func args that
rely on arg:xxx decl tags (in verifier_global_subprogs.c) allows for
simpler backwards compatibility testing on old kernels. All the cases in
test_global_func_ctx_args.c are supposed to work on older kernels, which
was manually validated during development.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agolibbpf: implement __arg_ctx fallback logic
Andrii Nakryiko [Thu, 4 Jan 2024 01:38:45 +0000 (17:38 -0800)]
libbpf: implement __arg_ctx fallback logic

Out of all special global func arg tag annotations, __arg_ctx is
practically is the most immediately useful and most critical to have
working across multitude kernel version, if possible. This would allow
end users to write much simpler code if __arg_ctx semantics worked for
older kernels that don't natively understand btf_decl_tag("arg:ctx") in
verifier logic.

Luckily, it is possible to ensure __arg_ctx works on old kernels through
a bit of extra work done by libbpf, at least in a lot of common cases.

To explain the overall idea, we need to go back at how context argument
was supported in global funcs before __arg_ctx support was added. This
was done based on special struct name checks in kernel. E.g., for
BPF_PROG_TYPE_PERF_EVENT the expectation is that argument type `struct
bpf_perf_event_data *` mark that argument as PTR_TO_CTX. This is all
good as long as global function is used from the same BPF program types
only, which is often not the case. If the same subprog has to be called
from, say, kprobe and perf_event program types, there is no single
definition that would satisfy BPF verifier. Subprog will have context
argument either for kprobe (if using bpf_user_pt_regs_t struct name) or
perf_event (with bpf_perf_event_data struct name), but not both.

This limitation was the reason to add btf_decl_tag("arg:ctx"), making
the actual argument type not important, so that user can just define
"generic" signature:

  __noinline int global_subprog(void *ctx __arg_ctx) { ... }

I won't belabor how libbpf is implementing subprograms, see a huge
comment next to bpf_object_relocate_calls() function. The idea is that
each main/entry BPF program gets its own copy of global_subprog's code
appended.

This per-program copy of global subprog code *and* associated func_info
.BTF.ext information, pointing to FUNC -> FUNC_PROTO BTF type chain
allows libbpf to simulate __arg_ctx behavior transparently, even if the
kernel doesn't yet support __arg_ctx annotation natively.

The idea is straightforward: each time we append global subprog's code
and func_info information, we adjust its FUNC -> FUNC_PROTO type
information, if necessary (that is, libbpf can detect the presence of
btf_decl_tag("arg:ctx") just like BPF verifier would do it).

The rest is just mechanical and somewhat painful BTF manipulation code.
It's painful because we need to clone FUNC -> FUNC_PROTO, instead of
reusing it, as same FUNC -> FUNC_PROTO chain might be used by another
main BPF program within the same BPF object, so we can't just modify it
in-place (and cloning BTF types within the same struct btf object is
painful due to constant memory invalidation, see comments in code).
Uploaded BPF object's BTF information has to work for all BPF
programs at the same time.

Once we have FUNC -> FUNC_PROTO clones, we make sure that instead of
using some `void *ctx` parameter definition, we have an expected `struct
bpf_perf_event_data *ctx` definition (as far as BPF verifier and kernel
is concerned), which will mark it as context for BPF verifier. Same
global subprog relocated and copied into another main BPF program will
get different type information according to main program's type. It all
works out in the end in a completely transparent way for end user.

Libbpf maintains internal program type -> expected context struct name
mapping internally. Note, not all BPF program types have named context
struct, so this approach won't work for such programs (just like it
didn't before __arg_ctx). So native __arg_ctx is still important to have
in kernel to have generic context support across all BPF program types.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agolibbpf: move BTF loading step after relocation step
Andrii Nakryiko [Thu, 4 Jan 2024 01:38:44 +0000 (17:38 -0800)]
libbpf: move BTF loading step after relocation step

With all the preparations in previous patches done we are ready to
postpone BTF loading and sanitization step until after all the
relocations are performed.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agolibbpf: move exception callbacks assignment logic into relocation step
Andrii Nakryiko [Thu, 4 Jan 2024 01:38:43 +0000 (17:38 -0800)]
libbpf: move exception callbacks assignment logic into relocation step

Move the logic of finding and assigning exception callback indices from
BTF sanitization step to program relocations step, which seems more
logical and will unblock moving BTF loading to after relocation step.

Exception callbacks discovery and assignment has no dependency on BTF
being loaded into the kernel, it only uses BTF information. It does need
to happen before subprogram relocations happen, though. Which is why the
split.

No functional changes.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agolibbpf: use stable map placeholder FDs
Andrii Nakryiko [Thu, 4 Jan 2024 01:38:42 +0000 (17:38 -0800)]
libbpf: use stable map placeholder FDs

Move map creation to later during BPF object loading by pre-creating
stable placeholder FDs (utilizing memfd_create()). Use dup2()
syscall to then atomically make those placeholder FDs point to real
kernel BPF map objects.

This change allows to delay BPF map creation to after all the BPF
program relocations. That, in turn, allows to delay BTF finalization and
loading into kernel to after all the relocations as well. We'll take
advantage of the latter in subsequent patches to allow libbpf to adjust
BTF in a way that helps with BPF global function usage.

Clean up a few places where we close map->fd, which now shouldn't
happen, because map->fd should be a valid FD regardless of whether map
was created or not. Surprisingly and nicely it simplifies a bunch of
error handling code. If this change doesn't backfire, I'm tempted to
pre-create such stable FDs for other entities (progs, maybe even BTF).
We previously did some manipulations to make gen_loader work with fake
map FDs, with stable map FDs this hack is not necessary for maps (we
still have it for BTF, but I left it as is for now).

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agolibbpf: don't rely on map->fd as an indicator of map being created
Andrii Nakryiko [Thu, 4 Jan 2024 01:38:41 +0000 (17:38 -0800)]
libbpf: don't rely on map->fd as an indicator of map being created

With the upcoming switch to preallocated placeholder FDs for maps,
switch various getters/setter away from checking map->fd. Use
map_is_created() helper that detect whether BPF map can be modified based
on map->obj->loaded state, with special provision for maps set up with
bpf_map__reuse_fd().

For backwards compatibility, we take map_is_created() into account in
bpf_map__fd() getter as well. This way before bpf_object__load() phase
bpf_map__fd() will always return -1, just as before the changes in
subsequent patches adding stable map->fd placeholders.

We also get rid of all internal uses of bpf_map__fd() getter, as it's
more oriented for uses external to libbpf. The above map_is_created()
check actually interferes with some of the internal uses, if map FD is
fetched through bpf_map__fd().

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agolibbpf: use explicit map reuse flag to skip map creation steps
Andrii Nakryiko [Thu, 4 Jan 2024 01:38:40 +0000 (17:38 -0800)]
libbpf: use explicit map reuse flag to skip map creation steps

Instead of inferring whether map already point to previously
created/pinned BPF map (which user can specify with bpf_map__reuse_fd()) API),
use explicit map->reused flag that is set in such case.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agolibbpf: make uniform use of btf__fd() accessor inside libbpf
Andrii Nakryiko [Thu, 4 Jan 2024 01:38:39 +0000 (17:38 -0800)]
libbpf: make uniform use of btf__fd() accessor inside libbpf

It makes future grepping and code analysis a bit easier.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agoMerge branch 'bpf-reduce-memory-usage-for-bpf_global_percpu_ma'
Alexei Starovoitov [Thu, 4 Jan 2024 05:08:26 +0000 (21:08 -0800)]
Merge branch 'bpf-reduce-memory-usage-for-bpf_global_percpu_ma'

Yonghong Song says:

====================
bpf: Reduce memory usage for bpf_global_percpu_ma

Currently when a bpf program intends to allocate memory for percpu kptr,
the verifier will call bpf_mem_alloc_init() to prefill all supported
unit sizes and this caused memory consumption very big for large number
of cpus. For example, for 128-cpu system, the total memory consumption
with initial prefill is ~175MB. Things will become worse for systems
with even more cpus.

Patch 1 avoids unnecessary extra percpu memory allocation.
Patch 2 adds objcg to bpf_mem_alloc at init stage so objcg can be
associated with root cgroup and objcg can be passed to later
bpf_mem_alloc_percpu_unit_init().
Patch 3 addresses memory consumption issue by avoiding to prefill
with all unit sizes, i.e. only prefilling with user specified size.
Patch 4 further reduces memory consumption by limiting the
number of prefill entries for percpu memory allocation.
Patch 5 has much smaller low/high watermarks for percpu allocation
to reduce memory consumption.
Patch 6 rejects percpu memory allocation with bpf_global_percpu_ma
when allocation size is greater than 512 bytes.
Patch 7 fixed test_bpf_ma test due to Patch 5.
Patch 8 added one test to show the verification failure log message.

Changelogs:
  v5 -> v6:
    . Change bpf_mem_alloc_percpu_init() to add objcg as one of parameters.
      For bpf_global_percpu_ma, the objcg is NULL, corresponding root memcg.
  v4 -> v5:
    . Do not do bpf_global_percpu_ma initialization at init stage, instead
      doing initialization when the verifier knows it is going to be used
      by bpf prog.
    . Using much smaller low/high watermarks for percpu allocation.
  v3 -> v4:
    . Add objcg to bpf_mem_alloc during init stage.
    . Initialize objcg at init stage but use it in bpf_mem_alloc_percpu_unit_init().
    . Remove check_obj_size() in bpf_mem_alloc_percpu_unit_init().
  v2 -> v3:
    . Clear the bpf_mem_cache if prefill fails.
    . Change test_bpf_ma percpu allocation tests to use bucket_size
      as allocation size instead of bucket_size - 8.
    . Remove __GFP_ZERO flag from __alloc_percpu_gfp() call.
  v1 -> v2:
    . Avoid unnecessary extra percpu memory allocation.
    . Add a separate function to do bpf_global_percpu_ma initialization
    . promote.
    . Promote function static 'sizes' array to file static.
    . Add comments to explain to refill only one item for percpu alloc.
====================

Link: https://lore.kernel.org/r/20231222031729.1287957-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agoselftests/bpf: Add a selftest with > 512-byte percpu allocation size
Yonghong Song [Fri, 22 Dec 2023 03:18:12 +0000 (19:18 -0800)]
selftests/bpf: Add a selftest with > 512-byte percpu allocation size

Add a selftest to capture the verification failure when the allocation
size is greater than 512.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031812.1293190-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agoselftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
Yonghong Song [Fri, 22 Dec 2023 03:18:07 +0000 (19:18 -0800)]
selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma

In the previous patch, the maximum data size for bpf_global_percpu_ma
is 512 bytes. This breaks selftest test_bpf_ma. The test is adjusted
in two aspects:
  - Since the maximum allowed data size for bpf_global_percpu_ma is
    512, remove all tests beyond that, names sizes 1024, 2048 and 4096.
  - Previously the percpu data size is bucket_size - 8 in order to
    avoid percpu allocation into the next bucket. This patch removed
    such data size adjustment thanks to Patch 1.

Also, a better way to generate BTF type is used than adding
a member to the value struct.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031807.1292853-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agobpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
Yonghong Song [Fri, 22 Dec 2023 03:18:01 +0000 (19:18 -0800)]
bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation

For percpu data structure allocation with bpf_global_percpu_ma,
the maximum data size is 4K. But for a system with large
number of cpus, bigger data size (e.g., 2K, 4K) might consume
a lot of memory. For example, the percpu memory consumption
with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB
memory.

We should discourage such usage. Let us limit the maximum data
size to be 512 for bpf_global_percpu_ma allocation.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031801.1290841-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agobpf: Use smaller low/high marks for percpu allocation
Yonghong Song [Fri, 22 Dec 2023 03:17:55 +0000 (19:17 -0800)]
bpf: Use smaller low/high marks for percpu allocation

Currently, refill low/high marks are set with the assumption
of normal non-percpu memory allocation. For example, for
an allocation size 256, for non-percpu memory allocation,
low mark is 32 and high mark is 96, resulting in the
batch allocation of 48 elements and the allocated memory
will be 48 * 256 = 12KB for this particular cpu.
Assuming an 128-cpu system, the total memory consumption
across all cpus will be 12K * 128 = 1.5MB memory.

This might be okay for non-percpu allocation, but may not be
good for percpu allocation, which will consume 1.5MB * 128 = 192MB
memory in the worst case if every cpu has a chance of memory
allocation.

In practice, percpu allocation is very rare compared to
non-percpu allocation. So let us have smaller low/high marks
which can avoid unnecessary memory consumption.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031755.1289671-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agobpf: Refill only one percpu element in memalloc
Yonghong Song [Fri, 22 Dec 2023 03:17:50 +0000 (19:17 -0800)]
bpf: Refill only one percpu element in memalloc

Typically for percpu map element or data structure, once allocated,
most operations are lookup or in-place update. Deletion are really
rare. Currently, for percpu data strcture, 4 elements will be
refilled if the size is <= 256. Let us just do with one element
for percpu data. For example, for size 256 and 128 cpus, the
potential saving will be 3 * 256 * 128 * 128 = 12MB.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031750.1289290-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agobpf: Allow per unit prefill for non-fix-size percpu memory allocator
Yonghong Song [Fri, 22 Dec 2023 03:17:45 +0000 (19:17 -0800)]
bpf: Allow per unit prefill for non-fix-size percpu memory allocator

Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation")
added support for non-fix-size percpu memory allocation.
Such allocation will allocate percpu memory for all buckets on all
cpus and the memory consumption is in the order to quadratic.
For example, let us say, 4 cpus, unit size 16 bytes, so each
cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
Then let us say, 8 cpus with the same unit size, each cpu
has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
So if the number of cpus doubles, the number of memory consumption
will be 4 times. So for a system with large number of cpus, the
memory consumption goes up quickly with quadratic order.
For example, for 4KB percpu allocation, 128 cpus. The total memory
consumption will 4KB * 128 * 128 = 64MB. Things will become
worse if the number of cpus is bigger (e.g., 512, 1024, etc.)

In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
done in boot time, so for system with large number of cpus, the initial
percpu memory consumption is very visible. For example, for 128 cpu
system, the total percpu memory allocation will be at least
(16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
  * 128 * 128 = ~138MB.
which is pretty big. It will be even bigger for larger number of cpus.

Note that the current prefill also allocates 4 entries if the unit size
is less than 256. So on top of 138MB memory consumption, this will
add more consumption with
3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB.
Next patch will try to reduce this memory consumption.

Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory
at init stage") moved the non-fix-size percpu memory allocation
to bpf verificaiton stage. Once a particular bpf_percpu_obj_new()
is called by bpf program, the memory allocator will try to fill in
the cache with all sizes, causing the same amount of percpu memory
consumption as in the boot stage.

To reduce the initial percpu memory consumption for non-fix-size
percpu memory allocation, instead of filling the cache with all
supported allocation sizes, this patch intends to fill the cache
only for the requested size. As typically users will not use large
percpu data structure, this can save memory significantly.
For example, the allocation size is 64 bytes with 128 cpus.
Then total percpu memory amount will be 64 * 128 * 128 = 1MB,
much less than previous 138MB.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031745.1289082-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agobpf: Add objcg to bpf_mem_alloc
Yonghong Song [Fri, 22 Dec 2023 03:17:39 +0000 (19:17 -0800)]
bpf: Add objcg to bpf_mem_alloc

The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's
are with the same objcg. This patch made such a property explicit.
The next patch will use this property to save and restore objcg
for percpu unit allocator.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031739.1288590-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agobpf: Avoid unnecessary extra percpu memory allocation
Yonghong Song [Fri, 22 Dec 2023 03:17:34 +0000 (19:17 -0800)]
bpf: Avoid unnecessary extra percpu memory allocation

Currently, for percpu memory allocation, say if the user
requests allocation size to be 32 bytes, the actually
calculated size will be 40 bytes and it further rounds
to 64 bytes, and eventually 64 bytes are allocated,
wasting 32-byte memory.

Change bpf_mem_alloc() to calculate the cache index
based on the user-provided allocation size so unnecessary
extra memory can be avoided.

Suggested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031734.1288400-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
21 months agoMerge branch 'fix sockmap + stream af_unix memleak'
Martin KaFai Lau [Thu, 4 Jan 2024 00:34:32 +0000 (16:34 -0800)]
Merge branch 'fix sockmap + stream  af_unix memleak'

John Fastabend says:

====================
There was a memleak when streaming af_unix sockets were inserted into
multiple sockmap slots and/or maps. This is because each insert would
call a proto update operatino and these must be allowed to be called
multiple times. The streaming af_unix implementation recently added
a refcnt to handle a use after free issue, however it introduced a
memleak when inserted into multiple maps.

This series fixes the memleak, adds a note in the code so we remember
that proto updates need to support this. And then we add three tests
for each of the slightly different iterations of adding sockets into
multiple maps. I kept them as 3 independent test cases here. I have
some slight preference for this they could however be a single test,
but then you don't get to run them independently which was sort of
useful while debugging.
====================

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
21 months agobpf: sockmap, add tests for proto updates replace socket
John Fastabend [Thu, 21 Dec 2023 23:23:27 +0000 (15:23 -0800)]
bpf: sockmap, add tests for proto updates replace socket

Add test that replaces the same socket with itself. This exercises a
corner case where old element and new element have the same posck.
Test protocols: TCP, UDP, stream af_unix and dgram af_unix.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20231221232327.43678-6-john.fastabend@gmail.com
21 months agobpf: sockmap, add tests for proto updates single socket to many map
John Fastabend [Thu, 21 Dec 2023 23:23:26 +0000 (15:23 -0800)]
bpf: sockmap, add tests for proto updates single socket to many map

Add test with multiple maps where each socket is inserted in multiple
maps. Test protocols: TCP, UDP, stream af_unix and dgram af_unix.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20231221232327.43678-5-john.fastabend@gmail.com
21 months agobpf: sockmap, add tests for proto updates many to single map
John Fastabend [Thu, 21 Dec 2023 23:23:25 +0000 (15:23 -0800)]
bpf: sockmap, add tests for proto updates many to single map

Add test with a single map where each socket is inserted multiple
times. Test protocols: TCP, UDP, stream af_unix and dgram af_unix.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20231221232327.43678-4-john.fastabend@gmail.com
21 months agobpf: sockmap, added comments describing update proto rules
John Fastabend [Thu, 21 Dec 2023 23:23:24 +0000 (15:23 -0800)]
bpf: sockmap, added comments describing update proto rules

Add a comment describing that the psock update proto callbback can be
called multiple times and this must be safe.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20231221232327.43678-3-john.fastabend@gmail.com
21 months agobpf: sockmap, fix proto update hook to avoid dup calls
John Fastabend [Thu, 21 Dec 2023 23:23:23 +0000 (15:23 -0800)]
bpf: sockmap, fix proto update hook to avoid dup calls

When sockets are added to a sockmap or sockhash we allocate and init a
psock. Then update the proto ops with sock_map_init_proto the flow is

  sock_hash_update_common
    sock_map_link
      psock = sock_map_psock_get_checked() <-returns existing psock
      sock_map_init_proto(sk, psock)       <- updates sk_proto

If the socket is already in a map this results in the sock_map_init_proto
being called multiple times on the same socket. We do this because when
a socket is added to multiple maps this might result in a new set of BPF
programs being attached to the socket requiring an updated ops struct.

This creates a rule where it must be safe to call psock_update_sk_prot
multiple times. When we added a fix for UAF through unix sockets in patch
4dd9a38a753fc we broke this rule by adding a sock_hold in that path
to ensure the sock is not released. The result is if a af_unix stream sock
is placed in multiple maps it results in a memory leak because we call
sock_hold multiple times with only a single sock_put on it.

Fixes: 8866730aed51 ("bpf, sockmap: af_unix stream sockets need to hold ref for pair sock")
Reported-by: Xingwei Lee <xrivendell7@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20231221232327.43678-2-john.fastabend@gmail.com
21 months agoMerge branch 'bpf-volatile-compare'
Andrii Nakryiko [Wed, 3 Jan 2024 18:41:22 +0000 (10:41 -0800)]
Merge branch 'bpf-volatile-compare'

Alexei Starovoitov says:

====================
bpf: volatile compare

From: Alexei Starovoitov <ast@kernel.org>

v2->v3:
Debugged profiler.c regression. It was caused by basic block layout.
Introduce bpf_cmp_likely() and bpf_cmp_unlikely() macros.
Debugged redundant <<=32, >>=32 with u32 variables. Added cast workaround.

v1->v2:
Fixed issues pointed out by Daniel, added more tests, attempted to convert profiler.c,
but barrier_var() wins vs bpf_cmp(). To be investigated.
Patches 1-4 are good to go, but 5 needs more work.
====================

Link: https://lore.kernel.org/r/20231226191148.48536-1-alexei.starovoitov@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
21 months agoselftests/bpf: Convert profiler.c to bpf_cmp.
Alexei Starovoitov [Tue, 26 Dec 2023 19:11:48 +0000 (11:11 -0800)]
selftests/bpf: Convert profiler.c to bpf_cmp.

Convert profiler[123].c to "volatile compare" to compare barrier_var() approach vs bpf_cmp_likely() vs bpf_cmp_unlikely().

bpf_cmp_unlikely() produces correct code, but takes much longer to verify:

./veristat -C -e prog,insns,states before after_with_unlikely
Program                               Insns (A)  Insns (B)  Insns       (DIFF)  States (A)  States (B)  States     (DIFF)
------------------------------------  ---------  ---------  ------------------  ----------  ----------  -----------------
kprobe__proc_sys_write                     1603      19606  +18003 (+1123.08%)         123        1678  +1555 (+1264.23%)
kprobe__vfs_link                          11815      70305   +58490 (+495.05%)         971        4967   +3996 (+411.53%)
kprobe__vfs_symlink                        5464      42896   +37432 (+685.07%)         434        3126   +2692 (+620.28%)
kprobe_ret__do_filp_open                   5641      44578   +38937 (+690.25%)         446        3162   +2716 (+608.97%)
raw_tracepoint__sched_process_exec         2770      35962  +33192 (+1198.27%)         226        3121  +2895 (+1280.97%)
raw_tracepoint__sched_process_exit         1526       2135      +609 (+39.91%)         133         208      +75 (+56.39%)
raw_tracepoint__sched_process_fork          265        337       +72 (+27.17%)          19          24       +5 (+26.32%)
tracepoint__syscalls__sys_enter_kill      18782     140407  +121625 (+647.56%)        1286       12176  +10890 (+846.81%)

bpf_cmp_likely() is equivalent to barrier_var():

./veristat -C -e prog,insns,states before after_with_likely
Program                               Insns (A)  Insns (B)  Insns   (DIFF)  States (A)  States (B)  States (DIFF)
------------------------------------  ---------  ---------  --------------  ----------  ----------  -------------
kprobe__proc_sys_write                     1603       1663    +60 (+3.74%)         123         127    +4 (+3.25%)
kprobe__vfs_link                          11815      12090   +275 (+2.33%)         971         971    +0 (+0.00%)
kprobe__vfs_symlink                        5464       5448    -16 (-0.29%)         434         426    -8 (-1.84%)
kprobe_ret__do_filp_open                   5641       5739    +98 (+1.74%)         446         446    +0 (+0.00%)
raw_tracepoint__sched_process_exec         2770       2608   -162 (-5.85%)         226         216   -10 (-4.42%)
raw_tracepoint__sched_process_exit         1526       1526     +0 (+0.00%)         133         133    +0 (+0.00%)
raw_tracepoint__sched_process_fork          265        265     +0 (+0.00%)          19          19    +0 (+0.00%)
tracepoint__syscalls__sys_enter_kill      18782      18970   +188 (+1.00%)        1286        1286    +0 (+0.00%)
kprobe__proc_sys_write                     2700       2809   +109 (+4.04%)         107         109    +2 (+1.87%)
kprobe__vfs_link                          12238      12366   +128 (+1.05%)         267         269    +2 (+0.75%)
kprobe__vfs_symlink                        7139       7365   +226 (+3.17%)         167         175    +8 (+4.79%)
kprobe_ret__do_filp_open                   7264       7070   -194 (-2.67%)         180         182    +2 (+1.11%)
raw_tracepoint__sched_process_exec         3768       3453   -315 (-8.36%)         211         199   -12 (-5.69%)
raw_tracepoint__sched_process_exit         3138       3138     +0 (+0.00%)          83          83    +0 (+0.00%)
raw_tracepoint__sched_process_fork          265        265     +0 (+0.00%)          19          19    +0 (+0.00%)
tracepoint__syscalls__sys_enter_kill      26679      24327  -2352 (-8.82%)        1067        1037   -30 (-2.81%)
kprobe__proc_sys_write                     1833       1833     +0 (+0.00%)         157         157    +0 (+0.00%)
kprobe__vfs_link                           9995      10127   +132 (+1.32%)         803         803    +0 (+0.00%)
kprobe__vfs_symlink                        5606       5672    +66 (+1.18%)         451         451    +0 (+0.00%)
kprobe_ret__do_filp_open                   5716       5782    +66 (+1.15%)         462         462    +0 (+0.00%)
raw_tracepoint__sched_process_exec         3042       3042     +0 (+0.00%)         278         278    +0 (+0.00%)
raw_tracepoint__sched_process_exit         1680       1680     +0 (+0.00%)         146         146    +0 (+0.00%)
raw_tracepoint__sched_process_fork          299        299     +0 (+0.00%)          25          25    +0 (+0.00%)
tracepoint__syscalls__sys_enter_kill      18372      18372     +0 (+0.00%)        1558        1558    +0 (+0.00%)

default (mcpu=v3), no_alu32, cpuv4 have similar differences.

Note one place where bpf_nop_mov() is used to workaround the verifier lack of link
between the scalar register and its spill to stack.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231226191148.48536-7-alexei.starovoitov@gmail.com
21 months agobpf: Add bpf_nop_mov() asm macro.
Alexei Starovoitov [Tue, 26 Dec 2023 19:11:47 +0000 (11:11 -0800)]
bpf: Add bpf_nop_mov() asm macro.

bpf_nop_mov(var) asm macro emits nop register move: rX = rX.
If 'var' is a scalar and not a fixed constant the verifier will assign ID to it.
If it's later spilled the stack slot will carry that ID as well.
Hence the range refining comparison "if rX < const" will update all copies
including spilled slot.
This macro is a temporary workaround until the verifier gets smarter.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231226191148.48536-6-alexei.starovoitov@gmail.com
21 months agoselftests/bpf: Remove bpf_assert_eq-like macros.
Alexei Starovoitov [Tue, 26 Dec 2023 19:11:46 +0000 (11:11 -0800)]
selftests/bpf: Remove bpf_assert_eq-like macros.

Since the last user was converted to bpf_cmp, remove bpf_assert_eq/ne/... macros.

__bpf_assert_op() macro is kept for experiments, since it's slightly more efficient
than bpf_assert(bpf_cmp_unlikely()) until LLVM is fixed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20231226191148.48536-5-alexei.starovoitov@gmail.com
21 months agoselftests/bpf: Convert exceptions_assert.c to bpf_cmp
Alexei Starovoitov [Tue, 26 Dec 2023 19:11:45 +0000 (11:11 -0800)]
selftests/bpf: Convert exceptions_assert.c to bpf_cmp

Convert exceptions_assert.c to bpf_cmp_unlikely() macro.

Since

bpf_assert(bpf_cmp_unlikely(var, ==, 100));
other code;

will generate assembly code:

  if r1 == 100 goto L2;
  r0 = 0
  call bpf_throw
L1:
  other code;
  ...

L2: goto L1;

LLVM generates redundant basic block with extra goto. LLVM will be fixed eventually.
Right now it's less efficient than __bpf_assert(var, ==, 100) macro that produces:
  if r1 == 100 goto L1;
  r0 = 0
  call bpf_throw
L1:
  other code;

But extra goto doesn't hurt the verification process.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20231226191148.48536-4-alexei.starovoitov@gmail.com
21 months agobpf: Introduce "volatile compare" macros
Alexei Starovoitov [Tue, 26 Dec 2023 19:11:44 +0000 (11:11 -0800)]
bpf: Introduce "volatile compare" macros

Compilers optimize conditional operators at will, but often bpf programmers
want to force compilers to keep the same operator in asm as it's written in C.
Introduce bpf_cmp_likely/unlikely(var1, conditional_op, var2) macros that can be used as:

-               if (seen >= 1000)
+               if (bpf_cmp_unlikely(seen, >=, 1000))

The macros take advantage of BPF assembly that is C like.

The macros check the sign of variable 'seen' and emits either
signed or unsigned compare.

For example:
int a;
bpf_cmp_unlikely(a, >, 0) will be translated to 'if rX s> 0 goto' in BPF assembly.

unsigned int a;
bpf_cmp_unlikely(a, >, 0) will be translated to 'if rX > 0 goto' in BPF assembly.

C type conversions coupled with comparison operator are tricky.
  int i = -1;
  unsigned int j = 1;
  if (i < j) // this is false.

  long i = -1;
  unsigned int j = 1;
  if (i < j) // this is true.

Make sure BPF program is compiled with -Wsign-compare then the macros will catch
the mistake.

The macros check LHS (left hand side) only to figure out the sign of compare.

'if 0 < rX goto' is not allowed in the assembly, so the users
have to use a variable on LHS anyway.

The patch updates few tests to demonstrate the use of the macros.

The macro allows to use BPF_JSET in C code, since LLVM doesn't generate it at
present. For example:

if (i & j) compiles into r0 &= r1; if r0 == 0 goto

while

if (bpf_cmp_unlikely(i, &, j)) compiles into if r0 & r1 goto

Note that the macros has to be careful with RHS assembly predicate.
Since:
u64 __rhs = 1ull << 42;
asm goto("if r0 < %[rhs] goto +1" :: [rhs] "ri" (__rhs));
LLVM will silently truncate 64-bit constant into s32 imm.

Note that [lhs] "r"((short)LHS) the type cast is a workaround for LLVM issue.
When LHS is exactly 32-bit LLVM emits redundant <<=32, >>=32 to zero upper 32-bits.
When LHS is 64 or 16 or 8-bit variable there are no shifts.
When LHS is 32-bit the (u64) cast doesn't help. Hence use (short) cast.
It does _not_ truncate the variable before it's assigned to a register.

Traditional likely()/unlikely() macros that use __builtin_expect(!!(x), 1 or 0)
have no effect on these macros, hence macros implement the logic manually.
bpf_cmp_unlikely() macro preserves compare operator as-is while
bpf_cmp_likely() macro flips the compare.

Consider two cases:
A.
  for() {
    if (foo >= 10) {
      bar += foo;
    }
    other code;
  }

B.
  for() {
    if (foo >= 10)
       break;
    other code;
  }

It's ok to use either bpf_cmp_likely or bpf_cmp_unlikely macros in both cases,
but consider that 'break' is effectively 'goto out_of_the_loop'.
Hence it's better to use bpf_cmp_unlikely in the B case.
While 'bar += foo' is better to keep as 'fallthrough' == likely code path in the A case.

When it's written as:
A.
  for() {
    if (bpf_cmp_likely(foo, >=, 10)) {
      bar += foo;
    }
    other code;
  }

B.
  for() {
    if (bpf_cmp_unlikely(foo, >=, 10))
       break;
    other code;
  }

The assembly will look like:
A.
  for() {
    if r1 < 10 goto L1;
      bar += foo;
  L1:
    other code;
  }

B.
  for() {
    if r1 >= 10 goto L2;
    other code;
  }
  L2:

The bpf_cmp_likely vs bpf_cmp_unlikely changes basic block layout, hence it will
greatly influence the verification process. The number of processed instructions
will be different, since the verifier walks the fallthrough first.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20231226191148.48536-3-alexei.starovoitov@gmail.com
21 months agoselftests/bpf: Attempt to build BPF programs with -Wsign-compare
Alexei Starovoitov [Tue, 26 Dec 2023 19:11:43 +0000 (11:11 -0800)]
selftests/bpf: Attempt to build BPF programs with -Wsign-compare

GCC's -Wall includes -Wsign-compare while clang does not.
Since BPF programs are built with clang we need to add this flag explicitly
to catch problematic comparisons like:

  int i = -1;
  unsigned int j = 1;
  if (i < j) // this is false.

  long i = -1;
  unsigned int j = 1;
  if (i < j) // this is true.

C standard for reference:

- If either operand is unsigned long the other shall be converted to unsigned long.

- Otherwise, if one operand is a long int and the other unsigned int, then if a
long int can represent all the values of an unsigned int, the unsigned int
shall be converted to a long int; otherwise both operands shall be converted to
unsigned long int.

- Otherwise, if either operand is long, the other shall be converted to long.

- Otherwise, if either operand is unsigned, the other shall be converted to unsigned.

Unfortunately clang's -Wsign-compare is very noisy.
It complains about (s32)a == (u32)b which is safe and doen't have surprising behavior.

This patch fixes some of the issues. It needs a follow up to fix the rest.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20231226191148.48536-2-alexei.starovoitov@gmail.com
21 months agoMerge branch 'bpf-simplify-checking-size-of-helper-accesses'
Andrii Nakryiko [Wed, 3 Jan 2024 18:37:57 +0000 (10:37 -0800)]
Merge branch 'bpf-simplify-checking-size-of-helper-accesses'

Andrei Matei says:

====================
bpf: Simplify checking size of helper accesses

v3->v4:
- kept only the minimal change, undoing debatable changes (Andrii)
- dropped the second patch from before, with changes to the error
  message (Andrii)
- extracted the new test into a separate patch (Andrii)
- added Acked by Andrii

v2->v3:
- split the error-logging function to a separate patch (Andrii)
- make the error buffers smaller (Andrii)
- include size of memory region for PTR_TO_MEM (Andrii)
- nits from Andrii and Eduard

v1->v2:
- make the error message include more info about the context of the
  zero-sized access (Andrii)
====================

Link: https://lore.kernel.org/r/20231221232225.568730-1-andreimatei1@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
21 months agobpf: Add a possibly-zero-sized read test
Andrei Matei [Thu, 21 Dec 2023 23:22:25 +0000 (18:22 -0500)]
bpf: Add a possibly-zero-sized read test

This patch adds a test for the condition that the previous patch mucked
with - illegal zero-sized helper memory access. As opposed to existing
tests, this new one uses a size whose lower bound is zero, as opposed to
a known-zero one.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231221232225.568730-3-andreimatei1@gmail.com
21 months agobpf: Simplify checking size of helper accesses
Andrei Matei [Thu, 21 Dec 2023 23:22:24 +0000 (18:22 -0500)]
bpf: Simplify checking size of helper accesses

This patch simplifies the verification of size arguments associated to
pointer arguments to helpers and kfuncs. Many helpers take a pointer
argument followed by the size of the memory access performed to be
performed through that pointer. Before this patch, the handling of the
size argument in check_mem_size_reg() was confusing and wasteful: if the
size register's lower bound was 0, then the verification was done twice:
once considering the size of the access to be the lower-bound of the
respective argument, and once considering the upper bound (even if the
two are the same). The upper bound checking is a super-set of the
lower-bound checking(*), except: the only point of the lower-bound check
is to handle the case where zero-sized-accesses are explicitly not
allowed and the lower-bound is zero. This static condition is now
checked explicitly, replacing a much more complex, expensive and
confusing verification call to check_helper_mem_access().

Error messages change in this patch. Before, messages about illegal
zero-size accesses depended on the type of the pointer and on other
conditions, and sometimes the message was plain wrong: in some tests
that changed you'll see that the old message was something like "R1 min
value is outside of the allowed memory range", where R1 is the pointer
register; the error was wrongly claiming that the pointer was bad
instead of the size being bad. Other times the information that the size
came for a register with a possible range of values was wrong, and the
error presented the size as a fixed zero. Now the errors refer to the
right register. However, the old error messages did contain useful
information about the pointer register which is now lost; recovering
this information was deemed not important enough.

(*) Besides standing to reason that the checks for a bigger size access
are a super-set of the checks for a smaller size access, I have also
mechanically verified this by reading the code for all types of
pointers. I could convince myself that it's true for all but
PTR_TO_BTF_ID (check_ptr_to_btf_access). There, simply looking
line-by-line does not immediately prove what we want. If anyone has any
qualms, let me know.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231221232225.568730-2-andreimatei1@gmail.com
21 months agonet/sched: cls_api: complement tcf_tfilter_dump_policy
Lin Ma [Thu, 28 Dec 2023 06:43:58 +0000 (14:43 +0800)]
net/sched: cls_api: complement tcf_tfilter_dump_policy

In function `tc_dump_tfilter`, the attributes array is parsed via
tcf_tfilter_dump_policy which only describes TCA_DUMP_FLAGS. However,
the NLA TCA_CHAIN is also accessed with `nla_get_u32`.

The access to TCA_CHAIN is introduced in commit 5bc1701881e3 ("net:
sched: introduce multichain support for filters") and no nla_policy is
provided for parsing at that point. Later on, tcf_tfilter_dump_policy is
introduced in commit f8ab1807a9c9 ("net: sched: introduce terse dump
flag") while still ignoring the fact that TCA_CHAIN needs a check. This
patch does that by complementing the policy to allow the access
discussed here can be safe as other cases just choose rtm_tca_policy as
the parsing policy.

Signed-off-by: Lin Ma <linma@zju.edu.cn>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoppp: Fix spelling typo in comment in ppp_async_encode()
liyouhong [Wed, 27 Dec 2023 01:58:31 +0000 (09:58 +0800)]
ppp: Fix spelling typo in comment in ppp_async_encode()

Fix spelling typo in comment

Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231227015831.289077-1-liyouhong@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
21 months agonet: ethtool: Fix symmetric-xor RSS RX flow hash check
Gerhard Engleder [Tue, 26 Dec 2023 20:55:36 +0000 (21:55 +0100)]
net: ethtool: Fix symmetric-xor RSS RX flow hash check

Commit 13e59344fb9d ("net: ethtool: add support for symmetric-xor RSS hash")
adds a check to the ethtool set_rxnfc operation, which checks the RX
flow hash if the flag RXH_XFRM_SYM_XOR is set. This flag is introduced
with the same commit. It calls the ethtool get_rxfh operation to get the
RX flow hash data. If get_rxfh is not supported, then EOPNOTSUPP is
returned.

There are driver like tsnep, macb, asp2, genet, gianfar, mtk, ... which
support the ethtool operation set_rxnfc but not get_rxfh. This results
in EOPNOTSUPP returned by ethtool_set_rxnfc() without actually calling
the ethtool operation set_rxnfc. Thus, set_rxnfc got broken for all
these drivers.

Check RX flow hash in ethtool_set_rxnfc() only if driver supports RX
flow hash.

Fixes: 13e59344fb9d ("net: ethtool: add support for symmetric-xor RSS hash")
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
Link: https://lore.kernel.org/r/20231226205536.32003-1-gerhard@engleder-embedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
21 months agoMerge branch 'bug-fixes-for-rss-symmetric-xor'
Jakub Kicinski [Wed, 3 Jan 2024 00:00:08 +0000 (16:00 -0800)]
Merge branch 'bug-fixes-for-rss-symmetric-xor'

Ahmed Zaki says:

====================
Bug fixes for RSS symmetric-xor

A couple of fixes for the symmetric-xor recently merged in net-next [1].

The first patch copies the xfrm value back to user-space when ethtool is
built with --disable-netlink. The second allows ethtool to change other
RSS attributes while not changing the xfrm values.

Link: https://lore.kernel.org/netdev/20231213003321.605376-1-ahmed.zaki@intel.com/
====================

Link: https://lore.kernel.org/r/20231221184235.9192-1-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
21 months agonet: ethtool: add a NO_CHANGE uAPI for new RXFH's input_xfrm
Ahmed Zaki [Thu, 21 Dec 2023 18:42:35 +0000 (11:42 -0700)]
net: ethtool: add a NO_CHANGE uAPI for new RXFH's input_xfrm

Add a NO_CHANGE uAPI value for the new RXFH/RSS input_xfrm uAPI field.
This needed so that user-space can set other RSS values (hkey or indir
table) without affecting input_xfrm.

Should have been part of [1].

Link: https://lore.kernel.org/netdev/20231213003321.605376-1-ahmed.zaki@intel.com/
Fixes: 13e59344fb9d ("net: ethtool: add support for symmetric-xor RSS hash")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231221184235.9192-3-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
21 months agonet: ethtool: copy input_xfrm to user-space in ethtool_get_rxfh
Ahmed Zaki [Thu, 21 Dec 2023 18:42:34 +0000 (11:42 -0700)]
net: ethtool: copy input_xfrm to user-space in ethtool_get_rxfh

The ioctl path of ethtool's get channels is missing the final step of
copying the new input_xfrm field to user-space. This should have been
part of [1].

Link: https://lore.kernel.org/netdev/20231213003321.605376-1-ahmed.zaki@intel.com/
Fixes: 13e59344fb9d ("net: ethtool: add support for symmetric-xor RSS hash")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231221184235.9192-2-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
21 months agoxsk: make struct xsk_cb_desc available outside CONFIG_XDP_SOCKETS
Vladimir Oltean [Tue, 19 Dec 2023 11:02:05 +0000 (13:02 +0200)]
xsk: make struct xsk_cb_desc available outside CONFIG_XDP_SOCKETS

The ice driver fails to build when CONFIG_XDP_SOCKETS is disabled.

drivers/net/ethernet/intel/ice/ice_base.c:533:21: error:
variable has incomplete type 'struct xsk_cb_desc'
        struct xsk_cb_desc desc = {};
                           ^
include/net/xsk_buff_pool.h:15:8: note:
forward declaration of 'struct xsk_cb_desc'
struct xsk_cb_desc;
       ^

Fixes: d68d707dcbbf ("ice: Support XDP hints in AF_XDP ZC mode")
Closes: https://lore.kernel.org/netdev/8b76dad3-8847-475b-aa17-613c9c978f7a@infradead.org/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20231219110205.1289506-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
21 months agoRevert "net: mdio: get/put device node during (un)registration"
Jakub Kicinski [Tue, 2 Jan 2024 22:23:34 +0000 (14:23 -0800)]
Revert "net: mdio: get/put device node during (un)registration"

This reverts commit cff9c565e65f3622e8dc1dcc21c1520a083dff35.

Revert based on feedback from Russell.

Link: https://lore.kernel.org/all/ZZPtUIRerqTI2%2Fyh@shell.armlinux.org.uk/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
21 months agoMerge branch 'renesas-rzg3s-add-support-for-ethernet'
David S. Miller [Tue, 2 Jan 2024 14:25:51 +0000 (14:25 +0000)]
Merge branch 'renesas-rzg3s-add-support-for-ethernet'

Claudiu Beznea says:

====================
renesas: rzg3s: Add support for Ethernet

Series adds Ethernet support for Renesas RZ/G3S.
Along with it preparatory cleanups and fixes were included.
====================

Link: https://lore.kernel.org/r/20231207070700.4156557-1-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
21 months agodt-bindings: net: renesas,etheravb: Document RZ/G3S support
Claudiu Beznea [Thu, 7 Dec 2023 07:06:57 +0000 (09:06 +0200)]
dt-bindings: net: renesas,etheravb: Document RZ/G3S support

Document Ethernet RZ/G3S support. Ethernet IP is similar to the one
available on RZ/G2L devices.

Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
21 months agoMerge branch 'remove-retired-tc-uapi'
David S. Miller [Tue, 2 Jan 2024 14:25:51 +0000 (14:25 +0000)]
Merge branch 'remove-retired-tc-uapi'

Jamal Hadi Salim says:

====================
net/sched: Remove UAPI support for retired TC qdiscs and classifiers

Classifiers RSVP and tcindex as well as qdiscs dsmark, CBQ and ATM have already
been deleted. This patchset removes their UAPI support.

User space - with a focus on iproute2 - typically copies these UAPI headers for
different kernels.
These deletion patches are coordinated with the iproute2 maintainers to make
sure that they delete any user space code referencing removed objects at their
leisure.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet/sched: Remove uapi support for CBQ qdisc
Jamal Hadi Salim [Sat, 23 Dec 2023 14:01:54 +0000 (09:01 -0500)]
net/sched: Remove uapi support for CBQ qdisc

Commit 051d44209842 ("net/sched: Retire CBQ qdisc") retired the CBQ qdisc.
Remove UAPI for it. Iproute2 will sync by equally removing it from user space.

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet/sched: Remove uapi support for ATM qdisc
Jamal Hadi Salim [Sat, 23 Dec 2023 14:01:53 +0000 (09:01 -0500)]
net/sched: Remove uapi support for ATM qdisc

Commit fb38306ceb9e ("net/sched: Retire ATM qdisc") retired the ATM qdisc.
Remove UAPI for it. Iproute2 will sync by equally removing it from user space.

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet/sched: Remove uapi support for dsmark qdisc
Jamal Hadi Salim [Sat, 23 Dec 2023 14:01:52 +0000 (09:01 -0500)]
net/sched: Remove uapi support for dsmark qdisc

Commit bbe77c14ee61 ("net/sched: Retire dsmark qdisc") retired the dsmark
classifier. Remove UAPI support for it.
Iproute2 will sync by equally removing it from user space.

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet/sched: Remove uapi support for tcindex classifier
Jamal Hadi Salim [Sat, 23 Dec 2023 14:01:51 +0000 (09:01 -0500)]
net/sched: Remove uapi support for tcindex classifier

commit 8c710f75256b ("net/sched: Retire tcindex classifier") retired the TC
tcindex classifier.
Remove UAPI for it.  Iproute2 will sync by equally removing it from user space.

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet/sched: Remove uapi support for rsvp classifier
Jamal Hadi Salim [Sat, 23 Dec 2023 14:01:50 +0000 (09:01 -0500)]
net/sched: Remove uapi support for rsvp classifier

commit 265b4da82dbf ("net/sched: Retire rsvp classifier") retired the TC RSVP
classifier.
Remove UAPI for it. Iproute2 will sync by equally removing it from user space.

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoMerge branch 'octeon_ep_vf-driver'
David S. Miller [Tue, 2 Jan 2024 14:19:54 +0000 (14:19 +0000)]
Merge branch 'octeon_ep_vf-driver'

Shinas Rasheed says:

====================
add octeon_ep_vf driver

This driver implements networking functionality of Marvell's Octeon
PCI Endpoint NIC VF.

This driver support following devices:
 * Network controller: Cavium, Inc. Device b203
 * Network controller: Cavium, Inc. Device b403
 * Network controller: Cavium, Inc. Device b103
 * Network controller: Cavium, Inc. Device b903
 * Network controller: Cavium, Inc. Device ba03
 * Network controller: Cavium, Inc. Device bc03
 * Network controller: Cavium, Inc. Device bd03

Changes:
V2:
  - Removed linux/version.h header file from inclusion in
    octep_vf_main.c
  - Corrected Makefile entry to include building octep_vf_mbox.c in
    [6/8] patch.
  - Removed redundant vzalloc pointer cast and vfree pointer check in
    [6/8] patch.

V1: https://lore.kernel.org/all/20231221092844.2885872-1-srasheed@marvell.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoocteon_ep_vf: update MAINTAINERS
Shinas Rasheed [Sat, 23 Dec 2023 13:40:00 +0000 (05:40 -0800)]
octeon_ep_vf: update MAINTAINERS

add MAINTAINERS for octeon_ep_vf driver.

Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoocteon_ep_vf: add ethtool support
Shinas Rasheed [Sat, 23 Dec 2023 13:39:59 +0000 (05:39 -0800)]
octeon_ep_vf: add ethtool support

Add support for the following ethtool commands:

ethtool -i|--driver devname
ethtool devname
ethtool -S|--statistics devname

Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoocteon_ep_vf: add Tx/Rx processing and interrupt support
Shinas Rasheed [Sat, 23 Dec 2023 13:39:58 +0000 (05:39 -0800)]
octeon_ep_vf: add Tx/Rx processing and interrupt support

Add support to enable MSI-x and register interrupts.
Add support to process Tx and Rx traffic. Includes processing
Tx completions and Rx refill.

Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoocteon_ep_vf: add support for ndo ops
Shinas Rasheed [Sat, 23 Dec 2023 13:39:57 +0000 (05:39 -0800)]
octeon_ep_vf: add support for ndo ops

Add support for ndo ops to set MAC address, change MTU, get stats.
Add control path support to set MAC address, change MTU, get stats,
set speed, get and set link mode.

Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoocteon_ep_vf: add Tx/Rx ring resource setup and cleanup
Shinas Rasheed [Sat, 23 Dec 2023 13:39:56 +0000 (05:39 -0800)]
octeon_ep_vf: add Tx/Rx ring resource setup and cleanup

Implement Tx/Rx ring resource allocation and cleanup.

Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoocteon_ep_vf: add VF-PF mailbox communication.
Shinas Rasheed [Sat, 23 Dec 2023 13:39:55 +0000 (05:39 -0800)]
octeon_ep_vf: add VF-PF mailbox communication.

Implement VF-PF mailbox to send all control commands from VF to PF
and receive responses and notifications from PF to VF.

Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoocteon_ep_vf: add hardware configuration APIs
Shinas Rasheed [Sat, 23 Dec 2023 13:39:54 +0000 (05:39 -0800)]
octeon_ep_vf: add hardware configuration APIs

Implement hardware resource init and shutdown helper APIs, like
hardware Tx/Rx queue init/enable/disable/reset.

Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoocteon_ep_vf: Add driver framework and device initialization
Shinas Rasheed [Sat, 23 Dec 2023 13:39:53 +0000 (05:39 -0800)]
octeon_ep_vf: Add driver framework and device initialization

Add driver framework and device setup and initialization for Octeon
PCI Endpoint NIC VF.

Add implementation to load module, initialize, register network device,
cleanup and unload module.

Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet/ps3_gelic_net: Add gelic_descr structures
Geoff Levand [Sat, 23 Dec 2023 07:28:20 +0000 (16:28 +0900)]
net/ps3_gelic_net: Add gelic_descr structures

In an effort to make the PS3 gelic driver easier to maintain, create two
new structures, struct gelic_hw_regs and struct gelic_chain_link, and
replace the corresponding members of struct gelic_descr with the new
structures.

The new struct gelic_hw_regs holds the register variables used by the
gelic hardware device.  The new struct gelic_chain_link holds variables
used to manage the driver's linked list of gelic descr structures.

Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoMerge branch 'bnxt_en-ntuple-fuilter-support'
David S. Miller [Tue, 2 Jan 2024 13:52:28 +0000 (13:52 +0000)]
Merge branch 'bnxt_en-ntuple-fuilter-support'

Michael Chan says:

====================
bnxt_en: Add basic ntuple filter support

The current driver only supports ntuple filters added by aRFS.  This
patch series adds basic support for user defined TCP/UDP ntuple filters
added by the user using ethtool.  Many of the patches are refactoring
patches to make the existing code more general to support both aRFS
and user defined filters.  aRFS filters always have the Toeplitz hash
value from the NIC.  A Toepliz hash function is added in patch 5 to
get the same hash value for user defined filters.  The hash is used
to store all ntuple filters in the table and all filters must be
hashed identically using the same function and key.

v2: Fix compile error in patch #4 when CONFIG_BNXT_SRIOV is disabled.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Add support for ntuple filter deletion by ethtool.
Michael Chan [Sat, 23 Dec 2023 04:22:10 +0000 (20:22 -0800)]
bnxt_en: Add support for ntuple filter deletion by ethtool.

Add logic to delete a user specified ntuple filter from ethtool.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Add support for ntuple filters added from ethtool.
Michael Chan [Sat, 23 Dec 2023 04:22:09 +0000 (20:22 -0800)]
bnxt_en: Add support for ntuple filters added from ethtool.

Add support for adding user defined ntuple TCP/UDP filters.  These
filters are similar to aRFS filters except that they don't get aged.
Source IP, destination IP, source port, or destination port can be
unspecifed as wildcard.  At least one of these tuples must be specifed.
If a tuple is specified, the full mask must be specified.

All ntuple related ethtool functions are now no longer compiled only
for CONFIG_RFS_ACCEL.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Add ntuple matching flags to the bnxt_ntuple_filter structure.
Michael Chan [Sat, 23 Dec 2023 04:22:08 +0000 (20:22 -0800)]
bnxt_en: Add ntuple matching flags to the bnxt_ntuple_filter structure.

aRFS filters match all 5 tuples.  User defined ntuple filters may
specify some of the tuples as wildcards.  To support that, we add the
ntuple_flags to the bnxt_ntuple_filter struct to specify which tuple
fields are to be matched.  The matching tuple fields will then be
passed to the firmware in bnxt_hwrm_cfa_ntuple_filter_alloc() to create
the proper filter.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Refactor ntuple filter removal logic in bnxt_cfg_ntp_filters().
Michael Chan [Sat, 23 Dec 2023 04:22:07 +0000 (20:22 -0800)]
bnxt_en: Refactor ntuple filter removal logic in bnxt_cfg_ntp_filters().

Refactor the logic into a new function bnxt_del_ntp_filters().  The
same call will be used when the user deletes an ntuple filter.

The bnxt_hwrm_cfa_ntuple_filter_free() function to call fw to free
the ntuple filter is exported so that the ethtool logic can call it.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Refactor the hash table logic for ntuple filters.
Michael Chan [Sat, 23 Dec 2023 04:22:06 +0000 (20:22 -0800)]
bnxt_en: Refactor the hash table logic for ntuple filters.

Generalize the ethtool logic that walks the ntuple hash table now that
we have the common bnxt_filter_base structure.  This will allow the code
to easily extend to cover user defined ntuple or ether filters.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Refactor filter insertion logic in bnxt_rx_flow_steer().
Michael Chan [Sat, 23 Dec 2023 04:22:05 +0000 (20:22 -0800)]
bnxt_en: Refactor filter insertion logic in bnxt_rx_flow_steer().

Add a new function bnxt_insert_ntp_filter() to insert the ntuple filter
into the hash table and other basic setup.  We'll use this function
to insert a user defined filter from ethtool.

Also, export bnxt_lookup_ntp_filter_from_idx() and bnxt_get_ntp_filter_idx()
for similar purposes.  All ntuple related functions are now no longer
compiled only for CONFIG_RFS_ACCEL

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Add new BNXT_FLTR_INSERTED flag to bnxt_filter_base struct.
Michael Chan [Sat, 23 Dec 2023 04:22:04 +0000 (20:22 -0800)]
bnxt_en: Add new BNXT_FLTR_INSERTED flag to bnxt_filter_base struct.

Change the unused flag to BNXT_FLTR_INSERTED.  To prepare for multiple
pathways that an ntuple filter can be deleted, we add this flag.  These
filter structures can be retreived from the RCU hash table but only
the caller that sees that the BNXT_FLTR_INSERTED flag is set can delete
the filter structure and clear the flag under spinlock.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Add bnxt_lookup_ntp_filter_from_idx() function
Michael Chan [Sat, 23 Dec 2023 04:22:03 +0000 (20:22 -0800)]
bnxt_en: Add bnxt_lookup_ntp_filter_from_idx() function

Add the helper function to look up the ntuple filter from the
hash index and use it in bnxt_rx_flow_steer().  The helper function
will also be used by user defined ntuple filters in the next
patches.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Add function to calculate Toeplitz hash
Pavan Chebbi [Sat, 23 Dec 2023 04:22:02 +0000 (20:22 -0800)]
bnxt_en: Add function to calculate Toeplitz hash

For ntuple filters added by aRFS, the Toeplitz hash calculated by our
NIC is available and is used to store the ntuple filter for quick
retrieval.  In the next patches, user defined ntuple filter support
will be added and we need to calculate the same hash for these
filters.  The same hash function needs to be used so we can detect
duplicates.

Add the function bnxt_toeplitz() to calculate the Toeplitz hash for
user defined ntuple filters.  bnxt_toeplitz() uses the same Toeplitz
key and the same key length as the NIC.

bnxt_get_ntp_filter_idx() is added to return the hash index.  For
aRFS, the hash comes from the NIC.  For user defined ntuple, we call
bnxt_toeplitz() to calculate the hash index.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Refactor L2 filter alloc/free firmware commands.
Michael Chan [Sat, 23 Dec 2023 04:22:01 +0000 (20:22 -0800)]
bnxt_en: Refactor L2 filter alloc/free firmware commands.

Refactor the L2 filter alloc/free logic so that these filters can be
added/deleted by the user.

The bp->ntp_fltr_bmap allocated size is also increased to allow enough
IDs for L2 filters.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Re-structure the bnxt_ntuple_filter structure.
Michael Chan [Sat, 23 Dec 2023 04:22:00 +0000 (20:22 -0800)]
bnxt_en: Re-structure the bnxt_ntuple_filter structure.

With the new bnxt_l2_filter structure, we can now re-structure the
bnxt_ntuple_filter structure to point to the bnxt_l2_filter structure.
We eliminate the L2 ether address info from the ntuple filter structure
as we can get the information from the L2 filter structure.  Note that
the source L2 MAC address is no longer used.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Add bnxt_l2_filter hash table.
Michael Chan [Sat, 23 Dec 2023 04:21:59 +0000 (20:21 -0800)]
bnxt_en: Add bnxt_l2_filter hash table.

The current driver only has an array of 4 additional L2 unicast
addresses to support the netdev uc address list.  Generalize and
expand this infrastructure with an L2 address hash table so we can
support an expanded list of unicast addresses (for bridges,
macvlans, OVS, etc).  The L2 hash table infrastructure will also
allow more generalized n-tuple filter support.

This patch creates the bnxt_l2_filter structure and the hash table.
This L2 filter structure has the same bnxt_filter_base structure
as used in the bnxt_ntuple_filter structure.

All currently supported L2 filters will now have an entry in this
new table.

Note that L2 filters may be created for the VF.  VF filters should
not be freed when the PF goes down.  Add some logic in
bnxt_free_l2_filters() to allow keeping the VF filters or to free
everything during rmmod.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agobnxt_en: Refactor bnxt_ntuple_filter structure.
Michael Chan [Sat, 23 Dec 2023 04:21:58 +0000 (20:21 -0800)]
bnxt_en: Refactor bnxt_ntuple_filter structure.

This is in preparation to support user defined L2 (ether) filters,
which will have many similarities with ntuple filters.  Refactor
bnxt_ntuple_filter structure to have a bnxt_filter_base structure
that can be re-used by the L2 filters.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoMerge tag 'for-net-next-2023-12-22' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Tue, 2 Jan 2024 13:43:23 +0000 (13:43 +0000)]
Merge tag 'for-net-next-2023-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

 - btnxpuart: Fix recv_buf return value
 - L2CAP: Fix responding with multiple rejects
 - Fix atomicity violation in {min,max}_key_size_set
 - ISO: Allow binding a PA sync socket
 - ISO: Reassociate a socket with an active BIS
 - ISO: Avoid creating child socket if PA sync is terminating
 - Add device 13d3:3572 IMC Networks Bluetooth Radio
 - Don't suspend when there are connections
 - Remove le_restart_scan work
 - Fix bogus check for re-auth not supported with non-ssp
 - lib: Add documentation to exported functions
 - Support HFP offload for QCA2066
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoDocumentation: add pyyaml to requirements.txt
Vegard Nossum [Fri, 22 Dec 2023 13:36:28 +0000 (14:36 +0100)]
Documentation: add pyyaml to requirements.txt

Commit f061c9f7d058 ("Documentation: Document each netlink family") added
a new Python script that is invoked during 'make htmldocs' and which reads
the netlink YAML spec files.

Using the virtualenv from scripts/sphinx-pre-install, we get this new
error wen running 'make htmldocs':

  Traceback (most recent call last):
    File "./tools/net/ynl/ynl-gen-rst.py", line 26, in <module>
      import yaml
  ModuleNotFoundError: No module named 'yaml'
  make[2]: *** [Documentation/Makefile:112: Documentation/networking/netlink_spec/rt_link.rst] Error 1
  make[1]: *** [Makefile:1708: htmldocs] Error 2

Fix this by adding 'pyyaml' to requirements.txt.

Note: This was somehow present in the original patch submission:
<https://lore.kernel.org/all/20231103135622.250314-1-leitao@debian.org/>
I'm not sure why the pyyaml requirement disappeared in the meantime.

Fixes: f061c9f7d058 ("Documentation: Document each netlink family")
Cc: Breno Leitao <leitao@debian.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoMerge branch 'mptcp-mib-counters'
David S. Miller [Tue, 2 Jan 2024 13:33:58 +0000 (13:33 +0000)]
Merge branch 'mptcp-mib-counters'

Matthieu Baerts says:

====================
mptcp: add CurrEstab MIB counter

This MIB counter is similar to the one of TCP -- CurrEstab -- available
in /proc/net/snmp. This is useful to quickly list the number of MPTCP
connections without having to iterate over all of them.

Patch 1 prepares its support by adding new helper functions:

 - MPTCP_DEC_STATS(): similar to MPTCP_INC_STATS(), but this time to
   decrement a counter.

 - mptcp_set_state(): similar to tcp_set_state(), to change the state of
   an MPTCP socket, and to inc/decrement the new counter when needed.

Patch 2 uses mptcp_set_state() instead of directly calling
inet_sk_state_store() to change the state of MPTCP sockets.

Patch 3 and 4 validate the new feature in MPTCP "join" and "diag"
selftests.
====================

Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoselftests: mptcp: diag: check CURRESTAB counters
Geliang Tang [Fri, 22 Dec 2023 12:47:25 +0000 (13:47 +0100)]
selftests: mptcp: diag: check CURRESTAB counters

This patch adds a new helper chk_msk_cestab() to check the current
established connections counter MIB_CURRESTAB in diag.sh. Invoke it
to check the counter during the connection after every chk_msk_inuse().

Signed-off-by: Geliang Tang <geliang.tang@linux.dev>
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoselftests: mptcp: join: check CURRESTAB counters
Geliang Tang [Fri, 22 Dec 2023 12:47:24 +0000 (13:47 +0100)]
selftests: mptcp: join: check CURRESTAB counters

This patch adds a new helper chk_cestab_nr() to check the current
established connections counter MIB_CURRESTAB. Set the newly added
variables cestab_ns1 and cestab_ns2 to indicate how many connections
are expected in ns1 or ns2.

Invoke check_cestab() to check the counter during the connection in
do_transfer() and invoke chk_cestab_nr() to re-check it when the
connection closed. These checks are embedded in add_tests().

Signed-off-by: Geliang Tang <geliang.tang@linux.dev>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agomptcp: use mptcp_set_state
Geliang Tang [Fri, 22 Dec 2023 12:47:23 +0000 (13:47 +0100)]
mptcp: use mptcp_set_state

This patch replaces all the 'inet_sk_state_store()' calls under net/mptcp
with the new helper mptcp_set_state().

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/460
Signed-off-by: Geliang Tang <geliang.tang@linux.dev>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agomptcp: add CurrEstab MIB counter support
Geliang Tang [Fri, 22 Dec 2023 12:47:22 +0000 (13:47 +0100)]
mptcp: add CurrEstab MIB counter support

Add a new MIB counter named MPTCP_MIB_CURRESTAB to count current
established MPTCP connections, similar to TCP_MIB_CURRESTAB. This is
useful to quickly list the number of MPTCP connections without having to
iterate over all of them.

This patch adds a new helper function mptcp_set_state(): if the state
switches from or to ESTABLISHED state, this newly added counter is
incremented. This helper is going to be used in the following patch.

Similar to MPTCP_INC_STATS(), a new helper called MPTCP_DEC_STATS() is
also needed to decrement a MIB counter.

Signed-off-by: Geliang Tang <geliang.tang@linux.dev>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoMerge branch 'selftests-tcp-ao'
David S. Miller [Tue, 2 Jan 2024 13:27:48 +0000 (13:27 +0000)]
Merge branch 'selftests-tcp-ao'

Dmitry Safonov says:

====================
selftest/net: Some more TCP-AO selftest post-merge fixups

Note that there's another post-merge fix for TCP-AO selftests, but that
doesn't conflict with these, so I don't resend that:

https://lore.kernel.org/all/20231219-b4-tcp-ao-selftests-out-of-tree-v1-1-0fff92d26eac@arista.com/T/#u
====================

Tested-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
21 months agoselftest/tcp-ao: Work on namespace-ified sysctl_optmem_max
Dmitry Safonov [Fri, 22 Dec 2023 01:59:07 +0000 (01:59 +0000)]
selftest/tcp-ao: Work on namespace-ified sysctl_optmem_max

Since commit f5769faeec36 ("net: Namespace-ify sysctl_optmem_max")
optmem_max is per-netns, so need of switching to root namespace.
It seems trivial to keep the old logic working, so going to keep it for
a while (at least, until kernel with netns-optmem_max will be release).

Currently, there is a test that checks that optmem_max limit applies to
TCP-AO keys and a little benchmark that measures linked-list TCP-AO keys
scaling, those are fixed by this.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoselftest/tcp-ao: Set routes in a proper VRF table id
Dmitry Safonov [Fri, 22 Dec 2023 01:59:06 +0000 (01:59 +0000)]
selftest/tcp-ao: Set routes in a proper VRF table id

In unsigned-md5 selftests ip_route_add() is not needed in
client_add_ip(): the route was pre-setup in __test_init() => link_init()
for subnet, rather than a specific ip-address.

Currently, __ip_route_add() mistakenly always sets VRF table
to RT_TABLE_MAIN - this seems to have sneaked in during unsigned-md5
tests debugging. That also explains, why ip_route_add_vrf() ignored
EEXIST, returned by fib6.

Yet, keep EEXIST ignoring in bench-lookups selftests as it's expected
that those selftests may add the same (duplicate) routes.

Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoMerge tag 'wireless-next-2023-12-22' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Tue, 2 Jan 2024 12:46:10 +0000 (12:46 +0000)]
Merge tag 'wireless-next-2023-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.8

The third "new features" pull request for v6.8. This is a smaller one
to clear up our tree before the break and nothing really noteworthy
this time.

Major changes:

stack

* cfg80211: introduce cfg80211_ssid_eq() for SSID matching

* cfg80211: support P2P operation on DFS channels

* mac80211: allow 64-bit radiotap timestamps

iwlwifi

* AX210: allow concurrent P2P operation on DFS channels
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoMerge branch 'net-tc-ipt-retire'
David S. Miller [Tue, 2 Jan 2024 12:41:16 +0000 (12:41 +0000)]
Merge branch 'net-tc-ipt-retire'

Jamal Hadi Salim says:

====================
net/sched: retire tc ipt action

In keeping up with my status as a hero who removes code: another one bites the
dust.
The tc ipt action was intended to run all netfilter/iptables target.
Unfortunately it has not benefitted over the years from proper updates when
netfilter changes, and for that reason it has remained rudimentary.
Pinging a bunch of people that i was aware were using this indicates that
removing it wont affect them.
Retire it to reduce maintenance efforts.
So Long, ipt, and Thanks for all the Fish.
====================

Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet/sched: Remove CONFIG_NET_ACT_IPT from default configs
Jamal Hadi Salim [Thu, 21 Dec 2023 21:31:04 +0000 (16:31 -0500)]
net/sched: Remove CONFIG_NET_ACT_IPT from default configs

Now that we are retiring the IPT action.

Reviewed-by: Victor Noguiera <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet/sched: Retire ipt action
Jamal Hadi Salim [Thu, 21 Dec 2023 21:31:03 +0000 (16:31 -0500)]
net/sched: Retire ipt action

The tc ipt action was intended to run all netfilter/iptables target.
Unfortunately it has not benefitted over the years from proper updates when
netfilter changes, and for that reason it has remained rudimentary.
Pinging a bunch of people that i was aware were using this indicates that
removing it wont affect them.
Retire it to reduce maintenance efforts. Buh-bye.

Reviewed-by: Victor Noguiera <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet-device: move gso_partial_features to net_device_read_tx
Eric Dumazet [Thu, 21 Dec 2023 14:07:47 +0000 (14:07 +0000)]
net-device: move gso_partial_features to net_device_read_tx

dev->gso_partial_features is read from tx fast path for GSO packets.

Move it to appropriate section to avoid a cache line miss.

Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoptp: ocp: Use DEFINE_RES_*() in place
Andy Shevchenko [Thu, 21 Dec 2023 14:06:07 +0000 (16:06 +0200)]
ptp: ocp: Use DEFINE_RES_*() in place

There is no need to have an intermediate functions as DEFINE_RES_*()
macros are represented by compound literals. Just use them in place.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoMerge branch 'phy-listing-link_topology-tracking'
David S. Miller [Mon, 1 Jan 2024 18:38:57 +0000 (18:38 +0000)]
Merge branch 'phy-listing-link_topology-tracking'

Maxime Chevallier says:

====================
Introduce PHY listing and link_topology tracking

Here's a V5 of the multi-PHY support series.

At a glance, besides some minor fixes and R'd-by from Andrew, one of the
thing this series does is remove the ASSERT_RTNL() from the
topo_add_phy/del_phy operations.

These operations will take a PHY device and put it into the list of
devices associated to a netdevice. The main thing to protect here is the
list itself, but since we use xarrays, my naive understanding of it is
that it contains its own protection scheme. There shouldn't be a need
for more locking, as the insertion/deletion paths are already hooked
into the PHY connection to a netdev, or disconnection from it.

Now for the rest of the cover :

As a remainder, this ongoing work aims ultimately at supporting complex
link topologies that involve multiplexing multiple PHYs/SFPs on a single
netdevice. As a first step, it's required that we are able to enumerate the
PHYs on a given ethernet interface.

By just doing so, we also improve already-existing use-cases, namely the
copper SFP modules support when a media-converter is used (as we have 2
PHYs on the link, but only one is referenced by net_device.phydev, which
is used on a variety of netlink commands).

The series is architectured as follows :

- The first patch adds the notion of phy_link_topology, which tracks
all PHYs attached to a netdevice.

- Patches 2, 3 and 4 adds some plumbing into SFP and phylib to be able
  to connect the dots when building the topology tree, to know which PHY
  is connected to which SFP bus, trying not to be too invasive on phylib.

- Patch 5 allows passing a PHY_INDEX to ethnl commands. I'm uncertain about
  this, as there are at least 4 netlink commands ( 5 with the one introduced
  in patch 7 ) that targets PHYs directly or indirectly, which to me makes
  it worth-it to have a generic way to pass a PHY index to commands, however
  the approach taken may be too generic.

- Patch 6 is the netlink spec update + ethtool-user.c|h autogenerated code
update (the autogenerated code triggers checkpatch warning though)

- Patch 7 introduces a new netlink command set to list PHYs on a netdevice.
It implements a custom DUMP and GET operation to allow filtered dumps,
that lists all PHYs on a given netdevice. I couldn't use most of ethnl's
plumbing though.

- Patch 8 is the netlink spec update + ethtool-user.c|h update for that
new command

- Patch 8,9,10 and 11 updates the PLCA, strset, cable-test and pse netlink
commands to use the user-provided PHY instead of net_device.phydev.

- Finally patch 12 adds some documentation for this whole work.

Examples
========

Here's a short overview of the kind of operations you can have regarding
the PHY topology. These tests were performed on a MacchiatoBin, which
has 3 interfaces :

eth0 and eth1 have the following layout:

MAC - PHY - SFP

eth2 has this more classic topology :

MAC - PHY - RJ45

finally eth3 has the following topology :

MAC - SFP

When performing a dump with all interfaces down, we don't get any
result, as no PHY has been attached to their respective net_device :

None

The following output is with eth0, eth2 and eth3 up, but no SFP module
inserted in none of the interfaces :

[{'downstream-sfp-name': 'sfp-eth0',
  'drvname': 'mv88x3310',
  'header': {'dev-index': 2, 'dev-name': 'eth0'},
  'id': 0,
  'index': 1,
  'name': 'f212a600.mdio-mii:00',
  'upstream-type': 'mac'},
 {'drvname': 'Marvell 88E1510',
  'header': {'dev-index': 4, 'dev-name': 'eth2'},
  'id': 21040593,
  'index': 1,
  'name': 'f212a200.mdio-mii:00',
  'upstream-type': 'mac'}]

And now is a dump operation with a copper SFP in the eth0 port :

[{'downstream-sfp-name': 'sfp-eth0',
  'drvname': 'mv88x3310',
  'header': {'dev-index': 2, 'dev-name': 'eth0'},
  'id': 0,
  'index': 1,
  'name': 'f212a600.mdio-mii:00',
  'upstream-type': 'mac'},
 {'drvname': 'Marvell 88E1111',
  'header': {'dev-index': 2, 'dev-name': 'eth0'},
  'id': 21040322,
  'index': 2,
  'name': 'i2c:sfp-eth0:16',
  'upstream': {'index': 1, 'sfp-name': 'sfp-eth0'},
  'upstream-type': 'phy'},
 {'drvname': 'Marvell 88E1510',
  'header': {'dev-index': 4, 'dev-name': 'eth2'},
  'id': 21040593,
  'index': 1,
  'name': 'f212a200.mdio-mii:00',
  'upstream-type': 'mac'}]

 -- Note that this shouldn't actually work as the 88x3310 PHY doesn't allow
a 1G SFP to be connected to its SFP interface, and I don't have a 10G copper SFP,
so for the sake of the demo I applied the following modification, which
of courses gives a non-functionnal link, but the PHY attach still works,
which is what I want to demonstrate :

@@ -488,7 +488,7 @@ static int mv3310_sfp_insert(void *upstream, const struct sfp_eeprom_id *id)

        if (iface != PHY_INTERFACE_MODE_10GBASER) {
                dev_err(&phydev->mdio.dev, "incompatible SFP module inserted\n");
-               return -EINVAL;
+               //return -EINVAL;
        }
        return 0;
 }

Finally an example of the filtered DUMP operation that Jakub suggested
in V1 :

[{'downstream-sfp-name': 'sfp-eth0',
  'drvname': 'mv88x3310',
  'header': {'dev-index': 2, 'dev-name': 'eth0'},
  'id': 0,
  'index': 1,
  'name': 'f212a600.mdio-mii:00',
  'upstream-type': 'mac'},
 {'drvname': 'Marvell 88E1111',
  'header': {'dev-index': 2, 'dev-name': 'eth0'},
  'id': 21040322,
  'index': 2,
  'name': 'i2c:sfp-eth0:16',
  'upstream': {'index': 1, 'sfp-name': 'sfp-eth0'},
  'upstream-type': 'phy'}]

And a classic GET operation allows querying a single PHY's info :

{'drvname': 'Marvell 88E1111',
 'header': {'dev-index': 2, 'dev-name': 'eth0'},
 'id': 21040322,
 'index': 2,
 'name': 'i2c:sfp-eth0:16',
 'upstream': {'index': 1, 'sfp-name': 'sfp-eth0'},
 'upstream-type': 'phy'}

Changed in V5:
- Removed the RTNL assertion in the topology ops
- Made the phy_topo_get_phy inline
- Fixed the PSE-PD multi-PHY support by re-adding a wrongly dropped
  check
- Fixed some typos in the documentation
- Fixed reverse xmas trees

Changes in V4:
- Dropped the RFC flag
- Made the net_device integration independent to having phylib enabled
- Removed the autogenerated ethtool-user code for the YNL specs

Changes in V3:
- Added RTNL assertions where needed
- Fixed issues in the DUMP code for PHY_GET, which crashed when running it
  twice in a row
- Added the documentation, and moved in-source docs around
- renamed link_topology to phy_link_topology

Changes in V2:
- Added the DUMP operation
- Added much more information in the reported data, to be able to reconstruct
  precisely the topology tree
- renamed phy_list to link_topology
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agoDocumentation: networking: document phy_link_topology
Maxime Chevallier [Thu, 21 Dec 2023 18:00:46 +0000 (19:00 +0100)]
Documentation: networking: document phy_link_topology

The newly introduced phy_link_topology tracks all ethernet PHYs that are
attached to a netdevice. Document the base principle, internal and
external APIs. As the phy_link_topology is expected to be extended, this
documentation will hold any further improvements and additions made
relative to topology handling.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet: ethtool: strset: Allow querying phy stats by index
Maxime Chevallier [Thu, 21 Dec 2023 18:00:45 +0000 (19:00 +0100)]
net: ethtool: strset: Allow querying phy stats by index

The ETH_SS_PHY_STATS command gets PHY statistics. Use the phydev pointer
from the ethnl request to allow query phy stats from each PHY on the
link.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet: ethtool: cable-test: Target the command to the requested PHY
Maxime Chevallier [Thu, 21 Dec 2023 18:00:44 +0000 (19:00 +0100)]
net: ethtool: cable-test: Target the command to the requested PHY

Cable testing is a PHY-specific command. Instead of targeting the command
towards dev->phydev, use the request to pick the targeted PHY.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet: ethtool: pse-pd: Target the command to the requested PHY
Maxime Chevallier [Thu, 21 Dec 2023 18:00:43 +0000 (19:00 +0100)]
net: ethtool: pse-pd: Target the command to the requested PHY

PSE and PD configuration is a PHY-specific command. Instead of targeting
the command towards dev->phydev, use the request to pick the targeted
PHY device.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet: ethtool: plca: Target the command to the requested PHY
Maxime Chevallier [Thu, 21 Dec 2023 18:00:42 +0000 (19:00 +0100)]
net: ethtool: plca: Target the command to the requested PHY

PLCA is a PHY-specific command. Instead of targeting the command
towards dev->phydev, use the request to pick the targeted PHY.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonetlink: specs: add ethnl PHY_GET command set
Maxime Chevallier [Thu, 21 Dec 2023 18:00:41 +0000 (19:00 +0100)]
netlink: specs: add ethnl PHY_GET command set

The PHY_GET command, supporting both DUMP and GET operations, is used to
retrieve the list of PHYs connected to a netdevice, and get topology
information to know where exactly it sits on the physical link.

Add the netlink specs corresponding to that command.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21 months agonet: ethtool: Introduce a command to list PHYs on an interface
Maxime Chevallier [Thu, 21 Dec 2023 18:00:40 +0000 (19:00 +0100)]
net: ethtool: Introduce a command to list PHYs on an interface

As we have the ability to track the PHYs connected to a net_device
through the link_topology, we can expose this list to userspace. This
allows userspace to use these identifiers for phy-specific commands and
take the decision of which PHY to target by knowing the link topology.

Add PHY_GET and PHY_DUMP, which can be a filtered DUMP operation to list
devices on only one interface.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>