]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
19 months agoselftests/bpf: Test struct_ops map definition with type suffix
Eduard Zingerman [Wed, 6 Mar 2024 10:45:18 +0000 (12:45 +0200)]
selftests/bpf: Test struct_ops map definition with type suffix

Extend struct_ops_module test case to check if it is possible to use
'___' suffixes for struct_ops type specification.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20240306104529.6453-5-eddyz87@gmail.com
19 months agolibbpf: Honor autocreate flag for struct_ops maps
Eduard Zingerman [Wed, 6 Mar 2024 10:45:17 +0000 (12:45 +0200)]
libbpf: Honor autocreate flag for struct_ops maps

Skip load steps for struct_ops maps not marked for automatic creation.
This should allow to load bpf object in situations like below:

    SEC("struct_ops/foo") int BPF_PROG(foo) { ... }
    SEC("struct_ops/bar") int BPF_PROG(bar) { ... }

    struct test_ops___v1 {
     int (*foo)(void);
    };

    struct test_ops___v2 {
     int (*foo)(void);
     int (*does_not_exist)(void);
    };

    SEC(".struct_ops.link")
    struct test_ops___v1 map_for_old = {
     .test_1 = (void *)foo
    };

    SEC(".struct_ops.link")
    struct test_ops___v2 map_for_new = {
     .test_1 = (void *)foo,
     .does_not_exist = (void *)bar
    };

Suppose program is loaded on old kernel that does not have definition
for 'does_not_exist' struct_ops member. After this commit it would be
possible to load such object file after the following tweaks:

    bpf_program__set_autoload(skel->progs.bar, false);
    bpf_map__set_autocreate(skel->maps.map_for_new, false);

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20240306104529.6453-4-eddyz87@gmail.com
19 months agolibbpf: Tie struct_ops programs to kernel BTF ids, not to local ids
Eduard Zingerman [Wed, 6 Mar 2024 10:45:16 +0000 (12:45 +0200)]
libbpf: Tie struct_ops programs to kernel BTF ids, not to local ids

Enforce the following existing limitation on struct_ops programs based
on kernel BTF id instead of program-local BTF id:

    struct_ops BPF prog can be re-used between multiple .struct_ops &
    .struct_ops.link as long as it's the same struct_ops struct
    definition and the same function pointer field

This allows reusing same BPF program for versioned struct_ops map
definitions, e.g.:

    SEC("struct_ops/test")
    int BPF_PROG(foo) { ... }

    struct some_ops___v1 { int (*test)(void); };
    struct some_ops___v2 { int (*test)(void); };

    SEC(".struct_ops.link") struct some_ops___v1 a = { .test = foo }
    SEC(".struct_ops.link") struct some_ops___v2 b = { .test = foo }

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240306104529.6453-3-eddyz87@gmail.com
19 months agolibbpf: Allow version suffixes (___smth) for struct_ops types
Eduard Zingerman [Wed, 6 Mar 2024 10:45:15 +0000 (12:45 +0200)]
libbpf: Allow version suffixes (___smth) for struct_ops types

E.g. allow the following struct_ops definitions:

    struct bpf_testmod_ops___v1 { int (*test)(void); };
    struct bpf_testmod_ops___v2 { int (*test)(void); };

    SEC(".struct_ops.link")
    struct bpf_testmod_ops___v1 a = { .test = ... }
    SEC(".struct_ops.link")
    struct bpf_testmod_ops___v2 b = { .test = ... }

Where both bpf_testmod_ops__v1 and bpf_testmod_ops__v2 would be
resolved as 'struct bpf_testmod_ops' from kernel BTF.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240306104529.6453-2-eddyz87@gmail.com
19 months agoMerge branch 'bpf-introduce-may_goto-and-cond_break'
Andrii Nakryiko [Wed, 6 Mar 2024 18:41:18 +0000 (10:41 -0800)]
Merge branch 'bpf-introduce-may_goto-and-cond_break'

Alexei Starovoitov says:

====================
bpf: Introduce may_goto and cond_break

From: Alexei Starovoitov <ast@kernel.org>

v5 -> v6:
- Rename BPF_JMA to BPF_JCOND
- Addressed Andrii's review comments

v4 -> v5:
- rewrote patch 1 to avoid fake may_goto_reg and use 'u32 may_goto_cnt' instead.
  This way may_goto handling is similar to bpf_loop() processing.
- fixed bug in patch 2 that RANGE_WITHIN should not use
  rold->type == NOT_INIT as a safe signal.
- patch 3 fixed negative offset computation in cond_break macro
- using bpf_arena and cond_break recompiled lib/glob.c as bpf prog
  and it works! It will be added as a selftest to arena series.

v3 -> v4:
- fix drained issue reported by John.
  may_goto insn could be implemented with sticky state (once
  reaches 0 it stays 0), but the verifier shouldn't assume that.
  It has to explore both branches.
  Arguably drained iterator state shouldn't be there at all.
  bpf_iter_css_next() is not sticky. Can be fixed, but auditing all
  iterators for stickiness. That's an orthogonal discussion.
- explained JMA name reasons in patch 1
- fixed test_progs-no_alu32 issue and added another test

v2 -> v3: Major change
- drop bpf_can_loop() kfunc and introduce may_goto instruction instead
  kfunc is a function call while may_goto doesn't consume any registers
  and LLVM can produce much better code due to less register pressure.
- instead of counting from zero to BPF_MAX_LOOPS start from it instead
  and break out of the loop when count reaches zero
- use may_goto instruction in cond_break macro
- recognize that 'exact' state comparison doesn't need to be truly exact.
  regsafe() should ignore precision and liveness marks, but range_within
  logic is safe to use while evaluating open coded iterators.
====================

Link: https://lore.kernel.org/r/20240306031929.42666-1-alexei.starovoitov@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
19 months agoselftests/bpf: Test may_goto
Alexei Starovoitov [Wed, 6 Mar 2024 03:19:29 +0000 (19:19 -0800)]
selftests/bpf: Test may_goto

Add tests for may_goto instruction via cond_break macro.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240306031929.42666-5-alexei.starovoitov@gmail.com
19 months agobpf: Add cond_break macro
Alexei Starovoitov [Wed, 6 Mar 2024 03:19:28 +0000 (19:19 -0800)]
bpf: Add cond_break macro

Use may_goto instruction to implement cond_break macro.
Ideally the macro should be written as:
  asm volatile goto(".byte 0xe5;
                     .byte 0;
                     .short %l[l_break] ...
                     .long 0;
but LLVM doesn't recognize fixup of 2 byte PC relative yet.
Hence use
  asm volatile goto(".byte 0xe5;
                     .byte 0;
                     .long %l[l_break] ...
                     .short 0;
that produces correct asm on little endian.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240306031929.42666-4-alexei.starovoitov@gmail.com
19 months agobpf: Recognize that two registers are safe when their ranges match
Alexei Starovoitov [Wed, 6 Mar 2024 03:19:27 +0000 (19:19 -0800)]
bpf: Recognize that two registers are safe when their ranges match

When open code iterators, bpf_loop or may_goto are used the following two
states are equivalent and safe to prune the search:

cur state: fp-8_w=scalar(id=3,smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=11,var_off=(0x0; 0xf))
old state: fp-8_rw=scalar(id=2,smin=umin=smin32=umin32=1,smax=umax=smax32=umax32=11,var_off=(0x0; 0xf))

In other words "exact" state match should ignore liveness and precision
marks, since open coded iterator logic didn't complete their propagation,
reg_old->type == NOT_INIT && reg_cur->type != NOT_INIT is also not safe to
prune while looping, but range_within logic that applies to scalars,
ptr_to_mem, map_value, pkt_ptr is safe to rely on.

Avoid doing such comparison when regular infinite loop detection logic is
used, otherwise bounded loop logic will declare such "infinite loop" as
false positive. Such example is in progs/verifier_loops1.c
not_an_inifinite_loop().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240306031929.42666-3-alexei.starovoitov@gmail.com
19 months agoMerge branch 'mm-enforce-ioremap-address-space-and-introduce-sparse-vm_area'
Andrii Nakryiko [Wed, 6 Mar 2024 18:19:04 +0000 (10:19 -0800)]
Merge branch 'mm-enforce-ioremap-address-space-and-introduce-sparse-vm_area'

Alexei Starovoitov says:

====================
mm: Enforce ioremap address space and introduce sparse vm_area

From: Alexei Starovoitov <ast@kernel.org>

v3 -> v4
- dropped VM_XEN patch for now. It will be in the follow up.
- fixed constant as pointed out by Mike

v2 -> v3
- added Christoph's reviewed-by to patch 1
- cap commit log lines to 75 chars
- factored out common checks in patch 3 into helper
- made vm_area_unmap_pages() return void

There are various users of kernel virtual address space:
vmalloc, vmap, ioremap, xen.

- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag
and these areas are treated differently by KASAN.

- the areas created by vmap() function should be tagged with VM_MAP
(as majority of the users do).

- ioremap areas are tagged with VM_IOREMAP and vm area start is aligned
to size of the area unlike vmalloc/vmap.

- there is also xen usage that is marked as VM_IOREMAP, but it doesn't
call ioremap_page_range() unlike all other VM_IOREMAP users.

To clean this up a bit, enforce that ioremap_page_range() checks the range
and VM_IOREMAP flag.

In addition BPF would like to reserve regions of kernel virtual address
space and populate it lazily, similar to xen use cases.
For that reason, introduce VM_SPARSE flag and vm_area_[un]map_pages()
helpers to populate this sparse area.

In the end the /proc/vmallocinfo will show
"vmalloc"
"vmap"
"ioremap"
"sparse"
categories for different kinds of address regions.

ioremap, sparse will return zero when dumped through /proc/kcore
====================

Link: https://lore.kernel.org/r/20240305030516.41519-1-alexei.starovoitov@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
19 months agobpf: Introduce may_goto instruction
Alexei Starovoitov [Wed, 6 Mar 2024 03:19:26 +0000 (19:19 -0800)]
bpf: Introduce may_goto instruction

Introduce may_goto instruction that from the verifier pov is similar to
open coded iterators bpf_for()/bpf_repeat() and bpf_loop() helper, but it
doesn't iterate any objects.
In assembly 'may_goto' is a nop most of the time until bpf runtime has to
terminate the program for whatever reason. In the current implementation
may_goto has a hidden counter, but other mechanisms can be used.
For programs written in C the later patch introduces 'cond_break' macro
that combines 'may_goto' with 'break' statement and has similar semantics:
cond_break is a nop until bpf runtime has to break out of this loop.
It can be used in any normal "for" or "while" loop, like

  for (i = zero; i < cnt; cond_break, i++) {

The verifier recognizes that may_goto is used in the program, reserves
additional 8 bytes of stack, initializes them in subprog prologue, and
replaces may_goto instruction with:
aux_reg = *(u64 *)(fp - 40)
if aux_reg == 0 goto pc+off
aux_reg -= 1
*(u64 *)(fp - 40) = aux_reg

may_goto instruction can be used by LLVM to implement __builtin_memcpy,
__builtin_strcmp.

may_goto is not a full substitute for bpf_for() macro.
bpf_for() doesn't have induction variable that verifiers sees,
so 'i' in bpf_for(i, 0, 100) is seen as imprecise and bounded.

But when the code is written as:
for (i = 0; i < 100; cond_break, i++)
the verifier see 'i' as precise constant zero,
hence cond_break (aka may_goto) doesn't help to converge the loop.
A static or global variable can be used as a workaround:
static int zero = 0;
for (i = zero; i < 100; cond_break, i++) // works!

may_goto works well with arena pointers that don't need to be bounds
checked on access. Load/store from arena returns imprecise unbounded
scalar and loops with may_goto pass the verifier.

Reserve new opcode BPF_JMP | BPF_JCOND for may_goto insn.
JCOND stands for conditional pseudo jump.
Since goto_or_nop insn was proposed, it may use the same opcode.
may_goto vs goto_or_nop can be distinguished by src_reg:
code = BPF_JMP | BPF_JCOND
src_reg = 0 - may_goto
src_reg = 1 - goto_or_nop

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240306031929.42666-2-alexei.starovoitov@gmail.com
19 months agomm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
Alexei Starovoitov [Tue, 5 Mar 2024 03:05:16 +0000 (19:05 -0800)]
mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
virtual space.

get_vm_area() with appropriate flag is used to request an area of kernel
address range. It's used for vmalloc, vmap, ioremap, xen use cases.
- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
- the areas created by vmap() function should be tagged with VM_MAP.
- ioremap areas are tagged with VM_IOREMAP.

BPF would like to extend the vmap API to implement a lazily-populated
sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
and vm_area_map_pages(area, start_addr, count, pages) API to map a set
of pages within a given area.
It has the same sanity checks as vmap() does.
It also checks that get_vm_area() was created with VM_SPARSE flag
which identifies such areas in /proc/vmallocinfo
and returns zero pages on read through /proc/kcore.

The next commits will introduce bpf_arena which is a sparsely populated
shared memory region between bpf program and user space process. It will
map privately-managed pages into a sparse vm area with the following steps:

  // request virtual memory region during bpf prog verification
  area = get_vm_area(area_size, VM_SPARSE);

  // on demand
  vm_area_map_pages(area, kaddr, kend, pages);
  vm_area_unmap_pages(area, kaddr, kend);

  // after bpf program is detached and unloaded
  free_vm_area(area);

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/bpf/20240305030516.41519-3-alexei.starovoitov@gmail.com
19 months agomm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
Alexei Starovoitov [Tue, 5 Mar 2024 03:05:15 +0000 (19:05 -0800)]
mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.

There are various users of get_vm_area() + ioremap_page_range() APIs.
Enforce that get_vm_area() was requested as VM_IOREMAP type and range
passed to ioremap_page_range() matches created vm_area to avoid
accidentally ioremap-ing into wrong address range.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20240305030516.41519-2-alexei.starovoitov@gmail.com
19 months agoMerge branch 'Allow struct_ops maps with a large number of programs'
Martin KaFai Lau [Thu, 29 Feb 2024 23:23:12 +0000 (15:23 -0800)]
Merge branch 'Allow struct_ops maps with a large number of programs'

Kui-Feng Lee says:

====================
The BPF struct_ops previously only allowed for one page to be used for
the trampolines of all links in a map. However, we have recently run
out of space due to the large number of BPF program links. By
allocating additional pages when we exhaust an existing page, we can
accommodate more links in a single map.

The variable st_map->image has been changed to st_map->image_pages,
and its type has been changed to an array of pointers to buffers of
PAGE_SIZE. Additional pages are allocated when all existing pages are
exhausted.

The test case loads a struct_ops maps having 40 programs. Their
trampolines takes about 6.6k+ bytes over 1.5 pages on x86.
---
Major differences from v3:

 - Refactor buffer allocations to bpf_struct_ops_tramp_buf_alloc() and
   bpf_struct_ops_tramp_buf_free().

Major differences from v2:

 - Move image buffer allocation to bpf_struct_ops_prepare_trampoline().

Major differences from v1:

 - Always free pages if failing to update.

 - Allocate 8 pages at most.

v3: https://lore.kernel.org/all/20240224030302.1500343-1-thinker.li@gmail.com/
v2: https://lore.kernel.org/all/20240221225911.757861-1-thinker.li@gmail.com/
v1: https://lore.kernel.org/all/20240216182828.201727-1-thinker.li@gmail.com/
====================

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
19 months agoselftests/bpf: Test struct_ops maps with a large number of struct_ops program.
Kui-Feng Lee [Sat, 24 Feb 2024 22:34:18 +0000 (14:34 -0800)]
selftests/bpf: Test struct_ops maps with a large number of struct_ops program.

Create and load a struct_ops map with a large number of struct_ops
programs to generate trampolines taking a size over multiple pages. The
map includes 40 programs. Their trampolines takes 6.6k+, more than 1.5
pages, on x86.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240224223418.526631-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
19 months agobpf: struct_ops supports more than one page for trampolines.
Kui-Feng Lee [Sat, 24 Feb 2024 22:34:17 +0000 (14:34 -0800)]
bpf: struct_ops supports more than one page for trampolines.

The BPF struct_ops previously only allowed one page of trampolines.
Each function pointer of a struct_ops is implemented by a struct_ops
bpf program. Each struct_ops bpf program requires a trampoline.
The following selftest patch shows each page can hold a little more
than 20 trampolines.

While one page is more than enough for the tcp-cc usecase,
the sched_ext use case shows that one page is not always enough and hits
the one page limit. This patch overcomes the one page limit by allocating
another page when needed and it is limited to a total of
MAX_IMAGE_PAGES (8) pages which is more than enough for
reasonable usages.

The variable st_map->image has been changed to st_map->image_pages, and
its type has been changed to an array of pointers to pages.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240224223418.526631-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
19 months agobpf, net: validate struct_ops when updating value.
Kui-Feng Lee [Sat, 24 Feb 2024 22:34:16 +0000 (14:34 -0800)]
bpf, net: validate struct_ops when updating value.

Perform all validations when updating values of struct_ops maps. Doing
validation in st_ops->reg() and st_ops->update() is not necessary anymore.
However, tcp_register_congestion_control() has been called in various
places. It still needs to do validations.

Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240224223418.526631-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
19 months agoselftests/bpf: xdp_hw_metadata reduce sleep interval
Song Yoong Siang [Sun, 3 Mar 2024 08:32:24 +0000 (16:32 +0800)]
selftests/bpf: xdp_hw_metadata reduce sleep interval

In current ping-pong design, xdp_hw_metadata will wait until the packet
transmission completely done, then only start to receive the next packet.

The current sleep interval is 10ms, which is unnecessary large. Typically,
a NIC does not need such a long time to transmit a packet. Furthermore,
during this 10ms sleep time, the app is unable to receive incoming packets.

Therefore, this commit reduce sleep interval to 10us, so that
xdp_hw_metadata is able to support periodic packets with shorter interval.
10us * 500 = 5ms should be enough for packet transmission and status
retrieval.

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20240303083225.1184165-2-yoong.siang.song@intel.com
19 months agoselftests/bpf: Extend uprobe/uretprobe triggering benchmarks
Andrii Nakryiko [Fri, 1 Mar 2024 21:45:51 +0000 (13:45 -0800)]
selftests/bpf: Extend uprobe/uretprobe triggering benchmarks

Settle on three "flavors" of uprobe/uretprobe, installed on different
kinds of instruction: nop, push, and ret. All three are testing
different internal code paths emulating or single-stepping instructions,
so are interesting to compare and benchmark separately.

To ensure `push rbp` instruction we ensure that uprobe_target_push() is
not a leaf function by calling (global __weak) noop function and
returning something afterwards (if we don't do that, compiler will just
do a tail call optimization).

Also, we need to make sure that compiler isn't skipping frame pointer
generation, so let's add `-fno-omit-frame-pointers` to Makefile.

Just to give an idea of where we currently stand in terms of relative
performance of different uprobe/uretprobe cases vs a cheap syscall
(getpgid()) baseline, here are results from my local machine:

$ benchs/run_bench_uprobes.sh
base           :    1.561 ± 0.020M/s
uprobe-nop     :    0.947 ± 0.007M/s
uprobe-push    :    0.951 ± 0.004M/s
uprobe-ret     :    0.443 ± 0.007M/s
uretprobe-nop  :    0.471 ± 0.013M/s
uretprobe-push :    0.483 ± 0.004M/s
uretprobe-ret  :    0.306 ± 0.007M/s

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240301214551.1686095-1-andrii@kernel.org
19 months agolibbpf: Correct debug message in btf__load_vmlinux_btf
Chen Shen [Sat, 2 Mar 2024 06:22:18 +0000 (14:22 +0800)]
libbpf: Correct debug message in btf__load_vmlinux_btf

In the function btf__load_vmlinux_btf, the debug message incorrectly
refers to 'path' instead of 'sysfs_btf_path'.

Signed-off-by: Chen Shen <peterchenshen@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240302062218.3587-1-peterchenshen@gmail.com
19 months agobpf, docs: Rename legacy conformance group to packet
Dave Thaler [Sat, 2 Mar 2024 01:22:29 +0000 (17:22 -0800)]
bpf, docs: Rename legacy conformance group to packet

There could be other legacy conformance groups in the future,
so use a more descriptive name.  The status of the conformance
group in the IANA registry is what designates it as legacy,
not the name of the group.

Signed-off-by: Dave Thaler <dthaler1968@gmail.com>
Link: https://lore.kernel.org/r/20240302012229.16452-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
19 months agobpf, docs: Use IETF format for field definitions in instruction-set.rst
Dave Thaler [Fri, 1 Mar 2024 22:23:37 +0000 (14:23 -0800)]
bpf, docs: Use IETF format for field definitions in instruction-set.rst

In preparation for publication as an IETF RFC, the WG chairs asked me
to convert the document to use IETF packet format for field layout, so
this patch attempts to make it consistent with other IETF documents.

Some fields that are not byte aligned were previously inconsistent
in how values were defined.  Some were defined as the value of the
byte containing the field (like 0x20 for a field holding the high
four bits of the byte), and others were defined as the value of the
field itself (like 0x2).  This PR makes them be consistent in using
just the values of the field itself, which is IETF convention.

As a result, some of the defines that used BPF_* would no longer
match the value in the spec, and so this patch also drops the BPF_*
prefix to avoid confusion with the defines that are the full-byte
equivalent values.  For consistency, BPF_* is then dropped from
other fields too.  BPF_<foo> is thus the Linux implementation-specific
define for <foo> as it appears in the BPF ISA specification.

The syntax BPF_ADD | BPF_X | BPF_ALU only worked for full-byte
values so the convention {ADD, X, ALU} is proposed for referring
to field values instead.

Also replace the redundant "LSB bits" with "least significant bits".

A preview of what the resulting Internet Draft would look like can
be seen at:
https://htmlpreview.github.io/?https://raw.githubusercontent.com/dthaler/ebp
f-docs-1/format/draft-ietf-bpf-isa.html

v1->v2: Fix sphinx issue as recommended by David Vernet

Signed-off-by: Dave Thaler <dthaler1968@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240301222337.15931-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Sun, 3 Mar 2024 04:50:59 +0000 (20:50 -0800)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-02-29

We've added 119 non-merge commits during the last 32 day(s) which contain
a total of 150 files changed, 3589 insertions(+), 995 deletions(-).

The main changes are:

1) Extend the BPF verifier to enable static subprog calls in spin lock
   critical sections, from Kumar Kartikeya Dwivedi.

2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
   in BPF global subprogs, from Andrii Nakryiko.

3) Larger batch of riscv BPF JIT improvements and enabling inlining
   of the bpf_kptr_xchg() for RV64, from Pu Lehui.

4) Allow skeleton users to change the values of the fields in struct_ops
   maps at runtime, from Kui-Feng Lee.

5) Extend the verifier's capabilities of tracking scalars when they
   are spilled to stack, especially when the spill or fill is narrowing,
   from Maxim Mikityanskiy & Eduard Zingerman.

6) Various BPF selftest improvements to fix errors under gcc BPF backend,
   from Jose E. Marchesi.

7) Avoid module loading failure when the module trying to register
   a struct_ops has its BTF section stripped, from Geliang Tang.

8) Annotate all kfuncs in .BTF_ids section which eventually allows
   for automatic kfunc prototype generation from bpftool, from Daniel Xu.

9) Several updates to the instruction-set.rst IETF standardization
   document, from Dave Thaler.

10) Shrink the size of struct bpf_map resp. bpf_array,
    from Alexei Starovoitov.

11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
    from Benjamin Tissoires.

12) Fix bpftool to be more portable to musl libc by using POSIX's
    basename(), from Arnaldo Carvalho de Melo.

13) Add libbpf support to gcc in CORE macro definitions,
    from Cupertino Miranda.

14) Remove a duplicate type check in perf_event_bpf_event,
    from Florian Lehner.

15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
    with notrace correctly, from Yonghong Song.

16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
    array to fix build warnings, from Kees Cook.

17) Fix resolve_btfids cross-compilation to non host-native endianness,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  selftests/bpf: Test if shadow types work correctly.
  bpftool: Add an example for struct_ops map and shadow type.
  bpftool: Generated shadow variables for struct_ops maps.
  libbpf: Convert st_ops->data to shadow type.
  libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
  bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
  bpf, arm64: use bpf_prog_pack for memory management
  arm64: patching: implement text_poke API
  bpf, arm64: support exceptions
  arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
  bpf: add is_async_callback_calling_insn() helper
  bpf: introduce in_sleepable() helper
  bpf: allow more maps in sleepable bpf programs
  selftests/bpf: Test case for lacking CFI stub functions.
  bpf: Check cfi_stubs before registering a struct_ops type.
  bpf: Clarify batch lookup/lookup_and_delete semantics
  bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
  bpf, docs: Fix typos in instruction-set.rst
  selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
  bpf: Shrink size of struct bpf_map/bpf_array.
  ...
====================

Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge branch 'inet_dump_ifaddr-no-rtnl'
David S. Miller [Fri, 1 Mar 2024 11:09:40 +0000 (11:09 +0000)]
Merge branch 'inet_dump_ifaddr-no-rtnl'

Eric Dumazet says:

====================
inet: no longer use RTNL to protect inet_dump_ifaddr()

This series convert inet so that a dump of addresses (ip -4 addr)
no longer requires RTNL.
====================

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: use xa_array iterator to implement inet_dump_ifaddr()
Eric Dumazet [Thu, 29 Feb 2024 11:40:16 +0000 (11:40 +0000)]
inet: use xa_array iterator to implement inet_dump_ifaddr()

1) inet_dump_ifaddr() can can run under RCU protection
   instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
   an extra recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: prepare inet_base_seq() to run without RTNL
Eric Dumazet [Thu, 29 Feb 2024 11:40:15 +0000 (11:40 +0000)]
inet: prepare inet_base_seq() to run without RTNL

In the following patch, inet_base_seq() will no longer be called
with RTNL held.

Add READ_ONCE()/WRITE_ONCE() annotations in dev_base_seq_inc()
and inet_base_seq().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: annotate data-races around ifa->ifa_flags
Eric Dumazet [Thu, 29 Feb 2024 11:40:14 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_flags

ifa->ifa_flags can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: annotate data-races around ifa->ifa_preferred_lft
Eric Dumazet [Thu, 29 Feb 2024 11:40:13 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_preferred_lft

ifa->ifa_preferred_lft can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: annotate data-races around ifa->ifa_valid_lft
Eric Dumazet [Thu, 29 Feb 2024 11:40:12 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_valid_lft

ifa->ifa_valid_lft can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoinet: annotate data-races around ifa->ifa_tstamp and ifa->ifa_cstamp
Eric Dumazet [Thu, 29 Feb 2024 11:40:11 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_tstamp and ifa->ifa_cstamp

ifa->ifa_tstamp can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Do the same for ifa->ifa_cstamp to prepare upcoming changes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'netdevsim-link'
David S. Miller [Fri, 1 Mar 2024 10:43:11 +0000 (10:43 +0000)]
Merge branch 'netdevsim-link'

David Wei says:

====================
netdevsim: link and forward skbs between ports

This patchset adds the ability to link two netdevsim ports together and
forward skbs between them, similar to veth. The goal is to use netdevsim
for testing features e.g. zero copy Rx using io_uring.

This feature was tested locally on QEMU, and a selftest is included.

I ran netdev selftests CI style and all tests but the following passed:
- gro.sh
- l2tp.sh
- ip_local_port_range.sh

gro.sh fails because virtme-ng mounts as read-only and it tries to write
to log.txt. This issue was reported to virtme-ng upstream.

l2tp.sh and ip_local_port_range.sh both fail for me on net-next/main as
well.

---
v13->v14:
- implement ndo_get_iflink()
- fix returning 0 if peer is already linked during linking or not linked
  during unlinking
- bump dropped counter if nsim_ipsec_tx() fails and generally reorder
  nsim_start_xmit()
- fix overflowing lines and indentations

v12->v13:
- wait for socat listening port to be ready before sending data in
  selftest

v11->v12:
- fix leaked netns refs
- fix rtnetlink.sh kci_test_ipsec_offload() selftest

v10->v11:
- add udevadm settle after creating netdevsims in selftest

v9->v10:
- fix not freeing skb when not there is no peer
- prevent possible id clashes in selftest
- cleanup selftest on error paths

v8->v9:
- switch to getting netns using fd rather than id
- prevent linking a netdevsim to itself
- update tests

v7->v8:
- fix not dereferencing RCU ptr using rcu_dereference()
- remove unused variables in selftest

v6->v7:
- change link syntax to netnsid:ifidx
- replace dev_get_by_index() with __dev_get_by_index()
- check for NULL peer when linking
- add a sysfs attribute for unlinking
- only update Tx stats if not dropped
- update selftest

v5->v6:
- reworked to link two netdevsims using sysfs attribute on the bus
  device instead of debugfs due to deadlock possibility if a netdevsim
  is removed during linking
- removed unnecessary patch maintaining a list of probed nsim_devs
- updated selftest

v4->v5:
- reduce nsim_dev_list_lock critical section
- fixed missing mutex unlock during unwind ladder
- rework nsim_dev_peer_write synchronization to take devlink lock as
  well as rtnl_lock
- return err msgs to user during linking if port doesn't exist or
  linking to self
- update tx stats outside of RCU lock

v3->v4:
- maintain a mutex protected list of probed nsim_devs instead of using
  nsim_bus_dev
- fixed synchronization issues by taking rtnl_lock
- track tx_dropped skbs

v2->v3:
- take lock when traversing nsim_bus_dev_list
- take device ref when getting a nsim_bus_dev
- return 0 if nsim_dev_peer_read cannot find the port
- address code formatting
- do not hard code values in selftests
- add Makefile for selftests

v1->v2:
- renamed debugfs file from "link" to "peer"
- replaced strstep() with sscanf() for consistency
- increased char[] buf sz to 22 for copying id + port from user
- added err msg w/ expected fmt when linking as a hint to user
- prevent linking port to itself
- protect peer ptr using RCU

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: fix rtnetlink.sh selftest
David Wei [Wed, 28 Feb 2024 23:22:53 +0000 (15:22 -0800)]
netdevsim: fix rtnetlink.sh selftest

I cleared IFF_NOARP flag from netdevsim dev->flags in order to support
skb forwarding. This breaks the rtnetlink.sh selftest
kci_test_ipsec_offload() test because ipsec does not connect to peers it
cannot transmit to.

Fix the issue by adding a neigh entry manually. ipsec_offload test now
successfully pass.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: add selftest for forwarding skb between connected ports
David Wei [Wed, 28 Feb 2024 23:22:52 +0000 (15:22 -0800)]
netdevsim: add selftest for forwarding skb between connected ports

Connect two netdevsim ports in different namespaces together, then send
packets between them using socat.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: add ndo_get_iflink() implementation
David Wei [Wed, 28 Feb 2024 23:22:51 +0000 (15:22 -0800)]
netdevsim: add ndo_get_iflink() implementation

Add an implementation for ndo_get_iflink() in netdevsim that shows the
ifindex of the linked peer, if any.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: forward skbs from one connected port to another
David Wei [Wed, 28 Feb 2024 23:22:50 +0000 (15:22 -0800)]
netdevsim: forward skbs from one connected port to another

Forward skbs sent from one netdevsim port to its connected netdevsim
port using dev_forward_skb, in a spirit similar to veth.

Add a tx_dropped variable to struct netdevsim, tracking the number of
skbs that could not be forwarded using dev_forward_skb().

The xmit() function accessing the peer ptr is protected by an RCU read
critical section. The rcu_read_lock() is functionally redundant as since
v5.0 all softirqs are implicitly RCU read critical sections; but it is
useful for human readers.

If another CPU is concurrently in nsim_destroy(), then it will first set
the peer ptr to NULL. This does not affect any existing readers that
dereferenced a non-NULL peer. Then, in unregister_netdevice(), there is
a synchronize_rcu() before the netdev is actually unregistered and
freed. This ensures that any readers i.e. xmit() that got a non-NULL
peer will complete before the netdev is freed.

Any readers after the RCU_INIT_POINTER() but before synchronize_rcu()
will dereference NULL, making it safe.

The codepath to nsim_destroy() and nsim_create() takes both the newly
added nsim_dev_list_lock and rtnl_lock. This makes it safe with
concurrent calls to linking two netdevsims together.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonetdevsim: allow two netdevsim ports to be connected
David Wei [Wed, 28 Feb 2024 23:22:49 +0000 (15:22 -0800)]
netdevsim: allow two netdevsim ports to be connected

Add two netdevsim bus attribute to sysfs:
/sys/bus/netdevsim/link_device
/sys/bus/netdevsim/unlink_device

Writing "A M B N" to link_device will link netdevsim M in netnsid A with
netdevsim N in netnsid B.

Writing "A M" to unlink_device will unlink netdevsim M in netnsid A from
its peer, if any.

rtnl_lock is taken to ensure nothing changes during the linking.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'selftests-xfail'
David S. Miller [Fri, 1 Mar 2024 10:30:30 +0000 (10:30 +0000)]
Merge branch 'selftests-xfail'

Jakub Kicinski says:

====================
selftests: kselftest_harness: support using xfail

When running selftests for our subsystem in our CI we'd like all
tests to pass. Currently some tests use SKIP for cases they
expect to fail, because the kselftest_harness limits the return
codes to pass/fail/skip. XFAIL which would be a great match
here cannot be used.

Remove the no_print handling and use vfork() to run the test in
a different process than the setup. This way we don't need to
pass "failing step" via the exit code. Further clean up the exit
codes so that we can use all KSFT_* values. Rewrite the result
printing to make handling XFAIL/XPASS easier. Support tests
declaring combinations of fixture + variant they expect to fail.

Merge plan is to put it on top of -rc6 and merge into net-next.
That way others should be able to pull the patches without
any networking changes.

v4:
 - rebase on top of Mickael's vfork() changes
v3: https://lore.kernel.org/all/20240220192235.2953484-1-kuba@kernel.org/
 - combine multiple series
 - change to "list of expected failures" rather than SKIP()-like handling
v2: https://lore.kernel.org/all/20240216002619.1999225-1-kuba@kernel.org/
 - fix alignment
follow up RFC: https://lore.kernel.org/all/20240216004122.2004689-1-kuba@kernel.org/
v1: https://lore.kernel.org/all/20240213154416.422739-1-kuba@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: ip_local_port_range: use XFAIL instead of SKIP
Jakub Kicinski [Thu, 29 Feb 2024 00:59:19 +0000 (16:59 -0800)]
selftests: ip_local_port_range: use XFAIL instead of SKIP

SCTP does not support IP_LOCAL_PORT_RANGE and we know it,
so use XFAIL instead of SKIP.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: support using xfail
Jakub Kicinski [Thu, 29 Feb 2024 00:59:18 +0000 (16:59 -0800)]
selftests: kselftest_harness: support using xfail

Currently some tests report skip for things they expect to fail
e.g. when given combination of parameters is known to be unsupported.
This is confusing because in an ideal test environment and fully
featured kernel no tests should be skipped.

Selftest summary line already includes xfail and xpass counters,
e.g.:

  Totals: pass:725 fail:0 xfail:0 xpass:0 skip:0 error:0

but there's no way to use it from within the harness.

Add a new per-fixture+variant combination list of test cases
we expect to fail.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: let PASS / FAIL provide diagnostic
Jakub Kicinski [Thu, 29 Feb 2024 00:59:17 +0000 (16:59 -0800)]
selftests: kselftest_harness: let PASS / FAIL provide diagnostic

Switch to printing KTAP line for PASS / FAIL with ksft_test_result_code(),
this gives us the ability to report diagnostic messages.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: separate diagnostic message with # in ksft_test_result_...
Jakub Kicinski [Thu, 29 Feb 2024 00:59:16 +0000 (16:59 -0800)]
selftests: kselftest_harness: separate diagnostic message with # in ksft_test_result_code()

According to the spec we should always print a # if we add
a diagnostic message. Having the caller pass in the new line
as part of diagnostic message makes handling this a bit
counter-intuitive, so append the new line in the helper.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: print test name for SKIP
Jakub Kicinski [Thu, 29 Feb 2024 00:59:15 +0000 (16:59 -0800)]
selftests: kselftest_harness: print test name for SKIP

Jakub points out that for parsers it's rather useful to always
have the test name on the result line. Currently if we SKIP
(or soon XFAIL or XPASS), we will print:

ok 17 # SKIP SCTP doesn't support IP_BIND_ADDRESS_NO_PORT

     ^
     no test name

Always print the test name.
KTAP format seems to allow or even call for it, per:
https://docs.kernel.org/dev-tools/ktap.html

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/all/87jzn6lnou.fsf@cloudflare.com/
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest: add ksft_test_result_code(), handling all exit codes
Jakub Kicinski [Thu, 29 Feb 2024 00:59:14 +0000 (16:59 -0800)]
selftests: kselftest: add ksft_test_result_code(), handling all exit codes

For generic test harness code it's more useful to deal with exit
codes directly, rather than having to switch on them and call
the right ksft_test_result_*() helper. Add such function to kselftest.h.

Note that "directive" and "diagnostic" are what ktap docs call
those parts of the message.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: use exit code to store skip
Jakub Kicinski [Thu, 29 Feb 2024 00:59:13 +0000 (16:59 -0800)]
selftests: kselftest_harness: use exit code to store skip

We always use skip in combination with exit_code being 0
(KSFT_PASS). This are basic KSFT / KTAP semantics.
Store the right KSFT_* code in exit_code directly.

This makes it easier to support tests reporting other
extended KSFT_* codes like XFAIL / XPASS.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: save full exit code in metadata
Jakub Kicinski [Thu, 29 Feb 2024 00:59:12 +0000 (16:59 -0800)]
selftests: kselftest_harness: save full exit code in metadata

Instead of tracking passed = 0/1 rename the field to exit_code
and invert the values so that they match the KSFT_* exit codes.
This will allow us to fold SKIP / XFAIL into the same value.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: generate test name once
Jakub Kicinski [Thu, 29 Feb 2024 00:59:11 +0000 (16:59 -0800)]
selftests: kselftest_harness: generate test name once

Since we added variant support generating full test case
name takes 4 string arguments. We're about to need it
in another two places. Stop the duplication and print
once into a temporary buffer.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests: kselftest_harness: use KSFT_* exit codes
Jakub Kicinski [Thu, 29 Feb 2024 00:59:10 +0000 (16:59 -0800)]
selftests: kselftest_harness: use KSFT_* exit codes

Now that we no longer need low exit codes to communicate
assertion steps - use normal KSFT exit codes.

Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests/harness: Merge TEST_F_FORK() into TEST_F()
Mickaël Salaün [Thu, 29 Feb 2024 00:59:09 +0000 (16:59 -0800)]
selftests/harness: Merge TEST_F_FORK() into TEST_F()

Replace Landlock-specific TEST_F_FORK() with an improved TEST_F() which
brings four related changes:

Run TEST_F()'s tests in a grandchild process to make it possible to
drop privileges and delegate teardown to the parent.

Compared to TEST_F_FORK(), simplify handling of the test grandchild
process thanks to vfork(2), and makes it generic (e.g. no explicit
conversion between exit code and _metadata).

Compared to TEST_F_FORK(), run teardown even when tests failed with an
assert thanks to commit 63e6b2a42342 ("selftests/harness: Run TEARDOWN
for ASSERT failures").

Simplify the test harness code by removing the no_print and step fields
which are not used.  I added this feature just after I made
kselftest_harness.h more broadly available but this step counter
remained even though it wasn't needed after all. See commit 369130b63178
("selftests: Enhance kselftest_harness.h to print which assert failed").

Replace spaces with tabs in one line of __TEST_F_IMPL().

Cc: Günther Noack <gnoack@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoselftests/landlock: Redefine TEST_F() as TEST_F_FORK()
Mickaël Salaün [Thu, 29 Feb 2024 00:59:08 +0000 (16:59 -0800)]
selftests/landlock: Redefine TEST_F() as TEST_F_FORK()

This has the effect of creating a new test process for either TEST_F()
or TEST_F_FORK(), which doesn't change tests but will ease potential
backports.  See next commit for the TEST_F_FORK() merge into TEST_F().

Cc: Günther Noack <gnoack@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'net-asp22-optimizations'
David S. Miller [Fri, 1 Mar 2024 09:22:50 +0000 (09:22 +0000)]
Merge branch 'net-asp22-optimizations'

Justin Chen says:

====================
Support for ASP 2.2 and optimizations

ASP 2.2 adds some power savings during low power modes.

Also make various improvements when entering low power modes and
reduce MDIO traffic by hooking up interrupts.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: bcmasp: Add support for PHY interrupts
Justin Chen [Wed, 28 Feb 2024 22:54:00 +0000 (14:54 -0800)]
net: bcmasp: Add support for PHY interrupts

Hook up the phy interrupts for internal phys to reduce mdio traffic
and improve responsiveness of link changes.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: bcmasp: Keep buffers through power management
Justin Chen [Wed, 28 Feb 2024 22:53:59 +0000 (14:53 -0800)]
net: bcmasp: Keep buffers through power management

There is no advantage of freeing and re-allocating buffers through
suspend and resume. This waste cycles and makes suspend/resume time
longer. We also open ourselves to failed allocations in systems with
heavy memory fragmentation.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: phy: mdio-bcm-unimac: Add asp v2.2 support
Justin Chen [Wed, 28 Feb 2024 22:53:58 +0000 (14:53 -0800)]
net: phy: mdio-bcm-unimac: Add asp v2.2 support

Add mdio compat string for ASP 2.0 ethernet driver.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: bcmasp: Add support for ASP 2.2
Justin Chen [Wed, 28 Feb 2024 22:53:57 +0000 (14:53 -0800)]
net: bcmasp: Add support for ASP 2.2

ASP 2.2 improves power savings during low power modes.

A new register was added to toggle to a slower clock during low
power modes.

EEE was broken for ASP 2.0/2.1. A HW workaround was added for
ASP 2.2 that requires toggling a chicken bit.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agodt-bindings: net: brcm,asp-v2.0: Add asp-v2.2
Justin Chen [Wed, 28 Feb 2024 22:53:56 +0000 (14:53 -0800)]
dt-bindings: net: brcm,asp-v2.0: Add asp-v2.2

Add support for ASP 2.2.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agodt-bindings: net: brcm,unimac-mdio: Add asp-v2.2
Justin Chen [Wed, 28 Feb 2024 22:53:55 +0000 (14:53 -0800)]
dt-bindings: net: brcm,unimac-mdio: Add asp-v2.2

The ASP 2.2 Ethernet controller uses a brcm unimac.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'qcom-phy-possible'
David S. Miller [Fri, 1 Mar 2024 08:56:39 +0000 (08:56 +0000)]
Merge branch 'qcom-phy-possible'

Robert Marko says:

====================
net: phy: qcom: qca808x: fill in possible_interfaces

QCA808x does not currently fill in the possible_interfaces.

This leads to Phylink not being aware that it supports 2500Base-X as well
so in cases where it is connected to a DSA switch like MV88E6393 it will
limit that port to phy-mode set in the DTS.

That means that if SGMII is used you are limited to 1G only while if
2500Base-X was set you are limited to 2.5G only.

Populating the possible_interfaces fixes this.

Changes in v2:
* Get rid of the if/else by Russels suggestion in the helper
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: phy: qcom: qca808x: fill in possible_interfaces
Robert Marko [Wed, 28 Feb 2024 17:24:10 +0000 (18:24 +0100)]
net: phy: qcom: qca808x: fill in possible_interfaces

Currently QCA808x driver does not fill the possible_interfaces.
2.5G QCA808x support SGMII and 2500Base-X while 1G model only supports
SGMII, so fill the possible_interfaces accordingly.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: phy: qcom: qca808x: add helper for checking for 1G only model
Robert Marko [Wed, 28 Feb 2024 17:24:09 +0000 (18:24 +0100)]
net: phy: qcom: qca808x: add helper for checking for 1G only model

There are 2 versions of QCA808x, one 2.5G capable and one 1G capable.
Currently, this matter only in the .get_features call however, it will
be required for filling supported interface modes so lets add a helper
that can be reused.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoSimplify net_dbg_ratelimited() dummy
Geert Uytterhoeven [Wed, 28 Feb 2024 14:05:29 +0000 (15:05 +0100)]
Simplify net_dbg_ratelimited() dummy

There is no need to wrap calls to the no_printk() helper inside an
always-false check, as no_printk() already does that internally.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'ipv6-devconf-lockless'
David S. Miller [Fri, 1 Mar 2024 08:42:33 +0000 (08:42 +0000)]
Merge branch 'ipv6-devconf-lockless'

Eric Dumazet says:

====================
ipv6: lockless accesses to devconf

- First patch puts in a cacheline_group the fields used in fast paths.

- Annotate all data races around idev->cnf fields.

- Last patch in this series removes RTNL use for RTM_GETNETCONF dumps.

v3: addressed Jakub Kicinski feedback in addrconf_disable_ipv6()
    Added tags from Jiri and Florian.

v2: addressed Jiri Pirko feedback
 - Added "ipv6: addrconf_disable_ipv6() optimizations"
   and "ipv6: addrconf_disable_policy() optimization"
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: use xa_array iterator to implement inet6_netconf_dump_devconf()
Eric Dumazet [Wed, 28 Feb 2024 13:54:39 +0000 (13:54 +0000)]
ipv6: use xa_array iterator to implement inet6_netconf_dump_devconf()

1) inet6_netconf_dump_devconf() can run under RCU protection
   instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
   an extra recvmsg() system call.

3) Do not use inet6_base_seq() anymore, for_each_netdev_dump()
   has nice properties. Restarting a GETDEVCONF dump if a device has
   been added/removed or if net->ipv6.dev_addr_genid has changed is moot.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6/addrconf: annotate data-races around devconf fields (II)
Eric Dumazet [Wed, 28 Feb 2024 13:54:38 +0000 (13:54 +0000)]
ipv6/addrconf: annotate data-races around devconf fields (II)

Final (?) round of this series.

Annotate lockless reads on following devconf fields,
because they be changed concurrently from /proc/net/ipv6/conf.

- accept_dad
- optimistic_dad
- use_optimistic
- use_oif_addrs_only
- ra_honor_pio_life
- keep_addr_on_down
- ndisc_notify
- ndisc_evict_nocarrier
- suppress_frag_ndisc
- addr_gen_mode
- seg6_enabled
- ioam6_enabled
- ioam6_id
- ioam6_id_wide
- drop_unicast_in_l2_multicast
- mldv[12]_unsolicited_report_interval
- force_mld_version
- force_tllao
- accept_untracked_na
- drop_unsolicited_na
- accept_source_route

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6/addrconf: annotate data-races around devconf fields (I)
Eric Dumazet [Wed, 28 Feb 2024 13:54:37 +0000 (13:54 +0000)]
ipv6/addrconf: annotate data-races around devconf fields (I)

Annotate lockless reads and writes on following devconf fields:

- regen_min_advance
- regen_max_retry
- dad_transmits
- use_tempaddr
- max_addresses
- max_desync_factor
- temp_valid_lft
- rtr_solicits
- rtr_solicit_max_interval
- rtr_solicit_interval
- rtr_solicit_delay
- enhanced_dad
- accept_redirects

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: addrconf_disable_policy() optimization
Eric Dumazet [Wed, 28 Feb 2024 13:54:36 +0000 (13:54 +0000)]
ipv6: addrconf_disable_policy() optimization

Writing over /proc/sys/net/ipv6/conf/default/disable_policy
does not need to hold RTNL.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around devconf->disable_policy
Eric Dumazet [Wed, 28 Feb 2024 13:54:35 +0000 (13:54 +0000)]
ipv6: annotate data-races around devconf->disable_policy

idev->cnf.disable_policy and net->ipv6.devconf_all->disable_policy
can be read locklessly. Add appropriate annotations on reads
and writes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around devconf->proxy_ndp
Eric Dumazet [Wed, 28 Feb 2024 13:54:34 +0000 (13:54 +0000)]
ipv6: annotate data-races around devconf->proxy_ndp

devconf->proxy_ndp can be read and written locklessly,
add appropriate annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races in rt6_probe()
Eric Dumazet [Wed, 28 Feb 2024 13:54:33 +0000 (13:54 +0000)]
ipv6: annotate data-races in rt6_probe()

Use READ_ONCE() while reading idev->cnf.rtr_probe_interval
while its value could be changed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around idev->cnf.ignore_routes_with_linkdown
Eric Dumazet [Wed, 28 Feb 2024 13:54:32 +0000 (13:54 +0000)]
ipv6: annotate data-races around idev->cnf.ignore_routes_with_linkdown

idev->cnf.ignore_routes_with_linkdown can be used without any locks,
add appropriate annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races in ndisc_router_discovery()
Eric Dumazet [Wed, 28 Feb 2024 13:54:31 +0000 (13:54 +0000)]
ipv6: annotate data-races in ndisc_router_discovery()

Annotate reads from in6_dev->cnf.XXX fields, as they could
change concurrently.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around cnf.forwarding
Eric Dumazet [Wed, 28 Feb 2024 13:54:30 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.forwarding

idev->cnf.forwarding and net->ipv6.devconf_all->forwarding
might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around cnf.hop_limit
Eric Dumazet [Wed, 28 Feb 2024 13:54:29 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.hop_limit

idev->cnf.hop_limit and net->ipv6.devconf_all->hop_limit
might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de> # for netfilter parts
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around cnf.mtu6
Eric Dumazet [Wed, 28 Feb 2024 13:54:28 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.mtu6

idev->cnf.mtu6 might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: addrconf_disable_ipv6() optimization
Eric Dumazet [Wed, 28 Feb 2024 13:54:27 +0000 (13:54 +0000)]
ipv6: addrconf_disable_ipv6() optimization

Writing over /proc/sys/net/ipv6/conf/default/disable_ipv6
does not need to hold RTNL.

v3: remove a wrong change (Jakub Kicinski feedback)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: annotate data-races around cnf.disable_ipv6
Eric Dumazet [Wed, 28 Feb 2024 13:54:26 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.disable_ipv6

disable_ipv6 is read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

v2: do not preload net before rtnl_trylock() in
    addrconf_disable_ipv6() (Jiri)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoipv6: add ipv6_devconf_read_txrx cacheline_group
Eric Dumazet [Wed, 28 Feb 2024 13:54:25 +0000 (13:54 +0000)]
ipv6: add ipv6_devconf_read_txrx cacheline_group

IPv6 TX and RX fast path use the following fields:

- disable_ipv6
- hop_limit
- mtu6
- forwarding
- disable_policy
- proxy_ndp

Place them in a group to increase data locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 29 Feb 2024 22:17:54 +0000 (14:17 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

net/mptcp/protocol.c
  adf1bb78dab5 ("mptcp: fix snd_wnd initialization for passive socket")
  9426ce476a70 ("mptcp: annotate lockless access for RX path fields")
https://lore.kernel.org/all/20240228103048.19255709@canb.auug.org.au/

Adjacent changes:

drivers/dpll/dpll_core.c
  0d60d8df6f49 ("dpll: rely on rcu for netdev_dpll_pin()")
  e7f8df0e81bf ("dpll: move xa_erase() call in to match dpll_pin_alloc() error path order")

drivers/net/veth.c
  1ce7d306ea63 ("veth: try harder when allocating queue memory")
  0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers")

drivers/net/wireless/intel/iwlwifi/mvm/d3.c
  8c9bef26e98b ("wifi: iwlwifi: mvm: d3: implement suspend with MLO")
  78f65fbf421a ("wifi: iwlwifi: mvm: ensure offloading TID queue exists")

net/wireless/nl80211.c
  f78c1375339a ("wifi: nl80211: reject iftype change with mesh ID change")
  414532d8aa89 ("wifi: cfg80211: use IEEE80211_MAX_MESH_ID_LEN appropriately")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge branch 'create-shadow-types-for-struct_ops-maps-in-skeletons'
Andrii Nakryiko [Thu, 29 Feb 2024 21:54:06 +0000 (13:54 -0800)]
Merge branch 'create-shadow-types-for-struct_ops-maps-in-skeletons'

Kui-Feng Lee says:

====================
Create shadow types for struct_ops maps in skeletons

This patchset allows skeleton users to change the values of the fields
in struct_ops maps at runtime. It will create a shadow type pointer in
a skeleton for each struct_ops map, allowing users to access the
values of fields through these pointers. For instance, if there is an
integer field named "FOO" in a struct_ops map called "testmap", you
can access the value of "FOO" in this way.

    skel->struct_ops.testmap->FOO = 13;

With this feature, the users can pass flags or other data along with
the map from the user space to the kernel without creating separate
struct_ops map for different values in BPF.

== Shadow Type ==

The shadow type of a struct_ops map is a variant of the original
struct type of the map. The code generator translates each field in
the original struct type to a field in the shadow type. The type of a
field in the shadow type may not be the same as the corresponding
field in the original struct type. For example, modifiers like
volatile, const, etc., are removed from the fields in a shadow
type. Function pointers are translated to pointers of struct
bpf_program.

Currently, only scalar types and function pointers are
supported. Fields belonging to structs, unions, non-function pointers,
arrays, or other types are not supported. For those unsupported
fields, they are converted to arrays of characters to preserve their
space within the original struct type.

The padding between consecutive fields is handled by padding fields
(__padding_*). This helps to maintain the memory layout consistent
with the original struct_type.

Here is an example of shadow types.
The origin struct type of a struct_ops map is

    struct bpf_testmod_ops {
     int (*test_1)(void);
     void (*test_2)(int a, int b);
     /* Used to test nullable arguments. */
     int (*test_maybe_null)(int dummy, struct task_struct *task);

     /* The following fields are used to test shadow copies. */
     char onebyte;
     struct {
     int a;
     int b;
     } unsupported;
     int data;
    };

The struct_ops map, named testmod_1, of this type will be translated
to a pointer in the shadow type.

    struct {
     struct my_skel__testmod_1__bpf_testmod_ops {
     const struct bpf_program *test_1;
     const struct bpf_program *test_2;
     const struct bpf_program *test_maybe_null;
     char onebyte;
     char __padding_4[3];
     char __unsupported_4[8];
     int data;
     } *testmod_1;
    } struct_ops;

== Convert st_ops->data to Shadow Type ==

libbpf converts st_ops->data to the format of the shadow type for each
struct_ops map. This means that the bytes where function pointers are
located are converted to the values of the pointers of struct
bpf_program. The fields of other types are kept as they were.

Libbpf will synchronize the pointers of struct bpf_program with
st_ops->progs[] so that users can change function pointers
(bpf_program) before loading the map.
---
Changes from v5:

 - Generate names for shadow types.

 - Check btf and the number of struct_ops maps in gen_st_ops_shadow()
   and gen_st_ops_shadow_init() instead of do_skeleton() and
   do_subskeleton().

 - Name unsupported fields in the pattern __unsupported_*.

 - Have a padding field for a unsupported fields as well if necessary.

 - Implement resolve_func_ptr() in gen.c instead of reusing the one in
   libbpf. (Remove the part 1 in v4.)

 - Fix stylistic issues.

Changes from v4:

 - Convert function pointers to the pointers to struct bpf_program in
   bpf_object__collect_st_ops_relos().

Changes from v3:

 - Add comment to avoid people from removing resolve_func_ptr() from
   libbpf_internal.h

 - Fix commit logs and comments.

 - Add an example about using the pointers of shadow types
   for struct_ops maps to bpftool-gen.8.

v5: https://lore.kernel.org/all/20240227010432.714127-1-thinker.li@gmail.com/
v4: https://lore.kernel.org/all/20240222222624.1163754-1-thinker.li@gmail.com/
v3: https://lore.kernel.org/all/20240221012329.1387275-1-thinker.li@gmail.com/
v2: https://lore.kernel.org/all/20240214020836.1845354-1-thinker.li@gmail.com/
v1: https://lore.kernel.org/all/20240124224130.859921-1-thinker.li@gmail.com/

Kui-Feng Lee (5):
  libbpf: set btf_value_type_id of struct bpf_map for struct_ops.
  libbpf: Convert st_ops->data to shadow type.
  bpftool: generated shadow variables for struct_ops maps.
  bpftool: Add an example for struct_ops map and shadow type.
  selftests/bpf: Test if shadow types work correctly.

 .../bpf/bpftool/Documentation/bpftool-gen.rst |  58 ++++-
 tools/bpf/bpftool/gen.c                       | 237 +++++++++++++++++-
 tools/lib/bpf/libbpf.c                        |  50 +++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  11 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   8 +
 .../bpf/prog_tests/test_struct_ops_module.c   |  19 +-
 .../selftests/bpf/progs/struct_ops_module.c   |   8 +
 7 files changed, 377 insertions(+), 14 deletions(-)
====================

Link: https://lore.kernel.org/r/20240229064523.2091270-1-thinker.li@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
19 months agoselftests/bpf: Test if shadow types work correctly.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:23 +0000 (22:45 -0800)]
selftests/bpf: Test if shadow types work correctly.

Change the values of fields, including scalar types and function pointers,
and check if the struct_ops map works as expected.

The test changes the field "test_2" of "testmod_1" from the pointer to
test_2() to pointer to test_3() and the field "data" to 13. The function
test_2() and test_3() both compute a new value for "test_2_result", but in
different way. By checking the value of "test_2_result", it ensures the
struct_ops map works as expected with changes through shadow types.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-6-thinker.li@gmail.com
19 months agobpftool: Add an example for struct_ops map and shadow type.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:22 +0000 (22:45 -0800)]
bpftool: Add an example for struct_ops map and shadow type.

The example in bpftool-gen.8 explains how to use the pointer of the shadow
type to change the value of a field of a struct_ops map.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-5-thinker.li@gmail.com
19 months agobpftool: Generated shadow variables for struct_ops maps.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:21 +0000 (22:45 -0800)]
bpftool: Generated shadow variables for struct_ops maps.

Declares and defines a pointer of the shadow type for each struct_ops map.

The code generator will create an anonymous struct type as the shadow type
for each struct_ops map. The shadow type is translated from the original
struct type of the map. The user of the skeleton use pointers of them to
access the values of struct_ops maps.

However, shadow types only supports certain types of fields, including
scalar types and function pointers. Any fields of unsupported types are
translated into an array of characters to occupy the space of the original
field. Function pointers are translated into pointers of the struct
bpf_program. Additionally, padding fields are generated to occupy the space
between two consecutive fields.

The pointers of shadow types of struct_osp maps are initialized when
*__open_opts() in skeletons are called. For a map called FOO, the user can
access it through the pointer at skel->struct_ops.FOO.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-4-thinker.li@gmail.com
19 months agolibbpf: Convert st_ops->data to shadow type.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:20 +0000 (22:45 -0800)]
libbpf: Convert st_ops->data to shadow type.

Convert st_ops->data to the shadow type of the struct_ops map. The shadow
type of a struct_ops type is a variant of the original struct type
providing a way to access/change the values in the maps of the struct_ops
type.

bpf_map__initial_value() will return st_ops->data for struct_ops types. The
skeleton is going to use it as the pointer to the shadow type of the
original struct type.

One of the main differences between the original struct type and the shadow
type is that all function pointers of the shadow type are converted to
pointers of struct bpf_program. Users can replace these bpf_program
pointers with other BPF programs. The st_ops->progs[] will be updated
before updating the value of a map to reflect the changes made by users.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-3-thinker.li@gmail.com
19 months agolibbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
Kui-Feng Lee [Thu, 29 Feb 2024 06:45:19 +0000 (22:45 -0800)]
libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.

For a struct_ops map, btf_value_type_id is the type ID of it's struct
type. This value is required by bpftool to generate skeleton including
pointers of shadow types. The code generator gets the type ID from
bpf_map__btf_value_type_id() in order to get the type information of the
struct type of a map.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-2-thinker.li@gmail.com
19 months agobpf: Replace bpf_lpm_trie_key 0-length array with flexible array
Kees Cook [Thu, 22 Feb 2024 15:56:15 +0000 (07:56 -0800)]
bpf: Replace bpf_lpm_trie_key 0-length array with flexible array

Replace deprecated 0-length array in struct bpf_lpm_trie_key with
flexible array. Found with GCC 13:

../kernel/bpf/lpm_trie.c:207:51: warning: array subscript i is outside array bounds of 'const __u8[0]' {aka 'const unsigned char[]'} [-Warray-bounds=]
  207 |                                        *(__be16 *)&key->data[i]);
      |                                                   ^~~~~~~~~~~~~
../include/uapi/linux/swab.h:102:54: note: in definition of macro '__swab16'
  102 | #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
      |                                                      ^
../include/linux/byteorder/generic.h:97:21: note: in expansion of macro '__be16_to_cpu'
   97 | #define be16_to_cpu __be16_to_cpu
      |                     ^~~~~~~~~~~~~
../kernel/bpf/lpm_trie.c:206:28: note: in expansion of macro 'be16_to_cpu'
  206 |                 u16 diff = be16_to_cpu(*(__be16 *)&node->data[i]
^
      |                            ^~~~~~~~~~~
In file included from ../include/linux/bpf.h:7:
../include/uapi/linux/bpf.h:82:17: note: while referencing 'data'
   82 |         __u8    data[0];        /* Arbitrary size */
      |                 ^~~~

And found at run-time under CONFIG_FORTIFY_SOURCE:

  UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:218:49
  index 0 is out of range for type '__u8 [*]'

Changing struct bpf_lpm_trie_key is difficult since has been used by
userspace. For example, in Cilium:

struct egress_gw_policy_key {
        struct bpf_lpm_trie_key lpm_key;
        __u32 saddr;
        __u32 daddr;
};

While direct references to the "data" member haven't been found, there
are static initializers what include the final member. For example,
the "{}" here:

        struct egress_gw_policy_key in_key = {
                .lpm_key = { 32 + 24, {} },
                .saddr   = CLIENT_IP,
                .daddr   = EXTERNAL_SVC_IP & 0Xffffff,
        };

To avoid the build time and run time warnings seen with a 0-sized
trailing array for struct bpf_lpm_trie_key, introduce a new struct
that correctly uses a flexible array for the trailing bytes,
struct bpf_lpm_trie_key_u8. As part of this, include the "header"
portion (which is just the "prefixlen" member), so it can be used
by anything building a bpf_lpr_trie_key that has trailing members that
aren't a u8 flexible array (like the self-test[1]), which is named
struct bpf_lpm_trie_key_hdr.

Unfortunately, C++ refuses to parse the __struct_group() helper, so
it is not possible to define struct bpf_lpm_trie_key_hdr directly in
struct bpf_lpm_trie_key_u8, so we must open-code the union directly.

Adjust the kernel code to use struct bpf_lpm_trie_key_u8 through-out,
and for the selftest to use struct bpf_lpm_trie_key_hdr. Add a comment
to the UAPI header directing folks to the two new options.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Closes: https://paste.debian.net/hidden/ca500597/
Link: https://lore.kernel.org/all/202206281009.4332AA33@keescook/
Link: https://lore.kernel.org/bpf/20240222155612.it.533-kees@kernel.org
19 months agoMerge tag 'net-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 29 Feb 2024 20:40:20 +0000 (12:40 -0800)]
Merge tag 'net-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, WiFi and netfilter.

  We have one outstanding issue with the stmmac driver, which may be a
  LOCKDEP false positive, not a blocker.

  Current release - regressions:

   - netfilter: nf_tables: re-allow NFPROTO_INET in
     nft_(match/target)_validate()

   - eth: ionic: fix error handling in PCI reset code

  Current release - new code bugs:

   - eth: stmmac: complete meta data only when enabled, fix null-deref

   - kunit: fix again checksum tests on big endian CPUs

  Previous releases - regressions:

   - veth: try harder when allocating queue memory

   - Bluetooth:
      - hci_bcm4377: do not mark valid bd_addr as invalid
      - hci_event: fix handling of HCI_EV_IO_CAPA_REQUEST

  Previous releases - always broken:

   - info leak in __skb_datagram_iter() on netlink socket

   - mptcp:
      - map v4 address to v6 when destroying subflow
      - fix potential wake-up event loss due to sndbuf auto-tuning
      - fix double-free on socket dismantle

   - wifi: nl80211: reject iftype change with mesh ID change

   - fix small out-of-bound read when validating netlink be16/32 types

   - rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back

   - ipv6: fix potential "struct net" ref-leak in inet6_rtm_getaddr()

   - ip_tunnel: prevent perpetual headroom growth with huge number of
     tunnels on top of each other

   - mctp: fix skb leaks on error paths of mctp_local_output()

   - eth: ice: fixes for DPLL state reporting

   - dpll: rely on rcu for netdev_dpll_pin() to prevent UaF

   - eth: dpaa: accept phy-interface-type = '10gbase-r' in the device
     tree"

* tag 'net-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits)
  dpll: fix build failure due to rcu_dereference_check() on unknown type
  kunit: Fix again checksum tests on big endian CPUs
  tls: fix use-after-free on failed backlog decryption
  tls: separate no-async decryption request handling from async
  tls: fix peeking with sync+async decryption
  tls: decrement decrypt_pending if no async completion will be called
  gtp: fix use-after-free and null-ptr-deref in gtp_newlink()
  net: hsr: Use correct offset for HSR TLV values in supervisory HSR frames
  igb: extend PTP timestamp adjustments to i211
  rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back
  tools: ynl: fix handling of multiple mcast groups
  selftests: netfilter: add bridge conntrack + multicast test case
  netfilter: bridge: confirm multicast packets before passing them up the stack
  netfilter: nf_tables: allow NFPROTO_INET in nft_(match/target)_validate()
  Bluetooth: qca: Fix triggering coredump implementation
  Bluetooth: hci_qca: Set BDA quirk bit if fwnode exists in DT
  Bluetooth: qca: Fix wrong event type for patch config command
  Bluetooth: Enforce validation on max value of connection interval
  Bluetooth: hci_event: Fix handling of HCI_EV_IO_CAPA_REQUEST
  Bluetooth: mgmt: Fix limited discoverable off timeout
  ...

19 months agoMerge tag 'landlock-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic...
Linus Torvalds [Thu, 29 Feb 2024 20:29:23 +0000 (12:29 -0800)]
Merge tag 'landlock-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux

Pull Landlock fix from Mickaël Salaün:
 "Fix a potential issue when handling inodes with inconsistent
  properties"

* tag 'landlock-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  landlock: Fix asymmetric private inodes referring

19 months agodpll: fix build failure due to rcu_dereference_check() on unknown type
Eric Dumazet [Thu, 29 Feb 2024 19:05:15 +0000 (11:05 -0800)]
dpll: fix build failure due to rcu_dereference_check() on unknown type

Tasmiya reports that their compiler complains that we deref
a pointer to unknown type with rcu_dereference_rtnl():

include/linux/rcupdate.h:439:9: error: dereferencing pointer to incomplete type ‘struct dpll_pin’

Unclear what compiler it is, at the moment, and we can't report
but since DPLL can't be a module - move the code from the header
into the source file.

Fixes: 0d60d8df6f49 ("dpll: rely on rcu for netdev_dpll_pin()")
Reported-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com>
Link: https://lore.kernel.org/all/3fcf3a2c-1c1b-42c1-bacb-78fdcd700389@linux.vnet.ibm.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240229190515.2740221-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agokunit: Fix again checksum tests on big endian CPUs
Christophe Leroy [Fri, 23 Feb 2024 10:41:52 +0000 (11:41 +0100)]
kunit: Fix again checksum tests on big endian CPUs

Commit b38460bc463c ("kunit: Fix checksum tests on big endian CPUs")
fixed endianness issues with kunit checksum tests, but then
commit 6f4c45cbcb00 ("kunit: Add tests for csum_ipv6_magic and
ip_fast_csum") introduced new issues on big endian CPUs. Those issues
are once again reflected by the warnings reported by sparse.

So, fix them with the same approach, perform proper conversion in
order to support both little and big endian CPUs. Once the conversions
are properly done and the right types used, the sparse warnings are
cleared as well.

Reported-by: Erhard Furtner <erhard_f@mailbox.org>
Fixes: 6f4c45cbcb00 ("kunit: Add tests for csum_ipv6_magic and ip_fast_csum")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/73df3a9e95c2179119398ad1b4c84cdacbd8dfb6.1708684443.git.christophe.leroy@csgroup.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge tag 'for-net-2024-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluet...
Jakub Kicinski [Thu, 29 Feb 2024 17:10:24 +0000 (09:10 -0800)]
Merge tag 'for-net-2024-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - mgmt: Fix limited discoverable off timeout
 - hci_qca: Set BDA quirk bit if fwnode exists in DT
 - hci_bcm4377: do not mark valid bd_addr as invalid
 - hci_sync: Check the correct flag before starting a scan
 - Enforce validation on max value of connection interval
 - hci_sync: Fix accept_list when attempting to suspend
 - hci_event: Fix handling of HCI_EV_IO_CAPA_REQUEST
 - Avoid potential use-after-free in hci_error_reset
 - rfcomm: Fix null-ptr-deref in rfcomm_check_security
 - hci_event: Fix wrongly recorded wakeup BD_ADDR
 - qca: Fix wrong event type for patch config command
 - qca: Fix triggering coredump implementation

* tag 'for-net-2024-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: qca: Fix triggering coredump implementation
  Bluetooth: hci_qca: Set BDA quirk bit if fwnode exists in DT
  Bluetooth: qca: Fix wrong event type for patch config command
  Bluetooth: Enforce validation on max value of connection interval
  Bluetooth: hci_event: Fix handling of HCI_EV_IO_CAPA_REQUEST
  Bluetooth: mgmt: Fix limited discoverable off timeout
  Bluetooth: hci_event: Fix wrongly recorded wakeup BD_ADDR
  Bluetooth: rfcomm: Fix null-ptr-deref in rfcomm_check_security
  Bluetooth: hci_sync: Fix accept_list when attempting to suspend
  Bluetooth: Avoid potential use-after-free in hci_error_reset
  Bluetooth: hci_sync: Check the correct flag before starting a scan
  Bluetooth: hci_bcm4377: do not mark valid bd_addr as invalid
====================

Link: https://lore.kernel.org/r/20240228145644.2269088-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge branch 'tls-a-few-more-fixes-for-async-decrypt'
Jakub Kicinski [Thu, 29 Feb 2024 17:07:18 +0000 (09:07 -0800)]
Merge branch 'tls-a-few-more-fixes-for-async-decrypt'

Sabrina Dubroca says:

====================
tls: a few more fixes for async decrypt

The previous patchset [1] took care of "full async". This adds a few
fixes for cases where only part of the crypto operations go the async
route, found by extending my previous debug patch [2] to do N
synchronous operations followed by M asynchronous ops (with N and M
configurable).

[1] https://patchwork.kernel.org/project/netdevbpf/list/?series=823784&state=*
[2] https://lore.kernel.org/all/9d664093b1bf7f47497b2c40b3a085b45f3274a2.1694021240.git.sd@queasysnail.net/
====================

Link: https://lore.kernel.org/r/cover.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agotls: fix use-after-free on failed backlog decryption
Sabrina Dubroca [Wed, 28 Feb 2024 22:44:00 +0000 (23:44 +0100)]
tls: fix use-after-free on failed backlog decryption

When the decrypt request goes to the backlog and crypto_aead_decrypt
returns -EBUSY, tls_do_decryption will wait until all async
decryptions have completed. If one of them fails, tls_do_decryption
will return -EBADMSG and tls_decrypt_sg jumps to the error path,
releasing all the pages. But the pages have been passed to the async
callback, and have already been released by tls_decrypt_done.

The only true async case is when crypto_aead_decrypt returns
 -EINPROGRESS. With -EBUSY, we already waited so we can tell
tls_sw_recvmsg that the data is available for immediate copy, but we
need to notify tls_decrypt_sg (via the new ->async_done flag) that the
memory has already been released.

Fixes: 859054147318 ("net: tls: handle backlogging of crypto requests")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/4755dd8d9bebdefaa19ce1439b833d6199d4364c.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agotls: separate no-async decryption request handling from async
Sabrina Dubroca [Wed, 28 Feb 2024 22:43:59 +0000 (23:43 +0100)]
tls: separate no-async decryption request handling from async

If we're not doing async, the handling is much simpler. There's no
reference counting, we just need to wait for the completion to wake us
up and return its result.

We should preferably also use a separate crypto_wait. I'm not seeing a
UAF as I did in the past, I think aec7961916f3 ("tls: fix race between
async notify and socket close") took care of it.

This will make the next fix easier.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/47bde5f649707610eaef9f0d679519966fc31061.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agotls: fix peeking with sync+async decryption
Sabrina Dubroca [Wed, 28 Feb 2024 22:43:58 +0000 (23:43 +0100)]
tls: fix peeking with sync+async decryption

If we peek from 2 records with a currently empty rx_list, and the
first record is decrypted synchronously but the second record is
decrypted async, the following happens:
  1. decrypt record 1 (sync)
  2. copy from record 1 to the userspace's msg
  3. queue the decrypted record to rx_list for future read(!PEEK)
  4. decrypt record 2 (async)
  5. queue record 2 to rx_list
  6. call process_rx_list to copy data from the 2nd record

We currently pass copied=0 as skip offset to process_rx_list, so we
end up copying once again from the first record. We should skip over
the data we've already copied.

Seen with selftest tls.12_aes_gcm.recv_peek_large_buf_mult_recs

Fixes: 692d7b5d1f91 ("tls: Fix recvmsg() to be able to peek across multiple records")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/1b132d2b2b99296bfde54e8a67672d90d6d16e71.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agotls: decrement decrypt_pending if no async completion will be called
Sabrina Dubroca [Wed, 28 Feb 2024 22:43:57 +0000 (23:43 +0100)]
tls: decrement decrypt_pending if no async completion will be called

With mixed sync/async decryption, or failures of crypto_aead_decrypt,
we increment decrypt_pending but we never do the corresponding
decrement since tls_decrypt_done will not be called. In this case, we
should decrement decrypt_pending immediately to avoid getting stuck.

For example, the prequeue prequeue test gets stuck with mixed
modes (one async decrypt + one sync decrypt).

Fixes: 94524d8fc965 ("net/tls: Add support for async decryption of tls records")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/c56d5fc35543891d5319f834f25622360e1bfbec.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agogtp: fix use-after-free and null-ptr-deref in gtp_newlink()
Alexander Ofitserov [Wed, 28 Feb 2024 11:47:03 +0000 (14:47 +0300)]
gtp: fix use-after-free and null-ptr-deref in gtp_newlink()

The gtp_link_ops operations structure for the subsystem must be
registered after registering the gtp_net_ops pernet operations structure.

Syzkaller hit 'general protection fault in gtp_genl_dump_pdp' bug:

[ 1010.702740] gtp: GTP module unloaded
[ 1010.715877] general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
[ 1010.715888] KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
[ 1010.715895] CPU: 1 PID: 128616 Comm: a.out Not tainted 6.8.0-rc6-std-def-alt1 #1
[ 1010.715899] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-alt1 04/01/2014
[ 1010.715908] RIP: 0010:gtp_newlink+0x4d7/0x9c0 [gtp]
[ 1010.715915] Code: 80 3c 02 00 0f 85 41 04 00 00 48 8b bb d8 05 00 00 e8 ed f6 ff ff 48 89 c2 48 89 c5 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 4f 04 00 00 4c 89 e2 4c 8b 6d 00 48 b8 00 00 00
[ 1010.715920] RSP: 0018:ffff888020fbf180 EFLAGS: 00010203
[ 1010.715929] RAX: dffffc0000000000 RBX: ffff88800399c000 RCX: 0000000000000000
[ 1010.715933] RDX: 0000000000000001 RSI: ffffffff84805280 RDI: 0000000000000282
[ 1010.715938] RBP: 000000000000000d R08: 0000000000000001 R09: 0000000000000000
[ 1010.715942] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88800399cc80
[ 1010.715947] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000400
[ 1010.715953] FS:  00007fd1509ab5c0(0000) GS:ffff88805b300000(0000) knlGS:0000000000000000
[ 1010.715958] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1010.715962] CR2: 0000000000000000 CR3: 000000001c07a000 CR4: 0000000000750ee0
[ 1010.715968] PKRU: 55555554
[ 1010.715972] Call Trace:
[ 1010.715985]  ? __die_body.cold+0x1a/0x1f
[ 1010.715995]  ? die_addr+0x43/0x70
[ 1010.716002]  ? exc_general_protection+0x199/0x2f0
[ 1010.716016]  ? asm_exc_general_protection+0x1e/0x30
[ 1010.716026]  ? gtp_newlink+0x4d7/0x9c0 [gtp]
[ 1010.716034]  ? gtp_net_exit+0x150/0x150 [gtp]
[ 1010.716042]  __rtnl_newlink+0x1063/0x1700
[ 1010.716051]  ? rtnl_setlink+0x3c0/0x3c0
[ 1010.716063]  ? is_bpf_text_address+0xc0/0x1f0
[ 1010.716070]  ? kernel_text_address.part.0+0xbb/0xd0
[ 1010.716076]  ? __kernel_text_address+0x56/0xa0
[ 1010.716084]  ? unwind_get_return_address+0x5a/0xa0
[ 1010.716091]  ? create_prof_cpu_mask+0x30/0x30
[ 1010.716098]  ? arch_stack_walk+0x9e/0xf0
[ 1010.716106]  ? stack_trace_save+0x91/0xd0
[ 1010.716113]  ? stack_trace_consume_entry+0x170/0x170
[ 1010.716121]  ? __lock_acquire+0x15c5/0x5380
[ 1010.716139]  ? mark_held_locks+0x9e/0xe0
[ 1010.716148]  ? kmem_cache_alloc_trace+0x35f/0x3c0
[ 1010.716155]  ? __rtnl_newlink+0x1700/0x1700
[ 1010.716160]  rtnl_newlink+0x69/0xa0
[ 1010.716166]  rtnetlink_rcv_msg+0x43b/0xc50
[ 1010.716172]  ? rtnl_fdb_dump+0x9f0/0x9f0
[ 1010.716179]  ? lock_acquire+0x1fe/0x560
[ 1010.716188]  ? netlink_deliver_tap+0x12f/0xd50
[ 1010.716196]  netlink_rcv_skb+0x14d/0x440
[ 1010.716202]  ? rtnl_fdb_dump+0x9f0/0x9f0
[ 1010.716208]  ? netlink_ack+0xab0/0xab0
[ 1010.716213]  ? netlink_deliver_tap+0x202/0xd50
[ 1010.716220]  ? netlink_deliver_tap+0x218/0xd50
[ 1010.716226]  ? __virt_addr_valid+0x30b/0x590
[ 1010.716233]  netlink_unicast+0x54b/0x800
[ 1010.716240]  ? netlink_attachskb+0x870/0x870
[ 1010.716248]  ? __check_object_size+0x2de/0x3b0
[ 1010.716254]  netlink_sendmsg+0x938/0xe40
[ 1010.716261]  ? netlink_unicast+0x800/0x800
[ 1010.716269]  ? __import_iovec+0x292/0x510
[ 1010.716276]  ? netlink_unicast+0x800/0x800
[ 1010.716284]  __sock_sendmsg+0x159/0x190
[ 1010.716290]  ____sys_sendmsg+0x712/0x880
[ 1010.716297]  ? sock_write_iter+0x3d0/0x3d0
[ 1010.716304]  ? __ia32_sys_recvmmsg+0x270/0x270
[ 1010.716309]  ? lock_acquire+0x1fe/0x560
[ 1010.716315]  ? drain_array_locked+0x90/0x90
[ 1010.716324]  ___sys_sendmsg+0xf8/0x170
[ 1010.716331]  ? sendmsg_copy_msghdr+0x170/0x170
[ 1010.716337]  ? lockdep_init_map_type+0x2c7/0x860
[ 1010.716343]  ? lockdep_hardirqs_on_prepare+0x430/0x430
[ 1010.716350]  ? debug_mutex_init+0x33/0x70
[ 1010.716360]  ? percpu_counter_add_batch+0x8b/0x140
[ 1010.716367]  ? lock_acquire+0x1fe/0x560
[ 1010.716373]  ? find_held_lock+0x2c/0x110
[ 1010.716384]  ? __fd_install+0x1b6/0x6f0
[ 1010.716389]  ? lock_downgrade+0x810/0x810
[ 1010.716396]  ? __fget_light+0x222/0x290
[ 1010.716403]  __sys_sendmsg+0xea/0x1b0
[ 1010.716409]  ? __sys_sendmsg_sock+0x40/0x40
[ 1010.716419]  ? lockdep_hardirqs_on_prepare+0x2b3/0x430
[ 1010.716425]  ? syscall_enter_from_user_mode+0x1d/0x60
[ 1010.716432]  do_syscall_64+0x30/0x40
[ 1010.716438]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1010.716444] RIP: 0033:0x7fd1508cbd49
[ 1010.716452] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ef 70 0d 00 f7 d8 64 89 01 48
[ 1010.716456] RSP: 002b:00007fff18872348 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[ 1010.716463] RAX: ffffffffffffffda RBX: 000055f72bf0eac0 RCX: 00007fd1508cbd49
[ 1010.716468] RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000006
[ 1010.716473] RBP: 00007fff18872360 R08: 00007fff18872360 R09: 00007fff18872360
[ 1010.716478] R10: 00007fff18872360 R11: 0000000000000202 R12: 000055f72bf0e1b0
[ 1010.716482] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1010.716491] Modules linked in: gtp(+) udp_tunnel ib_core uinput af_packet rfkill qrtr joydev hid_generic usbhid hid kvm_intel iTCO_wdt intel_pmc_bxt iTCO_vendor_support kvm snd_hda_codec_generic ledtrig_audio irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hda_intel nls_utf8 snd_intel_dspcfg nls_cp866 psmouse aesni_intel vfat crypto_simd fat cryptd glue_helper snd_hda_codec pcspkr snd_hda_core i2c_i801 snd_hwdep i2c_smbus xhci_pci snd_pcm lpc_ich xhci_pci_renesas xhci_hcd qemu_fw_cfg tiny_power_button button sch_fq_codel vboxvideo drm_vram_helper drm_ttm_helper ttm vboxsf vboxguest snd_seq_midi snd_seq_midi_event snd_seq snd_rawmidi snd_seq_device snd_timer snd soundcore msr fuse efi_pstore dm_mod ip_tables x_tables autofs4 virtio_gpu virtio_dma_buf drm_kms_helper cec rc_core drm virtio_rng virtio_scsi rng_core virtio_balloon virtio_blk virtio_net virtio_console net_failover failover ahci libahci libata evdev scsi_mod input_leds serio_raw virtio_pci intel_agp
[ 1010.716674]  virtio_ring intel_gtt virtio [last unloaded: gtp]
[ 1010.716693] ---[ end trace 04990a4ce61e174b ]---

Cc: stable@vger.kernel.org
Signed-off-by: Alexander Ofitserov <oficerovas@altlinux.org>
Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228114703.465107-1-oficerovas@altlinux.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agoMerge branch 'net-collect-tstats-automatically'
Paolo Abeni [Thu, 29 Feb 2024 11:50:15 +0000 (12:50 +0100)]
Merge branch 'net-collect-tstats-automatically'

Breno Leitao says:

====================
net: collect tstats automatically

The commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf") added a field in struct_netdevice, which tells what
type of statistics the driver supports.

That field is used primarily to allocate stats structures automatically,
but, it also could leveraged to simplify the drivers even further, such
as, if the driver relies in the default stats collection, then it
doesn't need to assign to .ndo_get_stats64. That means that drivers only
assign functions to .ndo_get_stats64 if they are using something
special.

I started to move some of these drivers[1][2][3] to use the core
allocation, and with this change in, I just need to touch the driver
once, and be able to simplify the whole stats allocation and collection
for generic case.

There are 44 devices today that could benefit from this simplification.

# grep -r .ndo_get_stats64 | grep dev_get_tstats64 | wc -l
44

As of today, netnext only has the `sit` driver fully ported to core
stats allocation, hence the second patch.

Links:
[1] https://lore.kernel.org/all/20240227182338.2739884-1-leitao@debian.org/
[2] https://lore.kernel.org/all/20240222144117.1370101-1-leitao@debian.org/
[3] https://lore.kernel.org/all/20240223115839.3572852-1-leitao@debian.org/
====================

Link: https://lore.kernel.org/r/20240228113125.3473685-1-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: sit: Do not set .ndo_get_stats64
Breno Leitao [Wed, 28 Feb 2024 11:31:22 +0000 (03:31 -0800)]
net: sit: Do not set .ndo_get_stats64

If the driver is using the network core allocation mechanism, by setting
NETDEV_PCPU_STAT_TSTATS, as this driver is, then, it doesn't need to set
the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Since
the network core calls it automatically, and .ndo_get_stats64 should
only be set if the driver needs special treatment.

This simplifies the driver, since all the generic statistics is now
handled by core.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: get stats64 if device if driver is configured
Breno Leitao [Wed, 28 Feb 2024 11:31:21 +0000 (03:31 -0800)]
net: get stats64 if device if driver is configured

If the network driver is relying in the net core to do stats allocation,
then we want to dev_get_tstats64() instead of netdev_stats_to_stats64(),
since there are per-cpu stats that needs to be taken in consideration.

This will also simplify the drivers in regard to statistics. Once the
driver sets NETDEV_PCPU_STAT_TSTATS, it doesn't not need to allocate the
stacks, neither it needs to set `.ndo_get_stats64 = dev_get_tstats64`
for the generic stats collection function anymore.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: stmmac: fix typo in comment
Yanteng Si [Wed, 28 Feb 2024 11:24:47 +0000 (19:24 +0800)]
net: stmmac: fix typo in comment

This is just a trivial fix for a typo in a comment, no functional
changes.

Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Link: https://lore.kernel.org/r/20240228112447.1490926-1-siyanteng@loongson.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agoMerge tag 'nf-24-02-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Paolo Abeni [Thu, 29 Feb 2024 11:16:07 +0000 (12:16 +0100)]
Merge tag 'nf-24-02-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

Patch #1 restores NFPROTO_INET with nft_compat, from Ignat Korchagin.

Patch #2 fixes an issue with bridge netfilter and broadcast/multicast
packets.

There is a day 0 bug in br_netfilter when used with connection tracking.

Conntrack assumes that an nf_conn structure that is not yet added to
hash table ("unconfirmed"), is only visible by the current cpu that is
processing the sk_buff.

For bridge this isn't true, sk_buff can get cloned in between, and
clones can be processed in parallel on different cpu.

This patch disables NAT and conntrack helpers for multicast packets.

Patch #3 adds a selftest to cover for the br_netfilter bug.

netfilter pull request 24-02-29

* tag 'nf-24-02-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: add bridge conntrack + multicast test case
  netfilter: bridge: confirm multicast packets before passing them up the stack
  netfilter: nf_tables: allow NFPROTO_INET in nft_(match/target)_validate()
====================

Link: https://lore.kernel.org/r/20240229000135.8780-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: hsr: Use correct offset for HSR TLV values in supervisory HSR frames
Lukasz Majewski [Wed, 28 Feb 2024 08:56:44 +0000 (09:56 +0100)]
net: hsr: Use correct offset for HSR TLV values in supervisory HSR frames

Current HSR implementation uses following supervisory frame (even for
HSRv1 the HSR tag is not is not present):

00000000: 01 15 4e 00 01 2d XX YY ZZ 94 77 10 88 fb 00 01
00000010: 7e 1c 17 06 XX YY ZZ 94 77 10 1e 06 XX YY ZZ 94
00000020: 77 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 00 00 00 00 00 00 00 00

The current code adds extra two bytes (i.e. sizeof(struct hsr_sup_tlv))
when offset for skb_pull() is calculated.
This is wrong, as both 'struct hsrv1_ethhdr_sp' and 'hsrv0_ethhdr_sp'
already have 'struct hsr_sup_tag' defined in them, so there is no need
for adding extra two bytes.

This code was working correctly as with no RedBox support, the check for
HSR_TLV_EOT (0x00) was off by two bytes, which were corresponding to
zeroed padded bytes for minimal packet size.

Fixes: eafaa88b3eb7 ("net: hsr: Add support for redbox supervision frames")
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228085644.3618044-1-lukma@denx.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>