====================
This patch series continues the work to migrate the script tests into
prog_tests.
test_lwt_seg6local.sh tests some bpf_lwt_* helpers. It contains only one
test that uses a network topology quite different than the ones that
can be found in others prog_tests/lwt_*.c files so I add a new
prog_tests/lwt_seg6local.c file.
While working on the migration I noticed that some routes present in the
script weren't needed so PATCH 1 deletes them and then PATCH 2 migrates
the test into the test_progs framework.
====================
selftests/bpf: lwt_seg6local: Move test to test_progs
test_lwt_seg6local.sh isn't used by the BPF CI.
Add a new file in the test_progs framework to migrate the tests done by
test_lwt_seg6local.sh. It uses the same network topology and the same BPF
programs located in progs/test_lwt_seg6local.c.
Use the network helpers instead of `nc` to exchange the final packet.
Remove test_lwt_seg6local.sh and its Makefile entry.
Amery Hung [Wed, 5 Mar 2025 18:20:57 +0000 (10:20 -0800)]
selftests/bpf: Fix dangling stdout seen by traffic monitor thread
Traffic monitor thread may see dangling stdout as the main thread closes
and reassigns stdout without protection. This happens when the main thread
finishes one subtest and moves to another one in the same netns_new()
scope.
The issue can be reproduced by running test_progs repeatedly with traffic
monitor enabled:
for ((i=1;i<=100;i++)); do
./test_progs -a flow_dissector_skb* -m '*'
done
For restoring stdout in crash_handler(), since it does not really care
about closing stdout, simlpy flush stdout and restore it to the original
one.
Then, Fix the issue by consolidating stdio_restore_cleanup() and
stdio_restore(), and protecting the use/close/assignment of stdout with
a lock. The locking in the main thread is always performed regradless of
whether traffic monitor is running or not for simplicity. It won't have
any side-effect.
Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20250305182057.2802606-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some routes in fb00:: are initialized during setup, even though they
aren't needed by the test as the UDP packets will travel through the
lightweight tunnels.
Amery Hung [Wed, 5 Mar 2025 18:20:55 +0000 (10:20 -0800)]
selftests/bpf: Clean up call sites of stdio_restore()
reset_affinity() and save_ns() are only called in run_one_test(). There is
no need to call stdio_restore() in reset_affinity() and save_ns() if
stdio_restore() is moved right after a test finishes in run_one_test().
Also remove an unnecessary check of env.stdout_saved in crash_handler()
by moving env.stdout_saved assignment to the beginning of main().
Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20250305182057.2802606-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: Move test_lwt_ip_encap to test_progs
test_lwt_ip_encap.sh isn't used by the BPF CI.
Add a new file in the test_progs framework to migrate the tests done by
test_lwt_ip_encap.sh. It uses the same network topology and the same BPF
programs located in progs/test_lwt_ip_encap.c.
Rework the GSO part to avoid using nc and dd.
Remove test_lwt_ip_encap.sh and its Makefile entry.
Alexei Starovoitov [Thu, 6 Mar 2025 06:04:27 +0000 (22:04 -0800)]
Merge branch 'arena-spin-lock'
Kumar Kartikeya Dwivedi says:
====================
Arena Spin Lock
This set provides an implementation of queued spin lock for arena.
There is no support for resiliency and recovering from deadlocks yet.
We will wait for the rqspinlock patch set to land before incorporating
support.
One minor change compared to the qspinlock algorithm in the kernel is
that we don't have the trylock fallback when nesting count exceeds 4.
The maximum number of supported CPUs is 1024, but this can be increased
in the future if necessary.
The API supports returning an error, so resiliency support can be added
in the future. Callers are still expected to check for and handle any
potential errors.
Errors are returned when the spin loops time out, when the number of
CPUs is greater than 1024, or when the extreme edge case of NMI
interrupting NMI interrupting HardIRQ interrupting SoftIRQ interrupting
task, all of them simultaneously in slow path, occurs, which is
unsupported.
* Add better comment and document LLVM bug for __unqual_typeof.
* Switch to precise counting in the selftest and simplify test.
* Add comment about return value handling.
* Reduce size for 100k to 50k to cap test runtime.
* Drop extra corruption handling case in decode_tail.
* Stick to 1, 1k, 100k critical section sizes.
* Fix unqual_typeof to not cast away arena tag for pointers.
* Remove hack to skip first qnode.
* Choose 100 as repeat count, 1000 is too much for 100k size.
* Use pthread_barrier in test.
* Rename to arena_spin_lock
* Introduce cond_break_label macro to jump to label from cond_break.
* Drop trylock fallback when nesting count exceeds 4.
* Fix bug in try_cmpxchg implementation.
* Add tests with critical sections of varying lengths.
* Add comments for _Generic trick to drop __arena tag.
* Fix bug due to qnodes being placed on first page, leading to CPU 0's
node being indistinguishable from NULL.
Implement queued spin lock algorithm as BPF program for lock words
living in BPF arena.
The algorithm is copied from kernel/locking/qspinlock.c and adapted for
BPF use.
We first implement abstract helpers for portable atomics and
acquire/release load instructions, by relying on X86_64 presence to
elide expensive barriers and rely on implementation details of the JIT,
and fall back to slow but correct implementations elsewhere. When
support for acquire/release load/stores lands, we can improve this
state.
Then, the qspinlock algorithm is adapted to remove dependence on
multi-word atomics due to lack of support in BPF ISA. For instance,
xchg_tail cannot use 16-bit xchg, and needs to be a implemented as a
32-bit try_cmpxchg loop.
Loops which are seemingly infinite from verifier PoV are annotated with
cond_break_label macro to return an error. Only 1024 NR_CPUs are
supported.
Note that the slow path is a global function, hence the verifier doesn't
know the return value's precision. The recommended way of usage is to
always test against zero for success, and not ret < 0 for error, as the
verifier would assume ret > 0 has not been accounted for. Add comments
in the function documentation about this quirk.
Add a new cond_break_label macro that jumps to the specified label when
the cond_break termination check fires, and allows us to better handle
the uncontrolled termination of the loop.
Eduard Zingerman [Wed, 5 Mar 2025 08:54:36 +0000 (00:54 -0800)]
bpf: correct use/def for may_goto instruction
may_goto instruction does not use any registers,
but in compute_insn_live_regs() it was treated as a regular
conditional jump of kind BPF_K with r0 as source register.
Thus unnecessarily marking r0 as used.
====================
bpf: simple DFA-based live registers analysis
This patch-set introduces a simple live registers DFA analysis.
Analysis is done as a separate step before main verification pass.
Results are stored in the env->insn_aux_data for each instruction.
The change helps with iterator/callback based loops handling,
as regular register liveness marks are not finalized while
loops are processed. See veristat results in patch #2.
Note: for regular subprogram calls analysis conservatively assumes
that r1-r5 are used, and r0 is used at each 'exit' instruction.
Experiments show that adding logic handling these cases precisely has
no impact on verification performance.
The patch set was tested by disabling the current register parentage
chain liveness computation, using DFA-based liveness for registers
while assuming all stack slots as live. See discussion in [1].
Changes v2 -> v3:
- added support for BPF_LOAD_ACQ, BPF_STORE_REL atomics (Alexei);
- correct use marks for r0 for BPF_CMPXCHG.
Changes v1 -> v2:
- added a refactoring commit extracting utility functions:
jmp_offset(), verbose_insn() (Alexei);
- added a refactoring commit extracting utility function
get_call_summary() in order to share helper/kfunc related code with
mark_fastcall_pattern_for_call() (Alexei);
- comment in the compute_insn_live_regs() extended (Alexei).
Changes RFC -> v1:
- parameter count for helpers and kfuncs is taken into account;
- copy_verifier_state() bugfix had been merged as a separate
patch-set and is no longer a part of this patch set.
====================
Introduce load-acquire and store-release BPF instructions
This patchset adds kernel support for BPF load-acquire and store-release
instructions (for background, please see [1]), including core/verifier
and arm64/x86-64 JIT compiler changes, as well as selftests. riscv64 is
also planned to be supported. The corresponding LLVM changes can be
found at:
https://github.com/llvm/llvm-project/pull/108636
The first 3 patches from v4 have already been applied:
- [bpf-next,v4,01/10] bpf/verifier: Factor out atomic_ptr_type_ok()
https://git.kernel.org/bpf/bpf-next/c/b2d9ef71d4c9
- [bpf-next,v4,02/10] bpf/verifier: Factor out check_atomic_rmw()
https://git.kernel.org/bpf/bpf-next/c/d430c46c7580
- [bpf-next,v4,03/10] bpf/verifier: Factor out check_load_mem() and check_store_reg()
https://git.kernel.org/bpf/bpf-next/c/d38ad248fb7a
Please refer to the LLVM PR and individual kernel patches for details.
Thanks!
o (kernel test robot) for 32-bit arches: make the verifier reject
64-bit load-acquires/store-releases, and fix
build error in interpreter changes
* tested ARCH=arc build following instructions from kernel test
robot
o (Alexei) drop Documentation/ patch (v4 10/10) for now
o (Alexei) change encoding to BPF_LOAD_ACQ=0x100, BPF_STORE_REL=0x110
o add Acked-by: tags from Ilya and Eduard
o make new selftests depend on:
* __clang_major__ >= 18, and
* ENABLE_ATOMICS_TESTS is defined (currently this means -mcpu=v3 or
v4), and
* JIT supports load_acq/store_rel (currenty only arm64)
o work around llvm-17 CI job failure by conditionally define
__arena_global variables as 64-bit if __clang_major__ < 18, to make
sure .addr_space.1 has no holes
o add Google copyright notice in new files
o (Eduard) for x86 and s390, make
bpf_jit_supports_insn(..., /*in_arena=*/true) return false
for load_acq/store_rel
o add Eduard's Acked-by: tag
o (Eduard) extract LDX and non-ATOMIC STX handling into helpers, see
PATCH v2 3/9
o allow unpriv programs to store-release pointers to stack
o (Alexei) make it clearer in the interpreter code (PATCH v2 4/9) that
only W and DW are supported for atomic RMW
o test misaligned load_acq/store_rel
o (Eduard) other selftests/ changes:
* test load_acq/store_rel with !atomic_ptr_type_ok() pointers:
- PTR_TO_CTX, for is_ctx_reg()
- PTR_TO_PACKET, for is_pkt_reg()
- PTR_TO_FLOW_KEYS, for is_flow_key_reg()
- PTR_TO_SOCKET, for is_sk_reg()
* drop atomics/ tests
* delete unnecessary 'pid' checks from arena_atomics/ tests
* avoid depending on __BPF_FEATURE_LOAD_ACQ_STORE_REL, use
__imm_insn() and inline asm macros instead
o 1-2/8: minor verifier.c refactoring patches
o 3/8: core/verifier changes
* (Eduard) handle load-acquire properly in backtrack_insn()
* (Eduard) avoid skipping checks (e.g.,
bpf_jit_supports_insn()) for load-acquires
* track the value stored by store-releases, just like how
non-atomic STX instructions are handled
* (Eduard) add missing link in commit message
* (Eduard) always print 'r' for disasm.c changes
o 4/8: arm64/insn: avoid treating load_acq/store_rel as
load_ex/store_ex
o 5/8: arm64/insn: add load_acq/store_rel
* (Xu) include Should-Be-One (SBO) bits in "mask" and "value",
to avoid setting fixed bits during runtime (JIT-compile
time)
o 6/8: arm64 JIT compiler changes
* (Xu) use emit_a64_add_i() for "pointer + offset" to optimize
code emission
o 7/8: selftests
* (Eduard) avoid adding new tests to the 'test_verifier' runner
* add more tests, e.g., checking mark_precise logic
o 8/8: instruction-set.rst changes
Eduard Zingerman [Tue, 4 Mar 2025 19:50:23 +0000 (11:50 -0800)]
bpf: use register liveness information for func_states_equal
Liveness analysis DFA computes a set of registers live before each
instruction. Leverage this information to skip comparison of dead
registers in func_states_equal(). This helps with convergance of
iterator processing loops, as bpf_reg_state->live marks can't be used
when loops are processed.
This has certain performance impact for selftests, here is a veristat
listing using `-f "insns_pct>5" -f "!insns<200"`
The last two tests are added to check if backtrack_insn() handles the
new instructions correctly.
Additionally, the last test also makes sure that the verifier
"remembers" the value (in src_reg) we store-release into e.g. a stack
slot. For example, if we take a look at the test program:
At #1, if the verifier doesn't remember that we wrote 8 to the stack,
then later at #4 we would be adding an unbounded scalar value to the
stack pointer, which would cause the program to be rejected:
VERIFIER LOG:
=============
...
math between fp pointer and register with unbounded min value is not allowed
For easier CI integration, instead of using built-ins like
__atomic_{load,store}_n() which depend on the new
__BPF_FEATURE_LOAD_ACQ_STORE_REL pre-defined macro, manually craft
load-acquire/store-release instructions using __imm_insn(), as suggested
by Eduard.
All new tests depend on:
(1) Clang major version >= 18, and
(2) ENABLE_ATOMICS_TESTS is defined (currently implies -mcpu=v3 or
v4), and
(3) JIT supports load-acquire/store-release (currently arm64 and
x86-64)
That 1-byte hole in the .addr_space.1 ELF section caused clang-17 to
crash:
fatal error: error in backend: unable to write nop sequence of 1 bytes
To work around such llvm-17 CI job failures, conditionally define
__arena_global variables as 64-bit if __clang_major__ < 18, to make sure
.addr_space.1 has no holes. Ideally we should avoid compiling this file
using clang-17 at all (arena tests depend on
__BPF_FEATURE_ADDR_SPACE_CAST, and are skipped for llvm-17 anyway), but
that is a separate topic.
Eduard Zingerman [Tue, 4 Mar 2025 19:50:22 +0000 (11:50 -0800)]
bpf: simple DFA-based live registers analysis
Compute may-live registers before each instruction in the program.
The register is live before the instruction I if it is read by I or
some instruction S following I during program execution and is not
overwritten between I and S.
This information would be used in the next patch as a hint in
func_states_equal().
Use a simple algorithm described in [1] to compute this information:
- define the following:
- I.use : a set of all registers read by instruction I;
- I.def : a set of all registers written by instruction I;
- I.in : a set of all registers that may be alive before I execution;
- I.out : a set of all registers that may be alive after I execution;
- I.successors : a set of instructions S that might immediately
follow I for some program execution;
- associate separate empty sets 'I.in' and 'I.out' with each instruction;
- visit each instruction in a postorder and update corresponding
'I.in' and 'I.out' sets as follows:
I.out = U [S.in for S in I.successors]
I.in = (I.out / I.def) U I.use
(where U stands for set union, / stands for set difference)
- repeat the computation while I.{in,out} changes for any instruction.
On implementation side keep things as simple, as possible:
- check_cfg() already marks instructions EXPLORED in post-order,
modify it to save the index of each EXPLORED instruction in a vector;
- represent I.{in,out,use,def} as bitmasks;
- don't split the program into basic blocks and don't maintain the
work queue, instead:
- do fixed-point computation by visiting each instruction;
- maintain a simple 'changed' flag if I.{in,out} for any instruction
change;
Measurements show that even such simplistic implementation does not
add measurable verification time overhead (for selftests, at-least).
Note on check_cfg() ex_insn_beg/ex_done change:
To avoid out of bounds access to env->cfg.insn_postorder array,
it should be guaranteed that instruction transitions to EXPLORED state
only once. Previously this was not the fact for incorrect programs
with direct calls to exception callbacks.
The 'align' selftest needs adjustment to skip computed insn/live
registers printout. Otherwise it matches lines from the live registers
printout.
Peilin Ye [Tue, 4 Mar 2025 01:06:40 +0000 (01:06 +0000)]
bpf, x86: Support load-acquire and store-release instructions
Recently we introduced BPF load-acquire (BPF_LOAD_ACQ) and store-release
(BPF_STORE_REL) instructions. For x86-64, simply implement them as
regular BPF_LDX/BPF_STX loads and stores. The verifier always rejects
misaligned load-acquires/store-releases (even if BPF_F_ANY_ALIGNMENT is
set), so emitted MOV* instructions are guaranteed to be atomic.
Arena accesses are supported. 8- and 16-bit load-acquires are
zero-extending (i.e., MOVZBQ, MOVZWQ).
Rename emit_atomic{,_index}() to emit_atomic_rmw{,_index}() to make it
clear that they only handle read-modify-write atomics, and extend their
@atomic_op parameter from u8 to u32, since we are starting to use more
than the lowest 8 bits of the 'imm' field.
Eduard Zingerman [Tue, 4 Mar 2025 19:50:21 +0000 (11:50 -0800)]
bpf: get_call_summary() utility function
Refactor mark_fastcall_pattern_for_call() to extract a utility
function get_call_summary(). For a helper or kfunc call this function
fills the following information: {num_params, is_void, fastcall}.
This function would be used in the next patch in order to get number
of parameters of a helper or kfunc call.
Peilin Ye [Tue, 4 Mar 2025 01:06:33 +0000 (01:06 +0000)]
bpf, arm64: Support load-acquire and store-release instructions
Support BPF load-acquire (BPF_LOAD_ACQ) and store-release
(BPF_STORE_REL) instructions in the arm64 JIT compiler. For example
(assuming little-endian):
Arena accesses are supported.
bpf_jit_supports_insn(..., /*in_arena=*/true) always returns true for
BPF_LOAD_ACQ and BPF_STORE_REL instructions, as they don't depend on
ARM64_HAS_LSE_ATOMICS.
Eduard Zingerman [Tue, 4 Mar 2025 19:50:20 +0000 (11:50 -0800)]
bpf: jmp_offset() and verbose_insn() utility functions
Extract two utility functions:
- One BPF jump instruction uses .imm field to encode jump offset,
while the rest use .off. Encapsulate this detail as jmp_offset()
function.
- Avoid duplicating instruction printing callback definitions by
defining a verbose_insn() function, which disassembles an
instruction into the verifier log while hiding this detail.
Peilin Ye [Tue, 4 Mar 2025 01:06:19 +0000 (01:06 +0000)]
arm64: insn: Add BIT(23) to {load,store}_ex's mask
We are planning to add load-acquire (LDAR{,B,H}) and store-release
(STLR{,B,H}) instructions to insn.{c,h}; add BIT(23) to mask of load_ex
and store_ex to prevent aarch64_insn_is_{load,store}_ex() from returning
false-positives for load-acquire and store-release instructions.
Reference: Arm Architecture Reference Manual (ARM DDI 0487K.a,
ID032224),
Alexei Starovoitov [Tue, 4 Mar 2025 01:40:14 +0000 (17:40 -0800)]
Merge branch 'timed-may_goto'
Kumar Kartikeya Dwivedi says:
====================
Timed may_goto
This series replaces the current implementation of cond_break, which
uses the may_goto instruction, and counts 8 million iterations per stack
frame, with an implementation based on sampling time locally on the CPU.
This is done to permit a longer time for a given loop per-program
invocation. The accounting is still done per-stack frame, but the count
is used to instead amortize the cost of the logic to sample and check
the time spent since the start.
This is needed for expressing more complicated algorithms (spin locks,
waiting loops, etc.) in BPF programs without false positive expiration
of the loop. For instance, the plan is to make use of this for
implementing spin locks for BPF arena [0].
For the loop as follows:
for (int i = 0;; i++) {}
Testing on a bare-metal Sapphire Rapids Intel server yields the following
table (taking an average of 25 runs).
Here, count is used to amortize the time sampling and checking logic.
Obviously, this is the limit of an empty loop. Given the complexity of
the loop body, the time spent in the loop can be longer. Cancellations
will address the task of imposing an upper bound on program runtime.
* Address comments from Alexei
* Use kernel comment style for new code.
* Remove p->count == 0 check in bpf_check_timed_may_goto.
* Add comments on AX as argument/retval calling convention.
* Add comments describing how the counting logic works.
* Use BPF_EMIT_CALL instead of open-coding instruction encoding.
* Change if ax != 1 goto pc+X condition to if ax != 0 goto pc+X.
====================
A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm'
field set to BPF_LOAD_ACQ (0x100).
Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with
the 'imm' field set to BPF_STORE_REL (0x110).
Unlike existing atomic read-modify-write operations that only support
BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and
store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an
exception, however, 64-bit load-acquires/store-releases are not
supported on 32-bit architectures (to fix a build error reported by the
kernel test robot).
An 8- or 16-bit load-acquire zero-extends the value before writing it to
a 32-bit register, just like ARM64 instruction LDARH and friends.
Similar to existing atomic read-modify-write operations, misaligned
load-acquires/store-releases are not allowed (even if
BPF_F_ANY_ALIGNMENT is set).
As an example, consider the following 64-bit load-acquire BPF
instruction (assuming little-endian):
In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have
bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new
instructions, until the corresponding JIT compiler supports them in
arena.
We are introducing a new function in the libbpf API, bpf_object__prepare,
which provides more granular control over the process of loading a
bpf_object. bpf_object__prepare performs ELF processing, relocations,
prepares final state of BPF program instructions (accessible with
bpf_program__insns()), creates and potentially pins maps, and stops short
of loading BPF programs.
There are couple of anticipated usecases for this API:
* Use BPF token for freplace programs that might need to lookup BTF of
other programs (BPF token creation can't be moved to open step, as open
step is "no privilege assumption" step so that tools like bpftool can
generate skeleton, discover the structure of BPF object, etc).
* Stopping at prepare gives users finalized BPF program
instructions (with subprogs appended, everything relocated and
finalized, etc). And that property can be taken advantage of by
veristat (and similar tools) that might want to process one program at
a time, but would like to avoid relatively slow ELF parsing and
processing; and even BPF selftests itself (RUN_TESTS part of it at
least) would benefit from this by eliminating waste of re-processing
ELF many times.
====================
Implement the arch_bpf_timed_may_goto function using inline assembly to
have control over which registers are spilled, and use our special
protocol of using BPF_REG_AX as an argument into the function, and as
the return value when going back.
Emit call depth accounting for the call made from this stub, and ensure
we don't have naked returns (when rethunk mitigations are enabled) by
falling back to the RET macro (instead of retq). After popping all saved
registers, the return address into the BPF program should be on top of
the stack.
Since the JIT support is now enabled, ensure selftests which are
checking the produced may_goto sequences do not break by adjusting them.
Make sure we still test the old may_goto sequence on other
architectures, while testing the new sequence on x86_64.
Implement support in the verifier for replacing may_goto implementation
from a counter-based approach to one which samples time on the local CPU
to have a bigger loop bound.
We implement it by maintaining 16-bytes per-stack frame, and using 8
bytes for maintaining the count for amortizing time sampling, and 8
bytes for the starting timestamp. To minimize overhead, we need to avoid
spilling and filling of registers around this sequence, so we push this
cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
the count to store into the stack through BPF_REG_AX. All caller-saved
registers (r0-r5) are guaranteed to remain untouched.
The loop can be broken by returning count as 0, otherwise we dispatch
into the function when the count drops to 0, and the runtime chooses to
refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
and aborting the loop on next iteration.
Since the check for 0 is done right after loading the count from the
stack, all subsequent cond_break sequences should immediately break as
well, of the same loop or subsequent loops in the program.
We pass in the stack_depth of the count (and thus the timestamp, by
adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
by adding the offset to r10/fp. This adjustment will be arch specific,
and the next patch will introduce support for x86.
Note that depending on loop complexity, time spent in the loop can be
more than the current limit (250 ms), but imposing an upper bound on
program runtime is an orthogonal problem which will be addressed when
program cancellations are supported.
The current time afforded by cond_break may not be enough for cases
where BPF programs want to implement locking algorithms inline, and use
cond_break as a promise to the verifier that they will eventually
terminate.
Below are some benchmarking numbers on the time taken per-iteration for
an empty loop that counts the number of iterations until cond_break
fires. For comparison, we compare it against bpf_for/bpf_repeat which is
another way to achieve the same number of spins (BPF_MAX_LOOPS). The
hardware used for benchmarking was a Sapphire Rapids Intel server with
performance governor enabled, mitigations were enabled.
Mykyta Yatsenko [Mon, 3 Mar 2025 13:57:51 +0000 (13:57 +0000)]
libbpf: Split bpf object load into prepare/load
Introduce bpf_object__prepare API: additional intermediate preparation
step that performs ELF processing, relocations, prepares final state of
BPF program instructions (accessible with bpf_program__insns()), creates
and (potentially) pins maps, and stops short of loading BPF programs.
We anticipate few use cases for this API, such as:
* Use prepare to initialize bpf_token, without loading freplace
programs, unlocking possibility to lookup BTF of other programs.
* Execute prepare to obtain finalized BPF program instructions without
loading programs, enabling tools like veristat to process one program at
a time, without incurring cost of ELF parsing and processing.
Mykyta Yatsenko [Mon, 3 Mar 2025 13:57:50 +0000 (13:57 +0000)]
libbpf: Introduce more granular state for bpf_object
We are going to split bpf_object loading into 2 stages: preparation and
loading. This will increase flexibility when working with bpf_object
and unlock some optimizations and use cases.
This patch substitutes a boolean flag (loaded) by more finely-grained
state for bpf_object.
Breno Leitao [Fri, 28 Feb 2025 18:43:34 +0000 (10:43 -0800)]
net: filter: Avoid shadowing variable in bpf_convert_ctx_access()
Rename the local variable 'off' to 'offset' to avoid shadowing the existing
'off' variable that is declared as an `int` in the outer scope of
bpf_convert_ctx_access().
This fixes a compiler warning:
net/core/filter.c:9679:8: warning: declaration shadows a local variable [-Wshadow]
Mykyta Yatsenko [Mon, 3 Mar 2025 13:57:49 +0000 (13:57 +0000)]
libbpf: Use map_is_created helper in map setters
Refactoring: use map_is_created helper in map setters that need to check
the state of the map. This helps to reduce the number of the places that
depend explicitly on the loaded flag, simplifying refactoring in the
next patch of this set.
====================
selftests/bpf: Migrate test_tunnel.sh to test_progs
Hi all,
This patch series continues the work to migrate the *.sh tests into
prog_tests framework.
The test_tunnel.sh script has already been partly migrated to
test_progs in prog_tests/test_tunnel.c so I add my work to it.
PATCH 1 & 2 create some helpers to avoid code duplication and ease the
migration in the following patches.
PATCH 3 to 9 migrate the tests of gre, ip6gre, erspan, ip6erspan,
geneve, ip6geneve and ip6tnl tunnels.
PATCH 10 removes test_tunnel.sh
====================
All tests from test_tunnel.sh have been migrated into test test_progs.
The last test remaining in the script is the test_ipip() that is already
covered in the test_prog framework by the NONE case of test_ipip_tunnel().
Remove the test_tunnel.sh script and its Makefile entry
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-10-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move ip6tnl tunnel tests to test_progs
ip6tnl tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test ip6tnl tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_ipip6() and test_ip6ip6() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-9-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move ip6geneve tunnel test to test_progs
ip6geneve tunnels are tested in the test_tunnel.sh but not in the
test_progs framework.
Add a new test in test_progs to test ip6geneve tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_ip6geneve() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-8-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move geneve tunnel test to test_progs
geneve tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test geneve tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_geneve() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-7-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move ip6erspan tunnel test to test_progs
ip6erspan tunnels are tested in the test_tunnel.sh but not in the
test_progs framework.
Add a new test in test_progs to test ip6erspan tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_ip6erspan() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-6-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move erspan tunnel tests to test_progs
erspan tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test erspan tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_erspan() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-5-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move ip6gre tunnel test to test_progs
ip6gre tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test ip6gre tunnels. It uses the same
network topology and the same BPF programs than the script. Disable the
IPv6 DAD feature because it can take lot of time and cause some tests to
fail depending on the environment they're run on.
Remove test_ip6gre() and test_ip6gretap() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-4-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move gre tunnel test to test_progs
gre tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test gre tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_gre() and test_gre_no_tunnel_key() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-3-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
veristat: @files-list.txt notation for object files list
A few small veristat improvements:
- It is possible to hit command line parameters number limit,
e.g. when running veristat for all object files generated for
test_progs. This patch-set adds an option to read objects files list
from a file.
- Correct usage of strerror() function.
- Avoid printing log lines to CSV output.
Changelog:
- v1 -> v2:
- replace strerror(errno) with strerror(-err) in patch #2 (Andrii)
All tests use more or less the same ping commands as final validation.
Also test_ping()'s return value is checked with ASSERT_OK() while this
check is already done by the SYS() macro inside test_ping().
Create helpers around test_ping() and use them in the tests to avoid code
duplication.
Remove the unnecessary ASSERT_OK() from the tests.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-2-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Peilin Ye [Mon, 3 Mar 2025 05:37:25 +0000 (05:37 +0000)]
bpf: Factor out check_load_mem() and check_store_reg()
Extract BPF_LDX and most non-ATOMIC BPF_STX instruction handling logic
in do_check() into helper functions to be used later. While we are
here, make that comment about "reserved fields" more specific.
Eduard Zingerman [Sat, 1 Mar 2025 00:01:47 +0000 (16:01 -0800)]
veristat: Report program type guess results to sdterr
In order not to pollute CSV output, e.g.:
$ ./veristat -o csv exceptions_ext.bpf.o > test.csv
Using guessed program type 'sched_cls' for exceptions_ext.bpf.o/extension...
Using guessed program type 'sched_cls' for exceptions_ext.bpf.o/throwing_extension...
A fair amount of code duplication is present among tests to attach BPF
programs.
Create generic_attach* helpers that attach BPF programs to a given
interface.
Use ASSERT_OK_FD() instead of ASSERT_GE() to check fd's validity.
Use these helpers in all the available tests.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-1-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Peilin Ye [Mon, 3 Mar 2025 05:37:19 +0000 (05:37 +0000)]
bpf: Factor out check_atomic_rmw()
Currently, check_atomic() only handles atomic read-modify-write (RMW)
instructions. Since we are planning to introduce other types of atomic
instructions (i.e., atomic load/store), extract the existing RMW
handling logic into its own function named check_atomic_rmw().
Remove the @insn_idx parameter as it is not really necessary. Use
'env->insn_idx' instead, as in other places in verifier.c.
Eric Dumazet [Sat, 1 Mar 2025 19:13:15 +0000 (19:13 +0000)]
bpf: no longer acquire map_idr_lock in bpf_map_inc_not_zero()
bpf_sk_storage_clone() is the only caller of bpf_map_inc_not_zero()
and is holding rcu_read_lock().
map_idr_lock does not add any protection, just remove the cost
for passive TCP flows.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kui-Feng Lee <kuifeng@meta.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250301191315.1532629-1-edumazet@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
Global subprogs in RCU/{preempt,irq}-disabled sections
Small change to allow non-sleepable global subprogs in
RCU, preempt-disabled, and irq-disabled sections. For
now, we don't lift the limitation for locks as it requires
more analysis, and will do this one resilient spin locks
land.
This surfaced a bug where sleepable global subprogs were
allowed in RCU read sections, that has been fixed. Tests
have been added to cover various cases.
* Rename subprog_info[i].sleepable to might_sleep, which more
accurately reflects the nature of the bit. 'sleepable' means whether
a given context is allowed to, while might_sleep captures if it
does.
* Disallow extensions that might sleep to attach to targets that don't
sleep, since they'd be permitted to be called in atomic contexts. (Eduard)
* Add tests for mixing non-sleepable and sleepable global function
calls, and extensions attaching to non-sleepable global functions. (Eduard)
* Rename changes_pkt_data -> summarization
====================
selftests/bpf: Add tests for extending sleepable global subprogs
Add tests for freplace behavior with the combination of sleepable
and non-sleepable global subprogs. The changes_pkt_data selftest
did all the hardwork, so simply rename it and include new support
for more summarization tests for might_sleep bit.
selftests/bpf: Test sleepable global subprogs in atomic contexts
Add tests for rejecting sleepable and accepting non-sleepable global
function calls in atomic contexts. For spin locks, we still reject
all global function calls. Once resilient spin locks land, we will
carefully lift in cases where we deem it safe.
Yonghong Song [Mon, 24 Feb 2025 23:01:16 +0000 (15:01 -0800)]
bpf: Allow pre-ordering for bpf cgroup progs
Currently for bpf progs in a cgroup hierarchy, the effective prog array
is computed from bottom cgroup to upper cgroups (post-ordering). For
example, the following cgroup hierarchy
root cgroup: p1, p2
subcgroup: p3, p4
have BPF_F_ALLOW_MULTI for both cgroup levels.
The effective cgroup array ordering looks like
p3 p4 p1 p2
and at run time, progs will execute based on that order.
But in some cases, it is desirable to have root prog executes earlier than
children progs (pre-ordering). For example,
- prog p1 intends to collect original pkt dest addresses.
- prog p3 will modify original pkt dest addresses to a proxy address for
security reason.
The end result is that prog p1 gets proxy address which is not what it
wants. Putting p1 to every child cgroup is not desirable either as it
will duplicate itself in many child cgroups. And this is exactly a use case
we are encountering in Meta.
To fix this issue, let us introduce a flag BPF_F_PREORDER. If the flag
is specified at attachment time, the prog has higher priority and the
ordering with that flag will be from top to bottom (pre-ordering).
For example, in the above example,
root cgroup: p1, p2
subcgroup: p3, p4
Let us say p2 and p4 are marked with BPF_F_PREORDER. The final
effective array ordering will be
p2 p4 p3 p1
The verifier currently does not permit global subprog calls when a lock
is held, preemption is disabled, or when IRQs are disabled. This is
because we don't know whether the global subprog calls sleepable
functions or not.
In case of locks, there's an additional reason: functions called by the
global subprog may hold additional locks etc. The verifier won't know
while verifying the global subprog whether it was called in context
where a spin lock is already held by the program.
Perform summarization of the sleepable nature of a global subprog just
like changes_pkt_data and then allow calls to global subprogs for
non-sleepable ones from atomic context.
While making this change, I noticed that RCU read sections had no
protection against sleepable global subprog calls, include it in the
checks and fix this while we're at it.
Care needs to be taken to not allow global subprog calls when regular
bpf_spin_lock is held. When resilient spin locks is held, we want to
potentially have this check relaxed, but not for now.
Also make sure extensions freplacing global functions cannot do so
in case the target is non-sleepable, but the extension is. The other
combination is ok.
Tests are included in the next patch to handle all special conditions.
====================
Optimize bpf selftest to increase CI success rate
1. Optimized some static bound port selftests to avoid port occupation
when running test_progs -j.
2. Optimized the retry logic for test_maps.
Some Failed CI:
https://github.com/kernel-patches/bpf/actions/runs/13275542359/job/37064974076
https://github.com/kernel-patches/bpf/actions/runs/13549227497/job/37868926343
https://github.com/kernel-patches/bpf/actions/runs/13548089029/job/37865812030
https://github.com/kernel-patches/bpf/actions/runs/13553536268/job/37883329296
(Perhaps it's due to the large number of pull requests requiring CI runs?)
====================
Jiayuan Chen [Thu, 27 Feb 2025 14:26:46 +0000 (22:26 +0800)]
selftests/bpf: Fixes for test_maps test
BPF CI has failed 3 times in the last 24 hours. Add retry for ENOMEM.
It's similar to the optimization plan:
commit 2f553b032cad ("selftsets/bpf: Retry map update for non-preallocated per-cpu map")
Introduce a new kfunc, bpf_dynptr_copy, which enables copying of
data from one dynptr to another. This functionality may be useful in
scenarios such as capturing XDP data to a ring buffer.
The patch set is split into 3 patches:
1. Refactor bpf_dynptr_read and bpf_dynptr_write by extracting code into
static functions, that allows calling them with no compiler warnings
2. Introduce bpf_dynptr_copy
3. Add tests for bpf_dynptr_copy
v2->v3:
* Implemented bpf_memcmp in dynptr_success.c test, as __builtin_memcmp
was not inlined on GCC-BPF.
====================
Mykyta Yatsenko [Wed, 26 Feb 2025 18:32:00 +0000 (18:32 +0000)]
bpf/helpers: Introduce bpf_dynptr_copy kfunc
Introducing bpf_dynptr_copy kfunc allowing copying data from one dynptr to
another. This functionality is useful in scenarios such as capturing XDP
data to a ring buffer.
The implementation consists of 4 branches:
* A fast branch for contiguous buffer capacity in both source and
destination dynptrs
* 3 branches utilizing __bpf_dynptr_read and __bpf_dynptr_write to copy
data to/from non-contiguous buffer
Mykyta Yatsenko [Wed, 26 Feb 2025 18:31:59 +0000 (18:31 +0000)]
bpf/helpers: Refactor bpf_dynptr_read and bpf_dynptr_write
Refactor bpf_dynptr_read and bpf_dynptr_write helpers: extract code
into the static functions namely __bpf_dynptr_read and
__bpf_dynptr_write, this allows calling these without compiler warnings.
====================
selftests/bpf: implement setting global variables in veristat
From: Mykyta Yatsenko <yatsenko@meta.com>
To better verify some complex BPF programs by veristat, it would be useful
to preset global variables. This patch set implements this functionality
and introduces tests for veristat.
v4->v5
* Rework parsing to use sscanf for integers
* Addressing nits
v3->v4:
* Fixing bug in set_global_var introduced by refactoring in previous patch set
* Addressed nits from Eduard
v2->v3:
* Reworked parsing of the presets, using sscanf to split into variable and
value, but still use strtoll/strtoull to support range checks when parsing
integers
* Fix test failures for no_alu32 & cpuv4 by checking if veristat binary is in
parent folder
* Introduce __CHECK_STR macro for simplifying checks in test
* Modify tests into sub-tests
====================
Mykyta Yatsenko [Tue, 25 Feb 2025 16:31:00 +0000 (16:31 +0000)]
selftests/bpf: Implement setting global variables in veristat
To better verify some complex BPF programs we'd like to preset global
variables.
This patch introduces CLI argument `--set-global-vars` or `-G` to
veristat, that allows presetting values to global variables defined
in BPF program. For example:
prog.c:
```
enum Enum { ELEMENT1 = 0, ELEMENT2 = 5 };
const volatile __s64 a = 5;
const volatile __u8 b = 5;
const volatile enum Enum c = ELEMENT2;
const volatile bool d = false;
char arr[4] = {0};
SEC("tp_btf/sched_switch")
int BPF_PROG(...)
{
bpf_printk("%c\n", arr[a]);
bpf_printk("%c\n", arr[b]);
bpf_printk("%c\n", arr[c]);
bpf_printk("%c\n", arr[d]);
return 0;
}
```
By default verification of the program fails:
```
./veristat prog.bpf.o
```
By presetting global variables, we can make verification pass:
```
./veristat wq.bpf.o -G "a = 0" -G "b = 1" -G "c = 2" -G "d = 3"
```
Alexei Starovoitov [Mon, 24 Feb 2025 22:16:37 +0000 (14:16 -0800)]
bpf: Fix deadlock between rcu_tasks_trace and event_mutex.
Fix the following deadlock:
CPU A
_free_event()
perf_kprobe_destroy()
mutex_lock(&event_mutex)
perf_trace_event_unreg()
synchronize_rcu_tasks_trace()
There are several paths where _free_event() grabs event_mutex
and calls sync_rcu_tasks_trace. Above is one such case.
CPU B
bpf_prog_test_run_syscall()
rcu_read_lock_trace()
bpf_prog_run_pin_on_cpu()
bpf_prog_load()
bpf_tracing_func_proto()
trace_set_clr_event()
mutex_lock(&event_mutex)
Delegate trace_set_clr_event() to workqueue to avoid
such lock dependency.
Yonghong Song [Thu, 7 Nov 2024 17:09:24 +0000 (09:09 -0800)]
docs/bpf: Document some special sdiv/smod operations
Patch [1] fixed possible kernel crash due to specific sdiv/smod operations
in bpf program. The following are related operations and the expected results
of those operations:
- LLONG_MIN/-1 = LLONG_MIN
- INT_MIN/-1 = INT_MIN
- LLONG_MIN%-1 = 0
- INT_MIN%-1 = 0
Those operations are replaced with codes which won't cause
kernel crash. This patch documents what operations may cause exception and
what replacement operations are.
Mahe Tardy [Tue, 25 Feb 2025 12:50:30 +0000 (12:50 +0000)]
bpf: add get_netns_cookie helper to cgroup_skb programs
This is needed in the context of Cilium and Tetragon to retrieve netns
cookie from hostns when traffic leaves Pod, so that we can correlate
skb->sk's netns cookie.
Amery Hung [Tue, 25 Feb 2025 23:35:45 +0000 (15:35 -0800)]
selftests/bpf: Test gen_pro/epilogue that generate kfuncs
Test gen_prologue and gen_epilogue that generate kfuncs that have not
been seen in the main program.
The main bpf program and return value checks are identical to
pro_epilogue.c introduced in commit 47e69431b57a ("selftests/bpf: Test
gen_prologue and gen_epilogue"). However, now when bpf_testmod_st_ops
detects a program name with prefix "test_kfunc_", it generates slightly
different prologue and epilogue: They still add 1000 to args->a in
prologue, add 10000 to args->a and set r0 to 2 * args->a in epilogue,
but involve kfuncs.
At high level, the alternative version of prologue and epilogue look
like this:
cgrp = bpf_cgroup_from_id(0);
if (cgrp)
bpf_cgroup_release(cgrp);
else
/* Perform what original bpf_testmod_st_ops prologue or
* epilogue does
*/
Since 0 is never a valid cgroup id, the original prologue or epilogue
logic will be performed. As a result, the __retval check should expect
the exact same return value.
Amery Hung [Tue, 25 Feb 2025 23:35:44 +0000 (15:35 -0800)]
bpf: Search and add kfuncs in struct_ops prologue and epilogue
Currently, add_kfunc_call() is only invoked once before the main
verification loop. Therefore, the verifier could not find the
bpf_kfunc_btf_tab of a new kfunc call which is not seen in user defined
struct_ops operators but introduced in gen_prologue or gen_epilogue
during do_misc_fixup(). Fix this by searching kfuncs in the patching
instruction buffer and add them to prog->aux->kfunc_tab.
Signed-off-by: Amery Hung <amery.hung@bytedance.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250225233545.285481-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Nandakumar Edamana [Fri, 21 Feb 2025 21:01:11 +0000 (02:31 +0530)]
libbpf: Fix out-of-bound read
In `set_kcfg_value_str`, an untrusted string is accessed with the assumption
that it will be at least two characters long due to the presence of checks for
opening and closing quotes. But the check for the closing quote
(value[len - 1] != '"') misses the fact that it could be checking the opening
quote itself in case of an invalid input that consists of just the opening
quote.
This commit adds an explicit check to make sure the string is at least two
characters long.
Further investigation shows the reason is due to not 8-byte aligned
store of percpu pointer in htab_elem_set_ptr():
*(void __percpu **)(l->key + key_size) = pptr;
Note that the whole htab_elem alignment is 8 (for x86_64). If the key_size
is 4, that means pptr is stored in a location which is 4 byte aligned but
not 8 byte aligned. In mm/kmemleak.c, scan_block() scans the memory based
on 8 byte stride, so it won't detect above pptr, hence reporting the memory
leak.
So storing pptr with 8-byte alignment won't cause any problem and can fix
kmemleak too.
The issue can be reproduced with bpf selftest as well:
1. Enable CONFIG_DEBUG_KMEMLEAK config
2. Add a getchar() before skel destroy in test_hash_map() in prog_tests/for_each.c.
The purpose is to keep map available so kmemleak can be detected.
3. run './test_progs -t for_each/hash_map &' and a kmemleak should be reported.
Amery Hung [Fri, 21 Feb 2025 17:56:44 +0000 (09:56 -0800)]
bpf: Refactor check_ctx_access()
Reduce the variable passing madness surrounding check_ctx_access().
Currently, check_mem_access() passes many pointers to local variables to
check_ctx_access(). They are used to initialize "struct
bpf_insn_access_aux info" in check_ctx_access() and then passed to
is_valid_access(). Then, check_ctx_access() takes the data our from
info and write them back the pointers to pass them back. This can be
simpilified by moving info up to check_mem_access().
Amery Hung [Thu, 20 Feb 2025 22:15:31 +0000 (14:15 -0800)]
bpf: Do not allow tail call in strcut_ops program with __ref argument
Reject struct_ops programs with refcounted kptr arguments (arguments
tagged with __ref suffix) that tail call. Once a refcounted kptr is
passed to a struct_ops program from the kernel, it can be freed or
xchged into maps. As there is no guarantee a callee can get the same
valid refcounted kptr in the ctx, we cannot allow such usage.
Andrii Nakryiko [Thu, 20 Feb 2025 00:28:21 +0000 (16:28 -0800)]
libbpf: Fix hypothetical STT_SECTION extern NULL deref case
Fix theoretical NULL dereference in linker when resolving *extern*
STT_SECTION symbol against not-yet-existing ELF section. Not sure if
it's possible in practice for valid ELF object files (this would require
embedded assembly manipulations, at which point BTF will be missing),
but fix the s/dst_sym/dst_sec/ typo guarding this condition anyways.
Fixes: faf6ed321cf6 ("libbpf: Add BPF static linker APIs") Fixes: a46349227cd8 ("libbpf: Add linker extern resolution support for functions and global variables") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250220002821.834400-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Hou Tao [Thu, 20 Feb 2025 04:22:59 +0000 (12:22 +0800)]
bpf: Use preempt_count() directly in bpf_send_signal_common()
bpf_send_signal_common() uses preemptible() to check whether or not the
current context is preemptible. If it is preemptible, it will use
irq_work to send the signal asynchronously instead of trying to hold a
spin-lock, because spin-lock is sleepable under PREEMPT_RT.
However, preemptible() depends on CONFIG_PREEMPT_COUNT. When
CONFIG_PREEMPT_COUNT is turned off (e.g., CONFIG_PREEMPT_VOLUNTARY=y),
!preemptible() will be evaluated as 1 and bpf_send_signal_common() will
use irq_work unconditionally.
Fix it by unfolding "!preemptible()" and using "preempt_count() != 0 ||
irqs_disabled()" instead.
Linus Torvalds [Thu, 20 Feb 2025 23:37:17 +0000 (15:37 -0800)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull BPF fixes from Daniel Borkmann:
- Fix a soft-lockup in BPF arena_map_free on 64k page size kernels
(Alan Maguire)
- Fix a missing allocation failure check in BPF verifier's
acquire_lock_state (Kumar Kartikeya Dwivedi)
- Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb
to the raw_tp_null_args set (Kuniyuki Iwashima)
- Fix a deadlock when freeing BPF cgroup storage (Abel Wu)
- Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex
(Andrii Nakryiko)
- Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is
accessing skb data not containing an Ethernet header (Shigeru
Yoshida)
- Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai)
- Several BPF sockmap fixes to address incorrect TCP copied_seq
calculations, which prevented correct data reads from recv(2) in user
space (Jiayuan Chen)
- Two fixes for BPF map lookup nullness elision (Daniel Xu)
- Fix a NULL-pointer dereference from vmlinux BTF lookup in
bpf_sk_storage_tracing_allowed (Jared Kangas)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests: bpf: test batch lookup on array of maps with holes
bpf: skip non exist keys in generic_map_lookup_batch
bpf: Handle allocation failure in acquire_lock_state
bpf: verifier: Disambiguate get_constant_map_key() errors
bpf: selftests: Test constant key extraction on irrelevant maps
bpf: verifier: Do not extract constant map keys for irrelevant maps
bpf: Fix softlockup in arena_map_free on 64k page kernel
net: Add rx_skb of kfree_skb to raw_tp_null_args[].
bpf: Fix deadlock when freeing cgroup storage
selftests/bpf: Add strparser test for bpf
selftests/bpf: Fix invalid flag of recv()
bpf: Disable non stream socket for strparser
bpf: Fix wrong copied_seq calculation
strparser: Add read_sock callback
bpf: avoid holding freeze_mutex during mmap operation
bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic
selftests/bpf: Adjust data size to have ETH_HLEN
bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type()
bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed
- tcp:
- adjust rcvq_space after updating scaling ratio
- drop secpath at the same time as we currently drop dst
- eth: gtp: suppress list corruption splat in gtp_net_exit_batch_rtnl().
Previous releases - always broken:
- vsock:
- fix variables initialization during resuming
- for connectible sockets allow only connected
- eth:
- geneve: fix use-after-free in geneve_find_dev()
- ibmvnic: don't reference skb after sending to VIOS"
* tag 'net-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
Revert "net: skb: introduce and use a single page frag cache"
net: allow small head cache usage with large MAX_SKB_FRAGS values
nfp: bpf: Add check for nfp_app_ctrl_msg_alloc()
tcp: drop secpath at the same time as we currently drop dst
net: axienet: Set mac_managed_pm
arp: switch to dev_getbyhwaddr() in arp_req_set_public()
net: Add non-RCU dev_getbyhwaddr() helper
sctp: Fix undefined behavior in left shift operation
selftests/bpf: Add a specific dst port matching
flow_dissector: Fix port range key handling in BPF conversion
selftests/net/forwarding: Add a test case for tc-flower of mixed port and port-range
flow_dissector: Fix handling of mixed port and port-range keys
geneve: Suppress list corruption splat in geneve_destroy_tunnels().
gtp: Suppress list corruption splat in gtp_net_exit_batch_rtnl().
dev: Use rtnl_net_dev_lock() in unregister_netdev().
net: Fix dev_net(dev) race in unregister_netdevice_notifier_dev_net().
net: Add net_passive_inc() and net_passive_dec().
net: pse-pd: pd692x0: Fix power limit retrieval
MAINTAINERS: trim the GVE entry
gve: set xdp redirect target only when it is available
...
Linus Torvalds [Thu, 20 Feb 2025 16:59:00 +0000 (08:59 -0800)]
Merge tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Fix for chmod regression
- Two reparse point related fixes
- One minor cleanup (for GCC 14 compiles)
- Fix for SMB3.1.1 POSIX Extensions reporting incorrect file type
* tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Treat unhandled directory name surrogate reparse points as mount directory nodes
cifs: Throw -EOPNOTSUPP error on unsupported reparse point type from parse_reparse_point()
smb311: failure to open files of length 1040 when mounting with SMB3.1.1 POSIX extensions
smb: client, common: Avoid multiple -Wflex-array-member-not-at-end warnings
smb: client: fix chmod(2) regression with ATTR_READONLY
Linus Torvalds [Thu, 20 Feb 2025 16:51:57 +0000 (08:51 -0800)]
Merge tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Small stuff:
- The fsck code for Hongbo's directory i_size patch was wrong, caught
by transaction restart injection: we now have the CI running
another test variant with restart injection enabled
- Another fixup for reflink pointers to missing indirect extents:
previous fix was for fsck code, this fixes the normal runtime paths
- Another small srcu lock hold time fix, reported by jpsollie"
* tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix srcu lock warning in btree_update_nodes_written()
bcachefs: Fix bch2_indirect_extent_missing_error()
bcachefs: Fix fsck directory i_size checking
Linus Torvalds [Thu, 20 Feb 2025 16:48:55 +0000 (08:48 -0800)]
Merge tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"Just a collection of bug fixes, nothing really stands out"
* tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: flush inodegc before swapon
xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate
xfs: Do not allow norecovery mount with quotacheck
xfs: do not check NEEDSREPAIR if ro,norecovery mount.
xfs: fix data fork format filtering during inode repair
xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n