Tony Ambardar [Tue, 23 Jul 2024 05:54:30 +0000 (22:54 -0700)]
selftests/bpf: Fix error compiling bpf_iter_setsockopt.c with musl libc
Existing code calls getsockname() with a 'struct sockaddr_in6 *' argument
where a 'struct sockaddr *' argument is declared, yielding compile errors
when building for mips64el/musl-libc:
bpf_iter_setsockopt.c: In function 'get_local_port':
bpf_iter_setsockopt.c:98:30: error: passing argument 2 of 'getsockname' from incompatible pointer type [-Werror=incompatible-pointer-types]
98 | if (!getsockname(fd, &addr, &addrlen))
| ^~~~~
| |
| struct sockaddr_in6 *
In file included from .../netinet/in.h:10,
from .../arpa/inet.h:9,
from ./test_progs.h:17,
from bpf_iter_setsockopt.c:5:
.../sys/socket.h:391:23: note: expected 'struct sockaddr * restrict' but argument is of type 'struct sockaddr_in6 *'
391 | int getsockname (int, struct sockaddr *__restrict, socklen_t *__restrict);
| ^
cc1: all warnings being treated as errors
This compiled under glibc only because the argument is declared to be a
"funky" transparent union which includes both types above. Explicitly cast
the argument to allow compiling for both musl and glibc.
Tony Ambardar [Tue, 23 Jul 2024 05:54:29 +0000 (22:54 -0700)]
selftests/bpf: Fix compile error from rlim_t in sk_storage_map.c
Cast 'rlim_t' argument to match expected type of printf() format and avoid
compile errors seen building for mips64el/musl-libc:
In file included from map_tests/sk_storage_map.c:20:
map_tests/sk_storage_map.c: In function 'test_sk_storage_map_stress_free':
map_tests/sk_storage_map.c:414:56: error: format '%lu' expects argument of type 'long unsigned int', but argument 2 has type 'rlim_t' {aka 'long long unsigned int'} [-Werror=format=]
414 | CHECK(err, "setrlimit(RLIMIT_NOFILE)", "rlim_new:%lu errno:%d",
| ^~~~~~~~~~~~~~~~~~~~~~~
415 | rlim_new.rlim_cur, errno);
| ~~~~~~~~~~~~~~~~~
| |
| rlim_t {aka long long unsigned int}
./test_maps.h:12:24: note: in definition of macro 'CHECK'
12 | printf(format); \
| ^~~~~~
map_tests/sk_storage_map.c:414:68: note: format string is defined here
414 | CHECK(err, "setrlimit(RLIMIT_NOFILE)", "rlim_new:%lu errno:%d",
| ~~^
| |
| long unsigned int
| %llu
cc1: all warnings being treated as errors
Tony Ambardar [Tue, 23 Jul 2024 05:54:28 +0000 (22:54 -0700)]
selftests/bpf: Use pid_t consistently in test_progs.c
Use pid_t rather than __pid_t when allocating memory for 'worker_pids' in
'struct test_env', as this is its declared type and also avoids compile
errors seen building against musl libc on mipsel64:
test_progs.c:1738:49: error: '__pid_t' undeclared (first use in this function); did you mean 'pid_t'?
1738 | env.worker_pids = calloc(sizeof(__pid_t), env.workers);
| ^~~~~~~
| pid_t
test_progs.c:1738:49: note: each undeclared identifier is reported only once for each function it appears in
====================
no_caller_saved_registers attribute for helper calls
This patch-set seeks to allow using no_caller_saved_registers gcc/clang
attribute with some BPF helper functions (and kfuncs in the future).
As documented in [1], this attribute means that function scratches
only some of the caller saved registers defined by ABI.
For BPF the set of such registers could be defined as follows:
- R0 is scratched only if function is non-void;
- R1-R5 are scratched only if corresponding parameter type is defined
in the function prototype.
The goal of the patch-set is to implement no_caller_saved_registers
(nocsr for short) in a backwards compatible manner:
- for kernels that support the feature, gain some performance boost
from better register allocation;
- for kernels that don't support the feature, allow programs execution
with minor performance losses.
To achieve this, use a scheme suggested by Alexei Starovoitov:
- for nocsr calls clang allocates registers as-if relevant r0-r5
registers are not scratched by the call;
- as a post-processing step, clang visits each nocsr call and adds
spill/fill for every live r0-r5;
- stack offsets used for spills/fills are allocated as lowest
stack offsets in whole function and are not used for any other
purpose;
- when kernel loads a program, it looks for such patterns
(nocsr function surrounded by spills/fills) and checks if
spill/fill stack offsets are used exclusively in nocsr patterns;
- if so, and if current JIT inlines the call to the nocsr function
(e.g. a helper call), kernel removes unnecessary spill/fill pairs;
- when old kernel loads a program, presence of spill/fill pairs
keeps BPF program valid, albeit slightly less efficient.
Corresponding clang/llvm changes are available in [2].
The patch-set uses bpf_get_smp_processor_id() function as a canary,
making it the first helper with nocsr attribute.
Change list:
- v3 -> v4:
- When nocsr spills/fills are removed in the subprogram, allow these
spills/fills to reside in [-MAX_BPF_STACK-48..MAX_BPF_STACK) range
(suggested by Alexei);
- Dropped patches with special handling for bpf_probe_read_kernel()
(requested by Alexei);
- Reset aux .nocsr_pattern and .nocsr_spills_num fields in
check_nocsr_stack_contract() (requested by Andrii).
Andrii, I have not added an additional flag to
struct bpf_subprog_info, it currently does not have holes
and I really don't like adding a bool field there just as an
alternative indicator that nocsr is disabled.
Indicator at the moment:
- nocsr_stack_off >= S16_MIN means that nocsr rewrite is enabled;
- nocsr_stack_off == S16_MIN means that nocsr rewrite is disabled.
- v2 -> v3:
- As suggested by Andrii, 'nocsr_stack_off' is no longer checked at
rewrite time, instead mark_nocsr_patterns() now does two passes
over BPF program:
- on a first pass it computes the lowest stack spill offset for
the subprogram;
- on a second pass this offset is used to recognize nocsr pattern.
- As suggested by Alexei, a new mechanic is added to work around a
situation mentioned by Andrii, when more helper functions are
marked as nocsr at compile time than current kernel supports:
- all {spill*,helper call,fill*} patterns are now marked as
insn_aux_data[*].nocsr_pattern, thus relaxing failure condition
for check_nocsr_stack_contract();
- spill/fill pairs are not removed for patterns where helper can't
be inlined;
- see mark_nocsr_pattern_for_call() for details an example.
- As suggested by Alexei, subprogram stack depth is now adjusted
if all spill/fill pairs could be removed. This adjustment has
to take place before optimize_bpf_loop(), hence the rewrite
is moved from do_misc_fixups() to remove_nocsr_spills_fills()
(again).
- As suggested by Andrii, special measures are taken to work around
bpf_probe_read_kernel() access to BPF stack, see patches 11, 12.
Patch #11 is very simplistic, a more comprehensive solution would
be to change the type of the third parameter of the
bpf_probe_read_kernel() from ARG_ANYTHING to something else and
not only check nocsr contract, but also propagate stack slot
liveness information. However, such change would require update in
struct bpf_call_arg_meta processing, which currently implies that
every memory parameter is followed by a size parameter.
I can work on these changes, please comment.
- Stylistic changes suggested by Andrii.
- Added acks from Andrii.
- Dropped RFC tag.
- v1 -> v2:
- assume that functions inlined by either jit or verifier
conform to no_caller_saved_registers contract (Andrii, Puranjay);
- allow nocsr rewrite for bpf_get_smp_processor_id()
on arm64 and riscv64 architectures (Puranjay);
- __arch_{x86_64,arm64,riscv64} macro for test_loader;
- moved remove_nocsr_spills_fills() inside do_misc_fixups() (Andrii);
- moved nocsr pattern detection from check_cfg() to a separate pass
(Andrii);
- various stylistic/correctness changes according to Andrii's
comments.
Eduard Zingerman [Mon, 22 Jul 2024 23:38:44 +0000 (16:38 -0700)]
selftests/bpf: test no_caller_saved_registers spill/fill removal
Tests for no_caller_saved_registers processing logic
(see verifier.c:match_and_mark_nocsr_pattern()):
- a canary positive test case;
- a canary test case for arm64 and riscv64;
- various tests with broken patterns;
- tests with read/write fixed/varying stack access that violate nocsr
stack access contract;
- tests with multiple subprograms;
- tests using nocsr in combination with may_goto/bpf_loop,
as all of these features affect stack depth;
- tests for nocsr stack spills below max stack depth.
Eduard Zingerman [Mon, 22 Jul 2024 23:38:43 +0000 (16:38 -0700)]
selftests/bpf: __arch_* macro to limit test cases to specific archs
Add annotations __arch_x86_64, __arch_arm64, __arch_riscv64
to specify on which architecture the test case should be tested.
Several __arch_* annotations could be specified at once.
When test case is not run on current arch it is marked as skipped.
For example, the following would be tested only on arm64 and riscv64:
Eduard Zingerman [Mon, 22 Jul 2024 23:38:42 +0000 (16:38 -0700)]
selftests/bpf: allow checking xlated programs in verifier_* tests
Add a macro __xlated("...") for use with test_loader tests.
When such annotations are present for the test case:
- bpf_prog_get_info_by_fd() is used to get BPF program after all
rewrites are applied by verifier.
- the program is disassembled and patterns specified in __xlated are
searched for in the disassembly text.
__xlated matching follows the same mechanics as __msg:
each subsequent pattern is matched from the point where
previous pattern ended.
This allows to write tests like below, where the goal is to verify the
behavior of one of the of the transformations applied by verifier:
Eduard Zingerman [Mon, 22 Jul 2024 23:38:41 +0000 (16:38 -0700)]
selftests/bpf: extract test_loader->expect_msgs as a data structure
Non-functional change: use a separate data structure to represented
expected messages in test_loader.
This would allow to use the same functionality for expected set of
disassembled instructions in the follow-up commit.
Eduard Zingerman [Mon, 22 Jul 2024 23:38:40 +0000 (16:38 -0700)]
selftests/bpf: no need to track next_match_pos in struct test_loader
The call stack for validate_case() function looks as follows:
- test_loader__run_subtests()
- process_subtest()
- run_subtest()
- prepare_case(), which does 'tester->next_match_pos = 0';
- validate_case(), which increments tester->next_match_pos.
Hence, each subtest is run with next_match_pos freshly set to zero.
Meaning that there is no need to persist this variable in the
struct test_loader, use local variable instead.
Disassembles instruction 'insn' to a text buffer 'buf'.
Removes insn->code hex prefix added by kernel disassembly routine.
Returns a pointer to the next instruction
(increments insn by either 1 or 2).
The source code:
int iter_arr_with_actual_elem_count(const void *ctx)
{
int i, n = loop_data.n, sum = 0;
if (n > ARRAY_SIZE(loop_data.data))
return 0;
bpf_for(i, 0, n) {
/* no rechecking of i against ARRAY_SIZE(loop_data.n) */
sum += loop_data.data[i];
}
return sum;
}
The insn #14 is a sign-extenstion load which is related to 'int i'.
The insn #15 did a subreg comparision. Note that smin=0xffffffff80000000 and this caused later
insn #23 failed verification due to unbounded min value.
Actually insn #15 R1 smin range can be better. Before insn #15, we have
R1_w=scalar(smin=0xffffffff80000000,smax=0x7fffffff)
With the above range, we know for R1, upper 32bit can only be 0xffffffff or 0.
Otherwise, the value range for R1 could be beyond [smin=0xffffffff80000000,smax=0x7fffffff].
After insn #15, for the true patch, we know smin32=0 and smax32=32. With the upper 32bit 0xffffffff,
then the corresponding value is [0xffffffff00000000, 0xffffffff00000020]. The range is
obviously beyond the original range [smin=0xffffffff80000000,smax=0x7fffffff] and the
range is not possible. So the upper 32bit must be 0, which implies smin = smin32 and
smax = smax32.
This patch fixed the issue by adding additional register deduction after 32-bit compare
insn. If the signed 32-bit register range is non-negative then 64-bit smin is
in range of [S32_MIN, S32_MAX], then the actual 64-bit smin/smax should be the same
as 32-bit smin32/smax32.
With this patch, iters/iter_arr_with_actual_elem_count succeeded with better register range:
from 15 to 20: R0=rdonly_mem(id=7,ref_obj_id=2,sz=4) R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R6=scalar(id=1,smin=umin=smin32=umin32=1,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) R7=scalar(id=9,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R8=scalar(id=9,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R10=fp0 fp-8=iter_num(ref_id=2,state=active,depth=3) refs=2
Eduard Zingerman [Mon, 22 Jul 2024 23:38:37 +0000 (16:38 -0700)]
bpf, x86, riscv, arm: no_caller_saved_registers for bpf_get_smp_processor_id()
The function bpf_get_smp_processor_id() is processed in a different
way, depending on the arch:
- on x86 verifier replaces call to bpf_get_smp_processor_id() with a
sequence of instructions that modify only r0;
- on riscv64 jit replaces call to bpf_get_smp_processor_id() with a
sequence of instructions that modify only r0;
- on arm64 jit replaces call to bpf_get_smp_processor_id() with a
sequence of instructions that modify only r0 and tmp registers.
These rewrites satisfy attribute no_caller_saved_registers contract.
Allow rewrite of no_caller_saved_registers patterns for
bpf_get_smp_processor_id() in order to use this function as a canary
for no_caller_saved_registers tests.
Yonghong Song [Tue, 23 Jul 2024 15:34:44 +0000 (08:34 -0700)]
selftests/bpf: Add tests for ldsx of pkt data/data_end/data_meta accesses
The following tests are added to verifier_ldsx.c:
- sign extension of data/data_end/data_meta for tcx programs.
The actual checking is in bpf_skb_is_valid_access() which
is called by sk_filter, cg_skb, lwt, tc(tcx) and sk_skb.
- sign extension of data/data_end/data_meta for xdp programs.
- sign extension of data/data_end for flow_dissector programs.
All newly-added tests have verification failure with message
"invalid bpf_context access". Without previous patch, all these
tests succeeded verification.
Eduard Zingerman [Mon, 22 Jul 2024 23:38:36 +0000 (16:38 -0700)]
bpf: no_caller_saved_registers attribute for helper calls
GCC and LLVM define a no_caller_saved_registers function attribute.
This attribute means that function scratches only some of
the caller saved registers defined by ABI.
For BPF the set of such registers could be defined as follows:
- R0 is scratched only if function is non-void;
- R1-R5 are scratched only if corresponding parameter type is defined
in the function prototype.
This commit introduces flag bpf_func_prot->allow_nocsr.
If this flag is set for some helper function, verifier assumes that
it follows no_caller_saved_registers calling convention.
The contract between kernel and clang allows to simultaneously use
such functions and maintain backwards compatibility with old
kernels that don't understand no_caller_saved_registers calls
(nocsr for short):
- clang generates a simple pattern for nocsr calls, e.g.:
- kernel removes unnecessary spills and fills, if called function is
inlined by verifier or current JIT (with assumption that patch
inserted by verifier or JIT honors nocsr contract, e.g. does not
scratch r3-r5 for the example above), e.g. the code above would be
transformed to:
Technically, the transformation is split into the following phases:
- function mark_nocsr_patterns(), called from bpf_check()
searches and marks potential patterns in instruction auxiliary data;
- upon stack read or write access,
function check_nocsr_stack_contract() is used to verify if
stack offsets, presumably reserved for nocsr patterns, are used
only from those patterns;
- function remove_nocsr_spills_fills(), called from bpf_check(),
applies the rewrite for valid patterns.
See comment in mark_nocsr_pattern_for_call() for more details.
Yonghong Song [Tue, 23 Jul 2024 15:34:39 +0000 (08:34 -0700)]
bpf: Fail verification for sign-extension of packet data/data_end/data_meta
syzbot reported a kernel crash due to
commit 1f1e864b6555 ("bpf: Handle sign-extenstin ctx member accesses").
The reason is due to sign-extension of 32-bit load for
packet data/data_end/data_meta uapi field.
The original code looks like:
r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */
r3 = *(u32 *)(r1 + 80) /* load __sk_buff->data_end */
r0 = r2
r0 += 8
if r3 > r0 goto +1
...
Note that __sk_buff->data load has 32-bit sign extension.
After verification and convert_ctx_accesses(), the final asm code looks like:
r2 = *(u64 *)(r1 +208)
r2 = (s32)r2
r3 = *(u64 *)(r1 +80)
r0 = r2
r0 += 8
if r3 > r0 goto pc+1
...
Note that 'r2 = (s32)r2' may make the kernel __sk_buff->data address invalid
which may cause runtime failure.
Currently, in C code, typically we have
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
...
and it will generate
r2 = *(u64 *)(r1 +208)
r3 = *(u64 *)(r1 +80)
r0 = r2
r0 += 8
if r3 > r0 goto pc+1
If we allow sign-extension,
void *data = (void *)(long)(int)skb->data;
void *data_end = (void *)(long)skb->data_end;
...
the generated code looks like
r2 = *(u64 *)(r1 +208)
r2 <<= 32
r2 s>>= 32
r3 = *(u64 *)(r1 +80)
r0 = r2
r0 += 8
if r3 > r0 goto pc+1
and this will cause verification failure since "r2 <<= 32" is not allowed
as "r2" is a packet pointer.
To fix this issue for case
r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */
this patch added additional checking in is_valid_access() callback
function for packet data/data_end/data_meta access. If those accesses
are with sign-extenstion, the verification will fail.
Tony Ambardar [Tue, 23 Jul 2024 00:30:45 +0000 (17:30 -0700)]
tools/runqslower: Fix LDFLAGS and add LDLIBS support
Actually use previously defined LDFLAGS during build and add support for
LDLIBS to link extra standalone libraries e.g. 'argp' which is not provided
by musl libc.
Tony Ambardar [Sat, 20 Jul 2024 05:25:35 +0000 (22:25 -0700)]
selftests/bpf: Fix wrong binary in Makefile log output
Make log output incorrectly shows 'test_maps' as the binary name for every
'CLNG-BPF' build step, apparently picking up the last value defined for the
$(TRUNNER_BINARY) variable. Update the 'CLANG_BPF_BUILD_RULE' variants to
fix this confusing output.
Fixes: a5d0c26a2784 ("selftests/bpf: Add a cpuv4 test runner for cpu=v4 testing") Fixes: 89ad7420b25c ("selftests/bpf: Drop the need for LLVM's llc") Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240720052535.2185967-1-tony.ambardar@gmail.com
selftests/bpf: Fix compilation failure when CONFIG_NET_FOU!=y
Without CONFIG_NET_FOU bpf selftests are unable to build because of
missing definitions. Add ___local versions of struct bpf_fou_encap and
enum bpf_fou_encap_type to fix the issue.
Tony Ambardar [Tue, 23 Jul 2024 00:13:29 +0000 (17:13 -0700)]
selftests/bpf: Fix error linking uprobe_multi on mips
Linking uprobe_multi.c on mips64el fails due to relocation overflows, when
the GOT entries required exceeds the default maximum. Add a specific CFLAGS
(-mxgot) for uprobe_multi.c on MIPS that allows using a larger GOT and
avoids errors such as:
/tmp/ccBTNQzv.o: in function `bench':
uprobe_multi.c:49:(.text+0x1d7720): relocation truncated to fit: R_MIPS_GOT_DISP against `uprobe_multi_func_08188'
uprobe_multi.c:49:(.text+0x1d7730): relocation truncated to fit: R_MIPS_GOT_DISP against `uprobe_multi_func_08189'
...
collect2: error: ld returned 1 exit status
Tony Ambardar [Tue, 23 Jul 2024 00:13:28 +0000 (17:13 -0700)]
selftests/bpf: Add missing system defines for mips
Update get_sys_includes in Makefile with missing MIPS-related definitions
to fix many, many compilation errors building selftests/bpf. The following
added defines drive conditional logic in system headers for word-size and
endianness selection:
Song Liu [Tue, 23 Jul 2024 05:14:55 +0000 (22:14 -0700)]
selftests/bpf: Add a test for mmap-able map in map
Regular BPF hash map is not mmap-able from user space. However, map-in-map
with outer map of type BPF_MAP_TYPE_HASH_OF_MAPS and mmap-able array as
inner map can perform similar operations as a mmap-able hash map. This
can be used by applications that benefit from fast accesses to some local
data.
Martin KaFai Lau [Tue, 23 Jul 2024 17:45:51 +0000 (10:45 -0700)]
Merge branch 'use network helpers, part 10'
Geliang Tang says:
====================
This set is part 10 of series "use network helpers" all BPF selftests
wide.
Patches 1-3 drop local functions make_client(), make_socket() and
inetaddr_len() in sk_lookup.c. Patch 4 drops a useless function
__start_server() in network_helpers.c.
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
selftests/bpf: Drop __start_server in network_helpers
The helper start_server_addr() is a wrapper of __start_server(), the
only difference between them is __start_server() accepts a sockaddr type
address parameter, but start_server_addr() accepts a sockaddr_storage one.
This patch drops __start_server(), and updates the callers to invoke
start_server_addr() instead.
This patch uses the public network helers client_socket() + make_sockaddr()
in sk_lookup.c to create the client socket, set the timeout sockopts, and
make the connecting address. The local defined function make_socket()
can be dropped then.
This patch uses the new helper connect_to_addr_str() in sk_lookup.c to
create the client socket and connect to the server, instead of using local
defined function make_client(). This local function can be dropped then.
====================
Add BPF LSM return value range check, BPF part
From: Xu Kuohai <xukuohai@huawei.com>
LSM BPF prog may make kernel panic when returning an unexpected value,
such as returning positive value on hook file_alloc_security.
To fix it, series [1] refactored LSM hook return values and added
BPF return value check on top of that. Since the refactoring of LSM
hooks and checking BPF prog return value patches is not closely related,
this series separates BPF-related patches from [1].
selftests/bpf: Add return value checks for failed tests
The return ranges of some bpf lsm test progs can not be deduced by
the verifier accurately. To avoid erroneous rejections, add explicit
return value checks for these progs.
The compiler optimized the two bpf progs in token_lsm.c to make return
value from the bool variable in the "return -1" path, causing an
unexpected rejection:
24: (b4) w0 = -1 ; R0_w=0xffffffff
; int BPF_PROG(test_int_hook, struct vm_area_struct *vma, @ lsm.c:89
25: (95) exit
At program exit the register R0 has smin=4294967295 smax=4294967295 should have been in [-4095, 0]
It can be seen that instruction "w0 = -1" zero extended -1 to 64-bit
register r0, setting both smin and smax values of r0 to 4294967295.
This resulted in a false reject when r0 was checked with range [-4095, 0].
Given bpf lsm does not return 64-bit values, this patch fixes it by changing
the compare between r0 and return range from 64-bit operation to 32-bit
operation for bpf lsm.
bpf: Prevent tail call between progs attached to different hooks
bpf progs can be attached to kernel functions, and the attached functions
can take different parameters or return different return values. If
prog attached to one kernel function tail calls prog attached to another
kernel function, the ctx access or return value verification could be
bypassed.
For example, if prog1 is attached to func1 which takes only 1 parameter
and prog2 is attached to func2 which takes two parameters. Since verifier
assumes the bpf ctx passed to prog2 is constructed based on func2's
prototype, verifier allows prog2 to access the second parameter from
the bpf ctx passed to it. The problem is that verifier does not prevent
prog1 from passing its bpf ctx to prog2 via tail call. In this case,
the bpf ctx passed to prog2 is constructed from func1 instead of func2,
that is, the assumption for ctx access verification is bypassed.
Another example, if BPF LSM prog1 is attached to hook file_alloc_security,
and BPF LSM prog2 is attached to hook bpf_lsm_audit_rule_known. Verifier
knows the return value rules for these two hooks, e.g. it is legal for
bpf_lsm_audit_rule_known to return positive number 1, and it is illegal
for file_alloc_security to return positive number. So verifier allows
prog2 to return positive number 1, but does not allow prog1 to return
positive number. The problem is that verifier does not prevent prog1
from calling prog2 via tail call. In this case, prog2's return value 1
will be used as the return value for prog1's hook file_alloc_security.
That is, the return value rule is bypassed.
This patch adds restriction for tail call to prevent such bypasses.
A bpf prog returning a positive number attached to file_alloc_security
hook makes kernel panic.
This happens because file system can not filter out the positive number
returned by the LSM prog using IS_ERR, and misinterprets this positive
number as a file pointer.
Given that hook file_alloc_security never returned positive number
before the introduction of BPF LSM, and other BPF LSM hooks may
encounter similar issues, this patch adds LSM return value check
in verifier, to ensure no unexpected value is returned.
Martin KaFai Lau [Mon, 22 Jul 2024 18:30:47 +0000 (11:30 -0700)]
selftests/bpf: Ensure the unsupported struct_ops prog cannot be loaded
There is an existing "bpf_tcp_ca/unsupp_cong_op" test to ensure
the unsupported tcp-cc "get_info" struct_ops prog cannot be loaded.
This patch adds a new test in the bpf_testmod such that the
unsupported ops test does not depend on other kernel subsystem
where its supporting ops may be changed in the future.
Martin KaFai Lau [Mon, 22 Jul 2024 18:30:46 +0000 (11:30 -0700)]
selftests/bpf: Fix the missing tramp_1 to tramp_40 ops in cfi_stubs
The tramp_1 to tramp_40 ops is not set in the cfi_stubs in the
bpf_testmod_ops. It fails the struct_ops_multi_pages test after
retiring the unsupported_ops in the earlier patch.
This patch initializes them in a loop during the bpf_testmod_init().
Martin KaFai Lau [Mon, 22 Jul 2024 18:30:45 +0000 (11:30 -0700)]
bpf: Check unsupported ops from the bpf_struct_ops's cfi_stubs
The bpf_tcp_ca struct_ops currently uses a "u32 unsupported_ops[]"
array to track which ops is not supported.
After cfi_stubs had been added, the function pointer in cfi_stubs is
also NULL for the unsupported ops. Thus, the "u32 unsupported_ops[]"
becomes redundant. This observation was originally brought up in the
bpf/cfi discussion:
https://lore.kernel.org/bpf/CAADnVQJoEkdjyCEJRPASjBw1QGsKYrF33QdMGc1RZa9b88bAEA@mail.gmail.com/
The recent bpf qdisc patch (https://lore.kernel.org/bpf/20240714175130.4051012-6-amery.hung@bytedance.com/)
also needs to specify quite many unsupported ops. It is a good time
to clean it up.
This patch removes the need of "u32 unsupported_ops[]" and tests for null-ness
in the cfi_stubs instead.
Testing the cfi_stubs is done in a new function bpf_struct_ops_supported().
The verifier will call bpf_struct_ops_supported() when loading the
struct_ops program. The ".check_member" is removed from the bpf_tcp_ca
in this patch. ".check_member" could still be useful for other subsytems
to enforce other restrictions (e.g. sched_ext checks for prog->sleepable).
Tao Chen [Sun, 21 Jul 2024 14:42:21 +0000 (22:42 +0800)]
bpftool: Add net attach/detach command to tcx prog
Now, attach/detach tcx prog supported in libbpf, so we can add new
command 'bpftool attach/detach tcx' to attach tcx prog with bpftool
for user.
# bpftool prog load tc_prog.bpf.o /sys/fs/bpf/tc_prog
# bpftool prog show
...
192: sched_cls name tc_prog tag 187aeb611ad00cfc gpl
loaded_at 2024-07-11T15:58:16+0800 uid 0
xlated 152B jited 97B memlock 4096B map_ids 100,99,97
btf_id 260
# bpftool net attach tcx_ingress name tc_prog dev lo
# bpftool net
...
tc:
lo(1) tcx/ingress tc_prog prog_id 29
# bpftool net detach tcx_ingress dev lo
# bpftool net
...
tc:
# bpftool net attach tcx_ingress name tc_prog dev lo
# bpftool net
tc:
lo(1) tcx/ingress tc_prog prog_id 29
Test environment: ubuntu_22_04, 6.7.0-060700-generic
The issue is confirmed in the discussions of
"bpf, x64: Fix tailcall infinite loop" [0].
The issue has been resolved on both x86_64 and arm64 [1].
I provide a long commit message in the "bpf, x64: Fix tailcall hierarchy"
patch to describe how the issue happens and how this patchset resolves the
issue in details.
How does this patchset resolve the issue?
In short, it stores tail_call_cnt on the stack of main prog, and propagates
tail_call_cnt_ptr to its subprogs.
First, at the prologue of main prog, it initializes tail_call_cnt and
prepares tail_call_cnt_ptr. And at the prologue of subprog, it reuses
the tail_call_cnt_ptr from caller.
Then, when a tailcall happens, it increments tail_call_cnt by its pointer.
v5 -> v6:
* Address comments from Eduard:
* Add JITed dumping along annotating comments
* Rewrite two selftests with RUN_TESTS macro.
v4 -> v5:
* Solution changes from tailcall run ctx to tail_call_cnt and its pointer.
It's because v4 solution is unable to handle the case that there is no
tailcall in subprog but there is tailcall in EXT prog which attaches to
the subprog.
v3 -> v4:
* Solution changes from per-task tail_call_cnt to tailcall run ctx.
As for per-cpu/per-task solution, there is a case it is unable to handle [2].
v2 -> v3:
* Solution changes from percpu tail_call_cnt to tail_call_cnt at task_struct.
v1 -> v2:
* Solution changes from extra run-time call insn to percpu tail_call_cnt.
* Address comments from Alexei:
* Use percpu tail_call_cnt.
* Use asm to make sure no callee saved registers are touched.
RFC v2 -> v1:
* Solution changes from propagating tail_call_cnt with its pointer to extra
run-time call insn.
* Address comments from Maciej:
* Replace all memcpy(prog, x86_nops[5], X86_PATCH_SIZE) with
emit_nops(&prog, X86_PATCH_SIZE)
RFC v1 -> RFC v2:
* Address comments from Stanislav:
* Separate moving emit_nops() as first patch.
Leon Hwang [Sun, 14 Jul 2024 12:39:00 +0000 (20:39 +0800)]
bpf, x64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature.
As we know, tail_call_cnt propagates by rax from caller to callee when
to call subprog in tailcall context. But, like the following example,
MAX_TAIL_CALL_CNT won't work because of missing tail_call_cnt
back-propagation from callee to caller.
SEC("tc")
int entry(struct __sk_buff *skb)
{
volatile int ret = 1;
count++;
subprog_tail1(skb);
subprog_tail2(skb);
return ret;
}
char __license[] SEC("license") = "GPL";
At run time, the tail_call_cnt in entry() will be propagated to
subprog_tail1() and subprog_tail2(). But, when the tail_call_cnt in
subprog_tail1() updates when bpf_tail_call_static(), the tail_call_cnt
in entry() won't be updated at the same time. As a result, in entry(),
when tail_call_cnt in entry() is less than MAX_TAIL_CALL_CNT and
subprog_tail1() returns because of MAX_TAIL_CALL_CNT limit,
bpf_tail_call_static() in suprog_tail2() is able to run because the
tail_call_cnt in subprog_tail2() propagated from entry() is less than
MAX_TAIL_CALL_CNT.
So, how many tailcalls are there for this case if no error happens?
From top-down view, does it look like hierarchy layer and layer?
With this view, there will be 2+4+8+...+2^33 = 2^34 - 2 = 17,179,869,182
tailcalls for this case.
How about there are N subprog_tail() in entry()? There will be almost
N^34 tailcalls.
Then, in this patch, it resolves this case on x86_64.
In stead of propagating tail_call_cnt from caller to callee, it
propagates its pointer, tail_call_cnt_ptr, tcc_ptr for short.
However, where does it store tail_call_cnt?
It stores tail_call_cnt on the stack of main prog. When tail call
happens in subprog, it increments tail_call_cnt by tcc_ptr.
Meanwhile, it stores tail_call_cnt_ptr on the stack of main prog, too.
And, before jump to tail callee, it has to pop tail_call_cnt and
tail_call_cnt_ptr.
Then, at the prologue of subprog, it must not make rax as
tail_call_cnt_ptr again. It has to reuse tail_call_cnt_ptr from caller.
As a result, at run time, it has to recognize rax is tail_call_cnt or
tail_call_cnt_ptr at prologue by:
1. rax is tail_call_cnt if rax is <= MAX_TAIL_CALL_CNT.
2. rax is tail_call_cnt_ptr if rax is > MAX_TAIL_CALL_CNT, because a
pointer won't be <= MAX_TAIL_CALL_CNT.
====================
bpf: track find_equal_scalars history on per-instruction level
This is a fix for precision tracking bug reported in [0].
It supersedes my previous attempt to fix similar issue in commit [1].
Here is a minimized test case from [0]:
W/o this fix verifier incorrectly assumes that instruction at label
(8) is unreachable. The issue is caused by failure to infer
precision mark for r0 at checkpoint #1:
- first verification path is:
- (0-4): r0 range [0,1];
- (5): r8 range [0,0], propagated to r7;
- (6): r8.id is reset;
- (7): jump is predicted to happen;
- (9-10): safe exit.
- when jump at (7) is predicted mark_chain_precision() for r7 is
called and backtrack_insn() proceeds as follows:
- at (7) r7 is marked as precise;
- at (5) r8 is not currently tracked and thus r0 is not marked;
- at (4-5) boundary logic from [1] is triggered and r7,r8 are marked
as precise;
- => r0 precision mark is missed.
- when second branch of (4) is considered, verifier prunes the state
because r0 is not marked as precise in the visited state.
Basically, backtracking logic fails to notice that at (5)
range information is gained for both r7 and r8, and thus both
r8 and r0 have to be marked as precise.
This happens because [1] can only account for such range
transfers at parent/child state boundaries.
The solution suggested by Andrii Nakryiko in [0] is to use jump
history to remember which registers gained range as a result of
find_equal_scalars() [renamed to sync_linked_regs()] and use
this information in backtrack_insn().
Which is what this patch-set does.
The patch-set uses u64 value as a vector of 10-bit values that
identify registers gaining range in find_equal_scalars().
This amounts to maximum of 6 possible values.
To check if such capacity is sufficient I've instrumented kernel
to track a histogram for maximal amount of registers that gain range
in find_equal_scalars per program verification [2].
Measurements done for verifier selftests and Cilium bpf object files
from [3] show that number of such registers is *always* <= 4 and
in 98% of cases it is <= 2.
When tested on a subset of selftests identified by
selftests/bpf/veristat.cfg and Cilium bpf object files from [3]
this patch-set has minimal verification performance impact:
[0] https://lore.kernel.org/bpf/CAEf4BzZ0xidVCqB47XnkXcNhkPWF6_nTV7yt+_Lf0kcFEut2Mg@mail.gmail.com/
[1] commit 904e6ddf4133 ("bpf: Use scalar ids in mark_chain_precision()")
[2] https://github.com/eddyz87/bpf/tree/find-equal-scalars-in-jump-history-with-stats
[3] https://github.com/anakryiko/cilium
Changes:
- v2 -> v3:
A number of stylistic changes suggested by Andrii:
- renamings:
- struct reg_or_spill -> linked_reg;
- find_equal_scalars() -> collect_linked_regs;
- copy_known_reg() -> sync_linked_regs;
- collect_linked_regs() now returns linked regs set of
size 2 or larger;
- dropped usage of bit fields in struct linked_reg;
- added a patch changing references to find_equal_scalars() in
selftests comments.
- v1 -> v2:
- patch "bpf: replace env->cur_hist_ent with a getter function" is
dropped (Andrii);
- added structure linked_regs and helper functions to [de]serialize
u64 value as such structure (Andrii);
- bt_set_equal_scalars() renamed to bt_sync_linked_regs(), moved to
start and end of backtrack_insn() in order to untie linked
register logic from conditional jumps backtracking.
Andrii requested a more radical change of moving linked registers
processing to bt_set_xxx() functions, I did an experiment in this
direction:
https://github.com/eddyz87/bpf/tree/find-equal-scalars-in-jump-history--linked-regs-in-bt-set-reg
the end result of the experiment seems much uglier than version
presented in v2.
Eduard Zingerman [Thu, 18 Jul 2024 20:23:55 +0000 (13:23 -0700)]
selftests/bpf: Tests for per-insn sync_linked_regs() precision tracking
Add a few test cases to verify precision tracking for scalars gaining
range because of sync_linked_regs():
- check what happens when more than 6 registers might gain range in
sync_linked_regs();
- check if precision is propagated correctly when operand of
conditional jump gained range in sync_linked_regs() and one of
linked registers is marked precise;
- check if precision is propagated correctly when operand of
conditional jump gained range in sync_linked_regs() and a
other-linked operand of the conditional jump is marked precise;
- add a minimized reproducer for precision tracking bug reported in [0];
- Check that mark_chain_precision() for one of the conditional jump
operands does not trigger equal scalars precision propagation.
Eduard Zingerman [Thu, 18 Jul 2024 20:23:54 +0000 (13:23 -0700)]
bpf: Remove mark_precise_scalar_ids()
Function mark_precise_scalar_ids() is superseded by
bt_sync_linked_regs() and equal scalars tracking in jump history.
mark_precise_scalar_ids() propagates precision over registers sharing
same ID on parent/child state boundaries, while jump history records
allow bt_sync_linked_regs() to propagate same information with
instruction level granularity, which is strictly more precise.
This commit removes mark_precise_scalar_ids() and updates test cases
in progs/verifier_scalar_ids to reflect new verifier behavior.
The tests are updated in the following manner:
- mark_precise_scalar_ids() propagated precision regardless of
presence of conditional jumps, while new jump history based logic
only kicks in when conditional jumps are present.
Hence test cases are augmented with conditional jumps to still
trigger precision propagation.
- As equal scalars tracking no longer relies on parent/child state
boundaries some test cases are no longer interesting,
such test cases are removed, namely:
- precision_same_state and precision_cross_state are superseded by
linked_regs_bpf_k;
- precision_same_state_broken_link and equal_scalars_broken_link
are superseded by linked_regs_broken_link.
Eduard Zingerman [Thu, 18 Jul 2024 20:23:53 +0000 (13:23 -0700)]
bpf: Track equal scalars history on per-instruction level
Use bpf_verifier_state->jmp_history to track which registers were
updated by find_equal_scalars() (renamed to collect_linked_regs())
when conditional jump was verified. Use recorded information in
backtrack_insn() to propagate precision.
E.g. for the following program:
while verifying instructions
1: r1 = r0 |
2: if r1 < 8 goto ... | push r0,r1 as linked registers in jmp_history
3: if r0 > 16 goto ... | push r0,r1 as linked registers in jmp_history
4: r2 = r10 |
5: r2 += r0 v mark_chain_precision(r0)
while doing mark_chain_precision(r0)
5: r2 += r0 | mark r0 precise
4: r2 = r10 |
3: if r0 > 16 goto ... | mark r0,r1 as precise
2: if r1 < 8 goto ... | mark r0,r1 as precise
1: r1 = r0 v
Technically, do this as follows:
- Use 10 bits to identify each register that gains range because of
sync_linked_regs():
- 3 bits for frame number;
- 6 bits for register or stack slot number;
- 1 bit to indicate if register is spilled.
- Use u64 as a vector of 6 such records + 4 bits for vector length.
- Augment struct bpf_jmp_history_entry with a field 'linked_regs'
representing such vector.
- When doing check_cond_jmp_op() remember up to 6 registers that
gain range because of sync_linked_regs() in such a vector.
- Don't propagate range information and reset IDs for registers that
don't fit in 6-value vector.
- Push a pair {instruction index, linked registers vector}
to bpf_verifier_state->jmp_history.
- When doing backtrack_insn() check if any of recorded linked
registers is currently marked precise, if so mark all linked
registers as precise.
This also requires fixes for two test_verifier tests:
- precise: test 1
- precise: test 2
Both tests contain the following instruction sequence:
The call to bpf_probe_read_kernel() at (26) forces r2 to be precise.
Previously, this forced all registers with same id to become precise
immediately when mark_chain_precision() is called.
After this change, the precision is propagated to registers sharing
same id only when 'if' instruction is backtracked.
Hence verification log for both tests is changed:
regs=r2,r9 -> regs=r2 for instructions 25..20.
Fixes: 904e6ddf4133 ("bpf: Use scalar ids in mark_chain_precision()") Reported-by: Hao Sun <sunhao.th@gmail.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240718202357.1746514-2-eddyz87@gmail.com Closes: https://lore.kernel.org/bpf/CAEf4BzZ0xidVCqB47XnkXcNhkPWF6_nTV7yt+_Lf0kcFEut2Mg@mail.gmail.com/
selftests/bpf: Use auto-dependencies for test objects
Make use of -M compiler options when building .test.o objects to
generate .d files and avoid re-building all tests every time.
Previously, if a single test bpf program under selftests/bpf/progs/*.c
has changed, make would rebuild all the *.bpf.o, *.skel.h and *.test.o
objects, which is a lot of unnecessary work.
A typical dependency chain is:
progs/x.c -> x.bpf.o -> x.skel.h -> x.test.o -> trunner_binary
However for many tests it's not a 1:1 mapping by name, and so far
%.test.o have been simply dependent on all %.skel.h files, and
%.skel.h files on all %.bpf.o objects.
Avoid full rebuilds by instructing the compiler (via -MMD) to
produce *.d files with real dependencies, and appropriately including
them. Exploit make feature that rebuilds included makefiles if they
were changed by setting %.test.d as prerequisite for %.test.o files.
A couple of examples of compilation time speedup (after the first
clean build):
$ touch progs/verifier_and.c && time make -j8
Before: real 0m16.651s
After: real 0m2.245s
$ touch progs/read_vsyscall.c && time make -j8
Before: real 0m15.743s
After: real 0m1.575s
A drawback of this change is that now there is an overhead due to make
processing lots of .d files, which potentially may slow down unrelated
targets. However a time to make all from scratch hasn't changed
significantly:
$ make clean && time make -j8
Before: real 1m31.148s
After: real 1m30.309s
Martin KaFai Lau [Thu, 18 Jul 2024 19:07:19 +0000 (12:07 -0700)]
Merge branch 'use network helpers, part 9'
Geliang Tang says:
====================
v3:
- patch 2:
- clear errno before connect_to_fd_opts.
- print err logs in run_test.
- set err to -1 when fd >= 0.
- patch 3:
- drop "int err".
v2:
- update patch 2 as Martin suggested.
This is the 9th part of series "use network helpers" all BPF selftests
wide.
Patches 1-2 update network helpers interfaces suggested by Martin.
Patch 3 adds a new helper connect_to_addr_str() as Martin suggested
instead of adding connect_fd_to_addr_str().
Patch 4 uses this newly added helper in make_client().
Patch 5 uses make_client() in sk_lookup and drop make_socket().
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Similar to connect_to_addr() helper for connecting to a server with the
given sockaddr_storage type address, this patch adds a new helper named
connect_to_addr_str() for connecting to a server with the given string
type address "addr_str", together with its "family" and "port" as other
parameters of connect_to_addr_str().
In connect_to_addr_str(), the parameters "family", "addr_str" and "port"
are used to create a sockaddr_storage type address "addr" by invoking
make_sockaddr(). Then pass this "addr" together with "addrlen", "type"
and "opts" to connect_to_addr().
selftests/bpf: Drop must_fail from network_helper_opts
The struct member "must_fail" of network_helper_opts() is only used in
cgroup_v1v2 tests, it makes sense to drop it from network_helper_opts.
Return value (fd) of connect_to_fd_opts() and the expect errno (EPERM)
can be checked in cgroup_v1v2.c directly, no need to check them in
connect_fd_to_addr() in network_helpers.c.
This also makes connect_fd_to_addr() function useless. It can be replaced
by connect().
The "type" parameter of connect_to_fd_opts() is redundant of "server_fd".
Since the "type" can be obtained inside by invoking getsockopt(SO_TYPE),
without passing it in as a parameter.
This patch drops the "type" parameter of connect_to_fd_opts() and updates
its callers.
Merge tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
A lot of networking people were at a conference last week, busy
catching COVID, so relatively short PR.
Current release - regressions:
- tcp: process the 3rd ACK with sk_socket for TFO and MPTCP
Current release - new code bugs:
- l2tp: protect session IDR and tunnel session list with one lock,
make sure the state is coherent to avoid a warning
- eth: bnxt_en: update xdp_rxq_info in queue restart logic
- eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field
Previous releases - regressions:
- xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
the field reuses previously un-validated pad
Previous releases - always broken:
- tap/tun: drop short frames to prevent crashes later in the stack
- eth: ice: add a per-VF limit on number of FDIR filters
- af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash"
* tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
tun: add missing verification for short frame
tap: add missing verification for short frame
mISDN: Fix a use after free in hfcmulti_tx()
gve: Fix an edge case for TSO skb validity check
bnxt_en: update xdp_rxq_info in queue restart logic
tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
bpf: Fix a segment issue when downgrading gso_size
net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling
MAINTAINERS: make Breno the netconsole maintainer
MAINTAINERS: Update bonding entry
net: nexthop: Initialize all fields in dumped nexthops
net: stmmac: Correct byte order of perfect_match
selftests: forwarding: skip if kernel not support setting bridge fdb learning limit
tipc: Return non-zero value from tipc_udp_addr2str() on error
netfilter: nft_set_pipapo_avx2: disable softinterrupts
ice: Fix recipe read procedure
ice: Add a per-VF limit on number of FDIR filters
net: bonding: correctly annotate RCU in bond_should_notify_peers()
...
Merge tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl constification from Joel Granados:
"Treewide constification of the ctl_table argument of proc_handlers
using a coccinelle script and some manual code formatting fixups.
This is a prerequisite to moving the static ctl_table structs into
read-only data section which will ensure that proc_handler function
pointers cannot be modified"
* tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: treewide: constify the ctl_table argument of proc_handlers
Merge tag 'efi-fixes-for-v6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Wipe screen_info after allocating it from the heap - used by arm32
and EFI zboot, other EFI architectures allocate it statically
- Revert to allocating boot_params from the heap on x86 when entering
via the native PE entrypoint, to work around a regression on older
Dell hardware
* tag 'efi-fixes-for-v6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
x86/efistub: Revert to heap allocated boot_params for PE entrypoint
efi/libstub: Zero initialize heap allocated struct screen_info
Merge tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"Three small changes this cycle:
- Clean up an architecture abstraction that is no longer needed
because all the architectures have converged.
- Actually use the prompt argument to kdb_position_cursor() instead
of ignoring it (functionally this fix is a nop but that was due to
luck rather than good judgement)
- Fix a -Wformat-security warning"
* tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Get rid of redundant kdb_curr_task()
kdb: Use the passed prompt in kdb_position_cursor()
kdb: address -Wformat-security warnings
Merge tag 'mips_6.11_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS updates from Thomas Bogendoerfer:
- Use improved timer sync for Loongson64
- Fix address of GCR_ACCESS register
- Add missing MODULE_DESCRIPTION
* tag 'mips_6.11_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
mips: sibyte: add missing MODULE_DESCRIPTION() macro
MIPS: SMP-CPS: Fix address for GCR_ACCESS register for CM3 and later
MIPS: Loongson64: Switch to SYNC_R4K
Merge tag 'parisc-for-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
"The gettimeofday() and clock_gettime() syscalls are now available as
vDSO functions, and Dave added a patch which allows to use NVMe cards
in the PCI slots as fast and easy alternative to SCSI discs.
Summary:
- add gettimeofday() and clock_gettime() vDSO functions
- enable PCI_MSI_ARCH_FALLBACKS to allow PCI to PCIe bridge adaptor
with PCIe NVME card to function in parisc machines
- allow users to reduce kernel unaligned runtime warnings
- minor code cleanups"
* tag 'parisc-for-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Add support for CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN
parisc: Use max() to calculate parisc_tlb_flush_threshold
parisc: Fix warning at drivers/pci/msi/msi.h:121
parisc: Add 64-bit gettimeofday() and clock_gettime() vDSO functions
parisc: Add 32-bit gettimeofday() and clock_gettime() vDSO functions
parisc: Clean up unistd.h file
Merge tag 'uml-for-linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux
Pull UML updates from Richard Weinberger:
- Support for preemption
- i386 Rust support
- Huge cleanup by Benjamin Berg
- UBSAN support
- Removal of dead code
* tag 'uml-for-linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (41 commits)
um: vector: always reset vp->opened
um: vector: remove vp->lock
um: register power-off handler
um: line: always fill *error_out in setup_one_line()
um: remove pcap driver from documentation
um: Enable preemption in UML
um: refactor TLB update handling
um: simplify and consolidate TLB updates
um: remove force_flush_all from fork_handler
um: Do not flush MM in flush_thread
um: Delay flushing syscalls until the thread is restarted
um: remove copy_context_skas0
um: remove LDT support
um: compress memory related stub syscalls while adding them
um: Rework syscall handling
um: Add generic stub_syscall6 function
um: Create signal stack memory assignment in stub_data
um: Remove stub-data.h include from common-offsets.h
um: time-travel: fix signal blocking race/hang
um: time-travel: remove time_exit()
...
Merge tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big set of driver core changes for 6.11-rc1.
Lots of stuff in here, with not a huge diffstat, but apis are evolving
which required lots of files to be touched. Highlights of the changes
in here are:
- platform remove callback api final fixups (Uwe took many releases
to get here, finally!)
- Rust bindings for basic firmware apis and initial driver-core
interactions.
It's not all that useful for a "write a whole driver in rust" type
of thing, but the firmware bindings do help out the phy rust
drivers, and the driver core bindings give a solid base on which
others can start their work.
There is still a long way to go here before we have a multitude of
rust drivers being added, but it's a great first step.
- driver core const api changes.
This reached across all bus types, and there are some fix-ups for
some not-common bus types that linux-next and 0-day testing shook
out.
This work is being done to help make the rust bindings more safe,
as well as the C code, moving toward the end-goal of allowing us to
put driver structures into read-only memory. We aren't there yet,
but are getting closer.
- minor devres cleanups and fixes found by code inspection
- arch_topology minor changes
- other minor driver core cleanups
All of these have been in linux-next for a very long time with no
reported problems"
* tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits)
ARM: sa1100: make match function take a const pointer
sysfs/cpu: Make crash_hotplug attribute world-readable
dio: Have dio_bus_match() callback take a const *
zorro: make match function take a const pointer
driver core: module: make module_[add|remove]_driver take a const *
driver core: make driver_find_device() take a const *
driver core: make driver_[create|remove]_file take a const *
firmware_loader: fix soundness issue in `request_internal`
firmware_loader: annotate doctests as `no_run`
devres: Correct code style for functions that return a pointer type
devres: Initialize an uninitialized struct member
devres: Fix memory leakage caused by driver API devm_free_percpu()
devres: Fix devm_krealloc() wasting memory
driver core: platform: Switch to use kmemdup_array()
driver core: have match() callback in struct bus_type take a const *
MAINTAINERS: add Rust device abstractions to DRIVER CORE
device: rust: improve safety comments
MAINTAINERS: add Danilo as FIRMWARE LOADER maintainer
MAINTAINERS: add Rust FW abstractions to FIRMWARE LOADER
firmware: rust: improve safety comments
...
Merge tag 'linux-watchdog-6.11-rc1' of git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- make watchdog_class const
- rework of the rzg2l_wdt driver
- other small fixes and improvements
* tag 'linux-watchdog-6.11-rc1' of git://www.linux-watchdog.org/linux-watchdog:
dt-bindings: watchdog: dlg,da9062-watchdog: Drop blank space
watchdog: rzn1: Convert comma to semicolon
watchdog: lenovo_se10_wdt: Convert comma to semicolon
dt-bindings: watchdog: renesas,wdt: Document RZ/G3S support
watchdog: rzg2l_wdt: Add suspend/resume support
watchdog: rzg2l_wdt: Rely on the reset driver for doing proper reset
watchdog: rzg2l_wdt: Remove comparison with zero
watchdog: rzg2l_wdt: Remove reset de-assert from probe
watchdog: rzg2l_wdt: Check return status of pm_runtime_put()
watchdog: rzg2l_wdt: Use pm_runtime_resume_and_get()
watchdog: rzg2l_wdt: Make the driver depend on PM
watchdog: rzg2l_wdt: Restrict the driver to ARCH_RZG2L and ARCH_R9A09G011
watchdog: imx7ulp_wdt: keep already running watchdog enabled
watchdog: starfive: Add missing clk_disable_unprepare()
watchdog: Make watchdog_class const
The cited commit missed to check against the validity of the frame length
in the tun_xdp_one() path, which could cause a corrupted skb to be sent
downstack. Even before the skb is transmitted, the
tun_xdp_one-->eth_type_trans() may access the Ethernet header although it
can be less than ETH_HLEN. Once transmitted, this could either cause
out-of-bound access beyond the actual length, or confuse the underlayer
with incorrect or inconsistent header length in the skb metadata.
In the alternative path, tun_get_user() already prohibits short frame which
has the length less than Ethernet header size from being transmitted for
IFF_TAP.
This is to drop any frame shorter than the Ethernet header size just like
how tun_get_user() does.
CVE: CVE-2024-41091 Inspired-by: https://lore.kernel.org/netdev/1717026141-25716-1-git-send-email-si-wei.liu@oracle.com/ Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()") Cc: stable@vger.kernel.org Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20240724170452.16837-3-dongli.zhang@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Si-Wei Liu [Wed, 24 Jul 2024 17:04:51 +0000 (10:04 -0700)]
tap: add missing verification for short frame
The cited commit missed to check against the validity of the frame length
in the tap_get_user_xdp() path, which could cause a corrupted skb to be
sent downstack. Even before the skb is transmitted, the
tap_get_user_xdp()-->skb_set_network_header() may assume the size is more
than ETH_HLEN. Once transmitted, this could either cause out-of-bound
access beyond the actual length, or confuse the underlayer with incorrect
or inconsistent header length in the skb metadata.
In the alternative path, tap_get_user() already prohibits short frame which
has the length less than Ethernet header size from being transmitted.
This is to drop any frame shorter than the Ethernet header size just like
how tap_get_user() does.
The NIC requires each TSO segment to not span more than 10
descriptors. NIC further requires each descriptor to not exceed
16KB - 1 (GVE_TX_MAX_BUF_SIZE_DQO).
The descriptors for an skb are generated by
gve_tx_add_skb_no_copy_dqo() for DQO RDA queue format.
gve_tx_add_skb_no_copy_dqo() loops through each skb frag and
generates a descriptor for the entire frag if the frag size is
not greater than GVE_TX_MAX_BUF_SIZE_DQO. If the frag size is
greater than GVE_TX_MAX_BUF_SIZE_DQO, it is split into descriptor(s)
of size GVE_TX_MAX_BUF_SIZE_DQO and a descriptor is generated for
the remainder (frag size % GVE_TX_MAX_BUF_SIZE_DQO).
gve_can_send_tso() checks if the descriptors thus generated for an
skb would meet the requirement that each TSO-segment not span more
than 10 descriptors. However, the current code misses an edge case
when a TSO segment spans multiple descriptors within a large frag.
This change fixes the edge case.
gve_can_send_tso() relies on the assumption that max gso size (9728)
is less than GVE_TX_MAX_BUF_SIZE_DQO and therefore within an skb
fragment a TSO segment can never span more than 2 descriptors.
Fixes: a57e5de476be ("gve: DQO: Add TX path") Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Jeroen de Borst <jeroendb@google.com> Cc: stable@vger.kernel.org Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240724143431.3343722-1-pkaligineedi@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bnxt_en: update xdp_rxq_info in queue restart logic
When the netdev_rx_queue_restart() restarts queues, the bnxt_en driver
updates(creates and deletes) a page_pool.
But it doesn't update xdp_rxq_info, so the xdp_rxq_info is still
connected to an old page_pool.
So, bnxt_rx_ring_info->page_pool indicates a new page_pool, but
bnxt_rx_ring_info->xdp_rxq is still connected to an old page_pool.
An old page_pool is no longer used so it is supposed to be
deleted by page_pool_destroy() but it isn't.
Because the xdp_rxq_info is holding the reference count for it and the
xdp_rxq_info is not updated, an old page_pool will not be deleted in
the queue restart logic.
Before restarting queues, an interface has 4 page_pools.
After restarting one queue, an interface has 5 page_pools, but it
should be 4, not 5.
The reason is that queue restarting logic creates a new page_pool and
an old page_pool is not deleted due to the absence of an update of
xdp_rxq_info logic.
Jakub Kicinski [Thu, 25 Jul 2024 14:40:24 +0000 (07:40 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-07-25
We've added 14 non-merge commits during the last 8 day(s) which contain
a total of 19 files changed, 177 insertions(+), 70 deletions(-).
The main changes are:
1) Fix af_unix to disable MSG_OOB handling for sockets in BPF sockmap and
BPF sockhash. Also add test coverage for this case, from Michal Luczaj.
2) Fix a segmentation issue when downgrading gso_size in the BPF helper
bpf_skb_adjust_room(), from Fred Li.
3) Fix a compiler warning in resolve_btfids due to a missing type cast,
from Liwei Song.
4) Fix stack allocation for arm64 to align the stack pointer at a 16 byte
boundary in the fexit_sleep BPF selftest, from Puranjay Mohan.
5) Fix a xsk regression to require a flag when actuating tx_metadata_len,
from Stanislav Fomichev.
6) Fix function prototype BTF dumping in libbpf for prototypes that have
no input arguments, from Andrii Nakryiko.
7) Fix stacktrace symbol resolution in perf script for BPF programs
containing subprograms, from Hou Tao.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
bpf: Fix a segment issue when downgrading gso_size
tools/resolve_btfids: Fix comparison of distinct pointer types warning in resolve_btfids
bpf, events: Use prog to emit ksymbol event for main program
selftests/bpf: Test sockmap redirect for AF_UNIX MSG_OOB
selftests/bpf: Parametrize AF_UNIX redir functions to accept send() flags
selftests/bpf: Support SOCK_STREAM in unix_inet_redir_to_connected()
af_unix: Disable MSG_OOB handling for sockets in sockmap/sockhash
bpftool: Fix typo in usage help
libbpf: Fix no-args func prototype BTF dumping syntax
MAINTAINERS: Update powerpc BPF JIT maintainers
MAINTAINERS: Update email address of Naveen
selftests/bpf: fexit_sleep: Fix stack allocation for arm64
====================
tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
The 'Fixes' commit recently changed the behaviour of TCP by skipping the
processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
unnecessary ACK in case of simultaneous connect(). Unfortunately, that
had an impact on TFO and MPTCP.
I started to look at the impact on MPTCP, because the MPTCP CI found
some issues with the MPTCP Packetdrill tests [1]. Then Paolo Abeni
suggested me to look at the impact on TFO with "plain" TCP.
For MPTCP, when receiving the 3rd ACK of a request adding a new path
(MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
has been created when the MPTCP connection got established before with
the first path. The newly added 'goto' will then skip the processing of
the segment text (step 7) and not go through tcp_data_queue() where the
MPTCP options are validated, and some actions are triggered, e.g.
sending the MPJ 4th ACK [2] as demonstrated by the new errors when
running a packetdrill test [3] establishing a second subflow.
This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
delayed. Still, we don't want to have this behaviour as it delays the
switch to the fully established mode, and invalid MPTCP options in this
3rd ACK will not be caught any more. This modification also affects the
MPTCP + TFO feature as well, and being the reason why the selftests
started to be unstable the last few days [4].
For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
passing: if the 3rd ACK contains data, and the connection is accept()ed
before receiving them, these data would no longer be processed, and thus
not ACKed.
One last thing about MPTCP, in case of simultaneous connect(), a
fallback to TCP will be done, which seems fine:
`../common/defaults.sh`
0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
Simultaneous SYN-data crossing is also not supported by TFO, see [6].
Kuniyuki Iwashima suggested to restrict the processing to SYN+ACK only:
that's a more generic solution than the one initially proposed, and
also enough to fix the issues described above.
Later on, Eric Dumazet mentioned that an ACK should still be sent in
reaction to the second SYN+ACK that is received: not sending a DUPACK
here seems wrong and could hurt:
0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <mss 1460, sackOK, TS val 1000 ecr 0,nop,wscale 8>
+0 < S 0:0(0) win 1000 <mss 1000, sackOK, nop, nop>
+0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 3308134035 ecr 0,nop,wscale 8>
+0 < S. 0:0(0) ack 1 win 1000 <mss 1000, sackOK, nop, nop>
+0 > . 1:1(0) ack 1 <nop, nop, sack 0:1> // <== Here
So in this version, the 'goto consume' is dropped, to always send an ACK
when switching from TCP_SYN_RECV to TCP_ESTABLISHED. This ACK will be
seen as a DUPACK -- with DSACK if SACK has been negotiated -- in case of
simultaneous SYN crossing: that's what is expected here.
Stanislav Fomichev [Sat, 13 Jul 2024 01:52:51 +0000 (18:52 -0700)]
xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
Julian reports that commit 341ac980eab9 ("xsk: Support tx_metadata_len")
can break existing use cases which don't zero-initialize xdp_umem_reg
padding. Introduce new XDP_UMEM_TX_METADATA_LEN to make sure we
interpret the padding as tx_metadata_len only when being explicitly
asked.
Fixes: 341ac980eab9 ("xsk: Support tx_metadata_len") Reported-by: Julian Schindel <mail@arctic-alpaca.de> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20240713015253.121248-2-sdf@fomichev.me
Move the freeing of the dummy net_device from mtk_free_dev() to
mtk_remove().
Previously, if alloc_netdev_dummy() failed in mtk_probe(),
eth->dummy_dev would be NULL. The error path would then call
mtk_free_dev(), which in turn called free_netdev() assuming dummy_dev
was allocated (but it was not), potentially causing a NULL pointer
dereference.
By moving free_netdev() to mtk_remove(), we ensure it's only called when
mtk_probe() has succeeded and dummy_dev is fully allocated. This
addresses a potential NULL pointer dereference detected by Smatch[1].
Fixes: b209bd6d0bff ("net: mediatek: mtk_eth_sock: allocate dummy net_device dynamically") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/4160f4e0-cbef-4a22-8b5d-42c4d399e1f7@stanley.mountain/ [1] Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240724080524.2734499-1-leitao@debian.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 25 Jul 2024 09:17:21 +0000 (11:17 +0200)]
Merge tag 'nf-24-07-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains a Netfilter fix for net:
Patch #1 if FPU is busy, then pipapo set backend falls back to standard
set element lookup. Moreover, disable bh while at this.
From Florian Westphal.
netfilter pull request 24-07-24
* tag 'nf-24-07-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_set_pipapo_avx2: disable softinterrupts
====================
Paolo Abeni [Thu, 25 Jul 2024 08:10:05 +0000 (10:10 +0200)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
This series contains updates to ice driver only.
Ahmed enforces the iavf per VF filter limit on ice (PF) driver to prevent
possible resource exhaustion.
Wojciech corrects assignment of l2 flags read from firmware.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: Fix recipe read procedure
ice: Add a per-VF limit on number of FDIR filters
====================
Merge tag 'phy-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy
Pull phy updates from Vinod Koul:
"New Support
- Samsung Exynos gs101 drd combo phy
- Qualcomm SC8180x USB uniphy, IPQ9574 QMP PCIe phy
- Airoha EN7581 PCIe phy
- Freescale i.MX8Q HSIO SerDes phy
- Starfive jh7110 dphy tx
Updates:
- Resume support for j721e-wiz driver
- Updates to Exynos usbdrd driver
- Support for optional power domains in g12a usb2-phy driver
- Debugfs support and updates to zynqmp driver"
* tag 'phy-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: (56 commits)
phy: airoha: Add dtime and Rx AEQ IO registers
dt-bindings: phy: airoha: Add dtime and Rx AEQ IO registers
dt-bindings: phy: rockchip-emmc-phy: Convert to dtschema
dt-bindings: phy: qcom,qmp-usb: fix spelling error
phy: exynos5-usbdrd: support Exynos USBDRD 3.1 combo phy (HS & SS)
phy: exynos5-usbdrd: convert Vbus supplies to regulator_bulk
phy: exynos5-usbdrd: convert (phy) register access clock to clk_bulk
phy: exynos5-usbdrd: convert core clocks to clk_bulk
phy: exynos5-usbdrd: support isolating HS and SS ports independently
dt-bindings: phy: samsung,usb3-drd-phy: add gs101 compatible
phy: core: Fix documentation of of_phy_get
phy: starfive: Correct the dphy configure process
phy: zynqmp: Add debugfs support
phy: zynqmp: Take the phy mutex in xlate
phy: zynqmp: Only wait for PLL lock "primary" instances
phy: zynqmp: Store instance instead of type
phy: zynqmp: Enable reference clock correctly
phy: cadence-torrent: Check return value on register read
phy: Fix the cacography in phy-exynos5250-usb2.c
phy: phy-rockchip-samsung-hdptx: Select CONFIG_MFD_SYSCON
...
Merge tag 'soundwire-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire
Pull soundwire updates from Vinod Koul:
- Simplification across subsystem using cleanup.h
- Support for debugfs to read/write commands
- Few Intel and Qualcomm driver updates
* tag 'soundwire-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
soundwire: debugfs: simplify with cleanup.h
soundwire: cadence: simplify with cleanup.h
soundwire: intel_ace2x: simplify with cleanup.h
soundwire: intel_ace2x: simplify return path in hw_params
soundwire: intel: simplify with cleanup.h
soundwire: intel: simplify return path in hw_params
soundwire: amd_init: simplify with cleanup.h
soundwire: amd: simplify with cleanup.h
soundwire: amd: simplify return path in hw_params
soundwire: intel_auxdevice: start the bus at default frequency
soundwire: intel_auxdevice: add cs42l43 codec to wake_capable_list
drivers:soundwire: qcom: cleanup port maask calculations
soundwire: bus: simplify by using local slave->prop
soundwire: generic_bandwidth_allocation: change port_bo parameter to pointer
soundwire: Intel: clarify Copyright information
soundwire: intel_ace2.x: add AC timing extensions for PantherLake
soundwire: bus: add stream refcount
soundwire: debugfs: add interface to read/write commands
Joel Granados [Wed, 24 Jul 2024 18:59:29 +0000 (20:59 +0200)]
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
Merge tag 'random-6.11-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"This adds getrandom() support to the vDSO.
First, it adds a new kind of mapping to mmap(2), MAP_DROPPABLE, which
lets the kernel zero out pages anytime under memory pressure, which
enables allocating memory that never gets swapped to disk but also
doesn't count as being mlocked.
Then, the vDSO implementation of getrandom() is introduced in a
generic manner and hooked into random.c.
Next, this is implemented on x86. (Also, though it's not ready for
this pull, somebody has begun an arm64 implementation already)
Finally, two vDSO selftests are added.
There are also two housekeeping cleanup commits"
* tag 'random-6.11-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
MAINTAINERS: add random.h headers to RNG subsection
random: note that RNDGETPOOL was removed in 2.6.9-rc2
selftests/vDSO: add tests for vgetrandom
x86: vdso: Wire up getrandom() vDSO implementation
random: introduce generic vDSO getrandom() implementation
mm: add MAP_DROPPABLE for designating always lazily freeable mappings
Merge tag 'vfs-6.11-rc1.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"VFS:
- The new 64bit mount ids start after the old mount id, i.e., at the
first non-32 bit value. However, we started counting one id too
late and thus lost 4294967296 as the first valid id. Fix that.
- Update a few comments on some vfs_*() creation helpers.
- Move copying of the xattr name out from the locks required to start
a filesystem write.
- Extend the filelock lock UAF fix to the compat code as well.
- Now that we added the ability to look up an inode under RCU it's
possible that lockless hash lookup can find and lock an inode after
it gets I_FREEING set. It then waits until inode teardown in
evict() is finished.
The flag however is still set after evict() has woken up all
waiters. If the inode lock is taken late enough on the waiting side
after hash removal and wakeup happened the waiting thread will
never be woken.
Before RCU based lookup this was synchronized via the
inode_hash_lock. But since unhashing requires the inode lock as
well we can check whether the inode is unhashed while holding inode
lock even without holding inode_hash_lock.
pidfd:
- The nsproxy structure contains nearly all of the namespaces
associated with a task. When a namespace type isn't supported
nsproxy might contain a NULL pointer or always point to the initial
namespace type. The logic isn't consistent. So when deriving
namespace fds we need to ensure that the namespace type is
supported.
First, so that we don't risk dereferncing NULL pointers. The
correct bigger fix would be to change all namespaces to always set
a valid namespace pointer in struct nsproxy independent of whether
or not it is compiled in. But that requires quite a few changes.
Second, so that we don't allow deriving namespace fds when the
namespace type doesn't exist and thus when they couldn't also be
derived via /proc/self/ns/.
- Add missing selftests for the new pidfd ioctls to derive namespace
fds. This simply extends the already existing testsuite.
netfs:
- Fix debug logging and fix kconfig variable name so it actually
works.
- Fix writeback that goes both to the server and cache. The streams
are only activated once a subreq is added. When a server write
happens the subreq doesn't need to have finished by the time the
cache write is started. If the server write has already finished by
the time the cache write is about to start the cache write will
operate on a folio that might already have been reused. Fix this by
preactivating the cache write.
- Limit cachefiles subreq size for cache writes to MAX_RW_COUNT"
* tag 'vfs-6.11-rc1.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
inode: clarify what's locked
vfs: Fix potential circular locking through setxattr() and removexattr()
filelock: Fix fcntl/close race recovery compat path
fs: use all available ids
cachefiles: Set the max subreq size for cache writes to MAX_RW_COUNT
netfs: Fix writeback that needs to go to both server and cache
pidfs: add selftests for new namespace ioctls
pidfs: handle kernels without namespaces cleanly
pidfs: when time ns disabled add check for ioctl
vfs: correct the comments of vfs_*() helpers
vfs: handle __wait_on_freeing_inode() and evict() race
netfs: Rename CONFIG_FSCACHE_DEBUG to CONFIG_NETFS_DEBUG
netfs: Revert "netfs: Switch debug logging to pr_debug()"
Commit e3ec0fe944d2 ("hostfs: Convert hostfs_read_folio() to use a
folio") simplified hostfs_read_folio(), but in the process of converting
to using folios natively also mis-used the folio_zero_tail() function
due to the very confusing API of that function.
Very arguably it's folio_zero_tail() API itself that is buggy, since it
would make more sense (and the documentation kind of implies) that the
third argument would be the pointer to the beginning of the folio
buffer.
But no, the third argument to folio_zero_tail() is where we should start
zeroing the tail (even if we already also pass in the offset separately
as the second argument).
So fix the hostfs caller, and we can leave any folio_zero_tail() sanity
cleanup for later.
Reported-and-tested-by: Maciej Żenczykowski <maze@google.com> Fixes: e3ec0fe944d2 ("hostfs: Convert hostfs_read_folio() to use a folio") Link: https://lore.kernel.org/all/CANP3RGceNzwdb7w=vPf5=7BCid5HVQDmz1K5kC9JG42+HVAh_g@mail.gmail.com/ Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>