Geliang Tang [Thu, 30 May 2024 07:41:08 +0000 (15:41 +0800)]
selftests/bpf: Use connect_to_fd_opts in do_test in bpf_tcp_ca
This patch uses connect_to_fd_opts() instead of using connect_fd_to_fd()
and settcpca() in do_test() in prog_tests/bpf_tcp_ca.c to accept a struct
network_helper_opts argument.
Then define a dctcp dedicated post_socket_cb callback stg_post_socket_cb(),
invoking both settcpca() and bpf_map_update_elem() in it, and set it in
test_dctcp(). For passing map_fd into stg_post_socket_cb() callback, a new
member map_fd is added in struct cb_opts.
Add another "const struct network_helper_opts *cli_opts" to do_test() to
separate it from the server "opts".
Mykyta Yatsenko [Wed, 5 Jun 2024 17:51:35 +0000 (18:51 +0100)]
libbpf: Auto-attach struct_ops BPF maps in BPF skeleton
Similarly to `bpf_program`, support `bpf_map` automatic attachment in
`bpf_object__attach_skeleton`. Currently only struct_ops maps could be
attached.
On bpftool side, code-generate links in skeleton struct for struct_ops maps.
Similarly to `bpf_program_skeleton`, set links in `bpf_map_skeleton`.
On libbpf side, extend `bpf_map` with new `autoattach` field to support
enabling or disabling autoattach functionality, introducing
getter/setter for this field.
`bpf_object__(attach|detach)_skeleton` is extended with
attaching/detaching struct_ops maps logic.
Yonghong Song [Wed, 5 Jun 2024 20:12:03 +0000 (13:12 -0700)]
selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT
Alexei reported that send_signal test may fail with nested CONFIG_PARAVIRT
configs. In this particular case, the base VM is AMD with 166 cpus, and I
run selftests with regular qemu on top of that and indeed send_signal test
failed. I also tried with an Intel box with 80 cpus and there is no issue.
In the test, two processes are communicating through pipe. Further debugging
with strace found that the above splat is triggered as read() syscall could
not receive the data even if the corresponding write() syscall in another
process successfully wrote data into the pipe.
The failed subtest is "send_signal_perf". The corresponding perf event has
sample_period 1 and config PERF_COUNT_SW_CPU_CLOCK. sample_period 1 means every
overflow event will trigger a call to the BPF program. So I suspect this may
overwhelm the system. So I increased the sample_period to 100,000 and the test
passed. The sample_period 10,000 still has the test failed.
In other parts of selftest, e.g., [1], sample_freq is used instead. So I
decided to use sample_freq = 1,000 since the test can pass as well.
Andrii Nakryiko [Wed, 5 Jun 2024 00:16:29 +0000 (17:16 -0700)]
libbpf: Remove callback-based type/string BTF field visitor helpers
Now that all libbpf/bpftool code switched to btf_field_iter, remove
btf_type_visit_type_ids() and btf_type_visit_str_offs() callback-based
helpers as not needed anymore.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-6-andrii@kernel.org
Andrii Nakryiko [Wed, 5 Jun 2024 00:16:27 +0000 (17:16 -0700)]
libbpf: Make use of BTF field iterator in BTF handling code
Use new BTF field iterator logic to replace all the callback-based
visitor calls. There is still a .BTF.ext callback-based visitor APIs
that should be converted, which will happens as a follow up.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-4-andrii@kernel.org
Andrii Nakryiko [Wed, 5 Jun 2024 00:16:25 +0000 (17:16 -0700)]
libbpf: Add BTF field iterator
Implement iterator-based type ID and string offset BTF field iterator.
This is used extensively in BTF-handling code and BPF linker code for
various sanity checks, rewriting IDs/offsets, etc. Currently this is
implemented as visitor pattern calling custom callbacks, which makes the
logic (especially in simple cases) unnecessarily obscure and harder to
follow.
Having equivalent functionality using iterator pattern makes for simpler
to understand and maintain code. As we add more code for BTF processing
logic in libbpf, it's best to switch to iterator pattern before adding
more callback-based code.
The idea for iterator-based implementation is to record offsets of
necessary fields within fixed btf_type parts (which should be iterated
just once), and, for kinds that have multiple members (based on vlen
field), record where in each member necessary fields are located.
Generic iteration code then just keeps track of last offset that was
returned and handles N members correctly. Return type is just u32
pointer, where NULL is returned when all relevant fields were already
iterated.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-2-andrii@kernel.org
Yonghong Song [Tue, 4 Jun 2024 18:00:34 +0000 (11:00 -0700)]
selftests/bpf: Ignore .llvm.<hash> suffix in kallsyms_find()
I hit the following failure when running selftests with
internal backported upstream kernel:
test_ksyms:PASS:kallsyms_fopen 0 nsec
test_ksyms:FAIL:ksym_find symbol 'bpf_link_fops' not found
#123 ksyms:FAIL
In /proc/kallsyms, we have
$ cat /proc/kallsyms | grep bpf_link_fops ffffffff829f0cb0 d bpf_link_fops.llvm.12608678492448798416
The CONFIG_LTO_CLANG_THIN is enabled in the kernel which is responsible
for bpf_link_fops.llvm.12608678492448798416 symbol name.
In prog_tests/ksyms.c we have
kallsyms_find("bpf_link_fops", &link_fops_addr)
and kallsyms_find() compares "bpf_link_fops" with symbols
in /proc/kallsyms in order to find the entry. With
bpf_link_fops.llvm.<hash> in /proc/kallsyms, the kallsyms_find()
failed.
To fix the issue, in kallsyms_find(), if a symbol has suffix
.llvm.<hash>, that suffix will be ignored for comparison.
This fixed the test failure.
Song Liu [Tue, 4 Jun 2024 07:07:00 +0000 (00:07 -0700)]
selftests/bpf: Fix bpf_cookie and find_vma in nested VM
bpf_cookie and find_vma are flaky in nested VMs, which is used by some CI
systems. It turns out these failures are caused by unreliable perf event
in nested VM. Fix these by:
1. Use PERF_COUNT_SW_CPU_CLOCK in find_vma;
2. Increase sample_freq in bpf_cookie.
====================
Enable BPF programs to declare arrays of kptr, bpf_rb_root, and bpf_list_head.
Some types, such as type kptr, bpf_rb_root, and bpf_list_head, are
treated in a special way. Previously, these types could not be the
type of a field in a struct type that is used as the type of a global
variable. They could not be the type of a field in a struct type that
is used as the type of a field in the value type of a map either. They
could not even be the type of array elements. This means that they can
only be the type of global variables or of direct fields in the value
type of a map.
The patch set aims to enable the use of these specific types in arrays
and struct fields, providing flexibility. It examines the types of
global variables or the value types of maps, such as arrays and struct
types, recursively to identify these special types and generate field
information for them.
For example,
...
struct task_struct __kptr *ptr[3];
...
it will create 3 instances of "struct btf_field" in the "btf_record" of
the data section.
Assuming that the size of an element/struct A is 0x100 and "foo"
starts at 0x7000, it includes two kptr records at 0x7100 and 0x7200,
and two rbtree root records at 0x7108 and 0x7208.
All these field information will be flatten, for struct types, and
repeated, for arrays.
---
Changes from v6:
- Return BPF_KPTR_REF from btf_get_field_type() only if var_type is a
struct type.
- Pass btf and type to btf_get_field_type().
Changes from v5:
- Ensure field->offset values of kptrs are advanced correctly from
one nested struct/or array to another.
Changes from v4:
- Return -E2BIG for i == MAX_RESOLVE_DEPTH.
Changes from v3:
- Refactor the common code of btf_find_struct_field() and
btf_find_datasec_var().
- Limit the number of levels looking into a struct types.
Changes from v2:
- Support fields in nested struct type.
- Remove nelems and duplicate field information with offset
adjustments for arrays.
Changes from v1:
- Move the check of element alignment out of btf_field_cmp() to
btf_record_find().
- Change the order of the previous patch 4 "bpf:
check_map_kptr_access() compute the offset from the reg state" as
the patch 7 now.
- Reject BPF_RB_NODE and BPF_LIST_NODE with nelems > 1.
- Rephrase the commit log of the patch "bpf: check_map_access() with
the knowledge of arrays" to clarify the alignment on elements.
Kui-Feng Lee (9):
bpf: Remove unnecessary checks on the offset of btf_field.
bpf: Remove unnecessary call to btf_field_type_size().
bpf: refactor btf_find_struct_field() and btf_find_datasec_var().
bpf: create repeated fields for arrays.
bpf: look into the types of the fields of a struct type recursively.
bpf: limit the number of levels of a nested struct type.
selftests/bpf: Test kptr arrays and kptrs in nested struct fields.
selftests/bpf: Test global bpf_rb_root arrays and fields in nested
struct types.
selftests/bpf: Test global bpf_list_head arrays.
Kui-Feng Lee [Thu, 23 May 2024 17:42:00 +0000 (10:42 -0700)]
selftests/bpf: Test kptr arrays and kptrs in nested struct fields.
Make sure that BPF programs can declare global kptr arrays and kptr fields
in struct types that is the type of a global variable or the type of a
nested descendant field in a global variable.
An array with only one element is special case, that it treats the element
like a non-array kptr field. Nested arrays are also tested to ensure they
are handled properly.
Kui-Feng Lee [Thu, 23 May 2024 17:41:58 +0000 (10:41 -0700)]
bpf: look into the types of the fields of a struct type recursively.
The verifier has field information for specific special types, such as
kptr, rbtree root, and list head. These types are handled
differently. However, we did not previously examine the types of fields of
a struct type variable. Field information records were not generated for
the kptrs, rbtree roots, and linked_list heads that are not located at the
outermost struct type of a variable.
For example,
struct A {
struct task_struct __kptr * task;
};
struct B {
struct A mem_a;
}
struct B var_b;
It did not examine "struct A" so as not to generate field information for
the kptr in "struct A" for "var_b".
This patch enables BPF programs to define fields of these special types in
a struct type other than the direct type of a variable or in a struct type
that is the type of a field in the value type of a map.
Kui-Feng Lee [Thu, 23 May 2024 17:41:57 +0000 (10:41 -0700)]
bpf: create repeated fields for arrays.
The verifier uses field information for certain special types, such as
kptr, rbtree root, and list head. These types are treated
differently. However, we did not previously support these types in
arrays. This update examines arrays and duplicates field information the
same number of times as the length of the array if the element type is one
of the special types.
Kui-Feng Lee [Thu, 23 May 2024 17:41:55 +0000 (10:41 -0700)]
bpf: Remove unnecessary call to btf_field_type_size().
field->size has been initialized by bpf_parse_fields() with the value
returned by btf_field_type_size(). Use it instead of calling
btf_field_type_size() again.
Geliang Tang [Thu, 23 May 2024 06:50:00 +0000 (14:50 +0800)]
selftests/bpf: Replace tx_prog_fd with tx_prog in test_sockmap
bpf_program__attach_sockmap() needs to take a parameter of type bpf_program
instead of an fd, so tx_prog_fd becomes useless. This patch uses a pointer
tx_prog to point to an item in progs[] array.
Geliang Tang [Thu, 23 May 2024 06:49:59 +0000 (14:49 +0800)]
selftests/bpf: Use bpf_link attachments in test_sockmap
Switch attachments to bpf_link using bpf_program__attach_sockmap() instead
of bpf_prog_attach().
This patch adds a new array progs[] to replace prog_fd[] array, set in
populate_progs() for each program in bpf object.
And another new array links[] to save the attached bpf_link. It is
initalized as NULL in populate_progs, set as the return valuses of
bpf_program__attach_sockmap(), and detached by bpf_link__detach().
Geliang Tang [Thu, 23 May 2024 06:49:57 +0000 (14:49 +0800)]
selftests/bpf: Fix tx_prog_fd values in test_sockmap
The values of tx_prog_fd in run_options() should not be 0, so set it as -1
in else branch, and test it using "if (tx_prog_fd > 0)" condition, not
"if (tx_prog_fd)" or "if (tx_prog_fd >= 0)".
Xiao Wang [Fri, 24 May 2024 07:55:43 +0000 (15:55 +0800)]
riscv, bpf: Introduce shift add helper with Zba optimization
Zba extension is very useful for generating addresses that index into array
of basic data types. This patch introduces sh2add and sh3add helpers for
RV32 and RV64 respectively, to accelerate addressing for array of unsigned
long data.
Martin KaFai Lau [Thu, 30 May 2024 17:34:53 +0000 (10:34 -0700)]
Merge branch 'Notify user space when a struct_ops object is detached/unregistered'
Kui-Feng Lee says:
====================
The subsystems managing struct_ops objects may need to detach a
struct_ops object due to errors or other reasons. It would be useful
to notify user space programs so that error recovery or logging can be
carried out.
This patch set enables the detach feature for struct_ops links and
send an event to epoll when a link is detached. Subsystems could call
link->ops->detach() to detach a link and notify user space programs
through epoll.
The signatures of callback functions in "struct bpf_struct_ops" have
been changed as well to pass an extra link argument to
subsystems. Subsystems could detach the links received from reg() and
update() callbacks if there is. This also provides a way that
subsystems can distinguish registrations for an object that has been
registered multiple times for several links.
However, bpf struct_ops maps without BPF_F_LINK have no any link.
Subsystems will receive NULL link pointer for this case.
---
Changes from v6:
- Fix the missing header at patch 5.
- Move RCU_INIT_POINTER() back to its original position.
Changes from v5:
- Change the commit title of the patch for bpftool.
Changes from v4:
- Change error code for bpf_struct_ops_map_link_update()
- Always return 0 for bpf_struct_ops_map_link_detach()
- Hold update_mutex in bpf_struct_ops_link_create()
- Add a separated instance of file_operations for links supporting
poll.
- Fix bpftool for bpf_link_fops_poll.
Changes from v3:
- Add a comment to explain why holding update_mutex is not necessary
in bpf_struct_ops_link_create()
- Use rcu_access_pointer() in bpf_struct_ops_map_link_poll().
Changes from v2:
- Rephrased commit logs and comments.
- Addressed some mistakes from patch splitting.
- Replace mutex with spinlock in bpf_testmod.c to address lockdep
Splat and simplify the implementation.
- Fix an argument passing to rcu_dereference_protected().
Changes from v1:
- Pass a link to reg, unreg, and update callbacks.
- Provide a function to detach a link from underlying subsystems.
- Add a kfunc to mimic detachments from subsystems, and provide a
flexible way to control when to do detachments.
- Add two tests to detach a link from the subsystem after the refcount
of the link drops to zero.
Kui-Feng Lee [Thu, 30 May 2024 06:59:46 +0000 (23:59 -0700)]
bpftool: Change pid_iter.bpf.c to comply with the change of bpf_link_fops.
To support epoll, a new instance of file_operations, bpf_link_fops_poll,
has been added for links that support epoll. The pid_iter.bpf.c checks
f_ops for links and other BPF objects. The check should fail for struct_ops
links without this patch.
Kui-Feng Lee [Thu, 30 May 2024 06:59:43 +0000 (23:59 -0700)]
selftests/bpf: test struct_ops with epoll
Verify whether a user space program is informed through epoll with EPOLLHUP
when a struct_ops object is detached.
The BPF code in selftests/bpf/progs/struct_ops_module.c has become
complex. Therefore, struct_ops_detach.c has been added to segregate the BPF
code for detachment tests from the BPF code for other tests based on the
recommendation of Andrii Nakryiko.
Kui-Feng Lee [Thu, 30 May 2024 06:59:41 +0000 (23:59 -0700)]
bpf: support epoll from bpf struct_ops links.
Add epoll support to bpf struct_ops links to trigger EPOLLHUP event upon
detachment.
This patch implements the "poll" of the "struct file_operations" for BPF
links and introduces a new "poll" operator in the "struct bpf_link_ops". By
implementing "poll" of "struct bpf_link_ops" for the links of struct_ops,
the file descriptor of a struct_ops link can be added to an epoll file
descriptor to receive EPOLLHUP events.
Kui-Feng Lee [Thu, 30 May 2024 06:59:40 +0000 (23:59 -0700)]
bpf: enable detaching links of struct_ops objects.
Implement the detach callback in bpf_link_ops for struct_ops so that user
programs can detach a struct_ops link. The subsystems that struct_ops
objects are registered to can also use this callback to detach the links
being passed to them.
Kui-Feng Lee [Thu, 30 May 2024 06:59:39 +0000 (23:59 -0700)]
bpf: pass bpf_struct_ops_link to callbacks in bpf_struct_ops.
Pass an additional pointer of bpf_struct_ops_link to callback function reg,
unreg, and update provided by subsystems defined in bpf_struct_ops. A
bpf_struct_ops_map can be registered for multiple links. Passing a pointer
of bpf_struct_ops_link helps subsystems to distinguish them.
This pointer will be used in the later patches to let the subsystem
initiate a detachment on a link that was registered to it previously.
Andrii Nakryiko [Wed, 29 May 2024 23:01:48 +0000 (16:01 -0700)]
Merge branch 'bpf-add-a-generic-bits-iterator'
Yafang Shao says:
====================
bpf: Add a generic bits iterator
Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
added for the new bpf_iter_bits functionality. These kfuncs enable the
iteration of the bits from a given address and a given number of bits.
- bpf_iter_bits_new
Initialize a new bits iterator for a given memory area. Due to the
limitation of bpf memalloc, the max number of bits to be iterated
over is (4096 * 8).
- bpf_iter_bits_next
Get the next bit in a bpf_iter_bits
- bpf_iter_bits_destroy
Destroy a bpf_iter_bits
The bits iterator can be used in any context and on any address.
Changes:
- v7->v8:
Refine the interface to avoid dealing with endianness (Andrii)
- v6->v7:
Fix endianness error for non-long-aligned data (Andrii)
- v5->v6:
Add positive tests (Andrii)
- v4->v5:
Simplify test cases (Andrii)
- v3->v4:
- Fix endianness error on s390x (Andrii)
- zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
- v2->v3:
Optimization for u64/u32 mask (Andrii)
- v1->v2:
Simplify the CPU number verification code to avoid the failure on s390x
(Eduard)
- bpf: Add bpf_iter_cpumask
https://lwn.net/Articles/961104/
- bpf: Add new bpf helper bpf_for_each_cpu
https://lwn.net/Articles/939939/
====================
Yafang Shao [Fri, 17 May 2024 02:30:34 +0000 (10:30 +0800)]
selftests/bpf: Add selftest for bits iter
Add test cases for the bits iter:
- Positive cases
- Bit mask representing a single word (8-byte unit)
- Bit mask representing data spanning more than one word
- The index of the set bit
- Nagative cases
- bpf_iter_bits_destroy() is required after calling
bpf_iter_bits_new()
- bpf_iter_bits_destroy() can only destroy an initialized iter
- bpf_iter_bits_next() must use an initialized iter
- Bit mask representing zero words
- Bit mask representing fewer words than expected
- Case for ENOMEM
- Case for NULL pointer
Yafang Shao [Fri, 17 May 2024 02:30:33 +0000 (10:30 +0800)]
bpf: Add bits iterator
Add three new kfuncs for the bits iterator:
- bpf_iter_bits_new
Initialize a new bits iterator for a given memory area. Due to the
limitation of bpf memalloc, the max number of words (8-byte units) that
can be iterated over is limited to (4096 / 8).
- bpf_iter_bits_next
Get the next bit in a bpf_iter_bits
- bpf_iter_bits_destroy
Destroy a bpf_iter_bits
The bits iterator facilitates the iteration of the bits of a memory area,
such as cpumask. It can be used in any context and on any address.
Martin KaFai Lau [Wed, 29 May 2024 00:53:04 +0000 (17:53 -0700)]
Merge branch 'use network helpers, part 5'
Geliang Tang says:
====================
This patchset uses post_socket_cb callbacks of struct network_helper_opts
to refactor do_test() in bpf_tcp_ca.c.
v5:
- address Martin's comments in v4 (thanks)
- add patch 4, use start_server_str in test_dctcp_fallback too
- ASSERT_* is already used in settcpca, use this helper in cc_cb (patch 3).
v4:
- address Martin's comments in v3 (thanks).
- drop 2 patches, keep "type" as the individual arg to start_server_addr,
connect_to_addr and start_server_str.
v3:
- Add 4 new patches, 1-3 are cleanups. 4 adds a new helper.
- address Martin's comments in v2.
v2:
- rebased on commit "selftests/bpf: Add test for the use of new args in
cong_control"
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Geliang Tang [Sat, 25 May 2024 12:08:19 +0000 (20:08 +0800)]
selftests/bpf: Use start_server_str in do_test in bpf_tcp_ca
This patch uses new helper start_server_str() in do_test() in bpf_tcp_ca.c
to accept a struct network_helper_opts argument instead of using
start_server() and settcpca(). Then change the type of the first paramenter
of do_test() into a struct network_helper_opts one.
Define its own cb_opts and opts for each test, set its own cc name into
cb_opts.cc, and cc_cb() into post_socket_cb callback, then pass it to
do_test().
Geliang Tang [Sat, 25 May 2024 12:08:18 +0000 (20:08 +0800)]
selftests/bpf: Use post_socket_cb in start_server_str
This patch uses start_server_str() helper in test_dctcp_fallback() in
bpf_tcp_ca.c, instead of using start_server() and settcpca(). For
support opts in start_server_str() helper, opts->cb_opts needs to be
passed to post_socket_cb() in __start_server().
Geliang Tang [Sat, 25 May 2024 12:08:17 +0000 (20:08 +0800)]
selftests/bpf: Use post_socket_cb in connect_to_fd_opts
Since the post_socket_cb() callback is added in struct network_helper_opts,
it's make sense to use it not only in __start_server(), but also in
connect_to_fd_opts(). Then it can be used to set TCP_CONGESTION sockopt.
Add a "void *" type member cb_opts into struct network_helper_opts, and add
a new struct named cb_opts in prog_tests/bpf_tcp_ca.c, then cc can be moved
into struct cb_opts from network_helper_opts. Define a new callback cc_cb()
to set TCP_CONGESTION sockopt, and set it to post_socket_cb pointer of opts.
Define a new cb_opts cubic, set it to cb_opts of opts. Pass this opts to
connect_to_fd_opts() in test_dctcp_fallback().
Geliang Tang [Sat, 25 May 2024 12:08:16 +0000 (20:08 +0800)]
selftests/bpf: Add start_server_str helper
It's a tech debt that start_server() does not take the "opts" argument.
It's pretty handy to have start_server() as a helper that takes string
address.
So this patch creates a new helper start_server_str(). Then start_server()
can be a wrapper of it.
Geliang Tang [Sat, 25 May 2024 12:08:15 +0000 (20:08 +0800)]
selftests/bpf: Drop struct post_socket_opts
It's not possible to have one generic/common "struct post_socket_opts"
for all tests. It's better to have the individual test define its own
callback opts struct.
So this patch drops struct post_socket_opts, and changes the second
parameter of post_socket_cb as "void *" type.
Mykyta Yatsenko [Fri, 24 May 2024 13:18:40 +0000 (14:18 +0100)]
libbpf: Configure log verbosity with env variable
Configure logging verbosity by setting LIBBPF_LOG_LEVEL environment
variable, which is applied only to default logger. Once user set their
custom logging callback, it is up to them to handle filtering.
We've added 23 non-merge commits during the last 11 day(s) which contain
a total of 45 files changed, 696 insertions(+), 277 deletions(-).
The main changes are:
1) Rename skb's mono_delivery_time to tstamp_type for extensibility
and add SKB_CLOCK_TAI type support to bpf_skb_set_tstamp(),
from Abhishek Chauhan.
2) Add netfilter CT zone ID and direction to bpf_ct_opts so that arbitrary
CT zones can be used from XDP/tc BPF netfilter CT helper functions,
from Brad Cowie.
3) Several tweaks to the instruction-set.rst IETF doc to address
the Last Call review comments, from Dave Thaler.
4) Small batch of riscv64 BPF JIT optimizations in order to emit more
compressed instructions to the JITed image for better icache efficiency,
from Xiao Wang.
5) Sort bpftool C dump output from BTF, aiming to simplify vmlinux.h
diffing and forcing more natural type definitions ordering,
from Mykyta Yatsenko.
6) Use DEV_STATS_INC() macro in BPF redirect helpers to silence
a syzbot/KCSAN race report for the tx_errors counter,
from Jiang Yunshui.
7) Un-constify bpf_func_info in bpftool to fix compilation with LLVM 17+
which started treating const structs as constants and thus breaking
full BTF program name resolution, from Ivan Babrou.
8) Fix up BPF program numbers in test_sockmap selftest in order to reduce
some of the test-internal array sizes, from Geliang Tang.
9) Small cleanup in Makefile.btf script to use test-ge check for v1.25-only
pahole, from Alan Maguire.
10) Fix bpftool's make dependencies for vmlinux.h in order to avoid needless
rebuilds in some corner cases, from Artem Savkov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
bpf, net: Use DEV_STAT_INC()
bpf, docs: Fix instruction.rst indentation
bpf, docs: Clarify call local offset
bpf, docs: Add table captions
bpf, docs: clarify sign extension of 64-bit use of 32-bit imm
bpf, docs: Use RFC 2119 language for ISA requirements
bpf, docs: Move sentence about returning R0 to abi.rst
bpf: constify member bpf_sysctl_kern:: Table
riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT
riscv, bpf: Use STACK_ALIGN macro for size rounding up
riscv, bpf: Optimize zextw insn with Zba extension
selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets
net: Add additional bit to support clockid_t timestamp type
net: Rename mono_delivery_time to tstamp_type for scalabilty
selftests/bpf: Update tests for new ct zone opts for nf_conntrack kfuncs
net: netfilter: Make ct zone opts configurable for bpf ct helpers
selftests/bpf: Fix prog numbers in test_sockmap
bpf: Remove unused variable "prev_state"
bpftool: Un-const bpf_func_info to fix it for llvm 17 and newer
bpf: Fix order of args in call to bpf_map_kvcalloc
...
====================
'mlx4_port_config was added by
commit ab9c17a009ee ("mlx4_core: Modify driver initialization flow to
accommodate SRIOV for Ethernet")
but remained unused.
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Gou Hao [Sun, 26 May 2024 14:57:18 +0000 (22:57 +0800)]
net/core: move the lockdep-init of sk_callback_lock to sk_init_common()
In commit cdfbabfb2f0c ("net: Work around lockdep limitation in
sockets that use sockets"), it introduces 'af_kern_callback_keys'
to lockdep-init of sk_callback_lock according to 'sk_kern_sock',
it modifies sock_init_data() only, and sk_clone_lock() calls
sk_init_common() to initialize sk_callback_lock too, so the
lockdep-init of sk_callback_lock should be moved to sk_init_common().
yunshui [Thu, 23 May 2024 03:35:20 +0000 (11:35 +0800)]
bpf, net: Use DEV_STAT_INC()
syzbot/KCSAN reported that races happen when multiple CPUs updating
dev->stats.tx_error concurrently. Adopt SMP safe DEV_STATS_INC() to
update the dev->stats fields.
Dave Thaler [Sat, 25 May 2024 15:33:32 +0000 (08:33 -0700)]
bpf, docs: Clarify call local offset
In the Jump instructions section it explains that the offset is
"relative to the instruction following the jump instruction".
But the program-local section confusingly said "referenced by
offset from the call instruction, similar to JA".
This patch updates that sentence with consistent wording, saying
it's relative to the instruction following the call instruction.
Dave Thaler [Mon, 20 May 2024 21:52:55 +0000 (14:52 -0700)]
bpf, docs: clarify sign extension of 64-bit use of 32-bit imm
imm is defined as a 32-bit signed integer.
{MOV, K, ALU64} says it does "dst = src" (where src is 'imm') and it
does do dst = (s64)imm, which in that sense does sign extend imm. The MOVSX
instruction is explained as sign extending, so added the example of
{MOV, K, ALU64} to make this more clear.
{JLE, K, JMP} says it does "PC += offset if dst <= src" (where src is 'imm',
and the comparison is unsigned). This was apparently ambiguous to some
readers as to whether the comparison was "dst <= (u64)(u32)imm" or
"dst <= (u64)(s64)imm" so added an example to make this more clear.
Dave Thaler [Fri, 17 May 2024 16:58:55 +0000 (09:58 -0700)]
bpf, docs: Use RFC 2119 language for ISA requirements
Per IETF convention and discussion at LSF/MM/BPF, use MUST etc.
keywords as requested by IETF Area Director review. Also as
requested, indicate that documenting BTF is out of scope of this
document and will be covered by a separate IETF specification.
Added paragraph about the terminology that is required IETF boilerplate
and must be worded exactly as such.
Dave Thaler [Fri, 17 May 2024 15:34:45 +0000 (08:34 -0700)]
bpf, docs: Move sentence about returning R0 to abi.rst
As discussed at LSF/MM/BPF, the sentence about using R0 for returning
values from calls is part of the calling convention and belongs in
abi.rst. Any further additions or clarifications to this text are left
for future patches on abi.rst. The current patch is simply to unblock
progression of instruction-set.rst to a standard.
In contrast, the restriction of register numbers to the range 0-10
is untouched, left in the instruction-set.rst definition of the
src_reg and dst_reg fields.
Thomas Weißschuh [Sat, 18 May 2024 14:58:47 +0000 (16:58 +0200)]
bpf: constify member bpf_sysctl_kern:: Table
The sysctl core is preparing to only expose instances of struct ctl_table
as "const". This will also affect the ctl_table argument of sysctl handlers,
for which bpf_sysctl_kern::table is also used.
As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.
To reduce the size of that final commit, switch this utility type which
is not bound by "typedef proc_handler" to "const struct ctl_table".
Xiao Wang [Sun, 19 May 2024 05:05:07 +0000 (13:05 +0800)]
riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT
We could try to emit compressed insn for reg move operation during CMPXCHG
JIT, the instruction compression has no impact on the jump offsets of
following forward and backward jump instructions.
Xiao Wang [Thu, 16 May 2024 09:04:30 +0000 (17:04 +0800)]
riscv, bpf: Optimize zextw insn with Zba extension
The Zba extension provides add.uw insn which can be used to implement
zext.w with rs2 set as ZERO.
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Pu Lehui <pulehui@huawei.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/bpf/20240516090430.493122-1-xiao.w.wang@intel.com
Martin KaFai Lau [Thu, 23 May 2024 21:13:57 +0000 (14:13 -0700)]
Merge branch 'Replace mono_delivery_time with tstamp_type'
Abhishek Chauhan says:
====================
Patch 1 :- This patch takes care of only renaming the mono delivery
timestamp to tstamp_type with no change in functionality of
existing available code in kernel also
Starts assigning tstamp_type with either mono or real and
introduces a new enum in the skbuff.h, again no change in functionality
of the existing available code in kernel , just making the code scalable.
Patch 2 :- Additional bit was added to support tai timestamp type to
avoid tstamp drops in the forwarding path when testing TC-ETF.
Patch is also updating bpf filter.c
Some updates to bpf header files with introduction to BPF_SKB_CLOCK_TAI
and documentation updates stating deprecation of BPF_SKB_TSTAMP_UNSPEC
and BPF_SKB_TSTAMP_DELIVERY_MONO
Patch 3:- Handles forwarding of UDP packets with TAI clock id tstamp_type
type with supported changes for tc_redirect/tc_redirect_dtime
to handle forwarding of UDP packets with TAI tstamp_type
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Abhishek Chauhan [Thu, 9 May 2024 21:18:34 +0000 (14:18 -0700)]
selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets
With changes in the design to forward CLOCK_TAI in the skbuff
framework, existing selftest framework needs modification
to handle forwarding of UDP packets with CLOCK_TAI as clockid.
Abhishek Chauhan [Thu, 9 May 2024 21:18:33 +0000 (14:18 -0700)]
net: Add additional bit to support clockid_t timestamp type
tstamp_type is now set based on actual clockid_t compressed
into 2 bits.
To make the design scalable for future needs this commit bring in
the change to extend the tstamp_type:1 to tstamp_type:2 to support
other clockid_t timestamp.
We now support CLOCK_TAI as part of tstamp_type as part of this
commit with existing support CLOCK_MONOTONIC and CLOCK_REALTIME.
Abhishek Chauhan [Thu, 9 May 2024 21:18:32 +0000 (14:18 -0700)]
net: Rename mono_delivery_time to tstamp_type for scalabilty
mono_delivery_time was added to check if skb->tstamp has delivery
time in mono clock base (i.e. EDT) otherwise skb->tstamp has
timestamp in ingress and delivery_time at egress.
Renaming the bitfield from mono_delivery_time to tstamp_type is for
extensibilty for other timestamps such as userspace timestamp
(i.e. SO_TXTIME) set via sock opts.
As we are renaming the mono_delivery_time to tstamp_type, it makes
sense to start assigning tstamp_type based on enum defined
in this commit.
Earlier we used bool arg flag to check if the tstamp is mono in
function skb_set_delivery_time, Now the signature of the functions
accepts tstamp_type to distinguish between mono and real time.
Also skb_set_delivery_type_by_clockid is a new function which accepts
clockid to determine the tstamp_type.
In future tstamp_type:1 can be extended to support userspace timestamp
by increasing the bitfield.
Linus Torvalds [Thu, 23 May 2024 19:49:37 +0000 (12:49 -0700)]
Merge tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Quite smaller than usual. Notably it includes the fix for the unix
regression from the past weeks. The TCP window fix will require some
follow-up, already queued.
Current release - regressions:
- af_unix: fix garbage collection of embryos
Previous releases - regressions:
- af_unix: fix race between GC and receive path
- ipv6: sr: fix missing sk_buff release in seg6_input_core
- tcp: remove 64 KByte limit for initial tp->rcv_wnd value
- eth: r8169: fix rx hangup
- eth: lan966x: remove ptp traps in case the ptp is not enabled
- eth: ixgbe: fix link breakage vs cisco switches
- eth: ice: prevent ethtool from corrupting the channels
Previous releases - always broken:
- openvswitch: set the skbuff pkt_type for proper pmtud support
- tcp: Fix shift-out-of-bounds in dctcp_update_alpha()
Misc:
- a bunch of selftests stabilization patches"
* tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits)
r8169: Fix possible ring buffer corruption on fragmented Tx packets.
idpf: Interpret .set_channels() input differently
ice: Interpret .set_channels() input differently
nfc: nci: Fix handling of zero-length payload packets in nci_rx_work()
net: relax socket state check at accept time.
tcp: remove 64 KByte limit for initial tp->rcv_wnd value
net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe()
tls: fix missing memory barrier in tls_init
net: fec: avoid lock evasion when reading pps_enable
Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI"
testing: net-drv: use stats64 for testing
net: mana: Fix the extra HZ in mana_hwc_send_request
net: lan966x: Remove ptp traps in case the ptp is not enabled.
openvswitch: Set the skbuff pkt_type for proper pmtud support.
selftest: af_unix: Make SCM_RIGHTS into OOB data.
af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS
tcp: Fix shift-out-of-bounds in dctcp_update_alpha().
selftests/net: use tc rule to filter the na packet
ipv6: sr: fix memleak in seg6_hmac_init_algo
af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
...
Linus Torvalds [Thu, 23 May 2024 19:36:38 +0000 (12:36 -0700)]
Merge tag 'trace-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"Minor last minute fixes:
- Fix a very tight race between the ring buffer readers and resizing
the ring buffer
- Correct some stale comments in the ring buffer code
- Fix kernel-doc in the rv code
- Add a MODULE_DESCRIPTION to preemptirq_delay_test"
* tag 'trace-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Update rv_en(dis)able_monitor doc to match kernel-doc
tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test
ring-buffer: Fix a race between readers and resize checks
ring-buffer: Correct stale comments related to non-consuming readers
Linus Torvalds [Thu, 23 May 2024 19:32:15 +0000 (12:32 -0700)]
Merge tag 'trace-tools-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing tool fix from Steven Rostedt:
"Fix printf format warnings in latency-collector.
Use the printf format string with %s to take a string instead of
taking in a string directly"
* tag 'trace-tools-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tools/latency-collector: Fix -Wformat-security compile warns
Linus Torvalds [Thu, 23 May 2024 19:28:01 +0000 (12:28 -0700)]
Merge tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing cleanup from Steven Rostedt:
"Remove second argument of __assign_str()
The __assign_str() macro logic of the TRACE_EVENT() macro was
optimized so that it no longer needs the second argument. The
__assign_str() is always matched with __string() field that takes a
field name and the source for that field:
__string(field, source)
The TRACE_EVENT() macro logic will save off the source value and then
use that value to copy into the ring buffer via the __assign_str().
Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and
__string() to not duplicate getting the string"), the __assign_str()
needed the second argument which would perform the same logic as the
__string() source parameter did. Not only would this add overhead, but
it was error prone as if the __assign_str() source produced something
different, it may not have allocated enough for the string in the ring
buffer (as the __string() source was used to determine how much to
allocate)
Now that the __assign_str() just uses the same string that was used in
__string() it no longer needs the source parameter. It can now be
removed"
* tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/treewide: Remove second parameter of __assign_str()
Linus Torvalds [Thu, 23 May 2024 19:09:22 +0000 (12:09 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The major fix here is for a filesystem corruption issue reported on
Apple M1 as a result of buggy management of the floating point
register state introduced in 6.8. I initially reverted one of the
offending patches, but in the end Ard cooked a proper fix so there's a
revert+reapply in the series.
Aside from that, we've got some CPU errata workarounds and misc other
fixes.
- Fix broken FP register state tracking which resulted in filesystem
corruption when dm-crypt is used
- Workarounds for Arm CPU errata affecting the SSBS Spectre
mitigation
- Fix lockdep assertion in DMC620 memory controller PMU driver
- Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is
disabled"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/fpsimd: Avoid erroneous elide of user state reload
Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"
arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY
perf/arm-dmc620: Fix lockdep assert in ->event_init()
Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"
arm64: errata: Add workaround for Arm errata 3194386 and 3312417
arm64: cputype: Add Neoverse-V3 definitions
arm64: cputype: Add Cortex-X4 definitions
arm64: barrier: Restore spec_bar() macro
Fix the following -Wformat-security compile warnings adding missing
format arguments:
latency-collector.c: In function ‘show_available’:
latency-collector.c:938:17: warning: format not a string literal and
no format arguments [-Wformat-security]
938 | warnx(no_tracer_msg);
| ^~~~~
latency-collector.c:943:17: warning: format not a string literal and
no format arguments [-Wformat-security]
943 | warnx(no_latency_tr_msg);
| ^~~~~
latency-collector.c: In function ‘find_default_tracer’:
latency-collector.c:986:25: warning: format not a string literal and
no format arguments [-Wformat-security]
986 | errx(EXIT_FAILURE, no_tracer_msg);
|
^~~~
latency-collector.c: In function ‘scan_arguments’:
latency-collector.c:1881:33: warning: format not a string literal and
no format arguments [-Wformat-security]
1881 | errx(EXIT_FAILURE, no_tracer_msg);
| ^~~~
Ken Milmore [Tue, 21 May 2024 22:45:50 +0000 (23:45 +0100)]
r8169: Fix possible ring buffer corruption on fragmented Tx packets.
An issue was found on the RTL8125b when transmitting small fragmented
packets, whereby invalid entries were inserted into the transmit ring
buffer, subsequently leading to calls to dma_unmap_single() with a null
address.
This was caused by rtl8169_start_xmit() not noticing changes to nr_frags
which may occur when small packets are padded (to work around hardware
quirks) in rtl8169_tso_csum_v2().
To fix this, postpone inspecting nr_frags until after any padding has been
applied.
The ice and idpf drivers can trigger a crash with AF_XDP due to incorrect
interpretation of the asymmetric Tx and Rx parameters in their
.set_channels() implementations:
1. ethtool -l <IFNAME> -> combined: 40
2. Attach AF_XDP to queue 30
3. ethtool -L <IFNAME> rx 15 tx 15
combined number is not specified, so command becomes {rx_count = 15,
tx_count = 15, combined_count = 40}.
4. ethnl_set_channels checks, if there are any AF_XDP of queues from the
new (combined_count + rx_count) to the old one, so from 55 to 40, check
does not trigger.
5. the driver interprets `rx 15 tx 15` as 15 combined channels and deletes
the queue that AF_XDP is attached to.
This is fundamentally a problem with interpreting a request for asymmetric
queues as symmetric combined queues.
Fix the ice and idpf drivers to stop interpreting such requests as a
request for combined queues. Due to current driver design for both ice and
idpf, it is not possible to support requests of the same count of Tx and Rx
queues with independent interrupts, (i.e. ethtool -L <IFNAME> rx 15 tx 15)
so such requests are now rejected.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
====================
Larysa Zaremba [Tue, 21 May 2024 19:39:54 +0000 (12:39 -0700)]
idpf: Interpret .set_channels() input differently
Unlike ice, idpf does not check, if user has requested at least 1 combined
channel. Instead, it relies on a check in the core code. Unfortunately, the
check does not trigger for us because of the hacky .set_channels()
interpretation logic that is not consistent with the core code.
This naturally leads to user being able to trigger a crash with an invalid
input. This is how:
1. ethtool -l <IFNAME> -> combined: 40
2. ethtool -L <IFNAME> rx 0 tx 0
combined number is not specified, so command becomes {rx_count = 0,
tx_count = 0, combined_count = 40}.
3. ethnl_set_channels checks, if there is at least 1 RX and 1 TX channel,
comparing (combined_count + rx_count) and (combined_count + tx_count)
to zero. Obviously, (40 + 0) is greater than zero, so the core code
deems the input OK.
4. idpf interprets `rx 0 tx 0` as 0 channels and tries to proceed with such
configuration.
The issue has to be solved fundamentally, as current logic is also known to
cause AF_XDP problems in ice [0].
Interpret the command in a way that is more consistent with ethtool
manual [1] (--show-channels and --set-channels) and new ice logic.
Considering that in the idpf driver only the difference between RX and TX
queues forms dedicated channels, change the correct way to set number of
channels to:
Larysa Zaremba [Tue, 21 May 2024 19:39:53 +0000 (12:39 -0700)]
ice: Interpret .set_channels() input differently
A bug occurs because a safety check guarding AF_XDP-related queues in
ethnl_set_channels(), does not trigger. This happens, because kernel and
ice driver interpret the ethtool command differently.
How the bug occurs:
1. ethtool -l <IFNAME> -> combined: 40
2. Attach AF_XDP to queue 30
3. ethtool -L <IFNAME> rx 15 tx 15
combined number is not specified, so command becomes {rx_count = 15,
tx_count = 15, combined_count = 40}.
4. ethnl_set_channels checks, if there are any AF_XDP of queues from the
new (combined_count + rx_count) to the old one, so from 55 to 40, check
does not trigger.
5. ice interprets `rx 15 tx 15` as 15 combined channels and deletes the
queue that AF_XDP is attached to.
Interpret the command in a way that is more consistent with ethtool
manual [0] (--show-channels and --set-channels).
Considering that in the ice driver only the difference between RX and TX
queues forms dedicated channels, change the correct way to set number of
channels to:
Ryosuke Yasuoka [Tue, 21 May 2024 15:34:42 +0000 (00:34 +0900)]
nfc: nci: Fix handling of zero-length payload packets in nci_rx_work()
When nci_rx_work() receives a zero-length payload packet, it should not
discard the packet and exit the loop. Instead, it should continue
processing subsequent packets.
Fixes: d24b03535e5e ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet") Signed-off-by: Ryosuke Yasuoka <ryasuoka@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240521153444.535399-1-ryasuoka@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The reproducer invokes shutdown() before entering the listener status.
After commit 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for
TCP_SYN_RECV sockets"), the above causes the child to reach the accept
syscall in FIN_WAIT1 status.
Eric noted we can relax the existing assertion in __inet_accept()
Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/490 Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/23ab880a44d8cfd967e84de8b93dbf48848e3d8c.1716299669.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason Xing [Tue, 21 May 2024 13:42:20 +0000 (21:42 +0800)]
tcp: remove 64 KByte limit for initial tp->rcv_wnd value
Recently, we had some servers upgraded to the latest kernel and noticed
the indicator from the user side showed worse results than before. It is
caused by the limitation of tp->rcv_wnd.
In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
CDN teams would not benefit from this change because they cannot have a
large window to receive a big packet, which will be slowed down especially
in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
the side effect for the latency/time-sensitive users.
To avoid future confusion, current change doesn't affect the initial
receive window on the wire in a SYN or SYN+ACK packet which are set within
65535 bytes according to RFC 7323 also due to the limit in
__tcp_transmit_skb():
th->window = htons(min(tp->rcv_wnd, 65535U));
In one word, __tcp_transmit_skb() already ensures that constraint is
respected, no matter how large tp->rcv_wnd is. The change doesn't violate
RFC.
Let me provide one example if with or without the patch:
Before:
client --- SYN: rwindow=65535 ---> server
client <--- SYN+ACK: rwindow=65535 ---- server
client --- ACK: rwindow=65536 ---> server
Note: for the last ACK, the calculation is 512 << 7.
After:
client --- SYN: rwindow=65535 ---> server
client <--- SYN+ACK: rwindow=65535 ---- server
client --- ACK: rwindow=175232 ---> server
Note: I use the following command to make it work:
ip route change default via [ip] dev eth0 metric 100 initrwnd 120
For the last ACK, the calculation is 1369 << 7.
When we apply such a patch, having a large rcv_wnd if the user tweak this
knob can help transfer data more rapidly and save some rtts.
Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240521134220.12510-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Romain Gantois [Tue, 21 May 2024 12:44:11 +0000 (14:44 +0200)]
net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe()
In the prueth_probe() function, if one of the calls to emac_phy_connect()
fails due to of_phy_connect() returning NULL, then the subsequent call to
phy_attached_info() will dereference a NULL pointer.
Check the return code of emac_phy_connect and fail cleanly if there is an
error.
Dae R. Jeong [Tue, 21 May 2024 10:34:38 +0000 (19:34 +0900)]
tls: fix missing memory barrier in tls_init
In tls_init(), a write memory barrier is missing, and store-store
reordering may cause NULL dereference in tls_{setsockopt,getsockopt}.
CPU0 CPU1
----- -----
// In tls_init()
// In tls_ctx_create()
ctx = kzalloc()
ctx->sk_proto = READ_ONCE(sk->sk_prot) -(1)
// In update_sk_prot()
WRITE_ONCE(sk->sk_prot, tls_prots) -(2)
// In sock_common_setsockopt()
READ_ONCE(sk->sk_prot)->setsockopt()
// In tls_{setsockopt,getsockopt}()
ctx->sk_proto->setsockopt() -(3)
In the above scenario, when (1) and (2) are reordered, (3) can observe
the NULL value of ctx->sk_proto, causing NULL dereference.
To fix it, we rely on rcu_assign_pointer() which implies the release
barrier semantic. By moving rcu_assign_pointer() after ctx->sk_proto is
initialized, we can ensure that ctx->sk_proto are visible when
changing sk->sk_prot.
Wei Fang [Tue, 21 May 2024 02:38:00 +0000 (10:38 +0800)]
net: fec: avoid lock evasion when reading pps_enable
The assignment of pps_enable is protected by tmreg_lock, but the read
operation of pps_enable is not. So the Coverity tool reports a lock
evasion warning which may cause data race to occur when running in a
multithread environment. Although this issue is almost impossible to
occur, we'd better fix it, at least it seems more logically reasonable,
and it also prevents Coverity from continuing to issue warnings.
According to the commit, it implements a manual AN-37 for some
"troublesome" Juniper MX5 switches. This appears to be a workaround for a
particular switch.
It has been reported that this causes a severe breakage for other switches,
including a Cisco 3560CX-12PD-S.
The code appears to be a workaround for a specific switch which fails to
link in SFI mode. It expects to see AN-37 auto negotiation in order to
link. The Cisco switch is not expecting AN-37 auto negotiation. When the
device starts the manual AN-37, the Cisco switch decides that the port is
confused and stops attempting to link with it. This persists until a power
cycle. A simple driver unload and reload does not resolve the issue, even
if loading with a version of the driver which lacks this workaround.
The authors of the workaround commit have not responded with
clarifications, and the result of the workaround is complete failure to
connect with other switches.
This appears to be a case where the driver can either "correctly" link with
the Juniper MX5 switch, at the cost of bricking the link with the Cisco
switch, or it can behave properly for the Cisco switch, but fail to link
with the Junipir MX5 switch. I do not know enough about the standards
involved to clearly determine whether either switch is at fault or behaving
incorrectly. Nor do I know whether there exists some alternative fix which
corrects behavior with both switches.
Joe Damato [Mon, 20 May 2024 23:58:43 +0000 (23:58 +0000)]
testing: net-drv: use stats64 for testing
Testing a network device that has large numbers of bytes/packets may
overflow. Using stats64 when comparing fixes this problem.
I tripped on this while iterating on a qstats patch for mlx5. See below
for confirmation without my added code that this is a bug.
Before this patch (with added debugging output):
$ NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
KTAP version 1
1..4
ok 1 stats.check_pause
ok 2 stats.check_fec
rstat: 481708634 qstat: 666201639514 key: tx-bytes
not ok 3 stats.pkt_byte_sum
ok 4 stats.qstat_by_ifindex
Note the huge delta above ^^^ in the rtnl vs qstats.
After this patch:
$ NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
KTAP version 1
1..4
ok 1 stats.check_pause
ok 2 stats.check_fec
ok 3 stats.pkt_byte_sum
ok 4 stats.qstat_by_ifindex
It looks like rtnl_fill_stats in net/core/rtnetlink.c will attempt to
copy the 64bit stats into a 32bit structure which is probably why this
behavior is occurring.
To show this is happening, you can get the underlying stats that the
stats.py test uses like this:
Note that the above was taken from a system with an mlx5 NIC, which only
exposes ndo_get_stats64.
Based on the ethtool output and qstat output, it appears that stats.py
should be updated to use the 'stats64' structure for accurate
comparisons when packet/byte counters get very large.
To confirm that this was not related to the qstats code I was iterating
on, I booted a kernel without my driver changes and re-ran the test
which shows the qstats are skipped (as they don't exist for mlx5):
NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
KTAP version 1
1..4
ok 1 stats.check_pause
ok 2 stats.check_fec
ok 3 stats.pkt_byte_sum # SKIP qstats not supported by the device
ok 4 stats.qstat_by_ifindex # SKIP No ifindex supports qstats
Linus Torvalds [Thu, 23 May 2024 01:59:29 +0000 (18:59 -0700)]
Merge tag 'mm-nonmm-stable-2024-05-22-17-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more non-mm updates from Andrew Morton:
- A series ("kbuild: enable more warnings by default") from Arnd
Bergmann which enables a number of additional build-time warnings. We
fixed all the fallout which we could find, there may still be a few
stragglers.
- Samuel Holland has developed the series "Unified cross-architecture
kernel-mode FPU API". This does a lot of consolidation of
per-architecture kernel-mode FPU usage and enables the use of newer
AMD GPUs on RISC-V.
- Tao Su has fixed some selftests build warnings in the series
"Selftests: Fix compilation warnings due to missing _GNU_SOURCE
definition".
- This pull also includes a nilfs2 fixup from Ryusuke Konishi.
* tag 'mm-nonmm-stable-2024-05-22-17-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
nilfs2: make block erasure safe in nilfs_finish_roll_forward()
selftests/harness: use 1024 in place of LINE_MAX
Revert "selftests/harness: remove use of LINE_MAX"
selftests/fpu: allow building on other architectures
selftests/fpu: move FP code to a separate translation unit
drm/amd/display: use ARCH_HAS_KERNEL_FPU_SUPPORT
drm/amd/display: only use hard-float, not altivec on powerpc
riscv: add support for kernel-mode FPU
x86: implement ARCH_HAS_KERNEL_FPU_SUPPORT
powerpc: implement ARCH_HAS_KERNEL_FPU_SUPPORT
LoongArch: implement ARCH_HAS_KERNEL_FPU_SUPPORT
lib/raid6: use CC_FLAGS_FPU for NEON CFLAGS
arm64: crypto: use CC_FLAGS_FPU for NEON CFLAGS
arm64: implement ARCH_HAS_KERNEL_FPU_SUPPORT
ARM: crypto: use CC_FLAGS_FPU for NEON CFLAGS
ARM: implement ARCH_HAS_KERNEL_FPU_SUPPORT
arch: add ARCH_HAS_KERNEL_FPU_SUPPORT
x86/fpu: fix asm/fpu/types.h include guard
kbuild: enable -Wcast-function-type-strict unconditionally
kbuild: enable -Wformat-truncation on clang
...
Linus Torvalds [Thu, 23 May 2024 00:32:04 +0000 (17:32 -0700)]
Merge tag 'mm-stable-2024-05-22-17-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more mm updates from Andrew Morton:
"A series from Dave Chinner which cleans up and fixes the handling of
nested allocations within stackdepot and page-owner"
* tag 'mm-stable-2024-05-22-17-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/page-owner: use gfp_nested_mask() instead of open coded masking
stackdepot: use gfp_nested_mask() instead of open coded masking
mm: lift gfp_kmemleak_mask() to gfp.h