Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:57 +0000 (21:44 +0100)]
ice: Pull out next_to_clean bump out of ice_put_rx_buf()
Plan is to move ice_put_rx_buf() to the end of ice_clean_rx_irq() so
in order to keep the ability of walking through HW Rx descriptors, pull
out next_to_clean handling out of ice_put_rx_buf().
Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:56 +0000 (21:44 +0100)]
ice: Store page count inside ice_rx_buf
This will allow us to avoid carrying additional auxiliary array of page
counts when dealing with XDP multi buffer support. Previously combining
fragmented frame to skb was not affected in the same way as XDP would be
as whole frame is needed to be in place before executing XDP prog.
Therefore, when going through HW Rx descriptors one-by-one, calls to
ice_put_rx_buf() need to be taken *after* running XDP prog on a
potentially multi buffered frame, so some additional storage of
page count is needed.
By adding page count to rx buf, it will make it easier to walk through
processed entries at the end of rx cleaning routine and decide whether
or not buffers should be recycled.
While at it, bump ice_rx_buf::pagecnt_bias from u16 up to u32. It was
proven many times that calculations on variables smaller than standard
register size are harmful. This was also the case during experiments
with embedding page count to ice_rx_buf - when this was added as u16 it
had a performance impact.
Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:55 +0000 (21:44 +0100)]
ice: Add xdp_buff to ice_rx_ring struct
In preparation for XDP multi-buffer support, let's store xdp_buff on
Rx ring struct. This will allow us to combine fragmented frames across
separate NAPI cycles in the same way as currently skb fragments are
handled. This means that skb pointer on Rx ring will become redundant
and will be removed. For now it is kept and layout of Rx ring struct was
not inspected, some member movement will be needed later on so that will
be the time to take care of it.
Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:54 +0000 (21:44 +0100)]
ice: Prepare legacy-rx for upcoming XDP multi-buffer support
Rx path is going to be modified in a way that fragmented frame will be
gathered within xdp_buff in the first place. This approach implies that
underlying buffer has to provide tailroom for skb_shared_info. This is
currently the case when ring uses build_skb but not when legacy-rx knob
is turned on. This case configures 2k Rx buffers and has no way to
provide either headroom or tailroom - FWIW it currently has
XDP_PACKET_HEADROOM which is broken and in here it is removed. 2k Rx
buffers were used so driver in this setting was able to support 9k MTU
as it can chain up to 5 Rx buffers. With offset configuring HW writing
2k of a data was passing the half of the page which broke the assumption
of our internal page recycling tricks.
Now if above got fixed and legacy-rx path would be left as is, when
referring to skb_shared_info via xdp_get_shared_info_from_buff(),
packet's content would be corrupted again. Hence size of Rx buffer needs
to be lowered and therefore supported MTU. This operation will allow us
to keep the unified data path and with 8k MTU users (if any of
legacy-rx) would still be good to go. However, tendency is to drop the
support for this code path at some point.
Add ICE_RXBUF_1664 as vsi::rx_buf_len and ICE_MAX_FRAME_LEGACY_RX (8320)
as vsi::max_frame for legacy-rx. For bigger page sizes configure 3k Rx
buffers, not 2k.
Since headroom support is removed, disable data_meta support on legacy-rx.
When preparing XDP buff, rely on ice_rx_ring::rx_offset setting when
deciding whether to support data_meta or not.
Alexei Starovoitov [Mon, 30 Jan 2023 03:16:29 +0000 (19:16 -0800)]
Merge branch 'Support bpf trampoline for s390x'
Ilya Leoshkevich says:
====================
v2: https://lore.kernel.org/bpf/20230128000650.1516334-1-iii@linux.ibm.com/#t
v2 -> v3:
- Make __arch_prepare_bpf_trampoline static.
(Reported-by: kernel test robot <lkp@intel.com>)
- Support both old- and new- style map definitions in sk_assign. (Alexei)
- Trim DENYLIST.s390x. (Alexei)
- Adjust s390x vmlinux path in vmtest.sh.
- Drop merged fixes.
v1: https://lore.kernel.org/bpf/20230125213817.1424447-1-iii@linux.ibm.com/#t
v1 -> v2:
- Fix core_read_macros, sk_assign, test_profiler, test_bpffs (24/31;
I'm not quite happy with the fix, but don't have better ideas),
and xdp_synproxy. (Andrii)
- Prettify liburandom_read and verify_pkcs7_sig fixes. (Andrii)
- Fix bpf_usdt_arg using barrier_var(); prettify barrier_var(). (Andrii)
- Change BPF_MAX_TRAMP_LINKS to enum and query it using BTF. (Andrii)
- Improve bpf_jit_supports_kfunc_call() description. (Alexei)
- Always check sign_extend() return value.
- Cc: Alexander Gordeev.
Hi,
This series implements poke, trampoline, kfunc, and mixing subprogs
and tailcalls on s390x.
The following failures still remain:
#82 get_stack_raw_tp:FAIL
get_stack_print_output:FAIL:user_stack corrupted user stack
Known issue:
We cannot reliably unwind userspace on s390x without DWARF.
#101 ksyms_module:FAIL
address of kernel function bpf_testmod_test_mod_kfunc is out of range
Known issue:
Kernel and modules are too far away from each other on s390x.
#190 stacktrace_build_id:FAIL
Known issue:
We cannot reliably unwind userspace on s390x without DWARF.
Ilya Leoshkevich [Sun, 29 Jan 2023 19:05:01 +0000 (20:05 +0100)]
selftests/bpf: Trim DENYLIST.s390x
Now that trampoline is implemented, enable a number of tests on s390x.
18 of the remaining failures have to do with either lack of rethook
(fixed by [1]) or syscall symbols missing from BTF (fixed by [2]).
Do not re-classify the remaining failures for now; wait until the
s390/for-next fixes are merged and re-classify only the remaining few.
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:59 +0000 (20:04 +0100)]
s390/bpf: Implement bpf_jit_supports_kfunc_call()
Implement calling kernel functions from eBPF. In general, the eBPF ABI
is fairly close to that of s390x, with one important difference: on
s390x callers should sign-extend signed arguments. Handle that by using
information returned by bpf_jit_find_kfunc_model().
Here is an example of how sign extensions works. Suppose we need to
call the following function from BPF:
As per the s390x ABI, bpf_kfunc_call_test4() has the right to assume
that a, b and c are sign-extended by the caller, which results in using
64-bit additions (agr) without any additional conversions. Without sign
extension we would have the following on the JITed code side:
This first 3 llilfs are 32-bit loads, that need to be sign-extended
to 64 bits.
Note: at the moment bpf_jit_find_kfunc_model() does not seem to play
nicely with XDP metadata functions: add_kfunc_call() adds an "abstract"
bpf_*() version to kfunc_btf_tab, but then fixup_kfunc_call() puts the
concrete version into insn->imm, which bpf_jit_find_kfunc_model() cannot
find. But this seems to be a common code problem.
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:57 +0000 (20:04 +0100)]
s390/bpf: Implement arch_prepare_bpf_trampoline()
arch_prepare_bpf_trampoline() is used for direct attachment of eBPF
programs to various places, bypassing kprobes. It's responsible for
calling a number of eBPF programs before, instead and/or after
whatever they are attached to.
Add a s390x implementation, paying attention to the following:
- Reuse the existing JIT infrastructure, where possible.
- Like the existing JIT, prefer making multiple passes instead of
backpatching. Currently 2 passes is enough. If literal pool is
introduced, this needs to be raised to 3. However, at the moment
adding literal pool only makes the code larger. If branch
shortening is introduced, the number of passes needs to be
increased even further.
- Support both regular and ftrace calling conventions, depending on
the trampoline flags.
- Use expolines for indirect calls.
- Handle the mismatch between the eBPF and the s390x ABIs.
- Sign-extend fmod_ret return values.
invoke_bpf_prog() produces about 120 bytes; it might be possible to
slightly optimize this, but reaching 50 bytes, like on x86_64, looks
unrealistic: just loading cookie, __bpf_prog_enter, bpf_func, insnsi
and __bpf_prog_exit as literals already takes at least 5 * 12 = 60
bytes, and we can't use relative addressing for most of them.
Therefore, lower BPF_MAX_TRAMP_LINKS on s390x.
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:56 +0000 (20:04 +0100)]
s390/bpf: Implement bpf_arch_text_poke()
bpf_arch_text_poke() is used to hotpatch eBPF programs and trampolines.
s390x has a very strict hotpatching restriction: the only thing that is
allowed to be hotpatched is conditional branch mask.
Take the same approach as commit de5012b41e5c ("s390/ftrace: implement
hotpatching"): create a conditional jump to a "plt", which loads the
target address from memory and jumps to it; then first patch this
address, and then the mask.
Trampolines (introduced in the next patch) respect the ftrace calling
convention: the return address is in %r0, and %r1 is clobbered. With
that in mind, bpf_arch_text_poke() does not differentiate between jumps
and calls.
However, there is a simple optimization for jumps (for the epilogue_ip
case): if a jump already points to the destination, then there is no
"plt" and we can just flip the mask.
For simplicity, the "plt" template is defined in assembly, and its size
is used to define C arrays. There doesn't seem to be a way to convey
this size to C as a constant, so it's hardcoded and double-checked
during runtime.
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:54 +0000 (20:04 +0100)]
selftests/bpf: Fix sk_assign on s390x
sk_assign is failing on an s390x machine running Debian "bookworm" for
2 reasons: legacy server_map definition and uninitialized addrlen in
recvfrom() call.
Fix by adding a new-style server_map definition and dropping addrlen
(recvfrom() allows NULL values for src_addr and addrlen).
Since the test should support tc built without libbpf, build the prog
twice: with the old-style definition and with the new-style definition,
then select the right one at runtime. This could be done at compile
time too, but this would not be cross-compilation friendly.
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:44 +0000 (01:06 +0100)]
bpf: btf: Add BTF_FMODEL_SIGNED_ARG flag
s390x eBPF JIT needs to know whether a function return value is signed
and which function arguments are signed, in order to generate code
compliant with the s390x ABI.
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:43 +0000 (01:06 +0100)]
bpf: iterators: Split iterators.lskel.h into little- and big- endian versions
iterators.lskel.h is little-endian, therefore bpf iterator is currently
broken on big-endian systems. Introduce a big-endian version and add
instructions regarding its generation. Unfortunately bpftool's
cross-endianness capabilities are limited to BTF right now, so the
procedure requires access to a big-endian machine or a configured
emulator.
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:42 +0000 (01:06 +0100)]
libbpf: Fix BPF_PROBE_READ{_STR}_INTO() on s390x
BPF_PROBE_READ_INTO() and BPF_PROBE_READ_STR_INTO() should map to
bpf_probe_read() and bpf_probe_read_str() respectively in order to work
correctly on architectures with !ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
The reason is that, even though the C code enforces that
arg_num < BPF_USDT_MAX_ARG_CNT, the verifier cannot propagate this
constraint to the arg_spec assignment yet. Help it by forcing r1 back
to stack after comparison.
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:38 +0000 (01:06 +0100)]
selftests/bpf: Fix xdp_synproxy/tc on s390x
Use the correct datatype for the values map values; currently the test
works by accident, since on little-endian machines it is sometimes
acceptable to access u64 as u32.
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:33 +0000 (01:06 +0100)]
selftests/bpf: Add a sign-extension test for kfuncs
s390x ABI requires the caller to zero- or sign-extend the arguments.
eBPF already deals with zero-extension (by definition of its ABI), but
not with sign-extension.
Add a test to cover that potentially problematic area.
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:30 +0000 (01:06 +0100)]
selftests/bpf: Fix cgrp_local_storage on s390x
Sync the definition of socket_cookie between the eBPF program and the
test. Currently the test works by accident, since on little-endian it
is sometimes acceptable to access u64 as u32.
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:29 +0000 (01:06 +0100)]
selftests/bpf: Fix xdp_do_redirect on s390x
s390x cache line size is 256 bytes, so skb_shared_info must be aligned
on a much larger boundary than for x86. This makes the maximum packet
size smaller.
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:28 +0000 (01:06 +0100)]
selftests/bpf: Fix verify_pkcs7_sig on s390x
Use bpf_probe_read_kernel() instead of bpf_probe_read(), which is not
defined on all architectures.
While at it, improve the error handling: do not hide the verifier log,
and check the return values of bpf_probe_read_kernel() and
bpf_copy_from_user().
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:20 +0000 (01:06 +0100)]
bpf: Use ARG_CONST_SIZE_OR_ZERO for 3rd argument of bpf_tcp_raw_gen_syncookie_ipv{4,6}()
These functions already check that th_len < sizeof(*th), and
propagating the lower bound (th_len > 0) may be challenging
in complex code, e.g. as is the case with xdp_synproxy test on
s390x [1]. Switch to ARG_CONST_SIZE_OR_ZERO in order to make the
verifier accept code where it cannot prove that th_len > 0.
David Vernet [Sat, 28 Jan 2023 14:15:37 +0000 (08:15 -0600)]
bpf: Build-time assert that cpumask offset is zero
The first element of a struct bpf_cpumask is a cpumask_t. This is done
to allow struct bpf_cpumask to be cast to a struct cpumask. If this
element were ever moved to another field, any BPF program passing a
struct bpf_cpumask * to a kfunc expecting a const struct cpumask * would
immediately fail to load. Add a build-time assertion so this is
assumption is captured and verified.
Johannes Berg [Fri, 27 Jan 2023 07:45:06 +0000 (08:45 +0100)]
net: netlink: recommend policy range validation
For large ranges (outside of s16) the documentation currently
recommends open-coding the validation, but it's better to use
the NLA_POLICY_FULL_RANGE() or NLA_POLICY_FULL_RANGE_SIGNED()
policy validation instead; recommend that.
Jakub Kicinski [Sat, 28 Jan 2023 07:59:45 +0000 (23:59 -0800)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
bpf-next 2023-01-28
We've added 124 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 6386 insertions(+), 1827 deletions(-).
The main changes are:
1) Implement XDP hints via kfuncs with initial support for RX hash and
timestamp metadata kfuncs, from Stanislav Fomichev and
Toke Høiland-Jørgensen.
Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk
2) Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case, from Andrii Nakryiko.
3) Significantly reduce the search time for module symbols by livepatch
and BPF, from Jiri Olsa and Zhen Lei.
4) Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs
in different time intervals, from David Vernet.
5) Fix several issues in the dynptr processing such as stack slot liveness
propagation, missing checks for PTR_TO_STACK variable offset, etc,
from Kumar Kartikeya Dwivedi.
6) Various performance improvements, fixes, and introduction of more
than just one XDP program to XSK selftests, from Magnus Karlsson.
7) Big batch to BPF samples to reduce deprecated functionality,
from Daniel T. Lee.
8) Enable struct_ops programs to be sleepable in verifier,
from David Vernet.
9) Reduce pr_warn() noise on BTF mismatches when they are expected under
the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien.
10) Describe modulo and division by zero behavior of the BPF runtime
in BPF's instruction specification document, from Dave Thaler.
11) Several improvements to libbpf API documentation in libbpf.h,
from Grant Seltzer.
12) Improve resolve_btfids header dependencies related to subcmd and add
proper support for HOSTCC, from Ian Rogers.
13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room()
helper along with BPF selftests, from Ziyang Xuan.
14) Simplify the parsing logic of structure parameters for BPF trampoline
in the x86-64 JIT compiler, from Pu Lehui.
15) Get BTF working for kernels with CONFIG_RUST enabled by excluding
Rust compilation units with pahole, from Martin Rodriguez Reboredo.
16) Get bpf_setsockopt() working for kTLS on top of TCP sockets,
from Kui-Feng Lee.
17) Disable stack protection for BPF objects in bpftool given BPF backends
don't support it, from Holger Hoffstätte.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits)
selftest/bpf: Make crashes more debuggable in test_progs
libbpf: Add documentation to map pinning API functions
libbpf: Fix malformed documentation formatting
selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket.
bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
bpf/selftests: Verify struct_ops prog sleepable behavior
bpf: Pass const struct bpf_prog * to .check_member
libbpf: Support sleepable struct_ops.s section
bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
selftests/bpf: Fix vmtest static compilation error
tools/resolve_btfids: Alter how HOSTCC is forced
tools/resolve_btfids: Install subcmd headers
bpf/docs: Document the nocast aliasing behavior of ___init
bpf/docs: Document how nested trusted fields may be defined
bpf/docs: Document cpumask kfuncs in a new file
selftests/bpf: Add selftest suite for cpumask kfuncs
selftests/bpf: Add nested trust selftests suite
bpf: Enable cpumasks to be queried and used as kptrs
bpf: Disallow NULLable pointers for trusted kfuncs
...
====================
Breno Leitao [Wed, 25 Jan 2023 18:52:30 +0000 (10:52 -0800)]
netpoll: Remove 4s sleep during carrier detection
This patch removes the msleep(4s) during netpoll_setup() if the carrier
appears instantly.
Here are some scenarios where this workaround is counter-productive in
modern ages:
Servers which have BMC communicating over NC-SI via the same NIC as gets
used for netconsole. BMC will keep the PHY up, hence the carrier
appearing instantly.
The link is fibre, SERDES getting sync could happen within 0.1Hz, and
the carrier also appears instantly.
Other than that, if a driver is reporting instant carrier and then
losing it, this is probably a driver bug.
drivers/net/ethernet/intel/ice/ice_main.c 418e53401e47 ("ice: move devlink port creation/deletion") 643ef23bd9dd ("ice: Introduce local var for readability")
https://lore.kernel.org/all/20230127124025.0dacef40@canb.auug.org.au/
https://lore.kernel.org/all/20230124005714.3996270-1-anthony.l.nguyen@intel.com/
drivers/net/ethernet/engleder/tsnep_main.c 3d53aaef4332 ("tsnep: Fix TX queue stop/wake for multiple queues") 25faa6a4c5ca ("tsnep: Replace TX spin_lock with __netif_tx_lock")
https://lore.kernel.org/all/20230127123604.36bb3e99@canb.auug.org.au/
net/netfilter/nf_conntrack_proto_sctp.c 13bd9b31a969 ("Revert "netfilter: conntrack: add sctp DATA_SENT state"") a44b7651489f ("netfilter: conntrack: unify established states for SCTP paths") f71cb8f45d09 ("netfilter: conntrack: sctp: use nf log infrastructure for invalid packets")
https://lore.kernel.org/all/20230127125052.674281f9@canb.auug.org.au/
https://lore.kernel.org/all/d36076f3-6add-a442-6d4b-ead9f7ffff86@tessares.net/
====================
net: xdp: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found in [1].
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in [2].
The drivers have only been compile-tested since I do not own any of
the HW below. So if you are a maintainer, it would be great if you
could take a quick look to make sure I did not mess something up.
Note that these were the drivers I found that violated the ordering by
running a simple script and manually checking the ones that came up as
potential offenders. But the script was not perfect in any way. There
might still be offenders out there, since the script can generate
false negatives.
Magnus Karlsson [Wed, 25 Jan 2023 07:49:01 +0000 (08:49 +0100)]
dpaa2-eth: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Magnus Karlsson [Wed, 25 Jan 2023 07:49:00 +0000 (08:49 +0100)]
dpaa_eth: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Magnus Karlsson [Wed, 25 Jan 2023 07:48:59 +0000 (08:48 +0100)]
virtio-net: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Magnus Karlsson [Wed, 25 Jan 2023 07:48:58 +0000 (08:48 +0100)]
lan966x: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Magnus Karlsson [Wed, 25 Jan 2023 07:48:57 +0000 (08:48 +0100)]
qede: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Grant Seltzer [Thu, 26 Jan 2023 02:47:49 +0000 (21:47 -0500)]
libbpf: Fix malformed documentation formatting
This fixes the doxygen format documentation above the
user_ring_buffer__* APIs. There has to be a newline
before the @brief, otherwise doxygen won't render them
for libbpf.readthedocs.org.
This patchset takes care of small cleanup of devlink params usage.
Some of the patches (first 2/3) are cosmetic, but I would like to
point couple of interesting ones:
Patch 9 is the main one of this set and introduces devlink instance
locking for params, similar to other devlink objects. That allows params
to be registered/unregistered when devlink instance is registered.
Patches 10-12 change mlx5 code to register non-driverinit params in the
code they are related to, and thanks to patch 8 this might be when
devlink instance is registered - for example during devlink reload.
---
v1->v2:
- Just small fix in the last patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:38 +0000 (08:58 +0100)]
net/mlx5: Move eswitch port metadata devlink param to flow eswitch code
Move the param registration and handling code into the eswitch offloads
code as they are related to each other. No point in having the
devlink param registration done in separate file.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:37 +0000 (08:58 +0100)]
net/mlx5: Move flow steering devlink param to flow steering code
Move the param registration and handling code into the flow steering
code as they are related to each other. No point in having the
devlink param registration done in separate file.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:36 +0000 (08:58 +0100)]
net/mlx5: Move fw reset devlink param to fw reset code
Move the param registration and handling code into the fw reset code
as they are related to each other. No point in having the devlink param
registration done in separate file.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:35 +0000 (08:58 +0100)]
devlink: protect devlink param list by instance lock
Commit 1d18bb1a4ddd ("devlink: allow registering parameters after
the instance") as the subject implies introduced possibility to register
devlink params even for already registered devlink instance. This is a
bit problematic, as the consistency or params list was originally
secured by the fact it is static during devlink lifetime. So in order to
protect the params list, take devlink instance lock during the params
operations. Introduce unlocked function variants and use them in drivers
in locked context. Put lock assertions to appropriate places.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:34 +0000 (08:58 +0100)]
devlink: put couple of WARN_ONs in devlink_param_driverinit_value_get()
Put couple of WARN_ONs in devlink_param_driverinit_value_get() function
to clearly indicate, that it is a driver bug if used without reload
support or for non-driverinit param.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:33 +0000 (08:58 +0100)]
devlink: make devlink_param_driverinit_value_set() return void
devlink_param_driverinit_value_set() currently returns int with possible
error, but no user is checking it anyway. The only reason for a fail is
a driver bug. So convert the function to return void and put WARN_ONs
on error paths.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:32 +0000 (08:58 +0100)]
qed: remove pointless call to devlink_param_driverinit_value_set()
devlink_param_driverinit_value_set() call makes sense only for "
driverinit" params. However here, the param is "runtime".
devlink_param_driverinit_value_set() returns -EOPNOTSUPP in such case
and does not do anything. So remove the pointless call to
devlink_param_driverinit_value_set() entirely.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:31 +0000 (08:58 +0100)]
ice: remove pointless calls to devlink_param_driverinit_value_set()
devlink_param_driverinit_value_set() call makes sense only for
"driverinit" params. However here, both params are "runtime".
devlink_param_driverinit_value_set() returns -EOPNOTSUPP in such case
and does not do anything. So remove the pointless calls to
devlink_param_driverinit_value_set() entirely.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:30 +0000 (08:58 +0100)]
devlink: don't work with possible NULL pointer in devlink_param_unregister()
There is a WARN_ON checking the param_item for being NULL when the param
is not inserted in the list. That indicates a driver BUG. Instead of
continuing to work with NULL pointer with its consequences, return.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:29 +0000 (08:58 +0100)]
devlink: make devlink_param_register/unregister static
There is no user outside the devlink code, so remove the export and make
the functions static. Move them before callers to avoid forward
declarations.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:28 +0000 (08:58 +0100)]
net/mlx5: Covert devlink params registration to use devlink_params_register/unregister()
Since mlx5 is the only user of devlink API to register/unregister a
single param, convert it to use array registration function allowing to
simplify the devlink API by removing the single param registration
functions.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:27 +0000 (08:58 +0100)]
net/mlx5: Change devlink param register/unregister function names
The functions are registering and unregistering devlink params, so
change the names accordingly.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Jan 2023 12:24:32 +0000 (12:24 +0000)]
Merge branch 'ethtool-netlink-next'
Jakub Kicinski says:
====================
ethtool: netlink: handle SET intro/outro in the common code
Factor out the boilerplate code from SET handlers to common code.
I volunteered to refactor the extack in GET in a conversation
with Vladimir but I gave up.
The handling of failures during dump in GET handlers is a bit
unclear to me. Some code uses presence of info as indication
of dump and tries to avoid reporting errors altogether
(including extack messages).
There's also the question of whether we should have a validation
callback (similar to .set_validate here) for GET. It looks like
.parse_request was expected to perform the validation. It takes
the extack and tb directly, not via info:
int (*parse_request)(struct ethnl_req_info *req_info,
struct nlattr **tb,
struct netlink_ext_ack *extack);
But .parse_request doesn't run under rtnl nor ethnl_ops_begin().
As a result some implementations defer validation until .prepare_data
where all the locks are held and they can call out to the driver.
All this makes me think that maybe we should refactor GET in the
same direction I'm refactoring SET. Split .prepare_data, take
more locks in the core, and add a validation helper which would
take extack directly:
- ret = ops->prepare_data(req_info, reply_data, info);
+ ret = ops->prepare_data_validate(req_info, reply_data, attrs, extack);
+ if (ret < 1) // if 0 -> skip for dump; -EOPNOTSUPP in do
+ goto err1;
+
+ ret = ethnl_ops_begin(dev);
+ if (ret)
+ goto err1;
+
+ ret = ops->prepare_data(req_info, reply_data); // no extack
+ ethnl_ops_complete(dev);
I'll file that away as a TODO for posterity / older me.
v2:
- invert checks for coalescing to avoid error code changes
- rebase and convert MM as well
This leads to a lot of copy / pasted code an bugs when people
mis-handle the error path.
Add a generic implementation of this pattern with a .set callback
in struct ethnl_request_ops called to "do stuff".
Also add an optional .set_validate which is called before
ethnl_ops_begin() -- a lot of implementations do basic request
capability / sanity checking at that point.
Because we want to avoid generating the notification when
no change happened - adopt a slightly hairy return values:
- 0 means nothing to do (no notification)
- 1 means done / continue
- negative error codes on error
Reuse .hdr_attr from struct ethnl_request_ops, GET and SET
use the same attr spaces in all cases.
Convert pause as an example (and to avoid unused function warnings).
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Christian Marangi [Wed, 25 Jan 2023 20:35:17 +0000 (21:35 +0100)]
net: dsa: qca8k: convert to regmap read/write API
Convert qca8k to regmap read/write bulk API. The mgmt eth can write up
to 32 bytes of data at times. Currently we use a custom function to do
it but regmap now supports declaration of read/write bulk even without a
bus.
Drop the custom function and rework the regmap function to this new
implementation.
Rework the qca8k_fdb_read/write function to use the new
regmap_bulk_read/write as the old qca8k_bulk_read/write are now dropped.
Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 26 Jan 2023 07:14:23 +0000 (23:14 -0800)]
net: skbuff: drop the linux/hrtimer.h include
linux/hrtimer.h include was added because apparently it used
to contain ktime related code. This is no longer the case
and we include linux/time.h explicitly.
Sadly this change is currently a noop because linux/dma-mapping.h
and net/page_pool.h pull in half of the universe.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 26 Jan 2023 07:14:22 +0000 (23:14 -0800)]
net: skbuff: drop the linux/splice.h include
splice.h is included since commit a60e3cc7c929 ("net: make
skb_splice_bits more configureable") but really even then
all we needed is some forward declarations. Most of that
code is now gone, and remaining has fwd declarations.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 26 Jan 2023 07:14:20 +0000 (23:14 -0800)]
net: skbuff: drop the linux/sched.h include
linux/sched.h was added for skb_mstamp_* (all the way back
before linux/sched.h got split and linux/sched/clock.h created).
We don't need it in skbuff.h any more.
Sadly this change is currently a noop because linux/dma-mapping.h
and net/page_pool.h pull in half of the universe.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Jan 2023 11:16:29 +0000 (11:16 +0000)]
Merge branch 'ipa-abstract-status'
Alex Elder says:
====================
net: ipa: abstract status parsing
Under some circumstances, IPA generates a "packet status" structure
that describes information about a packet. This is used, for
example, when offload hardware detects an error in a packet, or
otherwise discovers a packet needs special handling. In this case,
the status is delivered (along with the packet it describes) to a
"default" endpoint so that it can be handled by the AP.
Until now, the structure of this status information hasn't changed.
However, to support more than 32 endpoints, this structure required
some changes, such that some fields are rearranged in ways that are
tricky to represent using C code.
This series updates code related to the IPA status structure. The
first patch uses a local variable to avoid recomputing a packet
length more than once. The second stops using sizeof() to determine
the size of an IPA packet status structure. Patches 3-5 extend the
definitions for values held in packet status fields. Patch 6 does a
little general cleanup to make patch 7 simpler. Patch 7 stops using
a C structure to represent packet status; instead, a new function
fetches values "by name" from a buffer containing such a structure.
The last patch updates this function so it also supports IPA v5.0+.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:45 +0000 (14:45 -0600)]
net: ipa: add IPA v5.0 packet status support
Update ipa_status_extract() to support IPA v5.0 and beyond. Because
the format of the IPA packet status depends on the version, pass an
IPA pointer to the function.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:44 +0000 (14:45 -0600)]
net: ipa: introduce generalized status decoder
Stop assuming the IPA packet status has a fixed format (defined by
a C structure). Instead, use a function to extract each field from
a block of data interpreted as an IPA packet status. Define an
enumerated type that identifies the fields that can be extracted.
The current function extracts fields based on the existing
ipa_status structure format (which is no longer used).
Define IPA_STATUS_RULE_MISS, to replace the calls to field_max() to
represent that condition; those depended on the knowing the width of
a filter or router rule in the IPA packet status structure.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:43 +0000 (14:45 -0600)]
net: ipa: IPA status preparatory cleanups
The next patch reworks how the IPA packet status structure is
interpreted. This patch does some preparatory work, to make it
easier to see the effect of that change:
- Change a few functions that access fields in a IPA packet status
structure to store field values in local variables with names
related to the field.
- Pass a void pointer rather than an (equivalent) status pointer
to two functions called by ipa_endpoint_status_parse().
- Use "rule" rather than "val" as the name of a variable that
holds a routing rule ID.
- Consistently use "IPA packet status" rather than "status
element" when referring to this data structure.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:42 +0000 (14:45 -0600)]
net: ipa: define remaining IPA status field values
Define the remaining values for opcode and exception fields in the
IPA packet status structure. Most of these values are powers-of-2,
suggesting they are meant to be used as bitmasks, but that is not
the case. Add comments to be clear about this, and express the
values in decimal format.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:41 +0000 (14:45 -0600)]
net: ipa: rename the NAT enumerated type
Rename the ipa_nat_en enumerated type to be ipa_nat_type, and rename
its symbols accordingly. Add a comment indicating those values are
also used in the IPA status nat_type field.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:40 +0000 (14:45 -0600)]
net: ipa: define all IPA status mask bits
There is a 16 bit status mask defined in the IPA packet status
structure, of which only one (TAG_VALID) is currently used.
Define all other IPA status mask values in an enumerated type whose
numeric values are bit mask values (in CPU byte order) in the status
mask. Use the TAG_VALID value from that type rather than defining a
separate field mask.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:39 +0000 (14:45 -0600)]
net: ipa: stop using sizeof(status)
The IPA packet status structure changes in IPA v5.0 in ways that are
difficult to represent cleanly. As a small step toward redefining
it as a parsed block of data, use a constant to define its size,
rather than the size of the IPA status structure type.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:38 +0000 (14:45 -0600)]
net: ipa: refactor status buffer parsing
The packet length encoded in an IPA packet status buffer is computed
more than once in ipa_endpoint_status_parse(). It is also checked
again in ipa_endpoint_status_skip(), which that function calls.
Compute the length once, and use that computed value later rather
than recomputing it. Check for it being zero in the parse function
rather than in ipa_endpoint_status_skip().
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 25 Jan 2023 14:57:16 +0000 (16:57 +0200)]
net: dsa: ocelot: build felix.c into a dedicated kernel module
The build system currently complains:
scripts/Makefile.build:252: drivers/net/dsa/ocelot/Makefile:
felix.o is added to multiple modules: mscc_felix mscc_seville
Since felix.c holds the DSA glue layer, create a mscc_felix_dsa_lib.ko.
This is similar to how mscc_ocelot_switch_lib.ko holds a library for
configuring the hardware.
Jakub Kicinski [Fri, 27 Jan 2023 07:29:12 +0000 (23:29 -0800)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
virtchnl: update and refactor
Jesse Brandeburg says:
The virtchnl.h file is used by i40e/ice physical function (PF) drivers
and irdma when talking to the iavf driver. This series cleans up the
header file by removing unused elements, adding/cleaning some comments,
fixing the data structures so they are explicitly defined, including
padding, and finally does a long overdue rename of the IWARP members in
the structures to RDMA, since the ice driver and it's associated Intel
Ethernet E800 series adapters support both RDMA and IWARP.
The whole series should result in no functional change, but hopefully
clearer code.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
virtchnl: i40e/iavf: rename iwarp to rdma
virtchnl: do structure hardening
virtchnl: update header and increase header clarity
virtchnl: remove unused structure declaration
====================
Stanislav Fomichev [Thu, 26 Jan 2023 22:50:30 +0000 (14:50 -0800)]
selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
The existing timestamping_enable() is a no-op because it applies
to the socket-related path that we are not verifying here
anymore. (but still leaving the code around hoping we can
have xdp->skb path verified here as well)
====================
tools: ynl: prevent reorder and fix flags
Some codegen improvements for YAML specs.
First, Lorenzon discovered when switching the XDP feature family
to use flags instead of pure enum that the kdoc got garbled.
The support for enum and flags is therefore unified.
Second when regenerating all families we discussed so far I noticed
that some netlink policies jumped around. We need to ensure we don't
render code based on their ordering in a hash.
====================
Jakub Kicinski [Thu, 26 Jan 2023 00:02:35 +0000 (16:02 -0800)]
tools: ynl: store ops in ordered dict to avoid random ordering
When rendering code we should walk the ops in the order in which
they are declared in the spec. This is both more intuitive and
prevents code from jumping around when hashing in the dict changes.
Jakub Kicinski [Thu, 26 Jan 2023 00:02:34 +0000 (16:02 -0800)]
tools: ynl: rename ops_list -> msg_list
ops_list contains all the operations, but the main iteration use
case is to walk only ops which define attrs. Rename ops_list to
msg_list, because now it looks like the contents are the same,
just the format is different. While at it convert from tuple
to just keys, none of the users care about the name of the op.