www.infradead.org Git - users/hch/misc.git/log

Merge branch 'resilient-queued-spin-lock'

Kumar Kartikeya Dwivedi says:

====================
Resilient Queued Spin Lock

Changelog:
----------
v3 -> v4
v4: https://lore.kernel.org/bpf/20250303152305.3195648-1-memxor@gmail.com

* Fix bisectability problem by reordering locktorture commit before
   Makefile commit.
* Add EXPORT_SYMBOL_GPL to all used symbols and variables by consumers.
* Skip BPF selftest when nrprocs < 2.
* Fix kdoc to describe return value for res_spin_lock, slowpath.
* Move kernel/locking/rqspinlock.{c,h} to kernel/bpf/rqspinlock.{c,h}.

v2 -> v3
v2: https://lore.kernel.org/bpf/20250206105435.2159977-1-memxor@gmail.com

* Add ifdef's to fallback to Ankur's patch when it gets in, until then
   copy-paste the implementation.
* Change the meaning of RES_DEF_TIMEOUT from two critical section
   lengths to one for clarity, and use RES_DEF_TIMEOUT * 2 where needed.
* Use NSEC_PER_SEC as timeout for TAS fallback.
* Add Closes: tags for known syzbot reports.
* Change timeout for TAS fallback to 1 second.
* Fix more kernel test robot errors.
* More comments about smp_wmb in release_held_lock_entry interaction.
* Change RES_NR_HELD to 31.
* Address comments from Peter, Eduard, Alexei.

v1 -> v2
v1: https://lore.kernel.org/bpf/20250107140004.2732830-1-memxor@gmail.com

* Address nits from Waiman and Peter
* Fix arm64 WFE bug pointed out by Peter.
* Fix incorrect memory ordering in release_held_lock_entry, and
   document subtleties. Explain why release is sufficient in unlock
   but not in release_held_lock_entry.
* Remove dependence on CONFIG_QUEUED_SPINLOCKS and introduce a
   test-and-set fallback when queued spinlock support is missing on an
   architecture.
* Enforce FIFO ordering for BPF program spin unlocks.
* Address comments from Eduard on verifier plumbing.
* Add comments as suggested by Waiman.
* Refactor paravirt TAS lock to use the implemented TAS fallback.
* Use rqspinlock_t as the type throughout so that it can be replaced
   with a non-qspinlock type in case of fallback.
* Testing and benchmarking on arm64, added numbers to the cover letter.
* Fix kernel test robot errors.
* Fix a BPF selftest bug leading to spurious failures on arm64.

Introduction
------------

This patch set introduces Resilient Queued Spin Lock (or rqspinlock with
res_spin_lock() and res_spin_unlock() APIs).

This is a qspinlock variant which recovers the kernel from a stalled
state when the lock acquisition path cannot make forward progress. This
can occur when a lock acquisition attempt enters a deadlock situation
(e.g. AA, or ABBA), or more generally, when the owner of the lock (which
we’re trying to acquire) isn’t making forward progress.

The cover letter provides an overview of the motivation, design, and
alternative approaches. We then provide evaluation numbers showcasing
that while rqspinlock incurs overhead, the performance of rqspinlock
approaches that of the normal qspinlock used by the kernel.

The evaluations for rqspinlock were performed by replacing the default
qspinlock implementation with it and booting the kernel to run the
experiments. Support for locktorture is also included with numbers in
this series.

The cover letter's design section provides an overview of the
algorithmic approach. A technical document describing the implementation
in more detail is available here:
https://github.com/kkdwivedi/rqspinlock/blob/main/rqspinlock.pdf

We have a WIP TLA+ proof for liveness and mutual exclusion of rqspinlock
built on top of the qspinlock TLA+ proof from Catalin Marinas [3]. We
will share more details and the links in the near future.

Motivation
----------

In regular kernel code, usage of locks is assumed to be correct, so as
to avoid deadlocks and stalls by construction, however, the same is not
true for BPF programs. Users write normal C code and the in-kernel eBPF
runtime ensures the safety of the kernel by rejecting unsafe programs.
Users can upload programs that use locks in an improper fashion, and may
cause deadlocks when these programs run inside the kernel. The verifier
is responsible for rejecting such programs from being loaded into the
kernel.

Until now, the eBPF verifier ensured deadlock safety by only permitting
one lock acquisition at a time, and by preventing any functions to be
called from within the critical section. Additionally, only a few
restricted program types are allowed to call spin locks. As the usage of
eBPF grows (e.g. with sched_ext) beyond its conventional application in
networking, tracing, and security, the limitations on locking are
becoming a bottleneck for users.

The rqspinlock implementation allows us to permit more flexible locking
patterns in BPF programs, without limiting them to the subset that can
be proven safe statically (which is fairly small, and requires complex
static analysis), while ensuring that the kernel will recover in case we
encounter a locking violation at runtime. We make a tradeoff here by
accepting programs that may potentially have deadlocks, and recover the
kernel quickly at runtime to ensure availability.

Additionally, eBPF programs attached to different parts of the kernel
can introduce new control flow into the kernel, which increases the
likelihood of deadlocks in code not written to handle reentrancy. There
have been multiple syzbot reports surfacing deadlocks in internal kernel
code due to the diverse ways in which eBPF programs can be attached to
different parts of the kernel.  By switching the BPF subsystem’s lock
usage to rqspinlock, all of these issues can be mitigated at runtime.

This spin lock implementation allows BPF maps to become safer and remove
mechanisms that have fallen short in assuring safety when nesting
programs in arbitrary ways in the same context or across different
contexts. The red diffs due to patches 16-18 demonstrate this
simplification.

>  kernel/bpf/hashtab.c         | 102 ++++++++++++++++++++++++++++++++--------------------------...
>  kernel/bpf/lpm_trie.c        |  25 ++++++++++++++-----------
>  kernel/bpf/percpu_freelist.c | 113 +++++++++++++++++++++++++---------------------------------...
>  kernel/bpf/percpu_freelist.h |   4 ++--
>  4 files changed, 73 insertions(+), 171 deletions(-)

Design
------

Deadlocks mostly manifest as stalls in the waiting loops of the
qspinlock slow path. Thus, using stalls as a signal for deadlocks avoids
introducing cost to the normal fast path, and ensures bounded
termination of the waiting loop. Our recovery algorithm is focused on
terminating the waiting loops of the qspinlock algorithm when it gets
stuck, and implementing bespoke recovery procedures for each class of
waiter to restore the lock to a usable state. Deadlock detection is the
main mechanism used to provide faster recovery, with the timeout
mechanism acting as a final line of defense.

Deadlock Detection
~~~~~~~~~~~~~~~~~~
We handle two cases of deadlocks: AA deadlocks (attempts to acquire the
same lock again), and ABBA deadlocks (attempts to acquire two locks in
the opposite order from two distinct threads). Variants of ABBA
deadlocks may be encountered with more than two locks being held in the
incorrect order. These are not diagnosed explicitly, as they reduce to
ABBA deadlocks.

Deadlock detection is triggered immediately when beginning the waiting
loop of a lock slow path.

While timeouts ensure that any waiting loops in the locking slow path
terminate and return to the caller, it can be excessively long in some
situations. While the default timeout is short (0.5s), a stall for this
duration inside the kernel can set off alerts for latency-critical
services with strict SLOs.  Ideally, the kernel should recover from an
undesired state of the lock as soon as possible.

A multi-step strategy is used to recover the kernel from waiting loops
in the locking algorithm which may fail to terminate in a bounded amount
of time.

* Each CPU maintains a table of held locks. Entries are inserted and
   removed upon entry into lock, and exit from unlock, respectively.
* Deadlock detection for AA locks is thus simple: we have an AA
   deadlock if we find a held lock entry for the lock we’re attempting
   to acquire on the same CPU.
* During deadlock detection for ABBA, we search through the tables of
   all other CPUs to find situations where we are holding a lock the
   remote CPU is attempting to acquire, and they are holding a lock we
   are attempting to acquire. Upon encountering such a condition, we
   report an ABBA deadlock.
* We divide the duration between entry time point into the waiting loop
   and the timeout time point into intervals of 1 ms, and perform
   deadlock detection until timeout happens. Upon entry into the slow
   path, and then completion of each 1 ms interval, we perform detection
   of both AA and ABBA deadlocks. In the event that deadlock detection
   yields a positive result, the recovery happens sooner than the
   timeout.  Otherwise, it happens as a last resort upon completion of
   the timeout.

Timeouts
~~~~~~~~
Timeouts act as final line of defense against stalls for waiting loops.
The ‘ktime_get_mono_fast_ns’ function is used to poll for the current
time, and it is compared to the timestamp indicating the end time in the
waiter loop. Each waiting loop is instrumented to check an extra
condition using a macro. Internally, the macro implementation amortizes
the checking of the timeout to avoid sampling the clock in every
iteration.  Precisely, the timeout checks are invoked every 64k
iterations.

Recovery
~~~~~~~~
There is extensive literature in academia on designing locks that
support timeouts [0][1], as timeouts can be used as a proxy for
detecting the presence of deadlocks and recovering from them, without
maintaining explicit metadata to construct a waits-for relationship
between two threads at runtime.

In case of rqspinlock, the key simplification in our algorithm comes
from the fact that upon a timeout, waiters always leave the queue in
FIFO order.  As such, the timeout is only enforced by the head of the
wait queue, while other waiters rely on the head to signal them when a
timeout has occurred and when they need to exit. We don’t have to
implement complex algorithms and do not need extra synchronization for
waiters in the middle of the queue timing out before their predecessor
or successor, unlike previous approaches [0][1].

There are three forms of waiters in the original queued spin lock
algorithm.  The first is the waiter which acquires the pending bit and
spins on the lock word without forming a wait queue. The second is the
head waiter that is the first waiter heading the wait queue. The third
form is of all the non-head waiters queued behind the head, waiting to
be signalled through their MCS node to overtake the responsibility of
the head.

In rqspinlock's recovery algorithm, we are concerned with the second and
third kind. First, we augment the waiting loop of the head of the wait
queue with a timeout. When this timeout happens, all waiters part of the
wait queue will abort their lock acquisition attempts. This happens in
three steps.

* First, the head breaks out of its loop waiting for pending and locked
   bits to turn to 0, and non-head waiters break out of their MCS node
   spin (more on that later).
* Next, every waiter (head or non-head) attempts to check whether they
   are also the tail waiter, in such a case they attempt to zero out the
   tail word and allow a new queue to be built up for this lock. If they
   succeed, they have no one to signal next in the queue to stop
   spinning.
* Otherwise, they signal the MCS node of the next waiter to break out
   of its spin and try resetting the tail word back to 0. This goes on
   until the tail waiter is found. In case of races, the new tail will
   be responsible for performing the same task, as the old tail will
   then fail to reset the tail word and wait for its next pointer to be
   updated before it signals the new tail to do the same.

Timeout Bound
~~~~~~~~~~~~~
The timeout is applied by two types of waiters: the pending bit waiter
and the wait queue head waiter. As such, for the pending waiter, only
the lock owner is ahead of it, and for the wait queue head waiter, only
the lock owner and the pending waiter take precedence in executing their
critical sections.

We define the timeout value to span at most 1 critical section length,
and then use the appropriate value (default, or default x 2) depending
on if we are the pending waiter or head of wait queue.

Therefore, the waiting loop wait can span at most 2 critical section
lengths, and thus, it is unaffected by the amount of contention or the
number of CPUs on the host. Non-head waiters simply wait for the wait
queue head to signal them on a timeout.

In Meta's production, we have noticed uncore PMU reads and SMIs
consuming tens of msecs. While these events are rare, a 0.25 second
timeout should absorb such tail events and not raise false alarms for
timeouts. We will continue monitoring this in production and adjust the
timeout if necessary in the future.

More details of the recovery algorithm is described in patch 9 and a
detailed description is available at [2].

Alternatives
------------

Lockdep: We do not rely on the lockdep facility for reporting violations
for primarily two reasons:

* Overhead: The lockdep infrastructure can add significant overhead to
  the lock acquisition path, and is not recommended for use in
  production due to this reason. While the report is more useful and
  exhaustive, the overhead can be prohibitive, especially as BPF
  programs run in hot paths of the kernel.  Moreover, it also increases
  the size of the lock word to store extra metadata, which is not
  feasible for BPF spin locks that are 4-bytes in size today (similar to
  qspinlock).

* Debug Tool: Lockdep is intended to be used as a debugging facility,
  providing extra context to the user about the locking violations
  occurring during runtime. It is always turned off on all production
  kernels, therefore isn’t available most of the time.

We require a mechanism for detecting common variants of deadlocks that
is always available in production kernels and never turned off. At the
same time, it must not introduce overhead in terms of time (for the slow
path) and memory (for the lock word size).

Evaluation
----------

We run benchmarks that stress locking scalability and perform comparison
against the baseline (qspinlock). For the rqspinlock case, we replace
the default qspinlock with it in the kernel, such that all spin locks in
the kernel use the rqspinlock slow path. As such, benchmarks that stress
kernel spin locks end up exercising rqspinlock.

Evaluation setup
~~~~~~~~~~~~~~~~

We set the CPU governor to performance for all experiments.

Note: Numbers for arm64 have been obtained without the no-WFE fallback
in this series, to perform a fair comparison with the WFE using
qspinlock baseline.

x86_64:

Intel Xeon Platinum 8468 (Sapphire Rapids)
96 cores (48 x 2 sockets)
2 threads per core, 0-95, siblings from 96-191
2 NUMA nodes (every 48 cores), 2 LLCs (every 48 cores), 1 LLC per NUMA node
Hyperthreading enabled

arm64:

Ampere Max Neoverse-N1 256-Core Processor
256 cores (128 cores x 2 sockets)
1 thread per core
2 NUMA nodes (every 128 cores), 1 L2 per core (256 instances), no shared L3
No hyperthreading available

The locktorture experiment is run for 30 seconds.
Average of 25 runs is used for will-it-scale, after an initial warm up.

More information on the locks contended in the will-it-scale experiments
is available in the evaluation section of the CNA paper, in table 1 [4].

Legend:
QL - qspinlock (avg. throughput)
RQL - rqspinlock (avg. throughput)

Results
~~~~~~~

locktorture - x86_64

Threads QL RQL Speedup
-----------------------------------------------
1 46910437 45057327 0.96
2 29871063 25085034 0.84
4 13876024 19242776 1.39
8 14638499 13346847 0.91
16 14380506 14104716 0.98
24 17278144 15293077 0.89
32 19494283 17826675 0.91
40 27760955 21002910 0.76
48 28638897 26432549 0.92
56 29336194 26512029 0.9
64 30040731 27421403 0.91
72 29523599 27010618 0.91
80 28846738 27885141 0.97
88 29277418 25963753 0.89
96 28472339 27423865 0.96
104 28093317 26634895 0.95
112 29914000 27872339 0.93
120 29199580 26682695 0.91
128 27755880 27314662 0.98
136 30349095 27092211 0.89
144 29193933 27805445 0.95
152 28956663 26071497 0.9
160 28950009 28183864 0.97
168 29383520 28135091 0.96
176 28475883 27549601 0.97
184 31958138 28602434 0.89
192 31342633 33394385 1.07

will-it-scale open1_threads - x86_64

Threads QL       QL stddev       stddev% RQL      RQL stddev      stddev% Speedup
-----------------------------------------------------------------------------------------------
1 1396323.92 7373.12 0.53 1366616.8 4152.08 0.3 0.98
2 1844403.8 3165.26 0.17 1700301.96 2396.58 0.14 0.92
4 2370590.6 24545.54 1.04 1655872.32 47938.71 2.9 0.7
8 2185227.04 9537.9 0.44 1691205.16 9783.25 0.58 0.77
16 2110672.36 10972.99 0.52 1781696.24 15021.43 0.84 0.84
24 1655042.72 18037.23 1.09 2165125.4 5422.54 0.25 1.31
32 1738928.24 7166.64 0.41 1829468.24 9081.59 0.5 1.05
40 1854430.52 6148.24 0.33 1731062.28 3311.95 0.19 0.93
48 1766529.96 5063.86 0.29 1749375.28 2311.27 0.13 0.99
56 1303016.28 6168.4 0.47 1452656 7695.29 0.53 1.11
64 1169557.96 4353.67 0.37 1287370.56 8477.2 0.66 1.1
72 1036023.4 7116.53 0.69 1135513.92 9542.55 0.84 1.1
80 1097913.64 11356 1.03 1176864.8 6771.41 0.58 1.07
88 1123907.36 12843.13 1.14 1072416.48 7412.25 0.69 0.95
96 1166981.52 9402.71 0.81 1129678.76 9499.14 0.84 0.97
104 1108954.04 8171.46 0.74 1032044.44 7840.17 0.76 0.93
112 1000777.76 8445.7 0.84 1078498.8 6551.47 0.61 1.08
120 1029448.4 6992.29 0.68 1093743 8378.94 0.77 1.06
128 1106670.36 10102.15 0.91 1241438.68 23212.66 1.87 1.12
136 1183776.88 6394.79 0.54 1116799.64 18111.38 1.62 0.94
144 1201122 25917.69 2.16 1301779.96 15792.6 1.21 1.08
152 1099737.08 13567.82 1.23 1053647.2 12704.29 1.21 0.96
160 1031186.32 9048.07 0.88 1069961.4 8293.18 0.78 1.04
168 1068817 16486.06 1.54 1096495.36 14021.93 1.28 1.03
176 966633.96 9623.27 1 1081129.84 9474.81 0.88 1.12
184 1004419.04 12111.11 1.21 1037771.24 12001.66 1.16 1.03
192 1088858.08 16522.93 1.52 1027943.12 14238.57 1.39 0.94

will-it-scale open2_threads - x86_64

Threads QL       QL stddev       stddev% RQL      RQL stddev      stddev% Speedup
-----------------------------------------------------------------------------------------------
1 1337797.76 4649.19 0.35 1332609.4 3813.14 0.29 1
2 1598300.2 1059.93 0.07 1771891.36 5667.12 0.32 1.11
4 1736573.76 13025.33 0.75 1396901.2 2682.46 0.19 0.8
8 1794367.84 4879.6 0.27 1917478.56 3751.98 0.2 1.07
16 1990998.44 8332.78 0.42 1864165.56 9648.59 0.52 0.94
24 1868148.56 4248.23 0.23 1710136.68 2760.58 0.16 0.92
32 1955180 6719 0.34 1936149.88 1980.87 0.1 0.99
40 1769646.4 4686.54 0.26 1729653.68 4551.22 0.26 0.98
48 1724861.16 4056.66 0.24 1764900 971.11 0.06 1.02
56 1318568 7758.86 0.59 1385660.84 7039.8 0.51 1.05
64 1143290.28 5351.43 0.47 1316686.6 5597.69 0.43 1.15
72 1196762.68 10655.67 0.89 1230173.24 9858.2 0.8 1.03
80 1126308.24 6901.55 0.61 1085391.16 7444.34 0.69 0.96
88 1035672.96 5452.95 0.53 1035541.52 8095.33 0.78 1
96 1030203.36 6735.71 0.65 1020113.48 8683.13 0.85 0.99
104 1039432.88 6583.59 0.63 1083902.48 5775.72 0.53 1.04
112 1113609.04 4380.62 0.39 1072010.36 8983.14 0.84 0.96
120 1109420.96 7183.5 0.65 1079424.12 10929.97 1.01 0.97
128 1095400.04 4274.6 0.39 1095475.2 12042.02 1.1 1
136 1071605.4 11103.73 1.04 1114757.2 10516.55 0.94 1.04
144 1104147.2 9714.75 0.88 1044954.16 7544.2 0.72 0.95
152 1164280.24 13386.15 1.15 1101213.92 11568.49 1.05 0.95
160 1084892.04 7941.25 0.73 1152273.76 9593.38 0.83 1.06
168 983654.76 11772.85 1.2 1111772.28 9806.83 0.88 1.13
176 1087544.24 11262.35 1.04 1077507.76 9442.02 0.88 0.99
184 1101682.4 24701.68 2.24 1095223.2 16707.29 1.53 0.99
192 983712.08 13453.59 1.37 1051244.2 15662.05 1.49 1.07

will-it-scale lock1_threads - x86_64

Threads QL       QL stddev       stddev% RQL      RQL stddev      stddev% Speedup
-----------------------------------------------------------------------------------------------
1 4307484.96 3959.31 0.09 4252908.56 10375.78 0.24 0.99
2 7701844.32 4169.88 0.05 7219233.52 6437.11 0.09 0.94
4 14781878.72 22854.85 0.15 15260565.12 37305.71 0.24 1.03
8 12949698.64 99270.42 0.77 9954660.4 142805.68 1.43 0.77
16 12947690.64 72977.27 0.56 10865245.12 49520.31 0.46 0.84
24 11142990.64 33200.39 0.3 11444391.68 37884.46 0.33 1.03
32 9652335.84 22369.48 0.23 9344086.72 21639.22 0.23 0.97
40 9185931.12 5508.96 0.06 8881506.32 5072.33 0.06 0.97
48 9084385.36 10871.05 0.12 8863579.12 4583.37 0.05 0.98
56 6595540.96 33100.59 0.5 6640389.76 46619.96 0.7 1.01
64 5946726.24 47160.5 0.79 6572155.84 91973.73 1.4 1.11
72 6744894.72 43166.65 0.64 5991363.36 80637.56 1.35 0.89
80 6234502.16 118983.16 1.91 5157894.32 73592.72 1.43 0.83
88 5053879.6 199713.75 3.95 4479758.08 36202.27 0.81 0.89
96 5184302.64 99199.89 1.91 5249210.16 122348.69 2.33 1.01
104 4612391.92 40803.05 0.88 4850209.6 26813.28 0.55 1.05
112 4809209.68 24070.68 0.5 4869477.84 27489.04 0.56 1.01
120 5130746.4 34265.5 0.67 4620047.12 44229.54 0.96 0.9
128 5376465.28 95028.05 1.77 4781179.6 43700.93 0.91 0.89
136 5453742.4 86718.87 1.59 5412457.12 40339.68 0.75 0.99
144 5805040.72 84669.31 1.46 5595382.48 68701.65 1.23 0.96
152 5842897.36 31120.33 0.53 5787587.12 43521.68 0.75 0.99
160 5837665.12 14179.44 0.24 5118808.72 45193.23 0.88 0.88
168 5660332.72 27467.09 0.49 5104959.04 40891.75 0.8 0.9
176 5180312.24 28656.39 0.55 4718407.6 58734.13 1.24 0.91
184 4706824.16 50469.31 1.07 4692962.64 92266.85 1.97 1
192 5126054.56 51082.02 1 4680866.8 58743.51 1.25 0.91

will-it-scale lock2_threads - x86_64

Threads QL       QL stddev       stddev% RQL      RQL stddev      stddev% Speedup
-----------------------------------------------------------------------------------------------
1 4316091.2 4933.28 0.11 4293104 30369.71 0.71 0.99
2 3500046.4 19852.62 0.57 4507627.76 23667.66 0.53 1.29
4 3639098.96 26370.65 0.72 3673166.32 30822.71 0.84 1.01
8 3714548.56 49953.44 1.34 4055818.56 71630.41 1.77 1.09
16 4188724.64 105414.49 2.52 4316077.12 68956.15 1.6 1.03
24 3737908.32 47391.46 1.27 3762254.56 55345.7 1.47 1.01
32 3820952.8 45207.66 1.18 3710368.96 52651.92 1.42 0.97
40 3791280.8 28630.55 0.76 3661933.52 37671.27 1.03 0.97
48 3765721.84 59553.83 1.58 3604738.64 50861.36 1.41 0.96
56 3175505.76 64336.17 2.03 2771022.48 66586.99 2.4 0.87
64 2620294.48 71651.34 2.73 2650171.68 44810.83 1.69 1.01
72 2861893.6 86542.61 3.02 2537437.2 84571.75 3.33 0.89
80 2976297.2 83566.43 2.81 2645132.8 85992.34 3.25 0.89
88 2547724.8 102014.36 4 2336852.16 80570.25 3.45 0.92
96 2945310.32 82673.25 2.81 2513316.96 45741.81 1.82 0.85
104 3028818.64 90643.36 2.99 2581787.52 52967.48 2.05 0.85
112 2546264.16 102605.82 4.03 2118812.64 62043.19 2.93 0.83
120 2917334.64 112220.01 3.85 2720418.64 64035.96 2.35 0.93
128 2906621.84 69428.1 2.39 2795310.32 56736.87 2.03 0.96
136 2841833.76 105541.11 3.71 3063404.48 62288.94 2.03 1.08
144 3032822.32 134796.56 4.44 3169985.6 149707.83 4.72 1.05
152 2557694.96 62218.15 2.43 2469887.6 68343.78 2.77 0.97
160 2810214.72 61468.79 2.19 2323768.48 54226.71 2.33 0.83
168 2651146.48 76573.27 2.89 2385936.64 52433.98 2.2 0.9
176 2720616.32 89026.19 3.27 2941400.08 59296.64 2.02 1.08
184 2696086 88541.24 3.28 2598225.2 76365.7 2.94 0.96
192 2908194.48 87023.91 2.99 2377677.68 53299.82 2.24 0.82

locktorture - arm64

Threads QL RQL Speedup
-----------------------------------------------
1 43320464 44718174 1.03
2 21056971 29255448 1.39
4 16040120 11563981 0.72
8 12786398 12838909 1
16 13646408 13436730 0.98
24 13597928 13669457 1.01
32 16456220 14600324 0.89
40 16667726 13883101 0.83
48 14347691 14608641 1.02
56 15624580 15180758 0.97
64 18105114 16009137 0.88
72 16606438 14772256 0.89
80 16550202 14124056 0.85
88 16716082 15930618 0.95
96 16489242 16817657 1.02
104 17915808 17165324 0.96
112 17217482 21343282 1.24
120 20449845 20576123 1.01
128 18700902 20286275 1.08
136 17913378 21142921 1.18
144 18225673 18971921 1.04
152 18374206 19229854 1.05
160 23136514 20129504 0.87
168 21096269 17167777 0.81
176 21376794 21594914 1.01
184 23542989 20638298 0.88
192 22793754 20655980 0.91
200 20933027 19628316 0.94
208 23105684 25572720 1.11
216 24158081 23173848 0.96
224 23388984 22485353 0.96
232 21916401 23899343 1.09
240 22292129 22831784 1.02
248 25812762 22636787 0.88
256 24294738 26127113 1.08

will-it-scale open1_threads - arm64

Threads QL       QL stddev       stddev% RQL      RQL stddev      stddev% Speedup
-----------------------------------------------------------------------------------------------
1 844452.32 801 0.09 804936.92 900.25 0.11 0.95
2 1309419.08 9495.78 0.73 1265080.24 3171.13 0.25 0.97
4 2113074.24 5363.19 0.25 2041158.28 7883.65 0.39 0.97
8 1916650.96 15749.86 0.82 2039850.04 7562.87 0.37 1.06
16 1835540.72 12940.45 0.7 1937398.56 11461.15 0.59 1.06
24 1876760.48 12581.67 0.67 1966659.16 10012.69 0.51 1.05
32 1834525.6 5571.08 0.3 1929180.4 6221.96 0.32 1.05
40 1851592.76 7848.18 0.42 1937504.44 5991.55 0.31 1.05
48 1845067 4118.68 0.22 1773331.56 6068.23 0.34 0.96
56 1742709.36 6874.03 0.39 1716184.92 6713.16 0.39 0.98
64 1685339.72 6688.91 0.4 1676046.16 5844.06 0.35 0.99
72 1694838.84 2433.41 0.14 1821189.6 2906.89 0.16 1.07
80 1738778.68 2916.74 0.17 1729212.6 3714.41 0.21 0.99
88 1753131.76 2734.34 0.16 1713294.32 4652.82 0.27 0.98
96 1694112.52 4449.69 0.26 1714438.36 5621.66 0.33 1.01
104 1780279.76 2420.52 0.14 1767679.12 3067.66 0.17 0.99
112 1700284.72 9796.23 0.58 1796674.6 4066.06 0.23 1.06
120 1760466.72 3978.65 0.23 1704706.08 4080.04 0.24 0.97
128 1634067.96 5187.94 0.32 1764115.48 3545.02 0.2 1.08
136 1170303.84 7602.29 0.65 1227188.04 8090.84 0.66 1.05
144 953186.16 7859.02 0.82 964822.08 10536.61 1.09 1.01
152 818893.96 7238.86 0.88 853412.44 5932.25 0.7 1.04
160 707460.48 3868.26 0.55 746985.68 10363.03 1.39 1.06
168 658380.56 4938.77 0.75 672101.12 5442.95 0.81 1.02
176 614692.04 3137.74 0.51 615143.36 6197.19 1.01 1
184 574808.88 4741.61 0.82 592395.08 8840.92 1.49 1.03
192 548142.92 6116.31 1.12 571299.68 8388.56 1.47 1.04
200 511621.96 2182.33 0.43 532144.88 5467.04 1.03 1.04
208 506583.32 6834.39 1.35 521427.08 10318.65 1.98 1.03
216 480438.04 3608.96 0.75 510697.76 8086.47 1.58 1.06
224 470644.96 3451.35 0.73 467433.92 5008.59 1.07 0.99
232 466973.72 6599.97 1.41 444345.92 2144.96 0.48 0.95
240 442927.68 2351.56 0.53 440503.56 4289.01 0.97 0.99
248 432991.16 5829.92 1.35 445462.6 5944.03 1.33 1.03
256 409455.44 1430.5 0.35 422219.4 4007.04 0.95 1.03

will-it-scale open2_threads - arm64

Threads QL       QL stddev       stddev% RQL      RQL stddev      stddev% Speedup
-----------------------------------------------------------------------------------------------
1 818645.4 1097.02 0.13 774110.24 1562.45 0.2 0.95
2 1281013.04 2188.78 0.17 1238346.24 2149.97 0.17 0.97
4 2058514.16 13105.36 0.64 1985375 3204.48 0.16 0.96
8 1920414.8 16154.63 0.84 1911667.92 8882.98 0.46 1
16 1943729.68 8714.38 0.45 1978946.72 7465.65 0.38 1.02
24 1915846.88 7749.9 0.4 1914442.72 9841.71 0.51 1
32 1964695.92 8854.83 0.45 1914650.28 9357.82 0.49 0.97
40 1845071.12 5103.26 0.28 1891685.44 4278.34 0.23 1.03
48 1838897.6 5123.61 0.28 1843498.2 5391.94 0.29 1
56 1823768.32 3214.14 0.18 1736477.48 5675.49 0.33 0.95
64 1627162.36 3528.1 0.22 1685727.16 6102.63 0.36 1.04
72 1725320.16 4709.83 0.27 1710174.4 6707.54 0.39 0.99
80 1692288.44 9110.89 0.54 1773676.24 4327.94 0.24 1.05
88 1725496.64 4249.71 0.25 1695173.84 5097.14 0.3 0.98
96 1766093.08 2280.09 0.13 1732782.64 3606.1 0.21 0.98
104 1647753 2926.83 0.18 1710876.4 4416.04 0.26 1.04
112 1763785.52 3838.26 0.22 1803813.76 1859.2 0.1 1.02
120 1684095.16 2385.31 0.14 1766903.08 3258.34 0.18 1.05
128 1733528.56 2800.62 0.16 1677446.32 3201.14 0.19 0.97
136 1179187.84 6804.86 0.58 1241839.52 10698.51 0.86 1.05
144 969456.36 6421.85 0.66 1018441.96 8732.19 0.86 1.05
152 839295.64 10422.66 1.24 817531.92 6778.37 0.83 0.97
160 743010.72 6957.98 0.94 749291.16 9388.47 1.25 1.01
168 666049.88 13159.73 1.98 689408.08 10192.66 1.48 1.04
176 609185.56 5685.18 0.93 653744.24 10847.35 1.66 1.07
184 602232.08 12089.72 2.01 597718.6 13856.45 2.32 0.99
192 563919.32 9870.46 1.75 560080.4 8388.47 1.5 0.99
200 522396.28 4155.61 0.8 539168.64 10456.64 1.94 1.03
208 520328.28 9353.14 1.8 510011.4 6061.19 1.19 0.98
216 479797.72 5824.58 1.21 486955.32 4547.05 0.93 1.01
224 467943.8 4484.86 0.96 473252.76 5608.58 1.19 1.01
232 456914.24 3129.5 0.68 457463.2 7474.83 1.63 1
240 450535 5149.78 1.14 437653.56 4604.92 1.05 0.97
248 435475.2 2350.87 0.54 435589.24 6176.01 1.42 1
256 416737.88 2592.76 0.62 424178.28 3932.2 0.93 1.02

will-it-scale lock1_threads - arm64

Threads QL       QL stddev       stddev% RQL      RQL stddev      stddev% Speedup
-----------------------------------------------------------------------------------------------
1 2512077.52 3026.1 0.12 2085365.92 1612.44 0.08 0.83
2 4840180.4 3646.31 0.08 4326922.24 3802.17 0.09 0.89
4 9358779.44 6673.07 0.07 8467588.56 5577.05 0.07 0.9
8 9374436.88 18826.26 0.2 8635110.16 4217.66 0.05 0.92
16 9527184.08 14111.94 0.15 8561174.16 3258.6 0.04 0.9
24 8873099.76 17242.32 0.19 9286778.72 4124.51 0.04 1.05
32 8457640.4 10790.92 0.13 8700401.52 5110 0.06 1.03
40 8478771.76 13250.8 0.16 8746198.16 7606.42 0.09 1.03
48 8329097.76 7958.92 0.1 8774265.36 6082.08 0.07 1.05
56 8330143.04 11586.93 0.14 8472426.48 7402.13 0.09 1.02
64 8334684.08 10478.03 0.13 7979193.52 8436.63 0.11 0.96
72 7941815.52 16031.38 0.2 8016885.52 12640.56 0.16 1.01
80 8042221.68 10219.93 0.13 8072222.88 12479.54 0.15 1
88 8190336.8 10751.38 0.13 8432977.6 11865.67 0.14 1.03
96 8235010.08 7267.8 0.09 8022101.28 11910.63 0.15 0.97
104 8154434.08 7770.8 0.1 7987812 7647.42 0.1 0.98
112 7738464.56 11067.72 0.14 7968483.92 20632.93 0.26 1.03
120 8228919.36 10395.79 0.13 8304329.28 11913.76 0.14 1.01
128 7798646.64 8877.8 0.11 8197938.4 7527.81 0.09 1.05
136 5567293.68 66259.82 1.19 5642017.12 126584.59 2.24 1.01
144 4425655.52 55729.96 1.26 4519874.64 82996.01 1.84 1.02
152 3871300.8 77793.78 2.01 3850025.04 80167.3 2.08 0.99
160 3558041.68 55108.3 1.55 3495924.96 83626.42 2.39 0.98
168 3302042.72 45011.89 1.36 3298002.8 59393.64 1.8 1
176 3066165.2 34896.54 1.14 3063027.44 58219.26 1.9 1
184 2817899.6 43585.27 1.55 2859393.84 45258.03 1.58 1.01
192 2690403.76 42236.77 1.57 2630652.24 35953.13 1.37 0.98
200 2563141.44 28145.43 1.1 2539964.32 38556.52 1.52 0.99
208 2502968.8 27687.81 1.11 2477757.28 28240.81 1.14 0.99
216 2474917.76 24128.71 0.97 2483161.44 32198.37 1.3 1
224 2386874.72 32954.66 1.38 2398068.48 37667.29 1.57 1
232 2379248.24 27413.4 1.15 2327601.68 24565.28 1.06 0.98
240 2302146.64 19914.19 0.87 2236074.64 20968.17 0.94 0.97
248 2241798.32 21542.52 0.96 2173312.24 26498.36 1.22 0.97
256 2198765.12 20832.66 0.95 2136159.52 25027.96 1.17 0.97

will-it-scale lock2_threads - arm64

Threads QL       QL stddev       stddev% RQL      RQL stddev      stddev% Speedup
-----------------------------------------------------------------------------------------------
1 2499414.32 1932.27 0.08 2075704.8 24589.71 1.18 0.83
2 3887820 34198.36 0.88 4057432.64 11896.04 0.29 1.04
4 3445307.6 7958.3 0.23 3869960.4 3788.5 0.1 1.12
8 4310597.2 14405.9 0.33 3931319.76 5845.33 0.15 0.91
16 3995159.84 22621.85 0.57 3953339.68 15668.9 0.4 0.99
24 4048456.88 22956.51 0.57 3887812.64 30584.77 0.79 0.96
32 3974808.64 20465.87 0.51 3718778.08 27407.24 0.74 0.94
40 3941154.88 15136.68 0.38 3551464.24 33378.67 0.94 0.9
48 3725436.32 17090.67 0.46 3714356.08 19035.26 0.51 1
56 3558449.44 10123.46 0.28 3449656.08 36476.87 1.06 0.97
64 3514616.08 16470.99 0.47 3493197.04 25639.82 0.73 0.99
72 3461700.88 16780.97 0.48 3376565.04 16930.19 0.5 0.98
80 3797008.64 17599.05 0.46 3505856.16 34320.34 0.98 0.92
88 3737459.44 10774.93 0.29 3631757.68 24231.29 0.67 0.97
96 3612816.16 21865.86 0.61 3545354.56 16391.15 0.46 0.98
104 3765167.36 17763.8 0.47 3466467.12 22235.45 0.64 0.92
112 3713386 15455.21 0.42 3402210 18349.66 0.54 0.92
120 3699986.08 15153.08 0.41 3580303.92 19823.01 0.55 0.97
128 3648694.56 11891.62 0.33 3426445.28 22993.32 0.67 0.94
136 800046.88 6039.73 0.75 784412.16 9062.03 1.16 0.98
144 769483.36 5231.74 0.68 714132.8 8953.57 1.25 0.93
152 821081.52 4249.12 0.52 743694.64 8155.18 1.1 0.91
160 789040.16 9187.4 1.16 834865.44 6159.29 0.74 1.06
168 867742.4 8967.66 1.03 734905.36 15582.75 2.12 0.85
176 838650.32 7949.72 0.95 846939.68 8959.8 1.06 1.01
184 854984.48 19475.51 2.28 794549.92 11924.54 1.5 0.93
192 846262.32 13795.86 1.63 899915.12 8639.82 0.96 1.06
200 942602.16 12665.42 1.34 900385.76 8592.23 0.95 0.96
208 954183.68 12853.22 1.35 1166186.96 13045.03 1.12 1.22
216 929319.76 10157.79 1.09 926773.76 10577.01 1.14 1
224 967896.56 9819.6 1.01 951144.32 12343.83 1.3 0.98
232 990621.12 7771.97 0.78 916361.2 17878.44 1.95 0.93
240 995285.04 20104.22 2.02 972119.6 12856.42 1.32 0.98
248 1029436 20404.97 1.98 965301.28 11102.95 1.15 0.94
256 1038724.8 19201.03 1.85 1029942.08 12563.07 1.22 0.99

Written By
----------
Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi <memxor@gmail.com>

  [0]: https://www.cs.rochester.edu/research/synchronization/pseudocode/timeout.html
  [1]: https://dl.acm.org/doi/10.1145/571825.571830
  [2]: https://github.com/kkdwivedi/rqspinlock/blob/main/rqspinlock.pdf
  [3]: https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/plain/qspinlock.tla
  [4]: https://arxiv.org/pdf/1810.05600
====================

Link: https://patch.msgid.link/20250316040541.108729-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for rqspinlock

Introduce selftests that trigger AA, ABBA deadlocks, and test the edge
case where the held locks table runs out of entries, since we then
fallback to the timeout as the final line of defense. Also exercise
verifier's AA detection where applicable.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-26-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Maintain FIFO property for rqspinlock unlock

Since out-of-order unlocks are unsupported for rqspinlock, and irqsave
variants enforce strict FIFO ordering anyway, make the same change for
normal non-irqsave variants, such that FIFO ordering is enforced.

Two new verifier state fields (active_lock_id, active_lock_ptr) are used
to denote the top of the stack, and prev_id and prev_ptr are ascertained
whenever popping the topmost entry through an unlock.

Take special care to make these fields part of the state comparison in
refsafe.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-25-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Implement verifier support for rqspinlock

Introduce verifier-side support for rqspinlock kfuncs. The first step is
allowing bpf_res_spin_lock type to be defined in map values and
allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK
field to recognize and validate.

Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only
one of them (and at most one of them per-object, like before) must be
present. The bpf_res_spin_lock can also be used to protect objects that
require lock protection for their kfuncs, like BPF rbtree and linked
list.

The verifier plumbing to simulate success and failure cases when calling
the kfuncs is done by pushing a new verifier state to the verifier state
stack which will verify the failure case upon calling the kfunc. The
path where success is indicated creates all lock reference state and IRQ
state (if necessary for irqsave variants). In the case of failure, the
state clears the registers r0-r5, sets the return value, and skips kfunc
processing, proceeding to the next instruction.

When marking the return value for success case, the value is marked as
0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program,
whenever user checks the return value as 'if (ret)' or 'if (ret < 0)'
the verifier never traverses such branches for success cases, and would
be aware that the lock is not held in such cases.

We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs
are invoked. We introduce a kfunc_class state to avoid mixing lock
irqrestore kfuncs with IRQ state created by bpf_local_irq_save.

With all this infrastructure, these kfuncs become usable in programs
while satisfying all safety properties required by the kernel.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-24-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Introduce rqspinlock kfuncs

Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock,
and their irqsave/irqrestore variants, which wrap the rqspinlock APIs.
bpf_res_spin_lock returns a conditional result, depending on whether the
lock was acquired (NULL is returned when lock acquisition succeeds,
non-NULL upon failure). The memory pointed to by the returned pointer
upon failure can be dereferenced after the NULL check to obtain the
error code.

Instead of using the old bpf_spin_lock type, introduce a new type with
the same layout, and the same alignment, but a different name to avoid
type confusion.

Preemption is disabled upon successful lock acquisition, however IRQs
are not. Special kfuncs can be introduced later to allow disabling IRQs
when taking a spin lock. Resilient locks are safe against AA deadlocks,
hence not disabling IRQs currently does not allow violation of kernel
safety.

__irq_flag annotation is used to accept IRQ flags for the IRQ-variants,
with the same semantics as existing bpf_local_irq_{save, restore}.

These kfuncs will require additional verifier-side support in subsequent
commits, to allow programs to hold multiple locks at the same time.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-23-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Convert lpm_trie.c to rqspinlock

Convert all LPM trie usage of raw_spinlock to rqspinlock.

Note that rcu_dereference_protected in trie_delete_elem is switched over
to plain rcu_dereference, the RCU read lock should be held from BPF
program side or eBPF syscall path, and the trie->lock is just acquired
before the dereference. It is not clear the reason the protected variant
was used from the commit history, but the above reasoning makes sense so
switch over.

Closes: https://lore.kernel.org/lkml/000000000000adb08b061413919e@google.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-22-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Convert percpu_freelist.c to rqspinlock

Convert the percpu_freelist.c code to use rqspinlock, and remove the
extralist fallback and trylock-based acquisitions to avoid deadlocks.

Key thing to note is the retained while (true) loop to search through
other CPUs when failing to push a node due to locking errors. This
retains the behavior of the old code, where it would keep trying until
it would be able to successfully push the node back into the freelist of
a CPU.

Technically, we should start iteration for this loop from
raw_smp_processor_id() + 1, but to avoid hitting the edge of nr_cpus,
we skip execution in the loop body instead.

Closes: https://lore.kernel.org/bpf/CAPPBnEa1_pZ6W24+WwtcNFvTUHTHO7KUmzEbOcMqxp+m2o15qQ@mail.gmail.com
Closes: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-21-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Convert hashtab.c to rqspinlock

Convert hashtab.c from raw_spinlock to rqspinlock, and drop the hashed
per-cpu counter crud from the code base which is no longer necessary.

Closes: https://lore.kernel.org/bpf/675302fd.050a0220.2477f.0004.GAE@google.com
Closes: https://lore.kernel.org/bpf/000000000000b3e63e061eed3f6b@google.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-20-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Add locktorture support

Introduce locktorture support for rqspinlock using the newly added
macros as the first in-kernel user and consumer. Guard the code with
CONFIG_BPF_SYSCALL ifdef since rqspinlock is not available otherwise.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-19-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Add entry to Makefile, MAINTAINERS

Ensure that the rqspinlock code is only built when the BPF subsystem is
compiled in. Depending on queued spinlock support, we may or may not end
up building the queued spinlock slowpath, and instead fallback to the
test-and-set implementation. Also add entries to MAINTAINERS file.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-18-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Add macros for rqspinlock usage

Introduce helper macros that wrap around the rqspinlock slow path and
provide an interface analogous to the raw_spin_lock API. Note that
in case of error conditions, preemption and IRQ disabling is
automatically unrolled before returning the error back to the caller.

Ensure that in absence of CONFIG_QUEUED_SPINLOCKS support, we fallback
to the test-and-set implementation.

Add some comments describing the subtle memory ordering logic during
unlock, and why it's safe.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-17-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Add basic support for CONFIG_PARAVIRT

We ripped out PV and virtualization related bits from rqspinlock in an
earlier commit, however, a fair lock performs poorly within a virtual
machine when the lock holder is preempted. As such, retain the
virt_spin_lock fallback to test and set lock, but with timeout and
deadlock detection. We can do this by simply depending on the
resilient_tas_spin_lock implementation from the previous patch.

We don't integrate support for CONFIG_PARAVIRT_SPINLOCKS yet, as that
requires more involved algorithmic changes and introduces more
complexity. It can be done when the need arises in the future.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-15-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Add a test-and-set fallback

Include a test-and-set fallback when queued spinlock support is not
available. Introduce a rqspinlock type to act as a fallback when
qspinlock support is absent.

Include ifdef guards to ensure the slow path in this file is only
compiled when CONFIG_QUEUED_SPINLOCKS=y. Subsequent patches will add
further logic to ensure fallback to the test-and-set implementation
when queued spinlock support is unavailable on an architecture.

Unlike other waiting loops in rqspinlock code, the one for test-and-set
has no theoretical upper bound under contention, therefore we need a
longer timeout than usual. Bump it up to a second in this case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-14-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Add deadlock detection and recovery

While the timeout logic provides guarantees for the waiter's forward
progress, the time until a stalling waiter unblocks can still be long.
The default timeout of 1/4 sec can be excessively long for some use
cases. Additionally, custom timeouts may exacerbate recovery time.

Introduce logic to detect common cases of deadlocks and perform quicker
recovery. This is done by dividing the time from entry into the locking
slow path until the timeout into intervals of 1 ms. Then, after each
interval elapses, deadlock detection is performed, while also polling
the lock word to ensure we can quickly break out of the detection logic
and proceed with lock acquisition.

A 'held_locks' table is maintained per-CPU where the entry at the bottom
denotes a lock being waited for or already taken. Entries coming before
it denote locks that are already held. The current CPU's table can thus
be looked at to detect AA deadlocks. The tables from other CPUs can be
looked at to discover ABBA situations. Finally, when a matching entry
for the lock being taken on the current CPU is found on some other CPU,
a deadlock situation is detected. This function can take a long time,
therefore the lock word is constantly polled in each loop iteration to
ensure we can preempt detection and proceed with lock acquisition, using
the is_lock_released check.

We set 'spin' member of rqspinlock_timeout struct to 0 to trigger
deadlock checks immediately to perform faster recovery.

Note: Extending lock word size by 4 bytes to record owner CPU can allow
faster detection for ABBA. It is typically the owner which participates
in a ABBA situation. However, to keep compatibility with existing lock
words in the kernel (struct qspinlock), and given deadlocks are a rare
event triggered by bugs, we choose to favor compatibility over faster
detection.

The release_held_lock_entry function requires an smp_wmb, while the
release store on unlock will provide the necessary ordering for us. Add
comments to document the subtleties of why this is correct. It is
possible for stores to be reordered still, but in the context of the
deadlock detection algorithm, a release barrier is sufficient and
needn't be stronger for unlock's case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-13-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Protect waiters in trylock fallback from stalls

When we run out of maximum rqnodes, the original queued spin lock slow
path falls back to a try lock. In such a case, we are again susceptible
to stalls in case the lock owner fails to make progress. We use the
timeout as a fallback to break out of this loop and return to the
caller. This is a fallback for an extreme edge case, when on the same
CPU we run out of all 4 qnodes. When could this happen? We are in slow
path in task context, we get interrupted by an IRQ, which while in the
slow path gets interrupted by an NMI, whcih in the slow path gets
another nested NMI, which enters the slow path. All of the interruptions
happen after node->count++.

We use RES_DEF_TIMEOUT as our spinning duration, but in the case of this
fallback, no fairness is guaranteed, so the duration may be too small
for contended cases, as the waiting time is not bounded. Since this is
an extreme corner case, let's just prefer timing out instead of
attempting to spin for longer.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-12-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Protect waiters in queue from stalls

Implement the wait queue cleanup algorithm for rqspinlock. There are
three forms of waiters in the original queued spin lock algorithm. The
first is the waiter which acquires the pending bit and spins on the lock
word without forming a wait queue. The second is the head waiter that is
the first waiter heading the wait queue. The third form is of all the
non-head waiters queued behind the head, waiting to be signalled through
their MCS node to overtake the responsibility of the head.

In this commit, we are concerned with the second and third kind. First,
we augment the waiting loop of the head of the wait queue with a
timeout. When this timeout happens, all waiters part of the wait queue
will abort their lock acquisition attempts. This happens in three steps.
First, the head breaks out of its loop waiting for pending and locked
bits to turn to 0, and non-head waiters break out of their MCS node spin
(more on that later). Next, every waiter (head or non-head) attempts to
check whether they are also the tail waiter, in such a case they attempt
to zero out the tail word and allow a new queue to be built up for this
lock. If they succeed, they have no one to signal next in the queue to
stop spinning. Otherwise, they signal the MCS node of the next waiter to
break out of its spin and try resetting the tail word back to 0. This
goes on until the tail waiter is found. In case of races, the new tail
will be responsible for performing the same task, as the old tail will
then fail to reset the tail word and wait for its next pointer to be
updated before it signals the new tail to do the same.

We terminate the whole wait queue because of two main reasons. Firstly,
we eschew per-waiter timeouts with one applied at the head of the wait
queue. This allows everyone to break out faster once we've seen the
owner / pending waiter not responding for the timeout duration from the
head. Secondly, it avoids complicated synchronization, because when not
leaving in FIFO order, prev's next pointer needs to be fixed up etc.

Lastly, all of these waiters release the rqnode and return to the
caller. This patch underscores the point that rqspinlock's timeout does
not apply to each waiter individually, and cannot be relied upon as an
upper bound. It is possible for the rqspinlock waiters to return early
from a failed lock acquisition attempt as soon as stalls are detected.

The head waiter cannot directly WRITE_ONCE the tail to zero, as it may
race with a concurrent xchg and a non-head waiter linking its MCS node
to the head's MCS node through 'prev->next' assignment.

One notable thing is that we must use RES_DEF_TIMEOUT * 2 as our maximum
duration for the waiting loop (for the wait queue head), since we may
have both the owner and pending bit waiter ahead of us, and in the worst
case, need to span their maximum permitted critical section lengths.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-11-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Protect pending bit owners from stalls

The pending bit is used to avoid queueing in case the lock is
uncontended, and has demonstrated benefits for the 2 contender scenario,
esp. on x86. In case the pending bit is acquired and we wait for the
locked bit to disappear, we may get stuck due to the lock owner not
making progress. Hence, this waiting loop must be protected with a
timeout check.

To perform a graceful recovery once we decide to abort our lock
acquisition attempt in this case, we must unset the pending bit since we
own it. All waiters undoing their changes and exiting gracefully allows
the lock word to be restored to the unlocked state once all participants
(owner, waiters) have been recovered, and the lock remains usable.
Hence, set the pending bit back to zero before returning to the caller.

Introduce a lockevent (rqspinlock_lock_timeout) to capture timeout
event statistics.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Hardcode cond_acquire loops for arm64

Currently, for rqspinlock usage, the implementation of
smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
susceptible to stalls on arm64, because they do not guarantee that the
conditional expression will be repeatedly invoked if the address being
loaded from is not written to by other CPUs. When support for
event-streams is absent (which unblocks stuck WFE-based loops every
~100us), we may end up being stuck forever.

This causes a problem for us, as we need to repeatedly invoke the
RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
expires.

Let us import the smp_cond_load_acquire_timewait implementation Ankur is
proposing in [0], and then fallback to it once it is merged.

While we rely on the implementation to amortize the cost of sampling
check_timeout for us, it will not happen when event stream support is
unavailable. This is not the common case, and it would be difficult to
fit our logic in the time_expr_ns >= time_limit_ns comparison, hence
just let it be.

[0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com

Cc: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Add support for timeouts

Introduce policy macro RES_CHECK_TIMEOUT which can be used to detect
when the timeout has expired for the slow path to return an error. It
depends on being passed two variables initialized to 0: ts, ret. The
'ts' parameter is of type rqspinlock_timeout.

This macro resolves to the (ret) expression so that it can be used in
statements like smp_cond_load_acquire to break the waiting loop
condition.

The 'spin' member is used to amortize the cost of checking time by
dispatching to the implementation every 64k iterations. The
'timeout_end' member is used to keep track of the timestamp that denotes
the end of the waiting period. The 'ret' parameter denotes the status of
the timeout, and can be checked in the slow path to detect timeouts
after waiting loops.

The 'duration' member is used to store the timeout duration for each
waiting loop. The default timeout value defined in the header
(RES_DEF_TIMEOUT) is 0.25 seconds.

This macro will be used as a condition for waiting loops in the slow
path. Since each waiting loop applies a fresh timeout using the same
rqspinlock_timeout, we add a new RES_RESET_TIMEOUT as well to ensure the
values can be easily reinitialized to the default state.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Drop PV and virtualization support

Changes to rqspinlock in subsequent commits will be algorithmic
modifications, which won't remain in agreement with the implementations
of paravirt spin lock and virt_spin_lock support. These future changes
include measures for terminating waiting loops in slow path after a
certain point. While using a fair lock like qspinlock directly inside
virtual machines leads to suboptimal performance under certain
conditions, we cannot use the existing virtualization support before we
make it resilient as well. Therefore, drop it for now.

Note that we need to drop qspinlock_stat.h, as it's only relevant in
case of CONFIG_PARAVIRT_SPINLOCKS=y, but we need to keep lock_events.h
in the includes, which was indirectly pulled in before.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Add rqspinlock.h header

This header contains the public declarations usable in the rest of the
kernel for rqspinlock.

Let's also type alias qspinlock to rqspinlock_t to ensure consistent use
of the new lock type. We want to remove dependence on the qspinlock type
in later patches as we need to provide a test-and-set fallback, hence
begin abstracting away from now onwards.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

locking: Copy out qspinlock.c to kernel/bpf/rqspinlock.c

In preparation for introducing a new lock implementation, Resilient
Queued Spin Lock, or rqspinlock, we first begin our modifications by
using the existing qspinlock.c code as the base. Simply copy the code to
a new file and rename functions and variables from 'queued' to
'resilient_queued'.

Since we place the file in kernel/bpf, include needs to be relative.

This helps each subsequent commit in clearly showing how and where the
code is being changed. The only change after a literal copy in this
commit is renaming the functions where necessary, and rename qnodes to
rqnodes. Let's also use EXPORT_SYMBOL_GPL for rqspinlock slowpath.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

locking: Allow obtaining result of arch_mcs_spin_lock_contended

To support upcoming changes that require inspecting the return value
once the conditional waiting loop in arch_mcs_spin_lock_contended
terminates, modify the macro to preserve the result of
smp_cond_load_acquire. This enables checking the return value as needed,
which will help disambiguate the MCS node’s locked state in future
patches.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

locking: Move common qspinlock helpers to a private header

Move qspinlock helper functions that encode, decode tail word, set and
clear the pending and locked bits, and other miscellaneous definitions
and macros to a private header. To this end, create a qspinlock.h header
file in kernel/locking. Subsequent commits will introduce a modified
qspinlock slow path function, thus moving shared code to a private
header will help minimize unnecessary code duplication.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

locking: Move MCS struct definition to public header

Move the definition of the struct mcs_spinlock from the private
mcs_spinlock.h header in kernel/locking to the mcs_spinlock.h
asm-generic header, since we will need to reference it from the
qspinlock.h header in subsequent commits.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Make perf_event_read_output accessible in all program types.

The perf_event_read_event_output helper is currently only available to
tracing protrams, but is useful for other BPF programs like sched_ext
schedulers. When the helper is available, provide its bpf_func_proto
directly from the bpf base_proto.

Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250318030753.10949-1-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpftool-using-the-right-format-specifiers'

Jiayuan Chen says:

====================
bpftool: Using the right format specifiers

This patch adds the -Wformat-signedness compiler flag to detect and
prevent format string errors, where signed or unsigned types are
mismatched with format specifiers. Additionally, it fixes some format
string errors that were not fully addressed by the previous patch [1].

[1] https://lore.kernel.org/bpf/20250207123706.727928-1-mrpre@163.com/T/#u
---
v1->v2:
https://lore.kernel.org/bpf/20250310142037.45932-1-jiayuan.chen@linux.dev/
---
====================

Link: https://patch.msgid.link/20250311112809.81901-1-jiayuan.chen@linux.dev
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

bpftool: Using the right format specifiers

Fixed some formatting specifiers errors, such as using %d for int and %u
for unsigned int, as well as other byte-length types.

Perform type cast using the type derived from the data type itself, for
example, if it's originally an int, it will be cast to unsigned int if
forced to unsigned.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250311112809.81901-3-jiayuan.chen@linux.dev

bpftool: Add -Wformat-signedness flag to detect format errors

This commit adds the -Wformat-signedness compiler flag to detect and
prevent printf format errors, where signed or unsigned types are
mismatched with format specifiers. This helps to catch potential issues at
compile-time, ensuring that our code is more robust and reliable. With
this flag, the compiler will now warn about incorrect format strings, such
as using %d with unsigned types or %u with signed types.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250311112809.81901-2-jiayuan.chen@linux.dev

Merge branch 'support-freplace-prog-from-user-namespace'

Mykyta Yatsenko says:

====================
Support freplace prog from user namespace

From: Mykyta Yatsenko <yatsenko@meta.com>

Freplace programs can't be loaded from user namespace, as
bpf_program__set_attach_target() requires searching for target prog BTF,
which is locked under CAP_SYS_ADMIN.
This patch set enables this use case by:
1. Relaxing capable check in bpf's BPF_BTF_GET_FD_BY_ID, check for CAP_BPF
instead of CAP_SYS_ADMIN, support BPF token in attr argument.
2. Pass BPF token around libbpf from bpf_program__set_attach_target() to
bpf syscall where capable check is.
3. Validate positive/negative scenarios in selftests

This patch set is enabled by the recent libbpf change[1], that
introduced bpf_object__prepare() API. Calling bpf_object__prepare() for
freplace program before bpf_program__set_attach_target() initializes BPF
token, which is then passed to bpf syscall by libbpf.

[1] https://lore.kernel.org/all/20250303135752.158343-1-mykyta.yatsenko5@gmail.com/
====================

Link: https://patch.msgid.link/20250317174039.161275-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Test freplace from user namespace

Add selftests to verify that it is possible to load freplace program
from user namespace if BPF token is initialized by bpf_object__prepare
before calling bpf_program__set_attach_target.
Negative test is added as well.

Modified type of the priv_prog to xdp, as kprobe did not work on aarch64
and s390x.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250317174039.161275-5-mykyta.yatsenko5@gmail.com

libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID

Pass BPF token from bpf_program__set_attach_target to
BPF_BTF_GET_FD_BY_ID bpf command.
When freplace program attaches to target program, it needs to look up
for BTF of the target, this may require BPF token, if, for example,
running from user namespace.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250317174039.161275-4-mykyta.yatsenko5@gmail.com

bpf: Return prog btf_id without capable check

Return prog's btf_id from bpf_prog_get_info_by_fd regardless of capable
check. This patch enables scenario, when freplace program, running
from user namespace, requires to query target prog's btf.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250317174039.161275-3-mykyta.yatsenko5@gmail.com

bpf: BPF token support for BPF_BTF_GET_FD_BY_ID

Currently BPF_BTF_GET_FD_BY_ID requires CAP_SYS_ADMIN, which does not
allow running it from user namespace. This creates a problem when
freplace program running from user namespace needs to query target
program BTF.
This patch relaxes capable check from CAP_SYS_ADMIN to CAP_BPF and adds
support for BPF token that can be passed in attributes to syscall.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250317174039.161275-2-mykyta.yatsenko5@gmail.com

bpf, x86: Fix objtool warning for timed may_goto

Kernel test robot reported "call without frame pointer save/setup"
warning in objtool. This will make stack traces unreliable on
CONFIG_UNWINDER_FRAME_POINTER=y, however it works on
CONFIG_UNWINDER_ORC=y. Fix this by creating a stack frame for the
function.

Fixes: 2fb761823ead ("bpf, x86: Add x86 JIT support for timed may_goto")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202503071350.QOhsHVaW-lkp@intel.com/
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250315013039.1625048-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Check map->record at the beginning of check_and_free_fields()

When there are no special fields in the map value, there is no need to
invoke bpf_obj_free_fields(). Therefore, checking the validity of
map->record in advance.

After the change, the benchmark result of the per-cpu update case in
map_perf_test increased by 40% under a 16-CPU VM.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250315150930.1511727-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix sockopt selftest failure on powerpc

The SO_RCVLOWAT option is defined as 18 in the selftest header,
which matches the generic definition. However, on powerpc,
SO_RCVLOWAT is defined as 16. This discrepancy causes
sol_socket_sockopt() to fail with the default switch case on powerpc.

This commit fixes by defining SO_RCVLOWAT as 16 for powerpc.

Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Link: https://lore.kernel.org/bpf/20250311084647.3686544-1-skb99@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix string read in strncmp benchmark

The strncmp benchmark uses the bpf_strncmp helper and a hand-written
loop to compare two strings. The values of the strings are filled from
userspace. One of the strings is non-const (in .bss) while the other is
const (in .rodata) since that is the requirement of bpf_strncmp.

The problem is that in the hand-written loop, Clang optimizes the reads
from the const string to always return 0 which breaks the benchmark.

Use barrier_var to prevent the optimization.

The effect can be seen on the strncmp-no-helper variant.

Before this change:

    # ./bench strncmp-no-helper
    Setting up benchmark 'strncmp-no-helper'...
    Benchmark 'strncmp-no-helper' started.
    Iter   0 (112.309us): hits    0.000M/s (  0.000M/prod), drops    0.000M/s, total operations    0.000M/s
    Iter   1 (-23.238us): hits    0.000M/s (  0.000M/prod), drops    0.000M/s, total operations    0.000M/s
    Iter   2 ( 58.994us): hits    0.000M/s (  0.000M/prod), drops    0.000M/s, total operations    0.000M/s
    Iter   3 (-30.466us): hits    0.000M/s (  0.000M/prod), drops    0.000M/s, total operations    0.000M/s
    Iter   4 ( 29.996us): hits    0.000M/s (  0.000M/prod), drops    0.000M/s, total operations    0.000M/s
    Iter   5 ( 16.949us): hits    0.000M/s (  0.000M/prod), drops    0.000M/s, total operations    0.000M/s
    Iter   6 (-60.035us): hits    0.000M/s (  0.000M/prod), drops    0.000M/s, total operations    0.000M/s
    Summary: hits    0.000 ± 0.000M/s (  0.000M/prod), drops    0.000 ± 0.000M/s, total operations    0.000 ± 0.000M/s

After this change:

    # ./bench strncmp-no-helper
    Setting up benchmark 'strncmp-no-helper'...
    Benchmark 'strncmp-no-helper' started.
    Iter   0 ( 77.711us): hits    5.534M/s (  5.534M/prod), drops    0.000M/s, total operations    5.534M/s
    Iter   1 ( 11.215us): hits    6.006M/s (  6.006M/prod), drops    0.000M/s, total operations    6.006M/s
    Iter   2 (-14.253us): hits    5.931M/s (  5.931M/prod), drops    0.000M/s, total operations    5.931M/s
    Iter   3 ( 59.087us): hits    6.005M/s (  6.005M/prod), drops    0.000M/s, total operations    6.005M/s
    Iter   4 (-21.379us): hits    6.010M/s (  6.010M/prod), drops    0.000M/s, total operations    6.010M/s
    Iter   5 (-20.310us): hits    5.861M/s (  5.861M/prod), drops    0.000M/s, total operations    5.861M/s
    Iter   6 ( 53.937us): hits    6.004M/s (  6.004M/prod), drops    0.000M/s, total operations    6.004M/s
    Summary: hits    5.969 ± 0.061M/s (  5.969M/prod), drops    0.000 ± 0.000M/s, total operations    5.969 ± 0.061M/s

Fixes: 9c42652f8be3 ("selftests/bpf: Add benchmark for bpf_strncmp() helper")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20250313122852.1365202-1-vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix arena_spin_lock compilation on PowerPC

Venkat reported a compilation error for BPF selftests on PowerPC [0].
The crux of the error is the following message:
  In file included from progs/arena_spin_lock.c:7:
  /root/bpf-next/tools/testing/selftests/bpf/bpf_arena_spin_lock.h:122:8:
  error: member reference base type '__attribute__((address_space(1)))
  u32' (aka '__attribute__((address_space(1))) unsigned int') is not a
  structure or union
     122 |         old = atomic_read(&lock->val);

This is because PowerPC overrides the qspinlock type changing the
lock->val member's type from atomic_t to u32.

To remedy this, import the asm-generic version in the arena spin lock
header, name it __qspinlock (since it's aliased to arena_spinlock_t, the
actual name hardly matters), and adjust the selftest to not depend on
the type in vmlinux.h.

  [0]: https://lore.kernel.org/bpf/7bc80a3b-d708-4735-aa3b-6a8c21720f9d@linux.ibm.com

Fixes: 88d706ba7cc5 ("selftests/bpf: Introduce arena spin lock")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Link: https://lore.kernel.org/bpf/20250311154244.3775505-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: preload: Add MODULE_DESCRIPTION

Modpost complains when extra warnings are enabled:

WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/bpf/preload/bpf_preload.o

Add a description from the Kconfig help text.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250310134920.4123633-1-arnd@kernel.org
----
Not sure if that description actually fits what the module does. If not,
please add a different description instead.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: bpftool: Setting error code in do_loader()

We are missing setting error code in do_loader() when
bpf_object__open_file() fails. This means the command's exit status code
will be successful, even though the operation failed. So make sure to
return the correct error code. To maintain consistency with other
locations where bpf_object__open_file() is called, return -1.

[0] Closes: https://github.com/libbpf/bpftool/issues/156

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Sewon Nam <swnam0729@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Quentin Monnet <qmo@kernel.org>
Reviewed-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/bpf/d3b5b4b4-19bb-4619-b4dd-86c958c4a367@stanley.mountain/t/#u
Link: https://lore.kernel.org/bpf/20250311031238.14865-1-swnam0729@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'security-propagate-caller-information-in-bpf-hooks'

Blaise Boscaccy says:

====================
While trying to implement an eBPF gatekeeper program, we ran into an
issue whereas the LSM hooks are missing some relevant data.

Certain subcommands passed to the bpf() syscall can be invoked from
either the kernel or userspace. Additionally, some fields in the
bpf_attr struct contain pointers, and depending on where the
subcommand was invoked, they could point to either user or kernel
memory. One example of this is the bpf_prog_load subcommand and its
fd_array. This data is made available and used by the verifier but not
made available to the LSM subsystem. This patchset simply exposes that
information to applicable LSM hooks.

Change list:
- v6 -> v7
  - use gettid/pid in lieu of getpid/tgid in test condition
- v5 -> v6
  - fix regression caused by is_kernel renaming
  - simplify test logic
- v4 -> v5
  - merge v4 selftest breakout patch back into a single patch
  - change "is_kernel" to "kernel"
  - add selftest using new kernel flag
- v3 -> v4
  - split out selftest changes into a separate patch
- v2 -> v3
  - reorder params so that the new boolean flag is the last param
  - fixup function signatures in bpf selftests
- v1 -> v2
  - Pass a boolean flag in lieu of bpfptr_t

Revisions:
- v6
  https://lore.kernel.org/bpf/20250308013314.719150-1-bboscaccy@linux.microsoft.com/
- v5
  https://lore.kernel.org/bpf/20250307213651.3065714-1-bboscaccy@linux.microsoft.com/
- v4
  https://lore.kernel.org/bpf/20250304203123.3935371-1-bboscaccy@linux.microsoft.com/
- v3
  https://lore.kernel.org/bpf/20250303222416.3909228-1-bboscaccy@linux.microsoft.com/
- v2
  https://lore.kernel.org/bpf/20250228165322.3121535-1-bboscaccy@linux.microsoft.com/
- v1
  https://lore.kernel.org/bpf/20250226003055.1654837-1-bboscaccy@linux.microsoft.com/
====================

Link: https://patch.msgid.link/20250310221737.821889-1-bboscaccy@linux.microsoft.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Convert comma to semicolon

Replace comma between expressions with semicolons.

Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.

Found by inspection.
No functional change intended.
Compile tested only.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/bpf/20250310032045.651068-1-nichen@iscas.ac.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add a kernel flag test for LSM bpf hook

This test exercises the kernel flag added to security_bpf by
effectively blocking light-skeletons from loading while allowing
normal skeletons to function as-is. Since this should work with any
arbitrary BPF program, an existing program from LSKELS_EXTRA was
used as a test payload.

Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250310221737.821889-3-bboscaccy@linux.microsoft.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix selection of static vs. dynamic LLVM

The Makefile uses the exit code of the `llvm-config --link-static --libs`
command to choose between statically-linked and dynamically-linked LLVMs.
The stdout and stderr of that command are redirected to /dev/null.
To redirect the output the "&>" construction is used, which might not be
supported by /bin/sh, which is executed by make for $(shell ...) commands.
On such systems the test will fail even if static LLVM is actually
supported. Replace "&>" by ">/dev/null 2>&1" to fix this.

Fixes: 2a9d30fac818 ("selftests/bpf: Support dynamically linking LLVM if static is not available")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/bpf/20250310145112.1261241-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

security: Propagate caller information in bpf hooks

Certain bpf syscall subcommands are available for usage from both
userspace and the kernel. LSM modules or eBPF gatekeeper programs may
need to take a different course of action depending on whether or not
a BPF syscall originated from the kernel or userspace.

Additionally, some of the bpf_attr struct fields contain pointers to
arbitrary memory. Currently the functionality to determine whether or
not a pointer refers to kernel memory or userspace memory is exposed
to the bpf verifier, but that information is missing from various LSM
hooks.

Here we augment the LSM hooks to provide this data, by simply passing
a boolean flag indicating whether or not the call originated in the
kernel, in any hook that contains a bpf_attr struct that corresponds
to a subcommand that may be called from the kernel.

Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20250310221737.821889-2-bboscaccy@linux.microsoft.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-introduce-helper-for-populating-bpf_cpumask'

Emil Tsalapatis says:

====================
bpf: introduce helper for populating bpf_cpumask

Some BPF programs like scx schedulers have their own internal CPU mask types,
mask types, which they must transform into struct bpf_cpumask instances
before passing them to scheduling-related kfuncs. There is currently no
way to efficiently populate the bitfield of a bpf_cpumask from BPF memory,
and programs must use multiple bpf_cpumask_[set, clear] calls to do so.
Introduce a kfunc helper to populate the bitfield of a bpf_cpumask from valid
BPF memory with a single call.

Changelog :
-----------
v6->v7
v6:https://lore.kernel.org/bpf/20250307153847.8530-1-emil@etsalapatis.com/

Addressed feedback by Hou Tao:
* Removed RUN_TESTS invocation causing tests to run twice
* Added is_test_task guard to new selftests
* Removed extraneous __success attribute from existing selftests

v5->v6
v5:https://lore.kernel.org/bpf/20250307041738.6665-1-emil@etsalapatis.com/

Addressed feedback by Hou Tao:
* Removed __success attributes from cpumask selftests
* Fixed stale patch description that used old function name

v4->v5
v4: https://lore.kernel.org/bpf/20250305211235.368399-1-emil@etsalapatis.com/

Addressed feedback by Hou Tao:
* Readded the tests in tools/selftests/bpf/prog_tests/cpumask.c,
turns out the selftest entries were not duplicates.
* Removed stray whitespace in selftest.
* Add patch the missing selftest to prog_tests/cpumask.c
* Explicitly annotate all cpumask selftests with __success

The last patch could very well be its own cleanup patch, but I rolled it into
this series because it came up in the discussion. If the last patch in the
series has any issues I'd be fine with applying the first 3 patches and dealing
with it separately.

v3->v4
v3: https://lore.kernel.org/bpf/20250305161327.203396-1-emil@etsalapatis.com/

* Removed new tests from tools/selftests/bpf/prog_tests/cpumask.c because
they were being run twice.

Addressed feedback by Alexei Starovoitov:
* Added missing return value in function kdoc
* Added an additional patch fixing some missing kdoc fields in
kernel/bpf/cpumask.c

Addressed feedback by Tejun Heo:
* Renamed the kfunc to bpf_cpumask_populate to avoid confusion
w/ bitmap_fill()

v2->v3
v2: https://lore.kernel.org/bpf/20250305021020.1004858-1-emil@etsalapatis.com/

Addressed feedback by Alexei Starovoitov:
* Added back patch descriptions dropped from v1->v2
* Elide the alignment check for archs with efficient
unaligned accesses

v1->v2
v1: https://lore.kernel.org/bpf/20250228003321.1409285-1-emil@etsalapatis.com/

Addressed feedback by Hou Tao:
* Add check that the input buffer is aligned to sizeof(long)
* Adjust input buffer size check to use bitmap_size()
* Add selftest for checking the bit pattern of the bpf_cpumask
* Moved all selftests into existing files

Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
====================

Link: https://patch.msgid.link/20250309230427.26603-1-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: fix duplicate selftests in cpumask_success.

The BPF cpumask selftests are currently run twice in
test_progs/cpumask.c, once by traversing cpumask_success_testcases, and
once by invoking RUN_TESTS(cpumask_success). Remove the invocation of
RUN_TESTS to properly run the selftests only once.

Now that the tests are run only through cpumask_success_testscases, add
to it the missing test_refcount_null_tracking testcase. Also remove the
__success annotation from it, since it is now loaded and invoked by the
runner.

Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250309230427.26603-5-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'selftests-bpf-move-test_lwt_seg6local-to-test_progs'

Bastien Curutchet says:

====================
This patch series continues the work to migrate the script tests into
prog_tests.

test_lwt_seg6local.sh tests some bpf_lwt_* helpers. It contains only one
test that uses a network topology quite different than the ones that
can be found in others prog_tests/lwt_*.c files so I add a new
prog_tests/lwt_seg6local.c file.

While working on the migration I noticed that some routes present in the
script weren't needed so PATCH 1 deletes them and then PATCH 2 migrates
the test into the test_progs framework.
====================

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307-seg6local-v1-0-990fff8f180d@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: fix missing kdoc string fields in cpumask.c

Some bpf_cpumask-related kfuncs have kdoc strings that are missing
return values. Add a the missing descriptions for the return values.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250309230427.26603-4-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix cap_enable_effective() return code

The caller of cap_enable_effective() expects negative error code.
Fix it.

Before:
  failed to restore CAP_SYS_ADMIN: -1, Unknown error -1

After:
  failed to restore CAP_SYS_ADMIN: -3, No such process
  failed to restore CAP_SYS_ADMIN: -22, Invalid argument

Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250305022234.44932-1-yangfeng59949@163.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: lwt_seg6local: Move test to test_progs

test_lwt_seg6local.sh isn't used by the BPF CI.

Add a new file in the test_progs framework to migrate the tests done by
test_lwt_seg6local.sh. It uses the same network topology and the same BPF
programs located in progs/test_lwt_seg6local.c.
Use the network helpers instead of `nc` to exchange the final packet.

Remove test_lwt_seg6local.sh and its Makefile entry.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Link: https://lore.kernel.org/r/20250307-seg6local-v1-2-990fff8f180d@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: add bpf_cpumask_populate selftests

Add selftests for the bpf_cpumask_populate helper that sets a
bpf_cpumask to a bit pattern provided by a BPF program.

Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250309230427.26603-3-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix dangling stdout seen by traffic monitor thread

Traffic monitor thread may see dangling stdout as the main thread closes
and reassigns stdout without protection. This happens when the main thread
finishes one subtest and moves to another one in the same netns_new()
scope.

The issue can be reproduced by running test_progs repeatedly with traffic
monitor enabled:

for ((i=1;i<=100;i++)); do
./test_progs -a flow_dissector_skb* -m '*'
done

For restoring stdout in crash_handler(), since it does not really care
about closing stdout, simlpy flush stdout and restore it to the original
one.

Then, Fix the issue by consolidating stdio_restore_cleanup() and
stdio_restore(), and protecting the use/close/assignment of stdout with
a lock. The locking in the main thread is always performed regradless of
whether traffic monitor is running or not for simplicity. It won't have
any side-effect.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20250305182057.2802606-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: lwt_seg6local: Remove unused routes

Some routes in fb00:: are initialized during setup, even though they
aren't needed by the test as the UDP packets will travel through the
lightweight tunnels.

Remove these unnecessary routes.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Link: https://lore.kernel.org/r/20250307-seg6local-v1-1-990fff8f180d@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: add kfunc for populating cpumask bits

Add a helper kfunc that sets the bitmap of a bpf_cpumask from BPF memory.

Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250309230427.26603-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Allow assigning traffic monitor print function

Allow users to change traffic monitor's print function. If not provided,
traffic monitor will print to stdout by default.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250305182057.2802606-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Clean up call sites of stdio_restore()

reset_affinity() and save_ns() are only called in run_one_test(). There is
no need to call stdio_restore() in reset_affinity() and save_ns() if
stdio_restore() is moved right after a test finishes in run_one_test().

Also remove an unnecessary check of env.stdout_saved in crash_handler()
by moving env.stdout_saved assignment to the beginning of main().

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20250305182057.2802606-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Move test_lwt_ip_encap to test_progs

test_lwt_ip_encap.sh isn't used by the BPF CI.

Add a new file in the test_progs framework to migrate the tests done by
test_lwt_ip_encap.sh. It uses the same network topology and the same BPF
programs located in progs/test_lwt_ip_encap.c.
Rework the GSO part to avoid using nc and dd.

Remove test_lwt_ip_encap.sh and its Makefile entry.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250304-lwt_ip-v1-1-8fdeb9e79a56@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'arena-spin-lock'

Kumar Kartikeya Dwivedi says:

====================
Arena Spin Lock

This set provides an implementation of queued spin lock for arena.
There is no support for resiliency and recovering from deadlocks yet.
We will wait for the rqspinlock patch set to land before incorporating
support.

One minor change compared to the qspinlock algorithm in the kernel is
that we don't have the trylock fallback when nesting count exceeds 4.
The maximum number of supported CPUs is 1024, but this can be increased
in the future if necessary.

The API supports returning an error, so resiliency support can be added
in the future. Callers are still expected to check for and handle any
potential errors.

Errors are returned when the spin loops time out, when the number of
CPUs is greater than 1024, or when the extreme edge case of NMI
interrupting NMI interrupting HardIRQ interrupting SoftIRQ interrupting
task, all of them simultaneously in slow path, occurs, which is
unsupported.

Changelog:
----------
v4 -> v5
v4: https://lore.kernel.org/bpf/20250305045136.2614132-1-memxor@gmail.com

* Add better comment and document LLVM bug for __unqual_typeof.
* Switch to precise counting in the selftest and simplify test.
* Add comment about return value handling.
* Reduce size for 100k to 50k to cap test runtime.

v3 -> v4
v3: https://lore.kernel.org/bpf/20250305011849.1168917-1-memxor@gmail.com

* Drop extra corruption handling case in decode_tail.
* Stick to 1, 1k, 100k critical section sizes.
* Fix unqual_typeof to not cast away arena tag for pointers.
* Remove hack to skip first qnode.
* Choose 100 as repeat count, 1000 is too much for 100k size.
* Use pthread_barrier in test.

v2 -> v3
v2: https://lore.kernel.org/bpf/20250118162238.2621311-1-memxor@gmail.com

* Rename to arena_spin_lock
* Introduce cond_break_label macro to jump to label from cond_break.
* Drop trylock fallback when nesting count exceeds 4.
* Fix bug in try_cmpxchg implementation.
* Add tests with critical sections of varying lengths.
* Add comments for _Generic trick to drop __arena tag.
* Fix bug due to qnodes being placed on first page, leading to CPU 0's
node being indistinguishable from NULL.

v1 -> v2
v1: https://lore.kernel.org/bpf/20250117223754.1020174-1-memxor@gmail.com

* Fix definition of lock in selftest
====================

Link: https://patch.msgid.link/20250306035431.2186189-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, docs: Fix broken link to renamed bpf_iter_task_vmas.c

This file was renamed from bpf_iter_task_vma.c.

Fixes: 45b38941c81f ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250304204520.201115-1-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for arena spin lock

Add some basic selftests for qspinlock built over BPF arena using
cond_break_label macro.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250306035431.2186189-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Introduce arena spin lock

Implement queued spin lock algorithm as BPF program for lock words
living in BPF arena.

The algorithm is copied from kernel/locking/qspinlock.c and adapted for
BPF use.

We first implement abstract helpers for portable atomics and
acquire/release load instructions, by relying on X86_64 presence to
elide expensive barriers and rely on implementation details of the JIT,
and fall back to slow but correct implementations elsewhere. When
support for acquire/release load/stores lands, we can improve this
state.

Then, the qspinlock algorithm is adapted to remove dependence on
multi-word atomics due to lack of support in BPF ISA. For instance,
xchg_tail cannot use 16-bit xchg, and needs to be a implemented as a
32-bit try_cmpxchg loop.

Loops which are seemingly infinite from verifier PoV are annotated with
cond_break_label macro to return an error. Only 1024 NR_CPUs are
supported.

Note that the slow path is a global function, hence the verifier doesn't
know the return value's precision. The recommended way of usage is to
always test against zero for success, and not ret < 0 for error, as the
verifier would assume ret > 0 has not been accounted for. Add comments
in the function documentation about this quirk.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250306035431.2186189-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Introduce cond_break_label

Add a new cond_break_label macro that jumps to the specified label when
the cond_break termination check fires, and allows us to better handle
the uncontrolled termination of the loop.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250306035431.2186189-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: correct use/def for may_goto instruction

may_goto instruction does not use any registers,
but in compute_insn_live_regs() it was treated as a regular
conditional jump of kind BPF_K with r0 as source register.
Thus unnecessarily marking r0 as used.

Fixes: 14c8552db644 ("bpf: simple DFA-based live registers analysis")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250305085436.2731464-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-simple-dfa-based-live-registers-analysis'

Eduard Zingerman says:

====================
bpf: simple DFA-based live registers analysis

This patch-set introduces a simple live registers DFA analysis.
Analysis is done as a separate step before main verification pass.
Results are stored in the env->insn_aux_data for each instruction.

The change helps with iterator/callback based loops handling,
as regular register liveness marks are not finalized while
loops are processed. See veristat results in patch #2.

Note: for regular subprogram calls analysis conservatively assumes
that r1-r5 are used, and r0 is used at each 'exit' instruction.
Experiments show that adding logic handling these cases precisely has
no impact on verification performance.

The patch set was tested by disabling the current register parentage
chain liveness computation, using DFA-based liveness for registers
while assuming all stack slots as live. See discussion in [1].

Changes v2 -> v3:
- added support for BPF_LOAD_ACQ, BPF_STORE_REL atomics (Alexei);
- correct use marks for r0 for BPF_CMPXCHG.

Changes v1 -> v2:
- added a refactoring commit extracting utility functions:
  jmp_offset(), verbose_insn() (Alexei);
- added a refactoring commit extracting utility function
  get_call_summary() in order to share helper/kfunc related code with
  mark_fastcall_pattern_for_call() (Alexei);
- comment in the compute_insn_live_regs() extended (Alexei).

Changes RFC -> v1:
- parameter count for helpers and kfuncs is taken into account;
- copy_verifier_state() bugfix had been merged as a separate
  patch-set and is no longer a part of this patch set.

RFC: https://lore.kernel.org/bpf/20250122120442.3536298-1-eddyz87@gmail.com/
v1:  https://lore.kernel.org/bpf/20250228060032.1425870-1-eddyz87@gmail.com/
v2:  https://lore.kernel.org/bpf/20250304074239.2328752-1-eddyz87@gmail.com/
[1]  https://lore.kernel.org/bpf/cc29975fbaf163d0c2ed904a9a4d6d9452177542.camel@gmail.com/
====================

Link: https://patch.msgid.link/20250304195024.2478889-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test cases for compute_live_registers()

Cover instructions from each kind:
- assignment
- arithmetic
- store/load
- endian conversion
- atomics
- branches, conditional branches, may_goto, calls
- LD_ABS/LD_IND
- address_space_cast

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'introduce-load-acquire-and-store-release-bpf-instructions'

Peilin Ye says:

====================
Introduce load-acquire and store-release BPF instructions

This patchset adds kernel support for BPF load-acquire and store-release
instructions (for background, please see [1]), including core/verifier
and arm64/x86-64 JIT compiler changes, as well as selftests.  riscv64 is
also planned to be supported.  The corresponding LLVM changes can be
found at:

  https://github.com/llvm/llvm-project/pull/108636

The first 3 patches from v4 have already been applied:

  - [bpf-next,v4,01/10] bpf/verifier: Factor out atomic_ptr_type_ok()
    https://git.kernel.org/bpf/bpf-next/c/b2d9ef71d4c9
  - [bpf-next,v4,02/10] bpf/verifier: Factor out check_atomic_rmw()
    https://git.kernel.org/bpf/bpf-next/c/d430c46c7580
  - [bpf-next,v4,03/10] bpf/verifier: Factor out check_load_mem() and check_store_reg()
    https://git.kernel.org/bpf/bpf-next/c/d38ad248fb7a

Please refer to the LLVM PR and individual kernel patches for details.
Thanks!

v5: https://lore.kernel.org/all/cover.1741046028.git.yepeilin@google.com/
v5..v6 change:

  o (Alexei) avoid using #ifndef in verifier.c

v4: https://lore.kernel.org/bpf/cover.1740978603.git.yepeilin@google.com/
v4..v5 notable changes:

  o (kernel test robot) for 32-bit arches: make the verifier reject
                        64-bit load-acquires/store-releases, and fix
                        build error in interpreter changes
    * tested ARCH=arc build following instructions from kernel test
      robot
  o (Alexei) drop Documentation/ patch (v4 10/10) for now

v3: https://lore.kernel.org/bpf/cover.1740009184.git.yepeilin@google.com/
v3..v4 notable changes:

  o (Alexei) add x86-64 JIT support (including arena)
  o add Acked-by: tags from Xu

v2: https://lore.kernel.org/bpf/cover.1738888641.git.yepeilin@google.com/
v2..v3 notable changes:

  o (Alexei) change encoding to BPF_LOAD_ACQ=0x100, BPF_STORE_REL=0x110
  o add Acked-by: tags from Ilya and Eduard
  o make new selftests depend on:
    * __clang_major__ >= 18, and
    * ENABLE_ATOMICS_TESTS is defined (currently this means -mcpu=v3 or
      v4), and
    * JIT supports load_acq/store_rel (currenty only arm64)
  o work around llvm-17 CI job failure by conditionally define
    __arena_global variables as 64-bit if __clang_major__ < 18, to make
    sure .addr_space.1 has no holes
  o add Google copyright notice in new files

v1: https://lore.kernel.org/all/cover.1737763916.git.yepeilin@google.com/
v1..v2 notable changes:

  o (Eduard) for x86 and s390, make
             bpf_jit_supports_insn(..., /*in_arena=*/true) return false
     for load_acq/store_rel
  o add Eduard's Acked-by: tag
  o (Eduard) extract LDX and non-ATOMIC STX handling into helpers, see
             PATCH v2 3/9
  o allow unpriv programs to store-release pointers to stack
  o (Alexei) make it clearer in the interpreter code (PATCH v2 4/9) that
             only W and DW are supported for atomic RMW
  o test misaligned load_acq/store_rel
  o (Eduard) other selftests/ changes:
    * test load_acq/store_rel with !atomic_ptr_type_ok() pointers:
      - PTR_TO_CTX, for is_ctx_reg()
      - PTR_TO_PACKET, for is_pkt_reg()
      - PTR_TO_FLOW_KEYS, for is_flow_key_reg()
      - PTR_TO_SOCKET, for is_sk_reg()
    * drop atomics/ tests
    * delete unnecessary 'pid' checks from arena_atomics/ tests
    * avoid depending on __BPF_FEATURE_LOAD_ACQ_STORE_REL, use
      __imm_insn() and inline asm macros instead

RFC v1: https://lore.kernel.org/all/cover.1734742802.git.yepeilin@google.com
RFC v1..v1 notable changes:

  o 1-2/8: minor verifier.c refactoring patches
  o   3/8: core/verifier changes
         * (Eduard) handle load-acquire properly in backtrack_insn()
         * (Eduard) avoid skipping checks (e.g.,
                    bpf_jit_supports_insn()) for load-acquires
         * track the value stored by store-releases, just like how
           non-atomic STX instructions are handled
         * (Eduard) add missing link in commit message
         * (Eduard) always print 'r' for disasm.c changes
  o   4/8: arm64/insn: avoid treating load_acq/store_rel as
           load_ex/store_ex
  o   5/8: arm64/insn: add load_acq/store_rel
         * (Xu) include Should-Be-One (SBO) bits in "mask" and "value",
                to avoid setting fixed bits during runtime (JIT-compile
                time)
  o   6/8: arm64 JIT compiler changes
         * (Xu) use emit_a64_add_i() for "pointer + offset" to optimize
                code emission
  o   7/8: selftests
         * (Eduard) avoid adding new tests to the 'test_verifier' runner
         * add more tests, e.g., checking mark_precise logic
  o   8/8: instruction-set.rst changes

[1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/

Thanks,
====================

Link: https://patch.msgid.link/cover.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: use register liveness information for func_states_equal

Liveness analysis DFA computes a set of registers live before each
instruction. Leverage this information to skip comparison of dead
registers in func_states_equal(). This helps with convergance of
iterator processing loops, as bpf_reg_state->live marks can't be used
when loops are processed.

This has certain performance impact for selftests, here is a veristat
listing using `-f "insns_pct>5" -f "!insns<200"`

selftests:

File                  Program                        States (A)  States (B)  States  (DIFF)
--------------------  -----------------------------  ----------  ----------  --------------
arena_htab.bpf.o      arena_htab_llvm                        37          35     -2 (-5.41%)
arena_htab_asm.bpf.o  arena_htab_asm                         37          33    -4 (-10.81%)
arena_list.bpf.o      arena_list_add                         37          22   -15 (-40.54%)
dynptr_success.bpf.o  test_dynptr_copy                       22          16    -6 (-27.27%)
dynptr_success.bpf.o  test_dynptr_copy_xdp                   68          58   -10 (-14.71%)
iters.bpf.o           checkpoint_states_deletion            918          40  -878 (-95.64%)
iters.bpf.o           clean_live_states                     136          66   -70 (-51.47%)
iters.bpf.o           iter_nested_deeply_iters               43          37    -6 (-13.95%)
iters.bpf.o           iter_nested_iters                      72          62   -10 (-13.89%)
iters.bpf.o           iter_pass_iter_ptr_to_subprog          30          26    -4 (-13.33%)
iters.bpf.o           iter_subprog_iters                     68          59    -9 (-13.24%)
iters.bpf.o           loop_state_deps2                       35          32     -3 (-8.57%)
iters_css.bpf.o       iter_css_for_each                      32          29     -3 (-9.38%)
pyperf600_iter.bpf.o  on_event                              286         192   -94 (-32.87%)

Total progs: 3578
Old success: 2061
New success: 2061
States diff min:  -95.64%
States diff max:    0.00%
-100 .. -90  %: 1
-55 .. -45  %: 3
-45 .. -35  %: 2
-35 .. -25  %: 5
-20 .. -10  %: 12
-10 .. 0    %: 6

sched_ext:

File               Program                 States (A)  States (B)  States   (DIFF)
-----------------  ----------------------  ----------  ----------  ---------------
bpf.bpf.o          lavd_dispatch                 8950        7065  -1885 (-21.06%)
bpf.bpf.o          lavd_init                      516         480     -36 (-6.98%)
bpf.bpf.o          layered_dispatch               662         501   -161 (-24.32%)
bpf.bpf.o          layered_dump                   298         237    -61 (-20.47%)
bpf.bpf.o          layered_init                   523         423   -100 (-19.12%)
bpf.bpf.o          layered_init_task               24          22      -2 (-8.33%)
bpf.bpf.o          layered_runnable               151         125    -26 (-17.22%)
bpf.bpf.o          p2dq_dispatch                   66          53    -13 (-19.70%)
bpf.bpf.o          p2dq_init                      170         142    -28 (-16.47%)
bpf.bpf.o          refresh_layer_cpumasks         120          78    -42 (-35.00%)
bpf.bpf.o          rustland_init                   37          34      -3 (-8.11%)
bpf.bpf.o          rustland_init                   37          34      -3 (-8.11%)
bpf.bpf.o          rusty_select_cpu               125         108    -17 (-13.60%)
scx_central.bpf.o  central_dispatch                59          43    -16 (-27.12%)
scx_central.bpf.o  central_init                    39          28    -11 (-28.21%)
scx_nest.bpf.o     nest_init                       58          51     -7 (-12.07%)
scx_pair.bpf.o     pair_dispatch                  142         111    -31 (-21.83%)
scx_qmap.bpf.o     qmap_dispatch                  174         141    -33 (-18.97%)
scx_qmap.bpf.o     qmap_init                      768         654   -114 (-14.84%)

Total progs: 216
Old success: 186
New success: 186
States diff min:  -35.00%
States diff max:    0.00%
-35 .. -25  %: 3
-25 .. -20  %: 6
-20 .. -15  %: 6
-15 .. -5   %: 7
  -5 .. 0    %: 6

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add selftests for load-acquire and store-release instructions

Add several ./test_progs tests:

  - arena_atomics/load_acquire
  - arena_atomics/store_release
  - verifier_load_acquire/*
  - verifier_store_release/*
  - verifier_precision/bpf_load_acquire
  - verifier_precision/bpf_store_release

The last two tests are added to check if backtrack_insn() handles the
new instructions correctly.

Additionally, the last test also makes sure that the verifier
"remembers" the value (in src_reg) we store-release into e.g. a stack
slot.  For example, if we take a look at the test program:

    #0:  r1 = 8;
      /* store_release((u64 *)(r10 - 8), r1); */
    #1:  .8byte %[store_release];
    #2:  r1 = *(u64 *)(r10 - 8);
    #3:  r2 = r10;
    #4:  r2 += r1;
    #5:  r0 = 0;
    #6:  exit;

At #1, if the verifier doesn't remember that we wrote 8 to the stack,
then later at #4 we would be adding an unbounded scalar value to the
stack pointer, which would cause the program to be rejected:

  VERIFIER LOG:
  =============
...
  math between fp pointer and register with unbounded min value is not allowed

For easier CI integration, instead of using built-ins like
__atomic_{load,store}_n() which depend on the new
__BPF_FEATURE_LOAD_ACQ_STORE_REL pre-defined macro, manually craft
load-acquire/store-release instructions using __imm_insn(), as suggested
by Eduard.

All new tests depend on:

  (1) Clang major version >= 18, and
  (2) ENABLE_ATOMICS_TESTS is defined (currently implies -mcpu=v3 or
      v4), and
  (3) JIT supports load-acquire/store-release (currently arm64 and
      x86-64)

In .../progs/arena_atomics.c:

  /* 8-byte-aligned */
  __u8 __arena_global load_acquire8_value = 0x12;
  /* 1-byte hole */
  __u16 __arena_global load_acquire16_value = 0x1234;

That 1-byte hole in the .addr_space.1 ELF section caused clang-17 to
crash:

  fatal error: error in backend: unable to write nop sequence of 1 bytes

To work around such llvm-17 CI job failures, conditionally define
__arena_global variables as 64-bit if __clang_major__ < 18, to make sure
.addr_space.1 has no holes.  Ideally we should avoid compiling this file
using clang-17 at all (arena tests depend on
__BPF_FEATURE_ADDR_SPACE_CAST, and are skipped for llvm-17 anyway), but
that is a separate topic.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/1b46c6feaf0f1b6984d9ec80e500cc7383e9da1a.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: simple DFA-based live registers analysis

Compute may-live registers before each instruction in the program.
The register is live before the instruction I if it is read by I or
some instruction S following I during program execution and is not
overwritten between I and S.

This information would be used in the next patch as a hint in
func_states_equal().

Use a simple algorithm described in [1] to compute this information:
- define the following:
  - I.use : a set of all registers read by instruction I;
  - I.def : a set of all registers written by instruction I;
  - I.in  : a set of all registers that may be alive before I execution;
  - I.out : a set of all registers that may be alive after I execution;
  - I.successors : a set of instructions S that might immediately
                   follow I for some program execution;
- associate separate empty sets 'I.in' and 'I.out' with each instruction;
- visit each instruction in a postorder and update corresponding
  'I.in' and 'I.out' sets as follows:

      I.out = U [S.in for S in I.successors]
      I.in  = (I.out / I.def) U I.use

  (where U stands for set union, / stands for set difference)
- repeat the computation while I.{in,out} changes for any instruction.

On implementation side keep things as simple, as possible:
- check_cfg() already marks instructions EXPLORED in post-order,
  modify it to save the index of each EXPLORED instruction in a vector;
- represent I.{in,out,use,def} as bitmasks;
- don't split the program into basic blocks and don't maintain the
  work queue, instead:
  - do fixed-point computation by visiting each instruction;
  - maintain a simple 'changed' flag if I.{in,out} for any instruction
    change;
  Measurements show that even such simplistic implementation does not
  add measurable verification time overhead (for selftests, at-least).

Note on check_cfg() ex_insn_beg/ex_done change:
To avoid out of bounds access to env->cfg.insn_postorder array,
it should be guaranteed that instruction transitions to EXPLORED state
only once. Previously this was not the fact for incorrect programs
with direct calls to exception callbacks.

The 'align' selftest needs adjustment to skip computed insn/live
registers printout. Otherwise it matches lines from the live registers
printout.

[1] https://en.wikipedia.org/wiki/Live-variable_analysis

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, x86: Support load-acquire and store-release instructions

Recently we introduced BPF load-acquire (BPF_LOAD_ACQ) and store-release
(BPF_STORE_REL) instructions.  For x86-64, simply implement them as
regular BPF_LDX/BPF_STX loads and stores.  The verifier always rejects
misaligned load-acquires/store-releases (even if BPF_F_ANY_ALIGNMENT is
set), so emitted MOV* instructions are guaranteed to be atomic.

Arena accesses are supported.  8- and 16-bit load-acquires are
zero-extending (i.e., MOVZBQ, MOVZWQ).

Rename emit_atomic{,_index}() to emit_atomic_rmw{,_index}() to make it
clear that they only handle read-modify-write atomics, and extend their
@atomic_op parameter from u8 to u32, since we are starting to use more
than the lowest 8 bits of the 'imm' field.

Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/d22bb3c69f126af1d962b7314f3489eff606a3b7.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: get_call_summary() utility function

Refactor mark_fastcall_pattern_for_call() to extract a utility
function get_call_summary(). For a helper or kfunc call this function
fills the following information: {num_params, is_void, fastcall}.

This function would be used in the next patch in order to get number
of parameters of a helper or kfunc call.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, arm64: Support load-acquire and store-release instructions

Support BPF load-acquire (BPF_LOAD_ACQ) and store-release
(BPF_STORE_REL) instructions in the arm64 JIT compiler.  For example
(assuming little-endian):

  db 10 00 00 00 01 00 00  r0 = load_acquire((u64 *)(r1 + 0x0))
  95 00 00 00 00 00 00 00  exit

  opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX
  imm (0x00000100): BPF_LOAD_ACQ

The JIT compiler would emit an LDAR instruction for the above, e.g.:

  ldar  x7, [x0]

Similarly, consider the following 16-bit store-release:

  cb 21 00 00 10 01 00 00  store_release((u16 *)(r1 + 0x0), w2)
  95 00 00 00 00 00 00 00  exit

  opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX
  imm (0x00000110): BPF_STORE_REL

An STLRH instruction would be emitted, e.g.:

  stlrh  w1, [x0]

For a complete mapping:

  load-acquire     8-bit  LDARB
(BPF_LOAD_ACQ)   16-bit  LDARH
                  32-bit  LDAR (32-bit)
                  64-bit  LDAR (64-bit)
  store-release    8-bit  STLRB
(BPF_STORE_REL)  16-bit  STLRH
                  32-bit  STLR (32-bit)
                  64-bit  STLR (64-bit)

Arena accesses are supported.
bpf_jit_supports_insn(..., /*in_arena=*/true) always returns true for
BPF_LOAD_ACQ and BPF_STORE_REL instructions, as they don't depend on
ARM64_HAS_LSE_ATOMICS.

Acked-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/51664a1300710238ba2d4d95142b57a52c4f0cae.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: jmp_offset() and verbose_insn() utility functions

Extract two utility functions:
- One BPF jump instruction uses .imm field to encode jump offset,
  while the rest use .off. Encapsulate this detail as jmp_offset()
  function.
- Avoid duplicating instruction printing callback definitions by
  defining a verbose_insn() function, which disassembles an
  instruction into the verifier log while hiding this detail.

These functions will be used in the next patch.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

arm64: insn: Add load-acquire and store-release instructions

Add load-acquire ("load_acq", LDAR{,B,H}) and store-release
("store_rel", STLR{,B,H}) instructions.  Breakdown of encoding:

                                size        L   (Rs)  o0 (Rt2) Rn    Rt
             mask (0x3fdffc00): 00 111111 1 1 0 11111 1  11111 00000 00000
  value, load_acq (0x08dffc00): 00 001000 1 1 0 11111 1  11111 00000 00000
value, store_rel (0x089ffc00): 00 001000 1 0 0 11111 1  11111 00000 00000

As suggested by Xu [1], include all Should-Be-One (SBO) bits ("Rs" and
"Rt2" fields) in the "mask" and "value" numbers.

It is worth noting that we are adding the "no offset" variant of STLR
instead of the "pre-index" variant, which has a different encoding.

Reference: Arm Architecture Reference Manual (ARM DDI 0487K.a,
           ID032224),

  * C6.2.161 LDAR
  * C6.2.353 STLR

[1] https://lore.kernel.org/bpf/4e6641ce-3f1e-4251-8daf-4dd4b77d08c4@huaweicloud.com/

Acked-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/ba92057b7502ce4c9c9b03b7d637abe5e178134e.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

arm64: insn: Add BIT(23) to {load,store}_ex's mask

We are planning to add load-acquire (LDAR{,B,H}) and store-release
(STLR{,B,H}) instructions to insn.{c,h}; add BIT(23) to mask of load_ex
and store_ex to prevent aarch64_insn_is_{load,store}_ex() from returning
false-positives for load-acquire and store-release instructions.

Reference: Arm Architecture Reference Manual (ARM DDI 0487K.a,
           ID032224),

  * C6.2.228 LDXR
  * C6.2.165 LDAXR
  * C6.2.161 LDAR
  * C6.2.393 STXR
  * C6.2.360 STLXR
  * C6.2.353 STLR

Acked-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/5a4d2a52b2cc022bf86d0b572789f0b3bc3d5162.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'timed-may_goto'

Kumar Kartikeya Dwivedi says:

====================
Timed may_goto

This series replaces the current implementation of cond_break, which
uses the may_goto instruction, and counts 8 million iterations per stack
frame, with an implementation based on sampling time locally on the CPU.

This is done to permit a longer time for a given loop per-program
invocation. The accounting is still done per-stack frame, but the count
is used to instead amortize the cost of the logic to sample and check
the time spent since the start.

This is needed for expressing more complicated algorithms (spin locks,
waiting loops, etc.) in BPF programs without false positive expiration
of the loop. For instance, the plan is to make use of this for
implementing spin locks for BPF arena [0].

For the loop as follows:

for (int i = 0;; i++) {}

Testing on a bare-metal Sapphire Rapids Intel server yields the following
table (taking an average of 25 runs).

+-----------------------------+--------------+--------------+------------------+
| Loop type       | Iterations   | Time (ms)   | Time/iter (ns) |
+-----------------------------|--------------+--------------+------------------+
| may_goto       | 8388608      | 3     | 0.36        |
| timed_may_goto (count=65535)| 589674932    | 250     | 0.42        |
| bpf_for       | 8388608      | 10     | 1.19        |
+-----------------------------+--------------+--------------+------------------+

Here, count is used to amortize the time sampling and checking logic.

Obviously, this is the limit of an empty loop. Given the complexity of
the loop body, the time spent in the loop can be longer. Cancellations
will address the task of imposing an upper bound on program runtime.

For now, the implementation only supports x86.

  [0]: https://lore.kernel.org/bpf/20250118162238.2621311-1-memxor@gmail.com

Changelog:
----------
v1 -> v2
v1: https://lore.kernel.org/bpf/20250302201348.940234-1-memxor@gmail.com

* Address comments from Alexei
   * Use kernel comment style for new code.
   * Remove p->count == 0 check in bpf_check_timed_may_goto.
   * Add comments on AX as argument/retval calling convention.
   * Add comments describing how the counting logic works.
   * Use BPF_EMIT_CALL instead of open-coding instruction encoding.
   * Change if ax != 1 goto pc+X condition to if ax != 0 goto pc+X.
====================

Link: https://patch.msgid.link/20250304003239.2390751-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Introduce load-acquire and store-release instructions

Introduce BPF instructions with load-acquire and store-release
semantics, as discussed in [1].  Define 2 new flags:

  #define BPF_LOAD_ACQ    0x100
  #define BPF_STORE_REL   0x110

A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm'
field set to BPF_LOAD_ACQ (0x100).

Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with
the 'imm' field set to BPF_STORE_REL (0x110).

Unlike existing atomic read-modify-write operations that only support
BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and
store-releases also support BPF_B (8-bit) and BPF_H (16-bit).  As an
exception, however, 64-bit load-acquires/store-releases are not
supported on 32-bit architectures (to fix a build error reported by the
kernel test robot).

An 8- or 16-bit load-acquire zero-extends the value before writing it to
a 32-bit register, just like ARM64 instruction LDARH and friends.

Similar to existing atomic read-modify-write operations, misaligned
load-acquires/store-releases are not allowed (even if
BPF_F_ANY_ALIGNMENT is set).

As an example, consider the following 64-bit load-acquire BPF
instruction (assuming little-endian):

  db 10 00 00 00 01 00 00  r0 = load_acquire((u64 *)(r1 + 0x0))

  opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX
  imm (0x00000100): BPF_LOAD_ACQ

Similarly, a 16-bit BPF store-release:

  cb 21 00 00 10 01 00 00  store_release((u16 *)(r1 + 0x0), w2)

  opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX
  imm (0x00000110): BPF_STORE_REL

In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have
bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new
instructions, until the corresponding JIT compiler supports them in
arena.

[1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'introduce-bpf_object__prepare'

Mykyta Yatsenko says:

====================
Introduce bpf_object__prepare

From: Mykyta Yatsenko <yatsenko@meta.com>

We are introducing a new function in the libbpf API, bpf_object__prepare,
which provides more granular control over the process of loading a
bpf_object. bpf_object__prepare performs ELF processing, relocations,
prepares final state of BPF program instructions (accessible with
bpf_program__insns()), creates and potentially pins maps, and stops short
of loading BPF programs.

There are couple of anticipated usecases for this API:
* Use BPF token for freplace programs that might need to lookup BTF of
other programs (BPF token creation can't be moved to open step, as open
step is "no privilege assumption" step so that tools like bpftool can
generate skeleton, discover the structure of BPF object, etc).
* Stopping at prepare gives users finalized BPF program
instructions (with subprogs appended, everything relocated and
finalized, etc). And that property can be taken advantage of by
veristat (and similar tools) that might want to process one program at
a time, but would like to avoid relatively slow ELF parsing and
processing; and even BPF selftests itself (RUN_TESTS part of it at
least) would benefit from this by eliminating waste of re-processing
ELF many times.
====================

Link: https://patch.msgid.link/20250303135752.158343-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, x86: Add x86 JIT support for timed may_goto

Implement the arch_bpf_timed_may_goto function using inline assembly to
have control over which registers are spilled, and use our special
protocol of using BPF_REG_AX as an argument into the function, and as
the return value when going back.

Emit call depth accounting for the call made from this stub, and ensure
we don't have naked returns (when rethunk mitigations are enabled) by
falling back to the RET macro (instead of retq). After popping all saved
registers, the return address into the BPF program should be on top of
the stack.

Since the JIT support is now enabled, ensure selftests which are
checking the produced may_goto sequences do not break by adjusting them.
Make sure we still test the old may_goto sequence on other
architectures, while testing the new sequence on x86_64.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250304003239.2390751-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for bpf_object__prepare

Add selftests, checking that running bpf_object__prepare successfully
creates maps before load step.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250303135752.158343-5-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add verifier support for timed may_goto

Implement support in the verifier for replacing may_goto implementation
from a counter-based approach to one which samples time on the local CPU
to have a bigger loop bound.

We implement it by maintaining 16-bytes per-stack frame, and using 8
bytes for maintaining the count for amortizing time sampling, and 8
bytes for the starting timestamp. To minimize overhead, we need to avoid
spilling and filling of registers around this sequence, so we push this
cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
the count to store into the stack through BPF_REG_AX. All caller-saved
registers (r0-r5) are guaranteed to remain untouched.

The loop can be broken by returning count as 0, otherwise we dispatch
into the function when the count drops to 0, and the runtime chooses to
refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
and aborting the loop on next iteration.

Since the check for 0 is done right after loading the count from the
stack, all subsequent cond_break sequences should immediately break as
well, of the same loop or subsequent loops in the program.

We pass in the stack_depth of the count (and thus the timestamp, by
adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
by adding the offset to r10/fp. This adjustment will be arch specific,
and the next patch will introduce support for x86.

Note that depending on loop complexity, time spent in the loop can be
more than the current limit (250 ms), but imposing an upper bound on
program runtime is an orthogonal problem which will be addressed when
program cancellations are supported.

The current time afforded by cond_break may not be enough for cases
where BPF programs want to implement locking algorithms inline, and use
cond_break as a promise to the verifier that they will eventually
terminate.

Below are some benchmarking numbers on the time taken per-iteration for
an empty loop that counts the number of iterations until cond_break
fires. For comparison, we compare it against bpf_for/bpf_repeat which is
another way to achieve the same number of spins (BPF_MAX_LOOPS).  The
hardware used for benchmarking was a Sapphire Rapids Intel server with
performance governor enabled, mitigations were enabled.

+-----------------------------+--------------+--------------+------------------+
| Loop type                   | Iterations   |  Time (ms)   |   Time/iter (ns) |
+-----------------------------|--------------+--------------+------------------+
| may_goto                    | 8388608      |  3           |   0.36           |
| timed_may_goto (count=65535)| 589674932    |  250         |   0.42           |
| bpf_for                     | 8388608      |  10          |   1.19           |
+-----------------------------+--------------+--------------+------------------+

This gives a good approximation at low overhead while staying close to
the current implementation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250304003239.2390751-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Split bpf object load into prepare/load

Introduce bpf_object__prepare API: additional intermediate preparation
step that performs ELF processing, relocations, prepares final state of
BPF program instructions (accessible with bpf_program__insns()), creates
and (potentially) pins maps, and stops short of loading BPF programs.

We anticipate few use cases for this API, such as:
* Use prepare to initialize bpf_token, without loading freplace
programs, unlocking possibility to lookup BTF of other programs.
* Execute prepare to obtain finalized BPF program instructions without
loading programs, enabling tools like veristat to process one program at
a time, without incurring cost of ELF parsing and processing.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250303135752.158343-4-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Introduce more granular state for bpf_object

We are going to split bpf_object loading into 2 stages: preparation and
loading. This will increase flexibility when working with bpf_object
and unlock some optimizations and use cases.
This patch substitutes a boolean flag (loaded) by more finely-grained
state for bpf_object.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250303135752.158343-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net: filter: Avoid shadowing variable in bpf_convert_ctx_access()

Rename the local variable 'off' to 'offset' to avoid shadowing the existing
'off' variable that is declared as an `int` in the outer scope of
bpf_convert_ctx_access().

This fixes a compiler warning:

net/core/filter.c:9679:8: warning: declaration shadows a local variable [-Wshadow]

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://patch.msgid.link/20250228-fix_filter-v1-1-ce13eae66fe9@debian.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Use map_is_created helper in map setters

Refactoring: use map_is_created helper in map setters that need to check
the state of the map. This helps to reduce the number of the places that
depend explicitly on the loaded flag, simplifying refactoring in the
next patch of this set.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250303135752.158343-2-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'selftests-bpf-migrate-test_tunnel-sh-to-test_progs'

Bastien Curutchet says:

====================
selftests/bpf: Migrate test_tunnel.sh to test_progs

Hi all,

This patch series continues the work to migrate the *.sh tests into
prog_tests framework.

The test_tunnel.sh script has already been partly migrated to
test_progs in prog_tests/test_tunnel.c so I add my work to it.

PATCH 1 & 2 create some helpers to avoid code duplication and ease the
migration in the following patches.
PATCH 3 to 9 migrate the tests of gre, ip6gre, erspan, ip6erspan,
geneve, ip6geneve and ip6tnl tunnels.
PATCH 10 removes test_tunnel.sh
====================

Link: https://patch.msgid.link/20250303-tunnels-v2-0-8329f38f0678@bootlin.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_tunnel: Remove test_tunnel.sh

All tests from test_tunnel.sh have been migrated into test test_progs.
The last test remaining in the script is the test_ipip() that is already
covered in the test_prog framework by the NONE case of test_ipip_tunnel().

Remove the test_tunnel.sh script and its Makefile entry

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250303-tunnels-v2-10-8329f38f0678@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_tunnel: Move ip6tnl tunnel tests to test_progs

ip6tnl tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.

Add a new test in test_progs to test ip6tnl tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_ipip6() and test_ip6ip6() from the script.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250303-tunnels-v2-9-8329f38f0678@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_tunnel: Move ip6geneve tunnel test to test_progs

ip6geneve tunnels are tested in the test_tunnel.sh but not in the
test_progs framework.

Add a new test in test_progs to test ip6geneve tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_ip6geneve() from the script.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250303-tunnels-v2-8-8329f38f0678@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_tunnel: Move geneve tunnel test to test_progs

geneve tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.

Add a new test in test_progs to test geneve tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_geneve() from the script.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250303-tunnels-v2-7-8329f38f0678@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_tunnel: Move ip6erspan tunnel test to test_progs

ip6erspan tunnels are tested in the test_tunnel.sh but not in the
test_progs framework.

Add a new test in test_progs to test ip6erspan tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_ip6erspan() from the script.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250303-tunnels-v2-6-8329f38f0678@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_tunnel: Move erspan tunnel tests to test_progs

erspan tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.

Add a new test in test_progs to test erspan tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_erspan() from the script.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250303-tunnels-v2-5-8329f38f0678@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_tunnel: Move ip6gre tunnel test to test_progs

ip6gre tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.

Add a new test in test_progs to test ip6gre tunnels. It uses the same
network topology and the same BPF programs than the script. Disable the
IPv6 DAD feature because it can take lot of time and cause some tests to
fail depending on the environment they're run on.
Remove test_ip6gre() and test_ip6gretap() from the script.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250303-tunnels-v2-4-8329f38f0678@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_tunnel: Move gre tunnel test to test_progs

gre tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.

Add a new test in test_progs to test gre tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_gre() and test_gre_no_tunnel_key() from the script.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250303-tunnels-v2-3-8329f38f0678@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'veristat-files-list-txt-notation-for-object-files-list'

Eduard Zingerman says:

====================
veristat: @files-list.txt notation for object files list

A few small veristat improvements:
- It is possible to hit command line parameters number limit,
  e.g. when running veristat for all object files generated for
  test_progs. This patch-set adds an option to read objects files list
  from a file.
- Correct usage of strerror() function.
- Avoid printing log lines to CSV output.

Changelog:
- v1 -> v2:
  - replace strerror(errno) with strerror(-err) in patch #2 (Andrii)

v1: https://lore.kernel.org/bpf/3ee39a16-bc54-4820-984a-0add2b5b5f86@gmail.com/T/
====================

Link: https://patch.msgid.link/20250301000147.1583999-1-eddyz87@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_tunnel: Add ping helpers

All tests use more or less the same ping commands as final validation.
Also test_ping()'s return value is checked with ASSERT_OK() while this
check is already done by the SYS() macro inside test_ping().

Create helpers around test_ping() and use them in the tests to avoid code
duplication.
Remove the unnecessary ASSERT_OK() from the tests.

Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250303-tunnels-v2-2-8329f38f0678@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Factor out check_load_mem() and check_store_reg()

Extract BPF_LDX and most non-ATOMIC BPF_STX instruction handling logic
in do_check() into helper functions to be used later. While we are
here, make that comment about "reserved fields" more specific.

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/8b39c94eac2bb7389ff12392ca666f939124ec4f.1740978603.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

veristat: Report program type guess results to sdterr

In order not to pollute CSV output, e.g.:

  $ ./veristat -o csv exceptions_ext.bpf.o > test.csv
  Using guessed program type 'sched_cls' for exceptions_ext.bpf.o/extension...
  Using guessed program type 'sched_cls' for exceptions_ext.bpf.o/throwing_extension...

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Link: https://lore.kernel.org/bpf/20250301000147.1583999-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>