Yanfei Xu [Sun, 16 May 2021 09:50:10 +0000 (17:50 +0800)]
rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
If rcu_print_task_stall() is invoked on an rcu_node structure that does
not contain any tasks blocking the current grace period, it takes an
early exit that fails to release that rcu_node structure's lock. This
results in a self-deadlock, which is detected by lockdep.
This will also result in other complaints, including RCU's scheduler
hook complaining about blocking rather than preemption and an rcutorture
writer stall.
Only a partial RCU CPU stall warning message will be printed because of
the self-deadlock.
This commit therefore releases the lock on the rcu_print_task_stall()
function's early exit path.
Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled") Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Yanfei Xu [Sat, 15 May 2021 16:45:11 +0000 (00:45 +0800)]
rcu: Fix to include first blocked task in stall warning
The for loop in rcu_print_task_stall() always omits ts[0], which points
to the first task blocking the stalled grace period. This in turn fails
to count this first task, which means that ndetected will be equal to
zero when all CPUs have passed through their quiescent states and only
one task is blocking the stalled grace period. This zero value for
ndetected will in turn result in an incorrect "All QSes seen" message:
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 12-23):
(detected by 15, t=6504 jiffies, g=164777, q=9011209)
rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295252379-4295252378), jiffies_till_next_fqs=1, root ->qsmask 0x2
BUG: sleeping function called from invalid context at include/linux/uaccess.h:156
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 70613, name: msgstress04
INFO: lockdep is turned off.
Preemption disabled at:
[<ffff8000104031a4>] create_object.isra.0+0x204/0x4b0
CPU: 15 PID: 70613 Comm: msgstress04 Kdump: loaded Not tainted
5.12.2-yoctodev-standard #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x2cc
show_stack+0x24/0x30
dump_stack+0x110/0x188
___might_sleep+0x214/0x2d0
__might_sleep+0x7c/0xe0
This commit therefore fixes the loop to include ts[0].
Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled") Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 11 May 2021 20:27:22 +0000 (13:27 -0700)]
torture: Make torture.sh accept --do-all and --donone
Currently, torture.sh accepts --doall on the one hand and --do-none
on the other, which is a bit inconsistent. This commit therefore adds
--do-all and --donone so that a fully consistent test may be used.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Mon, 10 May 2021 19:30:49 +0000 (12:30 -0700)]
torture: Add clocksource-watchdog testing to torture.sh
This commit adds three short tests of the clocksource-watchdog capability
to the torture.sh script, all to avoid otherwise-inevitable bitrot.
While in the area, fix an obsolete comment.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 4 May 2021 00:04:57 +0000 (17:04 -0700)]
refscale: Add measurement of clock readout
This commit adds a "clock" type to refscale, which checks the performance
of ktime_get_real_fast_ns(). Use the "clocksource=" kernel boot parameter
to select the underlying clock source.
[ paulmck: Work around compiler false positive per kernel test robot. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
clocksource: Print deviation in nanoseconds for unstable case
Currently when an unstable clocksource is detected, the raw counters
of that clocksource and watchdog will be printed, which can only be
understood after some math calculation. So print the existing delta in
nanoseconds to make it easier for humans to check the results.
[ paulmck: Fix typo. ] Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Fri, 30 Apr 2021 03:34:19 +0000 (20:34 -0700)]
clocksource: Provide kernel module to test clocksource watchdog
When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. It would be good
to have a way of testing the clocksource watchdog's ability to
distinguish between these two causes of clock skew and instability.
Therefore, provide a new clocksource-wdtest module selected by a new
TEST_CLOCKSOURCE_WATCHDOG Kconfig option. This module has a single module
parameter named "holdoff" that provides the number of seconds of delay
before testing should start, which defaults to zero when built as a module
and to 10 seconds when built directly into the kernel. Very large systems
that boot slowly may need to increase the value of this module parameter.
This module uses hand-crafted clocksource structures to do its testing,
thus avoiding messing up timing for the rest of the kernel and for user
applications. This module first verifies that the ->uncertainty_margin
field of the clocksource structures are set sanely. It then tests the
delay-detection capability of the clocksource watchdog, increasing the
number of consecutive delays injected, first provoking console messages
complaining about the delays and finally forcing a clock-skew event.
Unexpected test results cause at least one WARN_ON_ONCE() console splat.
If there are no splats, the test has passed. Finally, it fuzzes the
value returned from a clocksource to test the clocksource watchdog's
ability to detect time skew.
This module checks the state of its clocksource after each test, and
uses WARN_ON_ONCE() to emit a console splat if there are any failures.
This should enable all types of test frameworks to detect any such
failures.
This facility is intended for diagnostic use only, and should be avoided
on production systems.
Paul E. McKenney [Wed, 28 Apr 2021 01:43:37 +0000 (18:43 -0700)]
clocksource: Reduce clocksource-skew threshold for TSC
Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in
a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed
by more than 12.5% in order to be marked unstable. Except that a clock
that is skewed by that much is probably destroying unsuspecting software
right and left. And given that there are now checks for false-positive
skews due to delays between reading the two clocks, it should be possible
to greatly decrease WATCHDOG_THRESHOLD, at least for fine-grained clocks
such as TSC.
Therefore, add a new uncertainty_margin field to the clocksource
structure that contains the maximum uncertainty in nanoseconds for
the corresponding clock. This field may be initialized manually,
as it is for clocksource_tsc_early and clocksource_jiffies, which
is copied to refined_jiffies. If the field is not initialized
manually, it will be computed at clock-registry time as the period
of the clock in question based on the scale and freq parameters to
__clocksource_update_freq_scale() function. If either of those two
parameters are zero, the tens-of-milliseconds WATCHDOG_THRESHOLD is
used as a cowardly alternative to dividing by zero. No matter how the
uncertainty_margin field is calculated, it is bounded below by twice
WATCHDOG_MAX_SKEW, that is, by 100 microseconds.
Note that manually initialized uncertainty_margin fields are not adjusted,
but there is a WARN_ON_ONCE() that triggers if any such field is less than
twice WATCHDOG_MAX_SKEW. This WARN_ON_ONCE() is intended to discourage
production use of the one-nanosecond uncertainty_margin values that are
used to test the clock-skew code itself.
The actual clock-skew check uses the sum of the uncertainty_margin fields
of the two clocksource structures being compared. Integer overflow is
avoided because the largest computed value of the uncertainty_margin
fields is one billion (10^9), and double that value fits into an
unsigned int. However, if someone manually specifies (say) UINT_MAX,
they will get what they deserve.
Note that the refined_jiffies uncertainty_margin field is initialized to
TICK_NSEC, which means that skew checks involving this clocksource will
be sufficently forgiving. In a similar vein, the clocksource_tsc_early
uncertainty_margin field is initialized to 32*NSEC_PER_MSEC, which
replicates the current behavior and allows custom setting if needed
in order to address the rare skews detected for this clocksource in
current mainline.
Paul E. McKenney [Wed, 14 Apr 2021 00:52:18 +0000 (17:52 -0700)]
clocksource: Limit number of CPUs checked for clock synchronization
Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU,
that clock is checked on all CPUs. This is thorough, but might not be
what you want on a system with a few tens of CPUs, let alone a few hundred
of them.
Therefore, by default check only up to eight randomly chosen CPUs.
Also provide a new clocksource.verify_n_cpus kernel boot parameter.
A value of -1 says to check all of the CPUs, and a non-negative value says
to randomly select that number of CPUs, without concern about selecting
the same CPU multiple times. However, make use of a cpumask so that a
given CPU will be checked at most once.
Paul E. McKenney [Mon, 21 Dec 2020 23:40:47 +0000 (15:40 -0800)]
clocksource: Check per-CPU clock synchronization when marked unstable
Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy
been solved in the past ten years. Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization. How would anyone know?
Therefore apply CPU-to-CPU synchronization checking to newly unstable
clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag.
Lists of desynchronized CPUs are printed, with the caveat that if it
is the reporting CPU that is itself desynchronized, it will appear that
all the other clocks are wrong. Just like in real life.
Paul E. McKenney [Thu, 17 Dec 2020 01:32:25 +0000 (17:32 -0800)]
clocksource: Retry clock read if long delays detected
When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.
Therefore, re-read the watchdog clock on either side of the read from
the clock under test. If the watchdog clock shows an excessive time
delta between its pair of reads, the reads are retried. The maximum
number of retries is specified by a new kernel boot parameter
clocksource.max_cswd_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries. If more
than one retry was required, a message is printed on the console (the
occasional single retry is expected behavior, especially in guest OSes).
If the maximum number of retries is exceeded, the clock under test will
be marked unstable. However, the probability of this happening due
to various sorts of delays is quite small. In addition, the reason
(clock-read delays) for the unstable marking will be apparent.
Marco Elver [Wed, 14 Apr 2021 11:28:25 +0000 (13:28 +0200)]
kcsan: Document "value changed" line
Update the example reports based on the latest reports generated by
kcsan_test module, which now include the "value changed" line. Add a
brief description of the "value changed" line.
Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Mark Rutland [Wed, 14 Apr 2021 11:28:24 +0000 (13:28 +0200)]
kcsan: Report observed value changes
When a thread detects that a memory location was modified without its
watchpoint being hit, the report notes that a change was detected, but
does not provide concrete values for the change. Knowing the concrete
values can be very helpful in tracking down any racy writers (e.g. as
specific values may only be written in some portions of code, or under
certain conditions).
When we detect a modification, let's report the concrete old/new values,
along with the access's mask of relevant bits (and which relevant bits
were modified). This can make it easier to identify potential racy
writers. As the snapshots are at most 8 bytes, we can only report values
for acceses up to this size, but this appears to cater for the common
case.
When we detect a race via a watchpoint, we may or may not have concrete
values for the modification. To be helpful, let's attempt to log them
when we do as they can be ignored where irrelevant.
The resulting reports appears as follows, with values zero-padded to the
access width:
| ==================================================================
| BUG: KCSAN: data-race in el0_svc_common+0x34/0x25c arch/arm64/kernel/syscall.c:96
|
| race at unknown origin, with read to 0xffff00007ae6aa00 of 8 bytes by task 223 on cpu 1:
| el0_svc_common+0x34/0x25c arch/arm64/kernel/syscall.c:96
| do_el0_svc+0x48/0xec arch/arm64/kernel/syscall.c:178
| el0_svc arch/arm64/kernel/entry-common.c:226 [inline]
| el0_sync_handler+0x1a4/0x390 arch/arm64/kernel/entry-common.c:236
| el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:674
|
| value changed: 0x0000000000000000 -> 0x0000000000000002
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 1 PID: 223 Comm: syz-executor.1 Not tainted 5.8.0-rc3-00094-ga73f923ecc8e-dirty #3
| Hardware name: linux,dummy-virt (DT)
| ==================================================================
If an access mask is set, it is shown underneath the "value changed"
line as "bits changed: 0x<bits changed> with mask 0x<non-zero mask>".
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[ elver@google.com: align "value changed" and "bits changed" lines,
which required massaging the message; do not print bits+mask if no
mask set. ] Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Mark Rutland [Wed, 14 Apr 2021 11:28:23 +0000 (13:28 +0200)]
kcsan: Remove kcsan_report_type
Now that the reporting code has been refactored, it's clear by
construction that print_report() can only be passed
KCSAN_REPORT_RACE_SIGNAL or KCSAN_REPORT_RACE_UNKNOWN_ORIGIN, and these
can also be distinguished by the presence of `other_info`.
Let's simplify things and remove the report type enum, and instead let's
check `other_info` to distinguish these cases. This allows us to remove
code for cases which are impossible and generally makes the code simpler.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[ elver@google.com: add updated comments to kcsan_report_*() functions ] Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Mark Rutland [Wed, 14 Apr 2021 11:28:22 +0000 (13:28 +0200)]
kcsan: Remove reporting indirection
Now that we have separate kcsan_report_*() functions, we can factor the
distinct logic for each of the report cases out of kcsan_report(). While
this means each case has to handle mutual exclusion independently, this
minimizes the conditionality of code and makes it easier to read, and
will permit passing distinct bits of information to print_report() in
future.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[ elver@google.com: retain comment about lockdep_off() ] Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Mark Rutland [Wed, 14 Apr 2021 11:28:21 +0000 (13:28 +0200)]
kcsan: Refactor access_info initialization
In subsequent patches we'll want to split kcsan_report() into distinct
handlers for each report type. The largest bit of common work is
initializing the `access_info`, so let's factor this out into a helper,
and have the kcsan_report_*() functions pass the `aaccess_info` as a
parameter to kcsan_report().
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Mark Rutland [Wed, 14 Apr 2021 11:28:20 +0000 (13:28 +0200)]
kcsan: Fold panic() call into print_report()
So that we can add more callers of print_report(), lets fold the panic()
call into print_report() so the caller doesn't have to handle this
explicitly.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Mark Rutland [Wed, 14 Apr 2021 11:28:19 +0000 (13:28 +0200)]
kcsan: Refactor passing watchpoint/other_info
The `watchpoint_idx` argument to kcsan_report() isn't meaningful for
races which were not detected by a watchpoint, and it would be clearer
if callers passed the other_info directly so that a NULL value can be
passed in this case.
Given that callers manipulate their watchpoints before passing the index
into kcsan_report_*(), and given we index the `other_infos` array using
this before we sanity-check it, the subsequent sanity check isn't all
that useful.
Let's remove the `watchpoint_idx` sanity check, and move the job of
finding the `other_info` out of kcsan_report().
Other than the removal of the check, there should be no functional
change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Mark Rutland [Wed, 14 Apr 2021 11:28:18 +0000 (13:28 +0200)]
kcsan: Distinguish kcsan_report() calls
Currently kcsan_report() is used to handle three distinct cases:
* The caller hit a watchpoint when attempting an access. Some
information regarding the caller and access are recorded, but no
output is produced.
* A caller which previously setup a watchpoint detected that the
watchpoint has been hit, and possibly detected a change to the
location in memory being watched. This may result in output reporting
the interaction between this caller and the caller which hit the
watchpoint.
* A caller detected a change to a modification to a memory location
which wasn't detected by a watchpoint, for which there is no
information on the other thread. This may result in output reporting
the unexpected change.
... depending on the specific case the caller has distinct pieces of
information available, but the prototype of kcsan_report() has to handle
all three cases. This means that in some cases we pass redundant
information, and in others we don't pass all the information we could
pass. This also means that the report code has to demux these three
cases.
So that we can pass some additional information while also simplifying
the callers and report code, add separate kcsan_report_*() functions for
the distinct cases, updating callers accordingly. As the watchpoint_idx
is unused in the case of kcsan_report_unknown_origin(), this passes a
dummy value into kcsan_report(). Subsequent patches will refactor the
report code to avoid this.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[ elver@google.com: try to make kcsan_report_*() names more descriptive ] Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Mark Rutland [Wed, 14 Apr 2021 11:28:17 +0000 (13:28 +0200)]
kcsan: Simplify value change detection
In kcsan_setup_watchpoint() we store snapshots of a watched value into a
union of u8/u16/u32/u64 sized fields, modify this in place using a
consistent field, then later check for any changes via the u64 field.
We can achieve the safe effect more simply by always treating the field
as a u64, as smaller values will be zero-extended. As the values are
zero-extended, we don't need to truncate the access_mask when we apply
it, and can always apply the full 64-bit access_mask to the 64-bit
value.
Finally, we can store the two snapshots and calculated difference
separately, which makes the code a little easier to read, and will
permit reporting the old/new values in subsequent patches.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Fri, 5 Mar 2021 00:04:09 +0000 (16:04 -0800)]
kcsan: Add pointer to access-marking.txt to data_race() bullet
This commit references tools/memory-model/Documentation/access-marking.txt
in the bullet introducing data_race(). The access-marking.txt file
gives advice on when data_race() should and should not be used.
Suggested-by: Akira Yokosawa <akiyks@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Arnd Bergmann [Fri, 14 May 2021 14:00:08 +0000 (16:00 +0200)]
kcsan: Fix debugfs initcall return type
clang with CONFIG_LTO_CLANG points out that an initcall function should
return an 'int' due to the changes made to the initcall macros in commit 3578ad11f3fb ("init: lto: fix PREL32 relocations"):
kernel/kcsan/debugfs.c:274:15: error: returning 'void' from a function with incompatible result type 'int'
late_initcall(kcsan_debugfs_init);
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
include/linux/init.h:292:46: note: expanded from macro 'late_initcall'
#define late_initcall(fn) __define_initcall(fn, 7)
Fixes: e36299efe7d7 ("kcsan, debugfs: Move debugfs file creation out of early init") Cc: stable <stable@vger.kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 18 May 2021 17:56:19 +0000 (10:56 -0700)]
Merge branches 'bitmaprange.2021.05.10c', 'doc.2021.05.10c', 'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD
Paul E. McKenney [Tue, 20 Apr 2021 17:58:07 +0000 (10:58 -0700)]
tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
In some architectures, the no-op variant of show_rcu_tasks_gp_kthreads()
get "no previous prototype" compiler warnings. These are false positives
given that kernel/rcu/tasks.h is included only once. But why put up
with the compiler noise?
This commit therefore adds "static inline" to this definition to force
the compiler to accept this situation, while also moving it to its proper
place in kernel/rcu/rcu.h.
Reported-by: kernel test robot <lkp@intel.com>
[ paulmck: Update per Stephen Rothwell feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 25 Mar 2021 00:08:48 +0000 (17:08 -0700)]
rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
Heavy networking load can cause a CPU to execute continuously and
indefinitely within ksoftirqd, in which case there will be no voluntary
task switches and thus no RCU-tasks quiescent states. This commit
therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks
quiescent state.
This of course means that __do_softirq() and its callers cannot be
invoked from within a tracing trampoline.
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org>
The deferred quiescent states resulting from the consolidation of RCU-bh
and RCU-sched into RCU means that rcu_read_unlock() will no longer attempt
to acquire scheduler locks if interrupts were disabled across that call
to rcu_read_unlock(). The cautions in the rcu_read_unlock() header
comment are therefore obsolete. This commit therefore removes them.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are a number of places that call out the fact that preempt-disable
regions of code now act as RCU read-side critical sections, where
preempt-disable regions of code include irq-disable regions of code,
bh-disable regions of code, hardirq handlers, and NMI handlers. However,
someone relying solely on (for example) the call_rcu() header comment
might well have no idea that preempt-disable regions of code have RCU
semantics.
This commit therefore updates the header comments for
call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
rcu_dereference_sched_check() to call out these new(ish) forms of RCU
readers.
Reported-by: Michel Lespinasse <michel@lespinasse.org>
[ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Wed, 21 Apr 2021 21:30:54 +0000 (14:30 -0700)]
rcu: Create an unrcu_pointer() to remove __rcu from a pointer
The xchg() and cmpxchg() functions are sometimes used to carry out RCU
updates. Unfortunately, this can result in sparse warnings for both
the old-value and new-value arguments, as well as for the return value.
The arguments can be dealt with using RCU_INITIALIZER():
old_p = xchg(&p, RCU_INITIALIZER(new_p));
But a sparse warning still remains due to assigning the __rcu pointer
returned from xchg to the (most likely) non-__rcu pointer old_p.
This commit therefore provides an unrcu_pointer() macro that strips
the __rcu. This macro can be used as follows:
Place an early call to start_poll_synchronize_srcu() before the invocation
of call_srcu() on the same srcu_struct structure.
After the later call to srcu_barrier(), the completion of the
first grace period should be visible to a subsequent invocation of
poll_state_synchronize_srcu(), and if not, warn.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Ingo Molnar [Tue, 23 Mar 2021 05:29:10 +0000 (22:29 -0700)]
rcu: Fix various typos in comments
Fix ~12 single-word typos in RCU code comments.
[ paulmck: Apply feedback from Randy Dunlap. ] Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Frederic Weisbecker [Tue, 23 Feb 2021 00:10:11 +0000 (01:10 +0100)]
rcu/nocb: Unify timers
Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar,
this commit merges them together. A new RCU_NOCB_WAKE_BYPASS wake level
is introduced. As a result, timers perform all kinds of deferred wake
ups but other deferred wakeup callsites only handle non-bypass wakeups
in order not to wake up rcuo too early.
The timer also unconditionally executes a full barrier so as to order
timer_pending() and callback enqueue although the path performing
RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also
test against the rdp leader instead of the current rdp.
This unconditional full barrier shouldn't bring visible overhead since
these timers almost never fire.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Frederic Weisbecker [Tue, 23 Feb 2021 00:10:10 +0000 (01:10 +0100)]
rcu/nocb: Prepare for fine-grained deferred wakeup
Tuning the deferred wakeup level must be done from a safe wakeup
point. Currently those sites are:
* ->nocb_timer
* user/idle/guest entry
* CPU down
* softirq/rcuc
All of these sites perform the wake up for both RCU_NOCB_WAKE and
RCU_NOCB_WAKE_FORCE.
In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan
to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until
a timer fires so that we don't wake up the NOCB-gp kthread too early.
To prepare for that, this commit specifies the per-callsite wakeup
level/limit.
Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Frederic Weisbecker [Tue, 23 Feb 2021 00:10:08 +0000 (01:10 +0100)]
rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
A NOCB-gp wake p can safely delete the ->nocb_bypass_timer because
nocb_gp_wait() will recheck again the bypass state and rearm the bypass
timer if necessary. This commit therefore deletes this timer.
Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Frederic Weisbecker [Tue, 23 Feb 2021 00:10:07 +0000 (01:10 +0100)]
rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
When waking up in nocb_gp_wait(), there is no need to keep the nocb_timer
around because this function will traverse the whole rdp list. Any
update performed before the timer was armed will now be visible after
the ->nocb_gp_lock acquire.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Frederic Weisbecker [Tue, 23 Feb 2021 00:10:06 +0000 (01:10 +0100)]
rcu/nocb: Allow de-offloading rdp leader
The only thing that prevented an rdp leader from being de-offloaded was
the nocb_bypass_timer that used to lock the nocb_lock of the rdp leader.
If an rdp gets de-offloaded, it will subtlely ignore rcu_nocb_lock()
calls and do its job in the timer unsafely. Worse yet: If it gets
re-offloaded in the middle of the timer, rcu_nocb_unlock() would try to
unlock, leaving it imbalanced.
Now that the nocb_bypass_timer doesn't use the nocb_lock anymore,
de-offloading the rdp leader is now safe. This commit therefore allows
the rdp leader to be de-offloaded.
Reported-by: Paul E. McKenney <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Frederic Weisbecker [Tue, 23 Feb 2021 00:10:05 +0000 (01:10 +0100)]
rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
The bypass timer calls __call_rcu_nocb_wake() instead of directly
calling __wake_nocb_gp(). The only difference here is that
rdp->qlen_last_fqs_check gets overridden. But resetting the deferred
force quiescent state base shouldn't be relevant for that timer. In fact
the bypass queue in question can be for any rdp from the group and not
necessarily the rdp leader on which the bypass timer is attached.
This commit therefore calls __wake_nocb_gp() directly. This way we
don't even need to lock the ->nocb_lock.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 15 Apr 2021 23:30:34 +0000 (16:30 -0700)]
rcu: Don't penalize priority boosting when there is nothing to boost
RCU priority boosting cannot do anything unless there is at least one
task blocking the current RCU grace period that was preempted within
the RCU read-side critical section that it still resides in. However,
the current rcu_torture_boost_failed() code will count this as an RCU
priority-boosting failure if there were no CPUs blocking the current
grace period. This situation can happen (for example) if the last CPU
blocking the current grace period was subjected to vCPU preemption,
which is always a risk for rcutorture guest OSes.
This commit therefore causes rcu_torture_boost_failed() to refrain from
reporting failure unless there is at least one task blocking the current
RCU grace period that was preempted within the RCU read-side critical
section that it still resides in.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 25 Jun 2019 05:30:32 +0000 (22:30 -0700)]
tools/memory-model: Use "-unroll 0" to keep --hw runs finite
Litmus tests involving atomic operations produce LL/SC loops on a number
of architectures, and unrolling these loops can result in excessive
verification times or even stack overflows. This commit therefore uses
the "-unroll 0" herd7 argument to avoid unrolling, on the grounds that
additional passes through an LL/SC loop should not change the verification.
Note however, that certain bugs in the mapping of the LL/SC loop to
machine instructions may go undetected. On the other hand, herd7 might
not be the best vehicle for finding such bugs in any case. (You do
stress-test your architecture-specific code, don't you?)
Suggested-by: Luc Maranget <luc.maranget@inria.fr> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 6 Jun 2019 09:13:27 +0000 (02:13 -0700)]
tools/memory-model: Make judgelitmus.sh handle scripted Result: tag
The scripts that generate the litmus tests in the "auto" directory of
the https://github.com/paulmckrcu/litmus archive place the "Result:"
tag into a single-line ocaml comment, which judgelitmus.sh currently
does not recognize. This commit therefore makes judgelitmus.sh
recognize both the multiline comment format that it currently does
and the automatically generated single-line format.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Fri, 3 May 2019 14:34:20 +0000 (07:34 -0700)]
tools/memory-model: Add data-race capabilities to judgelitmus.sh
This commit adds functionality to judgelitmus.sh to allow it to handle
both the "DATARACE" markers in the "Result:" comments in litmus tests
and the "Flag data-race" markers in LKMM output. For C-language tests,
if either marker is present, the other must also be as well, at least for
litmus tests having a "Result:" comment. If the LKMM output indicates
a data race, then failures of the Always/Sometimes/Never portion of the
"Result:" prediction are forgiven.
The reason for forgiving "Result:" mispredictions is that data races can
result in "interesting" compiler optimizations, so that all bets are off
in the data-race case.
[ paulmck: Apply Akira Yokosawa feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 2 May 2019 17:05:14 +0000 (10:05 -0700)]
tools/memory-model: Add checktheselitmus.sh to run specified litmus tests
This commit adds a checktheselitmus.sh script that runs the litmus tests
specified on the command line. This is useful for verifying fixes to
specific litmus tests.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 2 May 2019 16:51:57 +0000 (09:51 -0700)]
tools/memory-model: Add "--" to parseargs.sh for additional arguments
Currently, parseargs.sh expects to consume all the command-line arguments,
which prevents the calling script from having any of its own arguments.
This commit therefore causes parseargs.sh to stop consuming arguments
when it encounters a "--" argument, leaving any remaining arguments for
the calling script.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Mon, 8 Apr 2019 17:02:23 +0000 (10:02 -0700)]
tools/memory-model: Make history-check scripts use mselect7
The history-check scripts currently use grep to ignore non-C-language
litmus tests, which is a bit fragile. This commit therefore enlists the
aid of "mselect7 -arch C", given Luc Maraget's recent modifications that
allow mselect7 to operate in filter mode.
Paul E. McKenney [Mon, 8 Apr 2019 16:27:28 +0000 (09:27 -0700)]
tools/memory-model: Make checkghlitmus.sh use mselect7
The checkghlitmus.sh script currently uses grep to ignore non-C-language
litmus tests, which is a bit fragile. This commit therefore enlists the
aid of "mselect7 -arch C", given Luc Maraget's recent modifications that
allow mselect7 to operate in filter mode.
Paul E. McKenney [Wed, 27 Mar 2019 18:47:14 +0000 (11:47 -0700)]
tools/memory-model: Fix scripting --jobs argument
The parseargs.sh regular expression for the --jobs argument incorrectly
requires that the number of jobs be at least 10, that is, have at least
two digits. This commit therefore adjusts this regular expression to
allow single-digit numbers of jobs to be specified.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Sat, 23 Mar 2019 00:18:43 +0000 (17:18 -0700)]
tools/memory-model: Implement --hw support for checkghlitmus.sh
This commits enables the "--hw" argument for the checkghlitmus.sh script,
causing it to convert any applicable C-language litmus tests to the
specified flavor of assembly language, to verify these assembly-language
litmus tests, and checking compatibility of the outcomes.
Note that the conversion does not yet handle locking, RCU, SRCU, plain
C-language memory accesses, or casts.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Fri, 5 Apr 2019 19:34:56 +0000 (12:34 -0700)]
tools/memory-model: Add -v flag to jingle7 runs
Adding the -v flag to jingle7 invocations gives much useful information
on why jingle7 didn't like a given litmus test. This commit therefore
adds this flag and saves off any such information into a .err file.
Suggested-by: Luc Maranget <luc.maranget@inria.fr> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 26 Mar 2019 00:20:51 +0000 (17:20 -0700)]
tools/memory-model: Make runlitmus.sh check for jingle errors
It turns out that the jingle7 tool is currently a bit picky about
the litmus tests it is willing to process. This commit therefore
ensures that jingle7 failures are reported.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Fri, 22 Mar 2019 15:57:20 +0000 (08:57 -0700)]
tools/memory-model: Allow herd to deduce CPU type
Currently, the scripts specify the CPU's .cat file to herd. But this is
pointless because herd will select a good and sufficient .cat file from
the assembly-language litmus test itself. This commit therefore removes
the -model argument to herd, allowing herd to figure the CPU family out
itself.
Note that the user can override herd's choice using the "--herdopts"
argument to the scripts.
Suggested-by: Luc Maranget <luc.maranget@inria.fr> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit retains the assembly-language litmus tests generated from
the C-language litmus tests, appending the hardware tag to the original
C-language litmus test's filename. Thus, S+poonceonces.litmus.AArch64
contains the Armv8 assembly language corresponding to the C-language
S+poonceonces.litmus test.
This commit also updates the .gitignore to avoid committing these
automatically generated assembly-language litmus tests.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 21 Mar 2019 21:06:27 +0000 (14:06 -0700)]
tools/memory-model: Move from .AArch64.litmus.out to .litmus.AArch.out
When the github scripts see ".litmus.out", they assume that there must be
a corresponding C-language ".litmus" file. Won't they be disappointed
when they instead see nothing, or, worse yet, the corresponding
assembly-language litmus test? This commit therefore swaps the hardware
tag with the "litmus" to avoid this sort of disappointment.
This commit also adjusts the .gitignore file so as to avoid adding these
new ".out" files to git.
[ paulmck: Apply Akira Yokosawa feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Wed, 20 Mar 2019 23:41:41 +0000 (16:41 -0700)]
tools/memory-model: Make runlitmus.sh generate .litmus.out for --hw
In the absence of "Result:" comments, the runlitmus.sh script relies on
litmus.out files from prior LKMM runs. This can be a bit user-hostile,
so this commit makes runlitmus.sh generate any needed .litmus.out files
that don't already exist.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Wed, 20 Mar 2019 21:57:56 +0000 (14:57 -0700)]
tools/memory-model: Split runlitmus.sh out of checklitmus.sh
This commit prepares for adding --hw capability to github litmus-test
scripts by splitting runlitmus.sh (which simply runs the verification)
out of checklitmus.sh (which also judges the results).
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Wed, 20 Mar 2019 21:37:46 +0000 (14:37 -0700)]
tools/memory-model: Make judgelitmus.sh ransack .litmus.out files
The judgelitmus.sh script currently relies solely on the "Result:"
comment in the .litmus file. This is problematic when using the --hw
argument, because it is necessary to check the hardware model against
LKMM even in the absence of "Result:" comments.
This commit therefore modifies judgelitmus.sh to check the observation
in a .litmus.out file, in case one was generated by a previous LKMM run.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Wed, 20 Mar 2019 19:39:27 +0000 (12:39 -0700)]
tools/memory-model: Hardware checking for check{,all}litmus.sh
This commit makes checklitmus.sh and checkalllitmus.sh check to see
if a hardware verification was specified (via the --hw command-line
argument, which sets the LKMM_HW_MAP_FILE environment variable).
If so, the C-language litmus test is converted to the specified type
of assembly-language litmus test and herd is run on it. Hardware is
permitted to be stronger than LKMM requires, so "Always" and "Never"
verifications of "Sometimes" C-language litmus tests are forgiven.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 19 Mar 2019 23:37:01 +0000 (16:37 -0700)]
tools/memory-model: Fix checkalllitmus.sh comment
The checkalllitmus.sh runs litmus tests in the litmus-tests directory,
not those in the github archive, so this commit updates the comment to
reflect this reality.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 19 Mar 2019 21:39:10 +0000 (14:39 -0700)]
tools/memory-model: Make judgelitmus.sh handle hardware verifications
This commit makes the judgelitmus.sh script check the --hw argument
(AKA the LKMM_HW_MAP_FILE environment variable) and to adjust its
judgment for a run where a C-language litmus test has been translated to
assembly and the assembly version verified. In this case, the assembly
verification output is checked against the C-language script's "Result:"
comment. However, because hardware can be stronger than LKMM requires,
the judgelitmus.sh script forgives verification mismatches featuring
a "Sometimes" in the C-language script and an "Always" or "Never"
assembly-language verification.
Note that deadlock is not forgiven, however, this should not normally be
an issue given that C-language tests containing locking, RCU, or SRCU
cannot be translated to assembly. However, this issue can crop up in
litmus tests that mimic deadlock by using the "filter" clause to ignore
all executions. It can also crop up when certain herd arguments are
used to autofilter everything that does not match the "exists" clause
in cases where the "exists" clause cannot be satisfied.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 19 Mar 2019 22:59:26 +0000 (15:59 -0700)]
tools/memory-model: Update parseargs.sh for hardware verification
This commit adds a --hw argument to parseargs.sh to specify the CPU
family for a hardware verification. For example, "--hw AArch64" will
specify that a C-language litmus test is to be translated to ARMv8 and
the result verified. This will set the LKMM_HW_MAP_FILE environment
variable accordingly. If there is no --hw argument, this environment
variable will be set to the empty string.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 19 Mar 2019 21:27:06 +0000 (14:27 -0700)]
tools/memory-model: Make judgelitmus.sh detect hard deadlocks
If a litmus test specifies "Result: Never" and if it contains an
unconditional ("hard") deadlock, then running checklitmus.sh on it will
not flag any errors, despite the fact that there are no executions.
This commit therefore updates judgelitmus.sh to complain about tests
with no executions that are marked, but not as "Result: DEADLOCK".
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Mon, 18 Mar 2019 20:40:57 +0000 (13:40 -0700)]
tools/memory-model: Make judgelitmus.sh identify bad macros
Currently, judgelitmus.sh treats use of unknown primitives (such as
srcu_read_lock() prior to SRCU support) as "!!! Verification error".
This can be misleading because it fails to call out typos and running
a version LKMM on a litmus test requiring a feature not provided by
that version. This commit therefore changes judgelitmus.sh to check
for unknown primitives and to report them, for example, with:
'!!! Current LKMM version does not know "rcu_write_lock"'.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Mon, 18 Mar 2019 20:07:46 +0000 (13:07 -0700)]
tools/memory-model: Make cmplitmushist.sh note timeouts
Currently, cmplitmushist.sh treats timeouts (as in the "--timeout"
argument) as "Missing Observation line". This can be misleading because
it is quite possible that running the test longer would have produced
a verification. This commit therefore changes cmplitmushist.sh to check
for timeouts and to report them with "Timed out".
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Mon, 18 Mar 2019 18:53:50 +0000 (11:53 -0700)]
tools/memory-model: Make judgelitmus.sh note timeouts
Currently, judgelitmus.sh treats timeouts (as in the "--timeout" argument)
as "!!! Verification error". This can be misleading because it is quite
possible that running the test longer would have produced a verification.
This commit therefore changes judgelitmus.sh to check for timeouts and
to report them with "!!! Timeout".
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Fri, 14 Aug 2020 23:14:34 +0000 (16:14 -0700)]
tools/memory-model: Document locking corner cases
Most Linux-kernel uses of locking are straightforward, but there are
corner-case uses that rely on less well-known aspects of the lock and
unlock primitives. This commit therefore adds a locking.txt and litmus
tests in Documentation/litmus-tests/locking to explain these corner-case
uses.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
A misspelled git-grep regex revealed that smp_mb__after_spinlock()
was misspelled in explanation.txt. This commit adds the missing "_".
Fixes: 1c27b644c0fd ("Automate memory-barriers.txt; provide Linux-kernel memory model")
[ paulmck: Apply Alan Stern commit-log feedback. ] Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Sun, 11 Apr 2021 17:49:52 +0000 (10:49 -0700)]
rcu: Make rcu_gp_cleanup() be noinline for tracing
Although there are trace events for RCU grace periods, these are only
enabled in CONFIG_RCU_TRACE=y kernels. This commit therefore marks
rcu_gp_cleanup() noinline in order to provide a function that can be
traced that is invoked near the end of each grace period.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Wed, 7 Apr 2021 22:21:32 +0000 (15:21 -0700)]
rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
Kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y can experience
significant lock contention due to RCU's resulting focus on ending grace
periods as soon as possible. This is OK, but only if there are not very
many CPUs. This commit therefore puts this Kconfig option off-limits
to systems with more than four CPUs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Wed, 7 Apr 2021 22:14:01 +0000 (15:14 -0700)]
rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
Currently, show_rcu_gp_kthreads() only dumps rcu_node structures that
have outdated ideas of the current grace-period number. This commit
also dumps those that are in any way blocking the current grace period.
This helps diagnose RCU priority boosting failures.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 6 Apr 2021 03:42:09 +0000 (20:42 -0700)]
rcu: Make RCU priority boosting work on single-CPU rcu_node structures
When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one. Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread. Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure. There is no point in checking because
we know there is a CPU on its way in.
The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time. Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.
This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate. In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.
With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 6 Apr 2021 23:31:42 +0000 (16:31 -0700)]
rcu: Add quiescent states and boost states to show_rcu_gp_kthreads() output
This commit adds each rcu_node structure's ->qsmask and "bBEG" output
indicating whether: (1) There is a boost kthread, (2) A reader needs
to be (or is in the process of being) boosted, (3) A reader is blocking
an expedited grace period, and (4) A reader is blocking a normal grace
period. This helps diagnose RCU priority boosting failures.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Mon, 5 Apr 2021 16:51:05 +0000 (09:51 -0700)]
rcu: Reject RCU_LOCKDEP_WARN() false positives
If another lockdep report runs concurrently with an RCU lockdep report
from RCU_LOCKDEP_WARN(), the following sequence of events can occur:
1. debug_lockdep_rcu_enabled() sees that lockdep is enabled
when called from (say) synchronize_rcu().
2. Lockdep is disabled by a concurrent lockdep report.
3. debug_lockdep_rcu_enabled() evaluates its lockdep-expression
argument, for example, lock_is_held(&rcu_bh_lock_map).
4. Because lockdep is now disabled, lock_is_held() plays it safe and
returns the constant 1.
5. But in this case, the constant 1 is not safe, because invoking
synchronize_rcu() under rcu_read_lock_bh() is disallowed.
6. debug_lockdep_rcu_enabled() wrongly invokes lockdep_rcu_suspicious(),
resulting in a false-positive splat.
This commit therefore changes RCU_LOCKDEP_WARN() to check
debug_lockdep_rcu_enabled() after checking the lockdep expression,
so that any "safe" returns from lock_is_held() are rejected by
debug_lockdep_rcu_enabled(). This requires memory ordering, which is
supplied by READ_ONCE(debug_locks). The resulting volatile accesses
prevent the compiler from reordering and the fact that only one variable
is being accessed prevents the underlying hardware from reordering.
The combination works for IA64, which can reorder reads to the same
location, but this is defeated by the volatile accesses, which compile
to load instructions that provide ordering.
Reported-by: syzbot+dde0cc33951735441301@syzkaller.appspotmail.com Reported-by: Matthew Wilcox <willy@infradead.org> Reported-by: syzbot+88e4f02896967fe1ab0d@syzkaller.appspotmail.com Reported-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Mon, 5 Apr 2021 16:47:59 +0000 (09:47 -0700)]
lockdep: Explicitly flag likely false-positive report
The reason that lockdep_rcu_suspicious() prints the value of debug_locks
is because a value of zero indicates a likely false positive. This can
work, but is a bit obtuse. This commit therefore explicitly calls out
the possibility of a false positive.
Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Wed, 31 Mar 2021 17:59:05 +0000 (10:59 -0700)]
rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
Currently, rcu_spawn_core_kthreads() is invoked via an early_initcall(),
which works, except that rcu_spawn_gp_kthread() is also invoked via an
early_initcall() and rcu_spawn_core_kthreads() relies on adjustments to
kthread_prio that are carried out by rcu_spawn_gp_kthread(). There is
no guaranttee of ordering among early_initcall() handlers, and thus no
guarantee that kthread_prio will be properly checked and range-limited
at the time that rcu_spawn_core_kthreads() needs it.
In most cases, this bug is harmless. After all, the only reason that
rcu_spawn_gp_kthread() adjusts the value of kthread_prio is if the user
specified a nonsensical value for this boot parameter, which experience
indicates is rare.
Nevertheless, a bug is a bug. This commit therefore causes the
rcu_spawn_core_kthreads() function to be invoked directly from
rcu_spawn_gp_kthread() after any needed adjustments to kthread_prio have
been carried out.
Fixes: 48d07c04b4cc ("rcu: Enable elimination of Tree-RCU softirq processing") Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 30 Mar 2021 20:23:49 +0000 (13:23 -0700)]
rcu: Remove the unused rcu_irq_exit_preempt() function
Commit 9ee01e0f69a9 ("x86/entry: Clean up idtentry_enter/exit()
leftovers") left the rcu_irq_exit_preempt() in place in order to avoid
conflicts with the -rcu tree. Now that this change has long since hit
mainline, this commit removes the no-longer-used rcu_irq_exit_preempt()
function.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Mon, 3 May 2021 02:56:05 +0000 (19:56 -0700)]
rcutorture: Move mem_dump_obj() tests into separate function
To make the purpose of the code more apparent, this commit moves the
tests of mem_dump_obj() to a new rcu_torture_mem_dump_obj() function
and calls it from rcu_torture_cleanup().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 27 Apr 2021 20:51:35 +0000 (13:51 -0700)]
torture: Don't cap remote runs by build-system number of CPUs
Currently, if a torture scenario requires more CPUs than are present
on the build system, kvm.sh and friends limit the CPUs available to
that scenario. This makes total sense when the build system and the
system running the scenarios are one and the same, but not so much when
remote systems might well have more CPUs.
This commit therefore introduces a --remote flag to kvm.sh that suppresses
this CPU-limiting behavior, and causes kvm-remote.sh to use this flag.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 27 Apr 2021 16:56:42 +0000 (09:56 -0700)]
torture: Make kvm-remote.sh account for network failure in pathname checks
In a long-duration kvm-remote.sh run, almost all of the remote accesses will
be simple file-existence checks. These are thus the most likely to be caught
out by network failures, which do happen from time to time.
This commit therefore takes a first step towards tolerating temporary
network outages by making the file-existence checks repeat in the face of
such an outage. They also print a message every minute during a outage,
allowing the user to take appropriate action.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Wed, 14 Apr 2021 20:00:10 +0000 (13:00 -0700)]
rcutorture: Don't count CPU-stalled time against priority boosting
It will frequently be the case that rcu_torture_boost() will get a
->start_gp_poll() cookie that needs almost all of the current grace period
plus an additional grace period to elapse before ->poll_gp_state() will
return true. It is quite possible that the current grace period will have
(say) two seconds of stall by a CPU failing to pass through a quiescent
state, followed by 300 milliseconds of delay due to a preempted reader.
The next grace period might suffer only one second of stall by a CPU,
followed by another 300 milliseconds of delay due to a preempted reader.
This is an example of RCU priority boosting doing its job, but the full
elapsed time of 3.6 seconds exceeds the 3.5-second limit. In addition,
there is no CPU stall in force at the 3.5-second mark, so this would
nevertheless currently be counted as an RCU priority boosting failure.
This commit therefore avoids this sort of false positive by resetting
the gp_state_time timestamp any time that the current grace period is
being blocked by a CPU. This results in extremely frequent calls to
the ->check_boost_failed() function, so this commit provides a lockless
fastpath that is selected by supplying a NULL CPU-number pointer.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 8 Apr 2021 20:01:14 +0000 (13:01 -0700)]
rcutorture: Forgive RCU boost failures when CPUs don't pass through QS
Currently, rcu_torture_boost() runs CPU-bound at real-time priority
to force RCU priority inversions. It then checks that grace periods
progress during this CPU-bound time. If grace periods fail to progress,
it reports and RCU priority boosting failure.
However, it is possible (and sometimes does happen) that the grace period
fails to progress due to a CPU failing to pass through a quiescent state
for an extended time period (3.5 seconds by default). This can happen
due to vCPU preemption, long-running interrupts, and much else besides.
There is nothing that RCU priority boosting can do about these situations,
and so they should not be counted as RCU priority boosting failures.
This commit therefore checks for CPUs (as opposed to preempted tasks)
holding up a grace period, and flags the resulting RCU priority boosting
failures, but does not splat nor count them as errors. It does rate-limit
them to avoid flooding the console log.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 8 Apr 2021 17:46:55 +0000 (10:46 -0700)]
rcutorture: Make rcu_torture_boost_failed() check for GP end
It is possible that a delayed grace period that rcu_torture_boost()
was polling for ended while rcu_torture_boost_failed() was printing the
failure splat. It would be good to know when this happens. This commit
therefore has rcu_torture_boost_failed() recheck the grace period after
printing the splat, and printing a message indicating whether or not
the grace period has ended.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 8 Apr 2021 03:00:00 +0000 (20:00 -0700)]
rcutorture: Consolidate rcu_torture_boost() timing and statistics
This commit consolidates two loops in rcu_torture_boost(), one of which
counts the number of boost-test episodes and the other of which computes
the start time of the next episode, into one loop that does both with but
a single acquisition of boost_mutex. This means that the count of the
number of boost-test episodes is incremented after an episode completes
rather than before it starts, but it also avoids the over-counting that
was possible previously.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 8 Apr 2021 00:09:37 +0000 (17:09 -0700)]
rcutorture: Delay-based false positives for RCU priority boosting tests
If an rcu_torture_boost() kthread determines that its grace period
has not yet ended, it invokes rcu_torture_boost_failed() which checks
whether enough time has elapsed for this to be considered a failure of
RCU priority boosting, and, if so, flags the error.
Unfortunately, that kthread might be preempted for some seconds between
the time that it checks the grace period and the time that it checks the
time. This delay can result in a false positive, featuring a complaint
that a particular grace period has not ended, followed by a diagnostic
dump featuring a much later grace period.
This commit avoids these false positives by rechecking for the end of
the grace period after the time check.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Thu, 1 Apr 2021 22:26:56 +0000 (15:26 -0700)]
torture: Set kvm.sh language to English
Some of the code invoked directly and indirectly from kvm.sh parses
the output of commands. This parsing assumes English, which can cause
failures if the user has set some other language. In a few cases,
there are language-independent commands available, but this is not
always the case. Therefore, as an alternative to polyglot parsing,
this commit sets the LANG environment variable to en_US.UTF-8.
Reported-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Tue, 30 Mar 2021 23:30:32 +0000 (16:30 -0700)]
rcutorture: Judge RCU priority boosting on grace periods, not callbacks
Currently, rcutorture's testing of RCU priority boosting insists not
only that grace periods complete, but also that callbacks be invoked.
Although this is in fact what the user would want, ensuring that there
is sufficient CPU bandwidth devoted to callback execution is in fact
the user's responsibility. One could argue that rcutorture can take on
that responsibility, which is true in theory. But in practice, ensuring
sufficient CPU bandwidth to ksoftirqd, any rcuc kthreads, and any rcuo
kthreads is not particularly consistent with rcutorture's main job,
that of stress-testing RCU. In addition, if the system administrator
(say) makes very poor choices when pinning rcuo kthreads and then runs
rcutorture, there really isn't much rcutorture can do.
Besides, RCU priority boosting only boosts lagging readers, not all the
machinery required to invoke callbacks in a timely fashion.
This commit therefore switches rcutorture's evaluation of RCU priority
boosting from callback execution to grace-period completion by using
the new start_poll_synchronize_rcu() and poll_state_synchronize_rcu()
functions. When rcutorture is built in (as in when there is no innocent
workload to inconvenience), the ksoftirqd ktheads are boosted to real-time
priority 2 in order to allow timeouts to work properly in the face of
rcutorture's testing of RCU priority boosting.
Indeed, it is not as easy as it looks to create a reliable test of RCU
priority boosting without destroying the rest of the kernel!
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Paul E. McKenney [Fri, 26 Mar 2021 02:39:14 +0000 (19:39 -0700)]
torture: Make kvm-find-errors.sh account for kvm-remote.sh
Currently, kvm-find-errors.sh assumes that if "--buildonly" appears in
the log file, then the run did builds but ran no kernels. This breaks
with kvm-remote.sh, which uses kvm.sh to do a build, then kvm-again.sh
to run the kernels built on remote systems. This commit therefore adds
a check for a kvm-remote.sh run.
While in the area, this commit checks for "--build-only" as well as
"--build-only".
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>