]> www.infradead.org Git - linux.git/log
linux.git
3 years agotracing: Add "(fault)" name injection to kernel probes
Steven Rostedt (Google) [Wed, 12 Oct 2022 10:40:57 +0000 (06:40 -0400)]
tracing: Add "(fault)" name injection to kernel probes

Have the specific functions for kernel probes that read strings to inject
the "(fault)" name directly. trace_probes.c does this too (for uprobes)
but as the code to read strings are going to be used by synthetic events
(and perhaps other utilities), it simplifies the code by making sure those
other uses do not need to implement the "(fault)" name injection as well.

Link: https://lkml.kernel.org/r/20221012104534.644803645@goodmis.org
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Fixes: bd82631d7ccdc ("tracing: Add support for dynamic strings to synthetic events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Move duplicate code of trace_kprobe/eprobe.c into header
Steven Rostedt (Google) [Wed, 12 Oct 2022 10:40:56 +0000 (06:40 -0400)]
tracing: Move duplicate code of trace_kprobe/eprobe.c into header

The functions:

  fetch_store_strlen_user()
  fetch_store_strlen()
  fetch_store_string_user()
  fetch_store_string()

are identical in both trace_kprobe.c and trace_eprobe.c. Move them into
a new header file trace_probe_kernel.h to share it. This code will later
be used by the synthetic events as well.

Marked for stable as a fix for a crash in synthetic events requires it.

Link: https://lkml.kernel.org/r/20221012104534.467668078@goodmis.org
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Fixes: bd82631d7ccdc ("tracing: Add support for dynamic strings to synthetic events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoring-buffer: Fix kernel-doc
Jiapeng Chong [Sun, 9 Oct 2022 02:06:42 +0000 (10:06 +0800)]
ring-buffer: Fix kernel-doc

kernel/trace/ring_buffer.c:895: warning: expecting prototype for ring_buffer_nr_pages_dirty(). Prototype was for ring_buffer_nr_dirty_pages() instead.
kernel/trace/ring_buffer.c:5313: warning: expecting prototype for ring_buffer_reset_cpu(). Prototype was for ring_buffer_reset_online_cpus() instead.
kernel/trace/ring_buffer.c:5382: warning: expecting prototype for rind_buffer_empty(). Prototype was for ring_buffer_empty() instead.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2340
Link: https://lkml.kernel.org/r/20221009020642.12506-1-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoftrace: Fix char print issue in print_ip_ins()
Zheng Yejian [Tue, 11 Oct 2022 12:03:52 +0000 (12:03 +0000)]
ftrace: Fix char print issue in print_ip_ins()

When ftrace bug happened, following log shows every hex data in
problematic ip address:
  actual:   ffffffe8:6b:ffffffd9:01:21

But so many 'f's seem a little confusing, and that is because format
'%x' being used to print signed chars in array 'ins'. As suggested
by Joe, change to use format "%*phC" to print array 'ins'.

After this patch, the log is like:
  actual:   e8:6b:d9:01:21

Link: https://lkml.kernel.org/r/20221011120352.1878494-1-zhengyejian1@huawei.com
Fixes: 6c14133d2d3f ("ftrace: Do not blindly read the ip address in ftrace_bug()")
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoftrace: Create separate entry in MAINTAINERS for function hooks
Steven Rostedt (Google) [Thu, 6 Oct 2022 14:43:48 +0000 (10:43 -0400)]
ftrace: Create separate entry in MAINTAINERS for function hooks

The function hooks (ftrace) is a completely different subsystem from the
general tracing. It manages how to attach callbacks to most functions in
the kernel. It is also used by live kernel patching. It really is not part
of tracing, although tracing uses it.

Create a separate entry for FUNCTION HOOKS (FTRACE) to be separate from
tracing itself in the MAINTAINERS file.

Perhaps it should be moved out of the kernel/trace directory, but that's
for another time.

Link: https://lkml.kernel.org/r/20221006144439.459272364@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Update MAINTAINERS to reflect new tracing git repo
Steven Rostedt (Google) [Thu, 6 Oct 2022 14:43:47 +0000 (10:43 -0400)]
tracing: Update MAINTAINERS to reflect new tracing git repo

The tracing git repo will no longer be housed in my personal git repo,
but instead live in trace/linux-trace.git.

Update the MAINTAINERS file appropriately.

Link: https://lkml.kernel.org/r/20221006144439.282193367@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Do not free snapshot if tracer is on cmdline
Steven Rostedt (Google) [Wed, 5 Oct 2022 15:37:57 +0000 (11:37 -0400)]
tracing: Do not free snapshot if tracer is on cmdline

The ftrace_boot_snapshot and alloc_snapshot cmdline options allocate the
snapshot buffer at boot up for use later. The ftrace_boot_snapshot in
particular requires the snapshot to be allocated because it will take a
snapshot at the end of boot up allowing to see the traces that happened
during boot so that it's not lost when user space takes over.

When a tracer is registered (started) there's a path that checks if it
requires the snapshot buffer or not, and if it does not and it was
allocated it will do a synchronization and free the snapshot buffer.

This is only required if the previous tracer was using it for "max
latency" snapshots, as it needs to make sure all max snapshots are
complete before freeing. But this is only needed if the previous tracer
was using the snapshot buffer for latency (like irqoff tracer and
friends). But it does not make sense to free it, if the previous tracer
was not using it, and the snapshot was allocated by the cmdline
parameters. This basically takes away the point of allocating it in the
first place!

Note, the allocated snapshot worked fine for just trace events, but fails
when a tracer is enabled on the cmdline.

Further investigation, this goes back even further and it does not require
a tracer on the cmdline to fail. Simply enable snapshots and then enable a
tracer, and it will remove the snapshot.

Link: https://lkml.kernel.org/r/20221005113757.041df7fe@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 45ad21ca5530 ("tracing: Have trace_array keep track if snapshot buffer is allocated")
Reported-by: Ross Zwisler <zwisler@kernel.org>
Tested-by: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoftrace: Still disable enabled records marked as disabled
Steven Rostedt (Google) [Wed, 5 Oct 2022 04:38:09 +0000 (00:38 -0400)]
ftrace: Still disable enabled records marked as disabled

Weak functions started causing havoc as they showed up in the
"available_filter_functions" and this confused people as to why some
functions marked as "notrace" were listed, but when enabled they did
nothing. This was because weak functions can still have fentry calls, and
these addresses get added to the "available_filter_functions" file.
kallsyms is what converts those addresses to names, and since the weak
functions are not listed in kallsyms, it would just pick the function
before that.

To solve this, there was a trick to detect weak functions listed, and
these records would be marked as DISABLED so that they do not get enabled
and are mostly ignored. As the processing of the list of all functions to
figure out what is weak or not can take a long time, this process is put
off into a kernel thread and run in parallel with the rest of start up.

Now the issue happens whet function tracing is enabled via the kernel
command line. As it starts very early in boot up, it can be enabled before
the records that are weak are marked to be disabled. This causes an issue
in the accounting, as the weak records are enabled by the command line
function tracing, but after boot up, they are not disabled.

The ftrace records have several accounting flags and a ref count. The
DISABLED flag is just one. If the record is enabled before it is marked
DISABLED it will get an ENABLED flag and also have its ref counter
incremented. After it is marked for DISABLED, neither the ENABLED flag nor
the ref counter is cleared. There's sanity checks on the records that are
performed after an ftrace function is registered or unregistered, and this
detected that there were records marked as ENABLED with ref counter that
should not have been.

Note, the module loading code uses the DISABLED flag as well to keep its
functions from being modified while its being loaded and some of these
flags may get set in this process. So changing the verification code to
ignore DISABLED records is a no go, as it still needs to verify that the
module records are working too.

Also, the weak functions still are calling a trampoline. Even though they
should never be called, it is dangerous to leave these weak functions
calling a trampoline that is freed, so they should still be set back to
nops.

There's two places that need to not skip records that have the ENABLED
and the DISABLED flags set. That is where the ftrace_ops is processed and
sets the records ref counts, and then later when the function itself is to
be updated, and the ENABLED flag gets removed. Add a helper function
"skip_record()" that returns true if the record has the DISABLED flag set
but not the ENABLED flag.

Link: https://lkml.kernel.org/r/20221005003809.27d2b97b@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: b39181f7c6907 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/user_events: Move pages/locks into groups to prepare for namespaces
Beau Belgrave [Sat, 1 Oct 2022 00:10:16 +0000 (17:10 -0700)]
tracing/user_events: Move pages/locks into groups to prepare for namespaces

In order to enable namespaces or any sort of isolation within
user_events the register lock and pages need to be broken up into
groups. Each event and file now has a group pointer which stores the
actual pages to map, lookup data and synchronization objects.

This only enables a single group that maps to init_user_ns, as IMA
namespace has done. This enables user_events to start the work of
supporting namespaces by walking the namespaces up to the init_user_ns.
Future patches will address other user namespaces and will align to the
approaches the IMA namespace uses.

Link: https://lore.kernel.org/linux-kernel/20220915193221.1728029-15-stefanb@linux.ibm.com/#t
Link: https://lkml.kernel.org/r/20221001001016.2832-2-beaub@linux.microsoft.com
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Add Masami Hiramatsu as co-maintainer
Steven Rostedt (Google) [Fri, 30 Sep 2022 16:41:31 +0000 (12:41 -0400)]
tracing: Add Masami Hiramatsu as co-maintainer

Masami has been maintaining kprobes for a while now and that code has
been an integral part of tracing. He has also been an excellent reviewer
of all the tracing code and contributor as well.

The tracing subsystem needs another active maintainer to keep it running
smoothly, and I do not know anyone more qualified for the job than Masami.

Ingo has also told me that he has not been active in the tracing code for
some time and said he could be removed from the TRACING portion of the
MAINTAINERS file.

Link: https://lkml.kernel.org/r/20220930124131.7b6432dd@gandalf.local.home
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Remove unused variable 'dups'
Chen Zhongjin [Fri, 30 Sep 2022 10:32:36 +0000 (18:32 +0800)]
tracing: Remove unused variable 'dups'

Reported by Clang [-Wunused-but-set-variable]

'commit c193707dde77 ("tracing: Remove code which merges duplicates")'
This commit removed the code which merges duplicates in detect_dups(),
but forgot to delete the variable 'dups' which used to merge
duplicates in the loop.

Now only 'total_dups' is needed, remove 'dups' for clean code.

Link: https://lkml.kernel.org/r/20220930103236.253985-1-chenzhongjin@huawei.com
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoMAINTAINERS: add myself as a tracing reviewer
Mark Rutland [Wed, 28 Sep 2022 11:46:21 +0000 (12:46 +0100)]
MAINTAINERS: add myself as a tracing reviewer

Since I'm actively involved in a number of arch bits that intersect
ftrace (e.g. the actual arch implementation on arm64, stacktracing,
entry management, and general instrumentation safety), add myself as a
reviewer of the core ftrace code so that I have the change to catch any
potential problems early.

I spoke with Steven about this at LPC, and it seemed to make sense to
add me as a reviewer.

Link: https://lkml.kernel.org/r/20220928114621.248038-1-mark.rutland@arm.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoring-buffer: Fix race between reset page and reading page
Steven Rostedt (Google) [Thu, 29 Sep 2022 14:49:09 +0000 (10:49 -0400)]
ring-buffer: Fix race between reset page and reading page

The ring buffer is broken up into sub buffers (currently of page size).
Each sub buffer has a pointer to its "tail" (the last event written to the
sub buffer). When a new event is requested, the tail is locally
incremented to cover the size of the new event. This is done in a way that
there is no need for locking.

If the tail goes past the end of the sub buffer, the process of moving to
the next sub buffer takes place. After setting the current sub buffer to
the next one, the previous one that had the tail go passed the end of the
sub buffer needs to be reset back to the original tail location (before
the new event was requested) and the rest of the sub buffer needs to be
"padded".

The race happens when a reader takes control of the sub buffer. As readers
do a "swap" of sub buffers from the ring buffer to get exclusive access to
the sub buffer, it replaces the "head" sub buffer with an empty sub buffer
that goes back into the writable portion of the ring buffer. This swap can
happen as soon as the writer moves to the next sub buffer and before it
updates the last sub buffer with padding.

Because the sub buffer can be released to the reader while the writer is
still updating the padding, it is possible for the reader to see the event
that goes past the end of the sub buffer. This can cause obvious issues.

To fix this, add a few memory barriers so that the reader definitely sees
the updates to the sub buffer, and also waits until the writer has put
back the "tail" of the sub buffer back to the last event that was written
on it.

To be paranoid, it will only spin for 1 second, otherwise it will
warn and shutdown the ring buffer code. 1 second should be enough as
the writer does have preemption disabled. If the writer doesn't move
within 1 second (with preemption disabled) something is horribly
wrong. No interrupt should last 1 second!

Link: https://lore.kernel.org/all/20220830120854.7545-1-jiazi.li@transsion.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216369
Link: https://lkml.kernel.org/r/20220929104909.0650a36c@gandalf.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: c7b0930857e22 ("ring-buffer: prevent adding write in discarded area")
Reported-by: Jiazi.Li <jiazi.li@transsion.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/user_events: Update ABI documentation to align to bits vs bytes
Beau Belgrave [Thu, 28 Jul 2022 23:33:09 +0000 (16:33 -0700)]
tracing/user_events: Update ABI documentation to align to bits vs bytes

Update the documentation to reflect the new ABI requirements and how to
use the byte index with the mask properly to check event status.

Link: https://lkml.kernel.org/r/20220728233309.1896-7-beaub@linux.microsoft.com
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/user_events: Use bits vs bytes for enabled status page data
Beau Belgrave [Thu, 28 Jul 2022 23:33:08 +0000 (16:33 -0700)]
tracing/user_events: Use bits vs bytes for enabled status page data

User processes may require many events and when they do the cache
performance of a byte index status check is less ideal than a bit index.
The previous event limit per-page was 4096, the new limit is 32,768.

This change adds a bitwise index to the user_reg struct. Programs check
that the bit at status_bit has a bit set within the status page(s).

Link: https://lkml.kernel.org/r/20220728233309.1896-6-beaub@linux.microsoft.com
Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/user_events: Use refcount instead of atomic for ref tracking
Beau Belgrave [Thu, 28 Jul 2022 23:33:07 +0000 (16:33 -0700)]
tracing/user_events: Use refcount instead of atomic for ref tracking

User processes could open up enough event references to cause rollovers.
These could cause use after free scenarios, which we do not want.
Switching to refcount APIs prevent this, but will leak memory once
saturated.

Once saturated, user processes can still use the events. This prevents
a bad user process from stopping existing telemetry from being emitted.

Link: https://lkml.kernel.org/r/20220728233309.1896-5-beaub@linux.microsoft.com
Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/user_events: Ensure user provided strings are safely formatted
Beau Belgrave [Thu, 28 Jul 2022 23:33:06 +0000 (16:33 -0700)]
tracing/user_events: Ensure user provided strings are safely formatted

User processes can provide bad strings that may cause issues or leak
kernel details back out. Don't trust the content of these strings
when formatting strings for matching.

This also moves to a consistent dynamic length string creation model.

Link: https://lkml.kernel.org/r/20220728233309.1896-4-beaub@linux.microsoft.com
Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/user_events: Use WRITE instead of READ for io vector import
Beau Belgrave [Thu, 28 Jul 2022 23:33:05 +0000 (16:33 -0700)]
tracing/user_events: Use WRITE instead of READ for io vector import

import_single_range expects the direction/rw to be where it came from,
not the protection/limit. Since the import is in a write path use WRITE.

Link: https://lkml.kernel.org/r/20220728233309.1896-3-beaub@linux.microsoft.com
Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/user_events: Use NULL for strstr checks
Beau Belgrave [Thu, 28 Jul 2022 23:33:04 +0000 (16:33 -0700)]
tracing/user_events: Use NULL for strstr checks

Trivial fix to ensure strstr checks use NULL instead of 0.

Link: https://lkml.kernel.org/r/20220728233309.1896-2-beaub@linux.microsoft.com
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Fix spelling mistake "preapre" -> "prepare"
Colin Ian King [Wed, 28 Sep 2022 21:58:28 +0000 (22:58 +0100)]
tracing: Fix spelling mistake "preapre" -> "prepare"

There is a spelling mistake in the trace text. Fix it.

Link: https://lkml.kernel.org/r/20220928215828.66325-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Wake up waiters when tracing is disabled
Steven Rostedt (Google) [Wed, 28 Sep 2022 22:22:20 +0000 (18:22 -0400)]
tracing: Wake up waiters when tracing is disabled

When tracing is disabled, there's no reason that waiters should stay
waiting, wake them up, otherwise tasks get stuck when they should be
flushing the buffers.

Cc: stable@vger.kernel.org
Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Add ioctl() to force ring buffer waiters to wake up
Steven Rostedt (Google) [Thu, 29 Sep 2022 13:50:29 +0000 (09:50 -0400)]
tracing: Add ioctl() to force ring buffer waiters to wake up

If a process is waiting on the ring buffer for data, there currently isn't
a clean way to force it to wake up. Add an ioctl call that will force any
tasks that are waiting on the trace_pipe_raw file to wake up.

Link: https://lkml.kernel.org/r/20220929095029.117f913f@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Wake up ring buffer waiters on closing of the file
Steven Rostedt (Google) [Tue, 27 Sep 2022 23:15:27 +0000 (19:15 -0400)]
tracing: Wake up ring buffer waiters on closing of the file

When the file that represents the ring buffer is closed, there may be
waiters waiting on more input from the ring buffer. Call
ring_buffer_wake_waiters() to wake up any waiters when the file is
closed.

Link: https://lkml.kernel.org/r/20220927231825.182416969@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoring-buffer: Add ring_buffer_wake_waiters()
Steven Rostedt (Google) [Wed, 28 Sep 2022 17:39:38 +0000 (13:39 -0400)]
ring-buffer: Add ring_buffer_wake_waiters()

On closing of a file that represents a ring buffer or flushing the file,
there may be waiters on the ring buffer that needs to be woken up and exit
the ring_buffer_wait() function.

Add ring_buffer_wake_waiters() to wake up the waiters on the ring buffer
and allow them to exit the wait loop.

Link: https://lkml.kernel.org/r/20220928133938.28dc2c27@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 15693458c4bc0 ("tracing/ring-buffer: Move poll wake ups into ring buffer code")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoring-buffer: Check pending waiters when doing wake ups as well
Steven Rostedt (Google) [Tue, 27 Sep 2022 23:15:25 +0000 (19:15 -0400)]
ring-buffer: Check pending waiters when doing wake ups as well

The wake up waiters only checks the "wakeup_full" variable and not the
"full_waiters_pending". The full_waiters_pending is set when a waiter is
added to the wait queue. The wakeup_full is only set when an event is
triggered, and it clears the full_waiters_pending to avoid multiple calls
to irq_work_queue().

The irq_work callback really needs to check both wakeup_full as well as
full_waiters_pending such that this code can be used to wake up waiters
when a file is closed that represents the ring buffer and the waiters need
to be woken up.

Link: https://lkml.kernel.org/r/20220927231824.209460321@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 15693458c4bc0 ("tracing/ring-buffer: Move poll wake ups into ring buffer code")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoring-buffer: Have the shortest_full queue be the shortest not longest
Steven Rostedt (Google) [Tue, 27 Sep 2022 23:15:24 +0000 (19:15 -0400)]
ring-buffer: Have the shortest_full queue be the shortest not longest

The logic to know when the shortest waiters on the ring buffer should be
woken up or not has uses a less than instead of a greater than compare,
which causes the shortest_full to actually be the longest.

Link: https://lkml.kernel.org/r/20220927231823.718039222@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoring-buffer: Allow splice to read previous partially read pages
Steven Rostedt (Google) [Tue, 27 Sep 2022 18:43:17 +0000 (14:43 -0400)]
ring-buffer: Allow splice to read previous partially read pages

If a page is partially read, and then the splice system call is run
against the ring buffer, it will always fail to read, no matter how much
is in the ring buffer. That's because the code path for a partial read of
the page does will fail if the "full" flag is set.

The splice system call wants full pages, so if the read of the ring buffer
is not yet full, it should return zero, and the splice will block. But if
a previous read was done, where the beginning has been consumed, it should
still be given to the splice caller if the rest of the page has been
written to.

This caused the splice command to never consume data in this scenario, and
let the ring buffer just fill up and lose events.

Link: https://lkml.kernel.org/r/20220927144317.46be6b80@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 8789a9e7df6bf ("ring-buffer: read page interface")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoftrace: Fix recursive locking direct_mutex in ftrace_modify_direct_caller
Song Liu [Tue, 27 Sep 2022 00:41:46 +0000 (17:41 -0700)]
ftrace: Fix recursive locking direct_mutex in ftrace_modify_direct_caller

Naveen reported recursive locking of direct_mutex with sample
ftrace-direct-modify.ko:

[   74.762406] WARNING: possible recursive locking detected
[   74.762887] 6.0.0-rc6+ #33 Not tainted
[   74.763216] --------------------------------------------
[   74.763672] event-sample-fn/1084 is trying to acquire lock:
[   74.764152] ffffffff86c9d6b0 (direct_mutex){+.+.}-{3:3}, at: \
    register_ftrace_function+0x1f/0x180
[   74.764922]
[   74.764922] but task is already holding lock:
[   74.765421] ffffffff86c9d6b0 (direct_mutex){+.+.}-{3:3}, at: \
    modify_ftrace_direct+0x34/0x1f0
[   74.766142]
[   74.766142] other info that might help us debug this:
[   74.766701]  Possible unsafe locking scenario:
[   74.766701]
[   74.767216]        CPU0
[   74.767437]        ----
[   74.767656]   lock(direct_mutex);
[   74.767952]   lock(direct_mutex);
[   74.768245]
[   74.768245]  *** DEADLOCK ***
[   74.768245]
[   74.768750]  May be due to missing lock nesting notation
[   74.768750]
[   74.769332] 1 lock held by event-sample-fn/1084:
[   74.769731]  #0: ffffffff86c9d6b0 (direct_mutex){+.+.}-{3:3}, at: \
    modify_ftrace_direct+0x34/0x1f0
[   74.770496]
[   74.770496] stack backtrace:
[   74.770884] CPU: 4 PID: 1084 Comm: event-sample-fn Not tainted ...
[   74.771498] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
[   74.772474] Call Trace:
[   74.772696]  <TASK>
[   74.772896]  dump_stack_lvl+0x44/0x5b
[   74.773223]  __lock_acquire.cold.74+0xac/0x2b7
[   74.773616]  lock_acquire+0xd2/0x310
[   74.773936]  ? register_ftrace_function+0x1f/0x180
[   74.774357]  ? lock_is_held_type+0xd8/0x130
[   74.774744]  ? my_tramp2+0x11/0x11 [ftrace_direct_modify]
[   74.775213]  __mutex_lock+0x99/0x1010
[   74.775536]  ? register_ftrace_function+0x1f/0x180
[   74.775954]  ? slab_free_freelist_hook.isra.43+0x115/0x160
[   74.776424]  ? ftrace_set_hash+0x195/0x220
[   74.776779]  ? register_ftrace_function+0x1f/0x180
[   74.777194]  ? kfree+0x3e1/0x440
[   74.777482]  ? my_tramp2+0x11/0x11 [ftrace_direct_modify]
[   74.777941]  ? __schedule+0xb40/0xb40
[   74.778258]  ? register_ftrace_function+0x1f/0x180
[   74.778672]  ? my_tramp1+0xf/0xf [ftrace_direct_modify]
[   74.779128]  register_ftrace_function+0x1f/0x180
[   74.779527]  ? ftrace_set_filter_ip+0x33/0x70
[   74.779910]  ? __schedule+0xb40/0xb40
[   74.780231]  ? my_tramp1+0xf/0xf [ftrace_direct_modify]
[   74.780678]  ? my_tramp2+0x11/0x11 [ftrace_direct_modify]
[   74.781147]  ftrace_modify_direct_caller+0x5b/0x90
[   74.781563]  ? 0xffffffffa0201000
[   74.781859]  ? my_tramp1+0xf/0xf [ftrace_direct_modify]
[   74.782309]  modify_ftrace_direct+0x1b2/0x1f0
[   74.782690]  ? __schedule+0xb40/0xb40
[   74.783014]  ? simple_thread+0x2a/0xb0 [ftrace_direct_modify]
[   74.783508]  ? __schedule+0xb40/0xb40
[   74.783832]  ? my_tramp2+0x11/0x11 [ftrace_direct_modify]
[   74.784294]  simple_thread+0x76/0xb0 [ftrace_direct_modify]
[   74.784766]  kthread+0xf5/0x120
[   74.785052]  ? kthread_complete_and_exit+0x20/0x20
[   74.785464]  ret_from_fork+0x22/0x30
[   74.785781]  </TASK>

Fix this by using register_ftrace_function_nolock in
ftrace_modify_direct_caller.

Link: https://lkml.kernel.org/r/20220927004146.1215303-1-song@kernel.org
Fixes: 53cd885bc5c3 ("ftrace: Allow IPMODIFY and DIRECT ops on the same function")
Reported-and-tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoftrace: Properly unset FTRACE_HASH_FL_MOD
Zheng Yejian [Mon, 26 Sep 2022 15:20:08 +0000 (15:20 +0000)]
ftrace: Properly unset FTRACE_HASH_FL_MOD

When executing following commands like what document said, but the log
"#### all functions enabled ####" was not shown as expect:
  1. Set a 'mod' filter:
    $ echo 'write*:mod:ext3' > /sys/kernel/tracing/set_ftrace_filter
  2. Invert above filter:
    $ echo '!write*:mod:ext3' >> /sys/kernel/tracing/set_ftrace_filter
  3. Read the file:
    $ cat /sys/kernel/tracing/set_ftrace_filter

By some debugging, I found that flag FTRACE_HASH_FL_MOD was not unset
after inversion like above step 2 and then result of ftrace_hash_empty()
is incorrect.

Link: https://lkml.kernel.org/r/20220926152008.2239274-1-zhengyejian1@huawei.com
Cc: <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 8c08f0d5c6fb ("ftrace: Have cached module filters be an active filter")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/eprobe: Fix alloc event dir failed when event name no set
Tao Chen [Sat, 24 Sep 2022 14:13:34 +0000 (22:13 +0800)]
tracing/eprobe: Fix alloc event dir failed when event name no set

The event dir will alloc failed when event name no set, using the
command:
"echo "e:esys/ syscalls/sys_enter_openat file=\$filename:string"
>> dynamic_events"
It seems that dir name="syscalls/sys_enter_openat" is not allowed
in debugfs. So just use the "sys_enter_openat" as the event name.

Link: https://lkml.kernel.org/r/1664028814-45923-1-git-send-email-chentao.kernel@linux.alibaba.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Linyu Yuan <quic_linyyuan@quicinc.com>
Cc: Tao Chen <chentao.kernel@linux.alibaba.com
Cc: stable@vger.kernel.org
Fixes: 95c104c378dc ("tracing: Auto generate event name when creating a group of events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Tao Chen <chentao.kernel@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agox86: kprobes: Remove unused macro stack_addr
Chen Zhongjin [Sat, 24 Sep 2022 07:26:29 +0000 (15:26 +0800)]
x86: kprobes: Remove unused macro stack_addr

An unused macro reported by [-Wunused-macros].

This macro is used to access the sp in pt_regs because at that time
x86_32 can only get sp by kernel_stack_pointer(regs).

'3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs")'
This commit have unified the pt_regs and from them we can get sp from
pt_regs with regs->sp easily. Nowhere is using this macro anymore.

Refrencing pt_regs directly is more clear. Remove this macro for
code cleaning.

Link: https://lkml.kernel.org/r/20220924072629.104759-1-chenzhongjin@huawei.com
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoftrace: Remove obsoleted code from ftrace and task_struct
Gaosheng Cui [Fri, 23 Sep 2022 09:00:12 +0000 (17:00 +0800)]
ftrace: Remove obsoleted code from ftrace and task_struct

The trace of "struct task_struct" was no longer used since
commit 345ddcc882d8 ("ftrace: Have set_ftrace_pid use the
bitmap like events do"), and the functions about flags for
current->trace is useless, so remove them.

Link: https://lkml.kernel.org/r/20220923090012.505990-1-cuigaosheng1@huawei.com
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Disable interrupt or preemption before acquiring arch_spinlock_t
Waiman Long [Thu, 22 Sep 2022 14:56:22 +0000 (10:56 -0400)]
tracing: Disable interrupt or preemption before acquiring arch_spinlock_t

It was found that some tracing functions in kernel/trace/trace.c acquire
an arch_spinlock_t with preemption and irqs enabled. An example is the
tracing_saved_cmdlines_size_read() function which intermittently causes
a "BUG: using smp_processor_id() in preemptible" warning when the LTP
read_all_proc test is run.

That can be problematic in case preemption happens after acquiring the
lock. Add the necessary preemption or interrupt disabling code in the
appropriate places before acquiring an arch_spinlock_t.

The convention here is to disable preemption for trace_cmdline_lock and
interupt for max_lock.

Link: https://lkml.kernel.org/r/20220922145622.1744826-1-longman@redhat.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: stable@vger.kernel.org
Fixes: a35873a0993b ("tracing: Add conditional snapshot")
Fixes: 939c7a4f04fc ("tracing: Introduce saved_cmdlines_size file")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agorv/monitor: Add __init/__exit annotations to module init/exit funcs
Xiu Jianfeng [Thu, 22 Sep 2022 10:32:08 +0000 (18:32 +0800)]
rv/monitor: Add __init/__exit annotations to module init/exit funcs

Add missing __init/__exit annotations to module init/exit funcs.

Link: https://lkml.kernel.org/r/20220922103208.162869-1-xiujianfeng@huawei.com
Fixes: 24bce201d798 ("tools/rv: Add dot2k")
Fixes: 8812d21219b9 ("rv/monitor: Add the wip monitor skeleton created by dot2k")
Fixes: ccc319dcb450 ("rv/monitor: Add the wwnr monitor")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/osnoise: Fix possible recursive locking in stop_per_cpu_kthreads
Nico Pache [Mon, 19 Sep 2022 14:49:32 +0000 (08:49 -0600)]
tracing/osnoise: Fix possible recursive locking in stop_per_cpu_kthreads

There is a recursive lock on the cpu_hotplug_lock.

In kernel/trace/trace_osnoise.c:<start/stop>_per_cpu_kthreads:
    - start_per_cpu_kthreads calls cpus_read_lock() and if
start_kthreads returns a error it will call stop_per_cpu_kthreads.
    - stop_per_cpu_kthreads then calls cpus_read_lock() again causing
      deadlock.

Fix this by calling cpus_read_unlock() before calling
stop_per_cpu_kthreads. This behavior can also be seen in commit
f46b16520a08 ("trace/hwlat: Implement the per-cpu mode").

This error was noticed during the LTP ftrace-stress-test:

WARNING: possible recursive locking detected
--------------------------------------------
sh/275006 is trying to acquire lock:
ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: stop_per_cpu_kthreads

but task is already holding lock:
ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: start_per_cpu_kthreads

other info that might help us debug this:
 Possible unsafe locking scenario:

      CPU0
      ----
 lock(cpu_hotplug_lock);
 lock(cpu_hotplug_lock);

 *** DEADLOCK ***

May be due to missing lock nesting notation

3 locks held by sh/275006:
 #0: ffff8881023f0470 (sb_writers#24){.+.+}-{0:0}, at: ksys_write
 #1: ffffffffb084f430 (trace_types_lock){+.+.}-{3:3}, at: rb_simple_write
 #2: ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: start_per_cpu_kthreads

Link: https://lkml.kernel.org/r/20220919144932.3064014-1-npache@redhat.com
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Signed-off-by: Nico Pache <npache@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: kprobe: Make gen test module work in arm and riscv
Yipeng Zou [Mon, 19 Sep 2022 12:56:29 +0000 (20:56 +0800)]
tracing: kprobe: Make gen test module work in arm and riscv

For now, this selftest module can only work in x86 because of the
kprobe cmd was fixed use of x86 registers.
This patch adapted to register names under arm and riscv, So that
this module can be worked on those platform.

Link: https://lkml.kernel.org/r/20220919125629.238242-3-zouyipeng@huawei.com
Cc: <linux-riscv@lists.infradead.org>
Cc: <mingo@redhat.com>
Cc: <paul.walmsley@sifive.com>
Cc: <palmer@dabbelt.com>
Cc: <aou@eecs.berkeley.edu>
Cc: <zanussi@kernel.org>
Cc: <liaochang1@huawei.com>
Cc: <chris.zjh@huawei.com>
Fixes: 64836248dda2 ("tracing: Add kprobe event command generation test module")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: kprobe: Fix kprobe event gen test module on exit
Yipeng Zou [Mon, 19 Sep 2022 12:56:28 +0000 (20:56 +0800)]
tracing: kprobe: Fix kprobe event gen test module on exit

Correct gen_kretprobe_test clr event para on module exit.
This will make it can't to delete.

Link: https://lkml.kernel.org/r/20220919125629.238242-2-zouyipeng@huawei.com
Cc: <linux-riscv@lists.infradead.org>
Cc: <mingo@redhat.com>
Cc: <paul.walmsley@sifive.com>
Cc: <palmer@dabbelt.com>
Cc: <aou@eecs.berkeley.edu>
Cc: <zanussi@kernel.org>
Cc: <liaochang1@huawei.com>
Cc: <chris.zjh@huawei.com>
Fixes: 64836248dda2 ("tracing: Add kprobe event command generation test module")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agox86/kprobes: Remove unused arch_kprobe_override_function() declaration
Gaosheng Cui [Wed, 14 Sep 2022 11:04:37 +0000 (19:04 +0800)]
x86/kprobes: Remove unused arch_kprobe_override_function() declaration

All uses of arch_kprobe_override_function() have been removed by
commit 540adea3809f ("error-injection: Separate error-injection
from kprobe"), so remove the declaration, too.

Link: https://lkml.kernel.org/r/20220914110437.1436353-3-cuigaosheng1@huawei.com
Cc: <mingo@redhat.com>
Cc: <tglx@linutronix.de>
Cc: <bp@alien8.de>
Cc: <dave.hansen@linux.intel.com>
Cc: <x86@kernel.org>
Cc: <hpa@zytor.com>
Cc: <mhiramat@kernel.org>
Cc: <peterz@infradead.org>
Cc: <ast@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agox86/ftrace: Remove unused modifying_ftrace_code declaration
Gaosheng Cui [Wed, 14 Sep 2022 11:04:36 +0000 (19:04 +0800)]
x86/ftrace: Remove unused modifying_ftrace_code declaration

All uses of modifying_ftrace_code have been removed by
commit 768ae4406a5c ("x86/ftrace: Use text_poke()"),
so remove the declaration, too.

Link: https://lkml.kernel.org/r/20220914110437.1436353-2-cuigaosheng1@huawei.com
Cc: <mingo@redhat.com>
Cc: <tglx@linutronix.de>
Cc: <bp@alien8.de>
Cc: <dave.hansen@linux.intel.com>
Cc: <x86@kernel.org>
Cc: <hpa@zytor.com>
Cc: <mhiramat@kernel.org>
Cc: <peterz@infradead.org>
Cc: <ast@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracepoint: Optimize the critical region of mutex_lock in tracepoint_module_coming()
Zhen Lei [Wed, 14 Sep 2022 06:14:16 +0000 (14:14 +0800)]
tracepoint: Optimize the critical region of mutex_lock in tracepoint_module_coming()

The memory allocation of 'tp_mod' does not require mutex_lock()
protection, move it out.

Link: https://lkml.kernel.org/r/20220914061416.1630-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/filter: Call filter predicate functions directly via a switch statement
Steven Rostedt (Google) [Tue, 6 Sep 2022 22:53:17 +0000 (18:53 -0400)]
tracing/filter: Call filter predicate functions directly via a switch statement

Due to retpolines, indirect calls are much more expensive than direct
calls. The filters have a select set of functions it uses for the
predicates. Instead of using function pointers to call them, create a
filter_pred_fn_call() function that uses a switch statement to call the
predicate functions directly. This gives almost a 10% speedup to the
filter logic.

Using the histogram benchmark:

Before:

 # event histogram
 #
 # trigger info: hist:keys=delta:vals=hitcount:sort=delta:size=2048 if delta > 0 [active]
 #

{ delta:        113 } hitcount:        272
{ delta:        114 } hitcount:        840
{ delta:        118 } hitcount:        344
{ delta:        119 } hitcount:      25428
{ delta:        120 } hitcount:     350590
{ delta:        121 } hitcount:    1892484
{ delta:        122 } hitcount:    6205004
{ delta:        123 } hitcount:   11583521
{ delta:        124 } hitcount:   37590979
{ delta:        125 } hitcount:  108308504
{ delta:        126 } hitcount:  131672461
{ delta:        127 } hitcount:   88700598
{ delta:        128 } hitcount:   65939870
{ delta:        129 } hitcount:   45055004
{ delta:        130 } hitcount:   33174464
{ delta:        131 } hitcount:   31813493
{ delta:        132 } hitcount:   29011676
{ delta:        133 } hitcount:   22798782
{ delta:        134 } hitcount:   22072486
{ delta:        135 } hitcount:   17034113
{ delta:        136 } hitcount:    8982490
{ delta:        137 } hitcount:    2865908
{ delta:        138 } hitcount:     980382
{ delta:        139 } hitcount:    1651944
{ delta:        140 } hitcount:    4112073
{ delta:        141 } hitcount:    3963269
{ delta:        142 } hitcount:    1712508
{ delta:        143 } hitcount:     575941

After:

 # event histogram
 #
 # trigger info: hist:keys=delta:vals=hitcount:sort=delta:size=2048 if delta > 0 [active]
 #

{ delta:        103 } hitcount:         60
{ delta:        104 } hitcount:      16966
{ delta:        105 } hitcount:     396625
{ delta:        106 } hitcount:    3223400
{ delta:        107 } hitcount:   12053754
{ delta:        108 } hitcount:   20241711
{ delta:        109 } hitcount:   14850200
{ delta:        110 } hitcount:    4946599
{ delta:        111 } hitcount:    3479315
{ delta:        112 } hitcount:   18698299
{ delta:        113 } hitcount:   62388733
{ delta:        114 } hitcount:   95803834
{ delta:        115 } hitcount:   58278130
{ delta:        116 } hitcount:   15364800
{ delta:        117 } hitcount:    5586866
{ delta:        118 } hitcount:    2346880
{ delta:        119 } hitcount:    1131091
{ delta:        120 } hitcount:     620896
{ delta:        121 } hitcount:     236652
{ delta:        122 } hitcount:     105957
{ delta:        123 } hitcount:     119107
{ delta:        124 } hitcount:      54494
{ delta:        125 } hitcount:      63856
{ delta:        126 } hitcount:      64454
{ delta:        127 } hitcount:      34818
{ delta:        128 } hitcount:      41446
{ delta:        129 } hitcount:      51242
{ delta:        130 } hitcount:      28361
{ delta:        131 } hitcount:      23926

The peak before was 126ns per event, after the peak is 114ns, and the
fastest time went from 113ns to 103ns.

Link: https://lkml.kernel.org/r/20220906225529.781407172@goodmis.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Move struct filter_pred into trace_events_filter.c
Steven Rostedt (Google) [Tue, 6 Sep 2022 22:53:16 +0000 (18:53 -0400)]
tracing: Move struct filter_pred into trace_events_filter.c

The structure filter_pred and the typedef of the function used are only
referenced by trace_events_filter.c. There's no reason to have it in an
external header file. Move them into the only file they are used in.

Link: https://lkml.kernel.org/r/20220906225529.598047132@goodmis.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/hist: Call hist functions directly via a switch statement
Steven Rostedt (Google) [Tue, 6 Sep 2022 22:53:15 +0000 (18:53 -0400)]
tracing/hist: Call hist functions directly via a switch statement

Due to retpolines, indirect calls are much more expensive than direct
calls. The histograms have a select set of functions it uses for the
histograms, instead of using function pointers to call them, create a
hist_fn_call() function that uses a switch statement to call the histogram
functions directly. This gives a 13% speedup to the histogram logic.

Using the histogram benchmark:

Before:

 # event histogram
 #
 # trigger info: hist:keys=delta:vals=hitcount:sort=delta:size=2048 if delta > 0 [active]
 #

{ delta:        129 } hitcount:       2213
{ delta:        130 } hitcount:     285965
{ delta:        131 } hitcount:    1146545
{ delta:        132 } hitcount:    5185432
{ delta:        133 } hitcount:   19896215
{ delta:        134 } hitcount:   53118616
{ delta:        135 } hitcount:   83816709
{ delta:        136 } hitcount:   68329562
{ delta:        137 } hitcount:   41859349
{ delta:        138 } hitcount:   46257797
{ delta:        139 } hitcount:   54400831
{ delta:        140 } hitcount:   72875007
{ delta:        141 } hitcount:   76193272
{ delta:        142 } hitcount:   49504263
{ delta:        143 } hitcount:   38821072
{ delta:        144 } hitcount:   47702679
{ delta:        145 } hitcount:   41357297
{ delta:        146 } hitcount:   22058238
{ delta:        147 } hitcount:    9720002
{ delta:        148 } hitcount:    3193542
{ delta:        149 } hitcount:     927030
{ delta:        150 } hitcount:     850772
{ delta:        151 } hitcount:    1477380
{ delta:        152 } hitcount:    2687977
{ delta:        153 } hitcount:    2865985
{ delta:        154 } hitcount:    1977492
{ delta:        155 } hitcount:    2475607
{ delta:        156 } hitcount:    3403612

After:

 # event histogram
 #
 # trigger info: hist:keys=delta:vals=hitcount:sort=delta:size=2048 if delta > 0 [active]
 #

{ delta:        113 } hitcount:        272
{ delta:        114 } hitcount:        840
{ delta:        118 } hitcount:        344
{ delta:        119 } hitcount:      25428
{ delta:        120 } hitcount:     350590
{ delta:        121 } hitcount:    1892484
{ delta:        122 } hitcount:    6205004
{ delta:        123 } hitcount:   11583521
{ delta:        124 } hitcount:   37590979
{ delta:        125 } hitcount:  108308504
{ delta:        126 } hitcount:  131672461
{ delta:        127 } hitcount:   88700598
{ delta:        128 } hitcount:   65939870
{ delta:        129 } hitcount:   45055004
{ delta:        130 } hitcount:   33174464
{ delta:        131 } hitcount:   31813493
{ delta:        132 } hitcount:   29011676
{ delta:        133 } hitcount:   22798782
{ delta:        134 } hitcount:   22072486
{ delta:        135 } hitcount:   17034113
{ delta:        136 } hitcount:    8982490
{ delta:        137 } hitcount:    2865908
{ delta:        138 } hitcount:     980382
{ delta:        139 } hitcount:    1651944
{ delta:        140 } hitcount:    4112073
{ delta:        141 } hitcount:    3963269
{ delta:        142 } hitcount:    1712508
{ delta:        143 } hitcount:     575941
{ delta:        144 } hitcount:     351427
{ delta:        145 } hitcount:     218077
{ delta:        146 } hitcount:     167297
{ delta:        147 } hitcount:     146198
{ delta:        148 } hitcount:     116122
{ delta:        149 } hitcount:      58993
{ delta:        150 } hitcount:      40228

The delta above is in nanoseconds. It brings the fastest time down from
129ns to 113ns, and the peak from 141ns to 126ns.

Link: https://lkml.kernel.org/r/20220906225529.411545333@goodmis.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing: Add numeric delta time to the trace event benchmark
Steven Rostedt (Google) [Tue, 6 Sep 2022 22:53:14 +0000 (18:53 -0400)]
tracing: Add numeric delta time to the trace event benchmark

In order to testing filtering and histograms via the trace event
benchmark, record the delta time of the last event as a numeric value
(currently, it just saves it within the string) so that filters and
histograms can use it.

Link: https://lkml.kernel.org/r/20220906225529.213677569@goodmis.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agorv/dot2K: add 'static' qualifier for local variable
Zeng Heng [Wed, 24 Aug 2022 03:43:57 +0000 (11:43 +0800)]
rv/dot2K: add 'static' qualifier for local variable

Following Daniel's suggestion, fix similar warning
in template files, which would prevent new monitors
from such warning.

Link: https://lkml.kernel.org/r/20220824034357.2014202-3-zengheng4@huawei.com
Cc: <mingo@redhat.com>
Fixes: 24bce201d798 ("tools/rv: Add dot2k")
Suggested-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agorv/monitors: add 'static' qualifier for local symbols
Zeng Heng [Wed, 24 Aug 2022 03:43:56 +0000 (11:43 +0800)]
rv/monitors: add 'static' qualifier for local symbols

The sparse tool complains as follows:

kernel/trace/rv/monitors/wwnr/wwnr.c:18:19:
warning: symbol 'rv_wwnr' was not declared. Should it be static?

The `rv_wwnr` symbol is not dereferenced by other extern files,
so add static qualifier for it.

So does wip module.

Link: https://lkml.kernel.org/r/20220824034357.2014202-2-zengheng4@huawei.com
Cc: <mingo@redhat.com>
Fixes: ccc319dcb450 ("rv/monitor: Add the wwnr monitor")
Fixes: 8812d21219b9 ("rv/monitor: Add the wip monitor skeleton created by dot2k")
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoselftests/ftrace: Add eprobe syntax error testcase
Masami Hiramatsu (Google) [Mon, 1 Aug 2022 02:32:34 +0000 (11:32 +0900)]
selftests/ftrace: Add eprobe syntax error testcase

Add a syntax error test case for eprobe as same as kprobes.

Link: https://lkml.kernel.org/r/165932115471.2850673.8014722990775242727.stgit@devnote2
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agotracing/eprobe: Add eprobe filter support
Masami Hiramatsu (Google) [Mon, 1 Aug 2022 02:32:25 +0000 (11:32 +0900)]
tracing/eprobe: Add eprobe filter support

Add the filter option to the event probe. This is useful if user wants
to derive a new event based on the condition of the original event.

E.g.
 echo 'e:egroup/stat_runtime_4core sched/sched_stat_runtime \
        runtime=$runtime:u32 if cpu < 4' >> ../dynamic_events

Then it can filter the events only on first 4 cores.
Note that the fields used for 'if' must be the fields in the original
events, not eprobe events.

Link: https://lkml.kernel.org/r/165932114513.2850673.2592206685744598080.stgit@devnote2
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 years agoLinux 6.0-rc7 v6.0-rc7
Linus Torvalds [Sun, 25 Sep 2022 21:01:02 +0000 (14:01 -0700)]
Linux 6.0-rc7

3 years agoMerge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 25 Sep 2022 16:03:31 +0000 (09:03 -0700)]
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Regression and bug fixes:

   - Performance regression fix from 5.18 on a Rasberry Pi

   - Fix extent parsing bug which triggers a BUG_ON when a (corrupted)
     extent tree has has a non-root node when zero entries.

   - Fix a livelock where in the right (wrong) circumstances a large
     number of nfsd threads can try to write to a nearly full file
     system, and retry for hours(!)"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: limit the number of retries after discarding preallocations blocks
  ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0
  ext4: use buckets for cr 1 block scan instead of rbtree
  ext4: use locality group preallocation for small closed files
  ext4: make directory inode spreading reflect flexbg size
  ext4: avoid unnecessary spreading of allocations among groups
  ext4: make mballoc try target group first even with mb_optimize_scan

3 years agoMerge tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sun, 25 Sep 2022 15:53:52 +0000 (08:53 -0700)]
Merge tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull NVDIMM and DAX fixes from Dan Williams:
 "A recently discovered one-line fix for devdax that further addresses a
  v5.5 regression, and (a bit embarrassing) a small batch of fixes that
  have been sitting in my fixes tree for weeks.

  The older fixes have soaked in linux-next during that time and address
  an fsdax infinite loop and some other minor fixups.

   - Fix a infinite loop bug in fsdax

   - Fix memory-type detection for devdax (EINJ regression)

   - Small cleanups"

* tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  devdax: Fix soft-reservation memory description
  fsdax: Fix infinite loop in dax_iomap_rw()
  nvdimm/namespace: drop nested variable in create_namespace_pmem()
  ndtest: Cleanup all of blk namespace specific code
  pmem: fix a name collision

3 years agoMerge tag 'i2c-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sun, 25 Sep 2022 15:44:46 +0000 (08:44 -0700)]
Merge tag 'i2c-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "I2C driver bugfixes for mlxbf and imx, a few documentation fixes after
  the rework this cycle, and one hardening for the i2c-mux core"

* tag 'i2c-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: mux: harden i2c_mux_alloc() against integer overflows
  i2c: mlxbf: Fix frequency calculation
  i2c: mlxbf: prevent stack overflow in mlxbf_i2c_smbus_start_transaction()
  i2c: mlxbf: incorrect base address passed during io write
  Documentation: i2c: fix references to other documents
  MAINTAINERS: remove Nehal Shah from AMD MP2 I2C DRIVER
  i2c: imx: If pm_runtime_get_sync() returned 1 device access is possible

3 years agoMerge branch 'for-6.0/dax' into libnvdimm-fixes
Dan Williams [Sun, 25 Sep 2022 01:14:12 +0000 (18:14 -0700)]
Merge branch 'for-6.0/dax' into libnvdimm-fixes

Pick up another "Soft Reservation" fix for v6.0-final on top of some
straggling nvdimm fixes that missed v5.19.

3 years agodevdax: Fix soft-reservation memory description
Dan Williams [Fri, 23 Sep 2022 22:05:56 +0000 (15:05 -0700)]
devdax: Fix soft-reservation memory description

The "hmem" platform-devices that are created to represent the
platform-advertised "Soft Reserved" memory ranges end up inserting a
resource that causes the iomem_resource tree to look like this:

340000000-43fffffff : hmem.0
  340000000-43fffffff : Soft Reserved
    340000000-43fffffff : dax0.0

This is because insert_resource() reparents ranges when they completely
intersect an existing range.

This matters because code that uses region_intersects() to scan for a
given IORES_DESC will only check that top-level 'hmem.0' resource and
not the 'Soft Reserved' descendant.

So, to support EINJ (via einj_error_inject()) to inject errors into
memory hosted by a dax-device, be sure to describe the memory as
IORES_DESC_SOFT_RESERVED. This is a follow-on to:

commit b13a3e5fd40b ("ACPI: APEI: Fix _EINJ vs EFI_MEMORY_SP")

...that fixed EINJ support for "Soft Reserved" ranges in the first
instance.

Fixes: 262b45ae3ab4 ("x86/efi: EFI soft reservation to E820 enumeration")
Reported-by: Ricardo Sandoval Torres <ricardo.sandoval.torres@intel.com>
Tested-by: Ricardo Sandoval Torres <ricardo.sandoval.torres@intel.com>
Cc: <stable@vger.kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Omar Avelar <omar.avelar@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mark Gross <markgross@kernel.org>
Link: https://lore.kernel.org/r/166397075670.389916.7435722208896316387.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
3 years agoMerge tag 'kbuild-fixes-v6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 25 Sep 2022 00:41:17 +0000 (17:41 -0700)]
Merge tag 'kbuild-fixes-v6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Fix build error for the combination of SYSTEM_TRUSTED_KEYRING=y and
   X509_CERTIFICATE_PARSER=m

 - Fix DEBUG_INFO_SPLIT to generate debug info for GCC 11+ and Clang 12+

 - Revive debug info for assembly files

 - Remove unused code

* tag 'kbuild-fixes-v6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  Makefile.debug: re-enable debug info for .S files
  Makefile.debug: set -g unconditional on CONFIG_DEBUG_INFO_SPLIT
  certs: make system keyring depend on built-in x509 parser
  Kconfig: remove unused function 'menu_get_root_menu'
  scripts/clang-tools: remove unused module

3 years agoMerge tag 's390-6.0-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Sun, 25 Sep 2022 00:35:42 +0000 (17:35 -0700)]
Merge tag 's390-6.0-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fix from Vasily Gorbik:

 - Fix potential hangs in VFIO AP driver

* tag 's390-6.0-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/vfio-ap: bypass unnecessary processing of AP resources

3 years agoMerge tag 'pm-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Sat, 24 Sep 2022 15:53:57 +0000 (08:53 -0700)]
Merge tag 'pm-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix an uninitialized variable usage in the operating performance
  points code and add missing DT bindings for it.

  Specifics:

   - Fix uninitialized variable usage in dev_pm_opp_config_clks_simple()
     (Christophe JAILLET)

   - Add missing OPP DT properties (Rob Herring)"

* tag 'pm-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  dt-bindings: opp: Add missing (unevaluated|additional)Properties on child nodes
  OPP: Fix an un-initialized variable usage

3 years agoMerge tag 'char-misc-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sat, 24 Sep 2022 15:46:07 +0000 (08:46 -0700)]
Merge tag 'char-misc-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here are three tiny driver fixes for 6.0-rc7.  They include:

   - phy driver reset bugfix

   - fpga memleak bugfix

   - counter irq config bugfix

  The first two have been in linux-next for a while, the last one has
  only been added to my tree in the past few days, but was in linux-next
  under a different commit id. I couldn't pull directly from the counter
  tree due to some gpg key propagation issue, so I took the commit
  directly from email instead"

* tag 'char-misc-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  counter: 104-quad-8: Fix skipped IRQ lines during events configuration
  fpga: m10bmc-sec: Fix possible memory leak of flash_buf
  phy: marvell: phy-mvebu-a3700-comphy: Remove broken reset support

3 years agoMerge tag 'tty-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sat, 24 Sep 2022 15:42:55 +0000 (08:42 -0700)]
Merge tag 'tty-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial driver fixes from Greg KH:
 "Here are some small, and late, serial driver fixes for 6.0-rc7 to
  resolve some reported problems.

  Included in here are:

   - tegra icount accounting fixes, including a framework function that
     other drivers will be converted over to using in 6.1-rc1.

   - fsl_lpuart reset bugfix

   - 8250 omap 485 bugfix

   - sifive serial clock bugfix

  The last three patches have not shown up in linux-next due to them
  being added to my tree only 2 days ago, but they are tiny and
  self-contained and the developers say they resolve issues that they
  have with 6.0-rc. The other three have been in linux-next for a while
  with no reported issues"

* tag 'tty-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: sifive: enable clocks for UART when probed
  serial: 8250: omap: Use serial8250_em485_supported
  serial: fsl_lpuart: Reset prior to registration
  serial: tegra-tcu: Use uart_xmit_advance(), fixes icount.tx accounting
  serial: tegra: Use uart_xmit_advance(), fixes icount.tx accounting
  serial: Create uart_xmit_advance()

3 years agoMerge tag 'cgroup-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 24 Sep 2022 15:36:10 +0000 (08:36 -0700)]
Merge tag 'cgroup-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Add Waiman Long as a cpuset maintainer

 - cgroup_get_from_id() could be fed a kernfs ID which doesn't point to
   a cgroup directory but a knob file and then crash. Error out if the
   lookup kernfs_node isn't a directory.

* tag 'cgroup-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: cgroup_get_from_id() must check the looked-up kn is a directory
  cpuset: Add Waiman Long as a cpuset maintainer

3 years agoMerge tag 'wq-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 24 Sep 2022 15:32:59 +0000 (08:32 -0700)]
Merge tag 'wq-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:
 "Just one patch to improve flush lockdep coverage"

* tag 'wq-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: don't skip lockdep work dependency in cancel_work_sync()

3 years agoMerge tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linux
Linus Torvalds [Sat, 24 Sep 2022 15:27:08 +0000 (08:27 -0700)]
Merge tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "Just a single fix for an issue with un-reaped IOPOLL requests on ring
  exit"

* tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linux:
  io_uring: ensure that cached task references are always put on exit

3 years agoMerge tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux
Linus Torvalds [Sat, 24 Sep 2022 15:22:53 +0000 (08:22 -0700)]
Merge tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Fix a regression that's been plaguing us by reverting the offending
  commit, as attempts to both reproduce the issue and fix it in a saner
  fashion have failed.

  Fix for a potential oops condition in the s390 dasd block driver"

* tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux:
  Revert "block: freeze the queue earlier in del_gendisk"
  s390/dasd: fix Oops in dasd_alias_get_start_dev due to missing pavgroup

3 years agoMakefile.debug: re-enable debug info for .S files
Nick Desaulniers [Mon, 19 Sep 2022 17:45:47 +0000 (10:45 -0700)]
Makefile.debug: re-enable debug info for .S files

Alexey reported that the fraction of unknown filename instances in
kallsyms grew from ~0.3% to ~10% recently; Bill and Greg tracked it down
to assembler defined symbols, which regressed as a result of:

commit b8a9092330da ("Kbuild: do not emit debug info for assembly with LLVM_IAS=1")

In that commit, I allude to restoring debug info for assembler defined
symbols in a follow up patch, but it seems I forgot to do so in

commit a66049e2cf0e ("Kbuild: make DWARF version a choice")

Link: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=31bf18645d98b4d3d7357353be840e320649a67d
Fixes: b8a9092330da ("Kbuild: do not emit debug info for assembly with LLVM_IAS=1")
Reported-by: Alexey Alexandrov <aalexand@google.com>
Reported-by: Bill Wendling <morbo@google.com>
Reported-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
3 years agoMakefile.debug: set -g unconditional on CONFIG_DEBUG_INFO_SPLIT
Nick Desaulniers [Mon, 19 Sep 2022 17:30:30 +0000 (10:30 -0700)]
Makefile.debug: set -g unconditional on CONFIG_DEBUG_INFO_SPLIT

Dmitrii, Fangrui, and Mashahiro note:

  Before GCC 11 and Clang 12 -gsplit-dwarf implicitly uses -g2.

Fix CONFIG_DEBUG_INFO_SPLIT for gcc-11+ & clang-12+ which now need -g
specified in order for -gsplit-dwarf to work at all.

-gsplit-dwarf has been mutually exclusive with -g since support for
CONFIG_DEBUG_INFO_SPLIT was introduced in
commit 866ced950bcd ("kbuild: Support split debug info v4")
I don't think it ever needed to be.

Link: https://lore.kernel.org/lkml/20220815013317.26121-1-dmitrii.bundin.a@gmail.com/
Link: https://lore.kernel.org/lkml/CAK7LNARPAmsJD5XKAw7m_X2g7Fi-CAAsWDQiP7+ANBjkg7R7ng@mail.gmail.com/
Link: https://reviews.llvm.org/D80391
Cc: Andi Kleen <ak@linux.intel.com>
Reported-by: Dmitrii Bundin <dmitrii.bundin.a@gmail.com>
Reported-by: Fangrui Song <maskray@google.com>
Reported-by: Masahiro Yamada <masahiroy@kernel.org>
Suggested-by: Dmitrii Bundin <dmitrii.bundin.a@gmail.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
3 years agoio_uring: ensure that cached task references are always put on exit
Jens Axboe [Fri, 23 Sep 2022 19:44:56 +0000 (13:44 -0600)]
io_uring: ensure that cached task references are always put on exit

io_uring caches task references to avoid doing atomics for each of them
per request. If a request is put from the same task that allocated it,
then we can maintain a per-ctx cache of them. This obviously relies
on io_uring always pruning caches in a reliable way, and there's
currently a case off io_uring fd release where we can miss that.

One example is a ring setup with IOPOLL, which relies on the task
polling for completions, which will free them. However, if such a task
submits a request and then exits or closes the ring without reaping
the completion, then ring release will reap and put. If release happens
from that very same task, the completed request task refs will get
put back into the cache pool. This is problematic, as we're now beyond
the point of pruning caches.

Manually drop these caches after doing an IOPOLL reap. This releases
references from the current task, which is enough. If another task
happens to be doing the release, then the caching will not be
triggered and there's no issue.

Cc: stable@vger.kernel.org
Fixes: e98e49b2bbf7 ("io_uring: extend task put optimisations")
Reported-by: Homin Rhee <hominlab@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 23 Sep 2022 22:28:51 +0000 (15:28 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "These are all very simple and self-contained, although the CFI
  jump-table fix touches the generic linker script as that's where the
  problematic macro lives.

   - Fix false positive "sleeping while atomic" warning resulting from
     the kPTI rework taking a mutex too early.

   - Fix possible overflow in AMU frequency calculation

   - Fix incorrect shift in CMN PMU driver which causes problems with
     newer versions of the IP

   - Reduce alignment of the CFI jump table to avoid huge kernel images
     and link errors with !4KiB page size configurations"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  vmlinux.lds.h: CFI: Reduce alignment of jump-table to function alignment
  perf/arm-cmn: Add more bits to child node address offset field
  arm64: topology: fix possible overflow in amu_fie_setup()
  arm64: mm: don't acquire mutex when rewriting swapper

3 years agocerts: make system keyring depend on built-in x509 parser
Masahiro Yamada [Mon, 12 Sep 2022 06:52:10 +0000 (15:52 +0900)]
certs: make system keyring depend on built-in x509 parser

Commit e90886291c7c ("certs: make system keyring depend on x509 parser")
is not the right fix because x509_load_certificate_list() can be modular.

The combination of CONFIG_SYSTEM_TRUSTED_KEYRING=y and
CONFIG_X509_CERTIFICATE_PARSER=m still results in the following error:

    LD      .tmp_vmlinux.kallsyms1
  ld: certs/system_keyring.o: in function `load_system_certificate_list':
  system_keyring.c:(.init.text+0x8c): undefined reference to `x509_load_certificate_list'
  make: *** [Makefile:1169: vmlinux] Error 1

Fixes: e90886291c7c ("certs: make system keyring depend on x509 parser")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Adam Borowski <kilobyte@angband.pl>
3 years agoKconfig: remove unused function 'menu_get_root_menu'
Zeng Heng [Mon, 12 Sep 2022 09:48:38 +0000 (17:48 +0800)]
Kconfig: remove unused function 'menu_get_root_menu'

There is nowhere calling `menu_get_root_menu` function,
so remove it.

Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
3 years agoscripts/clang-tools: remove unused module
yangxingwu [Tue, 13 Sep 2022 04:07:53 +0000 (04:07 +0000)]
scripts/clang-tools: remove unused module

Remove unused imported 'os' module.

Signed-off-by: yangxingwu <xingwu.yang@gmail.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
3 years agocgroup: cgroup_get_from_id() must check the looked-up kn is a directory
Ming Lei [Fri, 23 Sep 2022 11:51:19 +0000 (19:51 +0800)]
cgroup: cgroup_get_from_id() must check the looked-up kn is a directory

cgroup has to be one kernfs dir, otherwise kernel panic is caused,
especially cgroup id is provide from userspace.

Reported-by: Marco Patalano <mpatalan@redhat.com>
Fixes: 6b658c4863c1 ("scsi: cgroup: Add cgroup_get_from_id()")
Cc: Muneendra <muneendra.kumar@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Tejun Heo <tj@kernel.org>
3 years agoMerge tag 'driver-core-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 23 Sep 2022 16:12:18 +0000 (09:12 -0700)]
Merge tag 'driver-core-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are two tiny driver core fixes for 6.0-rc7 that resolve some
  oft-reported problems.

  The first is a revert of the "fw_devlink.strict=1" default option that
  we keep trying to enable, but we keep finding platforms that this just
  breaks everything on. So again, we need it reverted and hopefully it
  can be worked on in future releases.

  The second is a sysfs file-size bugfix that resolves an issue that
  many people are starting to hit as the fix it is fixing also was
  backported to stable kernels. The util-linux developers are starting
  to get bugreports about sysfs files that contain no data because of
  this problem, and this fix which has been in linux-next in the
  bitfield tree for a long time, resolves it. I'm submitting it here as
  it needs to be merged for 6.0-final, not for 6.1-rc1.

  Both of these have been in linux-next with no reported issues, only
  reports were that these fixed problems"

* tag 'driver-core-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES
  Revert "driver core: Set fw_devlink.strict=1 by default"

3 years agoMerge tag 'usb-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Fri, 23 Sep 2022 16:07:08 +0000 (09:07 -0700)]
Merge tag 'usb-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB / Thunderbolt driver fixes and ids from Greg KH:
 "Here are a few small USB and Thunderbolt driver fixes and new device
  ids for 6.0-rc7.

  They contain:

   - new usb-serial driver ids

   - documentation build warning fix in USB hub code

   - flexcop-usb long-posted bugfix (the v4l maintainer for this is MIA
     so I have finally picked this up as it is a fix for a reported
     problem.)

   - dwc3 64bit DMA bugfix

   - new thunderbolt device ids

   - typec build error fix

  All of these have been in linux-next with no reported issues"

* tag 'usb-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: anx7411: Fix build error without CONFIG_POWER_SUPPLY
  media: flexcop-usb: fix endpoint type check
  USB: serial: option: add Quectel RM520N
  USB: serial: option: add Quectel BG95 0x0203 composition
  thunderbolt: Add support for Intel Maple Ridge single port controller
  usb: dwc3: core: leave default DMA if the controller does not support 64-bit DMA
  USB: core: Fix RST error in hub.c

3 years agoMerge tag 'landlock-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic...
Linus Torvalds [Fri, 23 Sep 2022 15:59:16 +0000 (08:59 -0700)]
Merge tag 'landlock-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux

Pull landlock fix from Mickaël Salaün:
 "Fix out-of-tree builds for Landlock tests"

* tag 'landlock-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  selftests/landlock: Fix out-of-tree builds

3 years agoMerge tag 'riscv-for-linus-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 23 Sep 2022 15:51:05 +0000 (08:51 -0700)]
Merge tag 'riscv-for-linus-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

 - A handful of build fixes for the T-Head errata, including some
   functional issues the compilers found

 - A fix for a nasty sigreturn bug

* tag 'riscv-for-linus-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  RISC-V: Avoid coupling the T-Head CMOs and Zicbom
  riscv: fix a nasty sigreturn bug...
  riscv: make t-head erratas depend on MMU
  riscv: fix RISCV_ISA_SVPBMT kconfig dependency warning
  RISC-V: Clean up the Zicbom block size probing

3 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Fri, 23 Sep 2022 15:42:30 +0000 (08:42 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "As everyone back came back from conferences, here are the pending
  patches for Linux 6.0.

  ARM:

   - Fix for kmemleak with pKVM

  s390:

   - Fixes for VFIO with zPCI

   - smatch fix

  x86:

   - Ensure XSAVE-capable hosts always allow FP and SSE state to be
     saved and restored via KVM_{GET,SET}_XSAVE

   - Fix broken max_mmu_rmap_size stat

   - Fix compile error with old glibc that doesn't have gettid()"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Inject #UD on emulated XSETBV if XSAVES isn't enabled
  KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES
  KVM: x86: Reinstate kvm_vcpu_arch.guest_supported_xcr0
  KVM: x86/mmu: add missing update to max_mmu_rmap_size
  selftests: kvm: Fix a compile error in selftests/kvm/rseq_test.c
  KVM: s390: pci: register pci hooks without interpretation
  KVM: s390: pci: fix GAIT physical vs virtual pointers usage
  KVM: s390: Pass initialized arg even if unused
  KVM: s390: pci: fix plain integer as NULL pointer warnings
  KVM: arm64: Use kmemleak_free_part_phys() to unregister hyp_mem_base

3 years agoMerge tag 'for-linus-6.0-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 23 Sep 2022 15:31:24 +0000 (08:31 -0700)]
Merge tag 'for-linus-6.0-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fix from Juergen Gross:
 "A single fix for an issue in the xenbus driver (initialization of
  multi-page rings for Xen PV devices)"

* tag 'for-linus-6.0-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/xenbus: fix xenbus_setup_ring()

3 years agoMerge tag 'drm-fixes-2022-09-23-1' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Fri, 23 Sep 2022 15:18:55 +0000 (08:18 -0700)]
Merge tag 'drm-fixes-2022-09-23-1' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "Regular fixes for the week, i915, mediatek, hisilicon, mgag200 and
  panel have some small fixes.

  amdgpu has more stack size fixes for clang build, and fixes for new
  IPs, but all with low regression chances since they are for stuff new
  in v6.0.

  i915:
   - avoid a general protection failure when using perf/OA
   - avoid kernel warnings on driver release

  amdgpu:
   - SDMA 6.x fix
   - GPUVM TF fix
   - DCN 3.2.x fixes
   - DCN 3.1.x fixes
   - SMU 13.x fixes
   - Clang stack size fixes for recently enabled DML code
   - Fix drm dirty callback change on non-atomic cases
   - USB4 display fix

  mediatek:
   - dsi: Add atomic {destroy,duplicate}_state, reset callbacks
   - dsi: Move mtk_dsi_stop() call back to mtk_dsi_poweroff()
   - Fix wrong dither settings

  hisilicon:
   - Depend on MMU

  mgag200:
   - Fix console on G200ER

  panel:
   - Fix innolux_g121i1_l01 bus format"

* tag 'drm-fixes-2022-09-23-1' of git://anongit.freedesktop.org/drm/drm: (30 commits)
  MAINTAINERS: switch graphics to airlied other addresses
  drm/mediatek: dsi: Move mtk_dsi_stop() call back to mtk_dsi_poweroff()
  drm/amd/display: Reduce number of arguments of dml314's CalculateFlipSchedule()
  drm/amd/display: Reduce number of arguments of dml314's CalculateWatermarksAndDRAMSpeedChangeSupport()
  drm/amdgpu: don't register a dirty callback for non-atomic
  drm/amd/pm: drop the pptable related workarounds for SMU 13.0.0
  drm/amd/pm: add support for 3794 pptable for SMU13.0.0
  drm/amd/display: correct num_dsc based on HW cap
  drm/amd/display: Disable OTG WA for the plane_state NULL case on DCN314
  drm/amd/display: Add shift and mask for ICH_RESET_AT_END_OF_LINE
  drm/amd/display: increase dcn315 pstate change latency
  drm/amd/display: Fix DP MST timeslot issue when fallback happened
  drm/amd/display: Display distortion after hotplug 5K tiled display
  drm/amd/display: Update dummy P-state search to use DCN32 DML
  drm/amd/display: skip audio setup when audio stream is enabled
  drm/amd/display: update gamut remap if plane has changed
  drm/amd/display: Assume an LTTPR is always present on fixed_vs links
  drm/amd/display: fix dcn315 memory channel count and width read
  drm/amd/display: Fix double cursor on non-video RGB MPO
  drm/amd/display: Only consider pixle rate div policy for DCN32+
  ...

3 years agoMerge tag 'kvm-s390-master-6.0-2' of https://git.kernel.org/pub/scm/linux/kernel...
Paolo Bonzini [Fri, 23 Sep 2022 14:06:08 +0000 (10:06 -0400)]
Merge tag 'kvm-s390-master-6.0-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

More pci fixes
Fix for a code analyser warning

3 years agovmlinux.lds.h: CFI: Reduce alignment of jump-table to function alignment
Will Deacon [Thu, 22 Sep 2022 21:57:15 +0000 (22:57 +0100)]
vmlinux.lds.h: CFI: Reduce alignment of jump-table to function alignment

Due to undocumented, hysterical raisins on x86, the CFI jump-table
sections in .text are needlessly aligned to PMD_SIZE in the vmlinux
linker script. When compiling a CFI-enabled arm64 kernel with a 64KiB
page-size, a PMD maps 512MiB of virtual memory and so the .text section
increases to a whopping 940MiB and blows the final Image up to 960MiB.
Others report a link failure.

Since the CFI jump-table requires only instruction alignment, reduce the
alignment directives to function alignment for parity with other parts
of the .text section. This reduces the size of the .text section for the
aforementioned 64KiB page size arm64 kernel to 19MiB for a much more
reasonable total Image size of 39MiB.

Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Mohan Rao .vanimina" <mailtoc.mohanrao@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/CAL_GTzigiNOMYkOPX1KDnagPhJtFNqSK=1USNbS0wUL4PW6-Uw@mail.gmail.com/
Fixes: cf68fffb66d6 ("add support for Clang CFI")
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220922215715.13345-1-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
3 years agoMAINTAINERS: switch graphics to airlied other addresses
Dave Airlie [Fri, 23 Sep 2022 05:36:12 +0000 (15:36 +1000)]
MAINTAINERS: switch graphics to airlied other addresses

My linux.ie address is in a bad place.
also add dri-devel for agpgart.

Signed-off-by: Dave Airlie <airlied@redhat.com>
3 years agoMerge tag 'drm-misc-fixes-2022-09-22' of git://anongit.freedesktop.org/drm/drm-misc...
Dave Airlie [Fri, 23 Sep 2022 03:18:21 +0000 (13:18 +1000)]
Merge tag 'drm-misc-fixes-2022-09-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

Short summary of fixes pull

 * drm/hisilicon: Depend on MMU
 * drm/mgag200: Fix console on G200ER
 * drm/panel: Fix innolux_g121i1_l01 bus format

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/YyxtXS588at6S4wg@linux-uq9g
3 years agoMerge tag 'mediatek-drm-fixes-6.0' of https://git.kernel.org/pub/scm/linux/kernel...
Dave Airlie [Fri, 23 Sep 2022 03:15:29 +0000 (13:15 +1000)]
Merge tag 'mediatek-drm-fixes-6.0' of https://git.kernel.org/pub/scm/linux/kernel/git/chunkuang.hu/linux into drm-fixes

Mediatek DRM Fixes for Linux 6.0

1. dsi: Add atomic {destroy,duplicate}_state, reset callbacks
2. drm/mediatek: Fix wrong dither settings
3. dsi: Move mtk_dsi_stop() call back to mtk_dsi_poweroff()

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20220921235624.23580-1-chunkuang.hu@kernel.org
3 years agoMerge tag 'amd-drm-fixes-6.0-2022-09-21' of https://gitlab.freedesktop.org/agd5f...
Dave Airlie [Fri, 23 Sep 2022 01:12:06 +0000 (11:12 +1000)]
Merge tag 'amd-drm-fixes-6.0-2022-09-21' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-6.0-2022-09-21:

amdgpu:
- SDMA 6.x fix
- GPUVM TF fix
- DCN 3.2.x fixes
- DCN 3.1.x fixes
- SMU 13.x fixes
- Clang stack size fixes for recently enabled DML code
- Fix drm dirty callback change on non-atomic cases
- USB4 display fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220921220605.6136-1-alexander.deucher@amd.com
3 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Thu, 22 Sep 2022 21:43:55 +0000 (14:43 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Three small and pretty obvious fixes, all in drivers"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: mpt3sas: Fix return value check of dma_get_required_mask()
  scsi: qla2xxx: Fix memory leak in __qlt_24xx_handle_abts()
  scsi: qedf: Fix a UAF bug in __qedf_probe()

3 years agoMerge tag 'slab-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka...
Linus Torvalds [Thu, 22 Sep 2022 21:37:58 +0000 (14:37 -0700)]
Merge tag 'slab-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fixes from Vlastimil Babka:

 - Fix a possible use-after-free in SLUB's kmem_cache removal,
   introduced in this cycle, by Feng Tang.

 - WQ_MEM_RECLAIM dependency fix for the workqueue-based cpu slab
   flushing introduced in 5.15, by Maurizio Lombardi.

 - Add missing KASAN hooks in two kmalloc entry paths, by Peter
   Collingbourne.

 - A BUG_ON() removal in SLUB's kmem_cache creation when allocation
   fails (too small to possibly happen in practice, syzbot used fault
   injection), by Chao Yu.

* tag 'slab-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context.
  mm/slab_common: fix possible double free of kmem_cache
  kasan: call kasan_malloc() from __kmalloc_*track_caller()
  mm/slub: fix to return errno if kmalloc() fails

3 years agoKVM: x86: Inject #UD on emulated XSETBV if XSAVES isn't enabled
Sean Christopherson [Wed, 24 Aug 2022 03:30:57 +0000 (03:30 +0000)]
KVM: x86: Inject #UD on emulated XSETBV if XSAVES isn't enabled

Inject #UD when emulating XSETBV if CR4.OSXSAVE is not set.  This also
covers the "XSAVE not supported" check, as setting CR4.OSXSAVE=1 #GPs if
XSAVE is not supported (and userspace gets to keep the pieces if it
forces incoherent vCPU state).

Add a comment to kvm_emulate_xsetbv() to call out that the CPU checks
CR4.OSXSAVE before checking for intercepts.  AMD'S APM implies that #UD
has priority (says that intercepts are checked before #GP exceptions),
while Intel's SDM says nothing about interception priority.  However,
testing on hardware shows that both AMD and Intel CPUs prioritize the #UD
over interception.

Fixes: 02d4160fbd76 ("x86: KVM: add xsetbv to the emulator")
Cc: stable@vger.kernel.org
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220824033057.3576315-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 years agoKVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES
Dr. David Alan Gilbert [Wed, 24 Aug 2022 03:30:56 +0000 (03:30 +0000)]
KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES

Allow FP and SSE state to be saved and restored via KVM_{G,SET}_XSAVE on
XSAVE-capable hosts even if their bits are not exposed to the guest via
XCR0.

Failing to allow FP+SSE first showed up as a QEMU live migration failure,
where migrating a VM from a pre-XSAVE host, e.g. Nehalem, to an XSAVE
host failed due to KVM rejecting KVM_SET_XSAVE.  However, the bug also
causes problems even when migrating between XSAVE-capable hosts as
KVM_GET_SAVE won't set any bits in user_xfeatures if XSAVE isn't exposed
to the guest, i.e. KVM will fail to actually migrate FP+SSE.

Because KVM_{G,S}ET_XSAVE are designed to allowing migrating between
hosts with and without XSAVE, KVM_GET_XSAVE on a non-XSAVE (by way of
fpu_copy_guest_fpstate_to_uabi()) always sets the FP+SSE bits in the
header so that KVM_SET_XSAVE will work even if the new host supports
XSAVE.

Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")
bz: https://bugzilla.redhat.com/show_bug.cgi?id=2079311
Cc: stable@vger.kernel.org
Cc: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
[sean: add comment, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220824033057.3576315-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 years agoKVM: x86: Reinstate kvm_vcpu_arch.guest_supported_xcr0
Sean Christopherson [Wed, 24 Aug 2022 03:30:55 +0000 (03:30 +0000)]
KVM: x86: Reinstate kvm_vcpu_arch.guest_supported_xcr0

Reinstate the per-vCPU guest_supported_xcr0 by partially reverting
commit 988896bb6182; the implicit assessment that guest_supported_xcr0 is
always the same as guest_fpu.fpstate->user_xfeatures was incorrect.

kvm_vcpu_after_set_cpuid() isn't the only place that sets user_xfeatures,
as user_xfeatures is set to fpu_user_cfg.default_features when guest_fpu
is allocated via fpu_alloc_guest_fpstate() => __fpstate_reset().
guest_supported_xcr0 on the other hand is zero-allocated.  If userspace
never invokes KVM_SET_CPUID2, supported XCR0 will be '0', whereas the
allowed user XFEATURES will be non-zero.

Practically speaking, the edge case likely doesn't matter as no sane
userspace will live migrate a VM without ever doing KVM_SET_CPUID2. The
primary motivation is to prepare for KVM intentionally and explicitly
setting bits in user_xfeatures that are not set in guest_supported_xcr0.

Because KVM_{G,S}ET_XSAVE can be used to svae/restore FP+SSE state even
if the host doesn't support XSAVE, KVM needs to set the FP+SSE bits in
user_xfeatures even if they're not allowed in XCR0, e.g. because XCR0
isn't exposed to the guest.  At that point, the simplest fix is to track
the two things separately (allowed save/restore vs. allowed XCR0).

Fixes: 988896bb6182 ("x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0")
Cc: stable@vger.kernel.org
Cc: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220824033057.3576315-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 years agoKVM: x86/mmu: add missing update to max_mmu_rmap_size
Miaohe Lin [Wed, 7 Sep 2022 08:06:57 +0000 (16:06 +0800)]
KVM: x86/mmu: add missing update to max_mmu_rmap_size

The update to statistic max_mmu_rmap_size is unintentionally removed by
commit 4293ddb788c1 ("KVM: x86/mmu: Remove redundant spte present check
in mmu_set_spte"). Add missing update to it or max_mmu_rmap_size will
always be nonsensical 0.

Fixes: 4293ddb788c1 ("KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Message-Id: <20220907080657.42898-1-linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 years agoselftests: kvm: Fix a compile error in selftests/kvm/rseq_test.c
Jinrong Liang [Tue, 2 Aug 2022 07:12:40 +0000 (15:12 +0800)]
selftests: kvm: Fix a compile error in selftests/kvm/rseq_test.c

The following warning appears when executing:
make -C tools/testing/selftests/kvm

rseq_test.c: In function ‘main’:
rseq_test.c:237:33: warning: implicit declaration of function ‘gettid’; did you mean ‘getgid’? [-Wimplicit-function-declaration]
          (void *)(unsigned long)gettid());
                                 ^~~~~~
                                 getgid
/usr/bin/ld: /tmp/ccr5mMko.o: in function `main':
../kvm/tools/testing/selftests/kvm/rseq_test.c:237: undefined reference to `gettid'
collect2: error: ld returned 1 exit status
make: *** [../lib.mk:173: ../kvm/tools/testing/selftests/kvm/rseq_test] Error 1

Use the more compatible syscall(SYS_gettid) instead of gettid() to fix it.
More subsequent reuse may cause it to be wrapped in a lib file.

Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Message-Id: <20220802071240.84626-1-cloudliang@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 years agoMerge tag 'kvmarm-fixes-6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmar...
Paolo Bonzini [Thu, 22 Sep 2022 21:01:33 +0000 (17:01 -0400)]
Merge tag 'kvmarm-fixes-6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.0, take #2

- Fix kmemleak usage in Protected KVM (again)

3 years agomm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context.
Maurizio Lombardi [Mon, 19 Sep 2022 16:39:29 +0000 (18:39 +0200)]
mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context.

Commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations
__free_slab() invocations out of IRQ context") moved all flush_cpu_slab()
invocations to the global workqueue to avoid a problem related
with deactivate_slab()/__free_slab() being called from an IRQ context
on PREEMPT_RT kernels.

When the flush_all_cpu_locked() function is called from a task context
it may happen that a workqueue with WQ_MEM_RECLAIM bit set ends up
flushing the global workqueue, this will cause a dependency issue.

 workqueue: WQ_MEM_RECLAIM nvme-delete-wq:nvme_delete_ctrl_work [nvme_core]
   is flushing !WQ_MEM_RECLAIM events:flush_cpu_slab
 WARNING: CPU: 37 PID: 410 at kernel/workqueue.c:2637
   check_flush_dependency+0x10a/0x120
 Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
 RIP: 0010:check_flush_dependency+0x10a/0x120[  453.262125] Call Trace:
 __flush_work.isra.0+0xbf/0x220
 ? __queue_work+0x1dc/0x420
 flush_all_cpus_locked+0xfb/0x120
 __kmem_cache_shutdown+0x2b/0x320
 kmem_cache_destroy+0x49/0x100
 bioset_exit+0x143/0x190
 blk_release_queue+0xb9/0x100
 kobject_cleanup+0x37/0x130
 nvme_fc_ctrl_free+0xc6/0x150 [nvme_fc]
 nvme_free_ctrl+0x1ac/0x2b0 [nvme_core]

Fix this bug by creating a workqueue for the flush operation with
the WQ_MEM_RECLAIM bit set.

Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
Cc: <stable@vger.kernel.org>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
3 years agoMerge tag 'soc-fixes-6.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Linus Torvalds [Thu, 22 Sep 2022 18:10:11 +0000 (11:10 -0700)]
Merge tag 'soc-fixes-6.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull ARM SoC fixes from Arnd Bergmann:
 "Another set of fixes for fixes for the soc tree:

   - A fix for the interrupt number on at91/lan966 ethernet PHYs

   - A second round of fixes for NXP i.MX series, including a couple of
     build issues, and board specific DT corrections on TQMa8MPQL,
     imx8mp-venice-gw74xx and imx8mm-verdin for reliability and
     partially broken functionality

   - Several fixes for Rockchip SoCs, addressing a USB issue on
     BPI-R2-Pro, wakeup on Gru-Bob and reliability of high-speed SD
     cards, among other minor issues

   - A fix for a long-running naming mistake that prevented the moxart
     mmc driver from working at all

   - Multiple Arm SCMI firmware fixes for hardening some corner cases"

* tag 'soc-fixes-6.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (30 commits)
  arm64: dts: imx8mp-venice-gw74xx: fix port/phy validation
  ARM: dts: lan966x: Fix the interrupt number for internal PHYs
  arm64: dts: imx8mp-venice-gw74xx: fix ksz9477 cpu port
  arm64: dts: imx8mp-venice-gw74xx: fix CAN STBY polarity
  dt-bindings: memory-controllers: fsl,imx8m-ddrc: drop Leonard Crestez
  arm64: dts: tqma8mqml: Include phy-imx8-pcie.h header
  arm64: defconfig: enable ARCH_NXP
  arm64: dts: imx8mp-tqma8mpql-mba8mpxl: add missing pinctrl for RTC alarm
  ARM: dts: fix Moxa SDIO 'compatible', remove 'sdhci' misnomer
  arm64: dts: imx8mm-verdin: extend pmic voltages
  arm64: dts: rockchip: Remove 'enable-active-low' from rk3566-quartz64-a
  arm64: dts: rockchip: Remove 'enable-active-low' from rk3399-puma
  arm64: dts: rockchip: fix property for usb2 phy supply on rk3568-evb1-v10
  arm64: dts: rockchip: fix property for usb2 phy supply on rock-3a
  arm64: dts: imx8ulp: add #reset-cells for pcc
  arm64: dts: tqma8mpxl-ba8mpxl: Fix button GPIOs
  arm64: dts: imx8mn: remove GPU power domain reset
  arm64: dts: rockchip: Set RK3399-Gru PCLK_EDP to 24 MHz
  arm64: dts: imx8mm: Reverse CPLD_Dn GPIO label mapping on MX8Menlo
  arm64: dts: rockchip: fix upper usb port on BPI-R2-Pro
  ...

3 years agoMerge tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 22 Sep 2022 17:58:13 +0000 (10:58 -0700)]
Merge tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from wifi, netfilter and can.

  A handful of awaited fixes here - revert of the FEC changes, bluetooth
  fix, fixes for iwlwifi spew.

  We added a warning in PHY/MDIO code which is triggering on a couple of
  platforms in a false-positive-ish way. If we can't iron that out over
  the week we'll drop it and re-add for 6.1.

  I've added a new "follow up fixes" section for fixes to fixes in
  6.0-rcs but it may actually give the false impression that those are
  problematic or that more testing time would have caught them. So
  likely a one time thing.

  Follow up fixes:

   - nf_tables_addchain: fix nft_counters_enabled underflow

   - ebtables: fix memory leak when blob is malformed

   - nf_ct_ftp: fix deadlock when nat rewrite is needed

  Current release - regressions:

   - Revert "fec: Restart PPS after link state change" and the related
     "net: fec: Use a spinlock to guard `fep->ptp_clk_on`"

   - Bluetooth: fix HCIGETDEVINFO regression

   - wifi: mt76: fix 5 GHz connection regression on mt76x0/mt76x2

   - mptcp: fix fwd memory accounting on coalesce

   - rwlock removal fall out:
      - ipmr: always call ip{,6}_mr_forward() from RCU read-side
        critical section
      - ipv6: fix crash when IPv6 is administratively disabled

   - tcp: read multiple skbs in tcp_read_skb()

   - mdio_bus_phy_resume state warning fallout:
      - eth: ravb: fix PHY state warning splat during system resume
      - eth: sh_eth: fix PHY state warning splat during system resume

  Current release - new code bugs:

   - wifi: iwlwifi: don't spam logs with NSS>2 messages

   - eth: mtk_eth_soc: enable XDP support just for MT7986 SoC

  Previous releases - regressions:

   - bonding: fix NULL deref in bond_rr_gen_slave_id

   - wifi: iwlwifi: mark IWLMEI as broken

  Previous releases - always broken:

   - nf_conntrack helpers:
      - irc: tighten matching on DCC message
      - sip: fix ct_sip_walk_headers
      - osf: fix possible bogus match in nf_osf_find()

   - ipvlan: fix out-of-bound bugs caused by unset skb->mac_header

   - core: fix flow symmetric hash

   - bonding, team: unsync device addresses on ndo_stop

   - phy: micrel: fix shared interrupt on LAN8814"

* tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
  selftests: forwarding: add shebang for sch_red.sh
  bnxt: prevent skb UAF after handing over to PTP worker
  net: marvell: Fix refcounting bugs in prestera_port_sfp_bind()
  net: sched: fix possible refcount leak in tc_new_tfilter()
  net: sunhme: Fix packet reception for len < RX_COPY_THRESHOLD
  udp: Use WARN_ON_ONCE() in udp_read_skb()
  selftests: bonding: cause oops in bond_rr_gen_slave_id
  bonding: fix NULL deref in bond_rr_gen_slave_id
  net: phy: micrel: fix shared interrupt on LAN8814
  net/smc: Stop the CLC flow if no link to map buffers on
  ice: Fix ice_xdp_xmit() when XDP TX queue number is not sufficient
  net: atlantic: fix potential memory leak in aq_ndev_close()
  can: gs_usb: gs_usb_set_phys_id(): return with error if identify is not supported
  can: gs_usb: gs_can_open(): fix race dev->can.state condition
  can: flexcan: flexcan_mailbox_read() fix return value for drop = true
  net: sh_eth: Fix PHY state warning splat during system resume
  net: ravb: Fix PHY state warning splat during system resume
  netfilter: nf_ct_ftp: fix deadlock when nat rewrite is needed
  netfilter: ebtables: fix memory leak when blob is malformed
  netfilter: nf_tables: fix percpu memory leak at nf_tables_addchain()
  ...

3 years agoMerge tag 'efi-urgent-for-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Thu, 22 Sep 2022 17:27:38 +0000 (10:27 -0700)]
Merge tag 'efi-urgent-for-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:

 - Use the right variable to check for shim insecure mode

 - Wipe setup_data field when booting via EFI

 - Add missing error check to efibc driver

* tag 'efi-urgent-for-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: libstub: check Shim mode using MokSBStateRT
  efi: x86: Wipe setup_data on pure EFI boot
  efi: efibc: Guard against allocation failure

3 years agoMerge tag 'gpio-fixes-for-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 22 Sep 2022 17:17:29 +0000 (10:17 -0700)]
Merge tag 'gpio-fixes-for-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

 - fix a NULL-pointer dereference at driver unbind and a potential
   resource leak in error path in gpio-mockup

 - make the irqchip immutable in gpio-ftgpio010

 - fix dereferencing a potentially uninitialized variable in gpio-tqmx86

 - fix interrupt registering in gpiolib's character device code

* tag 'gpio-fixes-for-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpiolib: cdev: Set lineevent_state::irq after IRQ register successfully
  gpio: tqmx86: fix uninitialized variable girq
  gpio: ftgpio010: Make irqchip immutable
  gpio: mockup: Fix potential resource leakage when register a chip
  gpio: mockup: fix NULL pointer dereference when removing debugfs

3 years agoMerge tag 'perf-tools-fixes-for-v6.0-2022-09-21' of git://git.kernel.org/pub/scm...
Linus Torvalds [Thu, 22 Sep 2022 17:12:21 +0000 (10:12 -0700)]
Merge tag 'perf-tools-fixes-for-v6.0-2022-09-21' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull perf tools fixes from Arnaldo Carvalho de Melo:

 - Fix polling of system-wide events related to mixing per-cpu and
   per-thread events.

 - Do not check if /proc/modules is unchanged when copying /proc/kcore,
   that doesn't get in the way of post processing analysis.

 - Include program header in ELF files generated for JIT files, so that
   they can be opened by tools using elfutils libraries.

 - Enter namespaces when synthesizing build-ids.

 - Fix some bugs related to a recent cpu_map overhaul where we should be
   using an index and not the cpu number.

 - Fix BPF program ELF section name, using the naming expected by libbpf
   when using BPF counters in 'perf stat'.

 - Add a new test for perf stat cgroup BPF counter.

 - Adjust check on 'perf test wp' for older kernels, where the
   PERF_EVENT_IOC_MODIFY_ATTRIBUTES ioctl isn't supported.

 - Sync x86 cpufeatures with the kernel sources, no changes in tooling.

* tag 'perf-tools-fixes-for-v6.0-2022-09-21' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  perf tools: Honor namespace when synthesizing build-ids
  tools headers cpufeatures: Sync with the kernel sources
  perf kcore_copy: Do not check /proc/modules is unchanged
  libperf evlist: Fix polling of system-wide events
  perf record: Fix cpu mask bit setting for mixed mmaps
  perf test: Skip wp modify test on old kernels
  perf jit: Include program header in ELF files
  perf test: Add a new test for perf stat cgroup BPF counter
  perf stat: Use evsel->core.cpus to iterate cpus in BPF cgroup counters
  perf stat: Fix cpu map index in bperf cgroup code
  perf stat: Fix BPF program section name

3 years agoext4: limit the number of retries after discarding preallocations blocks
Theodore Ts'o [Thu, 1 Sep 2022 22:03:14 +0000 (18:03 -0400)]
ext4: limit the number of retries after discarding preallocations blocks

This patch avoids threads live-locking for hours when a large number
threads are competing over the last few free extents as they blocks
getting added and removed from preallocation pools.  From our bug
reporter:

   A reliable way for triggering this has multiple writers
   continuously write() to files when the filesystem is full, while
   small amounts of space are freed (e.g. by truncating a large file
   -1MiB at a time). In the local filesystem, this can be done by
   simply not checking the return code of write (0) and/or the error
   (ENOSPACE) that is set. Over NFS with an async mount, even clients
   with proper error checking will behave this way since the linux NFS
   client implementation will not propagate the server errors [the
   write syscalls immediately return success] until the file handle is
   closed. This leads to a situation where NFS clients send a
   continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
   since the client isn't seeing this, the stream of writes continues
   at maximum network speed.

   When some space does appear, multiple writers will all attempt to
   claim it for their current write. For NFS, we may see dozens to
   hundreds of threads that do this.

   The real-world scenario of this is database backup tooling (in
   particular, github.com/mdkent/percona-xtrabackup) which may write
   large files (>1TiB) to NFS for safe keeping. Some temporary files
   are written, rewound, and read back -- all before closing the file
   handle (the temp file is actually unlinked, to trigger automatic
   deletion on close/crash.) An application like this operating on an
   async NFS mount will not see an error code until TiB have been
   written/read.

   The lockup was observed when running this database backup on large
   filesystems (64 TiB in this case) with a high number of block
   groups and no free space. Fragmentation is generally not a factor
   in this filesystem (~thousands of large files, mostly contiguous
   except for the parts written while the filesystem is at capacity.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
3 years agoext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0
Luís Henriques [Mon, 22 Aug 2022 09:42:35 +0000 (10:42 +0100)]
ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0

When walking through an inode extents, the ext4_ext_binsearch_idx() function
assumes that the extent header has been previously validated.  However, there
are no checks that verify that the number of entries (eh->eh_entries) is
non-zero when depth is > 0.  And this will lead to problems because the
EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:

[  135.245946] ------------[ cut here ]------------
[  135.247579] kernel BUG at fs/ext4/extents.c:2258!
[  135.249045] invalid opcode: 0000 [#1] PREEMPT SMP
[  135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4
[  135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
[  135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0
[  135.256475] Code:
[  135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246
[  135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023
[  135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c
[  135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c
[  135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024
[  135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000
[  135.272394] FS:  00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[  135.274510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0
[  135.277952] Call Trace:
[  135.278635]  <TASK>
[  135.279247]  ? preempt_count_add+0x6d/0xa0
[  135.280358]  ? percpu_counter_add_batch+0x55/0xb0
[  135.281612]  ? _raw_read_unlock+0x18/0x30
[  135.282704]  ext4_map_blocks+0x294/0x5a0
[  135.283745]  ? xa_load+0x6f/0xa0
[  135.284562]  ext4_mpage_readpages+0x3d6/0x770
[  135.285646]  read_pages+0x67/0x1d0
[  135.286492]  ? folio_add_lru+0x51/0x80
[  135.287441]  page_cache_ra_unbounded+0x124/0x170
[  135.288510]  filemap_get_pages+0x23d/0x5a0
[  135.289457]  ? path_openat+0xa72/0xdd0
[  135.290332]  filemap_read+0xbf/0x300
[  135.291158]  ? _raw_spin_lock_irqsave+0x17/0x40
[  135.292192]  new_sync_read+0x103/0x170
[  135.293014]  vfs_read+0x15d/0x180
[  135.293745]  ksys_read+0xa1/0xe0
[  135.294461]  do_syscall_64+0x3c/0x80
[  135.295284]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

This patch simply adds an extra check in __ext4_ext_check(), verifying that
eh_entries is not 0 when eh_depth is > 0.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283
Cc: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>