We need to improve this to show the source code line numbers in the
annotation view, so one can go from that program block to the annotation view
and see those source code line numbers straight away.
auxtrace/Intel PT:
Adrian Hunter:
- Add support for AUX area sampling, requires new functionality that
will land in 5.5, its already in tip.
This includes kernel capability querying so that it fails gracefully
with older kernels, duimping aux area samples in 'perf report -D' and
'perf script'.
perf.data:
Alexey Budankov:
- Fix decompression of PERF_RECORD_COMPRESSED records.
core:
Arnaldo Carvalho de Melo:
- Use the 'dcacheline' cmp routine to find the right DSOs taking into
account the 'maj', 'min', 'ino' and 'ino_generation', that got moved
from 'struct map' to 'struct dso', where it belongs.
This further reduces the size of 'struct map', there is still more
work to do to maybe get it to max one cacheline.
libtraceevent:
Hewenliang:
- Fix memory leakage in copy_filter_type().
Sudip Mukherjee:
- Fix header installation.
perf parse:
Ian Rogers :
- Fix potential memory leak when handling tracepoint errors, found using
LLVM's libFuzzer.
perf probe:
Colin Ian King:
- Fix spelling mistake "addrees" -> "address".
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ian Rogers [Wed, 20 Nov 2019 18:09:25 +0000 (10:09 -0800)]
perf parse: Fix potential memory leak when handling tracepoint errors
An error may be in place when tracepoint_error is called, use
parse_events__handle_error to avoid a memory leak and to capture the
first and last error. Error detected by LLVM's libFuzzer using the
following event:
There is a spelling mistake in a pr_warning message. Fix it.
Signed-off-by: Colin King <colin.king@canonical.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-janitors@vger.kernel.org Link: http://lore.kernel.org/lkml/20191121092623.374896-1-colin.king@canonical.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 15 Nov 2019 12:42:24 +0000 (14:42 +0200)]
perf intel-pt: Add support for decoding AUX area samples
Add support for dumping, queuing and decoding AUX area samples. Decoding
samples is the same as regular decoding, except in the case where there
are no timestamps, in which case buffers are decoded immediately before
the sample event.
Adrian Hunter [Fri, 15 Nov 2019 12:42:22 +0000 (14:42 +0200)]
perf pmu: When using default config, record which bits of config were changed by the user
Default config for a PMU is defined before selected events are parsed.
That allows the user-entered config to override the default config.
However that does not allow for changing the default config based on
other options.
For example, if the user chooses AUX area sampling mode, in the case of
Intel PT, the psb_period needs to be small for sampling, so there is a
need to set the default psb_period to 0 (2 KiB) in that case. However
that should not override a value set by the user. To allow for that,
when using default config, record which bits of config were changed by
the user.
Adrian Hunter [Fri, 15 Nov 2019 12:42:21 +0000 (14:42 +0200)]
perf auxtrace: Add support for queuing AUX area samples
Add functions to queue AUX area samples in advance
(auxtrace_queue_data()) or individually (auxtrace_queues__add_sample())
or find out what queue a sample belongs on
(auxtrace_queues__sample_queue()).
auxtrace_queue_data() can also queue snapshot data which keeps snapshots
and samples ordered with respect to each other in case support for that
is desired.
Adrian Hunter [Fri, 15 Nov 2019 12:42:20 +0000 (14:42 +0200)]
perf session: Add facility to peek at all events
AUX area samples are not limited in how far back in time the sample
could start. Consequently samples must be queued in advance to allow for
time-ordered processing. To achieve that, add
perf_session__peek_events() that walks and peeks at all the events.
Adrian Hunter [Fri, 15 Nov 2019 12:42:17 +0000 (14:42 +0200)]
perf record: Add aux-sample-size config term
To allow individual events to be selected for AUX area sampling, add
aux-sample-size config term. attr.aux_sample_size is updated by
auxtrace_parse_sample_options() so that the existing validation will see
the value. Any event that has a non-zero aux_sample_size will cause AUX
area sampling to be configured, irrespective of the --aux-sample option.
Adrian Hunter [Fri, 15 Nov 2019 12:42:16 +0000 (14:42 +0200)]
perf record: Add support for AUX area sampling
Add a 'perf record' option '--aux-sample' to request AUX area sampling.
AUX area sampling uses an overwriting buffer much like snapshot mode, so
adjust the AUX buffer mmapping accordingly. To make it easy to queue
samples for decoding, synthesize an ID index.
Adrian Hunter [Fri, 15 Nov 2019 12:42:15 +0000 (14:42 +0200)]
perf auxtrace: Add support for AUX area sample recording
Add support for parsing and validating AUX area sample options. At
present, the only option is the sample size, but it is also necessary to
ensure that events are in a group with an AUX area event as the leader.
Committer note:
Add missing 'static inline' in front of auxtrace_parse_sample_options()
for when we don't HAVE_AUXTRACE_SUPPORT.
Adrian Hunter [Fri, 15 Nov 2019 12:42:13 +0000 (14:42 +0200)]
perf record: Add a function to test for kernel support for AUX area sampling
Architectures are expected to know if AUX area sampling is supported by
the hardware. Add a function perf_can_aux_sample() which will determine
whether the kernel supports it.
Committer notes:
I reported that this message was taking place on a kernel without the
required bits:
# perf record --aux-sample -e '{intel_pt//u,branch-misses:u}'
Error:
The sys_perf_event_open() syscall returned with 7 (Argument list too long) for event (branch-misses:u).
/bin/dmesg | grep -i perf may provide additional information.
Adrian sent a patch addressing it, with this explanation:
----
perf_can_aux_sample_size() always returned true because it did not pass
the attribute size to sys_perf_event_open, nor correctly check the
return value and errno.
----
After applying it I get, later in the series, when --aux-sample is
added:
# perf record --aux-sample -e '{intel_pt//u,branch-misses:u}'
AUX area sampling is not supported by kernel
Adrian Hunter [Fri, 15 Nov 2019 12:42:11 +0000 (14:42 +0200)]
perf tools: Add kernel AUX area sampling definitions
Add kernel AUX area sampling definitions, which brings perf_event.h into
line with the kernel version.
New sample type PERF_SAMPLE_AUX requests a sample of the AUX area
buffer. New perf_event_attr member 'aux_sample_size' specifies the
desired size of the sample.
Also add support for parsing samples containing AUX area data i.e.
PERF_SAMPLE_AUX.
Committer notes:
I squashed the first two patches in this series to avoid breaking
automatic bisection, i.e. after applying only the original first patch
in this series we would have:
# perf test -v parsing
26: Sample parsing :
--- start ---
test child forked, pid 17018
sample format has changed, some new PERF_SAMPLE_ bit was introduced - test needs updating
test child finished with -1
---- end ----
Sample parsing: FAILED!
#
Alexander Shishkin [Wed, 20 Nov 2019 17:06:40 +0000 (19:06 +0200)]
perf/core: Make the mlock accounting simple again
Commit:
d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()")
does a lot of things to the mlock accounting arithmetics, while the only
thing that actually needed to happen is subtracting the part that is
charged to the mm from the part that is charged to the user, so that the
former isn't charged twice.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Song Liu <songliubraving@fb.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Cc: songliubraving@fb.com Link: https://lkml.kernel.org/r/20191120170640.54123-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Jin Yao [Mon, 18 Nov 2019 14:08:49 +0000 (22:08 +0800)]
perf report: Jump to symbol source view from total cycles view
This patch supports jumping from tui total cycles view to symbol source
view.
For example,
perf record -b ./div
perf report --total-cycles
In total cycles view, we can select one entry and press 'a' or press
ENTER key to jump to symbol source view.
This patch also sets sort_order to NULL in cmd_report() which will use
the default branch sort order. The percent value in new annotate view
will be consistent with the percent in annotate view switched from perf
report (we observed the original percent gap with previous patches).
v2:
---
Fix the 'make NO_SLANG=1' error. (set __maybe_unused to
annotation_opts in block_hists_tui_browse()).
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20191118140849.20714-2-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Mon, 18 Nov 2019 14:08:48 +0000 (22:08 +0800)]
perf util: Move block TUI function to ui browsers
It would be nice if we could jump to the assembler/source view (like the
normal perf report) from total cycles view.
This patch moves the block_hists_tui_browse from block-info.c to
ui/browsers/hists.c in order to reuse some browser codes (i.e
do_annotate) for implementing new annotation view.
v2:
---
Fix the 'make NO_SLANG=1' error. (Change 'int block_hists_tui_browse()'
to 'static inline int block_hists_tui_browse()')
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20191118140849.20714-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Alexey Budankov [Mon, 18 Nov 2019 14:21:03 +0000 (17:21 +0300)]
perf session: Fix decompression of PERF_RECORD_COMPRESSED records
Avoid termination of trace loading in case the last record in the
decompressed buffer partly resides in the following mmaped
PERF_RECORD_COMPRESSED record.
In this case NULL value returned by fetch_mmaped_event() means to
proceed to the next mmaped record then decompress it and load compressed
events.
The issue can be reproduced like this:
$ perf record -z -- some_long_running_workload
$ perf report --stdio -vv
decomp (B): 44519 to 163000
decomp (B): 48119 to 174800
decomp (B): 65527 to 131072
fetch_mmaped_event: head=0x1ffe0 event->header_size=0x28, mmap_size=0x20000: fuzzed perf.data?
Error:
failed to process sample
...
Testing:
71: Zstd perf.data compression/decompression : Ok
$ tools/perf/perf report -vv --stdio
decomp (B): 59593 to 262160
decomp (B): 4438 to 16512
decomp (B): 285 to 880
Looking at the vmlinux_path (8 entries long)
Using vmlinux for symbols
decomp (B): 57474 to 261248
prefetch_event: head=0x3fc78 event->header_size=0x28, mmap_size=0x3fc80: fuzzed or compressed perf.data?
decomp (B): 25 to 32
decomp (B): 52 to 120
...
Fixes: 57fc032ad643 ("perf session: Avoid infinite loop when seeing invalid header.size") Link: https://marc.info/?l=linux-kernel&m=156580812427554&w=2 Co-developed-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/cf782c34-f3f8-2f9f-d6ab-145cee0d5322@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Tue, 19 Nov 2019 21:44:22 +0000 (18:44 -0300)]
perf dso: Move dso_id from 'struct map' to 'struct dso'
And take it into account when looking up DSOs when we have the dso_id
fields obtained from somewhere, like from PERF_RECORD_MMAP2 records.
Instances of struct map pointing to the same DSO pathname but with
anything in dso_id different are in fact different DSOs, so better have
different 'struct dso' instances to reflect that. At some point we may
want to get copies of the contents of the different objects if we want
to do correct annotation or other analysis.
Arnaldo Carvalho de Melo [Tue, 19 Nov 2019 15:26:19 +0000 (12:26 -0300)]
perf map: Move maj/min/ino/ino_generation to separate struct
And this patch highlights where these fields are being used: in the sort
order where it uses it to compare maps and classify samples taking into
account not just the DSO, but those DSO id fields.
I think these should be used to differentiate DSOs with the same name
but different 'struct dso_id' fields, i.e. these fields should move to
'struct dso' and then be used as part of the key when doing lookups for
DSOs, in addition to the DSO name.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-8v5isitqy0dup47nnwkpc80f@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
- Fix segfault in thread__resolve_callchain_sample().
perf probe:
- Line fixes to show only lines where probes can be used with 'perf probe -L',
and when reporting them via 'perf probe -l'.
- Support multiprobe events.
perf scripts python:
Adrian Hunter:
- Fix use of TRUE with SQLite < 3.23 in exported-sql-viewer.py.
perf maps:
- Trim 'struct map' by removing the rb_node member for sorting
by map name, as that is only needed for processing kernel maps,
and only when classifying symbols by section at load time.
Sort them by name using qsort() and do lookups using bsearch()
when map_groups__find_by_name() is used.
perf parse:
Ian Rogers:
- Report initial event parsing error, providing a less cryptic message
to state that a PMU wasn't found in the system.
perf vendor events:
James Clark:
- Fix commas so that PMU event files for arm64, power8 and power nine
become valid JSON.
libtraceevent:
Konstantin Khlebnikov:
- Fix parsing of event %o and %X argument types.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Masami Hiramatsu [Mon, 18 Nov 2019 08:12:49 +0000 (17:12 +0900)]
perf probe: Trace a magic number if variable is not found
Trace a magic number as immediate value if the target variable is not
found at some probe points which is based on one probe event.
This feature is good for the case if you trace a source code line with
some local variables, which is compiled into several instructions and
some of the variables are optimized out on some instructions.
Even if so, with this feature, perf probe trace a magic number instead
of such disappeared variables and fold those probes on one event.
E.g. without this patch:
# perf probe -D "pud_page_vaddr pud"
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
Failed to find 'pud' in this function.
p:probe/pud_page_vaddr _text+23480787 pud=%ax:x64
p:probe/pud_page_vaddr _text+23808453 pud=%bp:x64
p:probe/pud_page_vaddr _text+23558082 pud=%ax:x64
p:probe/pud_page_vaddr _text+328373 pud=%r8:x64
p:probe/pud_page_vaddr _text+348448 pud=%bx:x64
p:probe/pud_page_vaddr _text+23816818 pud=%bx:x64
With this patch:
# perf probe -D "pud_page_vaddr pud" | head
spurious_kernel_fault is blacklisted function, skip it.
vmalloc_fault is blacklisted function, skip it.
p:probe/pud_page_vaddr _text+23480787 pud=%ax:x64
p:probe/pud_page_vaddr _text+149051 pud=\deade12d:x64
p:probe/pud_page_vaddr _text+23808453 pud=%bp:x64
p:probe/pud_page_vaddr _text+315926 pud=\deade12d:x64
p:probe/pud_page_vaddr _text+23807209 pud=\deade12d:x64
p:probe/pud_page_vaddr _text+23557365 pud=%ax:x64
p:probe/pud_page_vaddr _text+314097 pud=%di:x64
p:probe/pud_page_vaddr _text+314015 pud=\deade12d:x64
p:probe/pud_page_vaddr _text+313893 pud=\deade12d:x64
p:probe/pud_page_vaddr _text+324083 pud=\deade12d:x64
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Link: http://lore.kernel.org/lkml/157406476931.24476.6261475888681844285.stgit@devnote2 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Masami Hiramatsu [Mon, 18 Nov 2019 08:12:30 +0000 (17:12 +0900)]
perf probe: Support multiprobe event
Support multiprobe event if the event is based on function and lines and
kernel supports it. In this case, perf probe creates the first probe
with an event, and tries to append following probes on that event, since
those probes must be on the same source code line.
Before this patch;
# perf probe -a vfs_read:18
Added new events:
probe:vfs_read_L18 (on vfs_read:18)
probe:vfs_read_L18_1 (on vfs_read:18)
You can now use it in all perf tools, such as:
perf record -e probe:vfs_read_L18_1 -aR sleep 1
#
After this patch (on multiprobe supported kernel)
# perf probe -a vfs_read:18
Added new events:
probe:vfs_read_L18 (on vfs_read:18)
probe:vfs_read_L18 (on vfs_read:18)
You can now use it in all perf tools, such as:
perf record -e probe:vfs_read_L18 -aR sleep 1
#
Committer testing:
On a kernel that doesn't support multiprobe events, after this patch:
# uname -a
Linux quaco 5.3.8-200.fc30.x86_64 #1 SMP Tue Oct 29 14:46:22 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
# grep append /sys/kernel/debug/tracing/README
be modified by appending '.descending' or '.ascending' to a
can be modified by appending any of the following modifiers
#
# perf probe -a vfs_read:18
Added new events:
probe:vfs_read_L18 (on vfs_read:18)
probe:vfs_read_L18_1 (on vfs_read:18)
You can now use it in all perf tools, such as:
perf record -e probe:vfs_read_L18_1 -aR sleep 1
# perf probe -l
probe:vfs_read_L18 (on vfs_read:18@fs/read_write.c)
probe:vfs_read_L18_1 (on vfs_read:18@fs/read_write.c)
#
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Link: http://lore.kernel.org/lkml/157406475010.24476.586290752591512351.stgit@devnote2 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Masami Hiramatsu [Mon, 18 Nov 2019 08:12:20 +0000 (17:12 +0900)]
perf probe: Generate event name with line number
Generate event name from function name with line number as
<function>_L<line_number>. Note that this is only for the new event
which is defined by the line number of function (except for line 0).
If there is another event on same line, you have to use
"-f" option. In that case, the new event has "_1" suffix.
e.g.
# perf probe -a kernel_read:2
Added new event:
probe:kernel_read_L2 (on kernel_read:2)
You can now use it in all perf tools, such as:
perf record -e probe:kernel_read_L2 -aR sleep 1
But if we omit the line number or 0th line, it will
have no suffix.
# perf probe -a kernel_read:0
Added new event:
probe:kernel_read (on kernel_read)
You can now use it in all perf tools, such as:
perf record -e probe:kernel_read -aR sleep 1
probe:kernel_read (on kernel_read@linux-5.0.0/fs/read_write.c)
probe:kernel_read_L2 (on kernel_read:2@linux-5.0.0/fs/read_write.c)
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Link: http://lore.kernel.org/lkml/157406474026.24476.2828897745502059569.stgit@devnote2 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Masami Hiramatsu [Mon, 18 Nov 2019 08:12:10 +0000 (17:12 +0900)]
perf probe: Do not show non representive lines by perf-probe -L
Since perf probe -L shows non representive lines, it can be mislead
users where user can put probes. This prevents to show such non
representive lines so that user can understand which lines user can
probe.
old_fs = get_fs();
6 set_fs(get_ds());
/* The cast to a user pointer is valid due to the set_fs() */
8 result = vfs_read(file, (void __user *)buf, count, pos);
9 set_fs(old_fs);
10 return result;
}
EXPORT_SYMBOL(kernel_read);
5 old_fs = get_fs();
6 set_fs(KERNEL_DS);
/* The cast to a user pointer is valid due to the set_fs() */
8 result = vfs_read(file, (void __user *)buf, count, pos);
9 set_fs(old_fs);
10 return result;
}
EXPORT_SYMBOL(kernel_read);
#
See the 1, 3, 5 lines? They shouldn't be there, after this patch:
old_fs = get_fs();
6 set_fs(KERNEL_DS);
/* The cast to a user pointer is valid due to the set_fs() */
8 result = vfs_read(file, (void __user *)buf, count, pos);
9 set_fs(old_fs);
10 return result;
}
EXPORT_SYMBOL(kernel_read);
#
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Link: http://lore.kernel.org/lkml/157406473064.24476.2913278267727587314.stgit@devnote2 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Masami Hiramatsu [Mon, 18 Nov 2019 08:12:00 +0000 (17:12 +0900)]
perf probe: Verify given line is a representive line
Verify user given probe line is a representive line (which doesn't share
the address with other lines or the line is the least line among the
lines which shares same address), and if not, it shows what is the
representive line.
Without this fix, user can put a probe on the lines which is not a a
representive line. But since this is not a representive line, perf probe
-l shows a representive line number instead of user given line number.
e.g. (put kernel_read:3, but listed as kernel_read:2)
# perf probe -a kernel_read:3
Added new event:
probe:kernel_read (on kernel_read:3)
You can now use it in all perf tools, such as:
perf record -e probe:kernel_read -aR sleep 1
# perf probe -l
probe:kernel_read (on kernel_read:2@linux-5.0.0/fs/read_write.c)
With this fix, perf probe doesn't allow user to put a probe on a
representive line, and tell what is the representive line.
# perf probe -a kernel_read:3
This line is sharing the addrees with other lines.
Please try to probe at kernel_read:2 instead.
Error: Failed to add events.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Link: http://lore.kernel.org/lkml/157406472071.24476.14915451439785001021.stgit@devnote2 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Masami Hiramatsu [Mon, 18 Nov 2019 08:11:50 +0000 (17:11 +0900)]
perf probe: Show correct statement line number by perf probe -l
The dwarf_getsrc_die() can return the line which is not a statement nor
the least line number among the lines which shares same address.
This can lead perf probe --list shows incorrect line number for probed
address.
To fix this, this introduces cu_getsrc_die() which returns only a
statement line and which is the least line number (we call it the
representive line for an address), and use it in cu_find_lineinfo().
Also, if the given address is the entry address of a real function,
cu_find_lineinfo() returns the function declared line number instead of
the start line number of the function body.
For example, without this change perf probe -l shows incorrect line as
below.
# perf probe -a kernel_read:2
Added new event:
probe:kernel_read (on kernel_read:2)
You can now use it in all perf tools, such as:
perf record -e probe:kernel_read -aR sleep 1
# perf probe -l
probe:kernel_read (on kernel_read:1@linux-5.0.0/fs/read_write.c)
With this fix, it shows correct line number as below;
# perf probe -l
probe:kernel_read (on kernel_read:2@linux-5.0.0/fs/read_write.c)
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Link: http://lore.kernel.org/lkml/157406471067.24476.17463149618465494448.stgit@devnote2 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 15 Nov 2019 13:54:47 +0000 (15:54 +0200)]
x86/insn: Add some Intel instructions to the opcode map
Add to the opcode map the following instructions:
cldemote
tpause
umonitor
umwait
movdiri
movdir64b
enqcmd
enqcmds
encls
enclu
enclv
pconfig
wbnoinvd
For information about the instructions, refer Intel SDM May 2019
(325462-070US) and Intel Architecture Instruction Set Extensions
May 2019 (319433-037).
The instruction decoding can be tested using the perf tools'
"x86 instruction decoder - new instructions" test as folllows:
Adrian Hunter [Fri, 15 Nov 2019 13:54:46 +0000 (15:54 +0200)]
x86/insn: perf tools: Add some instructions to the new instructions test
Add to the "x86 instruction decoder - new instructions" test the following
instructions:
cldemote
tpause
umonitor
umwait
movdiri
movdir64b
enqcmd
enqcmds
encls
enclu
enclv
pconfig
wbnoinvd
For information about the instructions, refer Intel SDM May 2019
(325462-070US) and Intel Architecture Instruction Set Extensions
May 2019 (319433-037).
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Link: http://lore.kernel.org/lkml/20191115135447.6519-2-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
/* size: 128, cachelines: 2, members: 17 */
/* sum members: 116, holes: 2, sum holes: 7 */
/* sum bitfield members: 2 bits, bit holes: 1, sum bit holes: 6 bits */
/* padding: 4 */
/* forced alignments: 1 */
} __attribute__((__aligned__(8)));
$
and 'flags' is seldom used when printing details about the map or with
the "cacheline" sort order, we can move them it to the second cacheline,
that will allow combining it with 'refcnt', that is only four bytes:
Arnaldo Carvalho de Melo [Mon, 18 Nov 2019 19:26:29 +0000 (16:26 -0300)]
perf map: Use bitmap for booleans
The map->priv and map->erange_warned are seldom used, the first only in
tests/vmlinux-kallsyms.c, the later only when hist_entry__inc_addr_samples()
returns -ERANGE in 'perf top', which are really rare occasions, so make
them a bool bitfield.
This will open up space for other members on the first cacheline.
Arnaldo Carvalho de Melo [Sun, 17 Nov 2019 14:38:13 +0000 (11:38 -0300)]
perf map_groups: Auto sort maps by name, if needed
There are still lots of lookups by name, even if just when loading
vmlinux, till that code is studied to figure out if its possible to do
away with those map lookup by names, provide a way to sort it using
libc's qsort/bsearch.
Doing it at the first lookup defers the sorting a bit, and as the code
stands now, is never done for user maps, just for the kernel ones.
# perf probe -x ~/bin/perf 'found=__map_groups__find_by_name:10 name:string'
Added new event:
probe_perf:found (on __map_groups__find_by_name:10 in /home/acme/bin/perf with name:string)
7 if (mg->last_search_by_name && strcmp(mg->last_search_by_name->dso->short_name, name) == 0) {
8 map = mg->last_search_by_name;
9 goto out_unlock;
}
/*
* If we have mg->maps_by_name, then the name isn't in the rbtree,
* as mg->maps_by_name mirrors the rbtree when lookups by name are
* made.
*/
16 map = __map_groups__find_by_name(mg, name);
17 if (map || mg->maps_by_name != NULL)
18 goto out_unlock;
/* Fallback to traversing the rbtree... */
21 maps__for_each_entry(maps, map)
22 if (strcmp(map->dso->short_name, name) == 0) {
23 mg->last_search_by_name = map;
24 goto out_unlock;
}
# perf probe -x ~/bin/perf 'fallback=map_groups__find_by_name:21 name:string'
Added new events:
probe_perf:fallback (on map_groups__find_by_name:21 in /home/acme/bin/perf with name:string)
probe_perf:fallback_1 (on map_groups__find_by_name:21 in /home/acme/bin/perf with name:string)
You can now use it in all perf tools, such as:
perf record -e probe_perf:fallback_1 -aR sleep 1
#
# perf probe -l
probe_perf:fallback (on map_groups__find_by_name:21@util/symbol.c in /home/acme/bin/perf with name_string)
probe_perf:fallback_1 (on map_groups__find_by_name:21@util/symbol.c in /home/acme/bin/perf with name_string)
probe_perf:found (on __map_groups__find_by_name:10@util/symbol.c in /home/acme/bin/perf with name_string)
#
# perf stat -e probe_perf:*
Now run 'perf top' in another term and then, after a while, stop 'perf stat':
Furthermore, if we ask for interval printing, we can see that that is done just
at the start of the workload:
Arnaldo Carvalho de Melo [Thu, 14 Nov 2019 15:28:41 +0000 (12:28 -0300)]
perf machine: No need to check if kernel module maps pre-exist
We'only populating maps for kernel modules either from perf.data file
PERF_RECORD_MMAP records or when parsing /proc/modules, so there is no
need to first look if we already have those module maps in the list,
that would mean the kernel has duplicate entries.
So ditch one use of looking up maps by name.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-gnzjg2hhuz6jnrw91m35059y@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Alexander Shishkin [Fri, 15 Nov 2019 16:08:18 +0000 (18:08 +0200)]
perf/core: Fix the mlock accounting, again
Commit:
5e6c3c7b1ec2 ("perf/aux: Fix tracking of auxiliary trace buffer allocation")
tried to guess the correct combination of arithmetic operations that would
undo the AUX buffer's mlock accounting, and failed, leaking the bottom part
when an allocation needs to be charged partially to both user->locked_vm
and mm->pinned_vm, eventually leaving the user with no locked bonus:
$ perf record -e intel_pt//u -m1,128 uname
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.061 MB perf.data ]
$ perf record -e intel_pt//u -m1,128 uname
Permission error mapping pages.
Consider increasing /proc/sys/kernel/perf_event_mlock_kb,
or try again with a smaller value of -m/--mmap_pages.
(current value: 1,128)
Fix this by subtracting both locked and pinned counts when AUX buffer is
unmapped.
Reported-by: Thomas Richter <tmricht@linux.ibm.com> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Arnaldo Carvalho de Melo [Thu, 14 Nov 2019 15:15:34 +0000 (12:15 -0300)]
perf record: No need to process the synthesized MMAP events twice
At the end of a 'perf record' session, by default, we'll process all
samples and populate the threads, maps, etc so as to find out which of
the DSOs got samples, to reduce the size of the build-id table we'll
add to the perf.data headers.
But we don't need to process the PERF_RECORD_MMAP events synthesized
for the kernel modules, as we have those already via
perf_session__create_kernel_maps(), so add mmap/mmap2 handlers that
first look at event->header.misc to see if the event is for a user map,
bailing out if not.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-mofoxvcx2dryppcw3o689jdd@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Thu, 14 Nov 2019 13:46:45 +0000 (10:46 -0300)]
perf map: No need to adjust the long name of modules
At some point in the past we needed to make sure we would get the long
name of modules and not just what we get from /proc/modules, but that
need, as described in the cset that introduced the adjustment function:
Fixes: c03d5184f0e9 ("perf machine: Adjust dso->long_name for offline module")
Without using the buildid-cache:
# lsmod | grep trusted
# insmod trusted.ko
# lsmod | grep trusted
trusted 24576 0
# strace -e open,openat perf probe -m ./trusted.ko key_seal |& grep trusted
openat(AT_FDCWD, "/sys/module/trusted/notes/.note.gnu.build-id", O_RDONLY) = 4
openat(AT_FDCWD, "/sys/module/trusted/notes/.note.gnu.build-id", O_RDONLY) = 7
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
openat(AT_FDCWD, "/root/.debug/root/trusted.ko/dd3d355d567394d540f527e093e0f64b95879584/probes", O_RDWR|O_CREAT, 0644) = 3
openat(AT_FDCWD, "/usr/lib/debug/root/trusted.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/debug/root/trusted.ko", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/.debug/trusted.ko", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
openat(AT_FDCWD, "trusted.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, ".debug/trusted.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "trusted.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 4
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
probe:key_seal (on key_seal in trusted)
# perf probe -l
probe:key_seal (on key_seal in trusted)
#
No attempt at opening '[trusted]'.
Now using the build-id cache:
# rmmod trusted
# perf buildid-cache --add ./trusted.ko
# insmod trusted.ko
# strace -e open,openat perf probe -m ./trusted.ko key_seal |& grep trusted
openat(AT_FDCWD, "/sys/module/trusted/notes/.note.gnu.build-id", O_RDONLY) = 4
openat(AT_FDCWD, "/sys/module/trusted/notes/.note.gnu.build-id", O_RDONLY) = 7
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
openat(AT_FDCWD, "/root/.debug/root/trusted.ko/dd3d355d567394d540f527e093e0f64b95879584/probes", O_RDWR|O_CREAT, 0644) = 3
openat(AT_FDCWD, "/usr/lib/debug/root/trusted.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/debug/root/trusted.ko", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/.debug/trusted.ko", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
openat(AT_FDCWD, "trusted.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, ".debug/trusted.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "trusted.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 4
openat(AT_FDCWD, "/root/trusted.ko", O_RDONLY) = 3
#
Again, no attempt at reading '[trusted]'.
Finally, adding a probe to that function and then using:
This was the only path I could find using the perf tools that reach at this
function, then as of november/2019, if we put a probe in the line where the
actuall setting of the dso->long_name is done:
To further test this I used kvm.ko as the offline module, i.e. removed
if from the buildid-cache by nuking it completely (rm -rf ~/.debug) and
moved it from the normal kernel distro path, removed the modules, stoped
the kvm guest, and then installed it manually, etc.
# rmmod kvm-intel
# rmmod kvm
# lsmod | grep kvm
# modprobe kvm-intel
modprobe: ERROR: ctx=0x55d3b1722260 path=/lib/modules/5.3.8-200.fc30.x86_64/kernel/arch/x86/kvm/kvm.ko.xz error=No such file or directory
modprobe: ERROR: ctx=0x55d3b1722260 path=/lib/modules/5.3.8-200.fc30.x86_64/kernel/arch/x86/kvm/kvm.ko.xz error=No such file or directory
modprobe: ERROR: could not insert 'kvm_intel': Unknown symbol in module, or unknown parameter (see dmesg)
# insmod ./kvm.ko
# modprobe kvm-intel
modprobe: ERROR: ctx=0x562f34026260 path=/lib/modules/5.3.8-200.fc30.x86_64/kernel/arch/x86/kvm/kvm.ko.xz error=No such file or directory
modprobe: ERROR: ctx=0x562f34026260 path=/lib/modules/5.3.8-200.fc30.x86_64/kernel/arch/x86/kvm/kvm.ko.xz error=No such file or directory
# lsmod | grep kvm
kvm_intel 299008 0
kvm 765952 1 kvm_intel
irqbypass 16384 1 kvm
#
# perf probe -x ~/bin/perf machine__findnew_module_map:12 mname=m.name:string filename=filename:string 'dso_long_name=map->dso->long_name:string' 'dso_name=map->dso->name:string'
# perf probe -l
probe_perf:machine__findnew_module_map (on machine__findnew_module_map:12@util/machine.c in /home/acme/bin/perf with mname filename dso_long_name dso_name)
# perf record
^C[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 3.416 MB perf.data (33956 samples) ]
# perf trace -e probe_perf:machine*
<SNIP>
6.322 perf/23099 probe_perf:machine__findnew_module_map(__probe_ip: 5492493, mname: "[salsa20_generic]", filename: "/lib/modules/5.3.8-200.fc30.x86_64/kernel/crypto/salsa20_generic.ko.xz", dso_long_name: "/lib/modules/5.3.8-200.fc30.x86_64/kernel/crypto/salsa20_generic.ko.xz", dso_name: "[salsa20_generic]")
6.375 perf/23099 probe_perf:machine__findnew_module_map(__probe_ip: 5492493, mname: "[kvm]", filename: "[kvm]", dso_long_name: "[kvm]", dso_name: "[kvm]")
<SNIP>
The filename doesn't come with the path, no point in trying to set the dso->long_name.
[root@quaco ~]# strace -e open,openat perf probe -m ./kvm.ko kvm_apic_local_deliver |& egrep 'open.*kvm'
openat(AT_FDCWD, "/sys/module/kvm_intel/notes/.note.gnu.build-id", O_RDONLY) = 4
openat(AT_FDCWD, "/sys/module/kvm/notes/.note.gnu.build-id", O_RDONLY) = 4
openat(AT_FDCWD, "/lib/modules/5.3.8-200.fc30.x86_64/kernel/arch/x86/kvm", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 7
openat(AT_FDCWD, "/sys/module/kvm_intel/notes/.note.gnu.build-id", O_RDONLY) = 8
openat(AT_FDCWD, "/root/kvm.ko", O_RDONLY) = 3
openat(AT_FDCWD, "/root/.debug/root/kvm.ko/5955f426cb93f03f30f3e876814be2db80ab0b55/probes", O_RDWR|O_CREAT, 0644) = 3
openat(AT_FDCWD, "/usr/lib/debug/root/kvm.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/debug/root/kvm.ko", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/.debug/kvm.ko", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/kvm.ko", O_RDONLY) = 3
openat(AT_FDCWD, "kvm.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, ".debug/kvm.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "kvm.ko.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/root/kvm.ko", O_RDONLY) = 3
openat(AT_FDCWD, "/root/kvm.ko", O_RDONLY) = 3
openat(AT_FDCWD, "/root/kvm.ko", O_RDONLY) = 4
openat(AT_FDCWD, "/root/kvm.ko", O_RDONLY) = 3
[root@quaco ~]#
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-jlfew3lyb24d58egrp0o72o2@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Now add a probe to the place where we reuse the last search:
# perf probe -x ~/bin/perf map_groups__find_by_name:8
Added new event:
probe_perf:map_groups__find_by_name (on map_groups__find_by_name:8 in /home/acme/bin/perf)
You can now use it in all perf tools, such as:
perf record -e probe_perf:map_groups__find_by_name -aR sleep 1
#
Now lets do a system wide 'perf stat' counting those events:
# perf stat -e probe_perf:*
Leave it running and lets do a 'perf top', then, after a while, stop the
'perf stat':
# perf stat -e probe_perf:*
^C
Performance counter stats for 'system wide':
Arnaldo Carvalho de Melo [Wed, 13 Nov 2019 19:16:25 +0000 (16:16 -0300)]
perf maps: Do not use an rbtree to sort by map name
This is only used for the kernel maps, shave 24 bytes out 'struct map'
and just traverse the existing per ip rbtree to look for maps by name,
use a front end cache to reuse the last search if its the same name.
After this 'struct map' is down to just two cachelines:
Linus Torvalds [Sun, 17 Nov 2019 19:27:44 +0000 (11:27 -0800)]
Merge tag 'iommu-fixes-v5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Fix for Intel IOMMU to correct invalidation commands when in SVA
mode.
- Update MAINTAINERS entry for Intel IOMMU
* tag 'iommu-fixes-v5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Fix QI_DEV_IOTLB_PFSID and QI_DEV_EIOTLB_PFSID macros
MAINTAINERS: Update for INTEL IOMMU (VT-d) entry
Linus Torvalds [Sun, 17 Nov 2019 16:15:41 +0000 (08:15 -0800)]
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"An I2C core fix to prevent a use-after-free in a rare error path,
and an I2C ACPI addition to work around broken HW/firmware related
to touchscreens"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: core: fix use after free in of_i2c_notify
i2c: acpi: Force bus speed to 400KHz if a Silead touchscreen is present
Linus Torvalds [Sun, 17 Nov 2019 02:14:32 +0000 (18:14 -0800)]
Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"This reverts a number of changes to the khwrng thread which feeds the
kernel random number pool from hwrng drivers. They were trying to fix
issues with suspend-and-resume but ended up causing regressions"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
Revert "hwrng: core - Freeze khwrng thread during suspend"
Herbert Xu [Sun, 17 Nov 2019 00:48:17 +0000 (08:48 +0800)]
Revert "hwrng: core - Freeze khwrng thread during suspend"
This reverts commit 03a3bb7ae631 ("hwrng: core - Freeze khwrng
thread during suspend"), ff296293b353 ("random: Support freezable
kthreads in add_hwgenerator_randomness()") and 59b569480dc8 ("random:
Use wait_event_freezable() in add_hwgenerator_randomness()").
These patches introduced regressions and we need more time to
get them ready for mainline.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
1) Fix memory leak in xfrm_state code, from Steffen Klassert.
2) Fix races between devlink reload operations and device
setup/cleanup, from Jiri Pirko.
3) Null deref in NFC code, from Stephan Gerhold.
4) Refcount fixes in SMC, from Ursula Braun.
5) Memory leak in slcan open error paths, from Jouni Hogander.
6) Fix ETS bandwidth validation in hns3, from Yonglong Liu.
7) Info leak on short USB request answers in ax88172a driver, from
Oliver Neukum.
8) Release mem region properly in ep93xx_eth, from Chuhong Yuan.
9) PTP config timestamp flags validation, from Richard Cochran.
10) Dangling pointers after SKB data realloc in seg6, from Andrea Mayer.
11) Missing free_netdev() in gemini driver, from Chuhong Yuan.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits)
ipmr: Fix skb headroom in ipmr_get_route().
net: hns3: cleanup of stray struct hns3_link_mode_mapping
net/smc: fix fastopen for non-blocking connect()
rds: ib: update WR sizes when bringing up connection
net: gemini: add missed free_netdev
net: dsa: tag_8021q: Fix dsa_8021q_restore_pvid for an absent pvid
seg6: fix skb transport_header after decap_and_validate()
seg6: fix srh pointer in get_srh()
net: stmmac: Use the correct style for SPDX License Identifier
octeontx2-af: Use the correct style for SPDX License Identifier
ptp: Extend the test program to check the external time stamp flags.
mlx5: Reject requests to enable time stamping on both edges.
igb: Reject requests that fail to enable time stamping on both edges.
dp83640: Reject requests to enable time stamping on both edges.
mv88e6xxx: Reject requests to enable time stamping on both edges.
ptp: Introduce strict checking of external time stamp options.
renesas: reject unsupported external timestamp flags
mlx5: reject unsupported external timestamp flags
igb: reject unsupported external timestamp flags
dp83640: reject unsupported external timestamp flags
...
Guillaume Nault [Fri, 15 Nov 2019 17:29:52 +0000 (18:29 +0100)]
ipmr: Fix skb headroom in ipmr_get_route().
In route.c, inet_rtm_getroute_build_skb() creates an skb with no
headroom. This skb is then used by inet_rtm_getroute() which may pass
it to rt_fill_info() and, from there, to ipmr_get_route(). The later
might try to reuse this skb by cloning it and prepending an IPv4
header. But since the original skb has no headroom, skb_push() triggers
skb_under_panic():
Actually the original skb used to have enough headroom, but the
reserve_skb() call was lost with the introduction of
inet_rtm_getroute_build_skb() by commit 404eb77ea766 ("ipv4: support
sport, dport and ip_proto in RTM_GETROUTE").
We could reserve some headroom again in inet_rtm_getroute_build_skb(),
but this function shouldn't be responsible for handling the special
case of ipmr_get_route(). Let's handle that directly in
ipmr_get_route() by calling skb_realloc_headroom() instead of
skb_clone().
Fixes: 404eb77ea766 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Fri, 15 Nov 2019 11:39:30 +0000 (12:39 +0100)]
net/smc: fix fastopen for non-blocking connect()
FASTOPEN does not work with SMC-sockets. Since SMC allows fallback to
TCP native during connection start, the FASTOPEN setsockopts trigger
this fallback, if the SMC-socket is still in state SMC_INIT.
But if a FASTOPEN setsockopt is called after a non-blocking connect(),
this is broken, and fallback does not make sense.
This change complements
commit cd2063604ea6 ("net/smc: avoid fallback in case of non-blocking connect")
and fixes the syzbot reported problem "WARNING in smc_unhash_sk".
Reported-by: syzbot+8488cc4cf1c9e09b8b86@syzkaller.appspotmail.com Fixes: e1bbdd570474 ("net/smc: reduce sock_put() for fallback sockets") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dag Moxnes [Fri, 15 Nov 2019 08:56:01 +0000 (09:56 +0100)]
rds: ib: update WR sizes when bringing up connection
Currently WR sizes are updated from rds_ib_sysctl_max_send_wr and
rds_ib_sysctl_max_recv_wr when a connection is shut down. As a result,
a connection being down while rds_ib_sysctl_max_send_wr or
rds_ib_sysctl_max_recv_wr are updated, will not update the sizes when
it comes back up.
Move resizing of WRs to rds_ib_setup_qp so that connections will be setup
with the most current WR sizes.
Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chuhong Yuan [Fri, 15 Nov 2019 06:24:54 +0000 (14:24 +0800)]
net: gemini: add missed free_netdev
This driver forgets to free allocated netdev in remove like
what is done in probe failure.
Add the free to fix it.
Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 16 Nov 2019 16:08:25 +0000 (18:08 +0200)]
net: dsa: tag_8021q: Fix dsa_8021q_restore_pvid for an absent pvid
This sequence of operations:
ip link set dev br0 type bridge vlan_filtering 1
bridge vlan del dev swp2 vid 1
ip link set dev br0 type bridge vlan_filtering 1
ip link set dev br0 type bridge vlan_filtering 0
apparently fails with the message:
[ 31.305716] sja1105 spi0.1: Reset switch and programmed static config. Reason: VLAN filtering
[ 31.322161] sja1105 spi0.1: Couldn't determine PVID attributes (pvid 0)
[ 31.328939] sja1105 spi0.1: Failed to setup VLAN tagging for port 1: -2
[ 31.335599] ------------[ cut here ]------------
[ 31.340215] WARNING: CPU: 1 PID: 194 at net/switchdev/switchdev.c:157 switchdev_port_attr_set_now+0x9c/0xa4
[ 31.349981] br0: Commit of attribute (id=6) failed.
[ 31.354890] Modules linked in:
[ 31.357942] CPU: 1 PID: 194 Comm: ip Not tainted 5.4.0-rc6-01792-gf4f632e07665-dirty #2062
[ 31.366167] Hardware name: Freescale LS1021A
[ 31.370437] [<c03144dc>] (unwind_backtrace) from [<c030e184>] (show_stack+0x10/0x14)
[ 31.378153] [<c030e184>] (show_stack) from [<c11d1c1c>] (dump_stack+0xe0/0x10c)
[ 31.385437] [<c11d1c1c>] (dump_stack) from [<c034c730>] (__warn+0xf4/0x10c)
[ 31.392373] [<c034c730>] (__warn) from [<c034c7bc>] (warn_slowpath_fmt+0x74/0xb8)
[ 31.399827] [<c034c7bc>] (warn_slowpath_fmt) from [<c11ca204>] (switchdev_port_attr_set_now+0x9c/0xa4)
[ 31.409097] [<c11ca204>] (switchdev_port_attr_set_now) from [<c117036c>] (__br_vlan_filter_toggle+0x6c/0x118)
[ 31.418971] [<c117036c>] (__br_vlan_filter_toggle) from [<c115d010>] (br_changelink+0xf8/0x518)
[ 31.427637] [<c115d010>] (br_changelink) from [<c0f8e9ec>] (__rtnl_newlink+0x3f4/0x76c)
[ 31.435613] [<c0f8e9ec>] (__rtnl_newlink) from [<c0f8eda8>] (rtnl_newlink+0x44/0x60)
[ 31.443329] [<c0f8eda8>] (rtnl_newlink) from [<c0f89f20>] (rtnetlink_rcv_msg+0x2cc/0x51c)
[ 31.451477] [<c0f89f20>] (rtnetlink_rcv_msg) from [<c1008df8>] (netlink_rcv_skb+0xb8/0x110)
[ 31.459796] [<c1008df8>] (netlink_rcv_skb) from [<c1008648>] (netlink_unicast+0x17c/0x1f8)
[ 31.468026] [<c1008648>] (netlink_unicast) from [<c1008980>] (netlink_sendmsg+0x2bc/0x3b4)
[ 31.476261] [<c1008980>] (netlink_sendmsg) from [<c0f43858>] (___sys_sendmsg+0x230/0x250)
[ 31.484408] [<c0f43858>] (___sys_sendmsg) from [<c0f44c84>] (__sys_sendmsg+0x50/0x8c)
[ 31.492209] [<c0f44c84>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x28)
[ 31.500090] Exception stack(0xedf47fa8 to 0xedf47ff0)
[ 31.505122] 7fa0: 00000002b6f2e06000000003beabd6a40000000000000000
[ 31.513265] 7fc0: 00000002b6f2e0605d6e321300000128000000000000000100000006000619c4
[ 31.521405] 7fe0: 00086078beabd6580005edbcb6e7ce68
Since VID 0 is an invalid pvid from the bridge's point of view, let's
add this check in dsa_8021q_restore_pvid to avoid restoring a pvid that
doesn't really exist.
Fixes: 5f33183b7fdf ("net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrea Mayer [Sat, 16 Nov 2019 15:05:53 +0000 (16:05 +0100)]
seg6: fix skb transport_header after decap_and_validate()
in the receive path (more precisely in ip6_rcv_core()) the
skb->transport_header is set to skb->network_header + sizeof(*hdr). As a
consequence, after routing operations, destination input expects to find
skb->transport_header correctly set to the next protocol (or extension
header) that follows the network protocol. However, decap behaviors (DX*,
DT*) remove the outer IPv6 and SRH extension and do not set again the
skb->transport_header pointer correctly. For this reason, the patch sets
the skb->transport_header to the skb->network_header + sizeof(hdr) in each
DX* and DT* behavior.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: David S. Miller <davem@davemloft.net>
Nishad Kamdar [Sat, 16 Nov 2019 09:40:59 +0000 (15:10 +0530)]
net: stmmac: Use the correct style for SPDX License Identifier
This patch corrects the SPDX License Identifier style in
header files related to STMicroelectronics based Multi-Gigabit
Ethernet driver. For C header files Documentation/process/license-rules.rst
mandates C-like comments (opposed to C source files where
C++ style should be used).
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nishad Kamdar [Sat, 16 Nov 2019 09:20:45 +0000 (14:50 +0530)]
octeontx2-af: Use the correct style for SPDX License Identifier
This patch corrects the SPDX License Identifier style in
header files related to Marvell OcteonTX2 network devices.
It uses an expilict block comment for the SPDX License
Identifier.
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 16 Nov 2019 16:20:43 +0000 (08:20 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"11 fixes"
MM fixes and one xz decompressor fix.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/debug.c: PageAnon() is true for PageKsm() pages
mm/debug.c: __dump_page() prints an extra line
mm/page_io.c: do not free shared swap slots
mm/memory_hotplug: fix try_offline_node()
mm,thp: recheck each page before collapsing file THP
mm: slub: really fix slab walking for init_on_free
mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()
mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
lib/xz: fix XZ_DYNALLOC to avoid useless memory reallocations
mm: fix trying to reclaim unevictable lru page when calling madvise_pageout
mm: mempolicy: fix the wrong return value and potential pages leak of mbind
Ralph Campbell [Sat, 16 Nov 2019 01:35:07 +0000 (17:35 -0800)]
mm/debug.c: PageAnon() is true for PageKsm() pages
PageAnon() and PageKsm() use the low two bits of the page->mapping
pointer to indicate the page type. PageAnon() only checks the LSB while
PageKsm() checks the least significant 2 bits are equal to 3.
Therefore, PageAnon() is true for KSM pages. __dump_page() incorrectly
will never print "ksm" because it checks PageAnon() first. Fix this by
checking PageKsm() first.
Link: http://lkml.kernel.org/r/20191113000651.20677-1-rcampbell@nvidia.com Fixes: 1c6fb1d89e73 ("mm: print more information about mapping in __dump_page") Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It looks like the intent was to use pr_cont() for printing "flags:" but
pr_cont() usage is discouraged so fix this by extending the format to
include the flags into a single line:
Vinayak Menon [Sat, 16 Nov 2019 01:35:00 +0000 (17:35 -0800)]
mm/page_io.c: do not free shared swap slots
The following race is observed due to which a processes faulting on a
swap entry, finds the page neither in swapcache nor swap. This causes
zram to give a zero filled page that gets mapped to the process,
resulting in a user space crash later.
Consider parent and child processes Pa and Pb sharing the same swap slot
with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
Virtual address 'VA' of Pa and Pb points to the shared swap entry.
Pa Pb
fault on VA fault on VA
do_swap_page do_swap_page
lookup_swap_cache fails lookup_swap_cache fails
Pb scheduled out
swapin_readahead (deletes zram entry)
swap_free (makes swap_count 1)
Pb scheduled in
swap_readpage (swap_count == 1)
Takes SWP_SYNCHRONOUS_IO path
zram enrty absent
zram gives a zero filled page
Fix this by making sure that swap slot is freed only when swap count
drops down to one.
Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org Fixes: aa8d22a11da9 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference") Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Suggested-by: Minchan Kim <minchan@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Sat, 16 Nov 2019 01:34:57 +0000 (17:34 -0800)]
mm/memory_hotplug: fix try_offline_node()
try_offline_node() is pretty much broken right now:
- The node span is updated when onlining memory, not when adding it. We
ignore memory that was mever onlined. Bad.
- We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
trigger a kernel panic. Bad for memory that is offline but also bad
for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
first PFN of a section might contain garbage.
- Sections belonging to mixed nodes are not properly considered.
As memory blocks might belong to multiple nodes, we would have to walk
all pageblocks (or at least subsections) within present sections.
However, we don't have a way to identify whether a memmap that is not
online was initialized (relevant for ZONE_DEVICE). This makes things
more complicated.
Luckily, we can piggy pack on the node span and the nid stored in memory
blocks. Currently, the node span is grown when calling
move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
removing memory, before calling try_offline_node(). Sysfs links are
created via link_mem_sections(), e.g., during boot or when adding
memory.
If the node still spans memory or if any memory block belongs to the
nid, we don't set the node offline. As memory blocks that span multiple
nodes cannot get offlined, the nid stored in memory blocks is reliable
enough (for such online memory blocks, the node still spans the memory).
Introduce for_each_memory_block() to efficiently walk all memory blocks.
Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
when removing ZONE_DEVICE memory to fix similar issues (access of
garbage memmaps) - until we have a reliable way to identify whether
these memmaps were properly initialized. This implies later, that once
a node had ZONE_DEVICE memory, we won't be able to set a node offline -
which should be acceptable.
Since commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") memory that is added is not
assoziated with a zone/node (memmap not initialized). The introducing
commit 60a5a19e7419 ("memory-hotplug: remove sysfs file of node")
already missed that we could have multiple nodes for a section and that
the zone/node span is updated when onlining pages, not when adding them.
I tested this by hotplugging two DIMMs to a memory-less and cpu-less
NUMA node. The node is properly onlined when adding the DIMMs. When
removing the DIMMs, the node is properly offlined.
Song Liu [Sat, 16 Nov 2019 01:34:53 +0000 (17:34 -0800)]
mm,thp: recheck each page before collapsing file THP
In collapse_file(), for !is_shmem case, current check cannot guarantee
the locked page is up-to-date. Specifically, xas_unlock_irq() should
not be called before lock_page() and get_page(); and it is necessary to
recheck PageUptodate() after locking the page.
With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
may contain corrupted data. This is because khugepaged mistakenly
collapses some not up-to-date sub pages into a huge page, and assumes
the huge page is up-to-date. This will NOT corrupt data in the disk,
because the page is read-only and never written back. Fix this by
properly checking PageUptodate() after locking the page. This check
replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".
Also, move PageDirty() check after locking the page. Current khugepaged
should not try to collapse dirty file THP, because it is limited to
read-only .text. The only case we hit a dirty page here is when the
page hasn't been written since write. Bail out and retry when this
happens.
syzbot reported bug on previous version of this patch.
Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Song Liu <songliubraving@fb.com> Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Laura Abbott [Sat, 16 Nov 2019 01:34:50 +0000 (17:34 -0800)]
mm: slub: really fix slab walking for init_on_free
Commit 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free")
fixed one problem with the slab walking but missed a key detail: When
walking the list, the head and tail pointers need to be updated since we
end up reversing the list as a result. Without doing this, bulk free is
broken.
One way this is exposed is a NULL pointer with slub_debug=F:
=============================================================================
BUG skbuff_head_cache (Tainted: G T): Object already free
-----------------------------------------------------------------------------
Roman Gushchin [Sat, 16 Nov 2019 01:34:46 +0000 (17:34 -0800)]
mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()
An exiting task might belong to an offline cgroup. In this case an
attempt to grab a cgroup reference from the task can end up with an
infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the
cgroup will become online, neither the task will be migrated to a live
cgroup.
Fix this by switching over to css_tryget(). As css_tryget_online()
can't guarantee that the cgroup won't go offline, in most cases the
check doesn't make sense. In this particular case users of
hugetlb_cgroup_charge_cgroup() are not affected by this change.
A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
css_tryget() instead of css_tryget_online() in task_get_css()").
Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Sat, 16 Nov 2019 01:34:43 +0000 (17:34 -0800)]
mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
We've encountered a rcu stall in get_mem_cgroup_from_mm():
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
(t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
<...>
RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
<...>
__memcg_kmem_charge+0x55/0x140
__alloc_pages_nodemask+0x267/0x320
pipe_write+0x1ad/0x400
new_sync_write+0x127/0x1c0
__kernel_write+0x4f/0xf0
dump_emit+0x91/0xc0
writenote+0xa0/0xc0
elf_core_dump+0x11af/0x1430
do_coredump+0xc65/0xee0
get_signal+0x132/0x7c0
do_signal+0x36/0x640
exit_to_usermode_loop+0x61/0xd0
do_syscall_64+0xd4/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The problem is caused by an exiting task which is associated with an
offline memcg. We're iterating over and over in the do {} while
(!css_tryget_online()) loop, but obviously the memcg won't become online
and the exiting task won't be migrated to a live memcg.
Let's fix it by switching from css_tryget_online() to css_tryget().
As css_tryget_online() cannot guarantee that the memcg won't go offline,
the check is usually useless, except some rare cases when for example it
determines if something should be presented to a user.
A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
css_tryget() instead of css_tryget_online() in task_get_css()").
Johannes:
: The bug aside, it doesn't matter whether the cgroup is online for the
: callers. It used to matter when offlining needed to evacuate all charges
: from the memcg, and so needed to prevent new ones from showing up, but we
: don't care now.
Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Shakeel Butt <shakeeb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lasse Collin [Sat, 16 Nov 2019 01:34:39 +0000 (17:34 -0800)]
lib/xz: fix XZ_DYNALLOC to avoid useless memory reallocations
s->dict.allocated was initialized to 0 but never set after a successful
allocation, thus the code always thought that the dictionary buffer has
to be reallocated.
Link: http://lkml.kernel.org/r/20191104185107.3b6330df@tukaani.org Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Reported-by: Yu Sun <yusun2@cisco.com> Acked-by: Daniel Walker <danielwa@cisco.com> Cc: "Yixia Si (yisi)" <yisi@cisco.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
madvise_pageout() accesses the specified range of the vma and isolates
them, then runs shrink_page_list() to reclaim its memory. But it also
isolates the unevictable pages to reclaim. Hence, we can catch the
cases in shrink_page_list().
The root cause is that we scan the page tables instead of specific LRU
list. and so we need to filter out the unevictable lru pages from our
end.
Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Sat, 16 Nov 2019 01:34:33 +0000 (17:34 -0800)]
mm: mempolicy: fix the wrong return value and potential pages leak of mbind
Commit d883544515aa ("mm: mempolicy: make the behavior consistent when
MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value
of mbind() for a couple of corner cases. But, it altered the errno for
some other cases, for example, mbind() should return -EFAULT when part
or all of the memory range specified by nodemask and maxnode points
outside your accessible address space, or there was an unmapped hole in
the specified memory range specified by addr and len.
Fix this by preserving the errno returned by queue_pages_range(). And,
the pagelist may be not empty even though queue_pages_range() returns
error, put the pages back to LRU since mbind_range() is not called to
really apply the policy so those pages should not be migrated, this is
also the old behavior before the problematic commit.
Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com Fixes: d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Li Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: Li Xinhai <lixinhai.lxh@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> [4.19 and 5.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just got one of these for debugging some unrelated issues, and noticed
that Lenovo seems to have gone back to using RMI4 over smbus with
Synaptics touchpads on some of their new systems, particularly this one.
So, let's enable RMI mode for the X1 Extreme 2nd Generation.
Linus Torvalds [Fri, 15 Nov 2019 21:02:34 +0000 (13:02 -0800)]
Merge tag 'for-linus-20191115' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes that should make it into this release. This contains:
- io_uring:
- The timeout command assumes sequence == 0 means that we want
one completion, but this kind of overloading is unfortunate as
it prevents users from doing a pure time based wait. Since
this operation was introduced in this cycle, let's correct it
now, while we can. (me)
- One-liner to fix an issue with dependent links and fixed
buffer reads. The actual IO completed fine, but the link got
severed since we stored the wrong expected value. (me)
- Add TIMEOUT to list of opcodes that don't need a file. (Pavel)
- rsxx missing workqueue destry calls. Old bug. (Chuhong)
- Fix blk-iocost active list check (Jiufei)
- Fix impossible-to-hit overflow merge condition, that still hit some
folks very rarely (Junichi)
- Fix bfq hang issue from 5.3. This didn't get marked for stable, but
will go into stable post this merge (Paolo)"
* tag 'for-linus-20191115' of git://git.kernel.dk/linux-block:
rsxx: add missed destroy_workqueue calls in remove
iocost: check active_list of all the ancestors in iocg_activate()
block, bfq: deschedule empty bfq_queues not referred by any process
io_uring: ensure registered buffer import returns the IO length
io_uring: Fix getting file for timeout
block: check bi_size overflow before merge
io_uring: make timeout sequence == 0 mean no sequence
Wen Yang [Fri, 8 Nov 2019 08:36:48 +0000 (16:36 +0800)]
i2c: core: fix use after free in of_i2c_notify
We can't use "adap->dev" after it has been freed.
Fixes: 5bf4fa7daea6 ("i2c: break out OF support into separate file") Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Hans de Goede [Wed, 13 Nov 2019 18:29:38 +0000 (19:29 +0100)]
i2c: acpi: Force bus speed to 400KHz if a Silead touchscreen is present
Many cheap devices use Silead touchscreen controllers. Testing has shown
repeatedly that these touchscreen controllers work fine at 400KHz, but for
unknown reasons do not work properly at 100KHz. This has been seen on
both ARM and x86 devices using totally different i2c controllers.
On some devices the ACPI tables list another device at the same I2C-bus
as only being capable of 100KHz, testing has shown that these other
devices work fine at 400KHz (as can be expected of any recent I2C hw).
This commit makes i2c_acpi_find_bus_speed() always return 400KHz if a
Silead touchscreen controller is present, fixing the touchscreen not
working on devices which ACPI tables' wrongly list another device on the
same bus as only being capable of 100KHz.
Specifically this fixes the touchscreen on the Jumper EZpad 6 m4 not
working.
Reported-by: youling 257 <youling257@gmail.com> Tested-by: youling 257 <youling257@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
[wsa: rewording warning a little] Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
====================
ptp: Validate the ancillary ioctl flags more carefully.
The flags passed to the ioctls for periodic output signals and
time stamping of external signals were never checked, and thus formed
a useless ABI inadvertently. More recently, a version 2 of the ioctls
was introduced in order make the flags meaningful. This series
tightens up the checks on the new ioctl flags.
- Patch 1 ensures at least one edge flag is set for the new ioctl.
- Patches 2-7 are Jacob's recent checks, picking up the tags.
- Patch 8 introduces a "strict" flag for passing to the drivers when the
new ioctl is used.
- Patches 9-12 implement the "strict" checking in the drivers.
- Patch 13 extends the test program to exercise combinations of flags.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Cochran [Thu, 14 Nov 2019 18:45:07 +0000 (10:45 -0800)]
ptp: Extend the test program to check the external time stamp flags.
Because each driver and hardware has different capabilities, the test
cannot provide a simple pass/fail result, but it can at least show what
combinations of flags are supported.
Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Cochran [Thu, 14 Nov 2019 18:45:06 +0000 (10:45 -0800)]
mlx5: Reject requests to enable time stamping on both edges.
This driver enables rising edge or falling edge, but not both, and so
this patch validates that the request contains only one of the two
edges.
Signed-off-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Cochran [Thu, 14 Nov 2019 18:45:02 +0000 (10:45 -0800)]
ptp: Introduce strict checking of external time stamp options.
User space may request time stamps on rising edges, falling edges, or
both. However, the particular mode may or may not be supported in the
hardware or in the driver. This patch adds a "strict" flag that tells
drivers to ensure that the requested mode will be honored.
Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the renesas PTP support to explicitly reject any future flags that
get added to the external timestamp request ioctl.
In order to maintain currently functioning code, this patch accepts all
three current flags. This is because the PTP_RISING_EDGE and
PTP_FALLING_EDGE flags have unclear semantics and each driver seems to
have interpreted them slightly differently.
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Thu, 14 Nov 2019 18:45:00 +0000 (10:45 -0800)]
mlx5: reject unsupported external timestamp flags
Fix the mlx5 core PTP support to explicitly reject any future flags that
get added to the external timestamp request ioctl.
In order to maintain currently functioning code, this patch accepts all
three current flags. This is because the PTP_RISING_EDGE and
PTP_FALLING_EDGE flags have unclear semantics and each driver seems to
have interpreted them slightly differently.
[ RC: I'm not 100% sure what this driver does, but if I'm not wrong it
follows the dp83640:
flags Meaning
---------------------------------------------------- --------------------------
PTP_ENABLE_FEATURE Time stamp rising edge
PTP_ENABLE_FEATURE|PTP_RISING_EDGE Time stamp rising edge
PTP_ENABLE_FEATURE|PTP_FALLING_EDGE Time stamp falling edge
PTP_ENABLE_FEATURE|PTP_RISING_EDGE|PTP_FALLING_EDGE Time stamp falling edge
]
Cc: Feras Daoud <ferasda@mellanox.com> Cc: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Thu, 14 Nov 2019 18:44:59 +0000 (10:44 -0800)]
igb: reject unsupported external timestamp flags
Fix the igb PTP support to explicitly reject any future flags that
get added to the external timestamp request ioctl.
In order to maintain currently functioning code, this patch accepts all
three current flags. This is because the PTP_RISING_EDGE and
PTP_FALLING_EDGE flags have unclear semantics and each driver seems to
have interpreted them slightly differently.
This HW always time stamps both edges:
flags Meaning
---------------------------------------------------- --------------------------
PTP_ENABLE_FEATURE Time stamp both edges
PTP_ENABLE_FEATURE|PTP_RISING_EDGE Time stamp both edges
PTP_ENABLE_FEATURE|PTP_FALLING_EDGE Time stamp both edges
PTP_ENABLE_FEATURE|PTP_RISING_EDGE|PTP_FALLING_EDGE Time stamp both edges
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the dp83640 PTP support to explicitly reject any future flags that
get added to the external timestamp request ioctl.
In order to maintain currently functioning code, this patch accepts all
three current flags. This is because the PTP_RISING_EDGE and
PTP_FALLING_EDGE flags have unclear semantics and each driver seems to
have interpreted them slightly differently.
For the record, the semantics of this driver are:
flags Meaning
---------------------------------------------------- --------------------------
PTP_ENABLE_FEATURE Time stamp rising edge
PTP_ENABLE_FEATURE|PTP_RISING_EDGE Time stamp rising edge
PTP_ENABLE_FEATURE|PTP_FALLING_EDGE Time stamp falling edge
PTP_ENABLE_FEATURE|PTP_RISING_EDGE|PTP_FALLING_EDGE Time stamp falling edge
Cc: Stefan Sørensen <stefan.sorensen@spectralink.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>