Namhyung Kim [Mon, 10 Mar 2025 22:49:22 +0000 (15:49 -0700)]
perf annotate: Pass hist_entry to annotate functions
It's a prepartion to support code annotation and data type
annotation at the same time. Data type annotation needs more
information in the hist_entry so it needs to be passed deeper.
Also rename a function with the same name in the builtin-annotate.c
to hist_entry__stdio_annotate since it matches better to the command
line option. And change the condition inside to be simpler.
Namhyung Kim [Mon, 10 Mar 2025 22:49:21 +0000 (15:49 -0700)]
perf annotate: Pass annotation_options to annotation_line__print()
The annotation_line__print() has many arguments. But min_percent,
max_lines and percent_type are from struct annotaion_options. So let's
pass a pointer to the option instead of passing them separately to
reduce the number of function arguments.
Actually it has a recursive call if 'queue' is set. Add a new option
instance to pass different values for the case.
Factor out a function to get the name of member field at the given
offset. This will be used in other places.
Also update the output of typeoff sort key a little bit. As we know
that some special types like (stack operation), (stack canary) and
(unknown) won't have fields, skip printing the offset and field.
For example, the following change is expected.
"(stack operation) +0 (no field)" ==> "(stack operation)"
Namhyung Kim [Thu, 27 Feb 2025 19:12:22 +0000 (11:12 -0800)]
perf ftrace: Remove an unnecessary condition check in BPF
The bucket_num is set based on the {max,min}_latency already in
cmd_ftrace(), so no need to check it again in BPF. Also I found
that it didn't pass the max_latency to BPF. :)
Namhyung Kim [Thu, 27 Feb 2025 19:12:21 +0000 (11:12 -0800)]
perf ftrace: Fix latency stats with BPF
When BPF collects the stats for the latency in usec, it first divides
the time by 1000. But that means it would have 0 if the delta is small
and won't update the total time properly.
Let's keep the stats in nsec always and adjust to usec before printing.
Ian Rogers [Fri, 7 Mar 2025 02:39:06 +0000 (18:39 -0800)]
perf test stat: Additional topdown grouping tests
Add a loop and helper function to avoid repetition, the loop uses
arrays so switch the shell to bash. Add additional topdown group tests
where a topdown event needs to be moved beyond others and the slots
event isn't first in the target group. This replicates issues that
occur on hybrid systems where the other events are for the cpu_atom
PMU. Test with both PMU and software events. Place the slots event
later in the event list.
Dapeng Mi [Fri, 7 Mar 2025 02:39:05 +0000 (18:39 -0800)]
perf x86 evlist: Update comments on topdown regrouping
Update to remove comments about groupings not working and with the:
```
perf stat -e "{instructions,slots},{cycles,topdown-retiring}"
```
case that now works.
Ian Rogers [Fri, 7 Mar 2025 02:39:04 +0000 (18:39 -0800)]
perf parse-events: Corrections to topdown sorting
In the case of '{instructions,slots},faults,topdown-retiring' the
first event that must be grouped, slots, is ignored causing the
topdown-retiring event not to be adjacent to the group it needs to be
inserted into. Don't ignore the group members when computing the
force_grouped_index.
Make the force_grouped_index be for the leader of the group it is
within and always use it first rather than a group leader index so
that topdown events may be sorted from one group into another.
As the PMU name comparison applies to moving events in the same group
ensure the name ordering is always respected.
Change the group splitting logic to not group if there are no other
topdown events and to fix cases where the force group leader wasn't
being grouped with the other members of its group.
Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Closes: https://lore.kernel.org/lkml/20250224083306.71813-2-dapeng1.mi@linux.intel.com/ Closes: https://lore.kernel.org/lkml/f7e4f7e8-748c-4ec7-9088-0e844392c11a@linux.intel.com/ Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250307023906.1135613-3-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Dapeng Mi [Fri, 7 Mar 2025 02:39:03 +0000 (18:39 -0800)]
perf x86/topdown: Fix topdown leader sampling test error on hybrid
When running topdown leader smapling test on Intel hybrid platforms,
such as LNL/ARL, we see the below error.
Topdown leader sampling test
Topdown leader sampling [Failed topdown events not reordered correctly]
It indciates the below command fails.
perf record -o "${perfdata}" -e "{instructions,slots,topdown-retiring}:S" true
The root cause is that perf tool creats a perf event for each PMU type
if it can create.
As for this command, there would be 5 perf events created,
cpu_atom/instructions/,cpu_atom/topdown_retiring/,
cpu_core/slots/,cpu_core/instructions/,cpu_core/topdown-retiring/
For these 5 events, the 2 cpu_atom events are in a group and the other 3
cpu_core events are in another group.
When arch_topdown_sample_read() traverses all these 5 events, events
cpu_atom/instructions/ and cpu_core/slots/ don't have a same group
leade, and then return false directly and lead to cpu_core/slots/ event
is used to sample and this is not allowed by PMU driver.
It's a overkill to return false directly if "evsel->core.leader !=
leader->core.leader" since there could be multiple groups in the event
list.
Just "continue" instead of "return false" to fix this issue.
Fixes: 1e53e9d1787b ("perf x86/topdown: Correct leader selection with sample_read enabled") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Tested-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250307023906.1135613-2-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 7 Mar 2025 02:39:02 +0000 (18:39 -0800)]
perf tools: Improve handling of hybrid PMUs in perf_event_attr__fprintf
Support the PMU name from the legacy hardware and hw_cache PMU
extended types. Remove some macros and make variables more intention
revealing, rather than just being called "value".
Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: James Clark <james.clark@linaro.org> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250307023906.1135613-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 28 Feb 2025 22:23:06 +0000 (14:23 -0800)]
perf python: Add evlist all_cpus accessor
Add a means to get the reference counted all_cpus CPU map from an
evlist in its python form.
Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250228222308.626803-10-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 28 Feb 2025 22:23:05 +0000 (14:23 -0800)]
perf python: Avoid duplicated code in get_tracepoint_field
The code replicates computations done in evsel__tp_format, reuse
evsel__tp_format to simplify the python C code.
Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250228222308.626803-9-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 28 Feb 2025 22:23:04 +0000 (14:23 -0800)]
perf python: Update ungrouped evsel leader in clone
evsels are cloned in the python code as they form part of the Python
object pyrf_evsel. The cloning doesn't update the evsel's leader, do
this for the case of an evsel being ungrouped.
Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250228222308.626803-8-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 28 Feb 2025 22:23:03 +0000 (14:23 -0800)]
perf python: Add optional cpus and threads arguments to parse_events
Used for the evlist initialization.
Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250228222308.626803-7-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 28 Feb 2025 22:23:02 +0000 (14:23 -0800)]
perf python: Add member access to a number of evsel variables
Most variables are part of the perf_event_attr, so that they may be
queried and modified.
Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250228222308.626803-6-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 28 Feb 2025 22:23:01 +0000 (14:23 -0800)]
perf python: Add evlist enable and disable methods
By default the evsels from parse_events will be disabled. Add access
to the evlist functions so they can be enabled/disabled.
Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250228222308.626803-5-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 28 Feb 2025 22:23:00 +0000 (14:23 -0800)]
perf evsel: tp_format accessing improvements
Ensure evsel__clone copies the tp_sys and tp_name variables.
In evsel__tp_format, if tp_sys isn't set, use the config value to find
the tp_format. This succeeds in python code where pyrf__tracepoint has
already found the format.
Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250228222308.626803-4-irogers@google.com Fixes: 6c8310e8380d472c ("perf evsel: Allow evsel__newtp without libtraceevent") Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 28 Feb 2025 22:22:59 +0000 (14:22 -0800)]
perf evlist: Add success path to evlist__create_syswide_maps
Over various refactorings evlist__create_syswide_maps has been made to
only ever return with -ENOMEM. Fix this so that when
perf_evlist__set_maps is successfully called, 0 is returned.
Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250228222308.626803-3-irogers@google.com Fixes: 8c0498b6891d7ca5 ("perf evlist: Fix create_syswide_maps() not propagating maps") Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Fri, 28 Feb 2025 22:22:58 +0000 (14:22 -0800)]
perf debug: Avoid stack overflow in recursive error message
In debug_file, pr_warning_once is called on error. As that function
calls debug_file the function will yield a stack overflow. Switch the
location of the call so the recursion is avoided.
Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250228222308.626803-2-irogers@google.com Fixes: ec49230cf6dda704 ("perf debug: Expose debug file") Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Stephen Brennan [Fri, 7 Mar 2025 23:22:03 +0000 (15:22 -0800)]
perf symbol: Support .gnu_debugdata for symbols
Fedora introduced a "MiniDebuginfo" feature, in which an LZMA-compressed
ELF file is placed inside a section named ".gnu_debugdata". This file
contains nothing but a symbol table, which can be used to supplement the
.dynsym section which only contains required symbols for runtime.
It is supported by GDB for stack traces, but it should be useful for
tracing as well. Implement support for loading symbols from
.gnu_debugdata.
Stephen Brennan [Fri, 7 Mar 2025 23:22:02 +0000 (15:22 -0800)]
perf tools: Add LZMA decompression from FILE
Internally lzma_decompress_to_file() creates a FILE from the filename.
Add an API that takes an existing FILE directly. This allows
decompressing already-open files and even buffers opened by fmemopen().
It is necessary for supporting .gnu_debugdata in the next patch.
Ian Rogers [Sat, 8 Mar 2025 01:28:53 +0000 (17:28 -0800)]
perf mem: Don't leak mem event names
When preparing the mem events for the argv copies are intentionally
made. These copies are leaked and cause runs of perf using address
sanitizer to fail. Rather than leak the memory allocate a chunk of
memory for the mem event names upfront and build the strings in this -
the storage is sized larger than the previous buffer size. The caller
is then responsible for clearing up this memory. As part of this
change, remove the mem_loads_name and mem_stores_name global buffers
then change the perf_pmu__mem_events_name to write to an out argument
buffer.
Eric Lin [Thu, 13 Feb 2025 01:21:40 +0000 (17:21 -0800)]
perf vendor events riscv: Add SiFive P650 events
The SiFive Performance P650 core (including the vector-enabled P670 and
area-optimized P450/P470 variants) updates the P550 microarchitecture.
It brings in the debug, trace, and counter events from newer Bullet
cores, and adds new events for iTLB and dTLB multi-hits.
All other PMU events are unchanged from the P550 core.
Signed-off-by: Eric Lin <eric.lin@sifive.com> Co-developed-by: Samuel Holland <samuel.holland@sifive.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ian Rogers <irogers@google.com> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20250213220341.3215660-8-samuel.holland@sifive.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Eric Lin [Thu, 13 Feb 2025 01:21:39 +0000 (17:21 -0800)]
perf vendor events riscv: Add SiFive P550 events
The SiFive Performance P550 core features an out-of-order
microarchitecture which exposes the same PMU events as Bullet,
plus events for UTLB hits and PTE cache misses/hits.
Add support for specifying these events using symbolic names.
Signed-off-by: Eric Lin <eric.lin@sifive.com> Co-developed-by: Samuel Holland <samuel.holland@sifive.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ian Rogers <irogers@google.com> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20250213220341.3215660-7-samuel.holland@sifive.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Eric Lin [Thu, 13 Feb 2025 01:21:38 +0000 (17:21 -0800)]
perf vendor events riscv: Add SiFive Bullet version 0x0d events
SiFive Bullet microarchitecture cores with mimpid values starting with
0x0d or greater add new PMU events to count TLB miss stall cycles.
All other PMU events are unchanged from earlier Bullet cores.
Signed-off-by: Eric Lin <eric.lin@sifive.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ian Rogers <irogers@google.com> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20250213220341.3215660-6-samuel.holland@sifive.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Eric Lin [Thu, 13 Feb 2025 01:21:37 +0000 (17:21 -0800)]
perf vendor events riscv: Add SiFive Bullet version 0x07 events
SiFive Bullet microarchitecture cores with mimpid values starting with
0x07 or greater add new PMU events to support debug, trace, and counter
sampling and filtering (Sscofpmf).
All other PMU events are unchanged from earlier Bullet cores.
Signed-off-by: Eric Lin <eric.lin@sifive.com> Co-developed-by: Samuel Holland <samuel.holland@sifive.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ian Rogers <irogers@google.com> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20250213220341.3215660-5-samuel.holland@sifive.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Regenerate the event lists from the original hardware description. This
makes them consistent with the event lists for newer versions of the
hardware, allowing most files to be reused across hardware versions.
Signed-off-by: Eric Lin <eric.lin@sifive.com> Co-developed-by: Samuel Holland <samuel.holland@sifive.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ian Rogers <irogers@google.com> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20250213220341.3215660-4-samuel.holland@sifive.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Samuel Holland [Thu, 13 Feb 2025 01:21:35 +0000 (17:21 -0800)]
perf vendor events riscv: Remove leading zeroes
The EventCode field (as stored in the mhpmeventN CSRs) is actually 56
bits wide, but there is no need to keep leading zeroes in the JSON
files. Remove them to simplify review of the following change, which
regenerates the files in a way that does not include leading zeroes.
This change was performed automatically with `sed -i "s/0x0*/0x/"`.
Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ian Rogers <irogers@google.com> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20250213220341.3215660-3-samuel.holland@sifive.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Samuel Holland [Thu, 13 Feb 2025 01:21:34 +0000 (17:21 -0800)]
perf vendor events riscv: Rename U74 to Bullet
This set of PMU event descriptions applies not only to the SiFive U74
core configuration, but also to other SiFive cores that implement the
Bullet microarchitecture (such as U64, P270, and X280). Rename the
directory to be more generic.
Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ian Rogers <irogers@google.com> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20250213220341.3215660-2-samuel.holland@sifive.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
perf_color_default_config() was added in 2009 by
commit 8fc0321f1ad0 ("perf_counter tools: Add color terminal output
support")
but has remained unused.
Ian Rogers [Wed, 26 Feb 2025 23:01:09 +0000 (15:01 -0800)]
perf tests: Fix data symbol test with LTO builds
With LTO builds, although regular builds could also see this as
all the code is in one file, the datasym workload can realize the
buf1.reserved data is never accessed. The compiler moves the
variable to bss and only keeps the data1 and data2 parts as
separate variables. This causes the symbol check to fail in the
test. Make the variable volatile to disable the more aggressive
optimization. Rename the variable to make which buf1 in perf is
being referred to.
Before:
$ perf test -vv "data symbol"
126: Test data symbol:
--- start ---
test child forked, pid 299808
perf does not have symbol 'buf1'
perf is missing symbols - skipping test
---- end(-2) ----
126: Test data symbol : Skip
$ nm perf|grep buf1 0000000000a5fa40 b buf1.0 0000000000a5fa48 b buf1.1
After:
$ nm perf|grep buf1 0000000000a53a00 d buf1
$ perf test -vv "data symbol"126: Test data symbol:
--- start ---
test child forked, pid 302166
a53a00-a53a39 l buf1
perf does have symbol 'buf1'
Recording workload...
Waiting for "perf record has started" message
OK
Cleaning up files...
---- end(0) ----
126: Test data symbol : Ok
I found that hist_entry__delete() missed to release child entries in the
hierarchy tree (hroot_{in,out}). It needs to iterate the child entries
and call hist_entry__delete() recursively.
Athira Rajeev [Tue, 4 Mar 2025 15:41:14 +0000 (21:11 +0530)]
perf annotate: Return errors from disasm_line__parse_powerpc()
In disasm_line__parse_powerpc() , return code from function
disasm_line__parse() is ignored. This will result in bad results
if the disasm_line__parse() fails to disasm the line. Use
the return code to fix this.
When doing "perf annotate", perf tool provides option to
use specific disassembler like llvm/objdump/capstone. The
order picked is to use llvm first and if that fails fallback
to objdump ie to use PERF_DISASM_LLVM, PERF_DISASM_CAPSTONE
and PERF_DISASM_OBJDUMP
In powerpc, when using "data type" sort keys, first preferred
approach is to read the raw instruction from the DSO. In objdump
is specified in "--objdump" option, it picks the symbol disassemble
using objdump. Currently disasm_line__parse_powerpc() function
uses length of the "line" to determine if objdump is used.
But there are few cases, where if objdump doesn't recognise the
instruction, the disassembled string will be empty.
So depending on length of line will give bad results.
Add a new filed to annotation options structure,
"struct annotation_options" to save the disassembler used.
Use this info to determine if disassembly is done while
parsing the disasm line.
Namhyung Kim [Wed, 5 Mar 2025 23:28:38 +0000 (15:28 -0800)]
perf report: Do not process non-JIT BPF ksymbol events
The length of PERF_RECORD_KSYMBOL for BPF is a size of JITed code so
it'd be 0 when it's not JITed. The ksymbol is needed to symbolize the
code when it gets samples in the region but non-JITed code cannot get
samples. Thus it'd be ok to ignore them.
Actually it caused a performance issue in the perf tools on old ARM
kernels where it can refuse to JIT some BPF codes. It ended up
splitting the existing kernel map (kallsyms). And later lookup for a
kernel symbol would create a new kernel map from kallsyms and then
split it again and again. :(
Probably there's a bug in the kernel map/symbol handling in perf tools.
But I think we need to fix this anyway.
Ian Rogers [Wed, 5 Mar 2025 19:19:31 +0000 (11:19 -0800)]
perf test: Fix leak in "Synthesize attr update" test
The own_cpus map variable may be non-NULL and hold a reference, in
particular on hybrid machines. Do a put before overwriting the
variable to avoid a memory leak.
Namhyung Kim [Fri, 28 Feb 2025 21:17:34 +0000 (18:17 -0300)]
perf machine: Fix insertion of PERF_RECORD_KSYMBOL related kernel maps
This was detected at the end of a 'perf record' session when build-id
collection was enabled and thus the BPF programs put in place while the
session was running, some even put in place by perf itself were
processed and inserted, with some overlaps related to BPF trampolines
and programs took place.
Using maps__fixup_overlap_and_insert() instead of maps__insert() "fixes"
the problem, in the sense that overlaps will be dealt with and then the
consistency will be kept, but it would be interesting to fully
understand why such overlaps take place and how to deal with them when
doing symbol resolution.
Arnaldo Carvalho de Melo [Fri, 28 Feb 2025 21:17:33 +0000 (18:17 -0300)]
perf maps: Add missing map__set_kmap_maps() when replacing a kernel map
Since in this case __maps__insert_sorted() is not called and thus
doesn't have the opportunity to do the needed map__set_kmap_maps() calls on
the new map.
Namhyung Kim [Fri, 28 Feb 2025 21:17:32 +0000 (18:17 -0300)]
perf maps: Fixup maps_by_name when modifying maps_by_address
We can't just replacing the map in the maps_by_address and not touching
on the maps_by_name, that would leave the refcount as 1 and thus trip
another consistency check, this one:
106 /*
107 * Maps by name maps should be in maps_by_address, so
108 * the reference count should be higher.
109 */
110 assert(refcount_read(map__refcnt(map)) > 1);
Committer notice:
Initialize the newly added 'ni' variable, that really can't be
accessed unitialized trips some gcc versions, like:
12 20.00 archlinux:base : FAIL gcc version 13.2.1 20230801 (GCC)
util/maps.c: In function ‘__maps__fixup_overlap_and_insert’:
util/maps.c:896:54: error: ‘ni’ may be used uninitialized [-Werror=maybe-uninitialized]
896 | map__put(maps_by_name[ni]);
| ^
util/maps.c:816:25: note: ‘ni’ was declared here
816 | unsigned int i, ni;
| ^~
cc1: all warnings being treated as errors
make[3]: *** [/git/perf-6.14.0-rc1/tools/build/Makefile.build:138: util] Error 2
Arnaldo Carvalho de Melo [Fri, 28 Feb 2025 21:17:29 +0000 (18:17 -0300)]
perf maps: Introduce map__set_kmap_maps() for kernel maps
We need to set it in other places than __maps__insert(), so that we can
have access to the 'struct maps' from a kernel 'struct map'.
When building perf with 'DEBUG=1' we can notice it failing a consistency
check done in the check_invariants() function:
root@number:~# perf record -- perf test -w offcpu
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.040 MB perf.data (23 samples) ]
perf: util/maps.c:95: check_invariants: Assertion `map__end(prev) <= map__end(map)' failed.
Aborted (core dumped)
root@number:~#
The investigation on that was happening bisected to 876e80cf83d10585
("perf tools: Fixup end address of modules"), and the following patches
will plug the problems found, this patch is just legwork on that
direction.
Use the map__set_kmap_maps() name as per a review comment from Ian
Rogers, later there are further suggestions from him on getting rid of
the kmaps variable, see the thread referenced in the Link below.
The original commit message:
"
perf script output may show different fields on different core PMU's
that exist on heterogeneous platforms. For example,
perf record -e "{cpu_core/mem-loads-aux/,cpu_core/event=0xcd,\
umask=0x01,ldlat=3,name=MEM_UOPS_RETIRED.LOAD_LATENCY/}:upp"\
-c10000 -W -d -a -- sleep 1
Some fields, such as data_src, are not included by default.
The cause is that while one PMU may be assigned a type such as
PERF_TYPE_RAW, other core PMU's are dynamically allocated at boot time.
If this value does not match an existing PERF_TYPE_X value,
output_type(perf_event_attr.type) will return OUTPUT_TYPE_OTHER.
Instead search for a core PMU with a matching perf_event_attr type
and, if one is found, return PERF_TYPE_RAW to match output of other
core PMU's.
"
Suggested-by: Kan Liang <kan.liang@intel.com> Suggested-by: Ian Rogers <irogers@google.com> Signed-off-by: Thomas Falcon <thomas.falcon@intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250305163935.1605312-1-thomas.falcon@intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Thomas Richter [Tue, 4 Mar 2025 09:23:49 +0000 (10:23 +0100)]
perf bench: Fix perf bench syscall loop count
Command 'perf bench syscall fork -l 100000' offers option -l to run for
a specified number of iterations. However this option is not always
observed. The number is silently limited to 10000 iterations as can be
seen:
Namhyung Kim [Tue, 4 Mar 2025 02:28:37 +0000 (18:28 -0800)]
perf test: Simplify data symbol test
Now the workload will end after 1 second. Just run it with perf instead
of waiting for the background process.
Reviewed-by: Leo Yan <leo.yan@arm.com> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Leo Yan <leo.yan@arm.com> Link: https://lore.kernel.org/r/20250304022837.1877845-7-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Namhyung Kim [Tue, 4 Mar 2025 02:28:36 +0000 (18:28 -0800)]
perf test: Add timeout to datasym workload
Unlike others it has an infinite loop that make it annoying to call.
Make it finish after 1 second and handle command-line argument to change
the setting.
Reviewed-by: Leo Yan <leo.yan@arm.com> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Leo Yan <leo.yan@arm.com> Link: https://lore.kernel.org/r/20250304022837.1877845-6-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Namhyung Kim [Tue, 4 Mar 2025 02:28:35 +0000 (18:28 -0800)]
perf test: Add trace record and replay test
It just check trace record and replay could display correct output.
It uses 'sleep' process and sees there's a clock_nanosleep syscall.
$ sudo perf test -vv replay
108: perf trace record and replay:
--- start ---
test child forked, pid 1563219
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.077 MB /tmp/temporary_file.w1ApA (242 samples) ]
0.686 (1000.068 ms): sleep/1563226 clock_nanosleep(rqtp: 0x7ffc20ffee10, rmtp: 0x7ffc20ffee50) = 0
---- end(0) ----
108: perf trace record and replay : Ok
Namhyung Kim [Tue, 4 Mar 2025 02:28:33 +0000 (18:28 -0800)]
perf test: Skip perf probe tests when running as non-root
perf trace requires root because it needs to use [ku]probes.
Skip those test when it's not run as root.
Before:
$ perf test probe
47: Probe SDT events : Ok
104: test perf probe of function from different CU : FAILED!
115: perftool-testsuite_probe : FAILED!
117: Add vfs_getname probe to get syscall args filenames : FAILED!
118: probe libc's inet_pton & backtrace it with ping : FAILED!
119: Use vfs_getname probe to get syscall args filenames : FAILED!
After:
$ perf test probe
47: Probe SDT events : Ok
104: test perf probe of function from different CU : Skip
115: perftool-testsuite_probe : Skip
117: Add vfs_getname probe to get syscall args filenames : Skip
118: probe libc's inet_pton & backtrace it with ping : Skip
119: Use vfs_getname probe to get syscall args filenames : Skip
Namhyung Kim [Tue, 4 Mar 2025 02:28:32 +0000 (18:28 -0800)]
perf test: Add --metric-only to perf stat output tests
Add a test case for --metric-only for std, csv, json output mode using
shadow IPC metric from instructions and cycles events. It should
produce 'insn per cycle' metric.
But currently JSON output has (none) 'GHz' as well. It looks like a bug
but I don't have enough time to debug it for now so I made it pass. :(
$ perf stat --metric-only -e instructions,cycles true
$ perf stat -x, --metric-only -e instructions,cycles true
0.55,,
$ perf stat -j --metric-only -e instructions,cycles true
{"insn per cycle" : "0.53", "GHz" : "none"}
$ perf test output -v
5: Test data source output : Ok
31: Sort output of hist entries : Ok
88: perf stat CSV output linter : Ok
90: perf stat JSON output linter : Ok
92: perf stat STD output linter : Ok
Leo Yan [Tue, 4 Mar 2025 11:12:40 +0000 (11:12 +0000)]
perf arm-spe: Support previous branch target (PBT) address
When FEAT_SPE_PBT is implemented, the previous branch target address
(named as PBT) before the sampled operation, will be recorded.
This commit first introduces a 'prev_br_tgt' field in the record for
saving the PBT address in the decoder.
If the current operation is a branch instruction, by combining with PBT,
it can create a chain with two consecutive branches. As the branch
stack stores branches in descending order, meaning a newer branch is
stored in a lower entry in the stack. Arm SPE stores the latest branch
in the first entry of branch stack, and the previous branch coming from
PBT is stored into the second entry.
Otherwise, if current operation is not a branch, the last branch will be
saved for PBT only. PBT lacks associated information such as branch
source address, branch type, and events. The branch entry fills zeros
for the corresponding fields and only set its target address.
Leo Yan [Tue, 4 Mar 2025 11:12:39 +0000 (11:12 +0000)]
perf arm-spe: Add branch stack
Although Arm SPE cannot generate continuous branch records, this commit
creates a branch stack with only one branch entry. A single branch info
can be used for performance optimization.
A branch stack structure is dynamically allocated in the decode queue.
The branch stack and stack flags are synthesized based on branch types
and associated events.
Leo Yan [Tue, 4 Mar 2025 11:12:38 +0000 (11:12 +0000)]
perf arm-spe: Set sample flags with supplement info
Based on the supplement information in the record, this commit sets the
sample flags for conditional branch, function call, return. It also
sets events in flags, such as mispredict, not taken, and in transaction.
Leo Yan [Tue, 4 Mar 2025 11:12:35 +0000 (11:12 +0000)]
perf arm-spe: Extend branch operations
In Arm ARM (ARM DDI 0487, L.a), the section "D18.2.7 Operation Type
packet", the branch subclass is extended for Call Return (CR), Guarded
control stack data access (GCS).
This commit adds support CR and GCS operations. The IND (indirect)
operation is defined only in bit [1], its macro is updated accordingly.
Move the COND (Conditional) macro into the same group with other
operations for better maintenance.
Leo Yan [Tue, 4 Mar 2025 11:12:34 +0000 (11:12 +0000)]
perf arm-spe: Fix load-store operation checking
The ARM_SPE_OP_LD and ARM_SPE_OP_ST operations are secondary operation
type, they are overlapping with other second level's operation types
belonging to SVE and branch operations. As a result, a non load-store
operation can be parsed for data source and memory sample.
To fix the issue, this commit introduces a is_ldst_op() macro for
checking LDST operation, and apply the checking when synthesize data
source and memory samples.
Fixes: a89dbc9b988f ("perf arm-spe: Set sample's data source field") Signed-off-by: Leo Yan <leo.yan@arm.com> Reviewed-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250304111240.3378214-7-leo.yan@arm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Leo Yan [Tue, 4 Mar 2025 11:12:31 +0000 (11:12 +0000)]
perf script: Separate events from branch types
Branch types and events are two different things. A branch type can be
a conditional branch, an indirect branch, a procedure call, a return, or
an exception taken, etc. The extra event information is provided for
what happens during a branch, e.g. if a branch is mispredicted or not
taken (specific to conditional branches).
To deliver information about branches, this commit separates events from
branch types. It parses branch types first, then appends event strings
embraced by the '/' character. If multiple events occur, the events is
separated with a comma (,).
Also add a minor improvement by adding char 'm' in char array for branch
mispredict event.
Leo Yan [Tue, 4 Mar 2025 11:12:30 +0000 (11:12 +0000)]
perf script: Refactor sample_flags_to_name() function
When generating a string for sample flags, the sample_flags_to_name()
function lacks the ability to parse the trace start bit or trace end bit.
Therefore, the function is invoked multiple times after clearing its
unsupported bits.
This commit improves the sample_flags_to_name() function to parse sample
flags in one go for three kinds of information:
- The prefix info for trace start, trace end, etc.
- Branch types.
- Extra info for transaction and interrupt related info.
As a result, the code is simplified to call the sample_flags_to_name()
only once. No expectation for any changes in the perf script output.
Leo Yan [Tue, 4 Mar 2025 11:12:29 +0000 (11:12 +0000)]
perf script: Make printing flags reliable
Add a check for the generated string of flags. Print out the raw number
if the string generation fails.
Use the SAMPLE_FLAGS_STR_ALIGNED_SIZE macro to replace the value '21'.
Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: James Clark <james.clark@linaro.org> Signed-off-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250304111240.3378214-2-leo.yan@arm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Legacy hybrid events have attr.type == PERF_TYPE_HARDWARE, so they look
like plain legacy events if we only look at attr.type. But legacy events
should still be uniquified if they were opened on a non-legacy PMU. Fix
it by checking if the evsel is hybrid and forcing needs_uniquify
before looking at the attr.type.
This restores PMU names on hybrid systems and also changes "perf stat
metrics (shadow stat) test" from a FAIL back to a SKIP (on hybrid). The
test was gated on "cycles" appearing alone which doesn't happen on
here.
Before:
$ perf stat -- true
...
<not counted> instructions:u (0.00%)
162,536 instructions:u # 0.58 insn per cycle
...
After:
$ perf stat -- true
...
<not counted> cpu_atom/instructions/u (0.00%)
162,541 cpu_core/instructions/u # 0.62 insn per cycle
...
Fixes: 357b965deba9 ("perf stat: Changes to event name uniquification") Suggested-by: Ian Rogers <irogers@google.com> Signed-off-by: James Clark <james.clark@linaro.org> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lore.kernel.org/r/20250226145526.632380-1-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Namhyung Kim [Wed, 26 Feb 2025 20:30:39 +0000 (12:30 -0800)]
perf tools: Skip BPF sideband event for userspace profiling
The BPF sideband information is tracked using a separate thread and
evlist. But it's only useful for profiling kernel and we can skip it
when users profile their application only.
It seems it already fails to open the sideband event in that case.
Let's remove the noise in the verbose output anyway.
Luca Ceresoli [Fri, 24 Jan 2025 13:06:08 +0000 (14:06 +0100)]
perf build: Fix in-tree build due to symbolic link
Building perf in-tree is broken after commit 890a1961c812 ("perf tools:
Create source symlink in perf object dir") which added a 'source' symlink
in the output dir pointing to the source dir.
With in-tree builds, the added 'SOURCE = ...' line is executed multiple
times (I observed 2 during the build plus 2 during installation). This is a
minor inefficiency, in theory not harmful because symlink creation is
assumed to be idempotent. But it is not.
1. ln -sf $(srctree)/tools/perf $(OUTPUT)/source
-> creates /absolute/path/to/linux/tools/perf/source
link to /absolute/path/to/linux/tools/perf
=> OK, that's what was intended
2. ln -sf $(srctree)/tools/perf $(OUTPUT)/source # same command as 1
-> creates /absolute/path/to/linux/tools/perf/perf
link to /absolute/path/to/linux/tools/perf
=> Not what was intended, not idempotent
3. Now the build _should_ create the 'perf' executable, but it fails
The reason is the tricky 'ln' command line. At the first invocation 'ln'
uses the 1st form:
ln [OPTION]... [-T] TARGET LINK_NAME
and creates a link to TARGET *called LINK_NAME*.
At the second invocation $(OUTPUT)/source exists, so 'ln' uses the 3rd
form:
ln [OPTION]... TARGET... DIRECTORY
and creates a link to TARGET *called TARGET* inside DIRECTORY.
Fix by adding -n/--no-dereference to "treat LINK_NAME as a normal file
if it is a symbolic link to a directory", as the manpage says.
Closes: https://lore.kernel.org/all/20241125182506.38af9907@booty/ Fixes: 890a1961c812 ("perf tools: Create source symlink in perf object dir") Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Tested-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20250124-perf-fix-intree-build-v1-1-485dd7a855e4@bootlin.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ian Rogers [Tue, 25 Feb 2025 19:36:00 +0000 (11:36 -0800)]
tools/x86: Fix linux/unaligned.h include path in lib/insn.c
tools/arch/x86/include/linux doesn't exist but building is working by
virtue of a -I. Building using bazel this fails. Use angle brackets to
include unaligned.h so there isn't an invalid relative include.
Fixes: 5f60d5f6bbc1 ("move asm/unaligned.h to linux/unaligned.h") Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250225193600.90037-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Leo Yan [Thu, 27 Feb 2025 08:55:44 +0000 (08:55 +0000)]
perf arm-spe: Report error if set frequency
When users set the parameter '-F' to specify frequency for Arm SPE, the
tool reports error:
perf record -F 1000 -e arm_spe_0// -- sleep 1
Error:
Invalid event (arm_spe_0//) in per-thread mode, enable system wide with '-a'.
The output logs are confused and it does not give the correct reminding.
Arm SPE does not support frequency setting given it adopts a statistical
based approach.
Alternatively, Arm SPE supports setting period. This commit adds a
for frequency setting. It reports error and reminds users to set period
instead.
After:
perf record -F 1000 -e arm_spe_0// -- sleep 1
Arm SPE: Frequency is not supported. Set period with -c option or PMU parameter (-e arm_spe_0/period=NUM/).
Chun-Tse Shao [Thu, 27 Feb 2025 00:28:56 +0000 (16:28 -0800)]
perf lock: Report owner stack in usermode
This patch parses `owner_lock_stat` into a RB tree, enabling ordered
reporting of owner lock statistics with stack traces. It also updates
the documentation for the `-o` option in contention mode, decouples `-o`
from `-t`, and issues a warning to inform users about the new behavior
of `-ov`.
Example output:
$ sudo ~/linux/tools/perf/perf lock con -abvo -Y mutex-spin -E3 perf bench sched pipe
...
contended total wait max wait avg wait type caller
171 1.55 ms 20.26 us 9.06 us mutex pipe_read+0x57
0xffffffffac6318e7 pipe_read+0x57
0xffffffffac623862 vfs_read+0x332
0xffffffffac62434b ksys_read+0xbb
0xfffffffface604b2 do_syscall_64+0x82
0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76
36 193.71 us 15.27 us 5.38 us mutex pipe_write+0x50
0xffffffffac631ee0 pipe_write+0x50
0xffffffffac6241db vfs_write+0x3bb
0xffffffffac6244ab ksys_write+0xbb
0xfffffffface604b2 do_syscall_64+0x82
0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76
4 51.22 us 16.47 us 12.80 us mutex do_epoll_wait+0x24d
0xffffffffac691f0d do_epoll_wait+0x24d
0xffffffffac69249b do_epoll_pwait.part.0+0xb
0xffffffffac693ba5 __x64_sys_epoll_pwait+0x95
0xfffffffface604b2 do_syscall_64+0x82
0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76
=== owner stack trace ===
3 31.24 us 15.27 us 10.41 us mutex pipe_read+0x348
0xffffffffac631bd8 pipe_read+0x348
0xffffffffac623862 vfs_read+0x332
0xffffffffac62434b ksys_read+0xbb
0xfffffffface604b2 do_syscall_64+0x82
0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76
...
Chun-Tse Shao [Thu, 27 Feb 2025 00:28:54 +0000 (16:28 -0800)]
perf lock: Retrieve owner callstack in bpf program
This implements per-callstack aggregation of lock owners in addition to
per-thread. The owner callstack is captured using `bpf_get_task_stack()`
at `contention_begin()` and it also adds a custom stackid function for the
owner stacks to be compared easily.
The owner info is kept in a hash map using lock addr as a key to handle
multiple waiters for the same lock. At `contention_end()`, it updates the
owner lock stat based on the info that was saved at `contention_begin()`.
If there are more waiters, it'd update the owner pid to itself as
`contention_end()` means it gets the lock now. But it also needs to check
the return value of the lock function in case task was killed by a signal
or something.
Chun-Tse Shao [Thu, 27 Feb 2025 00:28:53 +0000 (16:28 -0800)]
perf lock: Add bpf maps for owner stack tracing
Add a struct and few bpf maps in order to tracing owner stack.
`struct owner_tracing_data`: Contains owner's pid, stack id, timestamp for
when the owner acquires lock, and the count of lock waiters.
`stack_buf`: Percpu buffer for retrieving owner stacktrace.
`owner_stacks`: For tracing owner stacktrace to customized owner stack id.
`owner_data`: For tracing lock_address to `struct owner_tracing_data` in
bpf program.
`owner_stat`: For reporting owner stacktrace in usermode.
Ian Rogers [Mon, 10 Feb 2025 19:12:31 +0000 (11:12 -0800)]
perf cpumap: Reduce cpu size from int to int16_t
Fewer than 32k logical CPUs are currently supported by perf. A cpumap
is indexed by an integer (see perf_cpu_map__cpu) yielding a perf_cpu
that wraps a 4-byte int for the logical CPU - the wrapping is done
deliberately to avoid confusing a logical CPU with an index into a
cpumap. Using a 4-byte int within the perf_cpu is larger than required
so this patch reduces it to the 2-byte int16_t. For a cpumap
containing 16 entries this will reduce the array size from 64 to 32
bytes. For very large servers with lots of logical CPUs the size
savings will be greater.
Backtrace pointed to :
?? ()
perf_session.process_user_event ()
reader.read_event ()
perf_session.process_events ()
cmd_trace ()
run_builtin ()
handle_internal_command ()
main ()
Further debug pointed that, segmentation fault happens when
trying to access id_index. Code snippet:
case PERF_RECORD_ID_INDEX:
err = tool->id_index(session, event);
Since 'commit 15d4a6f41d72 ("perf tool: Remove
perf_tool__fill_defaults()")', perf_tool__fill_defaults is
removed. All tools are initialized using perf_tool__init()
prior to use. But in builtin-trace, perf_tool__init is not
used and hence the defaults are not initialized. Use
perf_tool__init() in perf trace to handle the initialization.
James Clark [Wed, 26 Feb 2025 10:41:01 +0000 (10:41 +0000)]
perf pmu: Don't double count common sysfs and json events
After pmu_add_cpu_aliases() is called, perf_pmu__num_events() returns an
incorrect value that double counts common events and doesn't match the
actual count of events in the alias list. This is because after
'cpu_aliases_added == true', the number of events returned is
'sysfs_aliases + cpu_json_aliases'. But when adding 'case
EVENT_SRC_SYSFS' events, 'sysfs_aliases' and 'cpu_json_aliases' are both
incremented together, failing to account that these ones overlap and
only add a single item to the list. Fix it by adding another counter for
overlapping events which doesn't influence 'cpu_json_aliases'.
There doesn't seem to be a current issue because it's used in perf list
before pmu_add_cpu_aliases() so the correct value is returned. Other
uses in tests may also miss it for other reasons like only looking at
uncore events. However it's marked as a fixes commit in case any new fix
with new uses of perf_pmu__num_events() is backported.
Fixes: d9c5f5f94c2d ("perf pmu: Count sys and cpuid JSON events separately") Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250226104111.564443-3-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
James Clark [Wed, 26 Feb 2025 10:41:00 +0000 (10:41 +0000)]
perf pmu: Dynamically allocate tool PMU
perf_pmus__destroy() treats all PMUs as allocated and free's them so we
can't have any static PMUs that are added to the PMU lists. Fix it by
allocating the tool PMU in the same way as the others. Current users of
the tool PMU already use find_pmu() and not perf_pmus__tool_pmu(), so
rename the function to add 'new' to avoid it being misused in the
future.
perf_pmus__fake_pmu() can remain as static as it's not added to the
PMU lists.
Fixes the following error:
$ perf bench internals pmu-scan
# Running 'internals/pmu-scan' benchmark:
Computing performance of sysfs PMU event scan for 100 times
munmap_chunk(): invalid pointer
Aborted (core dumped)
Fixes: 240505b2d0ad ("perf tool_pmu: Factor tool events into their own PMU") Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250226104111.564443-2-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Code in synthesize_probe_trace_arg() access a null value and results in
segfault. Data structure which is null:
struct probe_trace_arg arg->value
We are hitting a case where arg->value is null in probe point:
"vfs_fstatat $params". This is happening since 'commit e896474fe485
("getname_maybe_null() - the third variant of pathname copy-in")'
Before the commit, probe point for vfs_fstatat was getting added only
for one location:
With this change, vfs_fstatat code is inlined for other locations in the
code:
Probe point found: __do_sys_lstat64+48
Probe point found: __do_sys_stat64+48
Probe point found: __do_sys_newlstat+48
Probe point found: __do_sys_newstat+48
Probe point found: vfs_fstatat+0
When trying to find matching dwarf information entry (DIE)
from the debuginfo, the code incorrectly picks DIE which is
not referring to vfs_fstatat. Snippet from dwarf entry in vmlinux
debuginfo file.
The main abstract die is:
<1><4214883>: Abbrev Number: 147 (DW_TAG_subprogram)
<4214885> DW_AT_external : 1
<4214885> DW_AT_name : (indirect string, offset: 0x17b9f3): vfs_fstatat
While collecting variables/parameters for a probe point, the function
copy_variables_cb() also looks at dwarf debug entries based on the
instruction address. Snippet
if (dwarf_haspc(die_mem, vf->pf->addr))
return DIE_FIND_CB_CONTINUE;
else
return DIE_FIND_CB_SIBLING;
But incase of inlined function instance for vfs_fstatat, there are two
entries which has the instruction address entry point as same.
Instance 1: which is for vfs_fstatat and DW_AT_abstract_origin points to
0x4214883 (reference above for main abstract die)
But the copy_variables_cb() continues to add parameters from second
instance also based on the dwarf_haspc() check. This results in
formal parameters for getname also appended to params. But while
filling in the args->value for these parameters, since these args
are not part of dwarf with offset "42131fa". Hence value will be
null. This incorrect args results in segfault when value field is
accessed.
Save the dwarf dieoffset of the actual DW_TAG_subprogram as part of
"struct probe_finder". In copy_variables_cb(), include check to make
sure the DW_AT_abstract_origin points to the correct entry if the
dwarf_haspc() matches the instruction address.
Gabriele Monaco [Fri, 7 Feb 2025 08:04:45 +0000 (09:04 +0100)]
perf ftrace latency: allow to hide empty buckets
Especially while using several buckets, it isn't uncommon to have some
of them empty and reading the histogram may be a bit more complex:
# perf ftrace latency -a -T mutex_lock --bucket-range 5 --max-latency 200
# DURATION | COUNT | GRAPH |
0 - 5 us | 14816 | ###################################### |
5 - 10 us | 1228 | ### |
10 - 15 us | 438 | # |
15 - 20 us | 106 | |
20 - 25 us | 21 | |
25 - 30 us | 11 | |
30 - 35 us | 1 | |
35 - 40 us | 2 | |
40 - 45 us | 4 | |
45 - 50 us | 0 | |
50 - 55 us | 1 | |
55 - 60 us | 0 | |
60 - 65 us | 1 | |
65 - 70 us | 1 | |
70 - 75 us | 1 | |
75 - 80 us | 2 | |
80 - 85 us | 0 | |
85 - 90 us | 1 | |
90 - 95 us | 0 | |
95 - 100 us | 1 | |
100 - 105 us | 0 | |
105 - 110 us | 0 | |
110 - 115 us | 0 | |
115 - 120 us | 0 | |
120 - 125 us | 1 | |
125 - 130 us | 0 | |
130 - 135 us | 0 | |
135 - 140 us | 1 | |
140 - 145 us | 0 | |
145 - 150 us | 0 | |
150 - 155 us | 0 | |
155 - 160 us | 0 | |
160 - 165 us | 0 | |
165 - 170 us | 0 | |
170 - 175 us | 0 | |
175 - 180 us | 0 | |
180 - 185 us | 0 | |
185 - 190 us | 0 | |
190 - 195 us | 0 | |
195 - 200 us | 0 | |
200 - ... us | 2 | |
Allow the optional flag --hide-empty to remove buckets with no element
and produce a more compact graph. This feature could be misleading since
there is no clear indication for missing buckets, for this reason it's
disabled by default.
# perf ftrace latency -a -T mutex_lock --bucket-range 5 --max-latency --hide-empty 200
# DURATION | COUNT | GRAPH |
0 - 5 us | 14816 | ###################################### |
5 - 10 us | 1228 | ### |
10 - 15 us | 438 | # |
15 - 20 us | 106 | |
20 - 25 us | 21 | |
25 - 30 us | 11 | |
30 - 35 us | 1 | |
35 - 40 us | 2 | |
40 - 45 us | 4 | |
50 - 55 us | 1 | |
60 - 65 us | 1 | |
65 - 70 us | 1 | |
70 - 75 us | 1 | |
75 - 80 us | 2 | |
85 - 90 us | 1 | |
95 - 100 us | 1 | |
120 - 125 us | 1 | |
135 - 140 us | 1 | |
200 - ... us | 2 | |
Gabriele Monaco [Fri, 7 Feb 2025 08:04:44 +0000 (09:04 +0100)]
perf ftrace latency: variable histogram buckets
The max-latency value can make the histogram smaller, but not larger, we
have a maximum of 22 buckets and specifying a max-latency that would
require more buckets has no effect.
Dynamically allocate the buckets and compute the bucket number from the
max latency as (max-min) / range + 2
If the maximum is not specified, we still set the bucket number to 22
and compute the maximum accordingly.
Fail if the maximum is smaller than min+range, this way we make sure we
always have 3 buckets: those below min, those above max and one in the
middle.
Since max-latency is not available in log2 mode, always use 22 buckets.
Namhyung Kim [Sun, 26 Jan 2025 21:02:42 +0000 (13:02 -0800)]
perf annotate-data: Handle direct use of stack pointer without fbreg
Sometimes compiler generates code to use the stack pointer register
without frame pointer. As we know RSP is the stack register on x86,
let's treat it as same as fbreg. But the offset would be opposite
direction so update the debug message accordingly.
Thomas Falcon [Thu, 20 Feb 2025 04:59:42 +0000 (22:59 -0600)]
perf report: Fix sample number stats for branch entry mode
Currently, stats->nr_samples is incremented per entry in the branch stack
instead of per sample taken. As a result, statistics of samples taken
during perf record in --branch-filter or --branch-any mode does not
seem correct. Instead call hists__inc_nr_samples() for each sample taken
instead of for each entry in the branch stack.
Ian Rogers [Sat, 22 Feb 2025 06:10:13 +0000 (22:10 -0800)]
perf machine: Reuse module path buffer
Rather than copying the path and appending the directory entry in a
fresh path buffer, append to the path at the end of where it is for
the recursion level. This saves a PATH_MAX buffer per recursion level
and some unnecessary copying.
Ian Rogers [Sat, 22 Feb 2025 06:10:09 +0000 (22:10 -0800)]
perf header: Switch mem topology to io_dir__readdir
Switch memory_node__read and build_mem_topology from opendir/readdir
to io_dir__readdir, with smaller stack allocations. Reduces peak
memory consumption of perf record by 10kb.
Ian Rogers [Sat, 22 Feb 2025 06:10:06 +0000 (22:10 -0800)]
tools lib api: Add io_dir an allocation free readdir alternative
glibc's opendir allocates a minimum of 32kb, when called recursively
for a directory tree the memory consumption can add up - nearly 300kb
during perf start-up when processing modules. Add a stack allocated
variant of readdir sized a little more than 1kb.
As getdents64 may be missing from libc, add support using syscall. As
the system call number maybe missing, add #defines for those.
Note, an earlier version of this patch had a feature test for
getdents64 but there were problems on certains distros where
getdents64 would be #define renamed to getdents breaking the code. The
syscall use was made uncondtional to work around this. There is
context in:
https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/