Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
dtrace: failing to allocate more ECB space can cause a crash
The existing code was not taking into consideration that when the
table of ECBs needs to be expanded, the memory allocation can fail.
This could lead to a NULL pointer access, and a kernel crash. We
now check the result of the allocation, and bail out if it fails.
Orabug: 26503342 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Nick Alcock [Fri, 4 Aug 2017 16:54:24 +0000 (17:54 +0100)]
dtrace: work around libdtrace-ctf bug
The bug involves synthesising pointers to types (ipaddr_t *, in
particular) when such pointer types do not appear in the CTF files but
are needed by the CTF itself. This is working in standalone modules,
but not in modules with parent type containers.
As a workaround, pro tem before fixing this properly in libdtrace-ctf,
hack around it for the one type it is necessary for (a type that is used
in the DTrace system translators, so if this type does not resolve
correctly DTrace will not start). A suitable workaround is simply to
introduce a use of this pointer type in the C code, and it so happens
that we have a place where this would fit perfectly well.
Orabug: 26583958 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Tomas Jedlicka [Tue, 1 Aug 2017 13:15:44 +0000 (09:15 -0400)]
dtrace: Integrate DTrace Modules into kernel proper
This changeset integrates DTrace module sources into the main kernel
source tree under the GPLv2 license. Sources have been moved to
appropriate locations in the kernel tree.
In addition a new RPM package is introduced: kernel-headers-dtrace.
This package is responsible for installation of DTrace related header
files for its userspace component.
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Signed-off-by: David Mc Lean <david.mclean@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Tomas Jedlicka [Fri, 7 Apr 2017 20:51:53 +0000 (16:51 -0400)]
dtrace: FBT module support and SPARCs return probes
This fix adds two features to FBT provider:
1) support for modules
2) support for return probes on SPARC
The module support of x86 was almost ready as it does not rely on trampolines and
uses hashtables of tracepoints. This works well if we know amount of probes in
advance so can reserve correct amount of memory during module load time. Unfortunately
that is not possible on SPARC and we need to allocate a trampoline dynamically.
Major part of this code is about removing all static assumptions about FBT from kernel
code and moving the responsibility to dtrace modules. Trampolines for SPARC are now
allocated dynamically (including kernel's pseudo module). This applies to SDT trampolines
too.
Second change adds scan for return probes on SPARC with small heuristics to quickly
skip over cases that are not interesting for DTrace. At the same time this patch
allocates new SPARC Trap for FBT.
Support for .init section is not available on any platform. The .init section is freed
after a module is fully loaded and it is not possible to remove its probes without
further chagnes in DTrace framework (modules). This is deffered for later work.
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Tomas Jedlicka [Sun, 2 Jul 2017 19:06:10 +0000 (15:06 -0400)]
dtrace: Make dynamic variable cleanup self-throtling
With addition of cyclic_reprogram() it is possible to make dynamic variable cleanup
self throtling by simply reprograming itself from within the handler.
Tomas Jedlicka [Thu, 29 Jun 2017 11:22:52 +0000 (07:22 -0400)]
dtrace: DTrace state deadman must use dtrace_sync()
The dtrace_sync() allows to check that all CPUs have left probe context. Wihtout this
code the deadman check would based its assumption on the state of CPU that calls
deadman cyclics which is wrong.
Tomas Jedlicka [Fri, 7 Apr 2017 22:23:04 +0000 (18:23 -0400)]
dtrace: FBT module support and SPARCs return probes
This fix contains following changes:
- Modification of provider ops vector
- Move of fbt hash table to x86 platform specific code
- Instrumentation of return probes on SPARC
DTrace provider has semantics that allows to create a probe on the fly or to provide
probes for given module. There is no way how a provider can attach its own per-module
data through the framework.
With this change a provider my allocate per-module data inside provide_module() call.
Once module is about to go it will be notified by framework by destroy_module() op.
To stay binary compatible I extended the ops vector on its end but used C99 style to
init ops structures to keep callbacks grouped per their logic.
SPARC now does two passes over available kernel symbols. The first one is used to
count how many symbols are present to be able to allocate correctly sized trampoline.
Second pass performs actuall disassembly and creates return probes. It is possible that
not every probe is instrumentable so we may end up wasting some memory. It is a tradeoff
between speed and memory consumption.
It is not possible to instrument arbitrary return places so we support only some variants
that are used in the stream. Current implementation relies on usage of JMPL thus it is
not possible to instrument return from tail call optimized code.
Another change is in patching of the code. The JMPL requires to store NOP in its delay
slot. This prevents us to do this atomically on the running kernel and must stop CPUs
for safety reasons.
Linux probes may fire from non-standart context like TL1 so it is not safe to assume
anything about %g registers. Thanks to having few free %l we are able to temporarly
store %gs and restor them back to avoid breaking up trap handlers.
Tomas Jedlicka [Sun, 2 Jul 2017 16:17:01 +0000 (12:17 -0400)]
dtrace: Add support for manual triggered cyclics
In some scenarios it is better if a client of cyclic susbstem can reprogram cyclic on
his own. This is not possible with current implementation.
This chage adds cyclic_reprogram() that can be used to schedule cyclic from inside and
outside of its handler. A manually triggered cyclic is distinguished from other types
by having its interval set to -1.
Tomas Jedlicka [Fri, 30 Jun 2017 13:17:06 +0000 (09:17 -0400)]
dtrace: LOW level cyclics should use workqueues
The HIGH level cyclics are meant to be run from interrupt handler. This works on
Linux because the hrtimer is scheduled as tasklet. The LOW level cyclics must be
interruptible and should not be scheduled as tasklests.
DTrace is currently relying on being able to call dtrace_sync() from within a cyclic
handler. On Linux it is not safe to try send IPIs from within interrupt/bottom half
handlers.
This fix changes LOW level cyclics to use workqueues. At the moment we are using
shared system workqueue but it may be required to allocate our owns if this causes
big latency in our timer routines.
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
Kris Van Hees [Tue, 13 Jun 2017 16:33:04 +0000 (12:33 -0400)]
dtrace: FBT entry probes will now use int3
Due to some function prologues inserting an instruction between the
push rbp and mov rsp,rbp instruction *and* that instruction being one
that can validly take a LOCK profix (e.g. inc), it is not safe to
continue using the LOCK prefix as a way to trigger an Invalid Opcode
trap for FBT entry probes. The new trigger uses int3 (like the return
probes already do).
Orabug: 26190412
Orabug: 26174895 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Kris Van Hees [Mon, 12 Jun 2017 13:33:05 +0000 (09:33 -0400)]
dtrace: add kprobe-unsafe addresses to FBT blacklist
By means of the newly introduced API to add entries to the FBT
blacklist, we make sure to register addresses that are unsafe for
kprobes with the FBT blacklist because they are unsafe there also.
Orabug: 26190412 Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Kris Van Hees [Mon, 12 Jun 2017 13:29:16 +0000 (09:29 -0400)]
dtrace: convert FBT blacklist to RB-tree
The blacklist for FBT was implemented as a sorted list, populated from
a static list of functions. In order to allow functions to be added
from other places (i.e. programmatically), it has been converted to an
RB-tree with an API to add functions and to traverse the list. It is
still possible to add functions by address or to add them by symbol
name, to be resolved into the corresponding address.
Orabug: 26190412 Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Alan Maguire [Wed, 14 Jun 2017 13:06:27 +0000 (14:06 +0100)]
DTrace: IP provider use-after-free for drop-out probe points
KASan warnings showed a possible use-after-free for skbs in error handling
codepath after netfilter hooks have run. Hooks may free the skb, so we
should not derefence it at drop-out probe points after NF_HOOK().
Nick Alcock [Thu, 15 Jun 2017 21:04:08 +0000 (22:04 +0100)]
ctf: fix a variety of memory leaks and use-after-free bugs
These fall into two classes, but are sufficiently intertwined that it's
easier to commit them in one go.
The first is outright leaks, which exceed 1GiB on a normal run, varying
from the tiny (failure to free getline()'s line), through the disastrous
(failure to free items filtered from a list by list_filter(), leading to
the leaking of nearly the whole of the named_structs state, which is
huge). We also leak the structs_seen hash due to recreating it on
alias_fixup file switch without bothering to destroy it first,
The second is lifetime problems, centred around the stuff allocated and
freed in the detect_duplicates_tu_{init,/done}() functions. These were
comparing the module name against a saved copy to see if a new vars_seen
needed to be allocated, or whether this was just a flip of TU without a
change of object file and we could get away with just flushing its
contents out -- but unfortunately the state->module_name is assigned
directly from its parameter, and *that* has a lifetime lasting only
within process_file() -- and a deduplication run, of course, involves
iterating over a great many object files. So everything works as long as
we're flipping from TU to TU within a single object file, and then we
switch object files and are suddenly strcmp()ing with freed memory.
Discard this faulty optimization entirely, and just flush the vars_seen
hash in tu_done() and both create and destroy it in scan_duplicates(),
right where we create and destroy related stuff too.
Something similar happens with the state->dwfl_file_name due to its
derivation from id_file->file_name: if no duplicates are found, we
list_filter() that id_file straight out of the structs_seen list and
free it, and then on the next call state->dwfl_file_name points to freed
memory.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Orabug: 26283357
Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com> Acked-by: Saar Maoz <Saar.Maoz@oracle.com> Acked-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Nick Alcock [Wed, 24 May 2017 16:57:05 +0000 (17:57 +0100)]
ctf: prevent modules on the dedup blacklist from sharing any types at all
The deduplication blacklist exists to deal with a few modules, mostly
old ISA sound drivers, which consist of a #define and a #include of
another module, which then uses #ifdef profusely, sometimes on structure
members. This leads to DWARF which has differently shaped structures
in the same source location, which conflicts with the purpose of the
deduplicator, which is to use the source location to identify identical
structures and merge them.
Before now, the blacklist worked by preventing the appearance of a
structure in a module on the blacklist from making a structure shared:
it could still be shared if other modules defined it. This fails to
work for two reasons.
Firstly, if it is used by only *one* other module, we'll hit an
assertion failure in the emission phase which checks that any type it
emits is either defined in the module being emitted or in the shared
type repo.
We could remove that assertion, but we'd still be in the wrong, because
you could easily have a type in some header used by many modules that
said
struct trouble {
#ifdef MODULE_FOO
/* one definition */
#else
/* a different definition */
#endif
}
and a module that says
#define MODULE_FOO 1
#include <other_module.c>
Even if we blacklisted this module (and we would), this would still
fail, because 'struct trouble' would end up in the shared type
repository, and the existing code would fail to emit a new definition of
it in module blah, even though it should because its definition is
different.
This shows that if a module is pulling tricks like this we cannot trust
its use of shared types at all, since we have no visibility into
preprocessor definitions: regardless of the space cost (about 40KiB per
module), we cannot let it share any types whatsoever with the rest of
the kernel. Rather than piling heaps of blacklist checks in all over
dwarf2ctf to implement this, we do it by exploiting the fact that the
deduplicator works entirely on the basis of type IDs. We add a
'blacklist type prefix' that type_id() can add to the start of its IDs
(with some extra decoration, because the start of type IDs is a file
path, so we want this to be an impossible path). If we set this prefix
to the module name if and only if the module is blacklisted, and do not
add one otherwise, then every blacklisted module will have a unique set
of IDs for all its types, which can be shared within the module but not
outside it, so every type in the module will be unique and none of them
will end up in the shared type repository.
While we're at it, add yet another ancient ISA sound driver that plays
the same games to the blacklist.
This fix makes blacklisting modules much more space-expensive: each such
module expands the current size of the kernel module package by about
40KiB. (But there is only one blacklisted module built in current UEK,
so this is a tolerable cost.)
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: vincent.lim@oracle.com
Orabug: 26137220
[ 437f] structure_type
name (strp) "IO_APIC_route_entry"
byte_size (data1) 8
[ 438b] member
name (strp) "vector"
byte_size (data1) 4
bit_size (data1) 8
bit_offset (data1) 24
data_member_location (data1) 0
[ 439a] member
name (strp) "delivery_mode"
byte_size (data1) 4
bit_size (data1) 3
bit_offset (data1) 21
data_member_location (data1) 0
[ 43a9] member
name (strp) "dest_mode"
byte_size (data1) 4
bit_size (data1) 1
bit_offset (data1) 20
data_member_location (data1) 0
[ 43b8] member
name (strp) "delivery_status"
byte_size (data1) 4
bit_size (data1) 1
bit_offset (data1) 19
data_member_location (data1) 0
But CTF on little-endian requires the opposite: it has special handling
for the first member of a structure which assumes that it is closest to
the start of memory: in effect, it wants structure member addresses to
always ascend, even within bitfields, regardless of endianness (which
makes some sense intellectually as well).
dwarf2ctf's emission code generally emits sequentially, so except where
deduplication has eliminated items or dependent type insertion has added
them it emits things in the CTF in the same order as in the DWARF. We
can avoid this for short runs, as in this case, by switching from
iteration to recursion in such cases, spotting a run at identical
data_member_location, recursing until we hit the end of the run, then
unwinding and emitting as we unwind until the recursion is over.
Orabug: 25815129 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Thu, 2 Feb 2017 00:22:05 +0000 (00:22 +0000)]
ctf: bitfield support
Support for bitfields in dwarf2ctf was embryonic and largely untested
before now due to bugs in libdtrace-ctf: with a fix for those in hand,
we can fix bitfields here too.
Bitfields in DWARF and CTF have annoyingly different representations.
In DWARF, a bitfield is represented something like this:
[ 16561] member
name (string) "ihl"
decl_file (data1) 225
decl_line (data1) 87
type (ref4) [ 38]
byte_size (data1) 1
bit_size (data1) 4
bit_offset (data1) 4
data_member_location (sdata) 0
[...]
[ 38] typedef
name (strp) "__u8"
decl_file (data1) 36
decl_line (data1) 20
type (ref4) [ 43]
i.e. the padding, size, and starting location are all represented in the
member, where you would conceptually expect it to be.
In CTF, the starting location of the conceptual containing type of a
bitfield is encoded in the member: but the size and starting location of
the bitfield itself is represented in the dependent type, which is added
as a "non-root" type (which cannot be looked up by name) so that it can
have the same name as the un-bitfielded base type without causing a name
clash.
We use the new DIE attribute override mechanism added in commit 8935199962
to override DW_AT_bit_size and DW_AT_bit_offset for such members (fixing
a pre-existing bug in the process: we were looking for the DW_AT_bit_size
on the structure as a whole!), and in the base-type emission function
checking for the existence of a DW_AT_bit_size/offset and responding to
them by overriding the size and offset derived from DW_AT_byte_size and
noting that this is a non-root type. (The override needed, annoyingly,
is endian-dependent, because CTF consumers assume that on little-endian
systems the offset relates to the least-significant edge of the bitfield,
counting from the LSB, while DWARF assumes the opposite).
But this is not enough: unless more is done, this type will appear
to have the same type ID as its non-bitfield equivalent, leading to
confusion about which CTF file it should appear in and quite possibly
leading to it ending up in a CTF file that the structure containing the
bitfield cannot even see. So augment type_id()'s representation of
base types from e.g. 'int' to something like 'int:4' if and only if
a DW_AT_bit_size or an override of it is present and that override is
a different size from the native bitness of the type itself (the
DW_AT_byte_size). We encode the bit_offset only if there is also a
bit_size, as something like int:4:2. (That's unambiguous because
these attributes always arrive in pairs in bitfields and never appear
in anything else in C-generated DWARF.)
Finally, this breaks an optimization in the deduplicator, which was that
all structure members reference some top-level type, so when marking a
type as seen, structure members could just be skipped. Now, they have
to be chased iff they are bitfields using the same override trickery as
above to change the DW_AT_bit_size/offset in the member's type DIE, and
that bitfield override needs to be passed down to type_id() when finally
marking duplicated types as shared too. (Avoid code duplication by
factoring out some related code from a horrible conditional in
detect_duplicates() into a new type_needs_sharing() function.)
Orabug: 25815129 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Thu, 27 Apr 2017 17:39:53 +0000 (18:39 +0100)]
ctf: emit file-scope static variables
For as long as dwarf2ctf has emitted variables into the CTF, it has
emitted only extern variables. This is for two reasons: firstly, a
misconception that global variables with static linkage did not appear
in /proc/kallmodsyms (they do), and secondly, an ambiguity problem.
We cannot usefully distinguish two variables with the same name in the
same module: they differ only by address, so if both are static
variables in different translation units, we can't tell which is which.
(Also, we cannot emit more than one variable with a given name into a
given module CTF file in any case). CTF's modular rather than
translation-unit-based variable/type scope bites us here.
I sought to avoid this bug by emitting only non-static variables, but
this does not save us, because there might be another static variable
with the same name in the same module, whereupon the ambiguity problem
arises all over again. We must identify such ambiguous cases and strip
them out (not emitting CTF for this variable at all): then we can emit
static file-scope variables into the CTF without worry.
We do this by introducing a new, module-scope vars_seen hashtable into
the deduplicator state, which gains an entry for every variable name
seen in this module, and indicates whether it is static or not. This
lets us tell if we have seen a variable with a given name more than
once, and if we have, whether any of the instances was static. Then
we can consult this blacklist at variable-emission time, and skip any
variable if it is in the blacklist.
Unfortunately, computing the name for the variable's entry in the
blacklist is fairly expensive, and has to be done for every variable:
worse yet, this increases the number of variables emitted drastically
(in vmlinux and the shared typo repo alone, we go from 2247 to 10409
variables), and emitting that much CTF is not free: so the runtime goes
up by about 5%. We will reclaim this lost speed soon (and then some).
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Orabug: 25962387
Nick Alcock [Thu, 9 Mar 2017 14:31:33 +0000 (14:31 +0000)]
ctf: speed up the dwarf2ctf duplicate detector some more
The duplicate detector's alias detection pass is still doing unnecessary
work. When a non-opaque structure is marked shared, it is sometimes
necessary to do another deduplication scan, because the marking may have
marked types used within the structure as shared, which may require yet
more types (e.g. opaque uses of of that type in other modules) to be
marked shared as well. So while we know this pass can only affect
structures/ unions/enums that have names, and their interior types, it
would seem that we must keep scanning them to see if they need
deduplication until none remain.
However, there is one exception: if a non-opaque type and its
corresponding opaque type are both already marked shared, or if we have
just processed them and marked them accordingly, we know that we will
never need to re-mark those particular types again, since they can't be
more shared than they already are: so we can remove them from
consideration in future passes. Because we are only opening DWARF files
in this pass as needed now, this hugely cuts down the number of files we
process in subsequent passes: we still see the same number of passes,
but passes after the first (which marks tens of thousands of opaque
types as shared) only open a few files, mark a few hundred types, and
flash past in under a second.
In my tests, the alias fixup pass now takes under 10s, which can be
more or less ignored: all other passes other than initialization
and writeout are much more expensive. (Before this series, it took
over a minute on the fastest machine I have access to, and over three
minutes on SPARC.)
Orabug: 25815306 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Tue, 7 Mar 2017 20:31:12 +0000 (20:31 +0000)]
ctf: speed up the dwarf2ctf duplicate detector
dwarf2ctf is very slow. By far its most expensive pass is the duplicate
detector. Of necessity this has to scan every object file in the kernel
to identify shared types, which is pretty expensive even when the cache
is hot, but it's not doing this particularly efficiently and can be sped
up quite a lot.
As of commit f0d0ebd0b4, the duplicate detector's job was cut in two:
the first pass identifies all non-structure types and non-opaque
structures used by more than one module, and the second "alias fixup"
pass repeatedly unifies opaque and non-opaque structures with the same
name in different translation units, sharing their definitions. Both of
these passes generate or modify type IDs, so both need access to the
DWARF DIE of the types in question, since the type ID is derived
recursively from the DIE: but the second pass need not look at the DWARF
of any translation units that do not contain structures that might be
unified. However, the two are currently written in the same way, using
process_file() to traverse all the kernel's DWARF, even though the alias
fixup pass does almost nothing with that DWARF, and has less and less to
do on each iteration.
The sheer amount of wasted time here is remarkable. We traverse the
DWARF once for primary duplicate detection, once for CTF emission, but
often four or five or even seven or eight times for the alias fixup pass
(the precise count depends upon the relationships between types and the
order in which the DWARF files are traversed).
So improve things by tracking all types that the alias fixup pass is
interested in (structure types that are not anonymous inner structures
nor opaque nor used only as the types of array indices) and stash them
away during the first duplicate detection pass in a new temporary
singly-linked list, detect_duplicates_state.named_structs. We remember
the filename and DWARF DIE offset (so we can look the type up again) and
the type ID (because we just worked it out, so recomputing it would be a
waste of time). Then, rather than doing a process_file() for the alias
fixup pass, traverse the linked list, opening DWARF files as needed to
mark things as shared (but no more often than that: marking non-opaque
types needs the DIE so we can traverse into all its subtypes and mark
them shared too, but marking opaque types needs no DIE at all).
This has a significant effect on dwarf2ctf speed, slashing 25% off its
runtime in my tests, reducing the duplicate detector's share of the
runtime from about 80s to about 24s.
The dominant time consumer is now CTF emission rather than the
duplicate detector.
Orabug: 25815306 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Thu, 2 Feb 2017 00:23:16 +0000 (00:23 +0000)]
ctf: fix the size of int and avoid duplicating it
An upcoming bitfield-capable release of libdtrace-ctf adds extra
consistency checking which identified embarrassing but heretofore
harmless bugs in dwarf2ctf's representation of basic types.
dwarf2ctf has to emit a few basic types "by hand", since these are used
in the representation of types one of CTF or DWARF does not bother to
encode more complex forms for (function pointers, always encoded as
'int (*)()' in CTF, and 'void'). Embarrassingly, we were getting the
size of 'int' wrong: it should be in bits but we were emitting a count
of bytes instead, leading to a CTF representation of a 4-bit int.
This is always overridden by an accurate representation built into
DTrace in real use, but libdtrace-ctf finds the inconsistency anyway.
Worse is that it emits a representation of 'int' twice, once by hand in
init_ctf_table() and then again when it comes across the real 'int' type
DIE in the debuginfo. This is because we forgot to intern the types we
add by hand in the id_to_module and id_to_type hashes we use to detect
duplicate types, because we can only intern types we have a DWARF DIE
for, and these types either have no DIE at all or we just don't know
about it because we haven't even begun to traverse the debuginfo to look
up DIEs yet.
Fix this by adding a mark_shared_by_name() function which can be used to
intern basic types by name in these hashes. It has hardwired knowledge
of the type ID notation, but no more such knowledge than is already
present in detect_duplicates_alias_fixup() and friends (no more than
that types with no associated filename or line number are preceded by
"////".)
Orabug: 25815129 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Wed, 1 Feb 2017 23:51:21 +0000 (23:51 +0000)]
ctf: allow overriding of DIE attributes: use it for parent bias
Currently, dwarf2ctf's CTF type emission machinery emits any given CTF
type based purely on the data in the corresponding DWARF DIE, with
possible adjustments based on the DIE's structural parent (useful for
structure members, if not so much for top-level types). But from time
to time we need to adjust a CTF type depending on some property not of
its structural parent but of a type that depends on it: i.e., for
structures, we might want to adjust the offset of the members to cater
for the fact that this structure is being structurally merged with a
shorter assignment-compatible structure with identical name which
appeared in some translation unit that was processed earlier.
Currently, we handle this case, and only this case, by passing down a
'parent_bias' to all the CTF assembly functions. Replace this with a
more generic mechanism whereby an array of 'overrides' can be passed
down to construct_ctf_id(), die_to_ctf(), and all subordinate assembly
functions: these overrides consist of an array of die_override_t's,
where each element can either override or add to the value of one DWARF
attribute: this kicks in for specific DWARF tags only. (Only numerical
attributes are supported, obviously.)
(We also pass the overrides down to type_id() so that overrides can
affect the ID of types and thus cause a single DWARF type to generate
multiple CTF types, though we do not use this facility in this commit.)
To process the attributes, we introduce a new private_find_override() to
search the override list, and private_dwarf_udata() to fetch a udata and
handle it. (We do not override dwarf_hasattr(): anything that wants to
override an attribute that may not exist has to call
private_find_override() itself. If this happens a lot, we can introduce
an override for dwarf_hasattr() too.)
Currently we use this in exactly one place: in assemble_ctf_su_member(),
to replace the use of parent_bias. Further uses will come in the next
commit: thanks to this commit, none of them will require adding new
parameters to all the CTF construction functions :)
Also rename the 'override' parameter on the CTF construction functions,
which was used by array assembly to indicate that CTF types should
replace their parent type, with a much less confusingly-named 'replace'
parameter. (It was badly named before, but now that we have a parameter
named 'overrides' it is devastatingly badly named.)
Orabug: 25815129 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Alan Maguire [Wed, 12 Apr 2017 21:34:56 +0000 (22:34 +0100)]
DTrace tcp/udp provider probes
This patch adds DTrace SDT probes for the TCP and UDP protocols. For tcp
the following probes are added:
tcp:::send Fires when a tcp segment is transmitted
tcp:::receive Fires when a tcp segment is received
tcp:::state-change Fires when a tcp connection changes state
tcp:::connect-request Fires when a SYN segment is sent
tcp:::connect-refused Fires when a RST is received for connection attempt
tcp:::connect-established
Fires when three-way handshake completes for
initiator
tcp:::accept-refused Fires when a RST is sent refusing connection attempt
tcp:::accept-established
Fires when three-way handshake succeeds for acceptor
Arguments for all of these probes are:
arg0 struct sk_buff *; to be translated into pktinfo_t * containing
implementation-independent packet data
arg1 struct sock *; to be translated into csinfo_t * containing
implementation-independent connection data
arg2 __dtrace_tcp_void_ip_t *; to be translated into ipinfo_t * containing
implementation-independent IP information. Custom type is used as
this gives DTrace a hint that we can source IP information from other
arguments if the IP header is not available.
arg3 struct tcp_sock *; to be translated into tcpsinfo_t * containing
implementation-independent TCP connection data
arg4 struct tcphdr *; to be translated into a tcpinfo_t * containing
implementation-independent TCP header data
arg5 int representing previous state; to be translated into a
tcplsinfo_t * which contains the previous state. Differs from
current state (arg6) for state-change probes only.
arg6 int representing current state. Cannot be sourced from struct
tcp_sock as we sometimes need to probe before state change is
reflected there
arg7 int representing direction of traffic for probe; values are
DTRACE_NET_PROBE_INBOUND for receipt of data and
DTRACE_NET_PROBE_OUTBOUND for transmission.
For udp the following probes are added:
udp:::send Fires when a udp datagram is sent
udp:::receive Fires when a udp datagram is received
Arguments for these probes are:
arg0 struct sk_buff *; to be translated into pktinfo_t * containing
implementation-independent packet data
arg1 struct sock *; to be translated into csinfo_t * containing
implementation-independent connection data
arg2 void_ip_t *; to be translated into ipinfo_t * containing
implementation-independent IP information.
arg3 struct udp_sock *; to be translated into a udpsinfo_t * containing
implementation-independent UDP connection data
arg4 struct udphdr *; to be translated into a udpinfo_t * containing
implementation-independent UDP header information.
Kris Van Hees [Wed, 24 May 2017 03:34:53 +0000 (23:34 -0400)]
dtrace: ensure limit is enforced even when pcs is NULL
The dtrace_user_stacktrace() functions for x86_64 and sparc64 were
not handling the specified limit (st->limit correctly if the buffer
for PC values (st->pcs) was NULL. This commit ensures that we
decrement the limit whenever we encounter a PC, whether it gets
stored or not.
Orabug: 25949692 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Wed, 24 May 2017 05:00:34 +0000 (01:00 -0400)]
dtrace: ensure ustackdepth returns correct value
The implementation for ustackdepth was causing it to always return 1,
regardless of the depth of the ustack(). The commit ensures that the
underlying code can walk the stack (without actually collecting PCs)
and determine the depth.
Orabug: 25949692 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Mon, 22 May 2017 15:16:45 +0000 (11:16 -0400)]
dtrace: FBT return probes on x86_64 run with in_irq() true
Because FBT return probes are imeplemented on x86_64 by means of a
breakpoint trap (int3), and because int3 (on Linux) causes HARDIRQ
to be incremented in the preempt counter, the DTrace core thinks
that the probe was triggered from IRQ context (which it may or may
not be).
This commit ensures that we can detect wether we're processing a
probe triggered using int3, and if so, it subtracts from the HARDIRQ
counter before testing it (to compensate for the int3-imposed
increment).
Orabug: 26089286 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Kris Van Hees [Tue, 16 May 2017 13:59:32 +0000 (09:59 -0400)]
dtrace: different probe trigger instruction for entry vs return
On x86_64, we cannot use the LOCK prefix byte to consistently cause an
invalid opcode trap for FBT return probes because the 'ret' instruction
may be followed by an instruction that can validly take the LOCK prefix.
So, we use a different trigger instruction (int3).
In order to make this possible, the functions to set and clear FBT
probes on x86_64 (dtrace_invop_add() and dtrace_invop_remove()) have
been modified to accept a 2nd argument that indicates the instruction
to patch the probe location with. This is needed because FBT return
probes need a different instruction on x86_64 (LOCK prefix to force
an invalid opcode trap isn't safe because we do not know what
instruction may follow the "ret").
Orabug: 25949048 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Tue, 16 May 2017 03:29:16 +0000 (23:29 -0400)]
dtrace: make x86_64 FBT return probe detection less restrictive
The FBT return probe detection mechanism on x86_64 was requiring that
the "ret" instruction be followed by a "push %rbp" or "nop", which is
much too restrictive. The new code allows probing of all "ret"
instructions that occur in a function regardless of what instructions
follows.
In order to make this possible, the functions to set and clear FBT
probes on x86_64 (dtrace_invop_add() and dtrace_invop_remove()) have
been modified to accept a 2nd argument that indicates the instruction
to patch the probe location with. This is needed because FBT return
probes need a different instruction on x86_64 (LOCK prefix to force
an invalid opcode trap isn't safe because we do not know what
instruction may follow the "ret").
This commit also fixes the declaration of the dtrace_bad_address()
function that was missing its return type.
Orabug: 25949048 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Tue, 16 May 2017 03:25:09 +0000 (23:25 -0400)]
dtrace: support passing offset as arg0 to FBT return probes
FBT return probes pass the offset from the function start (in bytes)
as arg0. To make that possible, we pass the offset value in the call
to fbt_add_probe. For FBT entry probes we pass 0 (which is ignored).
Orabug: 25949086 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Tue, 16 May 2017 03:05:41 +0000 (23:05 -0400)]
dtrace: make FBT entry probe detection less restrictive on x86_64
The logic on x86_64 to determine whether we can probe a function is
too restrictive. By placing the probe on the "push %rbp" instruction
we can cover more functions, in case the "mov %rsp,%rbp" instruction
does not follow it immediately.
Orabug: 25949030 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Kris Van Hees [Tue, 16 May 2017 02:42:30 +0000 (22:42 -0400)]
dtrace: adjust FBT entry probe dection for OL7
On OL7, function prologues can be prefixed by a (5-byte) call
instruction on x86_64, which breaks the logic to determine if
we can place an FBT entry probe on that function. The new logic
accounts for the possibility that the anticipated prologue does
not show up as first instruction of the function.
Orabug: 25921361 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Tue, 16 May 2017 13:55:41 +0000 (09:55 -0400)]
dtrace: support passing offset as arg0 to FBT return probes
FBT return probes pass the offset from the function start (in bytes)
as arg0. To make that possible, we pass the offset value in the call
to fbt_add_probe. For FBT entry probes we pass 0 (which is ignored).
This commit also ensures that we emulate the 'ret' instruction on the
return path.
Orabug: 25949086 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Tue, 16 May 2017 13:38:04 +0000 (09:38 -0400)]
dtrace: improve probe execution debugging
The debugging code for probe execution had a few cases were the start
of execution was logged in debugging output but the completion was not
because of early termination conditions. Now all forms of completion
should be covered in debugging output.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Nick Alcock [Mon, 24 Apr 2017 11:14:05 +0000 (12:14 +0100)]
dtrace: canload() for input of *_ntop(), *_nto*()
These functions (some newly added, some older) were not appropriately
checking if the caller could load from their inputs, so could be
used by the not-yet-implemented unprivileged DTrace to read arbitrary
memory.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
The signature of the link_ntop() DTrace subroutine is:
string link_ntop(int hardware_type, void *addr);
link_ntop() takes a pointer to a hardware address and returns a string
which is the translation of that address to a string representation,
with content depending on the provided hardware type. Supported
hardware types are ARPHRD_ETHER and ARPHRD_INFINIBAND, both of which
are defined for use in D programs.
This is the link-level equivalent of inet_ntop().
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
[nca: reworded commit message a bit]
Orabug: 25931479
The dtrace_dif_variable() function is inlined during some compilations
and not during others, causing the number of frames to be skipped in
DTrace kernel stack traces to not be a constant. That causes incorrect
values for stackdepth to be reported.
This commit requests dtrace_dif_variable() to always be inlined, and
adjusts the aframes values in function of the inlining.
Orabug: 25872472 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas jedlicka <tomas.jedlicka@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
dtrace: fix handling of save_stack_trace sentinel (x86 only)
On x86 only, when save_stack_trace() writes less stack frames to the
buffer than there is space for, a ULONG_MAX is added as sentinel. The
DTrace code was mistakenly treating the buffer as always ending with a
ULONG_MAX.
Orabug: 25727046 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Wed, 15 Mar 2017 16:52:01 +0000 (12:52 -0400)]
dtrace: incorrect aframes value and wrong logic messes up caller and stack
Due to a mistake in how we compensate for the potential ULONG_MAX
sentinel value being added to kernel stacks on x86_64 (by the
save_stack_trace() function), the caller was always reported as 0.
This in turn was hiding a problem with the aframes values that are
used to ensure we skip the right amount of frames when reporting a
stack, caller, and calculating the stackdepth. Effectively, it tells
the stack walker how many frames were added to the stack due to DTrace
processing.
Orabug: 25727046 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Kris Van Hees [Wed, 15 Mar 2017 03:20:52 +0000 (23:20 -0400)]
dtrace: ensure we pass a limit to dtrace_stacktrace for stackdepth
When determining the (kernel) stackdepth, we pass scratch memory to the
dtrace_stacktrace() function because we are not interested in the actual
program counter values. However, we were passing in 0 as limit rather
than the actual maximum number of PCs that could fit in the remaining
scratch memory space.
We now also add no-fault protection to dtrace_getstackdepth().
Orabug: 25559321 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Kris Van Hees [Wed, 1 Mar 2017 04:37:11 +0000 (23:37 -0500)]
dtrace: comtinuing the FBT implementation and fixes
This commit continues the implementation of Function Boundary Tracing
(FBT) and fixes various problems with the original implementation and
other things in DTrace that it caused to break. It is done as a single
commit due to the intertwined nature of the code it touches.
1. We were only handling unaligned memory access traps as part of the
NOFAULT access protection. This commit adds handling data and
instruction access trap handling.
2. When an OOPS takes place, we now add output about whether we are
in DTrace probe context and what the last probe was that was being
processed (if any). That last data item isn't guaranteed to always
have a valid value. But it is helpful.
3. New ustack stack walker implementation (moved from module to kernel
for consistency and because we need access to low level structures
like the page tables) for both x86 and sparc. The new code avoids
any locking or sleeping. The new user stack walker is accessed as
as sub-function of dtrace_stacktrace(), selected using the flags
field of stacktrace_state_t.
4. We added a new field to the dtrace_psinfo_t structure (ustack) to
hold the bottom address of the stack. This is needed in the stack
walker (specifically for x86) to know when we have reached the end
of the stack. It is initialized from copy_process (in DTrace
specific code) when stack_start is passed as parameter to clone.
It is also set from dtrace_psinfo_alloc() (which is generally called
from performing an exec), and there it gets its value from the
mm->start_stack value.
5. The FBT black lists have been updated with functions that may be
invoked during probe processing. In addition, for x86_64 we added
explicit filter out of functions that start with insn_* or inat_*
because they are used for instruction analysis during probe
processing.
6. On sparc64, per-cpu data gets access by means of a global register
that holds the base address for this memory area. Some assembler
code clobbers that register in some cases, so it is not safe to
depend on this in probe context. Instead, we explicitly access
the data based on the smp_processor_id().
7. We added a new CPU DTTrace flag (CPU_DTRACE_PROBE_CTX) to flag that
we are processing in DTrace probe context. It is primarily used
to detect attempts of re-entry into dtrace_probe().
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Orabug: 21220305
Orabug: 24829326
Kris Van Hees [Mon, 27 Feb 2017 15:39:07 +0000 (10:39 -0500)]
dtrace: ensure DTrace can use get_user_pages safely
The processing of the DTrace-specific FOLL_IMMED flag was not robust
enough. We could still get into a situation where cond_resched() was
called (which is bad) or where the VMA area would get extended (which
is also bad). The only code that passes this flag is DTrace support
code, and when the flag is not passed, the execution flow is not at all
affected.
Orabug: 25640153 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Kris Van Hees [Fri, 24 Feb 2017 23:40:40 +0000 (18:40 -0500)]
dtrace: enable paranoid mode and IST shift for xen_int3
The Xen PVM path into an INT3 trap was not using paranoid=1 mode nor was
it using an IST shift as is done for HW INT3 traps. This interferes with
the instruction emulation code check based on the handler return value.
Orabug: 25580519 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Kris Van Hees [Mon, 20 Feb 2017 12:16:48 +0000 (07:16 -0500)]
dtrace: ensure we skip the entire SDT probe point
With the introduction of FBT support, the logic for skipping instructions
(with potential emulation of the skipped instruction) changed. This change
did not take into account the fact that is-enabled probes on x86_64 use a
3-byte sequence for setting ax to 0, followed by a 2-byte NOP. The old logic
resulted in failing to skip the setting of ax correctly.
New logic uses the knowledge that all SDT probes on x86_64 are of the same
length (ASM_CALL_SIZE) and therefore we can simply skip that number of bytes
and continue without any emulation.
Orabug: 25557283 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Fri, 3 Mar 2017 02:02:01 +0000 (21:02 -0500)]
dtrace: comtinuing the FBT implementation and fixes
This commit continues the implementation of Function Boundary Tracing
(FBT) and fixes various problems with the original implementation and
other things in DTrace that it caused to break. It is done as a single
commit due to the intertwined nature of the code it touches.
1. The sparc64 fast path implementation (dtrace_caller) for the D 'caller'
variable was trampling the %g4 register which Linux uses to hold the
'current' task pointer. By passing in a dummy argument, we ensure
that we can use the %i1 register to temporarily store %g4.
2. For consistency, we are now using stacktrace_state_t instead of
struct stacktrace_state.
3. We now call dtrace_stacktrace() under NOFAULT protection.
4. The ustack stack walker has been rewritten (in the kernel), so the
previous implementation has been removed.
5. We no longer process probes when the kernel panics, to avoid DTrace
disrupting output that could be crucial to debugging.
6. We now ensure that re-entry of dtrace_probe() can no longer happen,
except for the ERROR probe (which is by a re-entry by design).
7. Since FBT now works, the restriction to only support SyS_* functions
has been removed.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Orabug: 21220305
Orabug: 24829326
Alan Maguire [Mon, 23 Jan 2017 15:18:31 +0000 (15:18 +0000)]
dtrace: introduce and use typedef in6_addr_t
This is for consistency with the similar typedef in_addr_t: we have
to use the typedef in at least one place in the module so that the
compiler incorporates it into the DWARF and it ends up in the CTF
section. (Both the DTrace ip translators and, likely, the users
would expect that if one typedef exists, the other one does too.)
Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Orabug: 25557554
Kris Van Hees [Sat, 17 Dec 2016 23:08:44 +0000 (18:08 -0500)]
dtrace: SDT cleanup and bring in line with kernel
This commit performs some cleanup on the SDT provider, removing some
housekeeping tasks that are no longer needed (such as the need for an
arch-specific sdt_provide_module_arch() function).
This commit also contains a fix for the loop used in enabling and
disabling probes. It was failing to ensure that the enable/disable
function was being called with the correct SDT probe.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Kris Van Hees [Sat, 17 Dec 2016 23:08:44 +0000 (18:08 -0500)]
dtrace: fix preemption checks
The macros to verify whether the current execution can be preempted
were wrong. This commit fixes that. It also ensures that we call the
functions (or macros) provided for enabling/disabling preemption by
the kernel itself.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Kris Van Hees [Sat, 17 Dec 2016 23:08:44 +0000 (18:08 -0500)]
dtrace: when calling all modules do not forget kernel
For DTrace, the kernel is represented as a pseudo-module. When a loop
is made over all loaded modules in ordder to call a function for each
one of them, we need to also call that function for the kernel
pseudo-module.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Kris Van Hees [Sat, 17 Dec 2016 23:08:44 +0000 (18:08 -0500)]
dtrace: remove cleanup_module support
There is no need anymore for providers to call a cleanup_module
function in provider modules. The functionality that this function
provided is being rewritten.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Kris Van Hees [Wed, 23 Nov 2016 18:24:10 +0000 (18:24 +0000)]
dtrace: is-enabled probes for SDT
This is the module side of the is-enabled probe implementation. SDT
distinguishes is-enabled probes from normal probes by the leading ? in
their sdpd_name; at probe-firing time, the arch-dependent code arranges
to return 1 appropriately.
On x86, also arrange to jump past the probe's NOP region. There was no
need to do this before now, because a trap followed by a bunch of NOPs
is a perfectly valid instruction stream: but is-enabled probes have a
three-byte sequence implementing "xor %rax, %rax", and overwriting only
the first byte of that leaves us with a couple of bytes that must be
skipped. On SPARC, we drop the necessary return-value-changing
instruction into the delay slot of the call that used to be there
before we overwrote it with NOPs;: the instruction already there
is setting up the function argument-and-return-value, which is 0
when the probe is disabled, so we can overwrite it safely.
(We make minor adjustments to allow sdt_provide_probe_arch() to
safely modify the sdp_patchpoint.)
Finally, add a test use of an is-enabled probe to dt_test, used by the
DTrace testsuite.
[nca: sparc implementation, ip address adjustment, commit msg] Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 25143173
Kris Van Hees [Fri, 16 Dec 2016 23:54:10 +0000 (18:54 -0500)]
dtrace: function boundary tracing (FBT)
This patch implements Function Boundary Tracing (FBT) for x86 and
sparc. It covers generic, x86 and sparc specific changes.
Generic:
A new exported function (which will be provided by each supported
architecture) dtrace_fbt_init() is added to be called from module
code to initiate the discovery of FBT probes. Any eligible probe
(based on arch-dependent logic) is passed to the provided callback
function pointer for processing.
A new option is added to the DTrace kernel config to enable FBT.
The logic for determing the size of the pdata memory block (which
is arch-dependent) has changed at the architecture level, and the
code for setting up the kernel pseudo-module has been modified to
account for that change.
The post-processing script dtrace_sdt.sh is now determining how
many functions exist in the kernel as an upper bound for the amount
of functions that can be traced with FBT. The logic ensures that
aliases are not counted.
x86:
On x86_64, entry FBT probes are implemented similarly to SDT probes on
that same architecture. A trap is triggered from the probe location,
which causes a call into the DTrace core. The entry probe is placed on
the 'mov %rsp,%rbp' instruction that immediately followed a 'push %rbp'
as part of the function prologue. If this instruction sequence is not
found, no entry probe will be created for the function.
sparc:
On sparc64, entry FBT probes are also implemented similar to SDT
probes on that same architecture. A call is made into a trampoline
(allocated within the limited addressing range for a single assembler
instruction) which in turn calls into the DTrace core. The entry probe
is placed on the location where typically a call is made to _mcount for
profiling purposes. Under normal operation, that instruction will be
a NOP.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: David Mc Lean <david.mclean@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com> Reviewed-by: Allen Pais <allen.pais@oracle.com>
Orabug: 21220305
Orabug: 24829326
Kris Van Hees [Thu, 22 Dec 2016 07:20:57 +0000 (02:20 -0500)]
dtrace: add support for passing return value from trap handlers
Prior to this patch, trap handlers were called to service traps without
any mechanism to report back to the lowest level trap entry point. The
DTrace FBT implementation on x86 needs to be able to do just that because
FBT probes are enabled by replacing a one-byte assembler instruction with
a one-byte instruction that causes a trap. After the trap is handled, we
need to emulate the instruction that was replaced prior to returning to
the original instruction stream. Because different instructions may occur
at FBT probe points, we need to be able to report back to the trap entry
point which instruction was replaced by the trap.
Handlers that do not use notify_die() always return 0. Those who do use
notify_die() to call handlers have been modified to return the value that
the handler passed on its return.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-By: Dan Duval <dan.duval@oracle.com>
Orabug: 25312278
Kris Van Hees [Fri, 16 Dec 2016 18:04:52 +0000 (13:04 -0500)]
dtrace: ensure that our die notifier gets executed amongst the first
The die notifier is crucial for implementing safe memory access in
DTrace. To avoid other handlers potentially causing interference, we
set the priority of our handler at the highest level.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: David Mc Lean <david.mclean@oracle.com>
Nick Alcock [Mon, 28 Nov 2016 14:53:15 +0000 (14:53 +0000)]
dtrace: allow invop handler to specify number of insns to skip
Rather than unconditionally skipping one instruction, repurpose the
unused return value of the invop handler to allow it to specify the
amount to skip. (In the common case where it knows how many
instructions to skip at all times, this is more efficient.)
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Nick Alcock [Wed, 23 Nov 2016 17:50:09 +0000 (17:50 +0000)]
dtrace: is-enabled probes for SDT
"Is-enabled probes" are a conditional, long supported in userspace
probing, which lets you avoid doing expensive data-collection operations
needed only by DTrace probes unless those probes are active.
e.g. (an example using the core DTRACE_PROBE / DTRACE_IS_ENABLED macros,
rather than the DTRACE_providername macros used in practice, because
no such macros have been added to the kernel yet):
if (DTRACE_IS_ENABLED(__io_wait__start)) {
/* stuff done only when io:::wait-start is enabled */
}
As with normal SDT probes, the DTRACE_IS_ENABLED() macro compiles to a
stub function call (named like __dtrace_isenabled_*()) which is replaced
at bootup/module load time with an architecture-dependent instruction
sequence analogous to a function that always returns false, though no
function call is generated. At probe enabling time, this is replaced
with a trap into dtrace just like normal dtrace probes, incurring a
performance hit, but only when the probe is active.
The probe name used in the various ELF sections that track SDT
probes begins with a ? character to help the module distinguish
is-enabled probes from normal probes: this is internal to the DTrace
implementation and is otherwise invisible.
(Thanks to Kris Van Hees for initial work on this.)
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 25143173
Nick Alcock [Mon, 31 Oct 2016 13:12:56 +0000 (13:12 +0000)]
dtrace: check for errors when getting a new fd
get_unused_fd_flags() can legitimately fail, e.g. when the fd table is
full. We need to diagnose that rather than trying to fd_install() the
resulting negative number.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 24977175
Nick Alcock [Mon, 31 Oct 2016 10:44:26 +0000 (10:44 +0000)]
dtrace: take mmap_sem in PTRACE_GETMAPFD
Without this, we may oops if the process exec()s and discards its
address space after we find_vma().
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 24977175
Junxiao Bi [Tue, 1 Nov 2016 06:42:20 +0000 (14:42 +0800)]
ocfs2: fix not enough credit panic
The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192). ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up. The total credit should
include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(),
because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and
1 time will be consumed in ocfs2_dx_dir_new_cluster()->
__ocfs2_dx_dir_new_cluster()->ocfs2_dx_dir_format_cluster(). But only
two times is included in ocfs2_dx_dir_rebalance_credits(), fix it.
Eric Ren [Fri, 30 Sep 2016 22:11:32 +0000 (15:11 -0700)]
ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()
The testcase "mmaptruncate" of ocfs2-test deadlocks occasionally.
In this testcase, we create a 2*CLUSTER_SIZE file and mmap() on it;
there are 2 process repeatedly performing the following operations
respectively: one is doing memset(mmaped_addr + 2*CLUSTER_SIZE - 1, 'a',
1), while the another is playing ftruncate(fd, 2*CLUSTER_SIZE) and then
ftruncate(fd, CLUSTER_SIZE) again and again.
In ocfs2_write_begin_nolock(), we first grab the pages and then allocate
disk space for this write; ocfs2_try_to_free_truncate_log() will be
called if -ENOSPC is returned; if we're lucky to get enough clusters,
which is usually the case, we start over again.
But in ocfs2_free_write_ctxt() the target page isn't unlocked, so we
will deadlock when trying to grab the target page again.
Also, -ENOMEM might be returned in ocfs2_grab_pages_for_write().
Another deadlock will happen in __do_page_mkwrite() if
ocfs2_page_mkwrite() returns non-VM_FAULT_LOCKED, and along with a
locked target page.
These two errors fail on the same path, so fix them by unlocking the
target page manually before ocfs2_free_write_ctxt().
Jan Kara helps me clear out the JBD2 part, and suggest the hint for root
cause.
Changes since v1:
1. Also put ENOMEM error case into consideration.
Link: http://lkml.kernel.org/r/1474173902-32075-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: He Gang <ghe@suse.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c33f0785bf292cf1d15f4fbe42869c63e205b21c)
Joseph Qi [Mon, 19 Sep 2016 21:43:55 +0000 (14:43 -0700)]
ocfs2/dlm: fix race between convert and migration
Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not. This will introduce a race that right after
old master does umount ( means master will change), a new convert
request comes.
In this case, it will reset lockres state to DLM_RECOVERING and then
retry convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between
ocfs2 and dlm, and then finally BUG.
Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify
the race case between convert and recovery. So fix it.
Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery") Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e6f0c6e6170fec175fe676495f29029aecdf486c)
jiangyiwen [Fri, 25 Mar 2016 21:21:35 +0000 (14:21 -0700)]
ocfs2: solve a problem of crossing the boundary in updating backups
In update_backups() there exists a problem of crossing the boundary as
follows:
we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 584dca3440732afa84fbca07567bb66e1453936a)
jiangyiwen [Tue, 15 Mar 2016 21:53:01 +0000 (14:53 -0700)]
ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.
Joseph Qi [Thu, 14 Jan 2016 23:17:44 +0000 (15:17 -0800)]
ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del
In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode. This will have a problem once
ocfs2_journal_access_di fails. In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully. In
other words, the file is missing but not actually deleted. So we should
access orphan dinode first like unlink and rename.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jiufei Xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 074a6c655f6da12cb1123c8a84bfd8d781138800)
xuejiufei [Thu, 14 Jan 2016 23:17:41 +0000 (15:17 -0800)]
ocfs2/dlm: do not insert a new mle when another process is already migrating
When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32e493265b2be96404aaa478fb2913be29b06887)
jiangyiwen [Thu, 14 Jan 2016 23:17:33 +0000 (15:17 -0800)]
ocfs2: fix slot overwritten if storage link down during mount
The following case will lead to slot overwritten.
N1 N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
got super lock and also allocate slot 0
then unlock super lock
mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
N2 slot info is now overwritten
Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2. so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1247017f43a93eae3d64b7c25f3637dc545f5a47)
Xue jiufei [Thu, 14 Jan 2016 23:17:29 +0000 (15:17 -0800)]
ocfs2/dlm: return appropriate value when dlm_grab() returns NULL
dlm_grab() may return NULL when the node is doing unmount. When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:
Node 1 Node 2
receives migration message
from node 3, and send
migrate request to others
start unmounting
receives migrate request
from node 1 and call
dlm_migrate_request_handler()
unmount thread unregisters
domain handlers and removes
dlm_context from dlm_domains
dlm_migrate_request_handlers()
returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c372f2193a2e73d5936bf37259ae63ca388b4cbc)
jiangyiwen [Thu, 14 Jan 2016 23:17:23 +0000 (15:17 -0800)]
ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker
Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear
refmap bit on lockres") still exists a race which can't ensure the
ordering is exactly correct.
Node1 Node2 Node3
umount, migrate
lockres to Node2
migrate finished,
send migrate request
to Node3
received migrate request,
create a migration_mle,
respond to Node2.
set DLM_LOCK_RES_SETREF_INPROG
and send assert master to
Node3
delete migration_mle in
assert_master_handler,
Node3 umount without response
dlm_thread purge
this lockres, send drop
deref message to Node2
found the flag of
DLM_LOCK_RES_SETREF_INPROG
is set, dispatch
dlm_deref_lockres_worker to
clear refmap, but in function of
dlm_deref_lockres_worker,
only if node in refmap it wait
DLM_LOCK_RES_SETREF_INPROG
to be cleared. So worker is
done successfully
purge lockres, send
assert master response
to Node1, and finish umount
set Node3 in refmap, and it
won't be cleared forever, thus
lead to umount hung
so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
dlm_deref_lockres_worker.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b5560143385e18b4109ad6951c7719705e3dd995)
Xue jiufei [Thu, 14 Jan 2016 23:17:18 +0000 (15:17 -0800)]
ocfs2/dlm: fix a race between purge and migration
We found a race between purge and migration when doing code review.
Node A put lockres to purgelist before receiving the migrate message
from node B which is the master. Node A call dlm_mig_lockres_handler to
handle this message.
dlm_mig_lockres_handler
dlm_lookup_lockres
>>>>>> race window, dlm_run_purge_list may run and send
deref message to master, waiting the response
spin_lock(&res->spinlock);
res->state |= DLM_LOCK_RES_MIGRATING;
spin_unlock(&res->spinlock);
dlm_mig_lockres_handler returns
>>>>>> dlm_thread receives the response from master for the deref
message and triggers the BUG because the lockres has the state
DLM_LOCK_RES_MIGRATING with the following message:
xuejiufei [Tue, 29 Dec 2015 22:54:29 +0000 (14:54 -0800)]
ocfs2/dlm: clear migration_pending when migration target goes down
We have found a BUG on res->migration_pending when migrating lock
resources. The situation is as follows.
dlm_mark_lockres_migration
res->migration_pending = 1;
__dlm_lockres_reserve_ast
dlm_lockres_release_ast returns with res->migration_pending remains
because other threads reserve asts
wait dlm_migration_can_proceed returns 1
>>>>>>> o2hb found that target goes down and remove target
from domain_map
dlm_migration_can_proceed returns 1
dlm_mark_lockres_migrating returns -ESHOTDOWN with
res->migration_pending still remains.
When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending. So clear migration_pending when target is
down.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cc28d6d80f6ab494b10f0e2ec949eacd610f66e3)
Joseph Qi [Tue, 29 Dec 2015 22:54:06 +0000 (14:54 -0800)]
ocfs2: fix BUG when calculate new backup super
When resizing, it firstly extends the last gd. Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.
But it currently doesn't consider the situation that the backup super is
already done. And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits: