www.infradead.org Git - users/jedix/linux-maple.git/log

dtrace: IO provider unused variables when DTrace is disabled

Fix unused variables warnings caused by IO provider probes when
CONFIG_DTRACE is not set.

Orabug: 26570995

Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>

dtrace: failing to allocate more ECB space can cause a crash

The existing code was not taking into consideration that when the
table of ECBs needs to be expanded, the memory allocation can fail.
This could lead to a NULL pointer access, and a kernel crash. We
now check the result of the allocation, and bail out if it fails.

Orabug: 26503342
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: work around libdtrace-ctf bug

The bug involves synthesising pointers to types (ipaddr_t *, in
particular) when such pointer types do not appear in the CTF files but
are needed by the CTF itself. This is working in standalone modules,
but not in modules with parent type containers.

As a workaround, pro tem before fixing this properly in libdtrace-ctf,
hack around it for the one type it is necessary for (a type that is used
in the DTrace system translators, so if this type does not resolve
correctly DTrace will not start). A suitable workaround is simply to
introduce a use of this pointer type in the C code, and it so happens
that we have a place where this would fit perfectly well.

Orabug: 26583958
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: dtrace.ko won't build when DT_DISABLE_CTF is set

Fix for build failure that can occur when CTF generation is disabled.

Orabug: 26587631

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: Integrate DTrace Modules into kernel proper

This changeset integrates DTrace module sources into the main kernel
source tree under the GPLv2 license. Sources have been moved to
appropriate locations in the kernel tree.

In addition a new RPM package is introduced: kernel-headers-dtrace.
This package is responsible for installation of DTrace related header
files for its userspace component.

Orabug: 26585689

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: David Mc Lean <david.mclean@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>

Merge modules branch to the kernel tree

dtrace: Fix spec file for 0.6.1-3

dtrace: FBT module support and SPARCs return probes

This fix adds two features to FBT provider:
1) support for modules
2) support for return probes on SPARC

The module support of x86 was almost ready as it does not rely on trampolines and
uses hashtables of tracepoints. This works well if we know amount of probes in
advance so can reserve correct amount of memory during module load time. Unfortunately
that is not possible on SPARC and we need to allocate a trampoline dynamically.

Major part of this code is about removing all static assumptions about FBT from kernel
code and moving the responsibility to dtrace modules. Trampolines for SPARC are now
allocated dynamically (including kernel's pseudo module). This applies to SDT trampolines
too.

Second change adds scan for return probes on SPARC with small heuristics to quickly
skip over cases that are not interesting for DTrace. At the same time this patch
allocates new SPARC Trap for FBT.

Support for .init section is not available on any platform. The .init section is freed
after a module is fully loaded and it is not possible to remove its probes without
further chagnes in DTrace framework (modules). This is deffered for later work.

Orabug: 25960276
Orabug: 26384199

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>

dtrace: Make dynamic variable cleanup self-throtling

With addition of cyclic_reprogram() it is possible to make dynamic variable cleanup
self throtling by simply reprograming itself from within the handler.

Orabug: 26385177

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: Restore deadman original timing values

Debug values used by accident. Reverting them back.

Orabug: 26385159

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: DTrace state deadman must use dtrace_sync()

The dtrace_sync() allows to check that all CPUs have left probe context. Wihtout this
code the deadman check would based its assumption on the state of CPU that calls
deadman cyclics which is wrong.

Orabug: 26385102

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: FBT module support and SPARCs return probes

This fix contains following changes:
- Modification of provider ops vector
- Move of fbt hash table to x86 platform specific code
- Instrumentation of return probes on SPARC

DTrace provider has semantics that allows to create a probe on the fly or to provide
probes for given module. There is no way how a provider can attach its own per-module
data through the framework.

With this change a provider my allocate per-module data inside provide_module() call.
Once module is about to go it will be notified by framework by destroy_module() op.
To stay binary compatible I extended the ops vector on its end but used C99 style to
init ops structures to keep callbacks grouped per their logic.

SPARC now does two passes over available kernel symbols. The first one is used to
count how many symbols are present to be able to allocate correctly sized trampoline.
Second pass performs actuall disassembly and creates return probes. It is possible that
not every probe is instrumentable so we may end up wasting some memory. It is a tradeoff
between speed and memory consumption.

It is not possible to instrument arbitrary return places so we support only some variants
that are used in the stream. Current implementation relies on usage of JMPL thus it is
not possible to instrument return from tail call optimized code.

Another change is in patching of the code. The JMPL requires to store NOP in its delay
slot. This prevents us to do this atomically on the running kernel and must stop CPUs
for safety reasons.

Linux probes may fire from non-standart context like TL1 so it is not safe to assume
anything about %g registers. Thanks to having few free %l we are able to temporarly
store %gs and restor them back to avoid breaking up trap handlers.

Orabug: 26384179
Orabug: 26384765

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: Add support for manual triggered cyclics

In some scenarios it is better if a client of cyclic susbstem can reprogram cyclic on
his own. This is not possible with current implementation.

This chage adds cyclic_reprogram() that can be used to schedule cyclic from inside and
outside of its handler. A manually triggered cyclic is distinguished from other types
by having its interval set to -1.

Orabug: 26384803

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: LOW level cyclics should use workqueues

The HIGH level cyclics are meant to be run from interrupt handler. This works on
Linux because the hrtimer is scheduled as tasklet. The LOW level cyclics must be
interruptible and should not be scheduled as tasklests.

DTrace is currently relying on being able to call dtrace_sync() from within a cyclic
handler. On Linux it is not safe to try send IPIs from within interrupt/bottom half
handlers.

This fix changes LOW level cyclics to use workqueues. At the moment we are using
shared system workqueue but it may be required to allocate our owns if this causes
big latency in our timer routines.

Orabug: 26384779

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

dtrace: fix spec file for 0.6.1-2

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: FBT entry probes will now use int3

Due to some function prologues inserting an instruction between the
push rbp and mov rsp,rbp instruction *and* that instruction being one
that can validly take a LOCK profix (e.g. inc), it is not safe to
continue using the LOCK prefix as a way to trigger an Invalid Opcode
trap for FBT entry probes. The new trigger uses int3 (like the return
probes already do).

Orabug: 26190412
Orabug: 26174895
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: add kprobe-unsafe addresses to FBT blacklist

By means of the newly introduced API to add entries to the FBT
blacklist, we make sure to register addresses that are unsafe for
kprobes with the FBT blacklist because they are unsafe there also.

Orabug: 26190412
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

dtrace: convert FBT blacklist to RB-tree

The blacklist for FBT was implemented as a sorted list, populated from
a static list of functions. In order to allow functions to be added
from other places (i.e. programmatically), it has been converted to an
RB-tree with an API to add functions and to traverse the list. It is
still possible to add functions by address or to add them by symbol
name, to be resolved into the corresponding address.

Orabug: 26190412
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

DTrace: IP provider use-after-free for drop-out probe points

KASan warnings showed a possible use-after-free for skbs in error handling
codepath after netfilter hooks have run. Hooks may free the skb, so we
should not derefence it at drop-out probe points after NF_HOOK().

Orabug: 26267376

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalbail@oracle.com>

ctf: fix a variety of memory leaks and use-after-free bugs

These fall into two classes, but are sufficiently intertwined that it's
easier to commit them in one go.

The first is outright leaks, which exceed 1GiB on a normal run, varying
from the tiny (failure to free getline()'s line), through the disastrous
(failure to free items filtered from a list by list_filter(), leading to
the leaking of nearly the whole of the named_structs state, which is
huge).  We also leak the structs_seen hash due to recreating it on
alias_fixup file switch without bothering to destroy it first,

The second is lifetime problems, centred around the stuff allocated and
freed in the detect_duplicates_tu_{init,/done}() functions.  These were
comparing the module name against a saved copy to see if a new vars_seen
needed to be allocated, or whether this was just a flip of TU without a
change of object file and we could get away with just flushing its
contents out -- but unfortunately the state->module_name is assigned
directly from its parameter, and *that* has a lifetime lasting only
within process_file() -- and a deduplication run, of course, involves
iterating over a great many object files.  So everything works as long as
we're flipping from TU to TU within a single object file, and then we
switch object files and are suddenly strcmp()ing with freed memory.
Discard this faulty optimization entirely, and just flush the vars_seen
hash in tu_done() and both create and destroy it in scan_duplicates(),
right where we create and destroy related stuff too.

Something similar happens with the state->dwfl_file_name due to its
derivation from id_file->file_name: if no duplicates are found, we
list_filter() that id_file straight out of the structs_seen list and
free it, and then on the next call state->dwfl_file_name points to freed
memory.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Orabug: 26283357

dtrace: fix compilation with O=

Separate objdir compilation was broken because we were looking for
autoconf.h in the wrong place.

Fix it by looking in the same place as everything else in scripts/ does.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 26167475

dtrace: io provider probes for nfs

DTrace io provider probes are added for NFS read and write requests.

Orabug: 26242655

Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com>
Acked-by: Saar Maoz <Saar.Maoz@oracle.com>
Acked-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>

dtrace: support x86 CPUs with SMAP

We need to call STAC and CLAC at appropriate times, or the CPU will
fault us for accessing userspace without permission.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 26166784

dtrace: improve io provider coverage

The DTrace io provider coverage is extended to include IO performed
through the Generic Block Layer (bio). It also adds io provider probes
to XFS.

Orabug: 25816537

Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

dtrace: proc:::exit should trigger only if thread group exits

Orabug: 25904298

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>

ctf: prevent modules on the dedup blacklist from sharing any types at all

The deduplication blacklist exists to deal with a few modules, mostly
old ISA sound drivers, which consist of a #define and a #include of
another module, which then uses #ifdef profusely, sometimes on structure
members.  This leads to DWARF which has differently shaped structures
in the same source location, which conflicts with the purpose of the
deduplicator, which is to use the source location to identify identical
structures and merge them.

Before now, the blacklist worked by preventing the appearance of a
structure in a module on the blacklist from making a structure shared:
it could still be shared if other modules defined it.  This fails to
work for two reasons.

Firstly, if it is used by only *one* other module, we'll hit an
assertion failure in the emission phase which checks that any type it
emits is either defined in the module being emitted or in the shared
type repo.

We could remove that assertion, but we'd still be in the wrong, because
you could easily have a type in some header used by many modules that
said

struct trouble {
#ifdef MODULE_FOO
/* one definition */
#else
/* a different definition */
#endif
}

and a module that says

#define MODULE_FOO 1
#include <other_module.c>

Even if we blacklisted this module (and we would), this would still
fail, because 'struct trouble' would end up in the shared type
repository, and the existing code would fail to emit a new definition of
it in module blah, even though it should because its definition is
different.

This shows that if a module is pulling tricks like this we cannot trust
its use of shared types at all, since we have no visibility into
preprocessor definitions: regardless of the space cost (about 40KiB per
module), we cannot let it share any types whatsoever with the rest of
the kernel.  Rather than piling heaps of blacklist checks in all over
dwarf2ctf to implement this, we do it by exploiting the fact that the
deduplicator works entirely on the basis of type IDs.  We add a
'blacklist type prefix' that type_id() can add to the start of its IDs
(with some extra decoration, because the start of type IDs is a file
path, so we want this to be an impossible path).  If we set this prefix
to the module name if and only if the module is blacklisted, and do not
add one otherwise, then every blacklisted module will have a unique set
of IDs for all its types, which can be shared within the module but not
outside it, so every type in the module will be unique and none of them
will end up in the shared type repository.

While we're at it, add yet another ancient ISA sound driver that plays
the same games to the blacklist.

This fix makes blacklisting modules much more space-expensive: each such
module expands the current size of the kernel module package by about
40KiB.  (But there is only one blacklisted module built in current UEK,
so this is a tolerable cost.)

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: vincent.lim@oracle.com
Orabug: 26137220

ctf: emit bitfields in in-memory order

DWARF always emits bitfields in declaration order. e.g., for
this C:

struct IO_APIC_route_entry {
        __u32   vector          :  8,
                delivery_mode   :  3,   /* 000: FIXED
                                         * 001: lowest prio
                                         * 111: ExtINT
                                         */
                dest_mode       :  1,   /* 0: physical, 1: logical */
                delivery_status :  1,

we get this DWARF:

[  437f]    structure_type
             name                 (strp) "IO_APIC_route_entry"
             byte_size            (data1) 8
[  438b]      member
               name                 (strp) "vector"
               byte_size            (data1) 4
               bit_size             (data1) 8
               bit_offset           (data1) 24
               data_member_location (data1) 0
[  439a]      member
               name                 (strp) "delivery_mode"
               byte_size            (data1) 4
               bit_size             (data1) 3
               bit_offset           (data1) 21
               data_member_location (data1) 0
[  43a9]      member
               name                 (strp) "dest_mode"
               byte_size            (data1) 4
               bit_size             (data1) 1
               bit_offset           (data1) 20
               data_member_location (data1) 0
[  43b8]      member
               name                 (strp) "delivery_status"
               byte_size            (data1) 4
               bit_size             (data1) 1
               bit_offset           (data1) 19
               data_member_location (data1) 0

But CTF on little-endian requires the opposite: it has special handling
for the first member of a structure which assumes that it is closest to
the start of memory: in effect, it wants structure member addresses to
always ascend, even within bitfields, regardless of endianness (which
makes some sense intellectually as well).

dwarf2ctf's emission code generally emits sequentially, so except where
deduplication has eliminated items or dependent type insertion has added
them it emits things in the CTF in the same order as in the DWARF.  We
can avoid this for short runs, as in this case, by switching from
iteration to recursion in such cases, spotting a run at identical
data_member_location, recursing until we hit the end of the run, then
unwinding and emitting as we unwind until the recursion is over.

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: bitfield support

Support for bitfields in dwarf2ctf was embryonic and largely untested
before now due to bugs in libdtrace-ctf: with a fix for those in hand,
we can fix bitfields here too.

Bitfields in DWARF and CTF have annoyingly different representations.
In DWARF, a bitfield is represented something like this:

[ 16561]      member
               name                 (string) "ihl"
               decl_file            (data1) 225
               decl_line            (data1) 87
               type                 (ref4) [    38]
               byte_size            (data1) 1
               bit_size             (data1) 4
               bit_offset           (data1) 4
               data_member_location (sdata) 0
[...]
[    38]    typedef
             name                 (strp) "__u8"
             decl_file            (data1) 36
             decl_line            (data1) 20
             type                 (ref4) [    43]

i.e. the padding, size, and starting location are all represented in the
member, where you would conceptually expect it to be.

In CTF, the starting location of the conceptual containing type of a
bitfield is encoded in the member: but the size and starting location of
the bitfield itself is represented in the dependent type, which is added
as a "non-root" type (which cannot be looked up by name) so that it can
have the same name as the un-bitfielded base type without causing a name
clash.

We use the new DIE attribute override mechanism added in commit 8935199962
to override DW_AT_bit_size and DW_AT_bit_offset for such members (fixing
a pre-existing bug in the process: we were looking for the DW_AT_bit_size
on the structure as a whole!), and in the base-type emission function
checking for the existence of a DW_AT_bit_size/offset and responding to
them by overriding the size and offset derived from DW_AT_byte_size and
noting that this is a non-root type.  (The override needed, annoyingly,
is endian-dependent, because CTF consumers assume that on little-endian
systems the offset relates to the least-significant edge of the bitfield,
counting from the LSB, while DWARF assumes the opposite).

But this is not enough: unless more is done, this type will appear
to have the same type ID as its non-bitfield equivalent, leading to
confusion about which CTF file it should appear in and quite possibly
leading to it ending up in a CTF file that the structure containing the
bitfield cannot even see.  So augment type_id()'s representation of
base types from e.g. 'int' to something like 'int:4' if and only if
a DW_AT_bit_size or an override of it is present and that override is
a different size from the native bitness of the type itself (the
DW_AT_byte_size).  We encode the bit_offset only if there is also a
bit_size, as something like int:4:2.  (That's unambiguous because
these attributes always arrive in pairs in bitfields and never appear
in anything else in C-generated DWARF.)

Finally, this breaks an optimization in the deduplicator, which was that
all structure members reference some top-level type, so when marking a
type as seen, structure members could just be skipped.  Now, they have
to be chased iff they are bitfields using the same override trickery as
above to change the DW_AT_bit_size/offset in the member's type DIE, and
that bitfield override needs to be passed down to type_id() when finally
marking duplicated types as shared too.  (Avoid code duplication by
factoring out some related code from a horrible conditional in
detect_duplicates() into a new type_needs_sharing() function.)

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: emit file-scope static variables

For as long as dwarf2ctf has emitted variables into the CTF, it has
emitted only extern variables.  This is for two reasons: firstly, a
misconception that global variables with static linkage did not appear
in /proc/kallmodsyms (they do), and secondly, an ambiguity problem.

We cannot usefully distinguish two variables with the same name in the
same module: they differ only by address, so if both are static
variables in different translation units, we can't tell which is which.
(Also, we cannot emit more than one variable with a given name into a
given module CTF file in any case).  CTF's modular rather than
translation-unit-based variable/type scope bites us here.

I sought to avoid this bug by emitting only non-static variables, but
this does not save us, because there might be another static variable
with the same name in the same module, whereupon the ambiguity problem
arises all over again.  We must identify such ambiguous cases and strip
them out (not emitting CTF for this variable at all): then we can emit
static file-scope variables into the CTF without worry.

We do this by introducing a new, module-scope vars_seen hashtable into
the deduplicator state, which gains an entry for every variable name
seen in this module, and indicates whether it is static or not.  This
lets us tell if we have seen a variable with a given name more than
once, and if we have, whether any of the instances was static.  Then
we can consult this blacklist at variable-emission time, and skip any
variable if it is in the blacklist.

Unfortunately, computing the name for the variable's entry in the
blacklist is fairly expensive, and has to be done for every variable:
worse yet, this increases the number of variables emitted drastically
(in vmlinux and the shared typo repo alone, we go from 2247 to 10409
variables), and emitting that much CTF is not free: so the runtime goes
up by about 5%.  We will reclaim this lost speed soon (and then some).

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com
Orabug: 25962387

ctf: speed up the dwarf2ctf duplicate detector some more

The duplicate detector's alias detection pass is still doing unnecessary
work.  When a non-opaque structure is marked shared, it is sometimes
necessary to do another deduplication scan, because the marking may have
marked types used within the structure as shared, which may require yet
more types (e.g. opaque uses of of that type in other modules) to be
marked shared as well.  So while we know this pass can only affect
structures/ unions/enums that have names, and their interior types, it
would seem that we must keep scanning them to see if they need
deduplication until none remain.

However, there is one exception: if a non-opaque type and its
corresponding opaque type are both already marked shared, or if we have
just processed them and marked them accordingly, we know that we will
never need to re-mark those particular types again, since they can't be
more shared than they already are: so we can remove them from
consideration in future passes.  Because we are only opening DWARF files
in this pass as needed now, this hugely cuts down the number of files we
process in subsequent passes: we still see the same number of passes,
but passes after the first (which marks tens of thousands of opaque
types as shared) only open a few files, mark a few hundred types, and
flash past in under a second.

In my tests, the alias fixup pass now takes under 10s, which can be
more or less ignored: all other passes other than initialization
and writeout are much more expensive.  (Before this series, it took
over a minute on the fastest machine I have access to, and over three
minutes on SPARC.)

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: strdup() -> xstrdup()

Several unadorned strdup()s have crept into dwarf2ctf: all of them
should be xstrdup(), since none are handling malloc failure themselves.

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: speed up the dwarf2ctf duplicate detector

dwarf2ctf is very slow.  By far its most expensive pass is the duplicate
detector.  Of necessity this has to scan every object file in the kernel
to identify shared types, which is pretty expensive even when the cache
is hot, but it's not doing this particularly efficiently and can be sped
up quite a lot.

As of commit f0d0ebd0b4, the duplicate detector's job was cut in two:
the first pass identifies all non-structure types and non-opaque
structures used by more than one module, and the second "alias fixup"
pass repeatedly unifies opaque and non-opaque structures with the same
name in different translation units, sharing their definitions.  Both of
these passes generate or modify type IDs, so both need access to the
DWARF DIE of the types in question, since the type ID is derived
recursively from the DIE: but the second pass need not look at the DWARF
of any translation units that do not contain structures that might be
unified.  However, the two are currently written in the same way, using
process_file() to traverse all the kernel's DWARF, even though the alias
fixup pass does almost nothing with that DWARF, and has less and less to
do on each iteration.

The sheer amount of wasted time here is remarkable.  We traverse the
DWARF once for primary duplicate detection, once for CTF emission, but
often four or five or even seven or eight times for the alias fixup pass
(the precise count depends upon the relationships between types and the
order in which the DWARF files are traversed).

So improve things by tracking all types that the alias fixup pass is
interested in (structure types that are not anonymous inner structures
nor opaque nor used only as the types of array indices) and stash them
away during the first duplicate detection pass in a new temporary
singly-linked list, detect_duplicates_state.named_structs.  We remember
the filename and DWARF DIE offset (so we can look the type up again) and
the type ID (because we just worked it out, so recomputing it would be a
waste of time).  Then, rather than doing a process_file() for the alias
fixup pass, traverse the linked list, opening DWARF files as needed to
mark things as shared (but no more often than that: marking non-opaque
types needs the DIE so we can traverse into all its subtypes and mark
them shared too, but marking opaque types needs no DIE at all).

This has a significant effect on dwarf2ctf speed, slashing 25% off its
runtime in my tests, reducing the duplicate detector's share of the
runtime from about 80s to about 24s.

The dominant time consumer is now CTF emission rather than the
duplicate detector.

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: add module parameter to simple_dwfl_new() and adjust both callers

An upcoming speedup to dwarf2ctf needs this.

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: fix the size of int and avoid duplicating it

An upcoming bitfield-capable release of libdtrace-ctf adds extra
consistency checking which identified embarrassing but heretofore
harmless bugs in dwarf2ctf's representation of basic types.

dwarf2ctf has to emit a few basic types "by hand", since these are used
in the representation of types one of CTF or DWARF does not bother to
encode more complex forms for (function pointers, always encoded as
'int (*)()' in CTF, and 'void').  Embarrassingly, we were getting the
size of 'int' wrong: it should be in bits but we were emitting a count
of bytes instead, leading to a CTF representation of a 4-bit int.
This is always overridden by an accurate representation built into
DTrace in real use, but libdtrace-ctf finds the inconsistency anyway.

Worse is that it emits a representation of 'int' twice, once by hand in
init_ctf_table() and then again when it comes across the real 'int' type
DIE in the debuginfo.  This is because we forgot to intern the types we
add by hand in the id_to_module and id_to_type hashes we use to detect
duplicate types, because we can only intern types we have a DWARF DIE
for, and these types either have no DIE at all or we just don't know
about it because we haven't even begun to traverse the debuginfo to look
up DIEs yet.

Fix this by adding a mark_shared_by_name() function which can be used to
intern basic types by name in these hashes.  It has hardwired knowledge
of the type ID notation, but no more such knowledge than is already
present in detect_duplicates_alias_fixup() and friends (no more than
that types with no associated filename or line number are preceded by
"////".)

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: allow overriding of DIE attributes: use it for parent bias

Currently, dwarf2ctf's CTF type emission machinery emits any given CTF
type based purely on the data in the corresponding DWARF DIE, with
possible adjustments based on the DIE's structural parent (useful for
structure members, if not so much for top-level types).  But from time
to time we need to adjust a CTF type depending on some property not of
its structural parent but of a type that depends on it: i.e., for
structures, we might want to adjust the offset of the members to cater
for the fact that this structure is being structurally merged with a
shorter assignment-compatible structure with identical name which
appeared in some translation unit that was processed earlier.

Currently, we handle this case, and only this case, by passing down a
'parent_bias' to all the CTF assembly functions.  Replace this with a
more generic mechanism whereby an array of 'overrides' can be passed
down to construct_ctf_id(), die_to_ctf(), and all subordinate assembly
functions: these overrides consist of an array of die_override_t's,
where each element can either override or add to the value of one DWARF
attribute: this kicks in for specific DWARF tags only. (Only numerical
attributes are supported, obviously.)

(We also pass the overrides down to type_id() so that overrides can
affect the ID of types and thus cause a single DWARF type to generate
multiple CTF types, though we do not use this facility in this commit.)

To process the attributes, we introduce a new private_find_override() to
search the override list, and private_dwarf_udata() to fetch a udata and
handle it. (We do not override dwarf_hasattr(): anything that wants to
override an attribute that may not exist has to call
private_find_override() itself.  If this happens a lot, we can introduce
an override for dwarf_hasattr() too.)

Currently we use this in exactly one place: in assemble_ctf_su_member(),
to replace the use of parent_bias.  Further uses will come in the next
commit: thanks to this commit, none of them will require adding new
parameters to all the CTF construction functions :)

Also rename the 'override' parameter on the CTF construction functions,
which was used by array assembly to indicate that CTF types should
replace their parent type, with a much less confusingly-named 'replace'
parameter.  (It was badly named before, but now that we have a parameter
named 'overrides' it is devastatingly badly named.)

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

DTrace tcp/udp provider probes

This patch adds DTrace SDT probes for the TCP and UDP protocols.  For tcp
the following probes are added:

tcp:::send Fires when a tcp segment is transmitted
tcp:::receive Fires when a tcp segment is received
tcp:::state-change Fires when a tcp connection changes state
tcp:::connect-request Fires when a SYN segment is sent
tcp:::connect-refused Fires when a RST is received for connection attempt
tcp:::connect-established
Fires when three-way handshake completes for
initiator

tcp:::accept-refused Fires when a RST is sent refusing connection attempt
tcp:::accept-established
Fires when three-way handshake succeeds for acceptor

Arguments for all of these probes are:

arg0 struct sk_buff *; to be translated into pktinfo_t * containing
implementation-independent packet data
arg1 struct sock *; to be translated into csinfo_t * containing
implementation-independent connection data
arg2 __dtrace_tcp_void_ip_t *; to be translated into ipinfo_t * containing
implementation-independent IP information.  Custom type is used as
this gives DTrace a hint that we can source IP information from other
arguments if the IP header is not available.
arg3 struct tcp_sock *; to be translated into tcpsinfo_t * containing
implementation-independent TCP connection data
arg4 struct tcphdr *; to be translated into a tcpinfo_t * containing
implementation-independent TCP header data
arg5 int representing previous state; to be translated into a
tcplsinfo_t * which contains the previous state.  Differs from
current state (arg6) for state-change probes only.
arg6 int representing current state.  Cannot be sourced from struct
tcp_sock as we sometimes need to probe before state change is
reflected there
arg7 int representing direction of traffic for probe; values are
DTRACE_NET_PROBE_INBOUND for receipt of data and
DTRACE_NET_PROBE_OUTBOUND for transmission.

For udp the following probes are added:

udp:::send Fires when a udp datagram is sent
udp:::receive Fires when a udp datagram is received

Arguments for these probes are:

arg0 struct sk_buff *; to be translated into pktinfo_t * containing
        implementation-independent packet data
arg1    struct sock *; to be translated into csinfo_t * containing
        implementation-independent connection data
arg2    void_ip_t *; to be translated into ipinfo_t * containing
        implementation-independent IP information.
arg3 struct udp_sock *; to be translated into a udpsinfo_t * containing
implementation-independent UDP connection data
arg4 struct udphdr *; to be translated into a udpinfo_t * containing
implementation-independent UDP header information.

Orabug: 25815197

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Reviewed-by: Rao Shoaib <rao.shoaib@oracle.com>
Acked-by: Håkon Bugge <haakon.bugge@oracle.com>

dtrace: define DTRACE_PROBE_ENABLED to 0 when !CONFIG_DTRACE

Right now there is no definition for this at all, breaking kernel
compilation.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Nicolas Droux <nicolas.droux@oracle.com>
Orabug: 26145788

dtrace: ensure limit is enforced even when pcs is NULL

The dtrace_user_stacktrace() functions for x86_64 and sparc64 were
not handling the specified limit (st->limit correctly if the buffer
for PC values (st->pcs) was NULL. This commit ensures that we
decrement the limit whenever we encounter a PC, whether it gets
stored or not.

Orabug: 25949692
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: fix spec file for 0.6.1-1

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: ensure ustackdepth returns correct value

The implementation for ustackdepth was causing it to always return 1,
regardless of the depth of the ustack(). The commit ensures that the
underlying code can walk the stack (without actually collecting PCs)
and determine the depth.

Orabug: 25949692
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: FBT return probes on x86_64 run with in_irq() true

Because FBT return probes are imeplemented on x86_64 by means of a
breakpoint trap (int3), and because int3 (on Linux) causes HARDIRQ
to be incremented in the preempt counter, the DTrace core thinks
that the probe was triggered from IRQ context (which it may or may
not be).

This commit ensures that we can detect wether we're processing a
probe triggered using int3, and if so, it subtracts from the HARDIRQ
counter before testing it (to compensate for the int3-imposed
increment).

Orabug: 26089286
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: different probe trigger instruction for entry vs return

On x86_64, we cannot use the LOCK prefix byte to consistently cause an
invalid opcode trap for FBT return probes because the 'ret' instruction
may be followed by an instruction that can validly take the LOCK prefix.
So, we use a different trigger instruction (int3).

In order to make this possible, the functions to set and clear FBT
probes on x86_64 (dtrace_invop_add() and dtrace_invop_remove()) have
been modified to accept a 2nd argument that indicates the instruction
to patch the probe location with. This is needed because FBT return
probes need a different instruction on x86_64 (LOCK prefix to force
an invalid opcode trap isn't safe because we do not know what
instruction may follow the "ret").

Orabug: 25949048
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: make x86_64 FBT return probe detection less restrictive

The FBT return probe detection mechanism on x86_64 was requiring that
the "ret" instruction be followed by a "push %rbp" or "nop", which is
much too restrictive. The new code allows probing of all "ret"
instructions that occur in a function regardless of what instructions
follows.

In order to make this possible, the functions to set and clear FBT
probes on x86_64 (dtrace_invop_add() and dtrace_invop_remove()) have
been modified to accept a 2nd argument that indicates the instruction
to patch the probe location with. This is needed because FBT return
probes need a different instruction on x86_64 (LOCK prefix to force
an invalid opcode trap isn't safe because we do not know what
instruction may follow the "ret").

This commit also fixes the declaration of the dtrace_bad_address()
function that was missing its return type.

Orabug: 25949048
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: support passing offset as arg0 to FBT return probes

FBT return probes pass the offset from the function start (in bytes)
as arg0. To make that possible, we pass the offset value in the call
to fbt_add_probe. For FBT entry probes we pass 0 (which is ignored).

Orabug: 25949086
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: make FBT entry probe detection less restrictive on x86_64

The logic on x86_64 to determine whether we can probe a function is
too restrictive. By placing the probe on the "push %rbp" instruction
we can cover more functions, in case the "mov %rsp,%rbp" instruction
does not follow it immediately.

Orabug: 25949030
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: adjust FBT entry probe dection for OL7

On OL7, function prologues can be prefixed by a (5-byte) call
instruction on x86_64, which breaks the logic to determine if
we can place an FBT entry probe on that function. The new logic
accounts for the possibility that the anticipated prologue does
not show up as first instruction of the function.

Orabug: 25921361
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: support passing offset as arg0 to FBT return probes

FBT return probes pass the offset from the function start (in bytes)
as arg0. To make that possible, we pass the offset value in the call
to fbt_add_probe. For FBT entry probes we pass 0 (which is ignored).

This commit also ensures that we emulate the 'ret' instruction on the
return path.

Orabug: 25949086
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: improve probe execution debugging

The debugging code for probe execution had a few cases were the start
of execution was logged in debugging output but the completion was not
because of early termination conditions. Now all forms of completion
should be covered in debugging output.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: correct mutex_* subroutines

The mutex_* subroutines were not accessing the state of a mutex
correctly, causing incorrect results.

This commit also cleans up DIF emulation debug output a tiny bit.

Orabug: 26044447
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: canload() for input of *_ntop(), *_nto*()

These functions (some newly added, some older) were not appropriately
checking if the caller could load from their inputs, so could be
used by the not-yet-implemented unprivileged DTrace to read arbitrary
memory.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: implement link_ntop() DTrace subroutine logic

The signature of the link_ntop() DTrace subroutine is:

string link_ntop(int hardware_type, void *addr);

link_ntop() takes a pointer to a hardware address and returns a string
which is the translation of that address to a string representation,
with content depending on the provided hardware type. Supported
hardware types are ARPHRD_ETHER and ARPHRD_INFINIBAND, both of which
are defined for use in D programs.

This is the link-level equivalent of inet_ntop().

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
[nca: reworded commit message a bit]
Orabug: 25931479

dtrace: update spec file for 0.6.0-4

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: enforce inlining of dtrace_dif_variable

The dtrace_dif_variable() function is inlined during some compilations
and not during others, causing the number of frames to be skipped in
DTrace kernel stack traces to not be a constant. That causes incorrect
values for stackdepth to be reported.

This commit requests dtrace_dif_variable() to always be inlined, and
adjusts the aframes values in function of the inlining.

Orabug: 25872472
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas jedlicka <tomas.jedlicka@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: fix handling of save_stack_trace sentinel (x86 only)

On x86 only, when save_stack_trace() writes less stack frames to the
buffer than there is space for, a ULONG_MAX is added as sentinel. The
DTrace code was mistakenly treating the buffer as always ending with a
ULONG_MAX.

Orabug: 25727046
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: NEWS and spec file for 0.6.0-3

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

DTrace walltime lock-free implementation

The DIF_VAR_WALLTIMESTAMP now uses new dtrace_get_walltimestamp().

Orabug: 25715256

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: DTrace walltime lock-free implementation

DTrace walltimestamp is now based on reading current kernel
timespec without taking locks.

Orabug: 25715256

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: incorrect aframes value and wrong logic messes up caller and stack

Due to a mistake in how we compensate for the potential ULONG_MAX
sentinel value being added to kernel stacks on x86_64 (by the
save_stack_trace() function), the caller was always reported as 0.

This in turn was hiding a problem with the aframes values that are
used to ensure we skip the right amount of frames when reporting a
stack, caller, and calculating the stackdepth. Effectively, it tells
the stack walker how many frames were added to the stack due to DTrace
processing.

Orabug: 25727046
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: ensure we pass a limit to dtrace_stacktrace for stackdepth

When determining the (kernel) stackdepth, we pass scratch memory to the
dtrace_stacktrace() function because we are not interested in the actual
program counter values. However, we were passing in 0 as limit rather
than the actual maximum number of PCs that could fit in the remaining
scratch memory space.

We now also add no-fault protection to dtrace_getstackdepth().

Orabug: 25559321
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: get rid of dtrace_gethrtime

Remove the need for dtrace_gethrtime() and dtrace_getwalltime() because
the current implementations are not deadlock safe.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: drop spurious debugging left in by accident

We should not be emitting a KERN_INFO log message whenever an is-enabled
probe is discovered.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 25143173

dtrace: comtinuing the FBT implementation and fixes

This commit continues the implementation of Function Boundary Tracing
(FBT) and fixes various problems with the original implementation and
other things in DTrace that it caused to break.  It is done as a single
commit due to the intertwined nature of the code it touches.

1. We were only handling unaligned memory access traps as part of the
   NOFAULT access protection.  This commit adds handling data and
   instruction access trap handling.

2. When an OOPS takes place, we now add output about whether we are
   in DTrace probe context and what the last probe was that was being
   processed (if any).  That last data item isn't guaranteed to always
   have a valid value.  But it is helpful.

3. New ustack stack walker implementation (moved from module to kernel
   for consistency and because we need access to low level structures
   like the page tables) for both x86 and sparc.  The new code avoids
   any locking or sleeping.  The new user stack walker is accessed as
   as sub-function of dtrace_stacktrace(), selected using the flags
   field of stacktrace_state_t.

4. We added a new field to the dtrace_psinfo_t structure (ustack) to
   hold the bottom address of the stack.  This is needed in the stack
   walker (specifically for x86) to know when we have reached the end
   of the stack.  It is initialized from copy_process (in DTrace
   specific code) when stack_start is passed as parameter to clone.
   It is also set from dtrace_psinfo_alloc() (which is generally called
   from performing an exec), and there it gets its value from the
   mm->start_stack value.

5. The FBT black lists have been updated with functions that may be
   invoked during probe processing.  In addition, for x86_64 we added
   explicit filter out of functions that start with insn_* or inat_*
   because they are used for instruction analysis during probe
   processing.

6. On sparc64, per-cpu data gets access by means of a global register
   that holds the base address for this memory area.  Some assembler
   code clobbers that register in some cases, so it is not safe to
   depend on this in probe context.  Instead, we explicitly access
   the data based on the smp_processor_id().

7. We added a new CPU DTTrace flag (CPU_DTRACE_PROBE_CTX) to flag that
   we are processing in DTrace probe context.  It is primarily used
   to detect attempts of re-entry into dtrace_probe().

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Orabug: 21220305
Orabug: 24829326

dtrace: ensure DTrace can use get_user_pages safely

The processing of the DTrace-specific FOLL_IMMED flag was not robust
enough. We could still get into a situation where cond_resched() was
called (which is bad) or where the VMA area would get extended (which
is also bad). The only code that passes this flag is DTrace support
code, and when the flag is not passed, the execution flow is not at all
affected.

Orabug: 25640153
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: enable paranoid mode and IST shift for xen_int3

The Xen PVM path into an INT3 trap was not using paranoid=1 mode nor was
it using an IST shift as is done for HW INT3 traps. This interferes with
the instruction emulation code check based on the handler return value.

Orabug: 25580519
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

dtrace: get rid of dtrace_gethrtime()

Remove the need for dtrace_gethrtime() and dtrace_getwalltime() because
the current implementations are not deadlock safe.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: ensure we skip the entire SDT probe point

With the introduction of FBT support, the logic for skipping instructions
(with potential emulation of the skipped instruction) changed. This change
did not take into account the fact that is-enabled probes on x86_64 use a
3-byte sequence for setting ax to 0, followed by a 2-byte NOP. The old logic
resulted in failing to skip the setting of ax correctly.

New logic uses the knowledge that all SDT probes on x86_64 are of the same
length (ASM_CALL_SIZE) and therefore we can simply skip that number of bytes
and continue without any emulation.

Orabug: 25557283
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: add ip SDT provider

This gives probe points ip:::receive, ip:::send, ip:::drop-in, and
ip:::drop-out, with parameters compatible with the Solaris
implementation.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Orabug: 25557554

dtrace: update NEWS and spec file for 0.6.0-2

Included information about new bugfixes and features in te 0.6.0-2
release.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: comtinuing the FBT implementation and fixes

This commit continues the implementation of Function Boundary Tracing
(FBT) and fixes various problems with the original implementation and
other things in DTrace that it caused to break.  It is done as a single
commit due to the intertwined nature of the code it touches.

1. The sparc64 fast path implementation (dtrace_caller) for the D 'caller'
   variable was trampling the %g4 register which Linux uses to hold the
   'current' task pointer.  By passing in a dummy argument, we ensure
   that we can use the %i1 register to temporarily store %g4.

2. For consistency, we are now using stacktrace_state_t instead of
   struct stacktrace_state.

3. We now call dtrace_stacktrace() under NOFAULT protection.

4. The ustack stack walker has been rewritten (in the kernel), so the
   previous implementation has been removed.

5. We no longer process probes when the kernel panics, to avoid DTrace
   disrupting output that could be crucial to debugging.

6. We now ensure that re-entry of dtrace_probe() can no longer happen,
   except for the ERROR probe (which is by a re-entry by design).

7. Since FBT now works, the restriction to only support SyS_* functions
   has been removed.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Orabug: 21220305
Orabug: 24829326

dtrace: handle modular IPv6

We want to build in the IPv6 code even when IPv6 is built as a module
(as it is in UEK). Adjust the config conditionals accordingly.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Orabug: 25557554

dtrace: introduce and use typedef in6_addr_t

This is for consistency with the similar typedef in_addr_t: we have
to use the typedef in at least one place in the module so that the
compiler incorporates it into the DWARF and it ends up in the CTF
section. (Both the DTrace ip translators and, likely, the users
would expect that if one typedef exists, the other one does too.)

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Orabug: 25557554

dtrace: update NEWS and spec file

Update the NEWS file with information about FBT (new feature) and
the preemption issue (bug fix).

Update spec file with changelog entry for FBT.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: function boundary tracing (FBT) implementation

This commit provides the actual implementation of the FBT provider.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 21220305
Orabug: 24829326

dtrace: SDT cleanup and bring in line with kernel

This commit performs some cleanup on the SDT provider, removing some
housekeeping tasks that are no longer needed (such as the need for an
arch-specific sdt_provide_module_arch() function).

This commit also contains a fix for the loop used in enabling and
disabling probes. It was failing to ensure that the enable/disable
function was being called with the correct SDT probe.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: fix preemption checks

The macros to verify whether the current execution can be preempted
were wrong. This commit fixes that. It also ensures that we call the
functions (or macros) provided for enabling/disabling preemption by
the kernel itself.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: when calling all modules do not forget kernel

For DTrace, the kernel is represented as a pseudo-module. When a loop
is made over all loaded modules in ordder to call a function for each
one of them, we need to also call that function for the kernel
pseudo-module.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: remove cleanup_module support

There is no need anymore for providers to call a cleanup_module
function in provider modules. The functionality that this function
provided is being rewritten.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: is-enabled probes for SDT

This is the module side of the is-enabled probe implementation.  SDT
distinguishes is-enabled probes from normal probes by the leading ? in
their sdpd_name; at probe-firing time, the arch-dependent code arranges
to return 1 appropriately.

On x86, also arrange to jump past the probe's NOP region.  There was no
need to do this before now, because a trap followed by a bunch of NOPs
is a perfectly valid instruction stream: but is-enabled probes have a
three-byte sequence implementing "xor %rax, %rax", and overwriting only
the first byte of that leaves us with a couple of bytes that must be
skipped.  On SPARC, we drop the necessary return-value-changing
instruction into the delay slot of the call that used to be there
before we overwrote it with NOPs;: the instruction already there
is setting up the function argument-and-return-value, which is 0
when the probe is disabled, so we can overwrite it safely.

(We make minor adjustments to allow sdt_provide_probe_arch() to
safely modify the sdp_patchpoint.)

Finally, add a test use of an is-enabled probe to dt_test, used by the
DTrace testsuite.

[nca: sparc implementation, ip address adjustment, commit msg]
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 25143173

dtrace: function boundary tracing (FBT)

This patch implements Function Boundary Tracing (FBT) for x86 and
sparc.  It covers generic, x86 and sparc specific changes.

Generic:

A new exported function (which will be provided by each supported
architecture) dtrace_fbt_init() is added to be called from module
code to initiate the discovery of FBT probes.  Any eligible probe
(based on arch-dependent logic) is passed to the provided callback
function pointer for processing.

A new option is added to the DTrace kernel config to enable FBT.

The logic for determing the size of the pdata memory block (which
is arch-dependent) has changed at the architecture level, and the
code for setting up the kernel pseudo-module has been modified to
account for that change.

The post-processing script dtrace_sdt.sh is now determining how
many functions exist in the kernel as an upper bound for the amount
of functions that can be traced with FBT.  The logic ensures that
aliases are not counted.

x86:

On x86_64, entry FBT probes are implemented similarly to SDT probes on
that same architecture.  A trap is triggered from the probe location,
which causes a call into the DTrace core.  The entry probe is placed on
the 'mov %rsp,%rbp' instruction that immediately followed a 'push %rbp'
as part of the function prologue.  If this instruction sequence is not
found, no entry probe will be created for the function.

sparc:

On sparc64, entry FBT probes are also implemented similar to SDT
probes on that same architecture.  A call is made into a trampoline
(allocated within the limited addressing range for a single assembler
instruction) which in turn calls into the DTrace core.  The entry probe
is placed on the location where typically a call is made to _mcount for
profiling purposes.  Under normal operation, that instruction will be
a NOP.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: David Mc Lean <david.mclean@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Orabug: 21220305
Orabug: 24829326

dtrace: add support for passing return value from trap handlers

Prior to this patch, trap handlers were called to service traps without
any mechanism to report back to the lowest level trap entry point.  The
DTrace FBT implementation on x86 needs to be able to do just that because
FBT probes are enabled by replacing a one-byte assembler instruction with
a one-byte instruction that causes a trap.  After the trap is handled, we
need to emulate the instruction that was replaced prior to returning to
the original instruction stream.  Because different instructions may occur
at FBT probe points, we need to be able to report back to the trap entry
point which instruction was replaced by the trap.

Handlers that do not use notify_die() always return 0.  Those who do use
notify_die() to call handlers have been modified to return the value that
the handler passed on its return.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-By: Dan Duval <dan.duval@oracle.com>
Orabug: 25312278

dtrace: ensure that our die notifier gets executed amongst the first

The die notifier is crucial for implementing safe memory access in
DTrace. To avoid other handlers potentially causing interference, we
set the priority of our handler at the highest level.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: David Mc Lean <david.mclean@oracle.com>

Merge remote branch 'ca-git-uek/topic/uek-4.1/stable-cherry-picks' into official/topic/uek-4.1/dtrace

dtrace: allow invop handler to specify number of insns to skip

Rather than unconditionally skipping one instruction, repurpose the
unused return value of the invop handler to allow it to specify the
amount to skip. (In the common case where it knows how many
instructions to skip at all times, this is more efficient.)

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: is-enabled probes for SDT

"Is-enabled probes" are a conditional, long supported in userspace
probing, which lets you avoid doing expensive data-collection operations
needed only by DTrace probes unless those probes are active.

e.g. (an example using the core DTRACE_PROBE / DTRACE_IS_ENABLED macros,
rather than the DTRACE_providername macros used in practice, because
no such macros have been added to the kernel yet):

if (DTRACE_IS_ENABLED(__io_wait__start)) {
/* stuff done only when io:::wait-start is enabled */
}

As with normal SDT probes, the DTRACE_IS_ENABLED() macro compiles to a
stub function call (named like __dtrace_isenabled_*()) which is replaced
at bootup/module load time with an architecture-dependent instruction
sequence analogous to a function that always returns false, though no
function call is generated. At probe enabling time, this is replaced
with a trap into dtrace just like normal dtrace probes, incurring a
performance hit, but only when the probe is active.

The probe name used in the various ELF sections that track SDT
probes begins with a ? character to help the module distinguish
is-enabled probes from normal probes: this is internal to the DTrace
implementation and is otherwise invisible.

(Thanks to Kris Van Hees for initial work on this.)

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 25143173

dtrace: check for errors when getting a new fd

get_unused_fd_flags() can legitimately fail, e.g. when the fd table is
full. We need to diagnose that rather than trying to fd_install() the
resulting negative number.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 24977175

dtrace: take mmap_sem in PTRACE_GETMAPFD

Without this, we may oops if the process exec()s and discards its
address space after we find_vma().

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 24977175

dtrace: 0.6.0 specfile and NEWS.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>

ocfs2: fix not enough credit panic

The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192). ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up. The total credit should
include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(),
because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and
1 time will be consumed in ocfs2_dx_dir_new_cluster()->
__ocfs2_dx_dir_new_cluster()->ocfs2_dx_dir_format_cluster(). But only
two times is included in ocfs2_dx_dir_rebalance_credits(), fix it.

[34377.331151] ------------[ cut here ]------------
[34377.332007] kernel BUG at fs/ocfs2/journal.c:775!
[34377.344107] invalid opcode: 0000 [#1] SMP
[34377.346090] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
[34377.346090] CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 #2
[34377.346090] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
[34377.346090] task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000
[34377.346090] RIP: 0010:[<ffffffffa06e7397>]  [<ffffffffa06e7397>] ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
[34377.346090] RSP: 0018:ffff8800a7d4b6d8  EFLAGS: 00010286
[34377.346090] RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9
[34377.346090] RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460
[34377.346090] RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460
[34377.346090] R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e
[34377.346090] R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070
[34377.346090] FS:  00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000
[34377.346090] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34377.346090] CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0
[34377.346090] Stack:
[34377.346090]  00000000814d0a9c ffff88005fe61e00 ffff88006e995c00 ffff880009847c00
[34377.346090]  ffff8800a7d4b768 ffffffffa06c0999 ffff88009d3c2a10 ffff88005fe61e30
[34377.346090]  ffff8800997ce500 ffff8800997ce980 ffff880092beef60 000000100000000e
[34377.346090] Call Trace:
[34377.346090]  [<ffffffffa06c0999>] ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2]
[34377.346090]  [<ffffffffa06c3eeb>] ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2]
[34377.346090]  [<ffffffffa06dedb2>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
[34377.346090]  [<ffffffffa0761180>] ? ocfs2_refcount_tree_et_ops+0x60/0xfffffffffffe4b31 [ocfs2]
[34377.346090]  [<ffffffffa06e7730>] ? ocfs2_journal_access_dl+0x20/0x20 [ocfs2]
[34377.346090]  [<ffffffffa06c6b63>] ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2]
[34377.346090]  [<ffffffffa06c9709>] ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2]
[34377.346090]  [<ffffffffa06c9b16>] ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2]
[34377.346090]  [<ffffffffa06dee90>] ? ocfs2_read_inode_block+0x10/0x20 [ocfs2]
[34377.346090]  [<ffffffffa06f38e2>] ocfs2_mknod+0x5a2/0x1400 [ocfs2]
[34377.346090]  [<ffffffffa06f4933>] ocfs2_create+0x73/0x180 [ocfs2]
[34377.346090]  [<ffffffff81211de8>] vfs_create+0xd8/0x100
[34377.346090]  [<ffffffff8120f5fd>] ? lookup_real+0x1d/0x60
[34377.346090]  [<ffffffff81212535>] lookup_open+0x185/0x1c0
[34377.346090]  [<ffffffff8121571d>] do_last+0x36d/0x780
[34377.346090]  [<ffffffff812a85d6>] ? security_file_alloc+0x16/0x20
[34377.346090]  [<ffffffff81215bc2>] path_openat+0x92/0x470
[34377.346090]  [<ffffffff81215fea>] do_filp_open+0x4a/0xa0
[34377.346090]  [<ffffffff8132c570>] ? find_next_zero_bit+0x10/0x20
[34377.346090]  [<ffffffff812232ec>] ? __alloc_fd+0xac/0x150
[34377.346090]  [<ffffffff8120459a>] do_sys_open+0x11a/0x230
[34377.346090]  [<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
[34377.346090]  [<ffffffff812046ee>] SyS_open+0x1e/0x20
[34377.346090]  [<ffffffff816cb6ee>] system_call_fastpath+0x12/0x71
[34377.346090] Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
[34377.346090] RIP  [<ffffffffa06e7397>] ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
[34377.346090]  RSP <ffff8800a7d4b6d8>
[34377.615401] ---[ end trace 91ac5312a6ee1288 ]---
[34377.618919] Kernel panic - not syncing: Fatal exception
[34377.619910] Kernel Offset: disabled

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()

The testcase "mmaptruncate" of ocfs2-test deadlocks occasionally.

In this testcase, we create a 2*CLUSTER_SIZE file and mmap() on it;
there are 2 process repeatedly performing the following operations
respectively: one is doing memset(mmaped_addr + 2*CLUSTER_SIZE - 1, 'a',
1), while the another is playing ftruncate(fd, 2*CLUSTER_SIZE) and then
ftruncate(fd, CLUSTER_SIZE) again and again.

This is the backtrace when the deadlock happens:

   __wait_on_bit_lock+0x50/0xa0
   __lock_page+0xb7/0xc0
   ocfs2_write_begin_nolock+0x163f/0x1790 [ocfs2]
   ocfs2_page_mkwrite+0x1c7/0x2a0 [ocfs2]
   do_page_mkwrite+0x66/0xc0
   handle_mm_fault+0x685/0x1350
   __do_page_fault+0x1d8/0x4d0
   trace_do_page_fault+0x37/0xf0
   do_async_page_fault+0x19/0x70
   async_page_fault+0x28/0x30

In ocfs2_write_begin_nolock(), we first grab the pages and then allocate
disk space for this write; ocfs2_try_to_free_truncate_log() will be
called if -ENOSPC is returned; if we're lucky to get enough clusters,
which is usually the case, we start over again.

But in ocfs2_free_write_ctxt() the target page isn't unlocked, so we
will deadlock when trying to grab the target page again.

Also, -ENOMEM might be returned in ocfs2_grab_pages_for_write().
Another deadlock will happen in __do_page_mkwrite() if
ocfs2_page_mkwrite() returns non-VM_FAULT_LOCKED, and along with a
locked target page.

These two errors fail on the same path, so fix them by unlocking the
target page manually before ocfs2_free_write_ctxt().

Jan Kara helps me clear out the JBD2 part, and suggest the hint for root
cause.

Changes since v1:
1. Also put ENOMEM error case into consideration.

Link: http://lkml.kernel.org/r/1474173902-32075-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: He Gang <ghe@suse.com>
Acked-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c33f0785bf292cf1d15f4fbe42869c63e205b21c)

Conflicts:

fs/ocfs2/aops.c

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: fix race between convert and migration

Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not. This will introduce a race that right after
old master does umount ( means master will change), a new convert
request comes.

In this case, it will reset lockres state to DLM_RECOVERING and then
retry convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between
ocfs2 and dlm, and then finally BUG.

Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify
the race case between convert and recovery. So fix it.

Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e6f0c6e6170fec175fe676495f29029aecdf486c)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: solve a problem of crossing the boundary in updating backups

In update_backups() there exists a problem of crossing the boundary as
follows:

we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 584dca3440732afa84fbca07567bb66e1453936a)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()

Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.

    ocfs2_wake_downconvert_thread
    ocfs2_rw_unlock
    ocfs2_dio_end_io
    dio_complete
    .....
    bio_endio
    req_bio_endio
    ....
    scsi_io_completion
    blk_done_softirq
    __do_softirq
    do_softirq
    irq_exit
    do_IRQ
    ocfs2_osb_dump
    cat /sys/kernel/debug/ocfs2/${uuid}/fs_state

This patch still uses spin_lock_irqsave() - replace spin_lock() to solve
this situation.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit bfd97a0320d338b2fce422adeddd512466ef2390)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del

In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode.  This will have a problem once
ocfs2_journal_access_di fails.  In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully.  In
other words, the file is missing but not actually deleted.  So we should
access orphan dinode first like unlink and rename.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 074a6c655f6da12cb1123c8a84bfd8d781138800)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: do not insert a new mle when another process is already migrating

When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32e493265b2be96404aaa478fb2913be29b06887)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix slot overwritten if storage link down during mount

The following case will lead to slot overwritten.

N1                               N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
                                 mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
                                 got super lock and also allocate slot 0
                                 then unlock super lock

mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
                                 N2 slot info is now overwritten

Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2.  so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1247017f43a93eae3d64b7c25f3637dc545f5a47)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: return appropriate value when dlm_grab() returns NULL

dlm_grab() may return NULL when the node is doing unmount.  When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:

Node 1                                 Node 2
receives migration message
from node 3, and send
migrate request to others
                                     start unmounting

                                     receives migrate request
                                     from node 1 and call
                                     dlm_migrate_request_handler()

                                     unmount thread unregisters
                                     domain handlers and removes
                                     dlm_context from dlm_domains

                                     dlm_migrate_request_handlers()
                                     returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c372f2193a2e73d5936bf37259ae63ca388b4cbc)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker

Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear
refmap bit on lockres") still exists a race which can't ensure the
ordering is exactly correct.

Node1               Node2                    Node3
umount, migrate
lockres to Node2
                    migrate finished,
                    send migrate request
                    to Node3
                                              received migrate request,
                                              create a migration_mle,
                                              respond to Node2.
                    set DLM_LOCK_RES_SETREF_INPROG
                    and send assert master to
                    Node3
                                              delete migration_mle in
                                              assert_master_handler,
                                              Node3 umount without response
                                              dlm_thread purge
                                              this lockres, send drop
                                              deref message to Node2
                    found the flag of
                    DLM_LOCK_RES_SETREF_INPROG
                    is set, dispatch
                    dlm_deref_lockres_worker to
                    clear refmap, but in function of
                    dlm_deref_lockres_worker,
                    only if node in refmap it wait
                    DLM_LOCK_RES_SETREF_INPROG
                    to be cleared. So worker is
                    done successfully

                                              purge lockres, send
                                              assert master response
                                              to Node1, and finish umount
                    set Node3 in refmap, and it
                    won't be cleared forever, thus
                    lead to umount hung

so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
dlm_deref_lockres_worker.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b5560143385e18b4109ad6951c7719705e3dd995)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: fix a race between purge and migration

We found a race between purge and migration when doing code review.
Node A put lockres to purgelist before receiving the migrate message
from node B which is the master.  Node A call dlm_mig_lockres_handler to
handle this message.

dlm_mig_lockres_handler
  dlm_lookup_lockres
  >>>>>> race window, dlm_run_purge_list may run and send
         deref message to master, waiting the response
  spin_lock(&res->spinlock);
  res->state |= DLM_LOCK_RES_MIGRATING;
  spin_unlock(&res->spinlock);
  dlm_mig_lockres_handler returns

  >>>>>> dlm_thread receives the response from master for the deref
  message and triggers the BUG because the lockres has the state
  DLM_LOCK_RES_MIGRATING with the following message:

dlm_purge_lockres:209 ERROR: 6633EB681FA7474A9C280A4E1A836F0F: res
M0000000000000000030c0300000000 in use after deref

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 30bee898f86506893883ffb8db20d8101a29b5f5)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: clear migration_pending when migration target goes down

We have found a BUG on res->migration_pending when migrating lock
resources.  The situation is as follows.

dlm_mark_lockres_migration
  res->migration_pending = 1;
  __dlm_lockres_reserve_ast
  dlm_lockres_release_ast returns with res->migration_pending remains
      because other threads reserve asts
  wait dlm_migration_can_proceed returns 1
  >>>>>>> o2hb found that target goes down and remove target
          from domain_map
  dlm_migration_can_proceed returns 1
  dlm_mark_lockres_migrating returns -ESHOTDOWN with
      res->migration_pending still remains.

When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending.  So clear migration_pending when target is
down.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cc28d6d80f6ab494b10f0e2ec949eacd610f66e3)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix BUG when calculate new backup super

When resizing, it firstly extends the last gd.  Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.

But it currently doesn't consider the situation that the backup super is
already done.  And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:

    BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);

So check whether the backup super is done and then do the updates.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5c9ee4cbf2a945271f25b89b137f2c03bbc3be33)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>