www.infradead.org Git - users/jedix/linux-maple.git/log

dtrace: Add support for manual triggered cyclics

In some scenarios it is better if a client of cyclic susbstem can reprogram cyclic on
his own. This is not possible with current implementation.

This chage adds cyclic_reprogram() that can be used to schedule cyclic from inside and
outside of its handler. A manually triggered cyclic is distinguished from other types
by having its interval set to -1.

Orabug: 26384803

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: LOW level cyclics should use workqueues

The HIGH level cyclics are meant to be run from interrupt handler. This works on
Linux because the hrtimer is scheduled as tasklet. The LOW level cyclics must be
interruptible and should not be scheduled as tasklests.

DTrace is currently relying on being able to call dtrace_sync() from within a cyclic
handler. On Linux it is not safe to try send IPIs from within interrupt/bottom half
handlers.

This fix changes LOW level cyclics to use workqueues. At the moment we are using
shared system workqueue but it may be required to allocate our owns if this causes
big latency in our timer routines.

Orabug: 26384779

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

dtrace: add kprobe-unsafe addresses to FBT blacklist

By means of the newly introduced API to add entries to the FBT
blacklist, we make sure to register addresses that are unsafe for
kprobes with the FBT blacklist because they are unsafe there also.

Orabug: 26190412
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

dtrace: convert FBT blacklist to RB-tree

The blacklist for FBT was implemented as a sorted list, populated from
a static list of functions. In order to allow functions to be added
from other places (i.e. programmatically), it has been converted to an
RB-tree with an API to add functions and to traverse the list. It is
still possible to add functions by address or to add them by symbol
name, to be resolved into the corresponding address.

Orabug: 26190412
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

DTrace: IP provider use-after-free for drop-out probe points

KASan warnings showed a possible use-after-free for skbs in error handling
codepath after netfilter hooks have run. Hooks may free the skb, so we
should not derefence it at drop-out probe points after NF_HOOK().

Orabug: 26267376

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalbail@oracle.com>

ctf: fix a variety of memory leaks and use-after-free bugs

These fall into two classes, but are sufficiently intertwined that it's
easier to commit them in one go.

The first is outright leaks, which exceed 1GiB on a normal run, varying
from the tiny (failure to free getline()'s line), through the disastrous
(failure to free items filtered from a list by list_filter(), leading to
the leaking of nearly the whole of the named_structs state, which is
huge).  We also leak the structs_seen hash due to recreating it on
alias_fixup file switch without bothering to destroy it first,

The second is lifetime problems, centred around the stuff allocated and
freed in the detect_duplicates_tu_{init,/done}() functions.  These were
comparing the module name against a saved copy to see if a new vars_seen
needed to be allocated, or whether this was just a flip of TU without a
change of object file and we could get away with just flushing its
contents out -- but unfortunately the state->module_name is assigned
directly from its parameter, and *that* has a lifetime lasting only
within process_file() -- and a deduplication run, of course, involves
iterating over a great many object files.  So everything works as long as
we're flipping from TU to TU within a single object file, and then we
switch object files and are suddenly strcmp()ing with freed memory.
Discard this faulty optimization entirely, and just flush the vars_seen
hash in tu_done() and both create and destroy it in scan_duplicates(),
right where we create and destroy related stuff too.

Something similar happens with the state->dwfl_file_name due to its
derivation from id_file->file_name: if no duplicates are found, we
list_filter() that id_file straight out of the structs_seen list and
free it, and then on the next call state->dwfl_file_name points to freed
memory.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Orabug: 26283357

dtrace: fix compilation with O=

Separate objdir compilation was broken because we were looking for
autoconf.h in the wrong place.

Fix it by looking in the same place as everything else in scripts/ does.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 26167475

dtrace: io provider probes for nfs

DTrace io provider probes are added for NFS read and write requests.

Orabug: 26242655

Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com>
Acked-by: Saar Maoz <Saar.Maoz@oracle.com>
Acked-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>

dtrace: improve io provider coverage

The DTrace io provider coverage is extended to include IO performed
through the Generic Block Layer (bio). It also adds io provider probes
to XFS.

Orabug: 25816537

Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

dtrace: proc:::exit should trigger only if thread group exits

Orabug: 25904298

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>

ctf: prevent modules on the dedup blacklist from sharing any types at all

The deduplication blacklist exists to deal with a few modules, mostly
old ISA sound drivers, which consist of a #define and a #include of
another module, which then uses #ifdef profusely, sometimes on structure
members.  This leads to DWARF which has differently shaped structures
in the same source location, which conflicts with the purpose of the
deduplicator, which is to use the source location to identify identical
structures and merge them.

Before now, the blacklist worked by preventing the appearance of a
structure in a module on the blacklist from making a structure shared:
it could still be shared if other modules defined it.  This fails to
work for two reasons.

Firstly, if it is used by only *one* other module, we'll hit an
assertion failure in the emission phase which checks that any type it
emits is either defined in the module being emitted or in the shared
type repo.

We could remove that assertion, but we'd still be in the wrong, because
you could easily have a type in some header used by many modules that
said

struct trouble {
#ifdef MODULE_FOO
/* one definition */
#else
/* a different definition */
#endif
}

and a module that says

#define MODULE_FOO 1
#include <other_module.c>

Even if we blacklisted this module (and we would), this would still
fail, because 'struct trouble' would end up in the shared type
repository, and the existing code would fail to emit a new definition of
it in module blah, even though it should because its definition is
different.

This shows that if a module is pulling tricks like this we cannot trust
its use of shared types at all, since we have no visibility into
preprocessor definitions: regardless of the space cost (about 40KiB per
module), we cannot let it share any types whatsoever with the rest of
the kernel.  Rather than piling heaps of blacklist checks in all over
dwarf2ctf to implement this, we do it by exploiting the fact that the
deduplicator works entirely on the basis of type IDs.  We add a
'blacklist type prefix' that type_id() can add to the start of its IDs
(with some extra decoration, because the start of type IDs is a file
path, so we want this to be an impossible path).  If we set this prefix
to the module name if and only if the module is blacklisted, and do not
add one otherwise, then every blacklisted module will have a unique set
of IDs for all its types, which can be shared within the module but not
outside it, so every type in the module will be unique and none of them
will end up in the shared type repository.

While we're at it, add yet another ancient ISA sound driver that plays
the same games to the blacklist.

This fix makes blacklisting modules much more space-expensive: each such
module expands the current size of the kernel module package by about
40KiB.  (But there is only one blacklisted module built in current UEK,
so this is a tolerable cost.)

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: vincent.lim@oracle.com
Orabug: 26137220

ctf: emit bitfields in in-memory order

DWARF always emits bitfields in declaration order. e.g., for
this C:

struct IO_APIC_route_entry {
        __u32   vector          :  8,
                delivery_mode   :  3,   /* 000: FIXED
                                         * 001: lowest prio
                                         * 111: ExtINT
                                         */
                dest_mode       :  1,   /* 0: physical, 1: logical */
                delivery_status :  1,

we get this DWARF:

[  437f]    structure_type
             name                 (strp) "IO_APIC_route_entry"
             byte_size            (data1) 8
[  438b]      member
               name                 (strp) "vector"
               byte_size            (data1) 4
               bit_size             (data1) 8
               bit_offset           (data1) 24
               data_member_location (data1) 0
[  439a]      member
               name                 (strp) "delivery_mode"
               byte_size            (data1) 4
               bit_size             (data1) 3
               bit_offset           (data1) 21
               data_member_location (data1) 0
[  43a9]      member
               name                 (strp) "dest_mode"
               byte_size            (data1) 4
               bit_size             (data1) 1
               bit_offset           (data1) 20
               data_member_location (data1) 0
[  43b8]      member
               name                 (strp) "delivery_status"
               byte_size            (data1) 4
               bit_size             (data1) 1
               bit_offset           (data1) 19
               data_member_location (data1) 0

But CTF on little-endian requires the opposite: it has special handling
for the first member of a structure which assumes that it is closest to
the start of memory: in effect, it wants structure member addresses to
always ascend, even within bitfields, regardless of endianness (which
makes some sense intellectually as well).

dwarf2ctf's emission code generally emits sequentially, so except where
deduplication has eliminated items or dependent type insertion has added
them it emits things in the CTF in the same order as in the DWARF.  We
can avoid this for short runs, as in this case, by switching from
iteration to recursion in such cases, spotting a run at identical
data_member_location, recursing until we hit the end of the run, then
unwinding and emitting as we unwind until the recursion is over.

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: bitfield support

Support for bitfields in dwarf2ctf was embryonic and largely untested
before now due to bugs in libdtrace-ctf: with a fix for those in hand,
we can fix bitfields here too.

Bitfields in DWARF and CTF have annoyingly different representations.
In DWARF, a bitfield is represented something like this:

[ 16561]      member
               name                 (string) "ihl"
               decl_file            (data1) 225
               decl_line            (data1) 87
               type                 (ref4) [    38]
               byte_size            (data1) 1
               bit_size             (data1) 4
               bit_offset           (data1) 4
               data_member_location (sdata) 0
[...]
[    38]    typedef
             name                 (strp) "__u8"
             decl_file            (data1) 36
             decl_line            (data1) 20
             type                 (ref4) [    43]

i.e. the padding, size, and starting location are all represented in the
member, where you would conceptually expect it to be.

In CTF, the starting location of the conceptual containing type of a
bitfield is encoded in the member: but the size and starting location of
the bitfield itself is represented in the dependent type, which is added
as a "non-root" type (which cannot be looked up by name) so that it can
have the same name as the un-bitfielded base type without causing a name
clash.

We use the new DIE attribute override mechanism added in commit 8935199962
to override DW_AT_bit_size and DW_AT_bit_offset for such members (fixing
a pre-existing bug in the process: we were looking for the DW_AT_bit_size
on the structure as a whole!), and in the base-type emission function
checking for the existence of a DW_AT_bit_size/offset and responding to
them by overriding the size and offset derived from DW_AT_byte_size and
noting that this is a non-root type.  (The override needed, annoyingly,
is endian-dependent, because CTF consumers assume that on little-endian
systems the offset relates to the least-significant edge of the bitfield,
counting from the LSB, while DWARF assumes the opposite).

But this is not enough: unless more is done, this type will appear
to have the same type ID as its non-bitfield equivalent, leading to
confusion about which CTF file it should appear in and quite possibly
leading to it ending up in a CTF file that the structure containing the
bitfield cannot even see.  So augment type_id()'s representation of
base types from e.g. 'int' to something like 'int:4' if and only if
a DW_AT_bit_size or an override of it is present and that override is
a different size from the native bitness of the type itself (the
DW_AT_byte_size).  We encode the bit_offset only if there is also a
bit_size, as something like int:4:2.  (That's unambiguous because
these attributes always arrive in pairs in bitfields and never appear
in anything else in C-generated DWARF.)

Finally, this breaks an optimization in the deduplicator, which was that
all structure members reference some top-level type, so when marking a
type as seen, structure members could just be skipped.  Now, they have
to be chased iff they are bitfields using the same override trickery as
above to change the DW_AT_bit_size/offset in the member's type DIE, and
that bitfield override needs to be passed down to type_id() when finally
marking duplicated types as shared too.  (Avoid code duplication by
factoring out some related code from a horrible conditional in
detect_duplicates() into a new type_needs_sharing() function.)

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: emit file-scope static variables

For as long as dwarf2ctf has emitted variables into the CTF, it has
emitted only extern variables.  This is for two reasons: firstly, a
misconception that global variables with static linkage did not appear
in /proc/kallmodsyms (they do), and secondly, an ambiguity problem.

We cannot usefully distinguish two variables with the same name in the
same module: they differ only by address, so if both are static
variables in different translation units, we can't tell which is which.
(Also, we cannot emit more than one variable with a given name into a
given module CTF file in any case).  CTF's modular rather than
translation-unit-based variable/type scope bites us here.

I sought to avoid this bug by emitting only non-static variables, but
this does not save us, because there might be another static variable
with the same name in the same module, whereupon the ambiguity problem
arises all over again.  We must identify such ambiguous cases and strip
them out (not emitting CTF for this variable at all): then we can emit
static file-scope variables into the CTF without worry.

We do this by introducing a new, module-scope vars_seen hashtable into
the deduplicator state, which gains an entry for every variable name
seen in this module, and indicates whether it is static or not.  This
lets us tell if we have seen a variable with a given name more than
once, and if we have, whether any of the instances was static.  Then
we can consult this blacklist at variable-emission time, and skip any
variable if it is in the blacklist.

Unfortunately, computing the name for the variable's entry in the
blacklist is fairly expensive, and has to be done for every variable:
worse yet, this increases the number of variables emitted drastically
(in vmlinux and the shared typo repo alone, we go from 2247 to 10409
variables), and emitting that much CTF is not free: so the runtime goes
up by about 5%.  We will reclaim this lost speed soon (and then some).

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com
Orabug: 25962387

ctf: speed up the dwarf2ctf duplicate detector some more

The duplicate detector's alias detection pass is still doing unnecessary
work.  When a non-opaque structure is marked shared, it is sometimes
necessary to do another deduplication scan, because the marking may have
marked types used within the structure as shared, which may require yet
more types (e.g. opaque uses of of that type in other modules) to be
marked shared as well.  So while we know this pass can only affect
structures/ unions/enums that have names, and their interior types, it
would seem that we must keep scanning them to see if they need
deduplication until none remain.

However, there is one exception: if a non-opaque type and its
corresponding opaque type are both already marked shared, or if we have
just processed them and marked them accordingly, we know that we will
never need to re-mark those particular types again, since they can't be
more shared than they already are: so we can remove them from
consideration in future passes.  Because we are only opening DWARF files
in this pass as needed now, this hugely cuts down the number of files we
process in subsequent passes: we still see the same number of passes,
but passes after the first (which marks tens of thousands of opaque
types as shared) only open a few files, mark a few hundred types, and
flash past in under a second.

In my tests, the alias fixup pass now takes under 10s, which can be
more or less ignored: all other passes other than initialization
and writeout are much more expensive.  (Before this series, it took
over a minute on the fastest machine I have access to, and over three
minutes on SPARC.)

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: strdup() -> xstrdup()

Several unadorned strdup()s have crept into dwarf2ctf: all of them
should be xstrdup(), since none are handling malloc failure themselves.

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: speed up the dwarf2ctf duplicate detector

dwarf2ctf is very slow.  By far its most expensive pass is the duplicate
detector.  Of necessity this has to scan every object file in the kernel
to identify shared types, which is pretty expensive even when the cache
is hot, but it's not doing this particularly efficiently and can be sped
up quite a lot.

As of commit f0d0ebd0b4, the duplicate detector's job was cut in two:
the first pass identifies all non-structure types and non-opaque
structures used by more than one module, and the second "alias fixup"
pass repeatedly unifies opaque and non-opaque structures with the same
name in different translation units, sharing their definitions.  Both of
these passes generate or modify type IDs, so both need access to the
DWARF DIE of the types in question, since the type ID is derived
recursively from the DIE: but the second pass need not look at the DWARF
of any translation units that do not contain structures that might be
unified.  However, the two are currently written in the same way, using
process_file() to traverse all the kernel's DWARF, even though the alias
fixup pass does almost nothing with that DWARF, and has less and less to
do on each iteration.

The sheer amount of wasted time here is remarkable.  We traverse the
DWARF once for primary duplicate detection, once for CTF emission, but
often four or five or even seven or eight times for the alias fixup pass
(the precise count depends upon the relationships between types and the
order in which the DWARF files are traversed).

So improve things by tracking all types that the alias fixup pass is
interested in (structure types that are not anonymous inner structures
nor opaque nor used only as the types of array indices) and stash them
away during the first duplicate detection pass in a new temporary
singly-linked list, detect_duplicates_state.named_structs.  We remember
the filename and DWARF DIE offset (so we can look the type up again) and
the type ID (because we just worked it out, so recomputing it would be a
waste of time).  Then, rather than doing a process_file() for the alias
fixup pass, traverse the linked list, opening DWARF files as needed to
mark things as shared (but no more often than that: marking non-opaque
types needs the DIE so we can traverse into all its subtypes and mark
them shared too, but marking opaque types needs no DIE at all).

This has a significant effect on dwarf2ctf speed, slashing 25% off its
runtime in my tests, reducing the duplicate detector's share of the
runtime from about 80s to about 24s.

The dominant time consumer is now CTF emission rather than the
duplicate detector.

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: add module parameter to simple_dwfl_new() and adjust both callers

An upcoming speedup to dwarf2ctf needs this.

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: fix the size of int and avoid duplicating it

An upcoming bitfield-capable release of libdtrace-ctf adds extra
consistency checking which identified embarrassing but heretofore
harmless bugs in dwarf2ctf's representation of basic types.

dwarf2ctf has to emit a few basic types "by hand", since these are used
in the representation of types one of CTF or DWARF does not bother to
encode more complex forms for (function pointers, always encoded as
'int (*)()' in CTF, and 'void').  Embarrassingly, we were getting the
size of 'int' wrong: it should be in bits but we were emitting a count
of bytes instead, leading to a CTF representation of a 4-bit int.
This is always overridden by an accurate representation built into
DTrace in real use, but libdtrace-ctf finds the inconsistency anyway.

Worse is that it emits a representation of 'int' twice, once by hand in
init_ctf_table() and then again when it comes across the real 'int' type
DIE in the debuginfo.  This is because we forgot to intern the types we
add by hand in the id_to_module and id_to_type hashes we use to detect
duplicate types, because we can only intern types we have a DWARF DIE
for, and these types either have no DIE at all or we just don't know
about it because we haven't even begun to traverse the debuginfo to look
up DIEs yet.

Fix this by adding a mark_shared_by_name() function which can be used to
intern basic types by name in these hashes.  It has hardwired knowledge
of the type ID notation, but no more such knowledge than is already
present in detect_duplicates_alias_fixup() and friends (no more than
that types with no associated filename or line number are preceded by
"////".)

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: allow overriding of DIE attributes: use it for parent bias

Currently, dwarf2ctf's CTF type emission machinery emits any given CTF
type based purely on the data in the corresponding DWARF DIE, with
possible adjustments based on the DIE's structural parent (useful for
structure members, if not so much for top-level types).  But from time
to time we need to adjust a CTF type depending on some property not of
its structural parent but of a type that depends on it: i.e., for
structures, we might want to adjust the offset of the members to cater
for the fact that this structure is being structurally merged with a
shorter assignment-compatible structure with identical name which
appeared in some translation unit that was processed earlier.

Currently, we handle this case, and only this case, by passing down a
'parent_bias' to all the CTF assembly functions.  Replace this with a
more generic mechanism whereby an array of 'overrides' can be passed
down to construct_ctf_id(), die_to_ctf(), and all subordinate assembly
functions: these overrides consist of an array of die_override_t's,
where each element can either override or add to the value of one DWARF
attribute: this kicks in for specific DWARF tags only. (Only numerical
attributes are supported, obviously.)

(We also pass the overrides down to type_id() so that overrides can
affect the ID of types and thus cause a single DWARF type to generate
multiple CTF types, though we do not use this facility in this commit.)

To process the attributes, we introduce a new private_find_override() to
search the override list, and private_dwarf_udata() to fetch a udata and
handle it. (We do not override dwarf_hasattr(): anything that wants to
override an attribute that may not exist has to call
private_find_override() itself.  If this happens a lot, we can introduce
an override for dwarf_hasattr() too.)

Currently we use this in exactly one place: in assemble_ctf_su_member(),
to replace the use of parent_bias.  Further uses will come in the next
commit: thanks to this commit, none of them will require adding new
parameters to all the CTF construction functions :)

Also rename the 'override' parameter on the CTF construction functions,
which was used by array assembly to indicate that CTF types should
replace their parent type, with a much less confusingly-named 'replace'
parameter.  (It was badly named before, but now that we have a parameter
named 'overrides' it is devastatingly badly named.)

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

DTrace tcp/udp provider probes

This patch adds DTrace SDT probes for the TCP and UDP protocols.  For tcp
the following probes are added:

tcp:::send Fires when a tcp segment is transmitted
tcp:::receive Fires when a tcp segment is received
tcp:::state-change Fires when a tcp connection changes state
tcp:::connect-request Fires when a SYN segment is sent
tcp:::connect-refused Fires when a RST is received for connection attempt
tcp:::connect-established
Fires when three-way handshake completes for
initiator

tcp:::accept-refused Fires when a RST is sent refusing connection attempt
tcp:::accept-established
Fires when three-way handshake succeeds for acceptor

Arguments for all of these probes are:

arg0 struct sk_buff *; to be translated into pktinfo_t * containing
implementation-independent packet data
arg1 struct sock *; to be translated into csinfo_t * containing
implementation-independent connection data
arg2 __dtrace_tcp_void_ip_t *; to be translated into ipinfo_t * containing
implementation-independent IP information.  Custom type is used as
this gives DTrace a hint that we can source IP information from other
arguments if the IP header is not available.
arg3 struct tcp_sock *; to be translated into tcpsinfo_t * containing
implementation-independent TCP connection data
arg4 struct tcphdr *; to be translated into a tcpinfo_t * containing
implementation-independent TCP header data
arg5 int representing previous state; to be translated into a
tcplsinfo_t * which contains the previous state.  Differs from
current state (arg6) for state-change probes only.
arg6 int representing current state.  Cannot be sourced from struct
tcp_sock as we sometimes need to probe before state change is
reflected there
arg7 int representing direction of traffic for probe; values are
DTRACE_NET_PROBE_INBOUND for receipt of data and
DTRACE_NET_PROBE_OUTBOUND for transmission.

For udp the following probes are added:

udp:::send Fires when a udp datagram is sent
udp:::receive Fires when a udp datagram is received

Arguments for these probes are:

arg0 struct sk_buff *; to be translated into pktinfo_t * containing
        implementation-independent packet data
arg1    struct sock *; to be translated into csinfo_t * containing
        implementation-independent connection data
arg2    void_ip_t *; to be translated into ipinfo_t * containing
        implementation-independent IP information.
arg3 struct udp_sock *; to be translated into a udpsinfo_t * containing
implementation-independent UDP connection data
arg4 struct udphdr *; to be translated into a udpinfo_t * containing
implementation-independent UDP header information.

Orabug: 25815197

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Reviewed-by: Rao Shoaib <rao.shoaib@oracle.com>
Acked-by: Håkon Bugge <haakon.bugge@oracle.com>

dtrace: define DTRACE_PROBE_ENABLED to 0 when !CONFIG_DTRACE

Right now there is no definition for this at all, breaking kernel
compilation.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Nicolas Droux <nicolas.droux@oracle.com>
Orabug: 26145788

dtrace: ensure limit is enforced even when pcs is NULL

The dtrace_user_stacktrace() functions for x86_64 and sparc64 were
not handling the specified limit (st->limit correctly if the buffer
for PC values (st->pcs) was NULL. This commit ensures that we
decrement the limit whenever we encounter a PC, whether it gets
stored or not.

Orabug: 25949692
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: make x86_64 FBT return probe detection less restrictive

The FBT return probe detection mechanism on x86_64 was requiring that
the "ret" instruction be followed by a "push %rbp" or "nop", which is
much too restrictive. The new code allows probing of all "ret"
instructions that occur in a function regardless of what instructions
follows.

In order to make this possible, the functions to set and clear FBT
probes on x86_64 (dtrace_invop_add() and dtrace_invop_remove()) have
been modified to accept a 2nd argument that indicates the instruction
to patch the probe location with. This is needed because FBT return
probes need a different instruction on x86_64 (LOCK prefix to force
an invalid opcode trap isn't safe because we do not know what
instruction may follow the "ret").

This commit also fixes the declaration of the dtrace_bad_address()
function that was missing its return type.

Orabug: 25949048
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: support passing offset as arg0 to FBT return probes

FBT return probes pass the offset from the function start (in bytes)
as arg0. To make that possible, we pass the offset value in the call
to fbt_add_probe. For FBT entry probes we pass 0 (which is ignored).

Orabug: 25949086
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: make FBT entry probe detection less restrictive on x86_64

The logic on x86_64 to determine whether we can probe a function is
too restrictive. By placing the probe on the "push %rbp" instruction
we can cover more functions, in case the "mov %rsp,%rbp" instruction
does not follow it immediately.

Orabug: 25949030
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: adjust FBT entry probe dection for OL7

On OL7, function prologues can be prefixed by a (5-byte) call
instruction on x86_64, which breaks the logic to determine if
we can place an FBT entry probe on that function. The new logic
accounts for the possibility that the anticipated prologue does
not show up as first instruction of the function.

Orabug: 25921361
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: fix handling of save_stack_trace sentinel (x86 only)

On x86 only, when save_stack_trace() writes less stack frames to the
buffer than there is space for, a ULONG_MAX is added as sentinel. The
DTrace code was mistakenly treating the buffer as always ending with a
ULONG_MAX.

Orabug: 25727046
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: DTrace walltime lock-free implementation

DTrace walltimestamp is now based on reading current kernel
timespec without taking locks.

Orabug: 25715256

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: get rid of dtrace_gethrtime

Remove the need for dtrace_gethrtime() and dtrace_getwalltime() because
the current implementations are not deadlock safe.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: drop spurious debugging left in by accident

We should not be emitting a KERN_INFO log message whenever an is-enabled
probe is discovered.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 25143173

dtrace: comtinuing the FBT implementation and fixes

This commit continues the implementation of Function Boundary Tracing
(FBT) and fixes various problems with the original implementation and
other things in DTrace that it caused to break.  It is done as a single
commit due to the intertwined nature of the code it touches.

1. We were only handling unaligned memory access traps as part of the
   NOFAULT access protection.  This commit adds handling data and
   instruction access trap handling.

2. When an OOPS takes place, we now add output about whether we are
   in DTrace probe context and what the last probe was that was being
   processed (if any).  That last data item isn't guaranteed to always
   have a valid value.  But it is helpful.

3. New ustack stack walker implementation (moved from module to kernel
   for consistency and because we need access to low level structures
   like the page tables) for both x86 and sparc.  The new code avoids
   any locking or sleeping.  The new user stack walker is accessed as
   as sub-function of dtrace_stacktrace(), selected using the flags
   field of stacktrace_state_t.

4. We added a new field to the dtrace_psinfo_t structure (ustack) to
   hold the bottom address of the stack.  This is needed in the stack
   walker (specifically for x86) to know when we have reached the end
   of the stack.  It is initialized from copy_process (in DTrace
   specific code) when stack_start is passed as parameter to clone.
   It is also set from dtrace_psinfo_alloc() (which is generally called
   from performing an exec), and there it gets its value from the
   mm->start_stack value.

5. The FBT black lists have been updated with functions that may be
   invoked during probe processing.  In addition, for x86_64 we added
   explicit filter out of functions that start with insn_* or inat_*
   because they are used for instruction analysis during probe
   processing.

6. On sparc64, per-cpu data gets access by means of a global register
   that holds the base address for this memory area.  Some assembler
   code clobbers that register in some cases, so it is not safe to
   depend on this in probe context.  Instead, we explicitly access
   the data based on the smp_processor_id().

7. We added a new CPU DTTrace flag (CPU_DTRACE_PROBE_CTX) to flag that
   we are processing in DTrace probe context.  It is primarily used
   to detect attempts of re-entry into dtrace_probe().

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Orabug: 21220305
Orabug: 24829326

dtrace: ensure DTrace can use get_user_pages safely

The processing of the DTrace-specific FOLL_IMMED flag was not robust
enough. We could still get into a situation where cond_resched() was
called (which is bad) or where the VMA area would get extended (which
is also bad). The only code that passes this flag is DTrace support
code, and when the flag is not passed, the execution flow is not at all
affected.

Orabug: 25640153
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: enable paranoid mode and IST shift for xen_int3

The Xen PVM path into an INT3 trap was not using paranoid=1 mode nor was
it using an IST shift as is done for HW INT3 traps. This interferes with
the instruction emulation code check based on the handler return value.

Orabug: 25580519
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

dtrace: ensure we skip the entire SDT probe point

With the introduction of FBT support, the logic for skipping instructions
(with potential emulation of the skipped instruction) changed. This change
did not take into account the fact that is-enabled probes on x86_64 use a
3-byte sequence for setting ax to 0, followed by a 2-byte NOP. The old logic
resulted in failing to skip the setting of ax correctly.

New logic uses the knowledge that all SDT probes on x86_64 are of the same
length (ASM_CALL_SIZE) and therefore we can simply skip that number of bytes
and continue without any emulation.

Orabug: 25557283
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: add ip SDT provider

This gives probe points ip:::receive, ip:::send, ip:::drop-in, and
ip:::drop-out, with parameters compatible with the Solaris
implementation.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Orabug: 25557554

dtrace: function boundary tracing (FBT)

This patch implements Function Boundary Tracing (FBT) for x86 and
sparc.  It covers generic, x86 and sparc specific changes.

Generic:

A new exported function (which will be provided by each supported
architecture) dtrace_fbt_init() is added to be called from module
code to initiate the discovery of FBT probes.  Any eligible probe
(based on arch-dependent logic) is passed to the provided callback
function pointer for processing.

A new option is added to the DTrace kernel config to enable FBT.

The logic for determing the size of the pdata memory block (which
is arch-dependent) has changed at the architecture level, and the
code for setting up the kernel pseudo-module has been modified to
account for that change.

The post-processing script dtrace_sdt.sh is now determining how
many functions exist in the kernel as an upper bound for the amount
of functions that can be traced with FBT.  The logic ensures that
aliases are not counted.

x86:

On x86_64, entry FBT probes are implemented similarly to SDT probes on
that same architecture.  A trap is triggered from the probe location,
which causes a call into the DTrace core.  The entry probe is placed on
the 'mov %rsp,%rbp' instruction that immediately followed a 'push %rbp'
as part of the function prologue.  If this instruction sequence is not
found, no entry probe will be created for the function.

sparc:

On sparc64, entry FBT probes are also implemented similar to SDT
probes on that same architecture.  A call is made into a trampoline
(allocated within the limited addressing range for a single assembler
instruction) which in turn calls into the DTrace core.  The entry probe
is placed on the location where typically a call is made to _mcount for
profiling purposes.  Under normal operation, that instruction will be
a NOP.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: David Mc Lean <david.mclean@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Orabug: 21220305
Orabug: 24829326

dtrace: add support for passing return value from trap handlers

Prior to this patch, trap handlers were called to service traps without
any mechanism to report back to the lowest level trap entry point.  The
DTrace FBT implementation on x86 needs to be able to do just that because
FBT probes are enabled by replacing a one-byte assembler instruction with
a one-byte instruction that causes a trap.  After the trap is handled, we
need to emulate the instruction that was replaced prior to returning to
the original instruction stream.  Because different instructions may occur
at FBT probe points, we need to be able to report back to the trap entry
point which instruction was replaced by the trap.

Handlers that do not use notify_die() always return 0.  Those who do use
notify_die() to call handlers have been modified to return the value that
the handler passed on its return.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-By: Dan Duval <dan.duval@oracle.com>
Orabug: 25312278

dtrace: ensure that our die notifier gets executed amongst the first

The die notifier is crucial for implementing safe memory access in
DTrace. To avoid other handlers potentially causing interference, we
set the priority of our handler at the highest level.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: David Mc Lean <david.mclean@oracle.com>

Merge remote branch 'ca-git-uek/topic/uek-4.1/stable-cherry-picks' into official/topic/uek-4.1/dtrace

dtrace: allow invop handler to specify number of insns to skip

Rather than unconditionally skipping one instruction, repurpose the
unused return value of the invop handler to allow it to specify the
amount to skip. (In the common case where it knows how many
instructions to skip at all times, this is more efficient.)

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: is-enabled probes for SDT

"Is-enabled probes" are a conditional, long supported in userspace
probing, which lets you avoid doing expensive data-collection operations
needed only by DTrace probes unless those probes are active.

e.g. (an example using the core DTRACE_PROBE / DTRACE_IS_ENABLED macros,
rather than the DTRACE_providername macros used in practice, because
no such macros have been added to the kernel yet):

if (DTRACE_IS_ENABLED(__io_wait__start)) {
/* stuff done only when io:::wait-start is enabled */
}

As with normal SDT probes, the DTRACE_IS_ENABLED() macro compiles to a
stub function call (named like __dtrace_isenabled_*()) which is replaced
at bootup/module load time with an architecture-dependent instruction
sequence analogous to a function that always returns false, though no
function call is generated. At probe enabling time, this is replaced
with a trap into dtrace just like normal dtrace probes, incurring a
performance hit, but only when the probe is active.

The probe name used in the various ELF sections that track SDT
probes begins with a ? character to help the module distinguish
is-enabled probes from normal probes: this is internal to the DTrace
implementation and is otherwise invisible.

(Thanks to Kris Van Hees for initial work on this.)

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 25143173

dtrace: check for errors when getting a new fd

get_unused_fd_flags() can legitimately fail, e.g. when the fd table is
full. We need to diagnose that rather than trying to fd_install() the
resulting negative number.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 24977175

dtrace: take mmap_sem in PTRACE_GETMAPFD

Without this, we may oops if the process exec()s and discards its
address space after we find_vma().

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 24977175

ocfs2: fix not enough credit panic

The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192). ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up. The total credit should
include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(),
because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and
1 time will be consumed in ocfs2_dx_dir_new_cluster()->
__ocfs2_dx_dir_new_cluster()->ocfs2_dx_dir_format_cluster(). But only
two times is included in ocfs2_dx_dir_rebalance_credits(), fix it.

[34377.331151] ------------[ cut here ]------------
[34377.332007] kernel BUG at fs/ocfs2/journal.c:775!
[34377.344107] invalid opcode: 0000 [#1] SMP
[34377.346090] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
[34377.346090] CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 #2
[34377.346090] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
[34377.346090] task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000
[34377.346090] RIP: 0010:[<ffffffffa06e7397>]  [<ffffffffa06e7397>] ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
[34377.346090] RSP: 0018:ffff8800a7d4b6d8  EFLAGS: 00010286
[34377.346090] RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9
[34377.346090] RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460
[34377.346090] RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460
[34377.346090] R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e
[34377.346090] R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070
[34377.346090] FS:  00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000
[34377.346090] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34377.346090] CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0
[34377.346090] Stack:
[34377.346090]  00000000814d0a9c ffff88005fe61e00 ffff88006e995c00 ffff880009847c00
[34377.346090]  ffff8800a7d4b768 ffffffffa06c0999 ffff88009d3c2a10 ffff88005fe61e30
[34377.346090]  ffff8800997ce500 ffff8800997ce980 ffff880092beef60 000000100000000e
[34377.346090] Call Trace:
[34377.346090]  [<ffffffffa06c0999>] ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2]
[34377.346090]  [<ffffffffa06c3eeb>] ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2]
[34377.346090]  [<ffffffffa06dedb2>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
[34377.346090]  [<ffffffffa0761180>] ? ocfs2_refcount_tree_et_ops+0x60/0xfffffffffffe4b31 [ocfs2]
[34377.346090]  [<ffffffffa06e7730>] ? ocfs2_journal_access_dl+0x20/0x20 [ocfs2]
[34377.346090]  [<ffffffffa06c6b63>] ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2]
[34377.346090]  [<ffffffffa06c9709>] ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2]
[34377.346090]  [<ffffffffa06c9b16>] ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2]
[34377.346090]  [<ffffffffa06dee90>] ? ocfs2_read_inode_block+0x10/0x20 [ocfs2]
[34377.346090]  [<ffffffffa06f38e2>] ocfs2_mknod+0x5a2/0x1400 [ocfs2]
[34377.346090]  [<ffffffffa06f4933>] ocfs2_create+0x73/0x180 [ocfs2]
[34377.346090]  [<ffffffff81211de8>] vfs_create+0xd8/0x100
[34377.346090]  [<ffffffff8120f5fd>] ? lookup_real+0x1d/0x60
[34377.346090]  [<ffffffff81212535>] lookup_open+0x185/0x1c0
[34377.346090]  [<ffffffff8121571d>] do_last+0x36d/0x780
[34377.346090]  [<ffffffff812a85d6>] ? security_file_alloc+0x16/0x20
[34377.346090]  [<ffffffff81215bc2>] path_openat+0x92/0x470
[34377.346090]  [<ffffffff81215fea>] do_filp_open+0x4a/0xa0
[34377.346090]  [<ffffffff8132c570>] ? find_next_zero_bit+0x10/0x20
[34377.346090]  [<ffffffff812232ec>] ? __alloc_fd+0xac/0x150
[34377.346090]  [<ffffffff8120459a>] do_sys_open+0x11a/0x230
[34377.346090]  [<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
[34377.346090]  [<ffffffff812046ee>] SyS_open+0x1e/0x20
[34377.346090]  [<ffffffff816cb6ee>] system_call_fastpath+0x12/0x71
[34377.346090] Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
[34377.346090] RIP  [<ffffffffa06e7397>] ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
[34377.346090]  RSP <ffff8800a7d4b6d8>
[34377.615401] ---[ end trace 91ac5312a6ee1288 ]---
[34377.618919] Kernel panic - not syncing: Fatal exception
[34377.619910] Kernel Offset: disabled

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()

The testcase "mmaptruncate" of ocfs2-test deadlocks occasionally.

In this testcase, we create a 2*CLUSTER_SIZE file and mmap() on it;
there are 2 process repeatedly performing the following operations
respectively: one is doing memset(mmaped_addr + 2*CLUSTER_SIZE - 1, 'a',
1), while the another is playing ftruncate(fd, 2*CLUSTER_SIZE) and then
ftruncate(fd, CLUSTER_SIZE) again and again.

This is the backtrace when the deadlock happens:

   __wait_on_bit_lock+0x50/0xa0
   __lock_page+0xb7/0xc0
   ocfs2_write_begin_nolock+0x163f/0x1790 [ocfs2]
   ocfs2_page_mkwrite+0x1c7/0x2a0 [ocfs2]
   do_page_mkwrite+0x66/0xc0
   handle_mm_fault+0x685/0x1350
   __do_page_fault+0x1d8/0x4d0
   trace_do_page_fault+0x37/0xf0
   do_async_page_fault+0x19/0x70
   async_page_fault+0x28/0x30

In ocfs2_write_begin_nolock(), we first grab the pages and then allocate
disk space for this write; ocfs2_try_to_free_truncate_log() will be
called if -ENOSPC is returned; if we're lucky to get enough clusters,
which is usually the case, we start over again.

But in ocfs2_free_write_ctxt() the target page isn't unlocked, so we
will deadlock when trying to grab the target page again.

Also, -ENOMEM might be returned in ocfs2_grab_pages_for_write().
Another deadlock will happen in __do_page_mkwrite() if
ocfs2_page_mkwrite() returns non-VM_FAULT_LOCKED, and along with a
locked target page.

These two errors fail on the same path, so fix them by unlocking the
target page manually before ocfs2_free_write_ctxt().

Jan Kara helps me clear out the JBD2 part, and suggest the hint for root
cause.

Changes since v1:
1. Also put ENOMEM error case into consideration.

Link: http://lkml.kernel.org/r/1474173902-32075-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: He Gang <ghe@suse.com>
Acked-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c33f0785bf292cf1d15f4fbe42869c63e205b21c)

Conflicts:

fs/ocfs2/aops.c

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: fix race between convert and migration

Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not. This will introduce a race that right after
old master does umount ( means master will change), a new convert
request comes.

In this case, it will reset lockres state to DLM_RECOVERING and then
retry convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between
ocfs2 and dlm, and then finally BUG.

Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify
the race case between convert and recovery. So fix it.

Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e6f0c6e6170fec175fe676495f29029aecdf486c)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: solve a problem of crossing the boundary in updating backups

In update_backups() there exists a problem of crossing the boundary as
follows:

we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 584dca3440732afa84fbca07567bb66e1453936a)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()

Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.

    ocfs2_wake_downconvert_thread
    ocfs2_rw_unlock
    ocfs2_dio_end_io
    dio_complete
    .....
    bio_endio
    req_bio_endio
    ....
    scsi_io_completion
    blk_done_softirq
    __do_softirq
    do_softirq
    irq_exit
    do_IRQ
    ocfs2_osb_dump
    cat /sys/kernel/debug/ocfs2/${uuid}/fs_state

This patch still uses spin_lock_irqsave() - replace spin_lock() to solve
this situation.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit bfd97a0320d338b2fce422adeddd512466ef2390)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del

In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode.  This will have a problem once
ocfs2_journal_access_di fails.  In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully.  In
other words, the file is missing but not actually deleted.  So we should
access orphan dinode first like unlink and rename.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 074a6c655f6da12cb1123c8a84bfd8d781138800)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: do not insert a new mle when another process is already migrating

When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32e493265b2be96404aaa478fb2913be29b06887)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix slot overwritten if storage link down during mount

The following case will lead to slot overwritten.

N1                               N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
                                 mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
                                 got super lock and also allocate slot 0
                                 then unlock super lock

mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
                                 N2 slot info is now overwritten

Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2.  so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1247017f43a93eae3d64b7c25f3637dc545f5a47)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: return appropriate value when dlm_grab() returns NULL

dlm_grab() may return NULL when the node is doing unmount.  When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:

Node 1                                 Node 2
receives migration message
from node 3, and send
migrate request to others
                                     start unmounting

                                     receives migrate request
                                     from node 1 and call
                                     dlm_migrate_request_handler()

                                     unmount thread unregisters
                                     domain handlers and removes
                                     dlm_context from dlm_domains

                                     dlm_migrate_request_handlers()
                                     returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c372f2193a2e73d5936bf37259ae63ca388b4cbc)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker

Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear
refmap bit on lockres") still exists a race which can't ensure the
ordering is exactly correct.

Node1               Node2                    Node3
umount, migrate
lockres to Node2
                    migrate finished,
                    send migrate request
                    to Node3
                                              received migrate request,
                                              create a migration_mle,
                                              respond to Node2.
                    set DLM_LOCK_RES_SETREF_INPROG
                    and send assert master to
                    Node3
                                              delete migration_mle in
                                              assert_master_handler,
                                              Node3 umount without response
                                              dlm_thread purge
                                              this lockres, send drop
                                              deref message to Node2
                    found the flag of
                    DLM_LOCK_RES_SETREF_INPROG
                    is set, dispatch
                    dlm_deref_lockres_worker to
                    clear refmap, but in function of
                    dlm_deref_lockres_worker,
                    only if node in refmap it wait
                    DLM_LOCK_RES_SETREF_INPROG
                    to be cleared. So worker is
                    done successfully

                                              purge lockres, send
                                              assert master response
                                              to Node1, and finish umount
                    set Node3 in refmap, and it
                    won't be cleared forever, thus
                    lead to umount hung

so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
dlm_deref_lockres_worker.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b5560143385e18b4109ad6951c7719705e3dd995)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: fix a race between purge and migration

We found a race between purge and migration when doing code review.
Node A put lockres to purgelist before receiving the migrate message
from node B which is the master.  Node A call dlm_mig_lockres_handler to
handle this message.

dlm_mig_lockres_handler
  dlm_lookup_lockres
  >>>>>> race window, dlm_run_purge_list may run and send
         deref message to master, waiting the response
  spin_lock(&res->spinlock);
  res->state |= DLM_LOCK_RES_MIGRATING;
  spin_unlock(&res->spinlock);
  dlm_mig_lockres_handler returns

  >>>>>> dlm_thread receives the response from master for the deref
  message and triggers the BUG because the lockres has the state
  DLM_LOCK_RES_MIGRATING with the following message:

dlm_purge_lockres:209 ERROR: 6633EB681FA7474A9C280A4E1A836F0F: res
M0000000000000000030c0300000000 in use after deref

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 30bee898f86506893883ffb8db20d8101a29b5f5)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: clear migration_pending when migration target goes down

We have found a BUG on res->migration_pending when migrating lock
resources.  The situation is as follows.

dlm_mark_lockres_migration
  res->migration_pending = 1;
  __dlm_lockres_reserve_ast
  dlm_lockres_release_ast returns with res->migration_pending remains
      because other threads reserve asts
  wait dlm_migration_can_proceed returns 1
  >>>>>>> o2hb found that target goes down and remove target
          from domain_map
  dlm_migration_can_proceed returns 1
  dlm_mark_lockres_migrating returns -ESHOTDOWN with
      res->migration_pending still remains.

When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending.  So clear migration_pending when target is
down.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cc28d6d80f6ab494b10f0e2ec949eacd610f66e3)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix BUG when calculate new backup super

When resizing, it firstly extends the last gd.  Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.

But it currently doesn't consider the situation that the backup super is
already done.  And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:

    BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);

So check whether the backup super is done and then do the updates.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5c9ee4cbf2a945271f25b89b137f2c03bbc3be33)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error

In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we
should reclaim the inode successfully claimed above, otherwise, the
inode never be reused. The case is described below:

ocfs2_mknod
    ocfs2_mknod_locked
        ocfs2_claim_new_inode
                Successfully claim the inode
        __ocfs2_mknod_locked
            ocfs2_journal_access_di
            Failed because of -ENOMEM or other reasons, the inode
                        lockres has not been initialized yet.

    iput(inode)
        ocfs2_evict_inode
            ocfs2_delete_inode
                ocfs2_inode_lock
                    ocfs2_inode_lock_full_nested
                        __ocfs2_cluster_lock
                                Return -EINVAL because of the inode
                                lockres has not been initialized.

                So the following operations are not performed
                ocfs2_wipe_inode
                        ocfs2_remove_inode
                                ocfs2_free_dinode
                                        ocfs2_free_suballoc_bits

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b1529a41f777a48f95d4af29668b70ffe3360e1b)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix race between mount and delete node/cluster

There is a race case between mount and delete node/cluster, which will
lead o2hb_thread to malfunctioning dead loop.

    o2hb_thread
    {
        o2nm_depend_this_node();
        <<<<<< race window, node may have already been deleted, and then
               enter the loop, o2hb thread will be malfunctioning
               because of no configured nodes found.
        while (!kthread_should_stop() &&
               !reg->hr_unclean_stop && !reg->hr_aborted_start) {
    }

So check the return value of o2nm_depend_this_node() is needed.  If node
has been deleted, do not enter the loop and let mount fail.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0986fe9b50f425ec81f25a1a85aaf3574b31d801)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put

dlm_lockres_put will call dlm_lockres_release if it is the last
reference, and then it may call dlm_print_one_lock_resource and
take lockres spinlock.

So unlock lockres spinlock before dlm_lockres_put to avoid deadlock.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b67de018b37a97548645a879c627d4188518e907)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: avoid access invalid address when read o2dlm debug messages

The following case will lead to a lockres is freed but is still in use.

cat /sys/kernel/debug/o2dlm/locking_state dlm_thread
lockres_seq_start
    -> lock dlm->track_lock
    -> get resA
                                                resA->refs decrease to 0,
                                                call dlm_lockres_release,
                                                and wait for "cat" unlock.
Although resA->refs is already set to 0,
increase resA->refs, and then unlock
                                                lock dlm->track_lock
                                                    -> list_del_init()
                                                    -> unlock
                                                    -> free resA

In such a race case, invalid address access may occurs.  So we should
delete list res->tracking before resA->refs decrease to 0.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f57a22ddecd6f26040a67e2c12880f98f88b6e00)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix a tiny case that inode can not removed

When running dirop_fileop_racer we found a case that inode
can not removed.

Two nodes, say Node A and Node B, mount the same ocfs2 volume.  Create
two dirs /race/1/ and /race/2/ in the filesystem.

  Node A                            Node B
  rm -r /race/2/
                                    mv /race/1/ /race/2/
  call ocfs2_unlink(), get
  the EX mode of /race/2/
                                    wait for B unlock /race/2/
  decrease i_nlink of /race/2/ to 0,
  and add inode of /race/2/ into
  orphan dir, unlock /race/2/
                                    got EX mode of /race/2/. because
                                    /race/1/ is dir, so inc i_nlink
                                    of /race/2/ and update into disk,
                                    unlock /race/2/
  because i_nlink of /race/2/
  is not zero, this inode will
  always remain in orphan dir

This patch fixes this case by test whether i_nlink of new dir is zero.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 928dda1f9433f024ac48c3d97ae683bf83dd0e42)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: trusted xattr missing CAP_SYS_ADMIN check

The trusted extended attributes are only visible to the process which
hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
xattr_handler trusted list. The check is important because this will be
used for implementing mechanisms in the userspace for which other
ordinary processes should not have access to.

Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Taesoo kim <taesoo@gatech.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0f5e7b41f91814447defc34e915fc5d6e52266d9)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: set filesytem read-only when ocfs2_delete_entry failed.

In ocfs2_rename, it will lead to an inode with two entried(old and new) if
ocfs2_delete_entry(old) failed.  Thus, filesystem will be inconsistent.

The case is described below:

ocfs2_rename
    -> ocfs2_start_trans
    -> ocfs2_add_entry(new)
    -> ocfs2_delete_entry(old)
        -> __ocfs2_journal_access *failed* because of -ENOMEM
    -> ocfs2_commit_trans

So filesystem should be set to read-only at the moment.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 807a7907114c7c703017ed7a96477a2eeb0d08e0)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix NULL pointer dereference in function ocfs2_abort_trigger()

ocfs2_abort_trigger() use bh->b_assoc_map to get sb. But there's no
function to set bh->b_assoc_map in ocfs2, it will trigger NULL pointer
dereference while calling this function. We can get sb from
bh->b_bdev->bd_super instead of b_assoc_map.

[akpm@linux-foundation.org: update comment, per Joseph]
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 74e364ad1b13fd518a0bd4e5aec56d5e8706152f)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

kvm:vmx: more complete state update on APICv on/off

The function to update APICv on/off state (in particular, to deactivate
it when enabling Hyper-V SynIC) is incomplete: it doesn't adjust
APICv-related fields among secondary processor-based VM-execution
controls. As a result, Windows 2012 guests get stuck when SynIC-based
auto-EOI interrupt intersected with e.g. an IPI in the guest.

In addition, the MSR intercept bitmap isn't updated every time "virtualize
x2APIC mode" is toggled. This path can only be triggered by a malicious
guest, because Windows didn't use x2APIC but rather their own synthetic
APIC access MSRs; however a guest running in a SynIC-enabled VM could
switch to x2APIC and thus obtain direct access to host APIC MSRs
(CVE-2016-4440).

The patch fixes those omissions.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Reported-by: Steve Rutherford <srutherford@google.com>
Reported-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Orabug: 23347009
CVE: CVE-2016-4440
Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

dtrace: eliminate need for arg counting in sdt macros

The existing SDT probe macros are of the form

DTRACE_PROBEn(probe, ...)

where there must be as many args in ... as the 'n'.

This is implemented via terribly repetitive code in include/linux/sdt.h,
made much worse by the recent typed sdt additions.  Reduce
repetitiveness, and drop the 'n' variants of the macros.

All the complexity is hidden away in include/linux/sdt_internal.h.

It's derived from the tricks the perf probes were already using, but
improved to handle the zero-argument case and armoured against namespace
pollution with copious __ing.  Even when DTrace is turned off, we use a
do-nothing form of the same macro to verify that the number of arguments
in the probe is valid.

There is only one public macro now, DTRACE_PROBE(), and the usual thin
wrappers around it, which have also gone non-numbered (DTRACE_PROC,
DTRACE_IO, etc): all uses have been adjusted.

These macros use a family of semi-public macros, described below.

__DTRACE_DOUBLE_APPLY(type_macro, arg_macro, ...): This takes a type and
  arg macro and a set of pairs of arguments and calls the first argument
  on each pair with the type macro, the second with the arg macro, etc:
  if called with (type, arg, a, b, c, d) it will expand to something like

  type(a), arg(b), type(c), arg(d)

__DTRACE_DOUBLE_APPLY_NOCOMMA(type_macro, arg_macro, ...): As above, but
  omits all commas between arguments: the example above will yield

  type(a) arg(b) type(c) arg(d)

  (suitable for string concatenation).

__DTRACE_TYPE_APPLY(type_macro, arg_macro, ...): as above, but only
  calls the type_macro:

  type(a), type(c)

__DTRACE_ARG_APPLY(type_macro, arg_macro, ...): as above, but only
  calls the arg_macro:

  arg(b), arg(d)

__DTRACE_TYPE_APPLY_DEFAULT(type_macro, arg_macro, def, ...): as above,
  but takes an extra parameter to be used as a default when there are no
  args.  Calling it with (type, arg, void, a, b, c, d) will yield

  type(a), arg(b), type(c), arg(d)

  but calling it with (type, arg, void) will yield

  void

All the macros above take any number of pairs of arguments up to eight:
providing any other number, including odd numbers, is a compilation
error.

__DTRACE_APPLY(macro, ...): This is pre-existing, and calls the macro
  repeatedly with the specified args, comma-separated:

  m(a), m(b), m(c), ...

__DTRACE_APPLY_DEFAULT(macro, def, ...): Like DTRACE_TYPE_APPLY_DEFAULT,
  takes a default for use when there are no args.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 24678897

dtrace: augment SDT probes with type information

With all the other work done, we can finally move the SDT probe
type information out of the static sdt_args array in the SDT kernel
module and put them in the probe definitions they relate to.

Most of it is already there -- all probes already carry type
information.  All that is missing is translations, so we add those.

The syntax for these is

type: translation
  or
type : (translation, ...)

i.e. a source type, followed by a colon, followed by zero or more
translated types (if there is more than one, they must be bracketed),
with optional spaces if you like.  The resulting probe will have as many
arguments as there are translations.  Obviously, if DTrace userspace is
to do anything with this knowledge it will need a translator from the
specified source type to each of the specified translated types: DTrace
picks these up from kernel-versioned subdirectories of
/usr/lib64/dtrace/ (e.g. /usr/lib64/dtrace/4.1).

If a probe appears more than once (say, in different functions), each
instance must presently have the same types.  Changing translations
between instances of a single probe will be diagnosed if CONFIG_DT_DEBUG
is enabled.

Typos in the types are only diagnosed at runtime: a dtrace -vln of your
new probe will tell you if all is well.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 24661801

dtrace: import the sdt type information into per-sdt_probedesc state

Now we have type and probe names in ELF sections, we have to do something
with them (it's initdata, so in the kernel, if not in modules, it will
be thrown away after initialization.)

We marshal each arg string in turn, without parsing it, into a new field
in the sdt_probedesc_t, the per-probe structure used to communicate
information about SDT probes to the SDT kernel module.  Before now, this
has been a simple task at runtime: the array is constructed by
dtrace_sdt.sh and just needs to be reformulated (for the core kernel) or
dropped unchanged into place (for modules).  But this extra field is
derived from C macro expansion and cannot be determined by
dtrace_sdt.sh: so it leaves it 0 and we fill it in at boot / module load
time.

The fillout process is a little baroque to avoid slowing down the boot
too much, since we have to associate one array with another and want
to avoid linear scans.  We build a hashtable mapping from probe name
to args string (i.e. from _dtrace_sdt_names to _dtrace_sdt_args entry),
in the process verifying that if the probe appears multiple times,
all its arg strings are the same.  We can then easily use this hash
table to point the extra argument types field in the sdt_probedesc_t
array at the args string too.

One inefficiency remains: there can be *lots* of duplicate args strings,
and currently we make no attempt to deduplicate them.  (We discard the
core kernel's array of probe names, but the array of arg strings
obviously must be preserved.  We could deduplicate it, but do not.)

This is probably wasting no more than a few kilobytes at present, so
a deduplicator is not worth writing, but as the number of probes
(particularly perf probes) increases, one may become worth writing.

The new sdpd_args field into which the type info string is dropped is
protected from the ABI checker: the sdt_probedesc_t is internal to the
kernel and the DTrace module, and the usual ABI guarantees do not apply
to it.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 24661801

dtrace: record SDT and perf probe types in a new ELF section

Before now, the types specified in the SDT probe macros in
include/linux/sdt.h were simply thrown away: DTrace was never
informed of them, and instead got type information for SDT probes
from a hardwired list in the out-of-tree sdt module.

As a start to fixing this, export the SDT type information into
a new ELF section named _dtrace_sdt_args. Both native SDT probes
and perf probes translated into SDT probes use the same section:
its format is a sequence of probe type strings, with individual
arguments separated by commas, each type string separated from the
next by a NUL:

type string, comma-separated\0
...

(This is the same format used by perf probe prototype declarations,
and is what you'd get if you removed the argument names from a DTrace
SDT probe specification.)

The only difference between SDT and perf probe type strings is that the
type string for perf probes is also used to define a tracing function,
so it includes argument names. The SDT format has various other
extensions which do not affect this commit so will be described later.

The type strings in the section are in no particular order (it depends on the
order in the source code) and may contain duplicates, so to associate
each string with a probe, there is another section, _dtrace_sdt_names,
which is a simple null-terminated array (string table) of names, associated
1:1 with the type strings in the _dtrace_sdt_types section.

At this point in the patch series, nothing uses the newly-added sections.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 24661801

dtrace: ensure new SDT info generation works on sparc64

Due to addresses on sparc64 being in the lower range of the memory map, the
calculated addresses for SDT probe points were represented with less hex
characters then their function base address counterparts (obtained from
objdump output). This messed up the sorting based on address values, and
resulted in all probe points being associated with the last function in the
list.

This commit ensures that all addresses are 16 hex characters long.

Orabug: 24655168

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: rework kernel sdtinfo generation to be more accurate

With changes in the vmlinux.lds.S linker script for x86 in the 4.3 kernel
series, the existing mechanism for determining the addresses of SDT probe
points in the kernel proper was failing. The assumption that various
flavours of text sections were being concatenated into a single .text
section no longer holds.

The new implementation uses a relocation-retaining linking step in the
vmlinux linking process, which enables us to gain access to the final
addresses of symbols in the .text section, and also makes it very easy
to determine the addresses of SDT probes (.text base address + relocation
offset). We re-use the .tmp-vmlinux1 output file for the output of this
linking step because it is performed after that output file is no longer
needed in the kernel linking process.

Due to the relocation-retaining linking that occurs, the assertion test
for kernel image size had to be modified, because the _end symbol value
ends up being a relative offset rather than an absolute address.

Orabug: 24655168

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: David Mc Lean <david.mclean@oracle.com>

vfs: add vfs_select_inode() helper

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
Orabug: 24363418
CVE:CVE-2016-6198,CVE-2016-6197
Based on mainline v4.6 commit 54d5ca871e72f2bb172ec9323497f01cd5091ec7
Conflicts:
include/linux/dcache.h - code base
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

ctf: fix CONFIG_CTF && !CONFIG_DTRACE and CONFIG_DT_DISABLE_CTF

These combinations of config options, one of which should enable
CTF without enabling DTrace, the other one of which should enable
DTrace without enabling CTF, are both broken.

Fix them by making the CTF <-> DT_DISABLE_CTF relationship bidirectional,
by making dwarf2ctf compilation depend properly on CONFIG_CTF rather than
CONFIG_DTRACE, and by fixing the newly-introduced order-only dependency
of CTF on the sdtinfo stubs files so that it is only enforced when
both CTF and DTrace are enabled.

Orabug: 23859082
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-By: Todd Vierling <todd.vierling@oracle.com>

ext4: only call ext4_truncate when size <= isize

Orabug: 23598757

At LSF we decided that if we truncate up from isize we shouldn't trim
fallocated blocks that were fallocated with KEEP_SIZE and are past the
new i_size.  This patch fixes ext4 to do this.

[ Completely reworked patch so that i_disksize would actually get set
  when truncating up.  Also reworked the code for handling truncate so
  that it's easier to handle. -- tytso ]

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
(cherry picked from commit 3da40c7b089810ac9cf2bb1e59633f619f3a7312)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Conflicts:
fs/ext4/inode.c

dtrace: better Kconfig documentation

It's still skeletal, but it's a lot better than oceans of "To be
written".

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: put the SDT perf probes in the perf provider namespace

The perf probes, like most other sdt probes, belong in an appropriate
provider, not under the overarching sdt multiprovider.

Put them in a provider named 'perf'.

Orabug: 23004534
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: make perf-events probes separately configured

This makes them look like every other provider, and lets us
conditionalize them on CONFIG_TRACEPOINTS, like other tracepoint users
do.

Orabug: 23004534
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: ensure pdata is large enough

With the sudden (large) increase of SDT probe points, the previously
statically defined default size for the kernel pdata block was not
sufficient. The code has been modified to calculate the required size
for the pdata block at runtime.

This primarily affects sparc where the pdata also covers the memory
block used for SDT trampolines. It is now dependent on the actual
number of SDT probes in the kernel.

This does not affect modules because they were already allocating the
needed memory block at load time based on the actual number of probes.

Orabug: 23004534
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: use perf_events probes

For every trace point defined, also define an SDT probe. This allows
DTrace to expose probe points maintained by upstream.

[nca: fixed TODOs: added DTRACE_UINTPTR_CAST_EACH so that tracepoints
      that pass structures by value will still compile: we jam the
      structure into a uintptr if it will fit, otherwise passing its
      address in.

      This is only partial, so far: there is no CTF type info for any
      of these new probes, but this is a relatively minor issue that can
      be fixed later.]

Orabug: 23004534
Signed-off-by: Timothy J Fontaine <tj.fontaine@oracle.com>
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>

dtrace: add support for probes in sections other than .text

This commit adds support for SDT probes in sections other than the
regular .text section, both for the kernel and kernel modules.

The core of the problem is that probe locations (for modules and
for the kernel proper) were stored as an offset relatve to the start
of the code section they occur in.  Processing them at load time (or
boot time, for the kernel) resulted in incorrect addresses for any
probes in sections other than .text because the offset was getting
applied to the wrong base address.

I.e. a probe at offset 0x1d0 in section .text.unlikely would be
resolved as
BASE(.text) + 0x1d0
rather than
BASE(.text.unlikely) + 0x1d0

The end result was that (using x86_64 as example) we would be writing
a 5-byte NOP sequence in a location that was really not expecting that,
with very odd and entirely unpredictable results.

Solving this required two distinct mechanisms because modules are
linked in a way where the distict code sections are retained.  Only
at module load time are they laoded consecutively into a single block
of memory that starts at module_core; worse yet, probes can be
duplicated by the compiler into multiple such sections.  This means
that a simple offset will not do; the only thing we can rely on is the
function name (since symbol names are guaranteed unique, and the
compiler synthesizes new ones not valid in C when it clones and
specializes functions), and its start address.

The question is how to identify the functions.  We can't use the symbol
table index, because the symbol table is rewritten by strip(1) as part
of RPM generation: and we can't directly refer to the names given to
cloned functions because they are not valid in C: even if valid, they
might well be local to the translation unit in any case.  So we run the
.o file through a new tool, scripts/kmodsdt, which identifies all
functions containing DTrace probes by hunting for relocations to the
__dtrace_probe_* probe calls and, if the containing functions are local,
introduces an alias to each named __dta_<function>_<symindex> (the
function name and index are necessary because these functions are local,
so there could be several of them with the same name in different
translation units: the symindex makes the aliases unique).  The function
name is rewritten to ensure it is valid in C by translating . into _.

The address in the sdtinfo then references each function symbol (or
alias) and simply adds the offset from that function symbol to the probe
to it in perfectly normal C code, and the linker then deals with all the
problems involving keeping the addresses valid across stripping, module
loading, etc.

When the module is loaded, the kernel resolves all relocations.  This
means that the sdtinfo data that is compiled into the module will have
the exact address of all probe locations by the time we need it.

The kernel is a different story altogether...

The kernel performs a final link into an executable object, merging
all non-init code sections into a single .text.  This mens that for
the kernel, we need to calculate the absolute address of each probe
location based on the knowledge that section merging occurs.

We make use of the multi-step linking process that the kernel uses to
ensure stability of the kallsyms data.  Three steps take place in the
sdtinfo generation process (though they will appear as a single step,
because they are a single pipeline):

  - a list of code sections is extracted from vmlinux.o, and for each
    section we take note of the function with the lowest offset in
    the section

  - temporary kernel image .tmp_vmlinux1 is used to determine the
    absolute load address of each merged section by searching for the
    first function of each respective section, and subtracting its
    offset from its address.

  - using the list of absolute base addresses for each section, the
    list of probes (associated with the function they occur in, and
    the section the function belongs in) gets its absolute addresses
    for the probe locations by adding the probe offset (relative to
    its encapsulating section) to the section base address.

(Note: if two distinct sections in the kernel happen to have fuctions
       at the lowest offset with identical names, this process cannot
       distinguish between those two sections.  That is a limitation
       of the heuristic but is extremely unlikely.  If it were to
       occur, a fix would require to use a more complex comparison to
       determine whether a function in the final kernel image is the
       start of one section or another.)

Orabug: 23344927
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace, ctf: build sdtstubs and CTF after sdtinfo; sdtinfo follows modpost

In the following commit, sdtinfo generation will begin to
rewrite the .o file.  When this is happening, we cannot
have sdtstubs generation, dwarf2ctf, or modpost run in parallel
with it, since they all read the .o file (in modpost's case, for
symbol versions; in sdtstubs' case, for the list of symbols;
in dwarf2ctf's case, for the debugging info).  So have both of
the latter depend on the sdtinfo.c files that the sdtinfo pass
generates.  We force sdtinfo generation to run after modpost
by depending on the .mod.c files modpost generates.

For the CTF side of things, we use an order-only prerequisite
because sdtinfo's changes do not change anything which might
require dwarf2ctf to rerun, and so as not to disturb $^, which
we use to build the list of files dwarf2ctf should run over.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com

btrfs: put delayed item hook into inode

Orabug: 23513043

Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.

The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.

Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 8089fe62c6603860f6796ca80519b92391292f21)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Conflicts:
fs/btrfs/inode.c

Revert "net: preserve IP control block during GSO segmentation"

Orabug: 23522263

This reverts commit a9e889c735eef731df3fce53fb997277f3b0b498.

Yuval Shaia and team discovered that the aforementioned commit
resulted in a substantial (900-to-1) performance loss on a TCP
bandwidth test over IPoIB. Upstream is aware of the problem
but there is currently no solution in hand, so just revert the
offending commmit.

Note that the original commit purports to fix a kernel crash,
but so far we haven't seen the crash.

Signed-off-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>

dtrace: support SDT in single-file modules

A limitation of long standing in the SDT machinery is that it didn't
support modules that were implemented entirely in one file.  This is
because the SDT stubs generation machinery (which emits zero-byte
assembler stubs for all DTrace probe points) was implemented at the
point at which multi-file modules are linked into a single file prior to
modpost.  Since single-file modules *have* no such stage, they never got
their stubs, and probes never fired.  This becomes essential to fix so
that perf_events probes will work, because some of those are in
single-file modules.

We fix it by tearing out the sdtstub code from its prior home in
Makefile.build, and moving it into the late modpost stage, at the same
point at which sdtinfo files are generated.  We use the same prerequites
trick as is used for CTF files to link the sdtstub files into the
resulting modules; because this all happens at the very last possible
moment, it shares the same code path for single- and multifile modules,
and is a good bit simpler to boot (no need to manipulate the link
process or transform filenames intricately during stub generation).

There is one additional wrinkle: because the dtrace probes are now
always undefined until the last minute, we must adjust modpost to
disregard all undefined symbols that start "__dtrace_probe_".

Orabug: 23316392
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>

Revert "dtrace: support SDT in single-file modules"

This reverts commit daf016df43a40608ca00906c9c42bcbdb48712e3.

Orabug: 23344927
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

Revert "dtrace: use perf_events probes"

This reverts commit d2f3d53c0470606652755c8afa32eebf154e4763.

Orabug: 23344927
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

Revert "dtrace: ensure pdata is large enough"

This reverts commit 3acb568516269ead26f9916f0d01c9ae553c590c.

Orabug: 23344927
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

Revert "dtrace: make perf-events probes separately configured"

This reverts commit a3b2e729cb05de0e870b5f1a1778a6ac4a233c6e.

Orabug: 23344927
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

Revert "dtrace: put the SDT perf probes in the perf provider namespace"

This reverts commit f357a87ee7bf5d6db96643734bea35db63f512cc.

Orabug: 23344927
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

Revert "dtrace: better Kconfig documentation"

This reverts commit b5ac3f613225e7cddc9600afd7eb499018132842.

Orabug: 23344927
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

Correct backport of fa3c776 ("Thermal: Ignore invalid trip points")

Orabug: 23331182

Backport of 81ad4276b505e987dd8ebbdf63605f92cd172b52 failed to adjust
for intervening ->get_trip_temp() argument type change, thus causing
stack protector to panic.

drivers/thermal/thermal_core.c: In function ‘thermal_zone_device_register’:
drivers/thermal/thermal_core.c:1569:41: warning: passing argument 3 of
‘tz->ops->get_trip_temp’ from incompatible pointer type [-Wincompatible-pointer-types]
if (tz->ops->get_trip_temp(tz, count, &trip_temp))
^
drivers/thermal/thermal_core.c:1569:41: note: expected ‘long unsigned int *’
but argument is of type ‘int *’

CC: <stable@vger.kernel.org> #3.18,#4.1
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 5640c4c37eee293451388cd5ee74dfed3a30f32d)

Conflict:

drivers/thermal/thermal_core.c

tcp_cubic: do not set epoch_start in the future

Orabug: 23331181

[ Upstream commit c2e7204d180f8efc80f27959ca9cf16fa17f67db ]

Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
is normally set at ACK processing time, not at send time.

Doing a proper fix would need to add an additional state variable,
and does not seem worth the trouble, given CUBIC bug has been there
forever before Jana noticed it.

Let's simply not set epoch_start in the future, otherwise
bictcp_update() could overflow and CUBIC would again
grow cwnd too fast.

This was detected thanks to a packetdrill test Neal wrote that was flaky
before applying this fix.

Fixes: 30927520dbae ("tcp_cubic: better follow cubic curve after idle period")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Jana Iyengar <jri@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit af05df01c7a9f4b3cf3dc29365f4c7c4e2cd53ff)

Signed-off-by: Dan Duval <dan.duval@oracle.com>

usb: xhci: fix xhci locking up during hcd remove

Orabug: 23331180

[ Upstream commit ad6b1d914a9e07f3b9a9ae3396f3c840d0070539 ]

The problem seems to be that if a new device is detected
while we have already removed the shared HCD, then many of the
xhci operations (e.g.  xhci_alloc_dev(), xhci_setup_device())
hang as command never completes.

I don't think XHCI can operate without the shared HCD as we've
already called xhci_halt() in xhci_only_stop_hcd() when shared HCD
goes away. We need to prevent new commands from being queued
not only when HCD is dying but also when HCD is halted.

The following lockup was detected while testing the otg state
machine.

[  178.199951] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[  178.205799] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 1
[  178.214458] xhci-hcd xhci-hcd.0.auto: hcc params 0x0220f04c hci version 0x100 quirks 0x00010010
[  178.223619] xhci-hcd xhci-hcd.0.auto: irq 400, io mem 0x48890000
[  178.230677] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[  178.237796] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[  178.245358] usb usb1: Product: xHCI Host Controller
[  178.250483] usb usb1: Manufacturer: Linux 4.0.0-rc1-00024-g6111320 xhci-hcd
[  178.257783] usb usb1: SerialNumber: xhci-hcd.0.auto
[  178.267014] hub 1-0:1.0: USB hub found
[  178.272108] hub 1-0:1.0: 1 port detected
[  178.278371] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[  178.284171] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 2
[  178.294038] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
[  178.301183] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[  178.308776] usb usb2: Product: xHCI Host Controller
[  178.313902] usb usb2: Manufacturer: Linux 4.0.0-rc1-00024-g6111320 xhci-hcd
[  178.321222] usb usb2: SerialNumber: xhci-hcd.0.auto
[  178.329061] hub 2-0:1.0: USB hub found
[  178.333126] hub 2-0:1.0: 1 port detected
[  178.567585] dwc3 48890000.usb: usb_otg_start_host 0
[  178.572707] xhci-hcd xhci-hcd.0.auto: remove, state 4
[  178.578064] usb usb2: USB disconnect, device number 1
[  178.586565] xhci-hcd xhci-hcd.0.auto: USB bus 2 deregistered
[  178.592585] xhci-hcd xhci-hcd.0.auto: remove, state 1
[  178.597924] usb usb1: USB disconnect, device number 1
[  178.603248] usb 1-1: new high-speed USB device number 2 using xhci-hcd
[  190.597337] INFO: task kworker/u4:0:6 blocked for more than 10 seconds.
[  190.604273]       Not tainted 4.0.0-rc1-00024-g6111320 #1058
[  190.610228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  190.618443] kworker/u4:0    D c05c0ac0     0     6      2 0x00000000
[  190.625120] Workqueue: usb_otg usb_otg_work
[  190.629533] [<c05c0ac0>] (__schedule) from [<c05c10ac>] (schedule+0x34/0x98)
[  190.636915] [<c05c10ac>] (schedule) from [<c05c1318>] (schedule_preempt_disabled+0xc/0x10)
[  190.645591] [<c05c1318>] (schedule_preempt_disabled) from [<c05c23d0>] (mutex_lock_nested+0x1ac/0x3fc)
[  190.655353] [<c05c23d0>] (mutex_lock_nested) from [<c046cf8c>] (usb_disconnect+0x3c/0x208)
[  190.664043] [<c046cf8c>] (usb_disconnect) from [<c0470cf0>] (_usb_remove_hcd+0x98/0x1d8)
[  190.672535] [<c0470cf0>] (_usb_remove_hcd) from [<c0485da8>] (usb_otg_start_host+0x50/0xf4)
[  190.681299] [<c0485da8>] (usb_otg_start_host) from [<c04849a4>] (otg_set_protocol+0x5c/0xd0)
[  190.690153] [<c04849a4>] (otg_set_protocol) from [<c0484b88>] (otg_set_state+0x170/0xbfc)
[  190.698735] [<c0484b88>] (otg_set_state) from [<c0485740>] (otg_statemachine+0x12c/0x470)
[  190.707326] [<c0485740>] (otg_statemachine) from [<c0053c84>] (process_one_work+0x1b4/0x4a0)
[  190.716162] [<c0053c84>] (process_one_work) from [<c00540f8>] (worker_thread+0x154/0x44c)
[  190.724742] [<c00540f8>] (worker_thread) from [<c0058f88>] (kthread+0xd4/0xf0)
[  190.732328] [<c0058f88>] (kthread) from [<c000e810>] (ret_from_fork+0x14/0x24)
[  190.739898] 5 locks held by kworker/u4:0/6:
[  190.744274]  #0:  ("%s""usb_otg"){.+.+.+}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[  190.752799]  #1:  ((&otgd->work)){+.+.+.}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[  190.761326]  #2:  (&otgd->fsm.lock){+.+.+.}, at: [<c048562c>] otg_statemachine+0x18/0x470
[  190.769934]  #3:  (usb_bus_list_lock){+.+.+.}, at: [<c0470ce8>] _usb_remove_hcd+0x90/0x1d8
[  190.778635]  #4:  (&dev->mutex){......}, at: [<c046cf8c>] usb_disconnect+0x3c/0x208
[  190.786700] INFO: task kworker/1:0:14 blocked for more than 10 seconds.
[  190.793633]       Not tainted 4.0.0-rc1-00024-g6111320 #1058
[  190.799567] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  190.807783] kworker/1:0     D c05c0ac0     0    14      2 0x00000000
[  190.814457] Workqueue: usb_hub_wq hub_event
[  190.818866] [<c05c0ac0>] (__schedule) from [<c05c10ac>] (schedule+0x34/0x98)
[  190.826252] [<c05c10ac>] (schedule) from [<c05c4e40>] (schedule_timeout+0x13c/0x1ec)
[  190.834377] [<c05c4e40>] (schedule_timeout) from [<c05c19f0>] (wait_for_common+0xbc/0x150)
[  190.843062] [<c05c19f0>] (wait_for_common) from [<bf068a3c>] (xhci_setup_device+0x164/0x5cc [xhci_hcd])
[  190.852986] [<bf068a3c>] (xhci_setup_device [xhci_hcd]) from [<c046b7f4>] (hub_port_init+0x3f4/0xb10)
[  190.862667] [<c046b7f4>] (hub_port_init) from [<c046eb64>] (hub_event+0x704/0x1018)
[  190.870704] [<c046eb64>] (hub_event) from [<c0053c84>] (process_one_work+0x1b4/0x4a0)
[  190.878919] [<c0053c84>] (process_one_work) from [<c00540f8>] (worker_thread+0x154/0x44c)
[  190.887503] [<c00540f8>] (worker_thread) from [<c0058f88>] (kthread+0xd4/0xf0)
[  190.895076] [<c0058f88>] (kthread) from [<c000e810>] (ret_from_fork+0x14/0x24)
[  190.902650] 5 locks held by kworker/1:0/14:
[  190.907023]  #0:  ("usb_hub_wq"){.+.+.+}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[  190.915454]  #1:  ((&hub->events)){+.+.+.}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[  190.924070]  #2:  (&dev->mutex){......}, at: [<c046e490>] hub_event+0x30/0x1018
[  190.931768]  #3:  (&port_dev->status_lock){+.+.+.}, at: [<c046eb50>] hub_event+0x6f0/0x1018
[  190.940558]  #4:  (&bus->usb_address0_mutex){+.+.+.}, at: [<c046b458>] hub_port_init+0x58/0xb10

Signed-off-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 5d0b7d4792890a775d3765ba3e6e444b28c836cc)

Signed-off-by: Dan Duval <dan.duval@oracle.com>

usb: xhci: fix wild pointers in xhci_mem_cleanup

Orabug: 23331178

[ Upstream commit 71504062a7c34838c3fccd92c447f399d3cb5797 ]

This patch fixes some wild pointers produced by xhci_mem_cleanup.
These wild pointers will cause system crash if xhci_mem_cleanup()
is called twice.

Reported-and-tested-by: Pengcheng Li <lpc.li@hisilicon.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit bd713f99caa8c08b9bbc57f7c19eefc2eca25fde)

Signed-off-by: Dan Duval <dan.duval@oracle.com>

debugfs: Make automount point inodes permanently empty

Orabug: 23331176

[ Upstream commit 87243deb88671f70def4c52dfa7ca7830707bd31 ]

Starting with 4.1 the tracing subsystem has its own filesystem
which is automounted in the tracing subdirectory of debugfs.
Prior to this debugfs could be bind mounted in a cloned mount
namespace, but if tracefs has been mounted under debugfs this
now fails because there is a locked child mount. This creates
a regression for container software which bind mounts debugfs
to satisfy the assumption of some userspace software.

In other pseudo filesystems such as proc and sysfs we're already
creating mountpoints like this in such a way that no dirents can
be created in the directories, allowing them to be exceptions to
some MNT_LOCKED tests. In fact we're already do this for the
tracefs mountpoint in sysfs.

Do the same in debugfs_create_automount(), since the intention
here is clearly to create a mountpoint. This fixes the regression,
as locked child mounts on permanently empty directories do not
cause a bind mount to fail.

Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 6650ec2e27f912376ea9ec02fe2a144954ce8394)

Signed-off-by: Dan Duval <dan.duval@oracle.com>

Btrfs: fix file/data loss caused by fsync after rename and new inode

Orabug: 23331175

[ Upstream commit 56f23fdbb600e6087db7b009775b95ce07cc3195 ]

If we rename an inode A (be it a file or a directory), create a new
inode B with the old name of inode A and under the same parent directory,
fsync inode B and then power fail, at log tree replay time we end up
removing inode A completely. If inode A is a directory then all its files
are gone too.

Example scenarios where this happens:
This is reproducible with the following steps, taken from a couple of
test cases written for fstests which are going to be submitted upstream
soon:

   # Scenario 1

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir -p /mnt/a/x
   echo "hello" > /mnt/a/x/foo
   echo "world" > /mnt/a/x/bar
   sync
   mv /mnt/a/x /mnt/a/y
   mkdir /mnt/a/x
   xfs_io -c fsync /mnt/a/x
   <power failure happens>

   The next time the fs is mounted, log tree replay happens and
   the directory "y" does not exist nor do the files "foo" and
   "bar" exist anywhere (neither in "y" nor in "x", nor the root
   nor anywhere).

   # Scenario 2

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir /mnt/a
   echo "hello" > /mnt/a/foo
   sync
   mv /mnt/a/foo /mnt/a/bar
   echo "world" > /mnt/a/foo
   xfs_io -c fsync /mnt/a/foo
   <power failure happens>

   The next time the fs is mounted, log tree replay happens and the
   file "bar" does not exists anymore. A file with the name "foo"
   exists and it matches the second file we created.

Another related problem that does not involve file/data loss is when a
new inode is created with the name of a deleted snapshot and we fsync it:

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir /mnt/testdir
   btrfs subvolume snapshot /mnt /mnt/testdir/snap
   btrfs subvolume delete /mnt/testdir/snap
   rmdir /mnt/testdir
   mkdir /mnt/testdir
   xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir
   <power failure>

   The next time the fs is mounted the log replay procedure fails because
   it attempts to delete the snapshot entry (which has dir item key type
   of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
   resulting in the following error that causes mount to fail:

   [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
   [52174.512570] ------------[ cut here ]------------
   [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]()
   [52174.514681] BTRFS: Transaction aborted (error -2)
   [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc
   [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G        W       4.5.0-rc6-btrfs-next-27+ #1
   [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
   [52174.524053]  0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758
   [52174.524053]  0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd
   [52174.524053]  00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88
   [52174.524053] Call Trace:
   [52174.524053]  [<ffffffff81264e93>] dump_stack+0x67/0x90
   [52174.524053]  [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2
   [52174.524053]  [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs]
   [52174.524053]  [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50
   [52174.524053]  [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs]
   [52174.524053]  [<ffffffff8118f5e9>] ? iput+0xb0/0x284
   [52174.524053]  [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
   [52174.524053]  [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs]
   [52174.524053]  [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs]
   [52174.524053]  [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs]
   [52174.524053]  [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs]
   [52174.524053]  [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs]
   [52174.524053]  [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs]
   [52174.524053]  [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs]
   [52174.524053]  [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs]
   [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
   [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
   [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
   [52174.524053]  [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs]
   [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
   [52174.524053]  [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3
   [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
   [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
   [52174.524053]  [<ffffffff8119590f>] do_mount+0x8a6/0x9e8
   [52174.524053]  [<ffffffff811358dd>] ? strndup_user+0x3f/0x59
   [52174.524053]  [<ffffffff81195c65>] SyS_mount+0x77/0x9f
   [52174.524053]  [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b
   [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---

Fix this by forcing a transaction commit when such cases happen.
This means we check in the commit root of the subvolume tree if there
was any other inode with the same reference when the inode we are
fsync'ing is a new inode (created in the current transaction).

Test cases for fstests, covering all the scenarios given above, were
submitted upstream for fstests:

  * fstests: generic test for fsync after renaming directory
    https://patchwork.kernel.org/patch/8694281/

  * fstests: generic test for fsync after renaming file
    https://patchwork.kernel.org/patch/8694301/

  * fstests: add btrfs test for fsync after snapshot deletion
    https://patchwork.kernel.org/patch/8670671/

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 033ad030df0ea932a21499582fea59e1df95769b)

Signed-off-by: Dan Duval <dan.duval@oracle.com>

Btrfs: fix fsync after truncate when no_holes feature is enabled

Orabug: 23331174

[ Upstream commit a89ca6f24ffe435edad57de02eaabd37a2c6bff6 ]

When we have the no_holes feature enabled, if a we truncate a file to a
smaller size, truncate it again but to a size greater than or equals to
its original size and fsync it, the log tree will not have any information
about the hole covering the range [truncate_1_offset, new_file_size[.
Which means if the fsync log is replayed, the file will remain with the
state it had before both truncate operations.

Without the no_holes feature this does not happen, since when the inode
is logged (full sync flag is set) it will find in the fs/subvol tree a
leaf with a generation matching the current transaction id that has an
explicit extent item representing the hole.

Fix this by adding an explicit extent item representing a hole between
the last extent and the inode's i_size if we are doing a full sync.

The issue is easy to reproduce with the following test case for fstests:

  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey

  # This test was motivated by an issue found in btrfs when the btrfs
  # no-holes feature is enabled (introduced in kernel 3.14). So enable
  # the feature if the fs being tested is btrfs.
  if [ $FSTYP == "btrfs" ]; then
      _require_btrfs_fs_feature "no_holes"
      _require_btrfs_mkfs_feature "no-holes"
      MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
  fi

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test files and make sure everything is durably persisted.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K"         \
                  -c "pwrite -S 0xbb 64K 61K"       \
                  $SCRATCH_MNT/foo | _filter_xfs_io
  $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K"         \
                  -c "pwrite -S 0xff 64K 61K"       \
                  $SCRATCH_MNT/bar | _filter_xfs_io
  sync

  # Now truncate our file foo to a smaller size (64Kb) and then truncate
  # it to the size it had before the shrinking truncate (125Kb). Then
  # fsync our file. If a power failure happens after the fsync, we expect
  # our file to have a size of 125Kb, with the first 64Kb of data having
  # the value 0xaa and the second 61Kb of data having the value 0x00.
  $XFS_IO_PROG -c "truncate 64K" \
               -c "truncate 125K" \
               -c "fsync" \
               $SCRATCH_MNT/foo

  # Do something similar to our file bar, but the first truncation sets
  # the file size to 0 and the second truncation expands the size to the
  # double of what it was initially.
  $XFS_IO_PROG -c "truncate 0" \
               -c "truncate 253K" \
               -c "fsync" \
               $SCRATCH_MNT/bar

  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file
  # contents.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # We expect foo to have a size of 125Kb, the first 64Kb of data all
  # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
  # with value 0x00).
  echo "File foo content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  # We expect bar to have a size of 253Kb and no extents (any byte read
  # from bar has the value 0x00).
  echo "File bar content after log replay:"
  od -t x1 $SCRATCH_MNT/bar

  status=0
  exit

The expected file contents in the golden output are:

  File foo content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0372000
  File bar content after log replay:
  0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0772000

Without this fix, their contents are:

  File foo content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0372000
  File bar content after log replay:
  0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
  *
  0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  *
  0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0772000

A test case submission for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 091537b5b3102af4765256d2b527a2a1de4fb184)

Signed-off-by: Dan Duval <dan.duval@oracle.com>

Btrfs: fix fsync xattr loss in the fast fsync path

Orabug: 23331173

[ Upstream commit 36283bf777d963fac099213297e155d071096994 ]

After commit 4f764e515361 ("Btrfs: remove deleted xattrs on fsync log
replay"), we can end up in a situation where during log replay we end up
deleting xattrs that were never deleted when their file was last fsynced.

This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
set, the xattr was added in a past transaction and the leaf where the
xattr is located was not updated (COWed or created) in the current
transaction. In this scenario the xattr item never ends up in the log
tree and therefore at log replay time, which makes the replay code delete
the xattr from the fs/subvol tree as it thinks that xattr was deleted
prior to the last fsync.

Fix this by always logging all xattrs, which is the simplest and most
reliable way to detect deleted xattrs and replay the deletes at log replay
time.

This issue is reproducible with the following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1 # failure is the default!

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey
  . ./common/attr

  # real QA test starts here

  # We create a lot of xattrs for a single file. Only btrfs and xfs are currently
  # able to store such a large mount of xattrs per file, other filesystems such
  # as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
  # less than 1000 xattrs with very small values.
  _supported_fs btrfs xfs
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_attrs
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and make sure everything is
  # durably persisted.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add many small xattrs to our file.
  # We create such a large amount because it's needed to trigger the issue found
  # in btrfs - we need to have an amount that causes the fs to have at least 3
  # btree leafs with xattrs stored in them, and it must work on any leaf size
  # (maximum leaf/node size is 64Kb).
  num_xattrs=2000
  for ((i = 1; i <= $num_xattrs; i++)); do
      name="user.attr_$(printf "%04d" $i)"
      $SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
  done

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now update our file's data and fsync the file.
  # After a successful fsync, if the fsync log/journal is replayed we expect to
  # see all the xattrs we added before with the same values (and the updated file
  # data of course). Btrfs used to delete some of these xattrs when it replayed
  # its fsync log/journal.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
               -c "fsync" \
               $SCRATCH_MNT/foo | _filter_xfs_io

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again and mount. This makes the fs replay its fsync log.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File content after crash and log replay:"
  od -t x1 $SCRATCH_MNT/foo

  echo "File xattrs after crash and log replay:"
  for ((i = 1; i <= $num_xattrs; i++)); do
      name="user.attr_$(printf "%04d" $i)"
      echo -n "$name="
      $GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
      echo
  done

  status=0
  exit

The golden output expects all xattrs to be available, and with the correct
values, after the fsync log is replayed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit db4043dffa5b9f77160dfba088bbbbaa64dd9ed1)

Signed-off-by: Dan Duval <dan.duval@oracle.com>

assoc_array: don't call compare_object() on a node

Orabug: 23331172

[ Upstream commit 8d4a2ec1e0b41b0cf9a0c5cd4511da7f8e4f3de2 ]

Changes since V1: fixed the description and added KASan warning.

In assoc_array_insert_into_terminal_node(), we call the
compare_object() method on all non-empty slots, even when they're
not leaves, passing a pointer to an unexpected structure to
compare_object(). Currently it causes an out-of-bound read access
in keyring_compare_object detected by KASan (see below). The issue
is easily reproduced with keyutils testsuite.
Only call compare_object() when the slot is a leave.

KASan warning:
==================================================================
BUG: KASAN: slab-out-of-bounds in keyring_compare_object+0x213/0x240 at addr ffff880060a6f838
Read of size 8 by task keyctl/1655
=============================================================================
BUG kmalloc-192 (Not tainted): kasan: bad access detected
-----------------------------------------------------------------------------

Disabling lock debugging due to kernel taint
INFO: Allocated in assoc_array_insert+0xfd0/0x3a60 age=69 cpu=1 pid=1647
___slab_alloc+0x563/0x5c0
__slab_alloc+0x51/0x90
kmem_cache_alloc_trace+0x263/0x300
assoc_array_insert+0xfd0/0x3a60
__key_link_begin+0xfc/0x270
key_create_or_update+0x459/0xaf0
SyS_add_key+0x1ba/0x350
entry_SYSCALL_64_fastpath+0x12/0x76
INFO: Slab 0xffffea0001829b80 objects=16 used=8 fp=0xffff880060a6f550 flags=0x3fff8000004080
INFO: Object 0xffff880060a6f740 @offset=5952 fp=0xffff880060a6e5d1

Bytes b4 ffff880060a6f730: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f740: d1 e5 a6 60 00 88 ff ff 0e 00 00 00 00 00 00 00  ...`............
Object ffff880060a6f750: 02 cf 8e 60 00 88 ff ff 02 c0 8e 60 00 88 ff ff  ...`.......`....
Object ffff880060a6f760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f770: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f790: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f7a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f7b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f7c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f7d0: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f7e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Object ffff880060a6f7f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
CPU: 0 PID: 1655 Comm: keyctl Tainted: G    B           4.5.0-rc4-kasan+ #291
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000000 000000001b2800b4 ffff880060a179e0 ffffffff81b60491
ffff88006c802900 ffff880060a6f740 ffff880060a17a10 ffffffff815e2969
ffff88006c802900 ffffea0001829b80 ffff880060a6f740 ffff880060a6e650
Call Trace:
[<ffffffff81b60491>] dump_stack+0x85/0xc4
[<ffffffff815e2969>] print_trailer+0xf9/0x150
[<ffffffff815e9454>] object_err+0x34/0x40
[<ffffffff815ebe50>] kasan_report_error+0x230/0x550
[<ffffffff819949be>] ? keyring_get_key_chunk+0x13e/0x210
[<ffffffff815ec62d>] __asan_report_load_n_noabort+0x5d/0x70
[<ffffffff81994cc3>] ? keyring_compare_object+0x213/0x240
[<ffffffff81994cc3>] keyring_compare_object+0x213/0x240
[<ffffffff81bc238c>] assoc_array_insert+0x86c/0x3a60
[<ffffffff81bc1b20>] ? assoc_array_cancel_edit+0x70/0x70
[<ffffffff8199797d>] ? __key_link_begin+0x20d/0x270
[<ffffffff8199786c>] __key_link_begin+0xfc/0x270
[<ffffffff81993389>] key_create_or_update+0x459/0xaf0
[<ffffffff8128ce0d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff81992f30>] ? key_type_lookup+0xc0/0xc0
[<ffffffff8199e19d>] ? lookup_user_key+0x13d/0xcd0
[<ffffffff81534763>] ? memdup_user+0x53/0x80
[<ffffffff819983ea>] SyS_add_key+0x1ba/0x350
[<ffffffff81998230>] ? key_get_type_from_user.constprop.6+0xa0/0xa0
[<ffffffff828bcf4e>] ? retint_user+0x18/0x23
[<ffffffff8128cc7e>] ? trace_hardirqs_on_caller+0x3fe/0x580
[<ffffffff81004017>] ? trace_hardirqs_on_thunk+0x17/0x19
[<ffffffff828bc432>] entry_SYSCALL_64_fastpath+0x12/0x76
Memory state around the buggy address:
ffff880060a6f700: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
ffff880060a6f780: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
>ffff880060a6f800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                        ^
ffff880060a6f880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff880060a6f900: fc fc fc fc fc fc 00 00 00 00 00 00 00 00 00 00
==================================================================

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 32d1b6727390b22cc58d28eb9d7b2d7055e588b7)

Signed-off-by: Dan Duval <dan.duval@oracle.com>