Tomas Jedlicka [Sun, 2 Jul 2017 16:17:01 +0000 (12:17 -0400)]
dtrace: Add support for manual triggered cyclics
In some scenarios it is better if a client of cyclic susbstem can reprogram cyclic on
his own. This is not possible with current implementation.
This chage adds cyclic_reprogram() that can be used to schedule cyclic from inside and
outside of its handler. A manually triggered cyclic is distinguished from other types
by having its interval set to -1.
Tomas Jedlicka [Fri, 30 Jun 2017 13:17:06 +0000 (09:17 -0400)]
dtrace: LOW level cyclics should use workqueues
The HIGH level cyclics are meant to be run from interrupt handler. This works on
Linux because the hrtimer is scheduled as tasklet. The LOW level cyclics must be
interruptible and should not be scheduled as tasklests.
DTrace is currently relying on being able to call dtrace_sync() from within a cyclic
handler. On Linux it is not safe to try send IPIs from within interrupt/bottom half
handlers.
This fix changes LOW level cyclics to use workqueues. At the moment we are using
shared system workqueue but it may be required to allocate our owns if this causes
big latency in our timer routines.
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
Kris Van Hees [Mon, 12 Jun 2017 13:33:05 +0000 (09:33 -0400)]
dtrace: add kprobe-unsafe addresses to FBT blacklist
By means of the newly introduced API to add entries to the FBT
blacklist, we make sure to register addresses that are unsafe for
kprobes with the FBT blacklist because they are unsafe there also.
Orabug: 26190412 Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Kris Van Hees [Mon, 12 Jun 2017 13:29:16 +0000 (09:29 -0400)]
dtrace: convert FBT blacklist to RB-tree
The blacklist for FBT was implemented as a sorted list, populated from
a static list of functions. In order to allow functions to be added
from other places (i.e. programmatically), it has been converted to an
RB-tree with an API to add functions and to traverse the list. It is
still possible to add functions by address or to add them by symbol
name, to be resolved into the corresponding address.
Orabug: 26190412 Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Alan Maguire [Wed, 14 Jun 2017 13:06:27 +0000 (14:06 +0100)]
DTrace: IP provider use-after-free for drop-out probe points
KASan warnings showed a possible use-after-free for skbs in error handling
codepath after netfilter hooks have run. Hooks may free the skb, so we
should not derefence it at drop-out probe points after NF_HOOK().
Nick Alcock [Thu, 15 Jun 2017 21:04:08 +0000 (22:04 +0100)]
ctf: fix a variety of memory leaks and use-after-free bugs
These fall into two classes, but are sufficiently intertwined that it's
easier to commit them in one go.
The first is outright leaks, which exceed 1GiB on a normal run, varying
from the tiny (failure to free getline()'s line), through the disastrous
(failure to free items filtered from a list by list_filter(), leading to
the leaking of nearly the whole of the named_structs state, which is
huge). We also leak the structs_seen hash due to recreating it on
alias_fixup file switch without bothering to destroy it first,
The second is lifetime problems, centred around the stuff allocated and
freed in the detect_duplicates_tu_{init,/done}() functions. These were
comparing the module name against a saved copy to see if a new vars_seen
needed to be allocated, or whether this was just a flip of TU without a
change of object file and we could get away with just flushing its
contents out -- but unfortunately the state->module_name is assigned
directly from its parameter, and *that* has a lifetime lasting only
within process_file() -- and a deduplication run, of course, involves
iterating over a great many object files. So everything works as long as
we're flipping from TU to TU within a single object file, and then we
switch object files and are suddenly strcmp()ing with freed memory.
Discard this faulty optimization entirely, and just flush the vars_seen
hash in tu_done() and both create and destroy it in scan_duplicates(),
right where we create and destroy related stuff too.
Something similar happens with the state->dwfl_file_name due to its
derivation from id_file->file_name: if no duplicates are found, we
list_filter() that id_file straight out of the structs_seen list and
free it, and then on the next call state->dwfl_file_name points to freed
memory.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Orabug: 26283357
Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com> Acked-by: Saar Maoz <Saar.Maoz@oracle.com> Acked-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Nick Alcock [Wed, 24 May 2017 16:57:05 +0000 (17:57 +0100)]
ctf: prevent modules on the dedup blacklist from sharing any types at all
The deduplication blacklist exists to deal with a few modules, mostly
old ISA sound drivers, which consist of a #define and a #include of
another module, which then uses #ifdef profusely, sometimes on structure
members. This leads to DWARF which has differently shaped structures
in the same source location, which conflicts with the purpose of the
deduplicator, which is to use the source location to identify identical
structures and merge them.
Before now, the blacklist worked by preventing the appearance of a
structure in a module on the blacklist from making a structure shared:
it could still be shared if other modules defined it. This fails to
work for two reasons.
Firstly, if it is used by only *one* other module, we'll hit an
assertion failure in the emission phase which checks that any type it
emits is either defined in the module being emitted or in the shared
type repo.
We could remove that assertion, but we'd still be in the wrong, because
you could easily have a type in some header used by many modules that
said
struct trouble {
#ifdef MODULE_FOO
/* one definition */
#else
/* a different definition */
#endif
}
and a module that says
#define MODULE_FOO 1
#include <other_module.c>
Even if we blacklisted this module (and we would), this would still
fail, because 'struct trouble' would end up in the shared type
repository, and the existing code would fail to emit a new definition of
it in module blah, even though it should because its definition is
different.
This shows that if a module is pulling tricks like this we cannot trust
its use of shared types at all, since we have no visibility into
preprocessor definitions: regardless of the space cost (about 40KiB per
module), we cannot let it share any types whatsoever with the rest of
the kernel. Rather than piling heaps of blacklist checks in all over
dwarf2ctf to implement this, we do it by exploiting the fact that the
deduplicator works entirely on the basis of type IDs. We add a
'blacklist type prefix' that type_id() can add to the start of its IDs
(with some extra decoration, because the start of type IDs is a file
path, so we want this to be an impossible path). If we set this prefix
to the module name if and only if the module is blacklisted, and do not
add one otherwise, then every blacklisted module will have a unique set
of IDs for all its types, which can be shared within the module but not
outside it, so every type in the module will be unique and none of them
will end up in the shared type repository.
While we're at it, add yet another ancient ISA sound driver that plays
the same games to the blacklist.
This fix makes blacklisting modules much more space-expensive: each such
module expands the current size of the kernel module package by about
40KiB. (But there is only one blacklisted module built in current UEK,
so this is a tolerable cost.)
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: vincent.lim@oracle.com
Orabug: 26137220
[ 437f] structure_type
name (strp) "IO_APIC_route_entry"
byte_size (data1) 8
[ 438b] member
name (strp) "vector"
byte_size (data1) 4
bit_size (data1) 8
bit_offset (data1) 24
data_member_location (data1) 0
[ 439a] member
name (strp) "delivery_mode"
byte_size (data1) 4
bit_size (data1) 3
bit_offset (data1) 21
data_member_location (data1) 0
[ 43a9] member
name (strp) "dest_mode"
byte_size (data1) 4
bit_size (data1) 1
bit_offset (data1) 20
data_member_location (data1) 0
[ 43b8] member
name (strp) "delivery_status"
byte_size (data1) 4
bit_size (data1) 1
bit_offset (data1) 19
data_member_location (data1) 0
But CTF on little-endian requires the opposite: it has special handling
for the first member of a structure which assumes that it is closest to
the start of memory: in effect, it wants structure member addresses to
always ascend, even within bitfields, regardless of endianness (which
makes some sense intellectually as well).
dwarf2ctf's emission code generally emits sequentially, so except where
deduplication has eliminated items or dependent type insertion has added
them it emits things in the CTF in the same order as in the DWARF. We
can avoid this for short runs, as in this case, by switching from
iteration to recursion in such cases, spotting a run at identical
data_member_location, recursing until we hit the end of the run, then
unwinding and emitting as we unwind until the recursion is over.
Orabug: 25815129 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Thu, 2 Feb 2017 00:22:05 +0000 (00:22 +0000)]
ctf: bitfield support
Support for bitfields in dwarf2ctf was embryonic and largely untested
before now due to bugs in libdtrace-ctf: with a fix for those in hand,
we can fix bitfields here too.
Bitfields in DWARF and CTF have annoyingly different representations.
In DWARF, a bitfield is represented something like this:
[ 16561] member
name (string) "ihl"
decl_file (data1) 225
decl_line (data1) 87
type (ref4) [ 38]
byte_size (data1) 1
bit_size (data1) 4
bit_offset (data1) 4
data_member_location (sdata) 0
[...]
[ 38] typedef
name (strp) "__u8"
decl_file (data1) 36
decl_line (data1) 20
type (ref4) [ 43]
i.e. the padding, size, and starting location are all represented in the
member, where you would conceptually expect it to be.
In CTF, the starting location of the conceptual containing type of a
bitfield is encoded in the member: but the size and starting location of
the bitfield itself is represented in the dependent type, which is added
as a "non-root" type (which cannot be looked up by name) so that it can
have the same name as the un-bitfielded base type without causing a name
clash.
We use the new DIE attribute override mechanism added in commit 8935199962
to override DW_AT_bit_size and DW_AT_bit_offset for such members (fixing
a pre-existing bug in the process: we were looking for the DW_AT_bit_size
on the structure as a whole!), and in the base-type emission function
checking for the existence of a DW_AT_bit_size/offset and responding to
them by overriding the size and offset derived from DW_AT_byte_size and
noting that this is a non-root type. (The override needed, annoyingly,
is endian-dependent, because CTF consumers assume that on little-endian
systems the offset relates to the least-significant edge of the bitfield,
counting from the LSB, while DWARF assumes the opposite).
But this is not enough: unless more is done, this type will appear
to have the same type ID as its non-bitfield equivalent, leading to
confusion about which CTF file it should appear in and quite possibly
leading to it ending up in a CTF file that the structure containing the
bitfield cannot even see. So augment type_id()'s representation of
base types from e.g. 'int' to something like 'int:4' if and only if
a DW_AT_bit_size or an override of it is present and that override is
a different size from the native bitness of the type itself (the
DW_AT_byte_size). We encode the bit_offset only if there is also a
bit_size, as something like int:4:2. (That's unambiguous because
these attributes always arrive in pairs in bitfields and never appear
in anything else in C-generated DWARF.)
Finally, this breaks an optimization in the deduplicator, which was that
all structure members reference some top-level type, so when marking a
type as seen, structure members could just be skipped. Now, they have
to be chased iff they are bitfields using the same override trickery as
above to change the DW_AT_bit_size/offset in the member's type DIE, and
that bitfield override needs to be passed down to type_id() when finally
marking duplicated types as shared too. (Avoid code duplication by
factoring out some related code from a horrible conditional in
detect_duplicates() into a new type_needs_sharing() function.)
Orabug: 25815129 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Thu, 27 Apr 2017 17:39:53 +0000 (18:39 +0100)]
ctf: emit file-scope static variables
For as long as dwarf2ctf has emitted variables into the CTF, it has
emitted only extern variables. This is for two reasons: firstly, a
misconception that global variables with static linkage did not appear
in /proc/kallmodsyms (they do), and secondly, an ambiguity problem.
We cannot usefully distinguish two variables with the same name in the
same module: they differ only by address, so if both are static
variables in different translation units, we can't tell which is which.
(Also, we cannot emit more than one variable with a given name into a
given module CTF file in any case). CTF's modular rather than
translation-unit-based variable/type scope bites us here.
I sought to avoid this bug by emitting only non-static variables, but
this does not save us, because there might be another static variable
with the same name in the same module, whereupon the ambiguity problem
arises all over again. We must identify such ambiguous cases and strip
them out (not emitting CTF for this variable at all): then we can emit
static file-scope variables into the CTF without worry.
We do this by introducing a new, module-scope vars_seen hashtable into
the deduplicator state, which gains an entry for every variable name
seen in this module, and indicates whether it is static or not. This
lets us tell if we have seen a variable with a given name more than
once, and if we have, whether any of the instances was static. Then
we can consult this blacklist at variable-emission time, and skip any
variable if it is in the blacklist.
Unfortunately, computing the name for the variable's entry in the
blacklist is fairly expensive, and has to be done for every variable:
worse yet, this increases the number of variables emitted drastically
(in vmlinux and the shared typo repo alone, we go from 2247 to 10409
variables), and emitting that much CTF is not free: so the runtime goes
up by about 5%. We will reclaim this lost speed soon (and then some).
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Orabug: 25962387
Nick Alcock [Thu, 9 Mar 2017 14:31:33 +0000 (14:31 +0000)]
ctf: speed up the dwarf2ctf duplicate detector some more
The duplicate detector's alias detection pass is still doing unnecessary
work. When a non-opaque structure is marked shared, it is sometimes
necessary to do another deduplication scan, because the marking may have
marked types used within the structure as shared, which may require yet
more types (e.g. opaque uses of of that type in other modules) to be
marked shared as well. So while we know this pass can only affect
structures/ unions/enums that have names, and their interior types, it
would seem that we must keep scanning them to see if they need
deduplication until none remain.
However, there is one exception: if a non-opaque type and its
corresponding opaque type are both already marked shared, or if we have
just processed them and marked them accordingly, we know that we will
never need to re-mark those particular types again, since they can't be
more shared than they already are: so we can remove them from
consideration in future passes. Because we are only opening DWARF files
in this pass as needed now, this hugely cuts down the number of files we
process in subsequent passes: we still see the same number of passes,
but passes after the first (which marks tens of thousands of opaque
types as shared) only open a few files, mark a few hundred types, and
flash past in under a second.
In my tests, the alias fixup pass now takes under 10s, which can be
more or less ignored: all other passes other than initialization
and writeout are much more expensive. (Before this series, it took
over a minute on the fastest machine I have access to, and over three
minutes on SPARC.)
Orabug: 25815306 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Tue, 7 Mar 2017 20:31:12 +0000 (20:31 +0000)]
ctf: speed up the dwarf2ctf duplicate detector
dwarf2ctf is very slow. By far its most expensive pass is the duplicate
detector. Of necessity this has to scan every object file in the kernel
to identify shared types, which is pretty expensive even when the cache
is hot, but it's not doing this particularly efficiently and can be sped
up quite a lot.
As of commit f0d0ebd0b4, the duplicate detector's job was cut in two:
the first pass identifies all non-structure types and non-opaque
structures used by more than one module, and the second "alias fixup"
pass repeatedly unifies opaque and non-opaque structures with the same
name in different translation units, sharing their definitions. Both of
these passes generate or modify type IDs, so both need access to the
DWARF DIE of the types in question, since the type ID is derived
recursively from the DIE: but the second pass need not look at the DWARF
of any translation units that do not contain structures that might be
unified. However, the two are currently written in the same way, using
process_file() to traverse all the kernel's DWARF, even though the alias
fixup pass does almost nothing with that DWARF, and has less and less to
do on each iteration.
The sheer amount of wasted time here is remarkable. We traverse the
DWARF once for primary duplicate detection, once for CTF emission, but
often four or five or even seven or eight times for the alias fixup pass
(the precise count depends upon the relationships between types and the
order in which the DWARF files are traversed).
So improve things by tracking all types that the alias fixup pass is
interested in (structure types that are not anonymous inner structures
nor opaque nor used only as the types of array indices) and stash them
away during the first duplicate detection pass in a new temporary
singly-linked list, detect_duplicates_state.named_structs. We remember
the filename and DWARF DIE offset (so we can look the type up again) and
the type ID (because we just worked it out, so recomputing it would be a
waste of time). Then, rather than doing a process_file() for the alias
fixup pass, traverse the linked list, opening DWARF files as needed to
mark things as shared (but no more often than that: marking non-opaque
types needs the DIE so we can traverse into all its subtypes and mark
them shared too, but marking opaque types needs no DIE at all).
This has a significant effect on dwarf2ctf speed, slashing 25% off its
runtime in my tests, reducing the duplicate detector's share of the
runtime from about 80s to about 24s.
The dominant time consumer is now CTF emission rather than the
duplicate detector.
Orabug: 25815306 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Thu, 2 Feb 2017 00:23:16 +0000 (00:23 +0000)]
ctf: fix the size of int and avoid duplicating it
An upcoming bitfield-capable release of libdtrace-ctf adds extra
consistency checking which identified embarrassing but heretofore
harmless bugs in dwarf2ctf's representation of basic types.
dwarf2ctf has to emit a few basic types "by hand", since these are used
in the representation of types one of CTF or DWARF does not bother to
encode more complex forms for (function pointers, always encoded as
'int (*)()' in CTF, and 'void'). Embarrassingly, we were getting the
size of 'int' wrong: it should be in bits but we were emitting a count
of bytes instead, leading to a CTF representation of a 4-bit int.
This is always overridden by an accurate representation built into
DTrace in real use, but libdtrace-ctf finds the inconsistency anyway.
Worse is that it emits a representation of 'int' twice, once by hand in
init_ctf_table() and then again when it comes across the real 'int' type
DIE in the debuginfo. This is because we forgot to intern the types we
add by hand in the id_to_module and id_to_type hashes we use to detect
duplicate types, because we can only intern types we have a DWARF DIE
for, and these types either have no DIE at all or we just don't know
about it because we haven't even begun to traverse the debuginfo to look
up DIEs yet.
Fix this by adding a mark_shared_by_name() function which can be used to
intern basic types by name in these hashes. It has hardwired knowledge
of the type ID notation, but no more such knowledge than is already
present in detect_duplicates_alias_fixup() and friends (no more than
that types with no associated filename or line number are preceded by
"////".)
Orabug: 25815129 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Nick Alcock [Wed, 1 Feb 2017 23:51:21 +0000 (23:51 +0000)]
ctf: allow overriding of DIE attributes: use it for parent bias
Currently, dwarf2ctf's CTF type emission machinery emits any given CTF
type based purely on the data in the corresponding DWARF DIE, with
possible adjustments based on the DIE's structural parent (useful for
structure members, if not so much for top-level types). But from time
to time we need to adjust a CTF type depending on some property not of
its structural parent but of a type that depends on it: i.e., for
structures, we might want to adjust the offset of the members to cater
for the fact that this structure is being structurally merged with a
shorter assignment-compatible structure with identical name which
appeared in some translation unit that was processed earlier.
Currently, we handle this case, and only this case, by passing down a
'parent_bias' to all the CTF assembly functions. Replace this with a
more generic mechanism whereby an array of 'overrides' can be passed
down to construct_ctf_id(), die_to_ctf(), and all subordinate assembly
functions: these overrides consist of an array of die_override_t's,
where each element can either override or add to the value of one DWARF
attribute: this kicks in for specific DWARF tags only. (Only numerical
attributes are supported, obviously.)
(We also pass the overrides down to type_id() so that overrides can
affect the ID of types and thus cause a single DWARF type to generate
multiple CTF types, though we do not use this facility in this commit.)
To process the attributes, we introduce a new private_find_override() to
search the override list, and private_dwarf_udata() to fetch a udata and
handle it. (We do not override dwarf_hasattr(): anything that wants to
override an attribute that may not exist has to call
private_find_override() itself. If this happens a lot, we can introduce
an override for dwarf_hasattr() too.)
Currently we use this in exactly one place: in assemble_ctf_su_member(),
to replace the use of parent_bias. Further uses will come in the next
commit: thanks to this commit, none of them will require adding new
parameters to all the CTF construction functions :)
Also rename the 'override' parameter on the CTF construction functions,
which was used by array assembly to indicate that CTF types should
replace their parent type, with a much less confusingly-named 'replace'
parameter. (It was badly named before, but now that we have a parameter
named 'overrides' it is devastatingly badly named.)
Orabug: 25815129 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: tomas.jedlicka@oracle.com
Alan Maguire [Wed, 12 Apr 2017 21:34:56 +0000 (22:34 +0100)]
DTrace tcp/udp provider probes
This patch adds DTrace SDT probes for the TCP and UDP protocols. For tcp
the following probes are added:
tcp:::send Fires when a tcp segment is transmitted
tcp:::receive Fires when a tcp segment is received
tcp:::state-change Fires when a tcp connection changes state
tcp:::connect-request Fires when a SYN segment is sent
tcp:::connect-refused Fires when a RST is received for connection attempt
tcp:::connect-established
Fires when three-way handshake completes for
initiator
tcp:::accept-refused Fires when a RST is sent refusing connection attempt
tcp:::accept-established
Fires when three-way handshake succeeds for acceptor
Arguments for all of these probes are:
arg0 struct sk_buff *; to be translated into pktinfo_t * containing
implementation-independent packet data
arg1 struct sock *; to be translated into csinfo_t * containing
implementation-independent connection data
arg2 __dtrace_tcp_void_ip_t *; to be translated into ipinfo_t * containing
implementation-independent IP information. Custom type is used as
this gives DTrace a hint that we can source IP information from other
arguments if the IP header is not available.
arg3 struct tcp_sock *; to be translated into tcpsinfo_t * containing
implementation-independent TCP connection data
arg4 struct tcphdr *; to be translated into a tcpinfo_t * containing
implementation-independent TCP header data
arg5 int representing previous state; to be translated into a
tcplsinfo_t * which contains the previous state. Differs from
current state (arg6) for state-change probes only.
arg6 int representing current state. Cannot be sourced from struct
tcp_sock as we sometimes need to probe before state change is
reflected there
arg7 int representing direction of traffic for probe; values are
DTRACE_NET_PROBE_INBOUND for receipt of data and
DTRACE_NET_PROBE_OUTBOUND for transmission.
For udp the following probes are added:
udp:::send Fires when a udp datagram is sent
udp:::receive Fires when a udp datagram is received
Arguments for these probes are:
arg0 struct sk_buff *; to be translated into pktinfo_t * containing
implementation-independent packet data
arg1 struct sock *; to be translated into csinfo_t * containing
implementation-independent connection data
arg2 void_ip_t *; to be translated into ipinfo_t * containing
implementation-independent IP information.
arg3 struct udp_sock *; to be translated into a udpsinfo_t * containing
implementation-independent UDP connection data
arg4 struct udphdr *; to be translated into a udpinfo_t * containing
implementation-independent UDP header information.
Kris Van Hees [Wed, 24 May 2017 03:34:53 +0000 (23:34 -0400)]
dtrace: ensure limit is enforced even when pcs is NULL
The dtrace_user_stacktrace() functions for x86_64 and sparc64 were
not handling the specified limit (st->limit correctly if the buffer
for PC values (st->pcs) was NULL. This commit ensures that we
decrement the limit whenever we encounter a PC, whether it gets
stored or not.
Orabug: 25949692 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Tue, 16 May 2017 03:29:16 +0000 (23:29 -0400)]
dtrace: make x86_64 FBT return probe detection less restrictive
The FBT return probe detection mechanism on x86_64 was requiring that
the "ret" instruction be followed by a "push %rbp" or "nop", which is
much too restrictive. The new code allows probing of all "ret"
instructions that occur in a function regardless of what instructions
follows.
In order to make this possible, the functions to set and clear FBT
probes on x86_64 (dtrace_invop_add() and dtrace_invop_remove()) have
been modified to accept a 2nd argument that indicates the instruction
to patch the probe location with. This is needed because FBT return
probes need a different instruction on x86_64 (LOCK prefix to force
an invalid opcode trap isn't safe because we do not know what
instruction may follow the "ret").
This commit also fixes the declaration of the dtrace_bad_address()
function that was missing its return type.
Orabug: 25949048 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Tue, 16 May 2017 03:25:09 +0000 (23:25 -0400)]
dtrace: support passing offset as arg0 to FBT return probes
FBT return probes pass the offset from the function start (in bytes)
as arg0. To make that possible, we pass the offset value in the call
to fbt_add_probe. For FBT entry probes we pass 0 (which is ignored).
Orabug: 25949086 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Tue, 16 May 2017 03:05:41 +0000 (23:05 -0400)]
dtrace: make FBT entry probe detection less restrictive on x86_64
The logic on x86_64 to determine whether we can probe a function is
too restrictive. By placing the probe on the "push %rbp" instruction
we can cover more functions, in case the "mov %rsp,%rbp" instruction
does not follow it immediately.
Orabug: 25949030 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Kris Van Hees [Tue, 16 May 2017 02:42:30 +0000 (22:42 -0400)]
dtrace: adjust FBT entry probe dection for OL7
On OL7, function prologues can be prefixed by a (5-byte) call
instruction on x86_64, which breaks the logic to determine if
we can place an FBT entry probe on that function. The new logic
accounts for the possibility that the anticipated prologue does
not show up as first instruction of the function.
Orabug: 25921361 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
dtrace: fix handling of save_stack_trace sentinel (x86 only)
On x86 only, when save_stack_trace() writes less stack frames to the
buffer than there is space for, a ULONG_MAX is added as sentinel. The
DTrace code was mistakenly treating the buffer as always ending with a
ULONG_MAX.
Orabug: 25727046 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Wed, 1 Mar 2017 04:37:11 +0000 (23:37 -0500)]
dtrace: comtinuing the FBT implementation and fixes
This commit continues the implementation of Function Boundary Tracing
(FBT) and fixes various problems with the original implementation and
other things in DTrace that it caused to break. It is done as a single
commit due to the intertwined nature of the code it touches.
1. We were only handling unaligned memory access traps as part of the
NOFAULT access protection. This commit adds handling data and
instruction access trap handling.
2. When an OOPS takes place, we now add output about whether we are
in DTrace probe context and what the last probe was that was being
processed (if any). That last data item isn't guaranteed to always
have a valid value. But it is helpful.
3. New ustack stack walker implementation (moved from module to kernel
for consistency and because we need access to low level structures
like the page tables) for both x86 and sparc. The new code avoids
any locking or sleeping. The new user stack walker is accessed as
as sub-function of dtrace_stacktrace(), selected using the flags
field of stacktrace_state_t.
4. We added a new field to the dtrace_psinfo_t structure (ustack) to
hold the bottom address of the stack. This is needed in the stack
walker (specifically for x86) to know when we have reached the end
of the stack. It is initialized from copy_process (in DTrace
specific code) when stack_start is passed as parameter to clone.
It is also set from dtrace_psinfo_alloc() (which is generally called
from performing an exec), and there it gets its value from the
mm->start_stack value.
5. The FBT black lists have been updated with functions that may be
invoked during probe processing. In addition, for x86_64 we added
explicit filter out of functions that start with insn_* or inat_*
because they are used for instruction analysis during probe
processing.
6. On sparc64, per-cpu data gets access by means of a global register
that holds the base address for this memory area. Some assembler
code clobbers that register in some cases, so it is not safe to
depend on this in probe context. Instead, we explicitly access
the data based on the smp_processor_id().
7. We added a new CPU DTTrace flag (CPU_DTRACE_PROBE_CTX) to flag that
we are processing in DTrace probe context. It is primarily used
to detect attempts of re-entry into dtrace_probe().
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Orabug: 21220305
Orabug: 24829326
Kris Van Hees [Mon, 27 Feb 2017 15:39:07 +0000 (10:39 -0500)]
dtrace: ensure DTrace can use get_user_pages safely
The processing of the DTrace-specific FOLL_IMMED flag was not robust
enough. We could still get into a situation where cond_resched() was
called (which is bad) or where the VMA area would get extended (which
is also bad). The only code that passes this flag is DTrace support
code, and when the flag is not passed, the execution flow is not at all
affected.
Orabug: 25640153 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Kris Van Hees [Fri, 24 Feb 2017 23:40:40 +0000 (18:40 -0500)]
dtrace: enable paranoid mode and IST shift for xen_int3
The Xen PVM path into an INT3 trap was not using paranoid=1 mode nor was
it using an IST shift as is done for HW INT3 traps. This interferes with
the instruction emulation code check based on the handler return value.
Orabug: 25580519 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Kris Van Hees [Mon, 20 Feb 2017 12:16:48 +0000 (07:16 -0500)]
dtrace: ensure we skip the entire SDT probe point
With the introduction of FBT support, the logic for skipping instructions
(with potential emulation of the skipped instruction) changed. This change
did not take into account the fact that is-enabled probes on x86_64 use a
3-byte sequence for setting ax to 0, followed by a 2-byte NOP. The old logic
resulted in failing to skip the setting of ax correctly.
New logic uses the knowledge that all SDT probes on x86_64 are of the same
length (ASM_CALL_SIZE) and therefore we can simply skip that number of bytes
and continue without any emulation.
Orabug: 25557283 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Fri, 16 Dec 2016 23:54:10 +0000 (18:54 -0500)]
dtrace: function boundary tracing (FBT)
This patch implements Function Boundary Tracing (FBT) for x86 and
sparc. It covers generic, x86 and sparc specific changes.
Generic:
A new exported function (which will be provided by each supported
architecture) dtrace_fbt_init() is added to be called from module
code to initiate the discovery of FBT probes. Any eligible probe
(based on arch-dependent logic) is passed to the provided callback
function pointer for processing.
A new option is added to the DTrace kernel config to enable FBT.
The logic for determing the size of the pdata memory block (which
is arch-dependent) has changed at the architecture level, and the
code for setting up the kernel pseudo-module has been modified to
account for that change.
The post-processing script dtrace_sdt.sh is now determining how
many functions exist in the kernel as an upper bound for the amount
of functions that can be traced with FBT. The logic ensures that
aliases are not counted.
x86:
On x86_64, entry FBT probes are implemented similarly to SDT probes on
that same architecture. A trap is triggered from the probe location,
which causes a call into the DTrace core. The entry probe is placed on
the 'mov %rsp,%rbp' instruction that immediately followed a 'push %rbp'
as part of the function prologue. If this instruction sequence is not
found, no entry probe will be created for the function.
sparc:
On sparc64, entry FBT probes are also implemented similar to SDT
probes on that same architecture. A call is made into a trampoline
(allocated within the limited addressing range for a single assembler
instruction) which in turn calls into the DTrace core. The entry probe
is placed on the location where typically a call is made to _mcount for
profiling purposes. Under normal operation, that instruction will be
a NOP.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: David Mc Lean <david.mclean@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com> Reviewed-by: Allen Pais <allen.pais@oracle.com>
Orabug: 21220305
Orabug: 24829326
Kris Van Hees [Thu, 22 Dec 2016 07:20:57 +0000 (02:20 -0500)]
dtrace: add support for passing return value from trap handlers
Prior to this patch, trap handlers were called to service traps without
any mechanism to report back to the lowest level trap entry point. The
DTrace FBT implementation on x86 needs to be able to do just that because
FBT probes are enabled by replacing a one-byte assembler instruction with
a one-byte instruction that causes a trap. After the trap is handled, we
need to emulate the instruction that was replaced prior to returning to
the original instruction stream. Because different instructions may occur
at FBT probe points, we need to be able to report back to the trap entry
point which instruction was replaced by the trap.
Handlers that do not use notify_die() always return 0. Those who do use
notify_die() to call handlers have been modified to return the value that
the handler passed on its return.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-By: Dan Duval <dan.duval@oracle.com>
Orabug: 25312278
Kris Van Hees [Fri, 16 Dec 2016 18:04:52 +0000 (13:04 -0500)]
dtrace: ensure that our die notifier gets executed amongst the first
The die notifier is crucial for implementing safe memory access in
DTrace. To avoid other handlers potentially causing interference, we
set the priority of our handler at the highest level.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: David Mc Lean <david.mclean@oracle.com>
Nick Alcock [Mon, 28 Nov 2016 14:53:15 +0000 (14:53 +0000)]
dtrace: allow invop handler to specify number of insns to skip
Rather than unconditionally skipping one instruction, repurpose the
unused return value of the invop handler to allow it to specify the
amount to skip. (In the common case where it knows how many
instructions to skip at all times, this is more efficient.)
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Nick Alcock [Wed, 23 Nov 2016 17:50:09 +0000 (17:50 +0000)]
dtrace: is-enabled probes for SDT
"Is-enabled probes" are a conditional, long supported in userspace
probing, which lets you avoid doing expensive data-collection operations
needed only by DTrace probes unless those probes are active.
e.g. (an example using the core DTRACE_PROBE / DTRACE_IS_ENABLED macros,
rather than the DTRACE_providername macros used in practice, because
no such macros have been added to the kernel yet):
if (DTRACE_IS_ENABLED(__io_wait__start)) {
/* stuff done only when io:::wait-start is enabled */
}
As with normal SDT probes, the DTRACE_IS_ENABLED() macro compiles to a
stub function call (named like __dtrace_isenabled_*()) which is replaced
at bootup/module load time with an architecture-dependent instruction
sequence analogous to a function that always returns false, though no
function call is generated. At probe enabling time, this is replaced
with a trap into dtrace just like normal dtrace probes, incurring a
performance hit, but only when the probe is active.
The probe name used in the various ELF sections that track SDT
probes begins with a ? character to help the module distinguish
is-enabled probes from normal probes: this is internal to the DTrace
implementation and is otherwise invisible.
(Thanks to Kris Van Hees for initial work on this.)
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 25143173
Nick Alcock [Mon, 31 Oct 2016 13:12:56 +0000 (13:12 +0000)]
dtrace: check for errors when getting a new fd
get_unused_fd_flags() can legitimately fail, e.g. when the fd table is
full. We need to diagnose that rather than trying to fd_install() the
resulting negative number.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 24977175
Nick Alcock [Mon, 31 Oct 2016 10:44:26 +0000 (10:44 +0000)]
dtrace: take mmap_sem in PTRACE_GETMAPFD
Without this, we may oops if the process exec()s and discards its
address space after we find_vma().
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 24977175
Junxiao Bi [Tue, 1 Nov 2016 06:42:20 +0000 (14:42 +0800)]
ocfs2: fix not enough credit panic
The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192). ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up. The total credit should
include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(),
because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and
1 time will be consumed in ocfs2_dx_dir_new_cluster()->
__ocfs2_dx_dir_new_cluster()->ocfs2_dx_dir_format_cluster(). But only
two times is included in ocfs2_dx_dir_rebalance_credits(), fix it.
Eric Ren [Fri, 30 Sep 2016 22:11:32 +0000 (15:11 -0700)]
ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()
The testcase "mmaptruncate" of ocfs2-test deadlocks occasionally.
In this testcase, we create a 2*CLUSTER_SIZE file and mmap() on it;
there are 2 process repeatedly performing the following operations
respectively: one is doing memset(mmaped_addr + 2*CLUSTER_SIZE - 1, 'a',
1), while the another is playing ftruncate(fd, 2*CLUSTER_SIZE) and then
ftruncate(fd, CLUSTER_SIZE) again and again.
In ocfs2_write_begin_nolock(), we first grab the pages and then allocate
disk space for this write; ocfs2_try_to_free_truncate_log() will be
called if -ENOSPC is returned; if we're lucky to get enough clusters,
which is usually the case, we start over again.
But in ocfs2_free_write_ctxt() the target page isn't unlocked, so we
will deadlock when trying to grab the target page again.
Also, -ENOMEM might be returned in ocfs2_grab_pages_for_write().
Another deadlock will happen in __do_page_mkwrite() if
ocfs2_page_mkwrite() returns non-VM_FAULT_LOCKED, and along with a
locked target page.
These two errors fail on the same path, so fix them by unlocking the
target page manually before ocfs2_free_write_ctxt().
Jan Kara helps me clear out the JBD2 part, and suggest the hint for root
cause.
Changes since v1:
1. Also put ENOMEM error case into consideration.
Link: http://lkml.kernel.org/r/1474173902-32075-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: He Gang <ghe@suse.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c33f0785bf292cf1d15f4fbe42869c63e205b21c)
Joseph Qi [Mon, 19 Sep 2016 21:43:55 +0000 (14:43 -0700)]
ocfs2/dlm: fix race between convert and migration
Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not. This will introduce a race that right after
old master does umount ( means master will change), a new convert
request comes.
In this case, it will reset lockres state to DLM_RECOVERING and then
retry convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between
ocfs2 and dlm, and then finally BUG.
Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify
the race case between convert and recovery. So fix it.
Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery") Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e6f0c6e6170fec175fe676495f29029aecdf486c)
jiangyiwen [Fri, 25 Mar 2016 21:21:35 +0000 (14:21 -0700)]
ocfs2: solve a problem of crossing the boundary in updating backups
In update_backups() there exists a problem of crossing the boundary as
follows:
we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 584dca3440732afa84fbca07567bb66e1453936a)
jiangyiwen [Tue, 15 Mar 2016 21:53:01 +0000 (14:53 -0700)]
ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.
Joseph Qi [Thu, 14 Jan 2016 23:17:44 +0000 (15:17 -0800)]
ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del
In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode. This will have a problem once
ocfs2_journal_access_di fails. In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully. In
other words, the file is missing but not actually deleted. So we should
access orphan dinode first like unlink and rename.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jiufei Xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 074a6c655f6da12cb1123c8a84bfd8d781138800)
xuejiufei [Thu, 14 Jan 2016 23:17:41 +0000 (15:17 -0800)]
ocfs2/dlm: do not insert a new mle when another process is already migrating
When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32e493265b2be96404aaa478fb2913be29b06887)
jiangyiwen [Thu, 14 Jan 2016 23:17:33 +0000 (15:17 -0800)]
ocfs2: fix slot overwritten if storage link down during mount
The following case will lead to slot overwritten.
N1 N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
got super lock and also allocate slot 0
then unlock super lock
mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
N2 slot info is now overwritten
Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2. so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1247017f43a93eae3d64b7c25f3637dc545f5a47)
Xue jiufei [Thu, 14 Jan 2016 23:17:29 +0000 (15:17 -0800)]
ocfs2/dlm: return appropriate value when dlm_grab() returns NULL
dlm_grab() may return NULL when the node is doing unmount. When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:
Node 1 Node 2
receives migration message
from node 3, and send
migrate request to others
start unmounting
receives migrate request
from node 1 and call
dlm_migrate_request_handler()
unmount thread unregisters
domain handlers and removes
dlm_context from dlm_domains
dlm_migrate_request_handlers()
returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c372f2193a2e73d5936bf37259ae63ca388b4cbc)
jiangyiwen [Thu, 14 Jan 2016 23:17:23 +0000 (15:17 -0800)]
ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker
Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear
refmap bit on lockres") still exists a race which can't ensure the
ordering is exactly correct.
Node1 Node2 Node3
umount, migrate
lockres to Node2
migrate finished,
send migrate request
to Node3
received migrate request,
create a migration_mle,
respond to Node2.
set DLM_LOCK_RES_SETREF_INPROG
and send assert master to
Node3
delete migration_mle in
assert_master_handler,
Node3 umount without response
dlm_thread purge
this lockres, send drop
deref message to Node2
found the flag of
DLM_LOCK_RES_SETREF_INPROG
is set, dispatch
dlm_deref_lockres_worker to
clear refmap, but in function of
dlm_deref_lockres_worker,
only if node in refmap it wait
DLM_LOCK_RES_SETREF_INPROG
to be cleared. So worker is
done successfully
purge lockres, send
assert master response
to Node1, and finish umount
set Node3 in refmap, and it
won't be cleared forever, thus
lead to umount hung
so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
dlm_deref_lockres_worker.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b5560143385e18b4109ad6951c7719705e3dd995)
Xue jiufei [Thu, 14 Jan 2016 23:17:18 +0000 (15:17 -0800)]
ocfs2/dlm: fix a race between purge and migration
We found a race between purge and migration when doing code review.
Node A put lockres to purgelist before receiving the migrate message
from node B which is the master. Node A call dlm_mig_lockres_handler to
handle this message.
dlm_mig_lockres_handler
dlm_lookup_lockres
>>>>>> race window, dlm_run_purge_list may run and send
deref message to master, waiting the response
spin_lock(&res->spinlock);
res->state |= DLM_LOCK_RES_MIGRATING;
spin_unlock(&res->spinlock);
dlm_mig_lockres_handler returns
>>>>>> dlm_thread receives the response from master for the deref
message and triggers the BUG because the lockres has the state
DLM_LOCK_RES_MIGRATING with the following message:
xuejiufei [Tue, 29 Dec 2015 22:54:29 +0000 (14:54 -0800)]
ocfs2/dlm: clear migration_pending when migration target goes down
We have found a BUG on res->migration_pending when migrating lock
resources. The situation is as follows.
dlm_mark_lockres_migration
res->migration_pending = 1;
__dlm_lockres_reserve_ast
dlm_lockres_release_ast returns with res->migration_pending remains
because other threads reserve asts
wait dlm_migration_can_proceed returns 1
>>>>>>> o2hb found that target goes down and remove target
from domain_map
dlm_migration_can_proceed returns 1
dlm_mark_lockres_migrating returns -ESHOTDOWN with
res->migration_pending still remains.
When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending. So clear migration_pending when target is
down.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cc28d6d80f6ab494b10f0e2ec949eacd610f66e3)
Joseph Qi [Tue, 29 Dec 2015 22:54:06 +0000 (14:54 -0800)]
ocfs2: fix BUG when calculate new backup super
When resizing, it firstly extends the last gd. Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.
But it currently doesn't consider the situation that the backup super is
already done. And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:
alex chen [Fri, 6 Nov 2015 02:44:10 +0000 (18:44 -0800)]
ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error
In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we
should reclaim the inode successfully claimed above, otherwise, the
inode never be reused. The case is described below:
ocfs2_mknod
ocfs2_mknod_locked
ocfs2_claim_new_inode
Successfully claim the inode
__ocfs2_mknod_locked
ocfs2_journal_access_di
Failed because of -ENOMEM or other reasons, the inode
lockres has not been initialized yet.
iput(inode)
ocfs2_evict_inode
ocfs2_delete_inode
ocfs2_inode_lock
ocfs2_inode_lock_full_nested
__ocfs2_cluster_lock
Return -EINVAL because of the inode
lockres has not been initialized.
So the following operations are not performed
ocfs2_wipe_inode
ocfs2_remove_inode
ocfs2_free_dinode
ocfs2_free_suballoc_bits
Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b1529a41f777a48f95d4af29668b70ffe3360e1b)
Joseph Qi [Fri, 6 Nov 2015 02:44:07 +0000 (18:44 -0800)]
ocfs2: fix race between mount and delete node/cluster
There is a race case between mount and delete node/cluster, which will
lead o2hb_thread to malfunctioning dead loop.
o2hb_thread
{
o2nm_depend_this_node();
<<<<<< race window, node may have already been deleted, and then
enter the loop, o2hb thread will be malfunctioning
because of no configured nodes found.
while (!kthread_should_stop() &&
!reg->hr_unclean_stop && !reg->hr_aborted_start) {
}
So check the return value of o2nm_depend_this_node() is needed. If node
has been deleted, do not enter the loop and let mount fail.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0986fe9b50f425ec81f25a1a85aaf3574b31d801)
Joseph Qi [Thu, 22 Oct 2015 20:32:29 +0000 (13:32 -0700)]
ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put
dlm_lockres_put will call dlm_lockres_release if it is the last
reference, and then it may call dlm_print_one_lock_resource and
take lockres spinlock.
So unlock lockres spinlock before dlm_lockres_put to avoid deadlock.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b67de018b37a97548645a879c627d4188518e907)
ocfs2: avoid access invalid address when read o2dlm debug messages
The following case will lead to a lockres is freed but is still in use.
cat /sys/kernel/debug/o2dlm/locking_state dlm_thread
lockres_seq_start
-> lock dlm->track_lock
-> get resA
resA->refs decrease to 0,
call dlm_lockres_release,
and wait for "cat" unlock.
Although resA->refs is already set to 0,
increase resA->refs, and then unlock
lock dlm->track_lock
-> list_del_init()
-> unlock
-> free resA
In such a race case, invalid address access may occurs. So we should
delete list res->tracking before resA->refs decrease to 0.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f57a22ddecd6f26040a67e2c12880f98f88b6e00)
When running dirop_fileop_racer we found a case that inode
can not removed.
Two nodes, say Node A and Node B, mount the same ocfs2 volume. Create
two dirs /race/1/ and /race/2/ in the filesystem.
Node A Node B
rm -r /race/2/
mv /race/1/ /race/2/
call ocfs2_unlink(), get
the EX mode of /race/2/
wait for B unlock /race/2/
decrease i_nlink of /race/2/ to 0,
and add inode of /race/2/ into
orphan dir, unlock /race/2/
got EX mode of /race/2/. because
/race/1/ is dir, so inc i_nlink
of /race/2/ and update into disk,
unlock /race/2/
because i_nlink of /race/2/
is not zero, this inode will
always remain in orphan dir
This patch fixes this case by test whether i_nlink of new dir is zero.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 928dda1f9433f024ac48c3d97ae683bf83dd0e42)
The trusted extended attributes are only visible to the process which
hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
xattr_handler trusted list. The check is important because this will be
used for implementing mechanisms in the userspace for which other
ordinary processes should not have access to.
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Taesoo kim <taesoo@gatech.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0f5e7b41f91814447defc34e915fc5d6e52266d9)
Xue jiufei [Wed, 24 Jun 2015 23:55:20 +0000 (16:55 -0700)]
ocfs2: fix NULL pointer dereference in function ocfs2_abort_trigger()
ocfs2_abort_trigger() use bh->b_assoc_map to get sb. But there's no
function to set bh->b_assoc_map in ocfs2, it will trigger NULL pointer
dereference while calling this function. We can get sb from
bh->b_bdev->bd_super instead of b_assoc_map.
[akpm@linux-foundation.org: update comment, per Joseph] Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 74e364ad1b13fd518a0bd4e5aec56d5e8706152f)
Roman Kagan [Wed, 18 May 2016 14:48:20 +0000 (17:48 +0300)]
kvm:vmx: more complete state update on APICv on/off
The function to update APICv on/off state (in particular, to deactivate
it when enabling Hyper-V SynIC) is incomplete: it doesn't adjust
APICv-related fields among secondary processor-based VM-execution
controls. As a result, Windows 2012 guests get stuck when SynIC-based
auto-EOI interrupt intersected with e.g. an IPI in the guest.
In addition, the MSR intercept bitmap isn't updated every time "virtualize
x2APIC mode" is toggled. This path can only be triggered by a malicious
guest, because Windows didn't use x2APIC but rather their own synthetic
APIC access MSRs; however a guest running in a SynIC-enabled VM could
switch to x2APIC and thus obtain direct access to host APIC MSRs
(CVE-2016-4440).
The patch fixes those omissions.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> Reported-by: Steve Rutherford <srutherford@google.com> Reported-by: Yang Zhang <yang.zhang.wz@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Orabug: 23347009
CVE: CVE-2016-4440 Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>
Nick Alcock [Fri, 16 Sep 2016 09:17:18 +0000 (10:17 +0100)]
dtrace: eliminate need for arg counting in sdt macros
The existing SDT probe macros are of the form
DTRACE_PROBEn(probe, ...)
where there must be as many args in ... as the 'n'.
This is implemented via terribly repetitive code in include/linux/sdt.h,
made much worse by the recent typed sdt additions. Reduce
repetitiveness, and drop the 'n' variants of the macros.
All the complexity is hidden away in include/linux/sdt_internal.h.
It's derived from the tricks the perf probes were already using, but
improved to handle the zero-argument case and armoured against namespace
pollution with copious __ing. Even when DTrace is turned off, we use a
do-nothing form of the same macro to verify that the number of arguments
in the probe is valid.
There is only one public macro now, DTRACE_PROBE(), and the usual thin
wrappers around it, which have also gone non-numbered (DTRACE_PROC,
DTRACE_IO, etc): all uses have been adjusted.
These macros use a family of semi-public macros, described below.
__DTRACE_DOUBLE_APPLY(type_macro, arg_macro, ...): This takes a type and
arg macro and a set of pairs of arguments and calls the first argument
on each pair with the type macro, the second with the arg macro, etc:
if called with (type, arg, a, b, c, d) it will expand to something like
type(a), arg(b), type(c), arg(d)
__DTRACE_DOUBLE_APPLY_NOCOMMA(type_macro, arg_macro, ...): As above, but
omits all commas between arguments: the example above will yield
type(a) arg(b) type(c) arg(d)
(suitable for string concatenation).
__DTRACE_TYPE_APPLY(type_macro, arg_macro, ...): as above, but only
calls the type_macro:
type(a), type(c)
__DTRACE_ARG_APPLY(type_macro, arg_macro, ...): as above, but only
calls the arg_macro:
arg(b), arg(d)
__DTRACE_TYPE_APPLY_DEFAULT(type_macro, arg_macro, def, ...): as above,
but takes an extra parameter to be used as a default when there are no
args. Calling it with (type, arg, void, a, b, c, d) will yield
type(a), arg(b), type(c), arg(d)
but calling it with (type, arg, void) will yield
void
All the macros above take any number of pairs of arguments up to eight:
providing any other number, including odd numbers, is a compilation
error.
__DTRACE_APPLY(macro, ...): This is pre-existing, and calls the macro
repeatedly with the specified args, comma-separated:
m(a), m(b), m(c), ...
__DTRACE_APPLY_DEFAULT(macro, def, ...): Like DTRACE_TYPE_APPLY_DEFAULT,
takes a default for use when there are no args.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 24678897
Nick Alcock [Wed, 14 Sep 2016 09:39:33 +0000 (10:39 +0100)]
dtrace: augment SDT probes with type information
With all the other work done, we can finally move the SDT probe
type information out of the static sdt_args array in the SDT kernel
module and put them in the probe definitions they relate to.
Most of it is already there -- all probes already carry type
information. All that is missing is translations, so we add those.
The syntax for these is
type: translation
or
type : (translation, ...)
i.e. a source type, followed by a colon, followed by zero or more
translated types (if there is more than one, they must be bracketed),
with optional spaces if you like. The resulting probe will have as many
arguments as there are translations. Obviously, if DTrace userspace is
to do anything with this knowledge it will need a translator from the
specified source type to each of the specified translated types: DTrace
picks these up from kernel-versioned subdirectories of
/usr/lib64/dtrace/ (e.g. /usr/lib64/dtrace/4.1).
If a probe appears more than once (say, in different functions), each
instance must presently have the same types. Changing translations
between instances of a single probe will be diagnosed if CONFIG_DT_DEBUG
is enabled.
Typos in the types are only diagnosed at runtime: a dtrace -vln of your
new probe will tell you if all is well.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 24661801
Nick Alcock [Wed, 14 Sep 2016 01:16:46 +0000 (02:16 +0100)]
dtrace: import the sdt type information into per-sdt_probedesc state
Now we have type and probe names in ELF sections, we have to do something
with them (it's initdata, so in the kernel, if not in modules, it will
be thrown away after initialization.)
We marshal each arg string in turn, without parsing it, into a new field
in the sdt_probedesc_t, the per-probe structure used to communicate
information about SDT probes to the SDT kernel module. Before now, this
has been a simple task at runtime: the array is constructed by
dtrace_sdt.sh and just needs to be reformulated (for the core kernel) or
dropped unchanged into place (for modules). But this extra field is
derived from C macro expansion and cannot be determined by
dtrace_sdt.sh: so it leaves it 0 and we fill it in at boot / module load
time.
The fillout process is a little baroque to avoid slowing down the boot
too much, since we have to associate one array with another and want
to avoid linear scans. We build a hashtable mapping from probe name
to args string (i.e. from _dtrace_sdt_names to _dtrace_sdt_args entry),
in the process verifying that if the probe appears multiple times,
all its arg strings are the same. We can then easily use this hash
table to point the extra argument types field in the sdt_probedesc_t
array at the args string too.
One inefficiency remains: there can be *lots* of duplicate args strings,
and currently we make no attempt to deduplicate them. (We discard the
core kernel's array of probe names, but the array of arg strings
obviously must be preserved. We could deduplicate it, but do not.)
This is probably wasting no more than a few kilobytes at present, so
a deduplicator is not worth writing, but as the number of probes
(particularly perf probes) increases, one may become worth writing.
The new sdpd_args field into which the type info string is dropped is
protected from the ABI checker: the sdt_probedesc_t is internal to the
kernel and the DTrace module, and the usual ABI guarantees do not apply
to it.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 24661801
Nick Alcock [Wed, 14 Sep 2016 01:10:06 +0000 (02:10 +0100)]
dtrace: record SDT and perf probe types in a new ELF section
Before now, the types specified in the SDT probe macros in
include/linux/sdt.h were simply thrown away: DTrace was never
informed of them, and instead got type information for SDT probes
from a hardwired list in the out-of-tree sdt module.
As a start to fixing this, export the SDT type information into
a new ELF section named _dtrace_sdt_args. Both native SDT probes
and perf probes translated into SDT probes use the same section:
its format is a sequence of probe type strings, with individual
arguments separated by commas, each type string separated from the
next by a NUL:
type string, comma-separated\0
...
(This is the same format used by perf probe prototype declarations,
and is what you'd get if you removed the argument names from a DTrace
SDT probe specification.)
The only difference between SDT and perf probe type strings is that the
type string for perf probes is also used to define a tracing function,
so it includes argument names. The SDT format has various other
extensions which do not affect this commit so will be described later.
The type strings in the section are in no particular order (it depends on the
order in the source code) and may contain duplicates, so to associate
each string with a probe, there is another section, _dtrace_sdt_names,
which is a simple null-terminated array (string table) of names, associated
1:1 with the type strings in the _dtrace_sdt_types section.
At this point in the patch series, nothing uses the newly-added sections.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Orabug: 24661801
dtrace: ensure new SDT info generation works on sparc64
Due to addresses on sparc64 being in the lower range of the memory map, the
calculated addresses for SDT probe points were represented with less hex
characters then their function base address counterparts (obtained from
objdump output). This messed up the sorting based on address values, and
resulted in all probe points being associated with the last function in the
list.
This commit ensures that all addresses are 16 hex characters long.
Kris Van Hees [Tue, 16 Aug 2016 15:52:07 +0000 (11:52 -0400)]
dtrace: rework kernel sdtinfo generation to be more accurate
With changes in the vmlinux.lds.S linker script for x86 in the 4.3 kernel
series, the existing mechanism for determining the addresses of SDT probe
points in the kernel proper was failing. The assumption that various
flavours of text sections were being concatenated into a single .text
section no longer holds.
The new implementation uses a relocation-retaining linking step in the
vmlinux linking process, which enables us to gain access to the final
addresses of symbols in the .text section, and also makes it very easy
to determine the addresses of SDT probes (.text base address + relocation
offset). We re-use the .tmp-vmlinux1 output file for the output of this
linking step because it is performed after that output file is no longer
needed in the kernel linking process.
Due to the relocation-retaining linking that occurs, the assertion test
for kernel image size had to be modified, because the _end symbol value
ends up being a relative offset rather than an absolute address.
Nick Alcock [Thu, 14 Jul 2016 17:05:24 +0000 (18:05 +0100)]
ctf: fix CONFIG_CTF && !CONFIG_DTRACE and CONFIG_DT_DISABLE_CTF
These combinations of config options, one of which should enable
CTF without enabling DTrace, the other one of which should enable
DTrace without enabling CTF, are both broken.
Fix them by making the CTF <-> DT_DISABLE_CTF relationship bidirectional,
by making dwarf2ctf compilation depend properly on CONFIG_CTF rather than
CONFIG_DTRACE, and by fixing the newly-introduced order-only dependency
of CTF on the sdtinfo stubs files so that it is only enforced when
both CTF and DTrace are enabled.
Orabug: 23859082 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-By: Todd Vierling <todd.vierling@oracle.com>
At LSF we decided that if we truncate up from isize we shouldn't trim
fallocated blocks that were fallocated with KEEP_SIZE and are past the
new i_size. This patch fixes ext4 to do this.
[ Completely reworked patch so that i_disksize would actually get set
when truncating up. Also reworked the code for handling truncate so
that it's easier to handle. -- tytso ]
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
(cherry picked from commit 3da40c7b089810ac9cf2bb1e59633f619f3a7312) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Conflicts:
fs/ext4/inode.c
Kris Van Hees [Sat, 21 May 2016 13:43:55 +0000 (06:43 -0700)]
dtrace: ensure pdata is large enough
With the sudden (large) increase of SDT probe points, the previously
statically defined default size for the kernel pdata block was not
sufficient. The code has been modified to calculate the required size
for the pdata block at runtime.
This primarily affects sparc where the pdata also covers the memory
block used for SDT trampolines. It is now dependent on the actual
number of SDT probes in the kernel.
This does not affect modules because they were already allocating the
needed memory block at load time based on the actual number of probes.
Orabug: 23004534 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
For every trace point defined, also define an SDT probe. This allows
DTrace to expose probe points maintained by upstream.
[nca: fixed TODOs: added DTRACE_UINTPTR_CAST_EACH so that tracepoints
that pass structures by value will still compile: we jam the
structure into a uintptr if it will fit, otherwise passing its
address in.
This is only partial, so far: there is no CTF type info for any
of these new probes, but this is a relatively minor issue that can
be fixed later.]
Orabug: 23004534 Signed-off-by: Timothy J Fontaine <tj.fontaine@oracle.com> Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Kris Van Hees [Sat, 18 Jun 2016 14:19:03 +0000 (10:19 -0400)]
dtrace: add support for probes in sections other than .text
This commit adds support for SDT probes in sections other than the
regular .text section, both for the kernel and kernel modules.
The core of the problem is that probe locations (for modules and
for the kernel proper) were stored as an offset relatve to the start
of the code section they occur in. Processing them at load time (or
boot time, for the kernel) resulted in incorrect addresses for any
probes in sections other than .text because the offset was getting
applied to the wrong base address.
I.e. a probe at offset 0x1d0 in section .text.unlikely would be
resolved as
BASE(.text) + 0x1d0
rather than
BASE(.text.unlikely) + 0x1d0
The end result was that (using x86_64 as example) we would be writing
a 5-byte NOP sequence in a location that was really not expecting that,
with very odd and entirely unpredictable results.
Solving this required two distinct mechanisms because modules are
linked in a way where the distict code sections are retained. Only
at module load time are they laoded consecutively into a single block
of memory that starts at module_core; worse yet, probes can be
duplicated by the compiler into multiple such sections. This means
that a simple offset will not do; the only thing we can rely on is the
function name (since symbol names are guaranteed unique, and the
compiler synthesizes new ones not valid in C when it clones and
specializes functions), and its start address.
The question is how to identify the functions. We can't use the symbol
table index, because the symbol table is rewritten by strip(1) as part
of RPM generation: and we can't directly refer to the names given to
cloned functions because they are not valid in C: even if valid, they
might well be local to the translation unit in any case. So we run the
.o file through a new tool, scripts/kmodsdt, which identifies all
functions containing DTrace probes by hunting for relocations to the
__dtrace_probe_* probe calls and, if the containing functions are local,
introduces an alias to each named __dta_<function>_<symindex> (the
function name and index are necessary because these functions are local,
so there could be several of them with the same name in different
translation units: the symindex makes the aliases unique). The function
name is rewritten to ensure it is valid in C by translating . into _.
The address in the sdtinfo then references each function symbol (or
alias) and simply adds the offset from that function symbol to the probe
to it in perfectly normal C code, and the linker then deals with all the
problems involving keeping the addresses valid across stripping, module
loading, etc.
When the module is loaded, the kernel resolves all relocations. This
means that the sdtinfo data that is compiled into the module will have
the exact address of all probe locations by the time we need it.
The kernel is a different story altogether...
The kernel performs a final link into an executable object, merging
all non-init code sections into a single .text. This mens that for
the kernel, we need to calculate the absolute address of each probe
location based on the knowledge that section merging occurs.
We make use of the multi-step linking process that the kernel uses to
ensure stability of the kallsyms data. Three steps take place in the
sdtinfo generation process (though they will appear as a single step,
because they are a single pipeline):
- a list of code sections is extracted from vmlinux.o, and for each
section we take note of the function with the lowest offset in
the section
- temporary kernel image .tmp_vmlinux1 is used to determine the
absolute load address of each merged section by searching for the
first function of each respective section, and subtracting its
offset from its address.
- using the list of absolute base addresses for each section, the
list of probes (associated with the function they occur in, and
the section the function belongs in) gets its absolute addresses
for the probe locations by adding the probe offset (relative to
its encapsulating section) to the section base address.
(Note: if two distinct sections in the kernel happen to have fuctions
at the lowest offset with identical names, this process cannot
distinguish between those two sections. That is a limitation
of the heuristic but is extremely unlikely. If it were to
occur, a fix would require to use a more complex comparison to
determine whether a function in the final kernel image is the
start of one section or another.)
Orabug: 23344927 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Nick Alcock [Fri, 1 Jul 2016 14:38:01 +0000 (15:38 +0100)]
dtrace, ctf: build sdtstubs and CTF after sdtinfo; sdtinfo follows modpost
In the following commit, sdtinfo generation will begin to
rewrite the .o file. When this is happening, we cannot
have sdtstubs generation, dwarf2ctf, or modpost run in parallel
with it, since they all read the .o file (in modpost's case, for
symbol versions; in sdtstubs' case, for the list of symbols;
in dwarf2ctf's case, for the debugging info). So have both of
the latter depend on the sdtinfo.c files that the sdtinfo pass
generates. We force sdtinfo generation to run after modpost
by depending on the .mod.c files modpost generates.
For the CTF side of things, we use an order-only prerequisite
because sdtinfo's changes do not change anything which might
require dwarf2ctf to rerun, and so as not to disturb $^, which
we use to build the list of files dwarf2ctf should run over.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com
Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 8089fe62c6603860f6796ca80519b92391292f21) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Conflicts:
fs/btrfs/inode.c
Yuval Shaia and team discovered that the aforementioned commit
resulted in a substantial (900-to-1) performance loss on a TCP
bandwidth test over IPoIB. Upstream is aware of the problem
but there is currently no solution in hand, so just revert the
offending commmit.
Note that the original commit purports to fix a kernel crash,
but so far we haven't seen the crash.
Signed-off-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Tested-by: Rose Wang <rose.wang@oracle.com>
Nick Alcock [Thu, 19 May 2016 20:45:56 +0000 (21:45 +0100)]
dtrace: support SDT in single-file modules
A limitation of long standing in the SDT machinery is that it didn't
support modules that were implemented entirely in one file. This is
because the SDT stubs generation machinery (which emits zero-byte
assembler stubs for all DTrace probe points) was implemented at the
point at which multi-file modules are linked into a single file prior to
modpost. Since single-file modules *have* no such stage, they never got
their stubs, and probes never fired. This becomes essential to fix so
that perf_events probes will work, because some of those are in
single-file modules.
We fix it by tearing out the sdtstub code from its prior home in
Makefile.build, and moving it into the late modpost stage, at the same
point at which sdtinfo files are generated. We use the same prerequites
trick as is used for CTF files to link the sdtstub files into the
resulting modules; because this all happens at the very last possible
moment, it shares the same code path for single- and multifile modules,
and is a good bit simpler to boot (no need to manipulate the link
process or transform filenames intricately during stub generation).
There is one additional wrinkle: because the dtrace probes are now
always undefined until the last minute, we must adjust modpost to
disregard all undefined symbols that start "__dtrace_probe_".
Orabug: 23316392 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Backport of 81ad4276b505e987dd8ebbdf63605f92cd172b52 failed to adjust
for intervening ->get_trip_temp() argument type change, thus causing
stack protector to panic.
drivers/thermal/thermal_core.c: In function ‘thermal_zone_device_register’:
drivers/thermal/thermal_core.c:1569:41: warning: passing argument 3 of
‘tz->ops->get_trip_temp’ from incompatible pointer type [-Wincompatible-pointer-types]
if (tz->ops->get_trip_temp(tz, count, &trip_temp))
^
drivers/thermal/thermal_core.c:1569:41: note: expected ‘long unsigned int *’
but argument is of type ‘int *’
CC: <stable@vger.kernel.org> #3.18,#4.1 Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 5640c4c37eee293451388cd5ee74dfed3a30f32d)
Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
is normally set at ACK processing time, not at send time.
Doing a proper fix would need to add an additional state variable,
and does not seem worth the trouble, given CUBIC bug has been there
forever before Jana noticed it.
Let's simply not set epoch_start in the future, otherwise
bictcp_update() could overflow and CUBIC would again
grow cwnd too fast.
This was detected thanks to a packetdrill test Neal wrote that was flaky
before applying this fix.
Fixes: 30927520dbae ("tcp_cubic: better follow cubic curve after idle period") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Jana Iyengar <jri@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit af05df01c7a9f4b3cf3dc29365f4c7c4e2cd53ff)
The problem seems to be that if a new device is detected
while we have already removed the shared HCD, then many of the
xhci operations (e.g. xhci_alloc_dev(), xhci_setup_device())
hang as command never completes.
I don't think XHCI can operate without the shared HCD as we've
already called xhci_halt() in xhci_only_stop_hcd() when shared HCD
goes away. We need to prevent new commands from being queued
not only when HCD is dying but also when HCD is halted.
The following lockup was detected while testing the otg state
machine.
[ 178.199951] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[ 178.205799] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 1
[ 178.214458] xhci-hcd xhci-hcd.0.auto: hcc params 0x0220f04c hci version 0x100 quirks 0x00010010
[ 178.223619] xhci-hcd xhci-hcd.0.auto: irq 400, io mem 0x48890000
[ 178.230677] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[ 178.237796] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 178.245358] usb usb1: Product: xHCI Host Controller
[ 178.250483] usb usb1: Manufacturer: Linux 4.0.0-rc1-00024-g6111320 xhci-hcd
[ 178.257783] usb usb1: SerialNumber: xhci-hcd.0.auto
[ 178.267014] hub 1-0:1.0: USB hub found
[ 178.272108] hub 1-0:1.0: 1 port detected
[ 178.278371] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[ 178.284171] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 2
[ 178.294038] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
[ 178.301183] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 178.308776] usb usb2: Product: xHCI Host Controller
[ 178.313902] usb usb2: Manufacturer: Linux 4.0.0-rc1-00024-g6111320 xhci-hcd
[ 178.321222] usb usb2: SerialNumber: xhci-hcd.0.auto
[ 178.329061] hub 2-0:1.0: USB hub found
[ 178.333126] hub 2-0:1.0: 1 port detected
[ 178.567585] dwc3 48890000.usb: usb_otg_start_host 0
[ 178.572707] xhci-hcd xhci-hcd.0.auto: remove, state 4
[ 178.578064] usb usb2: USB disconnect, device number 1
[ 178.586565] xhci-hcd xhci-hcd.0.auto: USB bus 2 deregistered
[ 178.592585] xhci-hcd xhci-hcd.0.auto: remove, state 1
[ 178.597924] usb usb1: USB disconnect, device number 1
[ 178.603248] usb 1-1: new high-speed USB device number 2 using xhci-hcd
[ 190.597337] INFO: task kworker/u4:0:6 blocked for more than 10 seconds.
[ 190.604273] Not tainted 4.0.0-rc1-00024-g6111320 #1058
[ 190.610228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 190.618443] kworker/u4:0 D c05c0ac0 0 6 2 0x00000000
[ 190.625120] Workqueue: usb_otg usb_otg_work
[ 190.629533] [<c05c0ac0>] (__schedule) from [<c05c10ac>] (schedule+0x34/0x98)
[ 190.636915] [<c05c10ac>] (schedule) from [<c05c1318>] (schedule_preempt_disabled+0xc/0x10)
[ 190.645591] [<c05c1318>] (schedule_preempt_disabled) from [<c05c23d0>] (mutex_lock_nested+0x1ac/0x3fc)
[ 190.655353] [<c05c23d0>] (mutex_lock_nested) from [<c046cf8c>] (usb_disconnect+0x3c/0x208)
[ 190.664043] [<c046cf8c>] (usb_disconnect) from [<c0470cf0>] (_usb_remove_hcd+0x98/0x1d8)
[ 190.672535] [<c0470cf0>] (_usb_remove_hcd) from [<c0485da8>] (usb_otg_start_host+0x50/0xf4)
[ 190.681299] [<c0485da8>] (usb_otg_start_host) from [<c04849a4>] (otg_set_protocol+0x5c/0xd0)
[ 190.690153] [<c04849a4>] (otg_set_protocol) from [<c0484b88>] (otg_set_state+0x170/0xbfc)
[ 190.698735] [<c0484b88>] (otg_set_state) from [<c0485740>] (otg_statemachine+0x12c/0x470)
[ 190.707326] [<c0485740>] (otg_statemachine) from [<c0053c84>] (process_one_work+0x1b4/0x4a0)
[ 190.716162] [<c0053c84>] (process_one_work) from [<c00540f8>] (worker_thread+0x154/0x44c)
[ 190.724742] [<c00540f8>] (worker_thread) from [<c0058f88>] (kthread+0xd4/0xf0)
[ 190.732328] [<c0058f88>] (kthread) from [<c000e810>] (ret_from_fork+0x14/0x24)
[ 190.739898] 5 locks held by kworker/u4:0/6:
[ 190.744274] #0: ("%s""usb_otg"){.+.+.+}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[ 190.752799] #1: ((&otgd->work)){+.+.+.}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[ 190.761326] #2: (&otgd->fsm.lock){+.+.+.}, at: [<c048562c>] otg_statemachine+0x18/0x470
[ 190.769934] #3: (usb_bus_list_lock){+.+.+.}, at: [<c0470ce8>] _usb_remove_hcd+0x90/0x1d8
[ 190.778635] #4: (&dev->mutex){......}, at: [<c046cf8c>] usb_disconnect+0x3c/0x208
[ 190.786700] INFO: task kworker/1:0:14 blocked for more than 10 seconds.
[ 190.793633] Not tainted 4.0.0-rc1-00024-g6111320 #1058
[ 190.799567] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 190.807783] kworker/1:0 D c05c0ac0 0 14 2 0x00000000
[ 190.814457] Workqueue: usb_hub_wq hub_event
[ 190.818866] [<c05c0ac0>] (__schedule) from [<c05c10ac>] (schedule+0x34/0x98)
[ 190.826252] [<c05c10ac>] (schedule) from [<c05c4e40>] (schedule_timeout+0x13c/0x1ec)
[ 190.834377] [<c05c4e40>] (schedule_timeout) from [<c05c19f0>] (wait_for_common+0xbc/0x150)
[ 190.843062] [<c05c19f0>] (wait_for_common) from [<bf068a3c>] (xhci_setup_device+0x164/0x5cc [xhci_hcd])
[ 190.852986] [<bf068a3c>] (xhci_setup_device [xhci_hcd]) from [<c046b7f4>] (hub_port_init+0x3f4/0xb10)
[ 190.862667] [<c046b7f4>] (hub_port_init) from [<c046eb64>] (hub_event+0x704/0x1018)
[ 190.870704] [<c046eb64>] (hub_event) from [<c0053c84>] (process_one_work+0x1b4/0x4a0)
[ 190.878919] [<c0053c84>] (process_one_work) from [<c00540f8>] (worker_thread+0x154/0x44c)
[ 190.887503] [<c00540f8>] (worker_thread) from [<c0058f88>] (kthread+0xd4/0xf0)
[ 190.895076] [<c0058f88>] (kthread) from [<c000e810>] (ret_from_fork+0x14/0x24)
[ 190.902650] 5 locks held by kworker/1:0/14:
[ 190.907023] #0: ("usb_hub_wq"){.+.+.+}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[ 190.915454] #1: ((&hub->events)){+.+.+.}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[ 190.924070] #2: (&dev->mutex){......}, at: [<c046e490>] hub_event+0x30/0x1018
[ 190.931768] #3: (&port_dev->status_lock){+.+.+.}, at: [<c046eb50>] hub_event+0x6f0/0x1018
[ 190.940558] #4: (&bus->usb_address0_mutex){+.+.+.}, at: [<c046b458>] hub_port_init+0x58/0xb10
Starting with 4.1 the tracing subsystem has its own filesystem
which is automounted in the tracing subdirectory of debugfs.
Prior to this debugfs could be bind mounted in a cloned mount
namespace, but if tracefs has been mounted under debugfs this
now fails because there is a locked child mount. This creates
a regression for container software which bind mounts debugfs
to satisfy the assumption of some userspace software.
In other pseudo filesystems such as proc and sysfs we're already
creating mountpoints like this in such a way that no dirents can
be created in the directories, allowing them to be exceptions to
some MNT_LOCKED tests. In fact we're already do this for the
tracefs mountpoint in sysfs.
Do the same in debugfs_create_automount(), since the intention
here is clearly to create a mountpoint. This fixes the regression,
as locked child mounts on permanently empty directories do not
cause a bind mount to fail.
If we rename an inode A (be it a file or a directory), create a new
inode B with the old name of inode A and under the same parent directory,
fsync inode B and then power fail, at log tree replay time we end up
removing inode A completely. If inode A is a directory then all its files
are gone too.
Example scenarios where this happens:
This is reproducible with the following steps, taken from a couple of
test cases written for fstests which are going to be submitted upstream
soon:
The next time the fs is mounted, log tree replay happens and
the directory "y" does not exist nor do the files "foo" and
"bar" exist anywhere (neither in "y" nor in "x", nor the root
nor anywhere).
The next time the fs is mounted, log tree replay happens and the
file "bar" does not exists anymore. A file with the name "foo"
exists and it matches the second file we created.
Another related problem that does not involve file/data loss is when a
new inode is created with the name of a deleted snapshot and we fsync it:
The next time the fs is mounted the log replay procedure fails because
it attempts to delete the snapshot entry (which has dir item key type
of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
resulting in the following error that causes mount to fail:
Fix this by forcing a transaction commit when such cases happen.
This means we check in the commit root of the subvolume tree if there
was any other inode with the same reference when the inode we are
fsync'ing is a new inode (created in the current transaction).
Test cases for fstests, covering all the scenarios given above, were
submitted upstream for fstests:
* fstests: generic test for fsync after renaming directory
https://patchwork.kernel.org/patch/8694281/
* fstests: generic test for fsync after renaming file
https://patchwork.kernel.org/patch/8694301/
* fstests: add btrfs test for fsync after snapshot deletion
https://patchwork.kernel.org/patch/8670671/
Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 033ad030df0ea932a21499582fea59e1df95769b)
When we have the no_holes feature enabled, if a we truncate a file to a
smaller size, truncate it again but to a size greater than or equals to
its original size and fsync it, the log tree will not have any information
about the hole covering the range [truncate_1_offset, new_file_size[.
Which means if the fsync log is replayed, the file will remain with the
state it had before both truncate operations.
Without the no_holes feature this does not happen, since when the inode
is logged (full sync flag is set) it will find in the fs/subvol tree a
leaf with a generation matching the current transaction id that has an
explicit extent item representing the hole.
Fix this by adding an explicit extent item representing a hole between
the last extent and the inode's i_size if we are doing a full sync.
The issue is easy to reproduce with the following test case for fstests:
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
# This test was motivated by an issue found in btrfs when the btrfs
# no-holes feature is enabled (introduced in kernel 3.14). So enable
# the feature if the fs being tested is btrfs.
if [ $FSTYP == "btrfs" ]; then
_require_btrfs_fs_feature "no_holes"
_require_btrfs_mkfs_feature "no-holes"
MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
fi
# Now truncate our file foo to a smaller size (64Kb) and then truncate
# it to the size it had before the shrinking truncate (125Kb). Then
# fsync our file. If a power failure happens after the fsync, we expect
# our file to have a size of 125Kb, with the first 64Kb of data having
# the value 0xaa and the second 61Kb of data having the value 0x00.
$XFS_IO_PROG -c "truncate 64K" \
-c "truncate 125K" \
-c "fsync" \
$SCRATCH_MNT/foo
# Do something similar to our file bar, but the first truncation sets
# the file size to 0 and the second truncation expands the size to the
# double of what it was initially.
$XFS_IO_PROG -c "truncate 0" \
-c "truncate 253K" \
-c "fsync" \
$SCRATCH_MNT/bar
# Allow writes again, mount to trigger log replay and validate file
# contents.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# We expect foo to have a size of 125Kb, the first 64Kb of data all
# having the value 0xaa and the remaining 61Kb to be a hole (all bytes
# with value 0x00).
echo "File foo content after log replay:"
od -t x1 $SCRATCH_MNT/foo
# We expect bar to have a size of 253Kb and no extents (any byte read
# from bar has the value 0x00).
echo "File bar content after log replay:"
od -t x1 $SCRATCH_MNT/bar
status=0
exit
The expected file contents in the golden output are:
File foo content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* 0372000
File bar content after log replay: 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* 0772000
Without this fix, their contents are:
File foo content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
* 0372000
File bar content after log replay: 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
* 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
* 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* 0772000
A test case submission for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 091537b5b3102af4765256d2b527a2a1de4fb184)
After commit 4f764e515361 ("Btrfs: remove deleted xattrs on fsync log
replay"), we can end up in a situation where during log replay we end up
deleting xattrs that were never deleted when their file was last fsynced.
This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
set, the xattr was added in a past transaction and the leaf where the
xattr is located was not updated (COWed or created) in the current
transaction. In this scenario the xattr item never ends up in the log
tree and therefore at log replay time, which makes the replay code delete
the xattr from the fs/subvol tree as it thinks that xattr was deleted
prior to the last fsync.
Fix this by always logging all xattrs, which is the simplest and most
reliable way to detect deleted xattrs and replay the deletes at log replay
time.
This issue is reproducible with the following test case for fstests:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
. ./common/attr
# real QA test starts here
# We create a lot of xattrs for a single file. Only btrfs and xfs are currently
# able to store such a large mount of xattrs per file, other filesystems such
# as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
# less than 1000 xattrs with very small values.
_supported_fs btrfs xfs
_supported_os Linux
_need_to_be_root
_require_scratch
_require_dm_flakey
_require_attrs
_require_metadata_journaling $SCRATCH_DEV
# Create the test file with some initial data and make sure everything is
# durably persisted.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
sync
# Add many small xattrs to our file.
# We create such a large amount because it's needed to trigger the issue found
# in btrfs - we need to have an amount that causes the fs to have at least 3
# btree leafs with xattrs stored in them, and it must work on any leaf size
# (maximum leaf/node size is 64Kb).
num_xattrs=2000
for ((i = 1; i <= $num_xattrs; i++)); do
name="user.attr_$(printf "%04d" $i)"
$SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
done
# Sync the filesystem to force a commit of the current btrfs transaction, this
# is a necessary condition to trigger the bug on btrfs.
sync
# Now update our file's data and fsync the file.
# After a successful fsync, if the fsync log/journal is replayed we expect to
# see all the xattrs we added before with the same values (and the updated file
# data of course). Btrfs used to delete some of these xattrs when it replayed
# its fsync log/journal.
$XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount. This makes the fs replay its fsync log.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
echo "File content after crash and log replay:"
od -t x1 $SCRATCH_MNT/foo
echo "File xattrs after crash and log replay:"
for ((i = 1; i <= $num_xattrs; i++)); do
name="user.attr_$(printf "%04d" $i)"
echo -n "$name="
$GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
echo
done
status=0
exit
The golden output expects all xattrs to be available, and with the correct
values, after the fsync log is replayed.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit db4043dffa5b9f77160dfba088bbbbaa64dd9ed1)
Changes since V1: fixed the description and added KASan warning.
In assoc_array_insert_into_terminal_node(), we call the
compare_object() method on all non-empty slots, even when they're
not leaves, passing a pointer to an unexpected structure to
compare_object(). Currently it causes an out-of-bound read access
in keyring_compare_object detected by KASan (see below). The issue
is easily reproduced with keyutils testsuite.
Only call compare_object() when the slot is a leave.
KASan warning:
==================================================================
BUG: KASAN: slab-out-of-bounds in keyring_compare_object+0x213/0x240 at addr ffff880060a6f838
Read of size 8 by task keyctl/1655
=============================================================================
BUG kmalloc-192 (Not tainted): kasan: bad access detected
-----------------------------------------------------------------------------