]> www.infradead.org Git - users/willy/xarray.git/log
users/willy/xarray.git
11 years agoperf tools: Account entry stats when it's added to the output tree
Namhyung Kim [Tue, 22 Apr 2014 02:44:21 +0000 (11:44 +0900)]
perf tools: Account entry stats when it's added to the output tree

Currently, accounting each sample is done in multiple places - once
when adding them to the input tree, other when adding them to the
output tree.  It's not only confusing but also can cause a subtle
problem since concurrent processing like in perf top might see the
updated stats before adding entries into the output tree - like seeing
more (blank) lines at the end and/or slight inaccurate percentage.

To fix this, only account the entries when it's moved into the output
tree so that they cannot be seen prematurely.  There're some
exceptional cases here and there - they should be addressed separately
with comments.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1398327843-31845-7-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
11 years agoperf hists: Collapse expanded callchains after filter is applied
Namhyung Kim [Thu, 24 Apr 2014 07:44:16 +0000 (16:44 +0900)]
perf hists: Collapse expanded callchains after filter is applied

When a filter is applied a hist entry checks whether its callchain was
folded and account it to the output stat.  But this is rather hacky
and only TUI-specific.  Simply fold the callchains for the entry looks
like a simpler and more generic solution IMHO.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1398327843-31845-6-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
11 years agoperf hists: Add a couple of hists stat helper functions
Namhyung Kim [Thu, 24 Apr 2014 07:37:26 +0000 (16:37 +0900)]
perf hists: Add a couple of hists stat helper functions

Add hists__{reset,inc}_[filter_]stats() functions to cleanup accesses
to hist stats (for output).  Note that number of samples in the stat
is not handled here since it belongs to the input stage.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1398327843-31845-5-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
11 years agoperf hists: Move column length calculation out of hists__inc_stats()
Namhyung Kim [Thu, 24 Apr 2014 07:25:19 +0000 (16:25 +0900)]
perf hists: Move column length calculation out of hists__inc_stats()

It's not the part of logic of hists__inc_stats() so it'd be better to
move it out of the function.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1398327843-31845-4-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
11 years agoperf hists: Rename hists__inc_stats()
Namhyung Kim [Thu, 24 Apr 2014 07:21:46 +0000 (16:21 +0900)]
perf hists: Rename hists__inc_stats()

The existing hists__inc_nr_entries() is a misnomer as it's not only
increasing ->nr_entries but also other stats.  So rename it to more
general hists__inc_stats().

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1398327843-31845-3-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
11 years agoperf report: Count number of entries separately
Namhyung Kim [Tue, 22 Apr 2014 00:47:25 +0000 (09:47 +0900)]
perf report: Count number of entries separately

The hists->nr_entries is counted in multiple places so that they can
confuse readers of the code.  This is a preparation of later change
and do not intend any functional difference.

Note that report__collapse_hists() now changed to return nothing since
its return value (nr_samples) is only for checking if there's any data
in the input file and this can be acheived by checking ->nr_entries.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1398327843-31845-2-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
11 years agoMerge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git...
Ingo Molnar [Tue, 22 Apr 2014 18:28:23 +0000 (20:28 +0200)]
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf into perf/core

Pull perf/core improvements and fixes from Jiri Olsa:

Infrastructure changes:

  * Making some code (cpu node map and report parse callchain callback) global
    to be usable by upcomming changes (Don Zickus)

  * Fix pmu object compilation error (Jiri Olsa)

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agoperf callchain: Add generic report parse callchain callback function
Don Zickus [Mon, 7 Apr 2014 18:55:24 +0000 (14:55 -0400)]
perf callchain: Add generic report parse callchain callback function

This takes the parse_callchain_opt function and copies it into the
callchain.c file.  Now the c2c tool can use it too without duplicating.

Update perf-report to use the new routine too.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1396896924-129847-5-git-send-email-dzickus@redhat.com
[ Adding missing braces to multiline if condition ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf kmem: Utilize the new generic cpunode_map
Don Zickus [Mon, 7 Apr 2014 18:55:23 +0000 (14:55 -0400)]
perf kmem: Utilize the new generic cpunode_map

Use the previous patch implementation of cpunode_map for builtin-kmem.c
Should not be any functional difference.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/1396896924-129847-4-git-send-email-dzickus@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf tools: Use cpu/possible instead of cpu/kernel_max
Don Zickus [Mon, 7 Apr 2014 18:55:22 +0000 (14:55 -0400)]
perf tools: Use cpu/possible instead of cpu/kernel_max

The system's max configuration is represented by cpu/possible and
cpu/kernel_max can be huge (4096 vs. 128), so save space by keeping
smaller structures.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1396896924-129847-3-git-send-email-dzickus@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf tools: Allow ability to map cpus to nodes easily
Don Zickus [Mon, 7 Apr 2014 18:55:21 +0000 (14:55 -0400)]
perf tools: Allow ability to map cpus to nodes easily

This patch figures out the max number of cpus and nodes that are on the
system and creates a map of cpu to node.  This allows us to provide a cpu
and quickly get the node associated with it.

It was mostly copied from builtin-kmem.c and tweaked slightly to use less memory
(use possible cpus instead of max).  It also calculates the max number of nodes.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1396896924-129847-2-git-send-email-dzickus@redhat.com
[ Removing out label code in init_cpunode_map ]
[ Adding check for snprintf error ]
[ Removing unneeded returns ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf tools: Fix pmu object compilation error
Jiri Olsa [Wed, 16 Apr 2014 18:49:02 +0000 (20:49 +0200)]
perf tools: Fix pmu object compilation error

After applying some patches got another shadowing error:

  CC       util/pmu.o
util/pmu.c: In function ‘pmu_alias_terms’:
util/pmu.c:287:35: error: declaration of ‘clone’ shadows a global declaration [-Werror=shadow]

Renaming clone to cloned.

Acked-by: David Ahern <dsahern@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1397674818-27054-1-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf/x86: Export perf_assign_events()
Yan, Zheng [Tue, 18 Mar 2014 08:56:43 +0000 (16:56 +0800)]
perf/x86: Export perf_assign_events()

export perf_assign_events to allow building perf Intel uncore driver
as module

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1395133004-23205-3-git-send-email-zheng.z.yan@intel.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: eranian@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agohrtimer: Export __hrtimer_start_range_ns()
Yan, Zheng [Tue, 18 Mar 2014 08:56:42 +0000 (16:56 +0800)]
hrtimer: Export __hrtimer_start_range_ns()

Export __hrtimer_start_range_ns() to allow building perf Intel uncore
driver as a module.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1395133004-23205-2-git-send-email-zheng.z.yan@intel.com
Cc: eranian@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agoperf: Allow building PMU drivers as modules
Yan, Zheng [Tue, 18 Mar 2014 08:56:41 +0000 (16:56 +0800)]
perf: Allow building PMU drivers as modules

This patch adds support for building PMU driver as module. It exports
the functions perf_pmu_{register,unregister}() and adds reference tracking
for the PMU driver module.

When the PMU driver is built as a module, each active event of the PMU
holds a reference to the driver module.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1395133004-23205-1-git-send-email-zheng.z.yan@intel.com
Cc: eranian@google.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agoMerge branch 'perf/urgent' into perf/core, to pick up PMU driver fixes.
Ingo Molnar [Fri, 18 Apr 2014 10:14:55 +0000 (12:14 +0200)]
Merge branch 'perf/urgent' into perf/core, to pick up PMU driver fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agoperf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMU
Venkatesh Srinivas [Thu, 13 Mar 2014 19:36:26 +0000 (12:36 -0700)]
perf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMU

CPUs which should support the RAPL counters according to
Family/Model/Stepping may still issue #GP when attempting to access
the RAPL MSRs. This may happen when Linux is running under KVM and
we are passing-through host F/M/S data, for example. Use rdmsrl_safe
to first access the RAPL_POWER_UNIT MSR; if this fails, do not
attempt to use this PMU.

Signed-off-by: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394739386-22260-1-git-send-email-venkateshs@google.com
Cc: zheng.z.yan@intel.com
Cc: eranian@google.com
Cc: ak@linux.intel.com
Cc: linux-kernel@vger.kernel.org
[ The patch also silently fixes another bug: rapl_pmu_init() didn't handle the memory alloc failure case previously. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agoMerge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg...
Ingo Molnar [Fri, 18 Apr 2014 07:49:14 +0000 (09:49 +0200)]
Merge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/core

Pull uprobes fixes and cleanups from Oleg Nesterov:

  "Any probed jmp/call can kill the application, see the changelog in 11/15."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agouprobes/x86: Emulate relative conditional "near" jmp's
Oleg Nesterov [Mon, 7 Apr 2014 14:22:58 +0000 (16:22 +0200)]
uprobes/x86: Emulate relative conditional "near" jmp's

Change branch_setup_xol_ops() to simply use opc1 = OPCODE2(insn) - 0x10
if OPCODE1() == 0x0f; this matches the "short" jmp which checks the same
condition.

Thanks to lib/insn.c, it does the rest correctly. branch->ilen/offs are
correct no matter if this jmp is "near" or "short".

Reported-by: Jonathan Lebon <jlebon@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
11 years agouprobes/x86: Emulate relative conditional "short" jmp's
Oleg Nesterov [Sun, 6 Apr 2014 19:53:47 +0000 (21:53 +0200)]
uprobes/x86: Emulate relative conditional "short" jmp's

Teach branch_emulate_op() to emulate the conditional "short" jmp's which
check regs->flags.

Note: this doesn't support jcxz/jcexz, loope/loopz, and loopne/loopnz.
They all are rel8 and thus they can't trigger the problem, but perhaps
we will add the support in future just for completeness.

Reported-by: Jonathan Lebon <jlebon@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
11 years agouprobes/x86: Emulate relative call's
Oleg Nesterov [Sun, 6 Apr 2014 16:11:02 +0000 (18:11 +0200)]
uprobes/x86: Emulate relative call's

See the previous "Emulate unconditional relative jmp's" which explains
why we can not execute "jmp" out-of-line, the same applies to "call".

Emulating of rip-relative call is trivial, we only need to additionally
push the ret-address. If this fails, we execute this instruction out of
line and this should trigger the trap, the probed application should die
or the same insn will be restarted if a signal handler expands the stack.
We do not even need ->post_xol() for this case.

But there is a corner (and almost theoretical) case: another thread can
expand the stack right before we execute this insn out of line. In this
case it hit the same problem we are trying to solve. So we simply turn
the probed insn into "call 1f; 1:" and add ->post_xol() which restores
->sp and restarts.

Many thanks to Jonathan who finally found the standalone reproducer,
otherwise I would never resolve the "random SIGSEGV's under systemtap"
bug-report. Now that the problem is clear we can write the simplified
test-case:

void probe_func(void), callee(void);

int failed = 1;

asm (
".text\n"
".align 4096\n"
".globl probe_func\n"
"probe_func:\n"
"call callee\n"
"ret"
);

/*
 * This assumes that:
 *
 * - &probe_func = 0x401000 + a_bit, aligned = 0x402000
 *
 * - xol_vma->vm_start = TASK_SIZE_MAX - PAGE_SIZE = 0x7fffffffe000
 *   as xol_add_vma() asks; the 1st slot = 0x7fffffffe080
 *
 * so we can target the non-canonical address from xol_vma using
 * the simple math below, 100 * 4096 is just the random offset
 */
asm (".org . + 0x800000000000 - 0x7fffffffe080 - 5 - 1  + 100 * 4096\n");

void callee(void)
{
failed = 0;
}

int main(void)
{
probe_func();
return failed;
}

It SIGSEGV's if you probe "probe_func" (although this is not very reliable,
randomize_va_space/etc can change the placement of xol area).

Note: as Denys Vlasenko pointed out, amd and intel treat "callw" (0x66 0xe8)
differently. This patch relies on lib/insn.c and thus implements the intel's
behaviour: 0x66 is simply ignored. Fortunately nothing sane should ever use
this insn, so we postpone the fix until we decide what should we do; emulate
or not, support or not, etc.

Reported-by: Jonathan Lebon <jlebon@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
11 years agouprobes/x86: Emulate nop's using ops->emulate()
Oleg Nesterov [Sat, 5 Apr 2014 19:06:10 +0000 (21:06 +0200)]
uprobes/x86: Emulate nop's using ops->emulate()

Finally we can kill the ugly (and very limited) code in __skip_sstep().
Just change branch_setup_xol_ops() to treat "nop" as jmp to the next insn.

Thanks to lib/insn.c, it is clever enough. OPCODE1() == 0x90 includes
"(rep;)+ nop;" at least, and (afaics) much more.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
11 years agouprobes/x86: Emulate unconditional relative jmp's
Oleg Nesterov [Sat, 5 Apr 2014 18:05:02 +0000 (20:05 +0200)]
uprobes/x86: Emulate unconditional relative jmp's

Currently we always execute all insns out-of-line, including relative
jmp's and call's. This assumes that even if regs->ip points to nowhere
after the single-step, default_post_xol_op(UPROBE_FIX_IP) logic will
update it correctly.

However, this doesn't work if this regs->ip == xol_vaddr + insn_offset
is not canonical. In this case CPU generates #GP and general_protection()
kills the task which tries to execute this insn out-of-line.

Now that we have uprobe_xol_ops we can teach uprobes to emulate these
insns and solve the problem. This patch adds branch_xol_ops which has
a single branch_emulate_op() hook, so far it can only handle rel8/32
relative jmp's.

TODO: move ->fixup into the union along with rip_rela_target_address.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Jonathan Lebon <jlebon@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
11 years agouprobes/x86: Introduce sizeof_long(), cleanup adjust_ret_addr() and arch_uretprobe_hi...
Oleg Nesterov [Sun, 6 Apr 2014 15:16:10 +0000 (17:16 +0200)]
uprobes/x86: Introduce sizeof_long(), cleanup adjust_ret_addr() and arch_uretprobe_hijack_return_addr()

1. Add the trivial sizeof_long() helper and change other callers of
   is_ia32_task() to use it.

   TODO: is_ia32_task() is not what we actually want, TS_COMPAT does
   not necessarily mean 32bit. Fortunately syscall-like insns can't be
   probed so it actually works, but it would be better to rename and
   use is_ia32_frame().

2. As Jim pointed out "ncopied" in arch_uretprobe_hijack_return_addr()
   and adjust_ret_addr() should be named "nleft". And in fact only the
   last copy_to_user() in arch_uretprobe_hijack_return_addr() actually
   needs to inspect the non-zero error code.

TODO: adjust_ret_addr() should die. We can always calculate the value
we need to write into *regs->sp, just UPROBE_FIX_CALL should record
insn->length.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
11 years agouprobes/x86: Teach arch_uprobe_post_xol() to restart if possible
Oleg Nesterov [Thu, 3 Apr 2014 18:52:19 +0000 (20:52 +0200)]
uprobes/x86: Teach arch_uprobe_post_xol() to restart if possible

SIGILL after the failed arch_uprobe_post_xol() should only be used as
a last resort, we should try to restart the probed insn if possible.

Currently only adjust_ret_addr() can fail, and this can only happen if
another thread unmapped our stack after we executed "call" out-of-line.
Most probably the application if buggy, but even in this case it can
have a handler for SIGSEGV/etc. And in theory it can be even correct
and do something non-trivial with its memory.

Of course we can't restart unconditionally, so arch_uprobe_post_xol()
does this only if ->post_xol() returns -ERESTART even if currently this
is the only possible error.

default_post_xol_op(UPROBE_FIX_CALL) can always restart, but as Jim
pointed out it should not forget to pop off the return address pushed
by this insn executed out-of-line.

Note: this is not "perfect", we do not want the extra handler_chain()
after restart, but I think this is the best solution we can realistically
do without too much uglifications.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
11 years agouprobes/x86: Send SIGILL if arch_uprobe_post_xol() fails
Oleg Nesterov [Thu, 3 Apr 2014 18:20:10 +0000 (20:20 +0200)]
uprobes/x86: Send SIGILL if arch_uprobe_post_xol() fails

Currently the error from arch_uprobe_post_xol() is silently ignored.
This doesn't look good and this can lead to the hard-to-debug problems.

1. Change handle_singlestep() to loudly complain and send SIGILL.

   Note: this only affects x86, ppc/arm can't fail.

2. Change arch_uprobe_post_xol() to call arch_uprobe_abort_xol() and
   avoid TF games if it is going to return an error.

   This can help to to analyze the problem, if nothing else we should
   not report ->ip = xol_slot in the core-file.

   Note: this means that handle_riprel_post_xol() can be called twice,
   but this is fine because it is idempotent.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
11 years agouprobes/x86: Conditionalize the usage of handle_riprel_insn()
Oleg Nesterov [Mon, 31 Mar 2014 15:24:14 +0000 (17:24 +0200)]
uprobes/x86: Conditionalize the usage of handle_riprel_insn()

arch_uprobe_analyze_insn() calls handle_riprel_insn() at the start,
but only "0xff" and "default" cases need the UPROBE_FIX_RIP_ logic.
Move the callsite into "default" case and change the "0xff" case to
fall-through.

We are going to add the various hooks to handle the rip-relative
jmp/call instructions (and more), we need this change to enforce the
fact that the new code can not conflict with is_riprel_insn() logic
which, after this change, can only be used by default_xol_ops.

Note: arch_uprobe_abort_xol() still calls handle_riprel_post_xol()
directly. This is fine unless another _xol_ops we may add later will
need to reuse "UPROBE_FIX_RIP_AX|UPROBE_FIX_RIP_CX" bits in ->fixup.
In this case we can add uprobe_xol_ops->abort() hook, which (perhaps)
we will need anyway in the long term.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
11 years agouprobes/x86: Introduce uprobe_xol_ops and arch_uprobe->ops
Oleg Nesterov [Mon, 31 Mar 2014 19:01:31 +0000 (21:01 +0200)]
uprobes/x86: Introduce uprobe_xol_ops and arch_uprobe->ops

Introduce arch_uprobe->ops pointing to the "struct uprobe_xol_ops",
move the current UPROBE_FIX_{RIP*,IP,CALL} code into the default
set of methods and change arch_uprobe_pre/post_xol() accordingly.

This way we can add the new uprobe_xol_ops's to handle the insns
which need the special processing (rip-relative jmp/call at least).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
11 years agouprobes/x86: move the UPROBE_FIX_{RIP,IP,CALL} code at the end of pre/post hooks
Oleg Nesterov [Mon, 31 Mar 2014 17:38:09 +0000 (19:38 +0200)]
uprobes/x86: move the UPROBE_FIX_{RIP,IP,CALL} code at the end of pre/post hooks

No functional changes. Preparation to simplify the review of the next
change. Just reorder the code in arch_uprobe_pre/post_xol() functions
so that UPROBE_FIX_{RIP_*,IP,CALL} logic goes to the end.

Also change arch_uprobe_pre_xol() to use utask instead of autask, to
make the code more symmetrical with arch_uprobe_post_xol().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
11 years agouprobes/x86: Gather "riprel" functions together
Oleg Nesterov [Mon, 31 Mar 2014 16:35:09 +0000 (18:35 +0200)]
uprobes/x86: Gather "riprel" functions together

Cosmetic. Move pre_xol_rip_insn() and handle_riprel_post_xol() up to
the closely related handle_riprel_insn(). This way it is simpler to
read and understand this code, and this lessens the number of ifdef's.

While at it, update the comment in handle_riprel_post_xol() as Jim
suggested.

TODO: rename them somehow to make the naming consistent.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
11 years agouprobes/x86: Kill the "ia32_compat" check in handle_riprel_insn(), remove "mm" arg
Oleg Nesterov [Mon, 31 Mar 2014 16:09:36 +0000 (18:09 +0200)]
uprobes/x86: Kill the "ia32_compat" check in handle_riprel_insn(), remove "mm" arg

Kill the "mm->context.ia32_compat" check in handle_riprel_insn(), if
it is true insn_rip_relative() must return false. validate_insn_bits()
passed "ia32_compat" as !x86_64 to insn_init(), and insn_rip_relative()
checks insn->x86_64.

Also, remove the no longer needed "struct mm_struct *mm" argument and
the unnecessary "return" at the end.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
11 years agouprobes/x86: Fold prepare_fixups() into arch_uprobe_analyze_insn()
Oleg Nesterov [Mon, 31 Mar 2014 13:16:22 +0000 (15:16 +0200)]
uprobes/x86: Fold prepare_fixups() into arch_uprobe_analyze_insn()

No functional changes, preparation.

Shift the code from prepare_fixups() to arch_uprobe_analyze_insn()
with the following modifications:

- Do not call insn_get_opcode() again, it was already called
  by validate_insn_bits().

- Move "case 0xea" up. This way "case 0xff" can fall through
  to default case.

- change "case 0xff" to use the nested "switch (MODRM_REG)",
  this way the code looks a bit simpler.

- Make the comments look consistent.

While at it, kill the initialization of rip_rela_target_address and
->fixups, we can rely on kzalloc(). We will add the new members into
arch_uprobe, it would be better to assume that everything is zero by
default.

TODO: cleanup/fix the mess in validate_insn_bits() paths:

- validate_insn_64bits() and validate_insn_32bits() should be
  unified.

- "ifdef" is not used consistently; if good_insns_64 depends
  on CONFIG_X86_64, then probably good_insns_32 should depend
  on CONFIG_X86_32/EMULATION

- the usage of mm->context.ia32_compat looks wrong if the task
  is TIF_X32.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
11 years agouprobes: Kill UPROBE_SKIP_SSTEP and can_skip_sstep()
Oleg Nesterov [Sun, 30 Mar 2014 16:56:22 +0000 (18:56 +0200)]
uprobes: Kill UPROBE_SKIP_SSTEP and can_skip_sstep()

UPROBE_COPY_INSN, UPROBE_SKIP_SSTEP, and uprobe->flags must die. This
patch kills UPROBE_SKIP_SSTEP. I never understood why it was added;
not only it doesn't help, it harms.

It can only help to avoid arch_uprobe_skip_sstep() if it was already
called before and failed. But this is ugly, if we want to know whether
we can emulate this instruction or not we should do this analysis in
arch_uprobe_analyze_insn(), not when we hit this probe for the first
time.

And in fact this logic is simply wrong. arch_uprobe_skip_sstep() can
fail or not depending on the task/register state, if this insn can be
emulated but, say, put_user() fails we need to xol it this time, but
this doesn't mean we shouldn't try to emulate it when this or another
thread hits this bp next time.

And this is the actual reason for this change. We need to emulate the
"call" insn, but push(return-address) can obviously fail.

Per-arch notes:

x86: __skip_sstep() can only emulate "rep;nop". With this
     change it will be called every time and most probably
     for no reason.

     This will be fixed by the next changes. We need to
     change this suboptimal code anyway.

arm: Should not be affected. It has its own "bool simulate"
     flag checked in arch_uprobe_skip_sstep().

ppc: Looks like, it can emulate almost everything. Does it
     actually need to record the fact that emulate_step()
     failed? Hopefully not. But if yes, it can add the ppc-
     specific flag into arch_uprobe.

TODO: rename arch_uprobe_skip_sstep() to arch_uprobe_emulate_insn(),

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: David A. Long <dave.long@linaro.org>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
11 years agokprobes/x86: Fix page-fault handling logic
Masami Hiramatsu [Thu, 17 Apr 2014 08:16:44 +0000 (17:16 +0900)]
kprobes/x86: Fix page-fault handling logic

Current kprobes in-kernel page fault handler doesn't
expect that its single-stepping can be interrupted by
an NMI handler which may cause a page fault(e.g. perf
with callback tracing).

In that case, the page-fault handled by kprobes and it
misunderstands the page-fault has been caused by the
single-stepping code and tries to recover IP address
to probed address.

But the truth is the page-fault has been caused by the
NMI handler, and do_page_fault failes to handle real
page fault because the IP address is modified and
causes Kernel BUGs like below.

 ----
 [ 2264.726905] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
 [ 2264.727190] IP: [<ffffffff813c46e0>] copy_user_generic_string+0x0/0x40

To handle this correctly, I fixed the kprobes fault
handler to ensure the faulted ip address is its own
single-step buffer instead of checking current kprobe
state.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Sandeepa Prabhu <sandeepa.prabhu@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: fche@redhat.com
Cc: systemtap@sourceware.org
Link: http://lkml.kernel.org/r/20140417081644.26341.52351.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agoMerge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git...
Ingo Molnar [Thu, 17 Apr 2014 08:06:47 +0000 (10:06 +0200)]
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf into perf/core

Pull perf/core improvements and fixes from Jiri Olsa:

User visible changes:

  * Add --percentage option to control absolute/relative percentage output (Namhyung Kim)

Plumbing changes:

  * Add --list-cmds to 'kmem', 'mem', 'lock' and 'sched', for use by completion scripts (Ramkumar Ramachandra)

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 16 Apr 2014 23:40:18 +0000 (16:40 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Various fixes:

   - reboot regression fix
   - build message spam fix
   - GPU quirk fix
   - 'make kvmconfig' fix

  plus the wire-up of the renameat2() system call on i386"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Remove the PCI reboot method from the default chain
  x86/build: Supress "Nothing to be done for ..." messages
  x86/gpu: Fix sign extension issue in Intel graphics stolen memory quirks
  x86/platform: Fix "make O=dir kvmconfig"
  i386: Wire up the renameat2() syscall

11 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 16 Apr 2014 23:38:57 +0000 (16:38 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "Tooling fixes, plus a simple hardware-enablement patch for the Intel
  RAPL PMU (energy use measurement) on Haswell CPUs, which I hope is
  still fine at this stage"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tools: Instead of redirecting flex output, use -o
  perf tools: Fix double free in perf test 21 (code-reading.c)
  perf stat: Initialize statistics correctly
  perf bench: Set more defaults in the 'numa' suite
  perf bench: Fix segfault at the end of an 'all' execution
  perf bench: Update manpage to mention numa and futex
  perf probe: Use dwarf_getcfi_elf() instead of dwarf_getcfi()
  perf probe: Fix to handle errors in line_range searching
  perf probe: Fix --line option behavior
  perf tools: Pick up libdw without explicit LIBDW_DIR
  MAINTAINERS: Change e-mail to kernel.org one
  perf callchains: Disable unwind libraries when libelf isn't found
  tools lib traceevent: Do not call warning() directly
  tools lib traceevent: Print event name when show warning if possible
  perf top: Fix documentation of invalid -s option
  perf/x86: Enable DRAM RAPL support on Intel Haswell

11 years agoMerge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 16 Apr 2014 23:36:00 +0000 (16:36 -0700)]
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
 "ARM VIC (Vectored Irq Controller) irqchip driver fix"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: vic: Properly chain the cascaded IRQs

11 years agoMerge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 16 Apr 2014 23:35:18 +0000 (16:35 -0700)]
Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Ingo Molnar:
 "liblockdep fixes and mutex debugging fixes"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/mutex: Fix debug_mutexes
  tools/liblockdep: Add proper versioning to the shared obj
  tools/liblockdep: Ignore asmlinkage and visible

11 years agoMerge tag 'fbdev-fixes-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba...
Linus Torvalds [Wed, 16 Apr 2014 23:03:24 +0000 (16:03 -0700)]
Merge tag 'fbdev-fixes-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux

Pull fbdev fixes from Tomi Valkeinen:
 - fix build errors for bf54x-lq043fb and imxfb
 - fbcon fix for da8xx-fb
 - omapdss fixes for hdmi audio, irq handling and fclk calculation

* tag 'fbdev-fixes-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
  video: bf54x-lq043fb: fix build error
  OMAPDSS: Change struct reg_field to dispc_reg_field
  OMAPDSS: Take pixelclock unit change into account in hdmi_compute_acr()
  OMAPDSS: fix shared irq handlers
  video: imxfb: Select LCD_CLASS_DEVICE unconditionally
  OMAPDSS: fix rounding when calculating fclk rate
  video: da8xx-fb: Fix casting of info->pseudo_palette

11 years agoMerge tag 'pinctrl-v3.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Wed, 16 Apr 2014 22:59:16 +0000 (15:59 -0700)]
Merge tag 'pinctrl-v3.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pincontrol fixes from Linus Walleij:
 "A first set of pin control fixes for the v3.15 series:

   - Fix a couple of barnsjukdomar on the Rockchip driver.

   - Remove an idiotic debug print I happened to leave behind in the
     Nomadik driver.

   - Fixup the Qualcomm MSM interrupt handling code for the TLMM v2.

   - Three patches renaming the Broadcom Capri driver to BCM28155.  This
     has been falling between the chairs for some time due to some
     cross-tree synchronization misunderstandings, now I'm fed up with
     this and just rename it in this -rc1 phase"

* tag 'pinctrl-v3.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: fix typo in bindings documentation
  Update bcm_defconfig with new pinctrl CONFIG
  pinctrl: Rename Broadcom Capri pinctrl driver
  pinctrl: msm: Correct interrupt code for TLMM v2
  pinctrl: nomadik: delete stray debug print
  pinctrl: rockchip: handle first half of rk3188-bank0 correctly
  pinctrl: rockchip: add return value to rockchip_set_mux
  pinctrl: rockchip: fix offset of mux registers for rk3188

11 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Wed, 16 Apr 2014 18:28:25 +0000 (11:28 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 patches from Martin Schwidefsky:
 "An update to the oops output with additional information about the
  crash.  The renameat2 system call is enabled.  Two patches in regard
  to the PTR_ERR_OR_ZERO cleanup.  And a bunch of bug fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/sclp_cmd: replace PTR_RET with PTR_ERR_OR_ZERO
  s390/sclp: replace PTR_RET with PTR_ERR_OR_ZERO
  s390/sclp_vt220: Fix kernel panic due to early terminal input
  s390/compat: fix typo
  s390/uaccess: fix possible register corruption in strnlen_user_srst()
  s390: add 31 bit warning message
  s390: wire up sys_renameat2
  s390: show_registers() should not map user space addresses to kernel symbols
  s390/mm: print control registers and page table walk on crash
  s390/smp: fix smp_stop_cpu() for !CONFIG_SMP
  s390: fix control register update

11 years agoMerge tag 'please-pull-ia64-erratum' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 16 Apr 2014 18:22:45 +0000 (11:22 -0700)]
Merge tag 'please-pull-ia64-erratum' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull itanium erratum fix from Tony Luck:
 "Small workaround for a rare, but annoying, erratum #237"

* tag 'please-pull-ia64-erratum' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  [IA64] Change default PSR.ac from '1' to '0' (Fix erratum #237)

11 years ago[IA64] Change default PSR.ac from '1' to '0' (Fix erratum #237)
Tony Luck [Fri, 28 Mar 2014 21:42:07 +0000 (14:42 -0700)]
[IA64] Change default PSR.ac from '1' to '0' (Fix erratum #237)

April 2014 Itanium processor specification update:

http://www.intel.com/content/www/us/en/processors/itanium/itanium-specification-update.html

describes this erratum:

=========================================================================
237. Under a complex set of conditions, store to load forwarding for a
sub 8-byte load may complete incorrectly

Problem: A load instruction may complete incorrectly when a code sequence
using 4-byte or smaller load and store operations to the same address
is executed in combination with specific timing of all the following
concurrent conditions: store to load forwarding, alignment checking
enabled, a mis-predicted branch, and complex cache utilization activity.

Implication: The affected sub 8-byte instruction may complete
incorrectly resulting in unpredictable system behavior. There is an
extremely low probability of exposure due to the significant number of
complex microarchitectural concurrent conditions required to encounter
the erratum.

Workaround: Set PSR.ac = 0 to completely avoid the erratum. Disabling
Hyper-Threading will significantly reduce exposure to the conditions
that contribute to encountering the erratum.

Status: See the Summary Table of Changes for the affected steppings.
=========================================================================

[Table of changes essentially lists all models from McKinley to Tukwila]

The PSR.ac bit controls whether the processor will always generate
an unaligned reference trap (0x5a00) for a misaligned data access
(when PSR.ac=1) or if it will let the access succeed when running
on a cpu that implements logic to handle some unaligned accesses.

Way back in 2008 in commit b704882e70d87d7f56db5ff17e2253f3fa90e4f3
  [IA64] Rationalize kernel mode alignment checking
we made the decision to always enable strict checking. We were
already doing so in trap/interrupt context because the common
preamble code set this bit - but the rest of supervisor code
(and by inheritance user code) ran with PSR.ac=0.

We now reverse that decision and set PSR.ac=0 everywhere in the
kernel (also inherited by user processes). This will avoid the
erratum using the method described in the Itanium specification
update.  Net effect for users is that the processor will handle
unaligned access when it can (typically with a tiny performance
bubble in the pipeline ... but much less invasive than taking a
trap and having the OS perform the access).

Signed-off-by: Tony Luck <tony.luck@intel.com>
11 years agoperf sched: Introduce --list-cmds for use by scripts
Ramkumar Ramachandra [Sat, 15 Mar 2014 03:17:54 +0000 (23:17 -0400)]
perf sched: Introduce --list-cmds for use by scripts

Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1394853474-31019-5-git-send-email-artagnon@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf lock: Introduce --list-cmds for use by scripts
Ramkumar Ramachandra [Sat, 15 Mar 2014 03:17:53 +0000 (23:17 -0400)]
perf lock: Introduce --list-cmds for use by scripts

Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1394853474-31019-4-git-send-email-artagnon@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf mem: Introduce --list-cmds for use by scripts
Ramkumar Ramachandra [Sat, 15 Mar 2014 03:17:52 +0000 (23:17 -0400)]
perf mem: Introduce --list-cmds for use by scripts

Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1394853474-31019-3-git-send-email-artagnon@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf kmem: Introduce --list-cmds for use by scripts
Ramkumar Ramachandra [Sat, 15 Mar 2014 03:17:51 +0000 (23:17 -0400)]
perf kmem: Introduce --list-cmds for use by scripts

Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1394853474-31019-2-git-send-email-artagnon@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf tools: Show absolute percentage by default
Namhyung Kim [Tue, 14 Jan 2014 03:05:27 +0000 (12:05 +0900)]
perf tools: Show absolute percentage by default

Now perf report will show absolute percentage on filter entries by
default.

Suggested-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1397145720-8063-8-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf ui/tui: Add 'F' hotkey to toggle percentage output
Namhyung Kim [Mon, 10 Feb 2014 02:20:10 +0000 (11:20 +0900)]
perf ui/tui: Add 'F' hotkey to toggle percentage output

Add 'F' hotkey to toggle relative and absolute percentage of filtered
entries.

Suggested-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1397145720-8063-7-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf tools: Add hist.percentage config option
Namhyung Kim [Tue, 14 Jan 2014 03:02:15 +0000 (12:02 +0900)]
perf tools: Add hist.percentage config option

Add hist.percentage option for setting default value of the
symbol_conf.filter_relative.  It affects the output of various perf
commands (like perf report, top and diff) only if filter(s) applied.

An user can write .perfconfig file like below to show absolute
percentage of filtered entries by default:

  $ cat ~/.perfconfig
  [hist]
  percentage = absolute

And it can be changed through command line:

  $ perf report --percentage relative

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1397145720-8063-6-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf diff: Add --percentage option
Namhyung Kim [Fri, 7 Feb 2014 03:06:07 +0000 (12:06 +0900)]
perf diff: Add --percentage option

The --percentage option is for controlling overhead percentage
displayed.  It can only receive either of "relative" or "absolute" and
affects -c delta output only.

For more information, please see previous commit same thing done to
"perf report".

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1397145720-8063-5-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf top: Add --percentage option
Namhyung Kim [Fri, 7 Feb 2014 03:06:07 +0000 (12:06 +0900)]
perf top: Add --percentage option

The --percentage option is for controlling overhead percentage
displayed.  It can only receive either of "relative" or "absolute".
Move the parser callback function into a common location since it's
used by multiple commands now.

For more information, please see previous commit same thing done to
"perf report".

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1397145720-8063-4-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf report: Add --percentage option
Namhyung Kim [Tue, 14 Jan 2014 02:52:48 +0000 (11:52 +0900)]
perf report: Add --percentage option

The --percentage option is for controlling overhead percentage
displayed.  It can only receive either of "relative" or "absolute".

"relative" means it's relative to filtered entries only so that the
sum of shown entries will be always 100%.  "absolute" means it retains
the original value before and after the filter is applied.

  $ perf report -s comm
  # Overhead       Command
  # ........  ............
  #
      74.19%           cc1
       7.61%           gcc
       6.11%            as
       4.35%            sh
       4.14%          make
       1.13%        fixdep
  ...

  $ perf report -s comm -c cc1,gcc --percentage absolute
  # Overhead       Command
  # ........  ............
  #
      74.19%           cc1
       7.61%           gcc

  $ perf report -s comm -c cc1,gcc --percentage relative
  # Overhead       Command
  # ........  ............
  #
      90.69%           cc1
       9.31%           gcc

Note that it has zero effect if no filter was applied.

Suggested-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1397145720-8063-3-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf hists: Add support for showing relative percentage
Namhyung Kim [Thu, 26 Dec 2013 06:11:52 +0000 (15:11 +0900)]
perf hists: Add support for showing relative percentage

When filtering by thread, dso or symbol on TUI it also update total
period so that the output shows different result than no filter - the
percentage changed to relative to filtered entries only.  Sometimes
this is not desired since users might expect same results with filter.

So new filtered_* fields to hists->stats to count them separately.
They'll be controlled/used by user later.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1397145720-8063-2-git-send-email-namhyung@kernel.org
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agox86: Remove the PCI reboot method from the default chain
Ingo Molnar [Fri, 4 Apr 2014 06:41:26 +0000 (08:41 +0200)]
x86: Remove the PCI reboot method from the default chain

Steve reported a reboot hang and bisected it back to this commit:

  a4f1987e4c54 x86, reboot: Add EFI and CF9 reboot methods into the default list

He heroically tested all reboot methods and found the following:

  reboot=t       # triple fault                  ok
  reboot=k       # keyboard ctrl                 FAIL
  reboot=b       # BIOS                          ok
  reboot=a       # ACPI                          FAIL
  reboot=e       # EFI                           FAIL   [system has no EFI]
  reboot=p       # PCI 0xcf9                     FAIL

And I think it's pretty obvious that we should only try PCI 0xcf9 as a
last resort - if at all.

The other observation is that (on this box) we should never try
the PCI reboot method, but close with either the 'triple fault'
or the 'BIOS' (terminal!) reboot methods.

Thirdly, CF9_COND is a total misnomer - it should be something like
CF9_SAFE or CF9_CAREFUL, and 'CF9' should be 'CF9_FORCE' ...

So this patch fixes the worst problems:

 - it orders the actual reboot logic to follow the reboot ordering
   pattern - it was in a pretty random order before for no good
   reason.

 - it fixes the CF9 misnomers and uses BOOT_CF9_FORCE and
   BOOT_CF9_SAFE flags to make the code more obvious.

 - it tries the BIOS reboot method before the PCI reboot method.
   (Since 'BIOS' is a terminal reboot method resulting in a hang
    if it does not work, this is essentially equivalent to removing
    the PCI reboot method from the default reboot chain.)

 - just for the miraculous possibility of terminal (resulting
   in hang) reboot methods of triple fault or BIOS returning
   without having done their job, there's an ordering between
   them as well.

Reported-and-bisected-and-tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Aubrey <aubrey.li@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Link: http://lkml.kernel.org/r/20140404064120.GB11877@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Wed, 16 Apr 2014 03:30:30 +0000 (20:30 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Fix BPF filter validation of netlink attribute accesses, from
    Mathias Kruase.

 2) Netfilter conntrack generation seqcount not initialized properly,
    from Andrey Vagin.

 3) Fix comparison mask computation on big-endian in nft_cmp_fast(),
    from Patrick McHardy.

 4) Properly limit MTU over ipv6, from Eric Dumazet.

 5) Fix seccomp system call argument population on 32-bit, from Daniel
    Borkmann.

 6) skb_network_protocol() should not use hard-coded ETH_HLEN, instead
    skb->mac_len needs to be used.  From Vlad Yasevich.

 7) We have several cases of using socket based communications to
    implement a tunnel.  For example, some tunnels are encapsulations
    over UDP so we use an internal kernel UDP socket to do the
    transmits.

    These tunnels should behave just like other software devices and
    pass the packets on down to the next layer.

    Most importantly we want the top-level socket (eg TCP) that created
    the traffic to be charged for the SKB memory.

    However, once you get into the IP output path, we have code that
    assumed that whatever was attached to skb->sk is an IP socket.

    To keep the top-level socket being charged for the SKB memory,
    whilst satisfying the needs of the IP output path, we now pass in an
    explicit 'sk' argument.

    From Eric Dumazet.

 8) ping_init_sock() leaks group info, from Xiaoming Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
  cxgb4: use the correct max size for firmware flash
  qlcnic: Fix MSI-X initialization code
  ip6_gre: don't allow to remove the fb_tunnel_dev
  ipv4: add a sock pointer to dst->output() path.
  ipv4: add a sock pointer to ip_queue_xmit()
  driver/net: cosa driver uses udelay incorrectly
  at86rf230: fix __at86rf230_read_subreg function
  at86rf230: remove check if AVDD settled
  net: cadence: Add architecture dependencies
  net: Start with correct mac_len in skb_network_protocol
  Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  cxgb4: Save the correct mac addr for hw-loopback connections in the L2T
  net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W
  seccomp: fix populating a0-a5 syscall args in 32-bit x86 BPF
  qlcnic: Do not disable SR-IOV when VFs are assigned to VMs
  qlcnic: Fix QLogic application/driver interface for virtual NIC configuration
  qlcnic: Fix PVID configuration on eSwitch port.
  qlcnic: Fix max ring count calculation
  qlcnic: Fix to send INIT_NIC_FUNC as first mailbox.
  qlcnic: Fix panic due to uninitialzed delayed_work struct in use.
  ...

11 years agocxgb4: use the correct max size for firmware flash
Steve Wise [Tue, 15 Apr 2014 19:22:34 +0000 (14:22 -0500)]
cxgb4: use the correct max size for firmware flash

The wrong max fw size was being used and causing false
"too big" errors running ethtool -f.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoqlcnic: Fix MSI-X initialization code
Alexander Gordeev [Tue, 15 Apr 2014 09:37:14 +0000 (11:37 +0200)]
qlcnic: Fix MSI-X initialization code

Function qlcnic_setup_tss_rss_intr() might enter endless
loop in case pci_enable_msix() contiguously returns a
positive number of MSI-Xs that could have been allocated.
Besides, the function contains 'err = -EIO;' assignment
that never could be reached. This update fixes the
aforementioned issues.

Cc: Shahed Shaikh <shahed.shaikh@qlogic.com>
Cc: Dept-HSGLinuxNICDev@qlogic.com
Cc: netdev@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoip6_gre: don't allow to remove the fb_tunnel_dev
Nicolas Dichtel [Mon, 14 Apr 2014 15:11:38 +0000 (17:11 +0200)]
ip6_gre: don't allow to remove the fb_tunnel_dev

It's possible to remove the FB tunnel with the command 'ip link del ip6gre0' but
this is unsafe, the module always supposes that this device exists. For example,
ip6gre_tunnel_lookup() may use it unconditionally.

Let's add a rtnl handler for dellink, which will never remove the FB tunnel (we
let ip6gre_destroy_tunnels() do the job).

Introduced by commit c12b395a4664 ("gre: Support GRE over IPv6").

CC: Dmitry Kozlov <xeb@mail.ru>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoipv4: add a sock pointer to dst->output() path.
Eric Dumazet [Tue, 15 Apr 2014 17:47:15 +0000 (13:47 -0400)]
ipv4: add a sock pointer to dst->output() path.

In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.

The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.

Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
Reported-by: lucien xin <lucien.xin@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge tag 'perf-urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git...
Ingo Molnar [Tue, 15 Apr 2014 17:25:07 +0000 (19:25 +0200)]
Merge tag 'perf-urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf into perf/urgent

Pull perf/urgent fixes from Jiri Olsa:

  * Instead of redirecting flex output, use -o (Cody P Schafer)

  * Fix double free in perf test 21 (Adrian Hunter)

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
11 years agoipv4: add a sock pointer to ip_queue_xmit()
Eric Dumazet [Tue, 15 Apr 2014 16:58:34 +0000 (12:58 -0400)]
ipv4: add a sock pointer to ip_queue_xmit()

ip_queue_xmit() assumes the skb it has to transmit is attached to an
inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
changed l2tp to not change skb ownership and thus broke this assumption.

One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
so that we do not assume skb->sk points to the socket used by l2tp
tunnel.

Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
Reported-by: Zhan Jianyu <nasa4836@gmail.com>
Tested-by: Zhan Jianyu <nasa4836@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoirqchip: vic: Properly chain the cascaded IRQs
Linus Walleij [Tue, 15 Apr 2014 08:28:04 +0000 (10:28 +0200)]
irqchip: vic: Properly chain the cascaded IRQs

We are flagging the parent IRQ as chained, then we must also
make sure to call the chained_irq_[enter|exit] functions for
things to work smoothly.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Link: http://lkml.kernel.org/r/1397550484-7119-1-git-send-email-linus.walleij@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
11 years agoperf tools: Instead of redirecting flex output, use -o
Cody P Schafer [Mon, 14 Apr 2014 10:47:01 +0000 (12:47 +0200)]
perf tools: Instead of redirecting flex output, use -o

This gives us a real filename instead of having '<stdout>' show up all
over the place when debugging.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1396652539-2416-1-git-send-email-cody@linux.vnet.ibm.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agoperf tools: Fix double free in perf test 21 (code-reading.c)
Adrian Hunter [Thu, 10 Apr 2014 09:02:54 +0000 (12:02 +0300)]
perf tools: Fix double free in perf test 21 (code-reading.c)

perf_evlist__delete() deletes attached cpu and thread maps
but the test is still using them, so remove them from the
evlist before deleting it.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Link: http://lkml.kernel.org/r/53465E3E.8070201@intel.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
11 years agovideo: bf54x-lq043fb: fix build error
Steven Miao [Tue, 15 Apr 2014 07:17:19 +0000 (15:17 +0800)]
video: bf54x-lq043fb: fix build error

Fix build error by including linux/gpio.h. Also drop asm/gpio.h which is
not needed.

Signed-off-by: Steven Miao <realmz6@gmail.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
11 years agodriver/net: cosa driver uses udelay incorrectly
Li, Zhen-Hua [Tue, 15 Apr 2014 01:53:11 +0000 (09:53 +0800)]
driver/net: cosa driver uses udelay incorrectly

In cosa driver, udelay with more than 20000 may cause __bad_udelay.
Use msleep for instead.

Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoat86rf230: fix __at86rf230_read_subreg function
Alexander Aring [Mon, 14 Apr 2014 16:48:02 +0000 (18:48 +0200)]
at86rf230: fix __at86rf230_read_subreg function

The __at86rf230_read_subreg function don't mask and shift register
contents which it should do. This patch adds the necessary masks and
shift operations in this function.

Since we have csma support this can make some trouble on state changes.
Since CSMA support turned on some bits in the TRX_STATUS register that
used to be zero, not masking broke checking of the TRX_STATUS field
after commanding a state change.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Werner Almesberger <werner@almesberger.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoat86rf230: remove check if AVDD settled
Alexander Aring [Mon, 14 Apr 2014 16:48:01 +0000 (18:48 +0200)]
at86rf230: remove check if AVDD settled

The AVDD regulator is only enabled when the RF section is active TX_ON
(PLL_ON) state. Since commit 7dcbd22a97eb0689e6c583ad630ae0e7341e34c1
("ieee802154: ensure that first RF212 state comes from TRX_OFF").
We are in TRX_OFF state at the time at86rf230_hw_init is run.

Note that this test would only fail in case of a severe hardware
malfunction (faulty/shorted power supply, etc.) so it wasn't all that
useful in the first place.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Werner Almesberger <werner@almesberger.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: cadence: Add architecture dependencies
Jean Delvare [Mon, 14 Apr 2014 13:38:49 +0000 (15:38 +0200)]
net: cadence: Add architecture dependencies

The Cadence ethernet chipsets are only used on specific ARM
architectures. Add Kconfig dependencies so that drivers for these
chipsets are only buildable on the relevant architectures.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Mon, 14 Apr 2014 23:21:28 +0000 (16:21 -0700)]
Merge git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Marcelo Tosatti:
 - Fix for guest triggerable BUG_ON (CVE-2014-0155)
 - CR4.SMAP support
 - Spurious WARN_ON() fix

* git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: remove WARN_ON from get_kernel_ns()
  KVM: Rename variable smep to cr4_smep
  KVM: expose SMAP feature to guest
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
  KVM: Add SMAP support when setting CR4
  KVM: Remove SMAP bit from CR4_RESERVED_BITS
  KVM: ioapic: try to recover if pending_eoi goes out of range
  KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)

11 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Mon, 14 Apr 2014 23:04:14 +0000 (16:04 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull bmc2835 crypto fix from Herbert Xu:
 "This fixes a potential boot crash on bcm2835 due to the recent change
  that now causes hardware RNGs to be accessed on registration"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  hwrng: bcm2835 - fix oops when rng h/w is accessed during registration

11 years agouser namespace: fix incorrect memory barriers
Mikulas Patocka [Mon, 14 Apr 2014 20:58:55 +0000 (16:58 -0400)]
user namespace: fix incorrect memory barriers

smp_read_barrier_depends() can be used if there is data dependency between
the readers - i.e. if the read operation after the barrier uses address
that was obtained from the read operation before the barrier.

In this file, there is only control dependency, no data dependecy, so the
use of smp_read_barrier_depends() is incorrect. The code could fail in the
following way:
* the cpu predicts that idx < entries is true and starts executing the
  body of the for loop
* the cpu fetches map->extent[0].first and map->extent[0].count
* the cpu fetches map->nr_extents
* the cpu verifies that idx < extents is true, so it commits the
  instructions in the body of the for loop

The problem is that in this scenario, the cpu read map->extent[0].first
and map->nr_extents in the wrong order. We need a full read memory barrier
to prevent it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
David S. Miller [Mon, 14 Apr 2014 23:00:10 +0000 (19:00 -0400)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains three Netfilter fixes for your net tree,
they are:

* Fix missing generation sequence initialization which results in a splat
  if lockdep is enabled, it was introduced in the recent works to improve
  nf_conntrack scalability, from Andrey Vagin.

* Don't flush the GRE keymap list in nf_conntrack when the pptp helper is
  disabled otherwise this crashes due to a double release, from Andrey
  Vagin.

* Fix nf_tables cmp fast in big endian, from Patrick McHardy.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: Start with correct mac_len in skb_network_protocol
Vlad Yasevich [Mon, 14 Apr 2014 21:37:26 +0000 (17:37 -0400)]
net: Start with correct mac_len in skb_network_protocol

Sometimes, when the packet arrives at skb_mac_gso_segment()
its skb->mac_len already accounts for some of the mac lenght
headers in the packet.  This seems to happen when forwarding
through and OpenSSL tunnel.

When we start looking for any vlan headers in skb_network_protocol()
we seem to ignore any of the already known mac headers and start
with an ETH_HLEN.  This results in an incorrect offset, dropped
TSO frames and general slowness of the connection.

We can start counting from the known skb->mac_len
and return at least that much if all mac level headers
are known and accounted for.

Fixes: 53d6471cef17262d3ad1c7ce8982a234244f68ec (net: Account for all vlan headers in skb_mac_gso_segment)
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Borkman <dborkman@redhat.com>
Tested-by: Martin Filip <nexus+kernel@smoula.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoKVM: x86: remove WARN_ON from get_kernel_ns()
Marcelo Tosatti [Thu, 10 Apr 2014 21:19:12 +0000 (18:19 -0300)]
KVM: x86: remove WARN_ON from get_kernel_ns()

Function and callers can be preempted.

https://bugzilla.kernel.org/show_bug.cgi?id=73721

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
11 years agoKVM: Rename variable smep to cr4_smep
Feng Wu [Tue, 1 Apr 2014 09:56:48 +0000 (17:56 +0800)]
KVM: Rename variable smep to cr4_smep

Rename variable smep to cr4_smep, which can better reflect the
meaning of the variable.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
11 years agoKVM: expose SMAP feature to guest
Feng Wu [Tue, 1 Apr 2014 09:46:36 +0000 (17:46 +0800)]
KVM: expose SMAP feature to guest

This patch exposes SMAP feature to guest

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
11 years agoKVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
Feng Wu [Tue, 1 Apr 2014 09:46:35 +0000 (17:46 +0800)]
KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode

SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
11 years agoKVM: Add SMAP support when setting CR4
Feng Wu [Tue, 1 Apr 2014 09:46:34 +0000 (17:46 +0800)]
KVM: Add SMAP support when setting CR4

This patch adds SMAP handling logic when setting CR4 for guests

Thanks a lot to Paolo Bonzini for his suggestion to use the branchless
way to detect SMAP violation.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
11 years agoKVM: Remove SMAP bit from CR4_RESERVED_BITS
Feng Wu [Tue, 1 Apr 2014 09:46:33 +0000 (17:46 +0800)]
KVM: Remove SMAP bit from CR4_RESERVED_BITS

This patch removes SMAP bit from CR4_RESERVED_BITS.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
11 years agoRevert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver...
Daniel Borkmann [Mon, 14 Apr 2014 19:45:17 +0000 (21:45 +0200)]
Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"

This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
to reflect real state of the receiver's buffer") as it introduced a
serious performance regression on SCTP over IPv4 and IPv6, though a not
as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Current state:

[root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 17:56:21 GMT
Connecting to host 192.168.241.3, port 5201
      Cookie: Lab200slot2.1397238981.812898.548918
[  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
[  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
[  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
[  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
[  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
[  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
[  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
[  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
[  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
[  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
[  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
[  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
[  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
(etc)

[root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 19:08:41 GMT
Connecting to host 2001:db8:0:f101::1, port 5201
      Cookie: Lab200slot2.1397243321.714295.2b3f7c
[  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
[  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
[  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
[  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
[  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
[  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
[  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
[  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
[  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
[  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
[  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
[  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
(etc)

After patch:

[root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
Time: Mon, 14 Apr 2014 16:40:48 GMT
Connecting to host 192.168.240.3, port 5201
      Cookie: Lab200slot2.1397493648.413274.65e131
[  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
[  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
[  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
[  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec

With the reverted patch applied, the SCTP/IPv4 performance is back
to normal on latest upstream for IPv4 and IPv6 and has same throughput
as 3.4.2 test kernel, steady and interval reports are smooth again.

Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
Reported-by: Peter Butler <pbutler@sonusnet.com>
Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Peter Butler <pbutler@sonusnet.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agocxgb4: Save the correct mac addr for hw-loopback connections in the L2T
Steve Wise [Mon, 14 Apr 2014 19:22:43 +0000 (14:22 -0500)]
cxgb4: Save the correct mac addr for hw-loopback connections in the L2T

Hardware needs the local device mac address to support hw loopback for
rdma loopback connections.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W
Daniel Borkmann [Mon, 14 Apr 2014 19:20:12 +0000 (21:20 +0200)]
net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W

While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
been wrongly decoded by commit a8fc927780 ("sk-filter: Add ability to
get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.

In practice, this should not have much side-effect though, as such
conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
operation PR_GET_SECCOMP will only return the current seccomp mode, but
not the filter itself. Since the transition to the new BPF infrastructure,
it's also not used anymore, so we can simply remove this as it's
unreachable.

Fixes: a8fc927780 ("sk-filter: Add ability to get socket filter program (v2)")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoseccomp: fix populating a0-a5 syscall args in 32-bit x86 BPF
Daniel Borkmann [Mon, 14 Apr 2014 19:02:59 +0000 (21:02 +0200)]
seccomp: fix populating a0-a5 syscall args in 32-bit x86 BPF

Linus reports that on 32-bit x86 Chromium throws the following seccomp
resp. audit log messages:

  audit: type=1326 audit(1397359304.356:28108): auid=500 uid=500
gid=500 ses=2 subj=unconfined_u:unconfined_r:chrome_sandbox_t:s0-s0:c0.c1023
pid=3677 comm="chrome" exe="/opt/google/chrome/chrome" sig=0
syscall=172 compat=0 ip=0xb2dd9852 code=0x30000

  audit: type=1326 audit(1397359304.356:28109): auid=500 uid=500
gid=500 ses=2 subj=unconfined_u:unconfined_r:chrome_sandbox_t:s0-s0:c0.c1023
pid=3677 comm="chrome" exe="/opt/google/chrome/chrome" sig=0 syscall=5
compat=0 ip=0xb2dd9852 code=0x50000

These audit messages are being triggered via audit_seccomp() through
__secure_computing() in seccomp mode (BPF) filter with seccomp return
codes 0x30000 (== SECCOMP_RET_TRAP) and 0x50000 (== SECCOMP_RET_ERRNO)
during filter runtime. Moreover, Linus reports that x86_64 Chromium
seems fine.

The underlying issue that explains this is that the implementation of
populate_seccomp_data() is wrong. Our seccomp data structure sd that
is being shared with user ABI is:

  struct seccomp_data {
    int nr;
    __u32 arch;
    __u64 instruction_pointer;
    __u64 args[6];
  };

Therefore, a simple cast to 'unsigned long *' for storing the value of
the syscall argument via syscall_get_arguments() is just wrong as on
32-bit x86 (or any other 32bit arch), it would result in storing a0-a5
at wrong offsets in args[] member, and thus i) could leak stack memory
to user space and ii) tampers with the logic of seccomp BPF programs
that read out and check for syscall arguments:

  syscall_get_arguments(task, regs, 0, 1, (unsigned long *) &sd->args[0]);

Tested on 32-bit x86 with Google Chrome, unfortunately only via remote
test machine through slow ssh X forwarding, but it fixes the issue on
my side. So fix it up by storing args in type correct variables, gcc
is clever and optimizes the copy away in other cases, e.g. x86_64.

Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Reported-and-bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'qlcnic'
David S. Miller [Mon, 14 Apr 2014 17:43:58 +0000 (13:43 -0400)]
Merge branch 'qlcnic'

Shahed Shaikh says:

====================
qlcnic: Bug fixes

This patch series contains following bug fixes -

* Send INIT_NIC_FUNC mailbox command as first mailbox
* Fix a panic because of uninitialized delayed_work.
* Fix inconsistent calculation of max rings count.
* Fix PVID configuration issue. Driver needs to clear older
  PVID before adding new one.
* Fix QLogic application/driver interface by packing vNIC information
  array.
* Fix a crash when user tries to disable SR-IOV while VFs are
  still assigned to VMs.

Please apply to net.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoqlcnic: Do not disable SR-IOV when VFs are assigned to VMs
Manish Chopra [Mon, 14 Apr 2014 14:02:23 +0000 (10:02 -0400)]
qlcnic: Do not disable SR-IOV when VFs are assigned to VMs

o While disabling SR-IOV when VFs are assigned to VMs causes host crash
  so return -EPERM when user request to disable SR-IOV using pci sysfs in
  case of VFs are assigned to VMs.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoqlcnic: Fix QLogic application/driver interface for virtual NIC configuration
Jitendra Kalsaria [Mon, 14 Apr 2014 14:02:22 +0000 (10:02 -0400)]
qlcnic: Fix QLogic application/driver interface for virtual NIC configuration

o Application expect vNIC number as the array index but driver interface
return configuration in array index form.

o Pack the vNIC information array in the buffer such that application can
access it using vNIC number as the array index.

Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoqlcnic: Fix PVID configuration on eSwitch port.
Jitendra Kalsaria [Mon, 14 Apr 2014 14:02:21 +0000 (10:02 -0400)]
qlcnic: Fix PVID configuration on eSwitch port.

Clear older PVID before adding a newer PVID to the eSwicth port

Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoqlcnic: Fix max ring count calculation
Shahed Shaikh [Mon, 14 Apr 2014 14:02:20 +0000 (10:02 -0400)]
qlcnic: Fix max ring count calculation

Do not read max rings count from qlcnic_get_nic_info(). Use driver defined
values for 82xx adapters. In case of 83xx adapters, use minimum of firmware
provided and driver defined values.

Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoqlcnic: Fix to send INIT_NIC_FUNC as first mailbox.
Sucheta Chakraborty [Mon, 14 Apr 2014 14:02:19 +0000 (10:02 -0400)]
qlcnic: Fix to send INIT_NIC_FUNC as first mailbox.

o INIT_NIC_FUNC should be first mailbox sent. Sending DCB capability and
  parameter query commands after that command.

Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoqlcnic: Fix panic due to uninitialzed delayed_work struct in use.
Sucheta Chakraborty [Mon, 14 Apr 2014 14:02:18 +0000 (10:02 -0400)]
qlcnic: Fix panic due to uninitialzed delayed_work struct in use.

o AEN event was being received before initializing delayed_work struct
  and handlers for it. This was resulting in crash. This patch fixes it.

Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'be2net'
David S. Miller [Mon, 14 Apr 2014 17:41:42 +0000 (13:41 -0400)]
Merge branch 'be2net'

Sathya Perla says:

====================
be2net: patch set

Patch 1/2 is a v2 of a patch that was submitted earlier (as a part of a
different patch-set). v2 incorporates a suggestion given by David Laight
for how long to poll for pending TX completions while disabling a device.

Patch 2/2 fixes a crash in be_remove()->be_close()
path after be2net has aborted an EEH error recovery
due to a permanant failure.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agobe2net: Fix invocation of be_close() after be_clear()
Kalesh AP [Mon, 14 Apr 2014 10:42:41 +0000 (16:12 +0530)]
be2net: Fix invocation of be_close() after be_clear()

In the EEH error recovery path, when a permanent failure occurs,
we clean up adapter structure (i.e. destroy queues etc) by calling
be_clear() and return PCI_ERS_RESULT_DISCONNECT.
After this the stack tries to remove device from bus and calls
be_remove() which invokes netdev_unregister()->be_close().
be_close() operating on destroyed queues results in a
NULL dereference.

This patch fixes this problem by introducing a flag to keep track
of the setup state.

Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agobe2net: Fix to reap TX compls till HW doesn't respond for some time
Vasundhara Volam [Mon, 14 Apr 2014 10:42:40 +0000 (16:12 +0530)]
be2net: Fix to reap TX compls till HW doesn't respond for some time

be_close() currently waits for a max of 200ms to receive all pending
TX compls. This timeout value was roughly calculated based on 10G
transmission speeds and the TX queue depth. This timeout may not be
enough when the link is operating at lower speeds or in multi-channel/SR-IOV
configs with TX-rate limiting setting.

It is hard to calculate a "proper timeout value" that works in all
configurations.  This patch solves this problem by continuing to reap
TX completions till the HW is completely silent for a period of 10ms or
a HW error is detected.

v2: implements the new scheme (as suggested by David Laight) instead of
just waiting longer than 200ms for reaping all completions.

Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet/mlx4_core: Defer VF initialization till PF is fully initialized
Amir Vadai [Mon, 14 Apr 2014 08:17:22 +0000 (11:17 +0300)]
net/mlx4_core: Defer VF initialization till PF is fully initialized

Fix in commit [1] is not sufficient since a deferred VF initialization
could happen after pci_enable_sriov() is finished, but before the PF is
fully initialized.
Need to prevent VFs from initializing till the PF is fully ready and
comm channel is operational.

[1] - 9798935 "net/mlx4_core: mlx4_init_slave() shouldn't access comm
      channel before PF is ready"

CC: Stuart Hayes <Stuart_Hayes@Dell.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agobnx2: Don't build unused suspend/resume functions not enabled
Daniel J Blueman [Fri, 11 Apr 2014 08:14:26 +0000 (16:14 +0800)]
bnx2: Don't build unused suspend/resume functions not enabled

When CONFIG_PM_SLEEP isn't enabled, bnx2_suspend/resume are unused; don't
build them when they aren't used.

Signed-off-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoipv6: Limit mtu to 65575 bytes
Eric Dumazet [Fri, 11 Apr 2014 04:23:36 +0000 (21:23 -0700)]
ipv6: Limit mtu to 65575 bytes

Francois reported that setting big mtu on loopback device could prevent
tcp sessions making progress.

We do not support (yet ?) IPv6 Jumbograms and cook corrupted packets.

We must limit the IPv6 MTU to (65535 + 40) bytes in theory.

Tested:

ifconfig lo mtu 70000
netperf -H ::1

Before patch : Throughput :   0.05 Mbits

After patch : Throughput : 35484 Mbits

Reported-by: Francois WELLENREITER <f.wellenreiter@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'perf-core-for-mingo' into perf/urgent
Ingo Molnar [Mon, 14 Apr 2014 14:45:39 +0000 (16:45 +0200)]
Merge branch 'perf-core-for-mingo' into perf/urgent

Conflicts:
tools/perf/bench/numa.c

Pull perf fixes from Jiri Olsa.

Signed-off-by: Ingo Molnar <mingo@kernel.org>