The kernel is very erratic as to which pagetables have _PAGE_USER set. The
vsyscall page gets lucky: it seems that all of the relevant pagetables are
among the apparently arbitrary ones that set _PAGE_USER. Rather than
relying on chance, just explicitly set _PAGE_USER.
This will let us clean up pagetable setup to stop setting _PAGE_USER. The
added code can also be reused by pagetable isolation to manage the
_PAGE_USER bit in the usermode tables.
[ tglx: Folded paravirt fix from Juergen Gross ]
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The address hints are a trainwreck. The array entry numbers have to kept
magically in sync with the actual hints, which is doomed as some of the
array members are initialized at runtime via the entry numbers.
Designated initializers have been around before this code was
implemented....
Use the entry numbers to populate the address hints array and add the
missing bits and pieces. Split 32 and 64 bit for readability sake.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
is bogus. If a PTE is set to PAGE_NONE then the pgprot_val is not zero and
the entry is decoded in bogus ways, e.g. as RX GLB. That is confusing when
analyzing mapping correctness. Check for the present bit to make an
informed decision.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[note, only the inat.h portion, to get objtool back in sync - gregkh]
b0caa8c8c6bbc422bc3c32b64852d6d618f32b49 Mon Sep 17 00:00:00 2001
When computing a linear address and segmentation is used, we need to know
the base address of the segment involved in the computation. In most of
the cases, the segment base address will be zero as in USER_DS/USER32_DS.
However, it may be possible that a user space program defines its own
segments via a local descriptor table. In such a case, the segment base
address may not be zero. Thus, the segment base address is needed to
calculate correctly the linear address.
If running in protected mode, the segment selector to be used when
computing a linear address is determined by either any of segment override
prefixes in the instruction or inferred from the registers involved in the
computation of the effective address; in that order. Also, there are cases
when the segment override prefixes shall be ignored (i.e., code segments
are always selected by the CS segment register; string instructions always
use the ES segment register when using rDI register as operand). In long
mode, segment registers are ignored, except for FS and GS. In these two
cases, base addresses are obtained from the respective MSRs.
For clarity, this process can be split into four steps (and an equal
number of functions): determine if segment prefixes overrides can be used;
parse the segment override prefixes, and use them if found; if not found
or cannot be used, use the default segment registers associated with the
operand registers. Once the segment register to use has been identified,
read its value to obtain the segment selector.
The method to obtain the segment selector depends on several factors. In
32-bit builds, segment selectors are saved into a pt_regs structure
when switching to kernel mode. The same is also true for virtual-8086
mode. In 64-bit builds, segmentation is mostly ignored, except when
running a program in 32-bit legacy mode. In this case, CS and SS can be
obtained from pt_regs. DS, ES, FS and GS can be read directly from
the respective segment registers.
In order to identify the segment registers, a new set of #defines is
introduced. It also includes two special identifiers. One of them
indicates when the default segment register associated with instruction
operands shall be used. Another one indicates that the contents of the
segment register shall be ignored; this identifier is used when in long
mode.
Improvements-by: Borislav Petkov <bp@suse.de> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: ricardo.neri@intel.com Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Qiaowei Ren <qiaowei.ren@intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Colin Ian King <colin.king@canonical.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Adam Buchbinder <adam.buchbinder@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Garnier <thgarnie@google.com> Link: https://lkml.kernel.org/r/1509135945-13762-14-git-send-email-ricardo.neri-calderon@linux.intel.com Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The new ORC unwinder breaks the build of a 64-bit kernel on a 32-bit
host. Building the kernel on a i386 or x32 host fails with:
orc_dump.c: In function 'orc_dump':
orc_dump.c:105:26: error: passing argument 2 of 'elf_getshdrnum' from incompatible pointer type [-Werror=incompatible-pointer-types]
if (elf_getshdrnum(elf, &nr_sections)) {
^
In file included from /usr/local/include/gelf.h:32:0,
from elf.h:22,
from warn.h:26,
from orc_dump.c:20:
/usr/local/include/libelf.h:304:12: note: expected 'size_t * {aka unsigned int *}' but argument is of type 'long unsigned int *'
extern int elf_getshdrnum (Elf *__elf, size_t *__dst);
^~~~~~~~~~~~~~
orc_dump.c:190:17: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'Elf64_Sxword {aka long long int}' [-Werror=format=]
printf("%s+%lx:", name, rela.r_addend);
~~^ ~~~~~~~~~~~~~
%llx
Fix the build failure.
Another problem is that if the user specifies HOSTCC or HOSTLD
variables, they are ignored in the objtool makefile. Change the
Makefile to respect these variables.
linux/compiler.h is included indirectly by linux/types.h via
uapi/linux/types.h -> uapi/linux/posix_types.h -> linux/stddef.h
-> uapi/linux/stddef.h and is needed to provide a proper definition of
offsetof.
Unfortunately, compiler.h requires a definition of
smp_read_barrier_depends() for defining lockless_dereference() and soon
for defining READ_ONCE(), which means that all
users of READ_ONCE() will need to include asm/barrier.h to avoid splats
such as:
In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from arch/h8300/kernel/asm-offsets.c:11:
include/linux/list.h: In function 'list_empty':
>> include/linux/compiler.h:343:2: error: implicit declaration of function 'smp_read_barrier_depends' [-Werror=implicit-function-declaration]
smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
^
A better alternative is to include asm/barrier.h in linux/compiler.h,
but this requires a type definition for "bool" on some architectures
(e.g. x86), which is defined later by linux/types.h. Type "bool" is also
used directly in linux/compiler.h, so the whole thing is pretty fragile.
This patch splits compiler.h in two: compiler_types.h contains type
annotations, definitions and the compiler-specific parts, whereas
compiler.h #includes compiler-types.h and additionally defines macros
such as {READ,WRITE.ACCESS}_ONCE().
uapi/linux/stddef.h and linux/linkage.h are then moved over to include
linux/compiler_types.h, which fixes the build for h8 and blackfin.
Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1508840570-22169-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
- MOV with both zero-extended and sign-extended immediates
- implicit truncation of register contents via ALU32/MOV32
- implicit 32-bit truncation of ALU32 output
- oversized register source operand for ALU32 shift
- right-shift of a number that could be positive or negative
- map access where adding the operation size to the offset causes signed
32-bit overflow
- direct stack access at a ~4GiB offset
Also remove the F_LOAD_WITH_STRICT_ALIGNMENT flag from a bunch of tests
that should fail independent of what flags userspace passes.
There were various issues related to the limited size of integers used in
the verifier:
- `off + size` overflow in __check_map_access()
- `off + reg->off` overflow in check_mem_access()
- `off + reg->var_off.value` overflow or 32-bit truncation of
`reg->var_off.value` in check_mem_access()
- 32-bit truncation in check_stack_boundary()
Make sure that any integer math cannot overflow by not allowing
pointer math with large values.
Also reduce the scope of "scalar op scalar" tracking.
This could be made safe by passing through a reference to env and checking
for env->allow_ptr_leaks, but it would only work one way and is probably
not worth the hassle - not doing it will not directly lead to program
rejection.
Force strict alignment checks for stack pointers because the tracking of
stack spills relies on it; unaligned stack accesses can lead to corruption
of spilled registers, which is exploitable.
Properly handle register truncation to a smaller size.
The old code first mirrors the clearing of the high 32 bits in the bitwise
tristate representation, which is correct. But then, it computes the new
arithmetic bounds as the intersection between the old arithmetic bounds and
the bounds resulting from the bitwise tristate representation. Therefore,
when coerce_reg_to_32() is called on a number with bounds
[0xffff'fff8, 0x1'0000'0007], the verifier computes
[0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
This is incorrect: The truncated number could also be in the range [0, 7],
and no meaningful arithmetic bounds can be computed in that case apart from
the obvious [0, 0xffff'ffff].
Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.
Debian assigned CVE-2017-16996 for this issue.
v2:
- flip the mask during arithmetic bounds calculation (Ben Hutchings)
v3:
- add CVE number (Ben Hutchings)
Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.
Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.
Debian assigned CVE-2017-16995 for this issue.
v3:
- add CVE number (Ben Hutchings)
Fixes: 484611357c19 ("bpf: allow access into map value arrays") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Incorrect signed bounds were being computed.
If the old upper signed bound was positive and the old lower signed bound was
negative, this could cause the new upper signed bound to be too low,
leading to security issues.
Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
[jannh@google.com: changed description to reflect bug impact] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When LD_ABS/IND is used in the program, and we have a BPF helper
call that changes packet data (bpf_helper_changes_pkt_data() returns
true), then in case of sparc JIT, we try to reload cached skb data
from bpf2sparc[BPF_REG_6]. However, there is no such guarantee or
assumption that skb sits in R6 at this point, all helpers changing
skb data only have a guarantee that skb sits in R1. Therefore,
store BPF R1 in L7 temporarily and after procedure call use L7 to
reload cached skb data. skb sitting in R6 is only true at the time
when LD_ABS/IND is executed.
Fixes: 7a12b5031c6b ("sparc64: Add eBPF JIT.") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The assumption of unconditionally reloading skb pointers on
BPF helper calls where bpf_helper_changes_pkt_data() holds
true is wrong. There can be different contexts where the helper
would enforce a reload such as in case of XDP. Here, we do
have a struct xdp_buff instead of struct sk_buff as context,
thus this will access garbage.
JITs only ever need to deal with cached skb pointer reload
when ld_abs/ind was seen, therefore guard the reload behind
SEEN_SKB.
Fixes: 156d0e290e96 ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The assumption of unconditionally reloading skb pointers on
BPF helper calls where bpf_helper_changes_pkt_data() holds
true is wrong. There can be different contexts where the
BPF helper would enforce a reload such as in case of XDP.
Here, we do have a struct xdp_buff instead of struct sk_buff
as context, thus this will access garbage.
JITs only ever need to deal with cached skb pointer reload
when ld_abs/ind was seen, therefore guard the reload behind
SEEN_SKB only. Tested on s390x.
Fixes: 9db7f2b81880 ("s390/bpf: recache skb->data/hlen for skb_vlan_push/pop") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When tracing and networking programs are both attached in the
system and both use event-output helpers that eventually call
into perf_event_output(), then we could end up in a situation
where the tracing attached program runs in user context while
a cls_bpf program is triggered on that same CPU out of softirq
context.
Since both rely on the same per-cpu perf_sample_data, we could
potentially corrupt it. This can only ever happen in a combination
of the two types; all tracing programs use a bpf_prog_active
counter to bail out in case a program is already running on
that CPU out of a different context. XDP and cls_bpf programs
by themselves don't have this issue as they run in the same
context only. Therefore, split both perf_sample_data so they
cannot be accessed from each other.
Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") Reported-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Song Liu <songliubraving@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
when the verifier detects that register contains a runtime constant
and it's compared with another constant it will prune exploration
of the branch that is guaranteed not to be taken at runtime.
This is all correct, but malicious program may be constructed
in such a way that it always has a constant comparison and
the other branch is never taken under any conditions.
In this case such path through the program will not be explored
by the verifier. It won't be taken at run-time either, but since
all instructions are JITed the malicious program may cause JITs
to complain about using reserved fields, etc.
To fix the issue we have to track the instructions explored by
the verifier and sanitize instructions that are dead at run time
with NOPs. We cannot reject such dead code, since llvm generates
it for valid C code, since it doesn't do as much data flow
analysis as the verifier does.
83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
we allocate the mem_section array dynamically in sparse_memory_present_with_active_regions(),
but some architectures, like arm64, don't call the routine to initialize sparsemem.
Let's move the initialization into memory_present() it should cover all
architectures.
Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y") Link: http://lkml.kernel.org/r/20171107083337.89952-1-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dan Rue <dan.rue@linaro.org> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sending the switch state change twice within the same frame is invalid
evdev protocol and only works if the client handles keys immediately as
well. Processing events immediately is incorrect, it forces a fake
order of events that does not exist on the device.
Recent versions of libinput changed to only process the device state and
SYN_REPORT time, so now the key event is lost.
The DT specifies a threshold of 65000, we setup the register with a value in
the temperature resolution for the controller, 64656.
When we reach 64656, the interrupt fires, the interrupt is disabled. Then the
irq thread runs and calls thermal_zone_device_update() which will call in turn
hisi_thermal_get_temp().
The function will look if the temperature decreased, assuming it was more than
65000, but that is not the case because the current temperature is 64656
(because of the rounding when setting the threshold). This condition being
true, we re-enable the interrupt which fires immediately after exiting the irq
thread. That happens again and again until the temperature goes to more than
65000.
Potentially, there is here an interrupt storm if the temperature stabilizes at
this temperature. A very unlikely case but possible.
In any case, it does not make sense to handle dozens of alarm interrupt for
nothing.
Fix this by rounding the threshold value to the controller resolution so the
check against the threshold is consistent with the one set in the controller.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Leo Yan <leo.yan@linaro.org> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Eduardo Valentin <edubezval@gmail.com> Signed-off-by: Kevin Wangtao <kevin.wangtao@hisilicon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The step and the base temperature are fixed values, we can simplify the
computation by converting the base temperature to milli celsius and use a
pre-computed step value. That saves us a lot of mult + div for nothing at
runtime.
Take also the opportunity to change the function names to be consistent with
the rest of the code.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Leo Yan <leo.yan@linaro.org> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Eduardo Valentin <edubezval@gmail.com> Signed-off-by: Kevin Wangtao <kevin.wangtao@hisilicon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The threaded interrupt for the alarm interrupt is requested before the
temperature controller is setup. This one can fire an interrupt immediately
leading to a kernel panic as the sensor data is not initialized.
In order to prevent that, move the threaded irq after the Tsensor is setup.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Leo Yan <leo.yan@linaro.org> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Eduardo Valentin <edubezval@gmail.com> Signed-off-by: Kevin Wangtao <kevin.wangtao@hisilicon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As we are always enabling the interrupt, it is pointless to check if
the interrupt is masked or not, just set irq_enabled to 'true'.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Leo Yan <leo.yan@linaro.org> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Eduardo Valentin <edubezval@gmail.com> Signed-off-by: Kevin Wangtao <kevin.wangtao@hisilicon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The incoming mode might have a missing vrefresh field if it came from
drmModeSetCrtc(), which the kernel is supposed to calculate using
drm_mode_vrefresh(). We could either use that or the adjusted_mode's
original vrefresh value.
However, we can maintain a more exact vrefresh value (not just the
integer approximation), by scaling by the ratio of our clocks.
v2: Use math suggested by Andrzej Hajda instead.
v3: Simplify math now that adjusted_mode->clock isn't padded.
v4: Drop some parens.
When failing to enter broadcast timer mode for an idle state that
requires it, a new state is selected that does not require broadcast,
but the broadcast variable remains set. This causes
tick_broadcast_exit to be called despite not having entered broadcast
mode.
This causes the WARN_ON_ONCE(!irqs_disabled()) to trigger in some
cases. It does not appear to cause problems for code today, but seems
to violate the interface so should be fixed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If there is any non expired timer in the queue, the RTC alarm is never set.
This is an issue when adding a timer that expires before the next non
expired timer.
With the commit 76174004a0f19785 (tcp: do not slow start when cwnd equals
ssthresh), the comparison to the reduced cwnd in tcp_vegas_ssthresh() would
under-evaluate the ssthresh.
Signed-off-by: Hoang Tran <hoang.tran@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After the loop in hns_roce_v1_mr_free_work_fn function, it is possible that
all qps will have been freed (in which case ne will be 0). If that
happens, then later in the function when we dereference hr_qp we will
get an exception. Check ne is not 0 to make sure we actually have an
hr_qp left to work on.
This patch fixes the smatch error as below:
drivers/infiniband/hw/hns/hns_roce_hw_v1.c:1009 hns_roce_v1_mr_free_work_fn()
error: we previously assumed 'hr_qp' could be null
The NS for DAD are sent on admin up as long as a valid qdisc is found.
A race condition exists by which these packets will not egress the
interface if the operational state of the lower device is not yet up.
The solution is to delay DAD until the link is operationally up
according to RFC2863. Rather than only doing this, follow the existing
code checks by deferring IPv6 device initialization altogether. The fix
allows DAD on devices like tunnels that are controlled by userspace
control plane. The fix has no impact on regular deployments, but means
that there is no IPv6 connectivity until the port has been opened in
the case of port-based network access control, which should be
desirable.
Signed-off-by: Mike Manning <mmanning@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
State is initially reported as UNKNOWN. Before register call
netif_carrier_off(). Once the device is opened, call netif_carrier_on() in
order to set the state to UP.
Signed-off-by: Mick Tarsel <mjtarsel@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we process VF mailboxes, the driver is likely going to also queue
up messages to the switch manager. This process merely queues up the
FIFO, but doesn't actually begin the transmission process. Because we
hold the mailbox lock during this VF processing, the PF<->SM mailbox is
not getting processed at this time. Ensure that we actually process the
PF<->SM mailbox in between each PF<->VF mailbox.
This should ensure prompt transmission of the messages queued up after
each VF message is received and handled.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Odroid HC1 board has built-in JMicron USB to SATA bridge, which supports
UAS protocol. Compile-in support for it (instead of enabling it as module)
to make sure that all built-in storage devices are available for rootfs.
The bridge itself also supports fallback to standard USB Mass Storage
protocol, but USB Mass Storage class doesn't bind to it when UAS is
compiled as module and modules are not (yet) available.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting. Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed. Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently there is a bug in which the PF driver fails to inform clients
of a VF reset which then causes clients to leak resources. The bug
exists because we were incorrectly checking the I40E_VF_STATE_PRE_ENABLE
bit.
When a VF is first init we go through a reset to initialize variables
and allocate resources but we don't want to inform clients of this first
reset since the client isn't fully enabled yet so we set a state bit
signifying we're in a "pre-enabled" client state. During the first
reset we should be clearing the bit, allowing all following resets to
notify the client of the reset when the bit is not set. This patch
fixes the issue by negating the 'test_and_clear_bit' check to accurately
reflect the behavior we want.
Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Warning messages when NVME_TARGET_FC not defined on ppc builds
The lpfc_nvmet_replenish_context() function is only meaningful when NVME
target mode enabled. Surround the function body with ifdefs for target
mode enablement.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Local Reject/Invalid RPI errors seen during discovery.
Temporary RPI cleanup was occurring regardless of SLI rev. It's only
necessary on SLI-4.
Adjust the test for whether cleanup is necessary.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Firmware update fails with: status x17 add_status x56 on the final write
If multiple DMA buffers are used for the download, some firmware revs
have difficulty with signatures and crcs split across the dma buffer
boundaries. Resolve by making all writes be a single 4k page in length.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We've had support for setting both a minimum and maximum bandwidth via
.ndo_set_vf_bw since commit 883a9ccbae56 ("fm10k: Add support for SR-IOV
to driver", 2014-09-20).
Likely because we do not support minimum rates, the declaration
mis-ordered the "unused" parameter, which causes warnings when analyzed
with cppcheck.
Fix this warning by properly declaring the min_rate and max_rate
variables in the declaration and definition (rather than using
"unused"). Also rename "rate" to max_rate so as to clarify that we only
support setting the maximum rate.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the driver is built as a module, autoload won't work because the module
alias information is not filled. So user-space can't match the registered
device with the corresponding module.
Export the module alias information using the MODULE_DEVICE_TABLE() macro.
Before this patch:
$ modinfo snd_soc_msm8916_analog | grep alias
$
After this patch:
$ modinfo snd_soc_msm8916_analog | grep alias
alias: of:N*T*Cqcom,pm8916-wcd-analog-codecC*
alias: of:N*T*Cqcom,pm8916-wcd-analog-codec
Signed-off-by: Nicolas Dechesne <nicolas.dechesne@linaro.org> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As SCTP supports up to 65535 streams, that can lead to very large
allocations in sctp_stream_init(). As Xin Long noticed, systems with
small amounts of memory are more prone to not have enough memory and
dump warnings on dmesg initiated by user actions. Thus, silence them.
Also, if the reallocation of stream->out is not necessary, skip it and
keep the memory we already have.
Reported-by: Xin Long <lucien.xin@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In xmon, touch_nmi_watchdog() is not expected to be checking that
other CPUs have not touched the watchdog, so the code will just call
touch_nmi_watchdog() once before re-enabling hard interrupts.
Just update our CPU's state, and ignore apparently stuck SMP threads.
Arguably touch_nmi_watchdog should check for SMP lockups, and callers
should be fixed, but that's not trivial for the input code of xmon.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The SMP hardlockup watchdog cross-checks other CPUs for lockups, which
causes xmon headaches because it's assuming interrupts hard disabled
means no watchdog troubles. Try to improve that by calling
touch_nmi_watchdog() in obvious places where secondaries are spinning.
Also annotate these spin loops with spin_begin/end calls.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add pm_runtime_get_sync and pm_runtime_put calls to set_fmt callback
function. This fixes a bus error during boot when CONFIG_SUSPEND is
defined when this function gets called while the device is runtime
disabled and device registers are accessed while the clock is disabled.
Signed-off-by: Ed Blake <ed.blake@sondrel.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The macro used to set the microphone bias level causes the
snd_soc_write() call to overwrite other fields in the CDC_A_MICB_1_VAL
register. The macro also does not return the proper level value
to use. This fixes this by preserving all bits from the register
that are not the level while setting the level.
Signed-off-by: Jean-François Têtu <jean-francois.tetu@savoirfairelinux.com> Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are a small number of 'generic fields' (comm/COMM/cpu/CPU) that
are found by trace_find_event_field() but are only meant for
filtering. Specifically, they unlike normal fields, they have a size
of 0 and thus wreak havoc when used as a histogram key.
Exclude these (return -EINVAL) when used as histogram keys.
Previously, if an non-fatal error was reported by an endpoint, we
called report_error_detected() for the endpoint, every sibling on the
bus, and their descendents. If any of them did not implement the
.error_detected() method, do_recovery() failed, leaving all these
devices unrecovered.
For example, the system described in the bugzilla below has two devices:
0000:74:02.0 [19e5:a230] SAS controller, driver has .error_detected()
0000:74:03.0 [19e5:a235] SATA controller, driver lacks .error_detected()
When a device such as 74:02.0 reported a non-fatal error, do_recovery()
failed because 74:03.0 lacked an .error_detected() method. But per PCIe
r3.1, sec 6.2.2.2.2, such an error does not compromise the Link and
does not affect 74:03.0:
Non-fatal errors are uncorrectable errors which cause a particular
transaction to be unreliable but the Link is otherwise fully functional.
Isolating Non-fatal from Fatal errors provides Requester/Receiver logic
in a device or system management software the opportunity to recover from
the error without resetting the components on the Link and disturbing
other transactions in progress. Devices not associated with the
transaction in error are not impacted by the error.
Report non-fatal errors only to the endpoint that reported them. We really
want to check for AER_NONFATAL here, but the current code structure doesn't
allow that. Looking for pci_channel_io_normal is the best we can do now.
Currently, when setting up the IRQ for a q_vector, we set an affinity
hint based on the v_idx of that q_vector. Meaning a loop iterates on
v_idx, which is an incremental value, and the cpumask is created based
on this value.
This is a problem in systems with multiple logical CPUs per core (like in
simultaneous multithreading (SMT) scenarios). If we disable some logical
CPUs, by turning SMT off for example, we will end up with a sparse
cpu_online_mask, i.e., only the first CPU in a core is online, and
incremental filling in q_vector cpumask might lead to multiple offline
CPUs being assigned to q_vectors.
Example: if we have a system with 8 cores each one containing 8 logical
CPUs (SMT == 8 in this case), we have 64 CPUs in total. But if SMT is
disabled, only the 1st CPU in each core remains online, so the
cpu_online_mask in this case would have only 8 bits set, in a sparse way.
In general case, when SMT is off the cpu_online_mask has only C bits set:
0, 1*N, 2*N, ..., C*(N-1) where
C == # of cores;
N == # of logical CPUs per core.
In our example, only bits 0, 8, 16, 24, 32, 40, 48, 56 would be set.
Instead, we should only assign hints for CPUs which are online. Even
better, the kernel already provides a function, cpumask_local_spread()
which takes an index and returns a CPU, spreading the interrupts across
local NUMA nodes first, and then remote ones if necessary.
Since we generally have a 1:1 mapping between vectors and CPUs, there
is no real advantage to spreading vectors to local CPUs first. In order
to avoid mismatch of the default XPS hints, we'll pass -1 so that it
spreads across all CPUs without regard to the node locality.
Note that we don't need to change the q_vector->affinity_mask as this is
initialized to cpu_possible_mask, until an actual affinity is set and
then notified back to us.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit fixes 2 issues with host-wake irq trigger type handling
in hci_bcm:
1) bcm_setup_sleep sets sleep_params.host_wake_active based on
bcm_device.irq_polarity, but bcm_request_irq was always requesting
IRQF_TRIGGER_RISING as trigger type independent of irq_polarity.
This was a problem when the irq is described as a GpioInt rather then
an Interrupt in the DSDT as for GpioInt-s the value passed to request_irq
is honored. This commit fixes this by requesting the correct trigger
type depending on bcm_device.irq_polarity.
2) bcm_device.irq_polarity was used to directly store an ACPI polarity
value (ACPI_ACTIVE_*). This is undesirable because hci_bcm is also
used with device-tree and checking for something like ACPI_ACTIVE_LOW
in a non ACPI specific function like bcm_request_irq feels wrong.
This commit fixes this by renaming irq_polarity to irq_active_low
and changing its type to a bool.
Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The PCA9552 lines can be used either for driving LEDs or as GPIOs. The
manual states that for LEDs, the operation is open-drain:
The LSn LED select registers determine the source of the LED data.
00 = output is set LOW (LED on)
01 = output is set high-impedance (LED off; default)
10 = output blinks at PWM0 rate
11 = output blinks at PWM1 rate
For GPIOs it suggests a pull-up so that the open-case drives the line
high:
For use as output, connect external pull-up resistor to the pin
and size it according to the DC recommended operating
characteristics. LED output pin is HIGH when the output is
programmed as high-impedance, and LOW when the output is
programmed LOW through the ‘LED selector’ register. The output
can be pulse-width controlled when PWM0 or PWM1 are used.
Now, I have a hardware design that uses the LED controller to control
LEDs. However, for $reasons, we're using the leds-gpio driver to drive
the them. The reasons are here are a tangent but lead to the discovery
of the inversion, which manifested as the LEDs being set to full
brightness at boot when we expected them to be off.
As we're driving the LEDs through leds-gpio, this means wending our way
through the gpiochip abstractions. So with that in mind we need to
describe an active-low GPIO configuration to drive the LEDs as though
they were GPIOs.
The set() gpiochip callback in leds-pca955x does the following:
...
if (val)
pca955x_led_set(&led->led_cdev, LED_FULL);
else
pca955x_led_set(&led->led_cdev, LED_OFF);
...
Where LED_FULL = 255. pca955x_led_set() in turn does:
...
switch (value) {
case LED_FULL:
ls = pca955x_ledsel(ls, ls_led, PCA955X_LS_LED_ON);
break;
...
Where PCA955X_LS_LED_ON is defined as:
#define PCA955X_LS_LED_ON 0x0 /* Output LOW */
So here we have some type confusion: We've crossed domains from GPIO
behaviour to LED behaviour without accounting for possible inversions
in the process.
Stepping back to leds-gpio for a moment, during probe() we call
create_gpio_led(), which eventually executes:
if (template->default_state == LEDS_GPIO_DEFSTATE_KEEP) {
state = gpiod_get_value_cansleep(led_dat->gpiod);
if (state < 0)
return state;
} else {
state = (template->default_state == LEDS_GPIO_DEFSTATE_ON);
}
...
ret = gpiod_direction_output(led_dat->gpiod, state);
In the devicetree the GPIO is annotated as active-low, and
gpiod_get_value_cansleep() handles this for us:
int gpiod_get_value_cansleep(const struct gpio_desc *desc)
{
int value;
might_sleep_if(extra_checks);
VALIDATE_DESC(desc);
value = _gpiod_get_raw_value(desc);
if (value < 0)
return value;
if (test_bit(FLAG_ACTIVE_LOW, &desc->flags))
value = !value;
return value;
}
_gpiod_get_raw_value() in turn calls through the get() callback for the
gpiochip implementation, so returning to our get() implementation in
leds-pca955x we find we extract the raw value from hardware:
This behaviour is not symmetric with that of set(), where the val is
inverted by the driver.
Closing the loop on the GPIO_ACTIVE_LOW inversions,
gpiod_direction_output(), like gpiod_get_value_cansleep(), handles it
for us:
int gpiod_direction_output(struct gpio_desc *desc, int value)
{
VALIDATE_DESC(desc);
if (test_bit(FLAG_ACTIVE_LOW, &desc->flags))
value = !value;
else
value = !!value;
return _gpiod_direction_output_raw(desc, value);
}
All-in-all, with a value of 'keep' for default-state property in a
leds-gpio child node, the current state of the hardware will in-fact be
inverted; precisely the opposite of what was intended.
Rework leds-pca955x so that we avoid the incorrect inversion and clarify
the semantics with respect to GPIO.
After rwlock is replaced with rcu and spinlock, ip6_pol_route() will be
called with only rcu held. That means rt6 route deletion could happen
simultaneously with rt6_make_pcpu_rt(). This could potentially cause
memory leak if rt6_release() is called right before rt6_make_pcpu_rt()
on the same route.
This patch grabs rt->rt6i_ref safely before calling rt6_make_pcpu_rt()
to make sure rt6_release() will not get triggered while
rt6_make_pcpu_rt() is in progress. And rt6_release() is called after
rt6_make_pcpu_rt() is finished.
Note: As we are incrementing rt->rt6i_ref in ip6_pol_route(), there is a
very slim chance that fib6_purge_rt() will be triggered unnecessarily
when deleting a route if ip6_pol_route() running on another thread picks
this route as well and tries to make pcpu cache for it.
Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Similarly to early patch for erspan_xmit(), the ARPHDR_ETHER device
is the length of the whole ether packet. So skb->len should subtract
the dev->hard_header_len.
Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel") Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Signed-off-by: William Tu <u9012063@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: David Laight <David.Laight@aculab.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit 4ad23a976413 ("MD: use per-cpu counter for writes_pending"),
the wait_queue is only got invoked if THREAD_WAKEUP is not set previously.
With above change, I can see process_metadata_update could always hang on
the wait queue, because mddev->thread could stay on 'D' status and the
THREAD_WAKEUP flag is not cleared since there are lots of place to wake up
mddev->thread. Then deadlock happened as follows:
Similarly to CFQ, BFQ has its write-throttling heuristics, and it
is better not to combine them with further write-throttling
heuristics of a different nature.
So this commit disables write-back throttling for a device if BFQ
is used as I/O scheduler for that device.
The allocation for elem may fail (especially because we're using
GFP_ATOMIC) so best to check for a null return. This fixes a potential
null pointer dereference when assigning elem->pool.
Detected by CoverityScan CID#1357507 ("Dereference null return value")
Fixes: 8700e3e7c485 ("Soft RoCE driver") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce register mask for data-ready status register since
pressure sensors (e.g. LPS22HB) export just two channels
(BIT(0) and BIT(1)) and BIT(2) is marked reserved while in
st_sensors_new_samples_available() value read from status register
is masked using 0x7.
Moreover do not mask status register using active_scan_mask since
now status value is properly masked and if the result is not zero the
interrupt has to be consumed by the driver. This fix an issue on LPS25H
and LPS331AP where channel definition is swapped respect to status
register.
Furthermore that change allows to properly support new devices
(e.g LIS2DW12) that report just ZYXDA (data-ready) field in status register
to figure out if the interrupt has been generated by the device.
Fixes: 97865fe41322 (iio: st_sensors: verify interrupt event to status) Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@st.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch replaces hash_for_each function with hash_for_each_safe
when calling __i40e_del_filter. The hash_for_each_safe function is
the right one to use when iterating over a hash table to safely remove
a hash entry. Otherwise, incorrect values may be read from freed memory.
Detected by CoverityScan, CID 1402048 Read from pointer after free
Signed-off-by: Lihong Yang <lihong.yang@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On some i.MX6 platforms which do not have speed grading
check, opp table will not be created in platform code,
so cpufreq driver prints the following error message:
cpu cpu0: dev_pm_opp_get_opp_count: OPP table not found (-19)
However, this is not really an error in this case because the
imx6q-cpufreq driver first calls dev_pm_opp_get_opp_count()
and if it fails, it means that platform code does not provide
OPP and then dev_pm_opp_of_add_table() will be called.
In order to avoid such confusing error message, move it to
debug level.
It is up to the caller of dev_pm_opp_get_opp_count() to check its
return value and decide if it will print an error or not.
Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When creating virtual functions, create the "virtfn%u" and "physfn" links
in sysfs *before* attaching the driver instead of after. When we attach
the driver to the new virtual network interface first, there is a race when
the driver attaches to the new sends out an "add" udev event, and the
network interface naming software (biosdevname or systemd, for example)
tries to look at these links.
Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Whenever an I/O for a RAID volume fails with IOCStatus
MPI2_IOCSTATUS_SCSI_IOC_TERMINATED and SCSIStatus equal to
(MPI2_SCSI_STATE_TERMINATED | MPI2_SCSI_STATE_NO_SCSI_STATUS) then
return the I/O to SCSI midlayer with "DID_RESET" (i.e. retry the IO
infinite times) set in the host byte.
Previously, the driver was completing the I/O with "DID_SOFT_ERROR"
which causes the I/O to be quickly retried. However, firmware needed
more time and hence I/Os were failing.
Signed-off-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In case of connection reset Tx skb queue can have some skbs which are
not transmitted so purge Tx skb queue in release_offload_resources() to
avoid skb leak.
Signed-off-by: Varun Prakash <varun@chelsio.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When checking to see if a PCI bus can safely be reset, we previously
checked to see if any of the children had their PCI_DEV_FLAGS_NO_BUS_RESET
flag set. Children marked with that flag are known not to behave well
after a bus reset.
Some PCIe root port bridges also do not behave well after a bus reset,
sometimes causing the devices behind the bridge to become unusable.
Add a check for PCI_DEV_FLAGS_NO_BUS_RESET being set in the bridge device
to allow these bridges to be flagged, and prevent their secondary buses
from being reset.
Signed-off-by: David Daney <david.daney@cavium.com>
[jglauber@cavium.com: fixed typo] Signed-off-by: Jan Glauber <jglauber@cavium.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change the return error code to EINVAL if the MAC
address is not valid in the set_wol function.
Signed-off-by: Dan Murphy <dmurphy@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The kvm slabs can consume a significant amount of system memory
and indeed in our production environment we have observed that
a lot of machines are spending significant amount of memory that
can not be left as system memory overhead. Also the allocations
from these slabs can be triggered directly by user space applications
which has access to kvm and thus a buggy application can leak
such memory. So, these caches should be accounted to kmemcg.
The post-divider for the audio PLL is in bits [29:26], as specified
in the user manual, not [19:16] as currently programmed in the code.
The post-divider has a default register value of 2, i.e. a divider
of 3. This means the clock rate fed to the audio codec would be off.
This was discovered when porting sigma-delta modulation for the PLL
to sun5i, which needs the post-divider to be 1.
Fix the bit offset, so we do actually force the post-divider to a
certain value.
The round_rate callback for N-M-factor style clocks does not check if
the requested clock rate is supported by the fractional clock mode.
While this doesn't affect usage in practice, since the clock rates
are also supported through N-M factors, it does not match the set_rate
code.
Add a check to the round_rate callback so it matches the set_rate
callback.
>From the CI builds, its been observed that during a driver
reload/insert, dp dual mode read function sometimes fails to
read from LSPCON device over i2c-over-aux channel.
This patch:
- adds some delay and few retries, allowing a scope for these
devices to settle down and respond.
- changes one error log's level from ERROR->DEBUG as we want
to call it an error only after all the retries are exhausted.
V2: Addressed review comments from Jani (for loop for retry)
V3: Addressed review comments from Imre (break on partial read too)
V3: Addressed review comments from Ville/Imre (Add the retries
exclusively for LSPCON, not for all dp_dual_mode devices)
V4: Added r-b from Imre, sending it to dri-devel (Jani)
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102294
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102295
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102359
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103186 Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Imre Deak <imre.deak@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Reviewed-by: Imre Deak <imre.deak@intel.com> Acked-by: Dave Airlie <airlied@gmail.com> Signed-off-by: Shashank Sharma <shashank.sharma@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/1507826408-19322-1-git-send-email-shashank.sharma@intel.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This fixes an overflow condition that can happen with high max
brightness and period values in compute_duty_cycle. This fixes it by
using a 64 bit variable for computing the duty cycle.
Signed-off-by: Derek Basehore <dbasehore@chromium.org> Acked-by: Thierry Reding <thierry.reding@gmail.com> Reviewed-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The first node supplied to of_find_matching_node() has its reference
counter decreased as part of call to that function. In optee_driver_init()
after calling of_find_matching_node() it's invalid to call of_node_put() on
the supplied node again.
So remove the invalid call to of_node_put().
Reported-by: Alex Shi <alex.shi@linaro.org> Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org> Cc: <andi@linux-stable.l.notmuch.email> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is currently no way to force CPU bug bits like CPU feature bits. That
makes it impossible to set a bug bit once at boot and have it stick for all
upcoming CPUs.
Extend the force set/clear arrays to handle bug bits as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.992156574@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is no generic way to test whether a kernel is running on a specific
hypervisor. But that's required to prevent the upcoming user address space
separation feature in certain guest modes.
Make the hypervisor type enum unconditionally available and provide a
helper function which allows to test for a specific type.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.912938129@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The TSS is a fairly juicy target for exploits, and, now that the TSS
is in the cpu_entry_area, it's no longer protected by kASLR. Make it
read-only on x86_64.
On x86_32, it can't be RO because it's written by the CPU during task
switches, and we use a task gate for double faults. I'd also be
nervous about errata if we tried to make it RO even on configurations
without double fault handling.
[ tglx: AMD confirmed that there is no problem on 64-bit with TSS RO. So
it's probably safe to assume that it's a non issue, though Intel
might have been creative in that area. Still waiting for
confirmation. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bpetkov@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.733700132@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The existing code was a mess, mainly because C arrays are nasty. Turn
SYSENTER_stack into a struct, add a helper to find it, and do all the
obvious cleanups this enables.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bpetkov@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.653244723@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The IST stacks are needed when an IST exception occurs and are accessed
before any kernel code at all runs. Move them into struct cpu_entry_area.
The IST stacks are unlike the rest of cpu_entry_area: they're used even for
entries from kernel mode. This means that they should be set up before we
load the final IDT. Move cpu_entry_area setup to trap_init() for the boot
CPU and set it up for all possible CPUs at once in native_smp_prepare_cpus().
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.480598743@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live. It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer. The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.
With PAGE_TABLE_ISOLATION-like pagetable switching, this is
problematic. Without a scratch register, switching CR3 is impossible, so
%gs-based percpu memory would need to be mapped in the user pagetables.
Doing that without information leaks is difficult or impossible.
Instead, use a different sneaky trick. Map a copy of the first part
of the SYSCALL asm at a different address for each CPU. Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory. By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.
A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.
The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15. This patch therefore doesn't touch SYSCALL32
at all.
This patch actually seems to be a small speedup. With this patch,
SYSCALL touches an extra cache line and an extra virtual page, but
the pipeline no longer stalls waiting for SWAPGS. It seems that, at
least in a tight loop, the latter outweights the former.
Thanks to David Laight for an optimization tip.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bpetkov@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.403607157@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
By itself, this is useless. It gives us the ability to run some final code
before exit that cannnot run on the kernel stack. This could include a CR3
switch a la PAGE_TABLE_ISOLATION or some kernel stack erasing, for
example. (Or even weird things like *changing* which kernel stack gets
used as an ASLR-strengthening mechanism.)
The SYSRET32 path is not covered yet. It could be in the future or
we could just ignore it and force the slow path if needed.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.306546484@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Historically, IDT entries from usermode have always gone directly
to the running task's kernel stack. Rearrange it so that we enter on
a per-CPU trampoline stack and then manually switch to the task's stack.
This touches a couple of extra cachelines, but it gives us a chance
to run some code before we touch the kernel stack.
The asm isn't exactly beautiful, but I think that fully refactoring
it can wait.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.225330557@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we start using an entry trampoline, a #GP from userspace will
be delivered on the entry stack, not on the task stack. Fix the
espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
assuming that pt_regs + 1 == SP0. This won't change anything
without an entry stack, but it will make the code continue to work
when an entry stack is added.
While we're at it, improve the comments to explain what's actually
going on.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.130778051@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>