Convert indirect jumps in core 32/64bit entry assembler code to use
non-speculative sequences when CONFIG_RETPOLINE is enabled.
Don't use CALL_NOSPEC in entry_SYSCALL_64_fastpath because the return
address after the 'call' instruction must be *precisely* at the
.Lentry_SYSCALL_64_after_fastpath label for stub_ptregs_64 to work,
and the use of alternatives will mess that up unless we play horrid
games to prepend with NOPs and make the variants the same length. It's
not worth it; in the case where we ALTERNATIVE out the retpoline, the
first instruction at __x86.indirect_thunk.rax is going to be a bare
jmp *%rax anyway.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Cc: gnomes@lxorguk.ukuu.org.uk Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: thomas.lendacky@amd.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Link: https://lkml.kernel.org/r/1515707194-20531-7-git-send-email-dwmw@amazon.co.uk Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Razvan Ghitulete <rga@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 028083cb02db69237e73950576bc81ac579693dc)
Orabug: 27477743
CVE: CVE-2017-5715 Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Conflicts:
arch/x86/kernel/entry_32.S
(dmj:
- patch had arch/x86/entry/entry_32.S
- extra retpolines needed for sys_call_table)
arch/x86/kernel/entry_64.S
(dmj: patch had arch/x86/entry/entry_64.S) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Konrad Rzeszutek Wilk [Thu, 1 Feb 2018 15:40:43 +0000 (10:40 -0500)]
x86/spectre_v2: Figure out if STUFF_RSB macro needs to be used.
If IBRS is on, STUFF_RSB only needs to be enabled if the CPU
does not have SMEP enabled.
We change the macro to depend on a synthetic one called
X86_FEATURE_STUFF_RSB - which will only be turned on
if 'spectre_v2=ibrs' is set and the BIOS does not have SMEP
enabled.
(And naturally also if the kernel is compiled without retpoline
and the CPU does not have SMEP).
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Konrad Rzeszutek Wilk [Thu, 1 Feb 2018 15:13:37 +0000 (10:13 -0500)]
x86/spectre_v2: Figure out when to use IBRS.
Which is if:
a) on the bootline you have 'spectre_v2=ibrs' _and_ the microcode
is loaded that exposes this. And nobody used 'noibrs'.
Also if you have 'spectre_v2=ibrs noibrs' we end up falling
in the 'lfence'. Unless somebody did 'spectre_v2=ibrs noibrs nolfence',
then we go back to automatic discovery.
b) Kernel compiled without retpoline and the microcode is
available.
c) Kernel compiled without retpoline and there is no microcode with IBRS
then we pick 'lfence' on system calls.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Konrad Rzeszutek Wilk [Thu, 1 Feb 2018 14:45:27 +0000 (09:45 -0500)]
x86/spectre: Add IBRS option.
The spectre_v2_mitigation already has an IBRS option, lets make
the override possible. But don't select it by default if
the kernel has been compiled with retpoline.
If it has not (no compiler support), then fallback to ibrs.
Orabug: 27477743
CVE: CVE-2017-5715 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Add a spectre_v2= option to select the mitigation used for the indirect
branch speculation vulnerability.
Currently, the only option available is retpoline, in its various forms.
This will be expanded to cover the new IBRS/IBPB microcode features.
The RETPOLINE_AMD feature relies on a serializing LFENCE for speculation
control. For AMD hardware, only set RETPOLINE_AMD if LFENCE is a
serializing instruction, which is indicated by the LFENCE_RDTSC feature.
[ tglx: Folded back the LFENCE/AMD fixes and reworked it so IBRS
integration becomes simple ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: gnomes@lxorguk.ukuu.org.uk Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: thomas.lendacky@amd.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Link: https://lkml.kernel.org/r/1515707194-20531-5-git-send-email-dwmw@amazon.co.uk Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9f789bc5711bcacb5df003594b992f0c1cc19df4)
Orabug: 27477743
CVE: CVE-2017-5715 Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs_64.c
(dmj:
- patch used arch/x86/kernel/cpu/bugs.c
- UEK4-specific IBPB and IBRS code merged with retpoline
cmdline parsing
- disabled IBRS in the case of full retpoline mitigation)
(konrad:
- disabled lfence on IBRS paths in case of full retpoline)
arch/x86/kernel/cpu/common.c Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
the corresponding thunks. Provide assembler macros for invoking the thunks
in the same way that GCC does, from native and inline assembler.
This adds X86_FEATURE_RETPOLINE and sets it by default on all CPUs. In
some circumstances, IBRS microcode features may be used instead, and the
retpoline can be disabled.
On AMD CPUs if lfence is serialising, the retpoline can be dramatically
simplified to a simple "lfence; jmp *\reg". A future patch, after it has
been verified that lfence really is serialising in all circumstances, can
enable this by setting the X86_FEATURE_RETPOLINE_AMD feature bit in addition
to X86_FEATURE_RETPOLINE.
Do not align the retpoline in the altinstr section, because there is no
guarantee that it stays aligned when it's copied over the oldinstr during
alternative patching.
[ Andi Kleen: Rename the macros, add CONFIG_RETPOLINE option, export thunks]
[ tglx: Put actual function CALL/JMP in front of the macros, convert to
symbolic labels ]
[ dwmw2: Convert back to numeric labels, merge objtool fixes ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: gnomes@lxorguk.ukuu.org.uk Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: thomas.lendacky@amd.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Link: https://lkml.kernel.org/r/1515707194-20531-4-git-send-email-dwmw@amazon.co.uk Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
[ 4.4 backport: removed objtool annotation since there is no objtool ] Signed-off-by: Razvan Ghitulete <rga@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3c5e10905263dbe9fbc621d1889b85e9c867da25)
Orabug: 27477743
CVE: CVE-2017-5715 Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Conflicts:
arch/x86/kernel/cpu/common.c Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
The macro MODULE is not a config option, it is a per-file build
option. So, config_enabled(MODULE) is not sensible. (There is
another case in include/linux/export.h, where config_enabled() is
used against a non-config option.)
This commit renames some macros in include/linux/kconfig.h for the
use for non-config macros and replaces config_enabled(MODULE) with
__is_defined(MODULE).
I am keeping config_enabled() because it is still referenced from
some places, but I expect it would be deprecated in the future.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Michal Marek <mmarek@suse.com> Signed-off-by: Razvan Ghitulete <rga@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27477743
CVE: CVE-2017-5715
(cherry picked from commit 675901851fd2a00c7a70bcf327303efba3b81e2f) Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Add asm-usable variants of EXPORT_SYMBOL/EXPORT_SYMBOL_GPL. This
commit just adds the default implementation; most of the architectures
can simply add export.h to asm/Kbuild and start using <asm/export.h>
from assembler. The rest needs to have their <asm/export.h> define
everal macros and then explicitly include <asm-generic/export.h>
One area where the things might diverge from default is the alignment;
normally it's 8 bytes on 64bit targets and 4 on 32bit ones, both for
unsigned long and for struct kernel_symbol. Unfortunately, amd64 and
m68k are unusual - m68k aligns to 2 bytes (for both) and amd64 aligns
struct kernel_symbol to 16 bytes. For those we'll need asm/export.h to
override the constants used by generic version - KSYM_ALIGN and KCRC_ALIGN
for kernel_symbol and unsigned long resp. And no, __alignof__ would not
do the trick - on amd64 __alignof__ of struct kernel_symbol is 8, not 16.
More serious source of unpleasantness is treatment of function
descriptors on architectures that have those. Things like ppc64,
parisc, ia64, etc. need more than the address of the first insn to
call an arbitrary function. As the result, their representation of
pointers to functions is not the typical "address of the entry point" -
it's an address of a small static structure containing all the required
information (including the entry point, of course). Sadly, the asm-side
conventions differ in what the function name refers to - entry point or
the function descriptor. On ppc64 we do the latter;
bar: .quad foo
is what void (*bar)(void) = foo; turns into and the rare places where
we need to explicitly work with the label of entry point are dealt with
as DOTSYM(foo). For our purposes it's ideal - generic macros are usable.
However, parisc would have foo and P%foo used for label of entry point
and address of the function descriptor and
bar: .long P%foo
woudl be used instead. ia64 goes similar to parisc in that respect,
except that there it's @fptr(foo) rather than P%foo. Such architectures
need to define KSYM_FUNC that would turn a function name into whatever
is needed to refer to function descriptor.
What's more, on such architectures we need to know whether we are exporting
a function or an object - in assembler we have to tell that explicitly, to
decide whether we want EXPORT_SYMBOL(foo) produce e.g.
__ksymtab_foo: .quad foo
or
__ksymtab_foo: .quad @fptr(foo)
For that reason we introduce EXPORT_DATA_SYMBOL{,_GPL}(), to be used for
exports of data objects. On normal architectures it's the same thing
as EXPORT_SYMBOL{,_GPL}(), but on parisc-like ones they differ and the
right one needs to be used. Most of the exports are functions, so we
keep EXPORT_SYMBOL for those...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Razvan Ghitulete <rga@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27477743
CVE: CVE-2017-5715
(cherry picked from commit a88693d006986b8fc9ae5036852ac0156106de2a) Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Commit 4efca4ed ("kbuild: modversions for EXPORT_SYMBOL() for asm") adds
modversion support for symbols exported from asm files. Architectures
must include C-style declarations for those symbols in asm/asm-prototypes.h
in order for them to be versioned.
Add these declarations for x86, and an architecture-independent file that
can be used for common symbols.
With f27c2f6 reverting 8ab2ae6 ("default exported asm symbols to zero") we
produce a scary warning on x86, this commit fixes that.
Signed-off-by: Adam Borowski <kilobyte@angband.pl> Tested-by: Kalle Valo <kvalo@codeaurora.org> Acked-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Peter Wu <peter@lekensteyn.nl> Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Michal Marek <mmarek@suse.com> Signed-off-by: Razvan Ghitulete <rga@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27477743
CVE: CVE-2017-5715
(cherry picked from commit b76ac90af34dc08f057da15aa34584a0a3d381b5) Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Currently we use current_stack_pointer() function to get the value
of the stack pointer register. Since commit:
f5caf621ee35 ("x86/asm: Fix inline asm call constraints for Clang")
... we have a stack register variable declared. It can be used instead of
current_stack_pointer() function which allows to optimize away some
excessive "mov %rsp, %<dst>" instructions:
Where an ALTERNATIVE is used in the middle of an inline asm block, this
would otherwise lead to the following instruction being appended directly
to the trailing ".popsection", and a failed compile.
Fixes: 9cebed423c84 ("x86, alternative: Use .pushsection/.popsection") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: gnomes@lxorguk.ukuu.org.uk Cc: Rik van Riel <riel@redhat.com> Cc: ak@linux.intel.com Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180104143710.8961-8-dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 999d4f1961fa002bda138ddfe9119965421f85da)
Orabug: 27477743
CVE: CVE-2017-5715 Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
The alternatives code checks only the first byte whether it is a NOP, but
with NOPs in front of the payload and having actual instructions after it
breaks the "optimized' test.
Make sure to scan all bytes before deciding to optimize the NOPs in there.
Reported-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Link: https://lkml.kernel.org/r/20180110112815.mgciyf5acwacphkq@pd.tnic Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e997d991ab2b1dc9f9cdad999a891626c2aecf21)
Orabug: 27477743
CVE: CVE-2017-5715 Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
We are checking for gaps to previous bio_vec, which can
only detect back merges gaps. Moreover, at the point where
we check for a gap, we don't know if we will attempt a back
or a front merge. Thus, check for gap to prev in a back merge
attempt and check for a gap to next in a front merge attempt.
Signed-off-by: Jens Axboe <axboe@fb.com>
[sagig: Minor rename change] Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
(cherry picked from commit 5e7c4274a70aa2d6f485996d0ca1dad52d0039ca)
For drivers that don't support gaps in the SG lists handed to
them we must bounce (copy the user buffers) and pass a bio that
does not include gaps. This doesn't matter for any current user,
but will help to allow iser which can't handle gaps to use the
block virtual boundary instead of using driver-local bounce
buffering when handling SG_IO commands.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 46348456c1791053dcbe5a9e21825b10a3c8a8fb)
Keith Busch [Thu, 1 Feb 2018 19:37:06 +0000 (11:37 -0800)]
block: Replace SG_GAPS with new queue limits mask
The SG_GAPS queue flag caused checks for bio vector alignment against
PAGE_SIZE, but the device may have different constraints. This patch
adds a queue limits so a driver with such constraints can set to allow
requests that would have been unnecessarily split. The new gaps check
takes the request_queue as a parameter to simplify the logic around
invoking this function.
This new limit makes the queue flag redundant, so removing it and
all usage. Device-mappers will inherit the correct settings through
blk_stack_limits().
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 03100aada96f0645bbcb89aea24c01f02d0ef1fa)
In UEK4, the decision is made to do complete switch to
queue_virt_boundary discarding the use of SG_GAPS flags. Hence,
this partial porting of new queue limit mask is reverted. Also,
this partial patch is broken for stacked devices.
In UEK4, the decision is made to do complete switch to
queue_virt_boundary discarding the use of SG_GAPS flags. Hence,
this partial porting of new queue limit mask is reverted. Also,
this partial patch is broken for stacked devices.
In UEK4, the decision is made to do complete switch to
queue_virt_boundary discarding the use of SG_GAPS flags. Hence,
this partial porting of new queue limit mask is reverted. Also,
this partial patch is broken for stacked devices.
The following soft lockup was caught. This is a deadlock caused by
recursive locking.
Process kworker/u40:1:28016 was holding spin lock "mbx->queue_lock" in
qlcnic_83xx_mailbox_worker(), while a softirq came in and ask the same spin
lock in qlcnic_83xx_enqueue_mbx_cmd(). This lock should be hold by disable
bh..
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 233ac3891607f501f08879134d623b303838f478) Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Ankur Arora [Mon, 5 Feb 2018 03:35:07 +0000 (22:35 -0500)]
x86/entry: RESTORE_IBRS needs to be done under kernel CR3
RESTORE_IBRS_CLOBBER executes after we have already switched
to the USER_CR3. This blows up because RESTORE_IBRS_CLOBBER
looks at a kernel variable (use_ibrs).
This fixes a corner case that is caused by a race of dio write vs dio
read/write.
Here is how the race could happen.
Suppose that no extent map has been loaded into memory yet.
There is a file extent [0, 32K), two jobs are running concurrently
against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
read from [0, 4K) or [4K, 8K).
t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
When add_extent_mapping() gets -EEXIST for adding em [0, 32k),
search_extent_mapping() would return [0, 8k) as existing em, even
though start == existing->start, em is [0, 32k) and
extent_map_end(em) > extent_map_end(existing), ie. 32k > 8k,
then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
(with a length 0), and btrfs_get_extent() ends up returning -EEXIST,
and dio read/write will get -EEXIST which is confusing applications.
Here I also concluded all possible situations,
1) start < existing->start
After going thru the above case by case, it turns out that if start is
within existing em (front inclusive), then the existing em should be
returned, otherwise, we try our best to merge candidate em with
sibling ems to form a larger em.
Reported-by: David Vallender <david.vallender@landmark.co.uk> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com>
%block_len could be checked on deciding if two em are mergable.
merge_extent_mapping() has only added the front pad if the front part
of em gets truncated, but it's possible that the end part gets
truncated.
For both compressed extent and inline extent, em->block_len is not
adjusted accordingly, while for regular extent, em->block_len always
equals to em->len, hence this sets em->block_len with em->len.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com>
My QEMU VM was seeing inexplicable I/O errors that I tracked down to
errors coming from the qcow2 virtual drive in the host system. The qcow2
file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
Every once in awhile, pread() or pwrite() would return EEXIST, which
makes no sense. This turned out to be a bug in btrfs_get_extent().
Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
two threads race on adding the same extent map to an inode's extent map
tree. However, if the added em is merged with an adjacent em in the
extent tree, then we'll end up with an existing extent that is not
identical to but instead encompasses the extent we tried to add. When we
call merge_extent_mapping() to find the nonoverlapping part of the new
em, the arithmetic overflows because there is no such thing. We then end
up trying to add a bogus em to the em_tree, which results in a EEXIST
that can bubble all the way up to userspace.
Fix it by extending the identical extent map special case.
Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 8e2bd3b7fac91b79a6115fd1511ca20b2a09696d) Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com>
When dealing with inline extents, btrfs_get_extent will incorrectly try
to insert a duplicate extent_map. The dup hits -EEXIST from
add_extent_map, but then we try to merge with the existing one and end
up trying to insert a zero length extent_map.
This actually works most of the time, except when there are extent maps
past the end of the inline extent. rocksdb will trigger this sometimes
because it preallocates an extent and then truncates down.
You'll get EIOs trying to read inline after this because add_extent_map
is returning EEXIST
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 8dff9c85341032767d7b519217a79ea04cd676b0) Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: John Haxby <john.haxby@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
K. Y. Srinivasan [Fri, 23 Dec 2016 00:54:03 +0000 (16:54 -0800)]
Drivers: hv: util: Backup: Fix a rescind processing issue
VSS may use a char device to support the communication between
the user level daemon and the driver. When the VSS channel is rescinded
we need to make sure that the char device is fully cleaned up before
we can process a new VSS offer from the host. Implement this logic.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426063
(cherry picked from commit d77044d142e960f7b5f814a91ecb8bcf86aa552c) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Alex Ng [Sun, 6 Nov 2016 21:14:11 +0000 (13:14 -0800)]
Drivers: hv: vss: Operation timeouts should match host expectation
Increase the timeout of backup operations. When system is under I/O load,
it needs more time to freeze. These timeout values should also match the
host timeout values more closely.
Signed-off-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426063
(cherry picked from commit b357fd3908c1191f2f56e38aa77f2aecdae18bc8) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Alex Ng [Sun, 6 Nov 2016 21:14:10 +0000 (13:14 -0800)]
Drivers: hv: vss: Improve log messages.
Adding log messages to help troubleshoot error cases and transaction
handling.
Signed-off-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426063
(cherry picked from commit 23d2cc0c29eb0e7c6fe4cac88098306c31c40208) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Alex Ng [Fri, 2 Sep 2016 12:58:25 +0000 (05:58 -0700)]
Drivers: hv: utils: Check VSS daemon is listening before a hot backup
Hyper-V host will send a VSS_OP_HOT_BACKUP request to check if guest is
ready for a live backup/snapshot. The driver should respond to the check
only if the daemon is running and listening to requests. This allows the
host to fallback to standard snapshots in case the VSS daemon is not
running.
Signed-off-by: Alex Ng <alexng@messages.microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426063
(cherry picked from commit db886e4d24c2b3d334be2cc1bd1bd05d547eb4c4) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Alex Ng [Fri, 2 Sep 2016 12:58:24 +0000 (05:58 -0700)]
Drivers: hv: utils: Continue to poll VSS channel after handling requests.
Multiple VSS_OP_HOT_BACKUP requests may arrive in quick succession, even
though the host only signals once. The driver wass handling the first
request while ignoring the others in the ring buffer. We should poll the
VSS channel after handling a request to continue processing other requests.
Signed-off-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426063
(cherry picked from commit 497af84b81b98b27e9ba7aebb8a373412e328497) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Vitaly Kuznetsov [Fri, 10 Jun 2016 00:08:57 +0000 (17:08 -0700)]
Drivers: hv: utils: fix a race on userspace daemons registration
Background: userspace daemons registration protocol for Hyper-V utilities
drivers has two steps:
1) daemon writes its own version to kernel
2) kernel reads it and replies with module version
at this point we consider the handshake procedure being completed and we
do hv_poll_channel() transitioning the utility device to HVUTIL_READY
state. At this point we're ready to handle messages from kernel.
When hvutil_transport is in HVUTIL_TRANSPORT_CHARDEV mode we have a
single buffer for outgoing message. hvutil_transport_send() puts to this
buffer and till the buffer is cleared with hvt_op_read() returns -EFAULT
to all consequent calls. Host<->guest protocol guarantees there is no more
than one request at a time and we will not get new requests till we reply
to the previous one so this single message buffer is enough.
Now to the race. When we finish negotiation procedure and send kernel
module version to userspace with hvutil_transport_send() it goes into the
above mentioned buffer and if the daemon is slow enough to read it from
there we can get a collision when a request from the host comes, we won't
be able to put anything to the buffer so the request will be lost. To
solve the issue we need to know when the negotiation is really done (when
the version message is read by the daemon) and transition to HVUTIL_READY
state after this happens. Implement a callback on read to support this.
Old style netlink communication is not affected by the change, we don't
really know when these messages are delivered but we don't have a single
message buffer there.
Reported-by: Barry Davis <barry_davis@stormagic.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426063
(cherry picked from commit e0fa3e5e7df61eb2c339c9f0067c202c0cdeec2c) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Olaf Hering [Tue, 15 Dec 2015 00:01:42 +0000 (16:01 -0800)]
Drivers: hv: vss: run only on supported host versions
The Backup integration service on WS2012 has appearently trouble to
negotiate with a guest which does not support the provided util version.
Currently the VSS driver supports only version 5/0. A WS2012 offers only
version 1/x and 3/x, and vmbus_prep_negotiate_resp correctly returns an
empty icframe_vercnt/icmsg_vercnt. But the host ignores that and
continues to send ICMSGTYPE_NEGOTIATE messages. The result are weird
errors during boot and general misbehaviour.
Check the Windows version to work around the host bug, skip hv_vss_init
on WS2012 and older.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426063
(cherry picked from commit ed9ba608e4851144af8c7061cbb19f751c73e998) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
There is still a secure hole left in mem.c driver -- when securelevel is set
userland application could access PCI configuration space via this driver.
--
Attempting to write via mmap() API using acc_test.
[root@ban92uut054 ~]# ./acc_test mmap 0x846000c4=0x1
Using mmap() API for access
mmap write wrote 0x1
if (strcmp("rw", argv[1]) == 0) {
access_type = 1;
printf("Using pread()/pwrite() API for access\n");
}
else if (strcmp("mmap", argv[1]) == 0) {
access_type = 2;
printf("Using mmap() API for access\n");
}
else {
printf("Illegal access type: must be rw or mmap\n");
return -1;
}
addr = strtoul(argv[2], &tmp, 16);
if ((tmp && (*tmp != '=')) &&
((*tmp != '\0') || (errno == EINVAL) ||
(addr == ULONG_MAX && errno == ERANGE))) {
fprintf(stderr, "Invalid address specified; must be hex based\n");
if (errno) perror("error : ");
exit(1);
}
else if (tmp && (*tmp == '=')) { // write case
tmp++;
val = strtoul(tmp, NULL, 16);
operation = 1;
}
This patch is purposed to fix this hole when securelevel is set where one could write to
/dev/mem via the mmap() API. The fix to disallow opening /dev/mem or /dev/kmem
for access. The fix checks access at open rather than have get_securelevel() called at
the various write/read locations.
Signed-off-by: James Puthukattukaran <james.puthukattukaran@oracle.com> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com> Reviewed-by: Eric Snowberg <eric.snowberg@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle>
(cherry picked from commit 8ca81822fffb2536aa1076fb17d1ea1413df3e7f)
Add this commit back to UEK4 now that the regressions have been fixed
Ka-Cheong Poon [Mon, 18 Dec 2017 15:45:18 +0000 (07:45 -0800)]
rds: Calling getsockname() on unbounded socket generates seg fault
If a socket is not yet bound, calling getsockname() should return an
unspecified address with family AF_UNSPEC as an RDS socket can be
bound to either an IPv4 or an IPv6 address. Hence returning either
family is incorrect. Currently, the returned address is set to an
unspecified address with family AF_INET6. If the passed in buffer is
smaller than sizeof(struct sockaddr_in6), the app may get a
segmentation fault error. This is similar to passing a buffer smaller
than sizeof(struct sockaddr_in) before the IPv6 changes.
Zhu Yanjun [Fri, 1 Dec 2017 02:13:20 +0000 (21:13 -0500)]
net/mlx4_core: allow QPs with enable_smi_admin enabled
In the commit 64453f042519 ("net/mlx4_core: Disallow creation of RAW QPs
on a VF"), it disallows some QPs. But when enable_smi_admin is enabled,
it should be allowed to pass.
HÃ¥kon Bugge [Fri, 8 Dec 2017 15:55:22 +0000 (16:55 +0100)]
net/rds: Fix incorrect error handling
Commit 5f58d7e81c2f ("net/rds: reduce memory footprint during
ib_post_recv in IB transport") removes order two allocations used to
receive fragments. Instead, zero order allocations are used. However,
said commit has incorrect error handling.
Konrad Rzeszutek Wilk [Sat, 13 Jan 2018 03:32:23 +0000 (22:32 -0500)]
x86: Move STUFF_RSB in to the idt macro
instead of it sitting in paranoid_entry or error_entry.
The idea behind the STUFF_RSB is to be done _before_
any calls are done. Which means we really want this in the idt
macro that is handled for exceptions - such as device not available,
which currently looks as so:
[Ignore the callq *0x40.. that gets converted to an 'cld']
<device_not_available>:
nop
nop
nop
callq *0x40d0b7(%rip) # ffffffff81b55330 <pv_irq_ops+0x30> <= patched to cld
pushq $0xffffffffffffffff
sub $0x78,%rsp
callq ffffffff81748ea0 <error_entry> <=== call!
mov %rsp,%rdi
xor %esi,%esi
callq ffffffff81018830 <do_device_not_available>
test %rax,%rax
jne ffffffff81747f10 <dtrace_error_exit>
jmpq ffffffff817490a0 <error_exit>
nopl 0x0(%rax)
By stuffing the RSB before the call to error_entry (or
paranoid_entry) we remove the chance of this becoming an attack vector.
While at it, remove the useless comment - we don't encode any frames
in UEK4.
OraBug: 27417150 Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Sat, 13 Jan 2018 02:05:45 +0000 (21:05 -0500)]
x86/spec: STUFF_RSB _before_ ENABLE_IBRS
And also we need to STUFF_RSB _before_ calls.
In our case we have a bunch of ENABLE_INTERRUPTS
which are (in objdump):
callq *0x40b379(%rip) <pv_cpu_ops+0x128>
During bootup they do change to 'cld' (on baremetal).
On Xen PV they end up being those calls and STUFF_RSB is still
in effect which means it should be done before those calls are made.
Also the semantics of the IBRS MSR is "If IBRS is set, .. indirect
calls will not allow their predicated target address to be controlled ...
so long as as all RSB entries from previous less privileged prediction
mode are overwritten."
In other words - STUFF_RSB, then ENABLE_IBRS.
Xen hypervisor code follows that religiously and so shall we.
OraBug: 27448169 Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Boris Ostrovsky [Tue, 23 Jan 2018 16:02:42 +0000 (11:02 -0500)]
x86/IBRS: Don't try to change IBRS mode if IBRS is not available
sysctl_ibrs_enabled is set properly when IBRS support is discovered
in init_scattered_cpuid_features(). We should simply return an error
when attempt is made to change IBRS mode when the feature is not supoorted.
There is also no need to call set_ibrs_inuse() when changing mode to 1:
this is already done by clear_ibrs_disabled().
Boris Ostrovsky [Tue, 23 Jan 2018 16:02:41 +0000 (11:02 -0500)]
x86/IBRS: Remove support for IBRS_ENABLED_USER mode
This mode was added based on our understanding of IBRS_ATT (IBRS
All The Time) described in early versions of Intel documentation.
We assumed that while "basic" IBRS protects kernel from using
predictions created by userland, IBRS_ATT will provide similar
defence between usermode tasks.
This understanding was incorrect.
Instead, IBRS_ATT (also referred to as "Enhanced IBRS") allows the
kernel to write IBRS MSR once, during boot, and never have to write
it again. This is in contrast to basic IBRS where every change of
protection mode required an MSR write, which is somewhat expensive.
Enhanced IBRS is not available on existing processors. Until it
becomes available we remove IBRS_ENABLED_USER.
While doing this also add a test in ibrs_enabled_write() that will
only process input if the mode will actually change.
Qing Huang [Tue, 16 Jan 2018 19:27:32 +0000 (11:27 -0800)]
mlx4: add mstflint secure boot access kernel support
Source files are copied from:
https://github.com/Mellanox/mstflint.git master_devel
(under kernel sub-directory - changes up to commit a8a84fae7459aba01ff6ad6cbc40622b7615b110)
Due to recent security changes in UEK4, mstflint FW tool may
lose the ability to access CX3 HCAs in Secure Boot enabled
enviroment. This kernel patch in addition to an enhanced
version of mstflint tool will let us regain the ability to
manage CX3 FW when Secure Boot is enabled while running latest
UEK4 kernels.
Jia Zhang [Mon, 1 Jan 2018 02:04:47 +0000 (10:04 +0800)]
x86/microcode/intel: Extend BDW late-loading with a revision check
Instead of blacklisting all model 79 CPUs when attempting a late
microcode loading, limit that only to CPUs with microcode revisions <
0x0b000021 because only on those late loading may cause a system hang.
For such processors either:
a) a BIOS update which might contain a newer microcode revision
or
b) the early microcode loading method
should be considered.
Processors with revisions 0x0b000021 or higher will not experience such
hangs.
For more details, see erratum BDF90 in document #334165 (Intel Xeon
Processor E7-8800/4800 v4 Product Family Specification Update) from
September 2017.
[ bp: Heavily massage commit message and pr_* statements. ]
Ian Kent [Mon, 19 Sep 2016 21:44:12 +0000 (14:44 -0700)]
autofs: use dentry flags to block walks during expire
Somewhere along the way the autofs expire operation has changed to hold
a spin lock over expired dentry selection. The autofs indirect mount
expired dentry selection is complicated and quite lengthy so it isn't
appropriate to hold a spin lock over the operation.
Commit 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") added a
might_sleep() to dput() causing a WARN_ONCE() about this usage to be
issued.
But the spin lock doesn't need to be held over this check, the autofs
dentry info. flags are enough to block walks into dentrys during the
expire.
I've left the direct mount expire as it is (for now) because it is much
simpler and quicker than the indirect mount expire and adding spin lock
release and re-aquires would do nothing more than add overhead.
Fixes: 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") Link: http://lkml.kernel.org/r/20160912014017.1773.73060.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Reported-by: Takashi Iwai <tiwai@suse.de> Tested-by: Takashi Iwai <tiwai@suse.de> Cc: Takashi Iwai <tiwai@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 26032471
(cherry picked from commit 7cbdb4a286a60c5d519cb9223fe2134d26870d39) Signed-off-by: Mingming Cao <mingming.cao@oracle.com> Reviewed-by: Shirley Ma <shirley.ma@oracle.com>
Al Viro [Sun, 12 Jun 2016 15:24:46 +0000 (11:24 -0400)]
autofs races
* make autofs4_expire_indirect() skip the dentries being in process of
expiry
* do *not* mess with list_move(); making sure that dentry with
AUTOFS_INF_EXPIRING are not picked for expiry is enough.
* do not remove NO_RCU when we set EXPIRING, don't bother with smp_mb()
there. Clear it at the same time we clear EXPIRING. Makes a bunch of
tests simpler.
* rename NO_RCU to WANT_EXPIRE, which is what it really is.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 26032471
(cherry picked from commit ea01a18494b3d7a91b2f1f2a6a5aaef4741bc294) Signed-off-by: Mingming Cao <mingming.cao@oracle.com> Reviewed-by: Shirley Ma <shirley.ma@oracle.com>
* tag 'v4.1.12-122.bug26670475#v3':
xen-blkback: add pending_req allocation stats
xen-blkback: move indirect req allocation out-of-line
xen-blkback: pull nseg validation out in a function
xen-blkback: make struct pending_req less monolithic
Kanth Ghatraju [Thu, 11 Jan 2018 22:27:49 +0000 (17:27 -0500)]
x86: Display correct settings for the SPECTRE_V2 bug
Update the display message for spectre v2. Move the set up of the
bug bits to identify_cpu routine to remain persistent. This
routine reinitializes the data structure.
Thomas Gleixner [Thu, 11 Jan 2018 22:04:43 +0000 (17:04 -0500)]
sysfs/cpu: Add vulnerability folder
As the meltdown/spectre problem affects several CPU architectures, it makes
sense to have common way to express whether a system is affected by a
particular vulnerability or not. If affected the way to express the
mitigation should be common as well.
Create /sys/devices/system/cpu/vulnerabilities folder and files for
meltdown, spectre_v1 and spectre_v2.
Allow architectures to override the show function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Woodhouse <dwmw@amazon.co.uk> Link: https://lkml.kernel.org/r/20180107214913.096657732@linutronix.de
(cherry picked from commit 87590ce6e373d1a5401f6539f0c59ef92dd924a9)
Orabug: 27353383 Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts:
Documentation/ABI/testing/sysfs-devices-system-cpu
include/linux/cpu.h
Conflicts resolved by picking only the changes from original
patch. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Woodhouse [Thu, 11 Jan 2018 21:59:40 +0000 (16:59 -0500)]
x86/cpufeatures: Add X86_BUG_SPECTRE_V[12]
Add the bug bits for spectre v1/2 and force them unconditionally for all
cpus.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: gnomes@lxorguk.ukuu.org.uk Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1515239374-23361-2-git-send-email-dwmw@amazon.co.uk
(cherry picked from commit 99c6fa2511d8a683e61468be91b83f85452115fa)
Orabug: 27353383 Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/kernel/cpu/common.c
Resolved the conflict to pick only change required in common.c. The changes
in cpufeatures.h have been implemented in cpufeature.h. We do not set the
bit for SPECTRE_V1 bug. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Kanth Ghatraju [Thu, 11 Jan 2018 21:52:30 +0000 (16:52 -0500)]
x86/cpufeatures: Add X86_BUG_CPU_MELTDOWN
Add the BUG bit to indicate that the CPU is affected by the leak due to
lack of isolation of kernel and user space page tables. Currently AMD
CPUs are not affected by this.
Andrew Honig [Wed, 10 Jan 2018 18:12:03 +0000 (10:12 -0800)]
KVM: x86: Add memory barrier on vmcs field lookup
This adds a memory barrier when performing a lookup into
the vmcs_field_to_offset_table. This is related to
CVE-2017-5753.
This particularly scenario would involve an L1 hypervisor using
vmread/vmwrite to try execute a variant 1 side channel leak on the host.
In general variant 1 relies on a bounds check that gets bypassed
speculatively. However it requires a fairly specific code pattern to
actually be useful for an exploit, which is why most bounds check do
not require speculation barrier. It requires two memory references
close to each other. One that is out of bounds and attacker
controlled and one where the memory address is based on the memory
read in the first access. The first memory reference is a read of the
memory that the attacker wants to leak and the second references
creates side channel in the cache where the line accessed represents
the data to be leaked.
This code has that pattern because a potentially very large value for
field could be used in the vmcs_to_offset_table lookup which will be
put into f. Then very shortly thereafter and potentially still in
the speculation window will be dereferenced in the vmcs_to_field_offset
function.
OraBug: 27380809 Signed-off-by: Andrew Honig <ahonig@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 75f139aaf896d6fdeec2e468ddfa4b2fe469bf40)
[The upstream commit used asm('lfence') but we already have the osb()
macro so changed that out] Reviewed-by: Boris.Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
KVM allows guests to directly access I/O port 0x80 on Intel hosts. If
the guest floods this port with writes it generates exceptions and
instability in the host kernel, leading to a crash. With this change
guest writes to port 0x80 on Intel will behave the same as they
currently behave on AMD systems.
Prevent the flooding by removing the code that sets port 0x80 as a
passthrough port. This is essentially the same as upstream patch 99f85a28a78e96d28907fe036e1671a218fee597, except that patch was
for AMD chipsets and this patch is for Intel.
Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Jim Mattson <jmattson@google.com>
(cherry picked from commit d59d51f088014f25c2562de59b9abff4f42a7468)
Orabug: 27206805
CVE: CVE-2017-1000407 Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Acked-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Joao Martins [Tue, 16 Jan 2018 19:03:12 +0000 (19:03 +0000)]
ixgbevf: handle mbox_api_13 in ixgbevf_change_mtu
Commit 180603fe7 added a new API but failed to update one place
specifically in ixgbevf_change_mtu. The lack of it leads to
mtu set failures on 82599 VFs.
Orabug: 27397028 Fixes: 180603fe7 ("ixgbevf: Add support for VF promiscuous mode") Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Tested-by: Chuan Liu <chuan.liu@oracle.com>
struct pending_req is allocated ahead-of-time (in connect_ring())
for each ring slot. This is potentially a large number of allocations
(number-of-queues * ring-slots) and given that the structure is sized
for the worst case (MAX_INDIRECT_SEGMENTS), each element is 16616 bytes
on 64-bit.
The allocation itself is via kmalloc so this becomes multiple order-3
allocations for each vbd.
This patch slims down the structure by limiting the pre-allocated
structures to BLKIF_MAX_SEGMENTS_PER_REQUEST. Requests larger than
this are allocated dynamically. On my machine (E5-2660 0 @ 2.20GHz),
without any memory pressure, this adds an average of about 1us to do
the indirect allocation path.
Tim Tianyang Chen [Tue, 9 Jan 2018 23:57:28 +0000 (15:57 -0800)]
x86/fpu: Don't let userspace set bogus xcomp_bv
On x86, userspace can use the ptrace() or rt_sigreturn() system calls to
set a task's extended state (xstate) or "FPU" registers. ptrace() can
set them for another task using the PTRACE_SETREGSET request with
NT_X86_XSTATE, while rt_sigreturn() can set them for the current task.
In either case, registers can be set to any value, but the kernel
assumes that the XSAVE area itself remains valid in the sense that the
CPU can restore it.
However, in the case where the kernel is using the uncompacted xstate
format (which it does whenever the XSAVES instruction is unavailable),
it was possible for userspace to set the xcomp_bv field in the
xstate_header to an arbitrary value. However, all bits in that field
are reserved in the uncompacted case, so when switching to a task with
nonzero xcomp_bv, the XRSTOR instruction failed with a #GP fault. This
caused the WARN_ON_FPU(err) in copy_kernel_to_xregs() to be hit. In
addition, since the error is otherwise ignored, the FPU registers from
the task previously executing on the CPU were leaked.
Fix the bug by checking that the user-supplied value of xcomp_bv is 0 in
the uncompacted case, and returning an error otherwise.
The reason for validating xcomp_bv rather than simply overwriting it
with 0 is that we want userspace to see an error if it (incorrectly)
provides an XSAVE area in compacted format rather than in uncompacted
format.
Note that as before, in case of error we clear the task's FPU state.
This is perhaps non-ideal, especially for PTRACE_SETREGSET; it might be
better to return an error before changing anything. But it seems the
"clear on error" behavior is fine for now, and it's a little tricky to
do otherwise because it would mean we couldn't simply copy the full
userspace state into kernel memory in one __copy_from_user().
This bug was found by syzkaller, which hit the above-mentioned
WARN_ON_FPU():
Here is a C reproducer. The expected behavior is that the program spin
forever with no output. However, on a buggy kernel running on a
processor with the "xsave" feature but without the "xsaves" feature
(e.g. Sandy Bridge through Broadwell for Intel), within a second or two
the program reports that the xmm registers were corrupted, i.e. were not
restored correctly. With CONFIG_X86_DEBUG_FPU=y it also hits the above
kernel warning.
Note: the program only tests for the bug using the ptrace() system call.
The bug can also be reproduced using the rt_sigreturn() system call, but
only when called from a 32-bit program, since for 64-bit programs the
kernel restores the FPU state from the signal frame by doing XRSTOR
directly from userspace memory (with proper error checking).
Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> [v3.17+] Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Kevin Hao <haokexin@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Halcrow <mhalcrow@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Cc: kernel-hardening@lists.openwall.com Fixes: 0b29643 ("x86/xsaves: Change compacted format xsave area header") Link: http://lkml.kernel.org/r/20170922174156.16780-2-ebiggers3@gmail.com Link: http://lkml.kernel.org/r/20170923130016.21448-25-mingo@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 814fb7bb7db5433757d76f4c4502c96fc53b0b5e)
Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Hand picked because it's missing some commits that refactored fpu
functions out. For UEK4, xstateregs_set() is in arch/x86/kernel/i387.c
and __fpu__restore_sig() is in arch/x86/kernel/xsave.c.
xsave->header is xsave->xsave_hdr_struct and
xregs_state is xsave.
Xin Long [Tue, 17 Oct 2017 15:26:10 +0000 (23:26 +0800)]
sctp: do not peel off an assoc from one netns to another one
Now when peeling off an association to the sock in another netns, all
transports in this assoc are not to be rehashed and keep use the old
key in hashtable.
As a transport uses sk->net as the hash key to insert into hashtable,
it would miss removing these transports from hashtable due to the new
netns when closing the sock and all transports are being freeed, then
later an use-after-free issue could be caused when looking up an asoc
and dereferencing those transports.
This is a very old issue since very beginning, ChunYu found it with
syzkaller fuzz testing with this series:
This patch is to block this call when peeling one assoc off from one
netns to another one, so that the netns of all transport would not
go out-sync with the key in hashtable.
Note that this patch didn't fix it by rehashing transports, as it's
difficult to handle the situation when the tuple is already in use
in the new netns. Besides, no one would like to peel off one assoc
to another netns, considering ipaddrs, ifaces, etc. are usually
different.
Reported-by: ChunYu Wang <chunwang@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit df80cd9b28b9ebaa284a41df611dbf3a2d05ca74)
Andrey Konovalov [Thu, 2 Nov 2017 14:38:21 +0000 (10:38 -0400)]
media: dib0700: fix invalid dvb_detach argument
dvb_detach(arg) calls symbol_put_addr(arg), where arg should be a pointer
to a function. Right now a pointer to state->dib7000p_ops is passed to
dvb_detach(), which causes a BUG() in symbol_put_addr() as discovered by
syzkaller. Pass state->dib7000p_ops.set_wbd_ref instead.
Linus Torvalds [Sun, 20 Aug 2017 20:26:27 +0000 (13:26 -0700)]
Sanitize 'move_pages()' permission checks
The 'move_paghes()' system call was introduced long long ago with the
same permission checks as for sending a signal (except using
CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).
That turns out to not be a great choice - while the system call really
only moves physical page allocations around (and you need other
capabilities to do a lot of it), you can check the return value to map
out some the virtual address choices and defeat ASLR of a binary that
still shares your uid.
So change the access checks to the more common 'ptrace_may_access()'
model instead.
This tightens the access checks for the uid, and also effectively
changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
anybody really _uses_ this legacy system call any more (we hav ebetter
NUMA placement models these days), so I expect nobody to notice.
Famous last words.
Reported-by: Otto Ebeling <otto.ebeling@iki.fi> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 197e7e521384a23b9e585178f3f11c9fa08274b9)
David Howells [Wed, 11 Oct 2017 22:32:27 +0000 (23:32 +0100)]
assoc_array: Fix a buggy node-splitting case
This fixes CVE-2017-12193.
Fix a case in the assoc_array implementation in which a new leaf is
added that needs to go into a node that happens to be full, where the
existing leaves in that node cluster together at that level to the
exclusion of new leaf.
What needs to happen is that the existing leaves get moved out to a new
node, N1, at level + 1 and the existing node needs replacing with one,
N0, that has pointers to the new leaf and to N1.
The code that tries to do this gets this wrong in two ways:
(1) The pointer that should've pointed from N0 to N1 is set to point
recursively to N0 instead.
(2) The backpointer from N0 needs to be set correctly in the case N0 is
either the root node or reached through a shortcut.
Fix this by removing this path and using the split_node path instead,
which achieves the same end, but in a more general way (thanks to Eric
Biggers for spotting the redundancy).
The problem manifests itself as:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: assoc_array_apply_edit+0x59/0xe5
Fixes: 3cb989501c26 ("Add a generic associative array implementation.") Reported-and-tested-by: WU Fan <u3536072@connect.hku.hk> Signed-off-by: David Howells <dhowells@redhat.com> Cc: stable@vger.kernel.org [v3.13-rc1+] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ea6789980fdaa610d7eb63602c746bf6ec70cd2b)
Mohamed Ghannam [Sun, 10 Dec 2017 03:50:58 +0000 (03:50 +0000)]
net: ipv4: fix for a race condition in raw_sendmsg
inet->hdrincl is racy, and could lead to uninitialized stack pointer
usage, so its value should be read only once.
Fixes: c008ba5bdc9f ("ipv4: Avoid reading user iov twice after raw_probe_proto_opt") Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8f659a03a0ba9289b9aeb9b4470e6fb263d6f483)
Pavel Tatashin [Fri, 12 Jan 2018 00:17:49 +0000 (19:17 -0500)]
x86/pti/efi: broken conversion from efi to kernel page table
In entry_64.S we have code like this:
/* Unconditionally use kernel CR3 for do_nmi() */
/* %rax is saved above, so OK to clobber here */
ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER
/* If PCID enabled, NOFLUSH now and NOFLUSH on return */
ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID
pushq %rax
/* mask off "user" bit of pgd address and 12 PCID bits: */
andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
movq %rax, %cr3
2:
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
call do_nmi
With this instruction:
andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
We unconditionally switch from whatever our CR3 was to kernel page table.
But, in arch/x86/platform/efi/efi_64.c We temporarily set a different page
table, that does not have the kernel page table with 0x1000 offset from it.
Look in efi_thunk() and efi_thunk_set_virtual_address_map().
So, while CR3 points to the other page table, we get an NMI interrupt,
and clear 0x1000 from CR3, resulting in a bogus CR3 if the 0x1000 bit was
set.
The efi page table comes from realmode/rm/trampoline_64.S:
Notice: alignment is PAGE_SIZE, so after applying KAISER_SHADOW_PGD_OFFSET
which equal to PAGE_SIZE, we can get a different page table.
But, even if we fix alignment, here the trampoline binary is later copied
into dynamically allocated memory in reserve_real_mode(), so we need to
fix that place as well.
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>