Zhu Yanjun [Fri, 1 Dec 2017 02:13:20 +0000 (21:13 -0500)]
net/mlx4_core: allow QPs with enable_smi_admin enabled
In the commit 64453f042519 ("net/mlx4_core: Disallow creation of RAW QPs
on a VF"), it disallows some QPs. But when enable_smi_admin is enabled,
it should be allowed to pass.
Håkon Bugge [Fri, 8 Dec 2017 15:55:22 +0000 (16:55 +0100)]
net/rds: Fix incorrect error handling
Commit 5f58d7e81c2f ("net/rds: reduce memory footprint during
ib_post_recv in IB transport") removes order two allocations used to
receive fragments. Instead, zero order allocations are used. However,
said commit has incorrect error handling.
Konrad Rzeszutek Wilk [Sat, 13 Jan 2018 03:32:23 +0000 (22:32 -0500)]
x86: Move STUFF_RSB in to the idt macro
instead of it sitting in paranoid_entry or error_entry.
The idea behind the STUFF_RSB is to be done _before_
any calls are done. Which means we really want this in the idt
macro that is handled for exceptions - such as device not available,
which currently looks as so:
[Ignore the callq *0x40.. that gets converted to an 'cld']
<device_not_available>:
nop
nop
nop
callq *0x40d0b7(%rip) # ffffffff81b55330 <pv_irq_ops+0x30> <= patched to cld
pushq $0xffffffffffffffff
sub $0x78,%rsp
callq ffffffff81748ea0 <error_entry> <=== call!
mov %rsp,%rdi
xor %esi,%esi
callq ffffffff81018830 <do_device_not_available>
test %rax,%rax
jne ffffffff81747f10 <dtrace_error_exit>
jmpq ffffffff817490a0 <error_exit>
nopl 0x0(%rax)
By stuffing the RSB before the call to error_entry (or
paranoid_entry) we remove the chance of this becoming an attack vector.
While at it, remove the useless comment - we don't encode any frames
in UEK4.
OraBug: 27417150 Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Sat, 13 Jan 2018 02:05:45 +0000 (21:05 -0500)]
x86/spec: STUFF_RSB _before_ ENABLE_IBRS
And also we need to STUFF_RSB _before_ calls.
In our case we have a bunch of ENABLE_INTERRUPTS
which are (in objdump):
callq *0x40b379(%rip) <pv_cpu_ops+0x128>
During bootup they do change to 'cld' (on baremetal).
On Xen PV they end up being those calls and STUFF_RSB is still
in effect which means it should be done before those calls are made.
Also the semantics of the IBRS MSR is "If IBRS is set, .. indirect
calls will not allow their predicated target address to be controlled ...
so long as as all RSB entries from previous less privileged prediction
mode are overwritten."
In other words - STUFF_RSB, then ENABLE_IBRS.
Xen hypervisor code follows that religiously and so shall we.
OraBug: 27448169 Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Boris Ostrovsky [Tue, 23 Jan 2018 16:02:42 +0000 (11:02 -0500)]
x86/IBRS: Don't try to change IBRS mode if IBRS is not available
sysctl_ibrs_enabled is set properly when IBRS support is discovered
in init_scattered_cpuid_features(). We should simply return an error
when attempt is made to change IBRS mode when the feature is not supoorted.
There is also no need to call set_ibrs_inuse() when changing mode to 1:
this is already done by clear_ibrs_disabled().
Boris Ostrovsky [Tue, 23 Jan 2018 16:02:41 +0000 (11:02 -0500)]
x86/IBRS: Remove support for IBRS_ENABLED_USER mode
This mode was added based on our understanding of IBRS_ATT (IBRS
All The Time) described in early versions of Intel documentation.
We assumed that while "basic" IBRS protects kernel from using
predictions created by userland, IBRS_ATT will provide similar
defence between usermode tasks.
This understanding was incorrect.
Instead, IBRS_ATT (also referred to as "Enhanced IBRS") allows the
kernel to write IBRS MSR once, during boot, and never have to write
it again. This is in contrast to basic IBRS where every change of
protection mode required an MSR write, which is somewhat expensive.
Enhanced IBRS is not available on existing processors. Until it
becomes available we remove IBRS_ENABLED_USER.
While doing this also add a test in ibrs_enabled_write() that will
only process input if the mode will actually change.
Qing Huang [Tue, 16 Jan 2018 19:27:32 +0000 (11:27 -0800)]
mlx4: add mstflint secure boot access kernel support
Source files are copied from:
https://github.com/Mellanox/mstflint.git master_devel
(under kernel sub-directory - changes up to commit a8a84fae7459aba01ff6ad6cbc40622b7615b110)
Due to recent security changes in UEK4, mstflint FW tool may
lose the ability to access CX3 HCAs in Secure Boot enabled
enviroment. This kernel patch in addition to an enhanced
version of mstflint tool will let us regain the ability to
manage CX3 FW when Secure Boot is enabled while running latest
UEK4 kernels.
Jia Zhang [Mon, 1 Jan 2018 02:04:47 +0000 (10:04 +0800)]
x86/microcode/intel: Extend BDW late-loading with a revision check
Instead of blacklisting all model 79 CPUs when attempting a late
microcode loading, limit that only to CPUs with microcode revisions <
0x0b000021 because only on those late loading may cause a system hang.
For such processors either:
a) a BIOS update which might contain a newer microcode revision
or
b) the early microcode loading method
should be considered.
Processors with revisions 0x0b000021 or higher will not experience such
hangs.
For more details, see erratum BDF90 in document #334165 (Intel Xeon
Processor E7-8800/4800 v4 Product Family Specification Update) from
September 2017.
[ bp: Heavily massage commit message and pr_* statements. ]
Ian Kent [Mon, 19 Sep 2016 21:44:12 +0000 (14:44 -0700)]
autofs: use dentry flags to block walks during expire
Somewhere along the way the autofs expire operation has changed to hold
a spin lock over expired dentry selection. The autofs indirect mount
expired dentry selection is complicated and quite lengthy so it isn't
appropriate to hold a spin lock over the operation.
Commit 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") added a
might_sleep() to dput() causing a WARN_ONCE() about this usage to be
issued.
But the spin lock doesn't need to be held over this check, the autofs
dentry info. flags are enough to block walks into dentrys during the
expire.
I've left the direct mount expire as it is (for now) because it is much
simpler and quicker than the indirect mount expire and adding spin lock
release and re-aquires would do nothing more than add overhead.
Fixes: 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") Link: http://lkml.kernel.org/r/20160912014017.1773.73060.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Reported-by: Takashi Iwai <tiwai@suse.de> Tested-by: Takashi Iwai <tiwai@suse.de> Cc: Takashi Iwai <tiwai@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 26032471
(cherry picked from commit 7cbdb4a286a60c5d519cb9223fe2134d26870d39) Signed-off-by: Mingming Cao <mingming.cao@oracle.com> Reviewed-by: Shirley Ma <shirley.ma@oracle.com>
Al Viro [Sun, 12 Jun 2016 15:24:46 +0000 (11:24 -0400)]
autofs races
* make autofs4_expire_indirect() skip the dentries being in process of
expiry
* do *not* mess with list_move(); making sure that dentry with
AUTOFS_INF_EXPIRING are not picked for expiry is enough.
* do not remove NO_RCU when we set EXPIRING, don't bother with smp_mb()
there. Clear it at the same time we clear EXPIRING. Makes a bunch of
tests simpler.
* rename NO_RCU to WANT_EXPIRE, which is what it really is.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 26032471
(cherry picked from commit ea01a18494b3d7a91b2f1f2a6a5aaef4741bc294) Signed-off-by: Mingming Cao <mingming.cao@oracle.com> Reviewed-by: Shirley Ma <shirley.ma@oracle.com>
* tag 'v4.1.12-122.bug26670475#v3':
xen-blkback: add pending_req allocation stats
xen-blkback: move indirect req allocation out-of-line
xen-blkback: pull nseg validation out in a function
xen-blkback: make struct pending_req less monolithic
Kanth Ghatraju [Thu, 11 Jan 2018 22:27:49 +0000 (17:27 -0500)]
x86: Display correct settings for the SPECTRE_V2 bug
Update the display message for spectre v2. Move the set up of the
bug bits to identify_cpu routine to remain persistent. This
routine reinitializes the data structure.
Thomas Gleixner [Thu, 11 Jan 2018 22:04:43 +0000 (17:04 -0500)]
sysfs/cpu: Add vulnerability folder
As the meltdown/spectre problem affects several CPU architectures, it makes
sense to have common way to express whether a system is affected by a
particular vulnerability or not. If affected the way to express the
mitigation should be common as well.
Create /sys/devices/system/cpu/vulnerabilities folder and files for
meltdown, spectre_v1 and spectre_v2.
Allow architectures to override the show function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Woodhouse <dwmw@amazon.co.uk> Link: https://lkml.kernel.org/r/20180107214913.096657732@linutronix.de
(cherry picked from commit 87590ce6e373d1a5401f6539f0c59ef92dd924a9)
Orabug: 27353383 Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts:
Documentation/ABI/testing/sysfs-devices-system-cpu
include/linux/cpu.h
Conflicts resolved by picking only the changes from original
patch. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Woodhouse [Thu, 11 Jan 2018 21:59:40 +0000 (16:59 -0500)]
x86/cpufeatures: Add X86_BUG_SPECTRE_V[12]
Add the bug bits for spectre v1/2 and force them unconditionally for all
cpus.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: gnomes@lxorguk.ukuu.org.uk Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1515239374-23361-2-git-send-email-dwmw@amazon.co.uk
(cherry picked from commit 99c6fa2511d8a683e61468be91b83f85452115fa)
Orabug: 27353383 Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/kernel/cpu/common.c
Resolved the conflict to pick only change required in common.c. The changes
in cpufeatures.h have been implemented in cpufeature.h. We do not set the
bit for SPECTRE_V1 bug. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Kanth Ghatraju [Thu, 11 Jan 2018 21:52:30 +0000 (16:52 -0500)]
x86/cpufeatures: Add X86_BUG_CPU_MELTDOWN
Add the BUG bit to indicate that the CPU is affected by the leak due to
lack of isolation of kernel and user space page tables. Currently AMD
CPUs are not affected by this.
Andrew Honig [Wed, 10 Jan 2018 18:12:03 +0000 (10:12 -0800)]
KVM: x86: Add memory barrier on vmcs field lookup
This adds a memory barrier when performing a lookup into
the vmcs_field_to_offset_table. This is related to
CVE-2017-5753.
This particularly scenario would involve an L1 hypervisor using
vmread/vmwrite to try execute a variant 1 side channel leak on the host.
In general variant 1 relies on a bounds check that gets bypassed
speculatively. However it requires a fairly specific code pattern to
actually be useful for an exploit, which is why most bounds check do
not require speculation barrier. It requires two memory references
close to each other. One that is out of bounds and attacker
controlled and one where the memory address is based on the memory
read in the first access. The first memory reference is a read of the
memory that the attacker wants to leak and the second references
creates side channel in the cache where the line accessed represents
the data to be leaked.
This code has that pattern because a potentially very large value for
field could be used in the vmcs_to_offset_table lookup which will be
put into f. Then very shortly thereafter and potentially still in
the speculation window will be dereferenced in the vmcs_to_field_offset
function.
OraBug: 27380809 Signed-off-by: Andrew Honig <ahonig@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 75f139aaf896d6fdeec2e468ddfa4b2fe469bf40)
[The upstream commit used asm('lfence') but we already have the osb()
macro so changed that out] Reviewed-by: Boris.Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
KVM allows guests to directly access I/O port 0x80 on Intel hosts. If
the guest floods this port with writes it generates exceptions and
instability in the host kernel, leading to a crash. With this change
guest writes to port 0x80 on Intel will behave the same as they
currently behave on AMD systems.
Prevent the flooding by removing the code that sets port 0x80 as a
passthrough port. This is essentially the same as upstream patch 99f85a28a78e96d28907fe036e1671a218fee597, except that patch was
for AMD chipsets and this patch is for Intel.
Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Jim Mattson <jmattson@google.com>
(cherry picked from commit d59d51f088014f25c2562de59b9abff4f42a7468)
Orabug: 27206805
CVE: CVE-2017-1000407 Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Acked-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Joao Martins [Tue, 16 Jan 2018 19:03:12 +0000 (19:03 +0000)]
ixgbevf: handle mbox_api_13 in ixgbevf_change_mtu
Commit 180603fe7 added a new API but failed to update one place
specifically in ixgbevf_change_mtu. The lack of it leads to
mtu set failures on 82599 VFs.
Orabug: 27397028 Fixes: 180603fe7 ("ixgbevf: Add support for VF promiscuous mode") Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Tested-by: Chuan Liu <chuan.liu@oracle.com>
struct pending_req is allocated ahead-of-time (in connect_ring())
for each ring slot. This is potentially a large number of allocations
(number-of-queues * ring-slots) and given that the structure is sized
for the worst case (MAX_INDIRECT_SEGMENTS), each element is 16616 bytes
on 64-bit.
The allocation itself is via kmalloc so this becomes multiple order-3
allocations for each vbd.
This patch slims down the structure by limiting the pre-allocated
structures to BLKIF_MAX_SEGMENTS_PER_REQUEST. Requests larger than
this are allocated dynamically. On my machine (E5-2660 0 @ 2.20GHz),
without any memory pressure, this adds an average of about 1us to do
the indirect allocation path.
Tim Tianyang Chen [Tue, 9 Jan 2018 23:57:28 +0000 (15:57 -0800)]
x86/fpu: Don't let userspace set bogus xcomp_bv
On x86, userspace can use the ptrace() or rt_sigreturn() system calls to
set a task's extended state (xstate) or "FPU" registers. ptrace() can
set them for another task using the PTRACE_SETREGSET request with
NT_X86_XSTATE, while rt_sigreturn() can set them for the current task.
In either case, registers can be set to any value, but the kernel
assumes that the XSAVE area itself remains valid in the sense that the
CPU can restore it.
However, in the case where the kernel is using the uncompacted xstate
format (which it does whenever the XSAVES instruction is unavailable),
it was possible for userspace to set the xcomp_bv field in the
xstate_header to an arbitrary value. However, all bits in that field
are reserved in the uncompacted case, so when switching to a task with
nonzero xcomp_bv, the XRSTOR instruction failed with a #GP fault. This
caused the WARN_ON_FPU(err) in copy_kernel_to_xregs() to be hit. In
addition, since the error is otherwise ignored, the FPU registers from
the task previously executing on the CPU were leaked.
Fix the bug by checking that the user-supplied value of xcomp_bv is 0 in
the uncompacted case, and returning an error otherwise.
The reason for validating xcomp_bv rather than simply overwriting it
with 0 is that we want userspace to see an error if it (incorrectly)
provides an XSAVE area in compacted format rather than in uncompacted
format.
Note that as before, in case of error we clear the task's FPU state.
This is perhaps non-ideal, especially for PTRACE_SETREGSET; it might be
better to return an error before changing anything. But it seems the
"clear on error" behavior is fine for now, and it's a little tricky to
do otherwise because it would mean we couldn't simply copy the full
userspace state into kernel memory in one __copy_from_user().
This bug was found by syzkaller, which hit the above-mentioned
WARN_ON_FPU():
Here is a C reproducer. The expected behavior is that the program spin
forever with no output. However, on a buggy kernel running on a
processor with the "xsave" feature but without the "xsaves" feature
(e.g. Sandy Bridge through Broadwell for Intel), within a second or two
the program reports that the xmm registers were corrupted, i.e. were not
restored correctly. With CONFIG_X86_DEBUG_FPU=y it also hits the above
kernel warning.
Note: the program only tests for the bug using the ptrace() system call.
The bug can also be reproduced using the rt_sigreturn() system call, but
only when called from a 32-bit program, since for 64-bit programs the
kernel restores the FPU state from the signal frame by doing XRSTOR
directly from userspace memory (with proper error checking).
Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> [v3.17+] Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Kevin Hao <haokexin@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Halcrow <mhalcrow@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Cc: kernel-hardening@lists.openwall.com Fixes: 0b29643 ("x86/xsaves: Change compacted format xsave area header") Link: http://lkml.kernel.org/r/20170922174156.16780-2-ebiggers3@gmail.com Link: http://lkml.kernel.org/r/20170923130016.21448-25-mingo@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 814fb7bb7db5433757d76f4c4502c96fc53b0b5e)
Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Hand picked because it's missing some commits that refactored fpu
functions out. For UEK4, xstateregs_set() is in arch/x86/kernel/i387.c
and __fpu__restore_sig() is in arch/x86/kernel/xsave.c.
xsave->header is xsave->xsave_hdr_struct and
xregs_state is xsave.
Xin Long [Tue, 17 Oct 2017 15:26:10 +0000 (23:26 +0800)]
sctp: do not peel off an assoc from one netns to another one
Now when peeling off an association to the sock in another netns, all
transports in this assoc are not to be rehashed and keep use the old
key in hashtable.
As a transport uses sk->net as the hash key to insert into hashtable,
it would miss removing these transports from hashtable due to the new
netns when closing the sock and all transports are being freeed, then
later an use-after-free issue could be caused when looking up an asoc
and dereferencing those transports.
This is a very old issue since very beginning, ChunYu found it with
syzkaller fuzz testing with this series:
This patch is to block this call when peeling one assoc off from one
netns to another one, so that the netns of all transport would not
go out-sync with the key in hashtable.
Note that this patch didn't fix it by rehashing transports, as it's
difficult to handle the situation when the tuple is already in use
in the new netns. Besides, no one would like to peel off one assoc
to another netns, considering ipaddrs, ifaces, etc. are usually
different.
Reported-by: ChunYu Wang <chunwang@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit df80cd9b28b9ebaa284a41df611dbf3a2d05ca74)
Andrey Konovalov [Thu, 2 Nov 2017 14:38:21 +0000 (10:38 -0400)]
media: dib0700: fix invalid dvb_detach argument
dvb_detach(arg) calls symbol_put_addr(arg), where arg should be a pointer
to a function. Right now a pointer to state->dib7000p_ops is passed to
dvb_detach(), which causes a BUG() in symbol_put_addr() as discovered by
syzkaller. Pass state->dib7000p_ops.set_wbd_ref instead.
Linus Torvalds [Sun, 20 Aug 2017 20:26:27 +0000 (13:26 -0700)]
Sanitize 'move_pages()' permission checks
The 'move_paghes()' system call was introduced long long ago with the
same permission checks as for sending a signal (except using
CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).
That turns out to not be a great choice - while the system call really
only moves physical page allocations around (and you need other
capabilities to do a lot of it), you can check the return value to map
out some the virtual address choices and defeat ASLR of a binary that
still shares your uid.
So change the access checks to the more common 'ptrace_may_access()'
model instead.
This tightens the access checks for the uid, and also effectively
changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
anybody really _uses_ this legacy system call any more (we hav ebetter
NUMA placement models these days), so I expect nobody to notice.
Famous last words.
Reported-by: Otto Ebeling <otto.ebeling@iki.fi> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 197e7e521384a23b9e585178f3f11c9fa08274b9)
David Howells [Wed, 11 Oct 2017 22:32:27 +0000 (23:32 +0100)]
assoc_array: Fix a buggy node-splitting case
This fixes CVE-2017-12193.
Fix a case in the assoc_array implementation in which a new leaf is
added that needs to go into a node that happens to be full, where the
existing leaves in that node cluster together at that level to the
exclusion of new leaf.
What needs to happen is that the existing leaves get moved out to a new
node, N1, at level + 1 and the existing node needs replacing with one,
N0, that has pointers to the new leaf and to N1.
The code that tries to do this gets this wrong in two ways:
(1) The pointer that should've pointed from N0 to N1 is set to point
recursively to N0 instead.
(2) The backpointer from N0 needs to be set correctly in the case N0 is
either the root node or reached through a shortcut.
Fix this by removing this path and using the split_node path instead,
which achieves the same end, but in a more general way (thanks to Eric
Biggers for spotting the redundancy).
The problem manifests itself as:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: assoc_array_apply_edit+0x59/0xe5
Fixes: 3cb989501c26 ("Add a generic associative array implementation.") Reported-and-tested-by: WU Fan <u3536072@connect.hku.hk> Signed-off-by: David Howells <dhowells@redhat.com> Cc: stable@vger.kernel.org [v3.13-rc1+] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ea6789980fdaa610d7eb63602c746bf6ec70cd2b)
Mohamed Ghannam [Sun, 10 Dec 2017 03:50:58 +0000 (03:50 +0000)]
net: ipv4: fix for a race condition in raw_sendmsg
inet->hdrincl is racy, and could lead to uninitialized stack pointer
usage, so its value should be read only once.
Fixes: c008ba5bdc9f ("ipv4: Avoid reading user iov twice after raw_probe_proto_opt") Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8f659a03a0ba9289b9aeb9b4470e6fb263d6f483)
Pavel Tatashin [Fri, 12 Jan 2018 00:17:49 +0000 (19:17 -0500)]
x86/pti/efi: broken conversion from efi to kernel page table
In entry_64.S we have code like this:
/* Unconditionally use kernel CR3 for do_nmi() */
/* %rax is saved above, so OK to clobber here */
ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER
/* If PCID enabled, NOFLUSH now and NOFLUSH on return */
ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID
pushq %rax
/* mask off "user" bit of pgd address and 12 PCID bits: */
andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
movq %rax, %cr3
2:
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
call do_nmi
With this instruction:
andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
We unconditionally switch from whatever our CR3 was to kernel page table.
But, in arch/x86/platform/efi/efi_64.c We temporarily set a different page
table, that does not have the kernel page table with 0x1000 offset from it.
Look in efi_thunk() and efi_thunk_set_virtual_address_map().
So, while CR3 points to the other page table, we get an NMI interrupt,
and clear 0x1000 from CR3, resulting in a bogus CR3 if the 0x1000 bit was
set.
The efi page table comes from realmode/rm/trampoline_64.S:
Notice: alignment is PAGE_SIZE, so after applying KAISER_SHADOW_PGD_OFFSET
which equal to PAGE_SIZE, we can get a different page table.
But, even if we fix alignment, here the trampoline binary is later copied
into dynamically allocated memory in reserve_real_mode(), so we need to
fix that place as well.
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Boris Ostrovsky [Wed, 10 Jan 2018 00:08:45 +0000 (19:08 -0500)]
x86/IBRS: Make sure we restore MSR_IA32_SPEC_CTRL to a valid value
It is possible to (re-)enable IBRS between invocations of
ENABLE_IBRS_SAVE_AND_CLOBBER and RESTORE_IBRS_CLOBBER. If that happens,
the latter will be trying to write MSR_IA32_SPEC_CTRL with an
uninitialized value, possibly triggering a #GPF.
To avoid this let's make sure that we always save a valid value into
the save register. If IBRS is disabled that safe value will be
SPEC_CTRL_FEATURE_ENABLE_IBRS.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Instead of setting to zero we set it to SPEC_CTRL_FEATURE_ENABLE_IBRS
Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Boris Ostrovsky [Wed, 10 Jan 2018 00:08:44 +0000 (19:08 -0500)]
x86/IBRS/IBPB: Set sysctl_ibrs/ibpb_enabled properly
init_scattered_cpuid_features() is called twice, first time from
early_cpu_init(), before 'noibrs' or 'noibpb' option are parsed.
This results in setting of sysctl_* variables. When we call
init_scattered_cpuid_features() the second time, after the boot options
have been parsed, init_scattered_cpuid_features() will leave sysctl_*
parameters unchanged (i.e. already set).
To avoid this, always set those variables based on ibrs/pb_inuse.
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Jamie Iles [Tue, 9 Jan 2018 12:16:43 +0000 (12:16 +0000)]
x86/entry_64: TRACE_IRQS_OFF before re-enabling.
Our TRACE_IRQS_OFF call introduced in d572bdfdeb7a (x86/entry: Stuff RSB
for entry to kernel for non-SMEP platform) is after we have already
called ENABLE_INTERRUPTS, resulting in:
Jamie Iles [Tue, 9 Jan 2018 12:13:23 +0000 (12:13 +0000)]
ptrace: remove unlocked RCU dereference.
Commit 02bc4c7f77877 (x86/mm: Only set IBPB when the new thread cannot
ptrace current thread) reworked ___ptrace_may_access to take an
arbitrary task, but getting the task credentials needs to be done inside
an RCU critical section.
Move the dereference into the rcu_read_lock() below, preventing a boot
splat like:
Konrad Rzeszutek Wilk [Tue, 9 Jan 2018 17:40:25 +0000 (12:40 -0500)]
x86/ia32: Adds code hygiene for 32bit SYSCALL instruction entry.
This is a followup on the 111ba91464f2e29fc6417b50a1c1425e2080bc59
(*INCOMPLETE* x86/syscall: Clear unused extra registers on syscall entrance)
where we didn't completely finish adding the clearing of these
registers. This fixes it on the 32-bit system call entrances.
The movq R8(%rsp),%r8 is there to update the r8 as the
CLEAR_R8_TO_R15 clears that register so we have to fetch it
from the pt_regs->r8.
We also remove the SAVE_EXTRA_REGS from the ptrace code as
we clear them (r8->r15) so the extra SAVE_EXTRA_REGS ends
up putting NULLs in the pt->regs->[r8->r15].
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Konrad Rzeszutek Wilk [Tue, 9 Jan 2018 04:09:53 +0000 (23:09 -0500)]
x86/ia32: don't save registers on audit call
This is a followup on (x86/ia32: save and clear registers on syscall.)
where we would save the registers at the start of the system call
and also clear them (r8->15). But the ptrace syscall would do
the same thing (save) which meant we would end up over-writting them
with zeros.
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Konrad Rzeszutek Wilk [Tue, 9 Jan 2018 17:11:51 +0000 (12:11 -0500)]
x86/ia32: Move STUFF_RSB And ENABLE_IBRS
The:
x86/entry: Stuff RSB for entry to kernel for non-SMEP platform
x86/enter: Use IBRS on syscall and interrupts
backports put the macros after the ENABLE_INTERRUPTS, but in case
the ENABLE_INTERRUPTS macro unrolls, let put it above it.
Orabug: 27344012
CVE:CVE-2017-5715 Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Konrad Rzeszutek Wilk [Tue, 9 Jan 2018 19:17:30 +0000 (14:17 -0500)]
x86/spec: Always set IBRS to guest value on VMENTER and host on VMEXIT.
The paper says that to "set IBRS even if it was already set".
The Intel drop does not have that (it checks to see if it was enabled, and
if so does not do the WRMSR).
Furtheremore it says that on VM Entry we should restore the guest value.
But the patches from Intel again have that _only_ if they the guest
has the IBRS set to zero.
Xen does it that way (as the PDF).
Red Hat code follows the same way as Intel.
It is confusing. Upstream Arjan says:
IBRS will ensure that, when set after the ring transition, no earlier
branch prediction data is used for indirect branches while IBRS is set
What is a ring transition? Upon more clarification it is not
ring transition, but predication mode change. And
VMX non-root transition to VMX root is a prediction mode change and
1 setting in less privilege mode is not sufficient for VMX root mode.
In effect we do want to make a write to the MSR setting IBRS
(even if the value is already set to 1).
Jamie Iles [Mon, 8 Jan 2018 23:21:44 +0000 (23:21 +0000)]
x86/ia32: save and clear registers on syscall.
This is a followup to 111ba91464f2 (x86/syscall: Clear unused extra
registers on syscall entrance) and a1aa2e658e0af (Re-introduce clearing
of r12-15, rbp, rbx), making sure that we also save and clear registers
on the compat syscalls. Otherwise we see segfaults when running an
32-bit binary on a 64-bit kernel.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Pavel Tatashin [Mon, 8 Jan 2018 21:02:27 +0000 (16:02 -0500)]
pti: Rename X86_FEATURE_KAISER to X86_FEATURE_PTI
cat /proc/cpuinfo still shows kaiser feature, and want only pti
to be visible to users. Therefore, rename this macro to get
correct user visible output.
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reported-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Konrad Rzeszutek Wilk [Mon, 8 Jan 2018 01:22:26 +0000 (20:22 -0500)]
x86/kvm: Set IBRS on VMEXIT if guest disabled it.
If the guest writes does not write FEATURE_ENABLE_IBRS to
MSR_IA32_SPEC_CTRL, then KVM will not issue such write after
(Indirect Branch Prediction Injection).
Right before VMENTER we set the MSR to zero (if the guest
had it set to zero), or leave it at 1 (if the guest
had it set to 1).
But on the VMEXIT if the guest decided to set it to _zero_
before an VMEXIT, then we will leave it at zero and _not_
set the wrmsl to 1!
That is wrong.
And also if the guest did set to 1, then we write 1 to it again.
This fix turns the check around so that the MSR will always
be at MSR 1 - with the optimization that if the guest had
set it, we just keep it at 1.
Kris Van Hees [Sun, 7 Jan 2018 20:18:42 +0000 (12:18 -0800)]
Re-introduce clearing of r12-15, rbp, rbx
Re-introduce the clearing of the extra registers (r12-r15, rbp, rbx)
upon entry into a system call. This commit ensures that we do not
save the extra registers after they got cleared, because that causes
NULL values to get written in place of the saved values.
Konrad Rzeszutek Wilk [Sun, 7 Jan 2018 04:35:11 +0000 (23:35 -0500)]
x86/enter: Use IBRS on syscall and interrupts - fix ia32 path
The backports missed a tiny bit of changes.
The easier of them is the ia32_syscall - there are two ways it returns
back to userspace - to int_ret_from_sys_call and there eventually
end up either in syscall_return_via_sysret or opportunistic_sysret_failed.
syscall_return_via_sysret had it, but opportunistic_sysret_failed failed
to have it. That is b/c we optimized a bit and stuck the DISABLE_IBRS
on restore_c_regs_and_iret which was called from opportunistic_sysret_failed
and retint_swapgs.
But with KPTI, doing IBRS_DISABLE from within restore_c_regs_and_iret is
not good - as we are touching an kernel variable and restore_c_regs_and_iret is
running with user-mode cr3!
So "x86: Fix spectre/kpti integration" fixed it by adding the DISABLE_IBRS
syscall_return_via_sysret.
(If you look at the original commit you would think that we should
also fix opportunistic_sysret_failed, but that is fixed in
"x86: Fix spectre/kpti integration")
The seconday issue is that we did not call DISABLE_IBRS from
sysexit_from_sys_call. This patch adds that in too.
Konrad Rzeszutek Wilk [Sun, 7 Jan 2018 04:25:30 +0000 (23:25 -0500)]
x86: Fix spectre/kpti integration
The issue is that DISABLE_IBRS (and pretty much all of the _IBRS) first
operation is touching an kernel variable. The restore_c_regs_and_iret is
already in user-space cr3 so we page fault.
The fix is simple - do not run any of the IBRS macros from within
restore_c_regs_and_iret. Which means that the three functions that
used to call it now have to call IBRS_DISABLE by themselves:
retint_swapgs, opportunistic_sysret_failed, and nmi.
Adding in the IBRS_DISABLE in opportunistic_sysret_failed also
fixes another bug - which is more clearly explained in
"x86/enter: Use IBRS on syscall and interrupts - fix ia32 path"
Orabug: 27333760
CVE: CVE-2017-5754 Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Jiri Kosina [Fri, 5 Jan 2018 19:21:38 +0000 (11:21 -0800)]
PTI: unbreak EFI old_memmap
old_memmap's efi_call_phys_prolog() calls set_pgd() with swapper PGD that
has PAGE_USER set, which makes PTI set NX on it, and therefore EFI can't
execute it's code.
Fix that by forcefully clearing _PAGE_NX from the PGD (this can't be done
by the pgprot API).
_PAGE_NX will be automatically reintroduced in efi_call_phys_epilog(), as
_set_pgd() will again notice that this is _PAGE_USER, and set _PAGE_NX on
it.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Jamie Iles [Fri, 5 Jan 2018 18:13:10 +0000 (18:13 +0000)]
x86/ldt: fix crash in ldt freeing.
94b1f3e2c4b7 (kaiser: merged update) factored out __free_ldt_struct() to
use vfree/free_page, but in the page allocation case it is actually
allocated with kmalloc so needs to be freed with kfree and not
free_page().
x86/entry: Define 'cpu_current_top_of_stack' for 64-bit code
32-bit code has PER_CPU_VAR(cpu_current_top_of_stack).
64-bit code uses somewhat more obscure: PER_CPU_VAR(cpu_tss + TSS_sp0).
Define the 'cpu_current_top_of_stack' macro on CONFIG_X86_64
as well so that the PER_CPU_VAR(cpu_current_top_of_stack)
expression can be used in both 32-bit and 64-bit code.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1429889495-27850-3-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3a23208e69679597e767cf3547b1a30dd845d9b5)
Orabug: 27333760
CVE: CVE-2017-5754 Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/ia32/ia32entry.S
Guenter Roeck [Thu, 4 Jan 2018 21:41:55 +0000 (13:41 -0800)]
kaiser: Set _PAGE_NX only if supported
This resolves a crash if loaded under qemu + haxm under windows.
See https://www.spinics.net/lists/kernel/msg2689835.html for details.
Here is a boot log (the log is from chromeos-4.4, but Tao Wu says that
the same log is also seen with vanilla v4.4.110-rc1).
The crash part of this problem may be solved with the following patch
(thanks to Hugh for the hint). There is still another problem, though -
with this patch applied, the qemu session aborts with "VCPU Shutdown
request", whatever that means.
Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID
check, instead of each caller doing it inline first: nobody needs
to optimize for the noPCID case, it's clearer this way, and better
suits later changes. Replace those no-op X86_CR3_PCID_KERN_FLUSH lines
by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes.
Hugh Dickins [Sun, 5 Nov 2017 01:23:24 +0000 (18:23 -0700)]
kaiser: asm/tlbflush.h handle noPGE at lower level
I found asm/tlbflush.h too twisty, and think it safer not to avoid
__native_flush_tlb_global_irq_disabled() in the kaiser_enabled case,
but instead let it handle kaiser_enabled along with cr3: it can just
use __native_flush_tlb() for that, no harm in re-disabling preemption.
(This is not the same change as Kirill and Dave have suggested for
upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge
check; cr3 is enough for kaiser, and thought to be cheaper than cr4.)
Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals()
preference from __native_flush_tlb(): unlike the invpcid_flush_all()
preference in __native_flush_tlb_global(), it's not seen in upstream
4.14, and was recently reported to be surprisingly slow.
Hugh Dickins [Sun, 29 Oct 2017 18:36:19 +0000 (11:36 -0700)]
kaiser: drop is_atomic arg to kaiser_pagetable_walk()
I have not observed a might_sleep() warning from setup_fixmap_gdt()'s
use of kaiser_add_mapping() in our tree (why not?), but like upstream
we have not provided a way for that to pass is_atomic true down to
kaiser_pagetable_walk(), and at startup it's far from a likely source
of trouble: so just delete the walk's is_atomic arg and might_sleep().
Hugh Dickins [Wed, 4 Oct 2017 03:49:04 +0000 (20:49 -0700)]
kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush
Now that we're playing the ALTERNATIVE game, use that more efficient
method: instead of user-mapping an extra page, and reading an extra
cacheline each time for x86_cr3_pcid_noflush.
Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax)
is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs;
but the one line with $63 in looks clearer, so let's stick with that.
Worried about what happens with an ALTERNATIVE between the jump and
jump label in another ALTERNATIVE? I was, but have checked the
combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64,
and it does a good job.
Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2dff99eb0335f9e0817410696a180dba25ca7371)
Orabug: 27333760
CVE: CVE-2017-5754 Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)
kaiser: add "nokaiser" boot option, using ALTERNATIVE
Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime. That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.
It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.
Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments. But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e345dcc9481543edf4a0a5df4c4c2f9597b0a997)
Orabug: 27333760
CVE: CVE-2017-5754 Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)