www.infradead.org Git - users/jedix/linux-maple.git/log

x86/mitigation/spectre_v2: Add reporting of 'lfence'

if IBRS is off, but we do have 'lfence' enabled.

Obviously if 'lfence' is IBRS and 'nolfence' has been used - we
print that too.

OraBug: 27472666
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/spec: Add 'lfence_enabled' in sysfs

To toggle during runtime whether the lfence fallback should be used.
If IBRS is enabled warnings will be returned.

OraBug: 27472666
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/spec_ctrl: Add 'nolfence' knob to disable fallback for spectre_v2 mitigation

If 'noibrs' is used, or the hardware does not have IBRS microcode
we fallback on using 'lfence' on every system call/interrupt/exception/etc.

This can dramatically slow down the performance. As a knob to
measure this provide 'nolfence' which will also disable this
security big hammer.

OraBug: 27472666
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86: Fix compile issues if CONFIG_XEN not defined

We added the usage of 'xen_initial_domain()' but forgot
to include the header file.

OraBug: 27477720
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

hugetlb: fix nr_pmds accounting with shared page tables

We account HugeTLB's shared page table to all processes who share it.
The accounting happens during huge_pmd_share().

If somebody populates pud entry under us, we should decrease pagetable's
refcount and decrease nr_pmds of the process.

By mistake, I increase nr_pmds again in this case. :-/ It will lead to
"BUG: non-zero nr_pmds on freeing mm: 2" on process' exit.

Let's fix this by increasing nr_pmds only when we're sure that the page
table will be used.

Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name
Fixes: dc6c9a35b66b ("mm: account pmd page tables to the process")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: zhongjiang <zhongjiang@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 27451809

(cherry picked from commit c17b1f42594eb71b8d3eb5a6dfc907a7eb88a51d)
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

net/mlx4_core: allow QPs with enable_smi_admin enabled

In the commit 64453f042519 ("net/mlx4_core: Disallow creation of RAW QPs
on a VF"), it disallows some QPs. But when enable_smi_admin is enabled,
it should be allowed to pass.

Orabug: 27452072

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
(cherry picked from commit a3fe544e915d367100f9f149109d7a7599b032ca)
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>

net/rds: Fix incorrect error handling

Commit 5f58d7e81c2f ("net/rds: reduce memory footprint during
ib_post_recv in IB transport") removes order two allocations used to
receive fragments. Instead, zero order allocations are used. However,
said commit has incorrect error handling.

This commit fixes this.

Orabug: 27469760

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
(cherry picked from commit a3740c633f4f7089d5cbb11b4a9d19c79c96f746)

x86: Move STUFF_RSB in to the idt macro

instead of it sitting in paranoid_entry or error_entry.

The idea behind the STUFF_RSB is to be done _before_
any calls are done. Which means we really want this in the idt
macro that is handled for exceptions - such as device not available,
which currently looks as so:

[Ignore the callq *0x40.. that gets converted to an 'cld']

<device_not_available>:
  nop
  nop
  nop
  callq  *0x40d0b7(%rip)        # ffffffff81b55330 <pv_irq_ops+0x30> <= patched to cld
  pushq  $0xffffffffffffffff
  sub    $0x78,%rsp
  callq  ffffffff81748ea0 <error_entry>           <=== call!
  mov    %rsp,%rdi
  xor    %esi,%esi
  callq  ffffffff81018830 <do_device_not_available>
  test   %rax,%rax
  jne    ffffffff81747f10 <dtrace_error_exit>
  jmpq   ffffffff817490a0 <error_exit>
  nopl   0x0(%rax)

By stuffing the RSB before the call to error_entry (or
paranoid_entry) we remove the chance of this becoming an attack vector.

While at it, remove the useless comment - we don't encode any frames
in UEK4.

OraBug: 27417150
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/spectre: Drop the warning about ibrs being obsolete.

They scare folks thinking in that they don't work if you
use 'noibrs'. And create support tickets. Just let us
be quiet and support both options.

OraBug: 27439976
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/spec: STUFF_RSB _before_ ENABLE_IBRS

And also we need to STUFF_RSB _before_ calls.

In our case we have a bunch of ENABLE_INTERRUPTS
which are (in objdump):
callq *0x40b379(%rip) <pv_cpu_ops+0x128>

During bootup they do change to 'cld' (on baremetal).

On Xen PV they end up being those calls and STUFF_RSB is still
in effect which means it should be done before those calls are made.

Also the semantics of the IBRS MSR is "If IBRS is set, .. indirect
calls will not allow their predicated target address to be controlled ...
so long as as all RSB entries from previous less privileged prediction
mode are overwritten."

In other words - STUFF_RSB, then ENABLE_IBRS.

Xen hypervisor code follows that religiously and so shall we.

OraBug: 27448169
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/spec: Don't print the Missing arguments for option spectre_v2.

if not specified. There is no need for that at all.

OraBug: 27448241
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86: Move ENABLE_IBRS in the interrupt macro.

The interrupt macro already stuffs the RSB at the start
but neglected to call the ENABLE_IBRS (we did call
DISABLE_IBRS when we were done).

This meant that on any interrupt we would not toggle
the IBRS MSR.

OraBug: 27448273

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Tested-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/IBRS: Don't try to change IBRS mode if IBRS is not available

sysctl_ibrs_enabled is set properly when IBRS support is discovered
in init_scattered_cpuid_features(). We should simply return an error
when attempt is made to change IBRS mode when the feature is not supoorted.

There is also no need to call set_ibrs_inuse() when changing mode to 1:
this is already done by clear_ibrs_disabled().

Orabug: 27448280

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/IBRS: Remove support for IBRS_ENABLED_USER mode

This mode was added based on our understanding of IBRS_ATT (IBRS
All The Time) described in early versions of Intel documentation.
We assumed that while "basic" IBRS protects kernel from using
predictions created by userland, IBRS_ATT will provide similar
defence between usermode tasks.

This understanding was incorrect.

Instead, IBRS_ATT (also referred to as "Enhanced IBRS") allows the
kernel to write IBRS MSR once, during boot, and never have to write
it again. This is in contrast to basic IBRS where every change of
protection mode required an MSR write, which is somewhat expensive.

Enhanced IBRS is not available on existing processors. Until it
becomes available we remove IBRS_ENABLED_USER.

While doing this also add a test in ibrs_enabled_write() that will
only process input if the mode will actually change.

Orabug: 27448280

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86: Use PRED_CMD MSR when ibpb is enabled

Since we have the knobs we should depend on those.

OraBug: 27448280

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/IBRS: Drop unnecessary WRITE_ONCE

There is no reason to use it here.

Orabug: 27448280

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/IBRS/IBPB: Remove procfs interface to ibrs/ibpb_enable

We already have exact same functionality available in debugfs.

Orabug: 27448280

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/IBPB: Provide debugfs interface for changing IBPB mode

... similar to how we change IBRS mode.

Orabug: 27448313

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/spec: Also print IBRS if IBPB is disabled.

If ones disables ibpb_enabled the 'spectre_v2' sysfs
shows "Vulnerable" but it should say "IBRS"

This fixes it.

OraBug: 27448313
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86: Include linux/device.h in bugs_64.c

struct device_attribute is defined there. Without this file we get

arch/x86/kernel/cpu/bugs_64.c:125: warning: `struct device_attribute¿
declared inside parameter list

Orabug: 27448330

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

fs/ocfs2: remove page cache for converted direct write

Remove the page cache range for the writes converted from direct IO to
buffered IO.

orabug: 27431376
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>

Revert "ocfs2: code clean up for direct io"

This reverts commit 11fc5176778a82fb5bb0100413496e125862e649.

This is patch in a patch set but back ported separately and caused
problem.

orabug: 27431376
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>

mlx4: add mstflint secure boot access kernel support

Source files are copied from:
https://github.com/Mellanox/mstflint.git master_devel
(under kernel sub-directory - changes up to commit
a8a84fae7459aba01ff6ad6cbc40622b7615b110)

Due to recent security changes in UEK4, mstflint FW tool may
lose the ability to access CX3 HCAs in Secure Boot enabled
enviroment. This kernel patch in addition to an enhanced
version of mstflint tool will let us regain the ability to
manage CX3 FW when Secure Boot is enabled while running latest
UEK4 kernels.

Orabug: 27424392

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Reviewed-by: Rao Shoaib <rao.shoaib@oracle.com>

x86/microcode/intel: Extend BDW late-loading with a revision check

Instead of blacklisting all model 79 CPUs when attempting a late
microcode loading, limit that only to CPUs with microcode revisions <
0x0b000021 because only on those late loading may cause a system hang.

For such processors either:

a) a BIOS update which might contain a newer microcode revision

or

b) the early microcode loading method

should be considered.

Processors with revisions 0x0b000021 or higher will not experience such
hangs.

For more details, see erratum BDF90 in document #334165 (Intel Xeon
Processor E7-8800/4800 v4 Product Family Specification Update) from
September 2017.

[ bp: Heavily massage commit message and pr_* statements. ]

Orabug: 27343609

Fixes: 723f2828a98c ("x86/microcode/intel: Disable late loading on model 79")
Signed-off-by: Jia Zhang <qianyue.zj@alibaba-inc.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Cc: <stable@vger.kernel.org> # v4.14
Link: http://lkml.kernel.org/r/1514772287-92959-1-git-send-email-qianyue.zj@alibaba-inc.com
(cherry picked from commit b94b7373317164402ff7728d10f7023127a02b60)
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/microcode/intel: Disable late loading on model 79

[ Upstream commit 723f2828a98c8ca19842042f418fb30dd8cfc0f7 ]

Blacklist Broadwell X model 79 for late loading due to an erratum.

Orabug: 27343609

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171018111225.25635-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 9a8466b5d6c650b6c7f9e632ce605cccd5df1097)
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

autofs: use dentry flags to block walks during expire

Somewhere along the way the autofs expire operation has changed to hold
a spin lock over expired dentry selection. The autofs indirect mount
expired dentry selection is complicated and quite lengthy so it isn't
appropriate to hold a spin lock over the operation.

Commit 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") added a
might_sleep() to dput() causing a WARN_ONCE() about this usage to be
issued.

But the spin lock doesn't need to be held over this check, the autofs
dentry info. flags are enough to block walks into dentrys during the
expire.

I've left the direct mount expire as it is (for now) because it is much
simpler and quicker than the indirect mount expire and adding spin lock
release and re-aquires would do nothing more than add overhead.

Fixes: 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()")
Link: http://lkml.kernel.org/r/20160912014017.1773.73060.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Reported-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Takashi Iwai <tiwai@suse.de>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 26032471
(cherry picked from commit 7cbdb4a286a60c5d519cb9223fe2134d26870d39)
Signed-off-by: Mingming Cao <mingming.cao@oracle.com>
Reviewed-by: Shirley Ma <shirley.ma@oracle.com>

autofs races

* make autofs4_expire_indirect() skip the dentries being in process of
expiry
* do *not* mess with list_move(); making sure that dentry with
AUTOFS_INF_EXPIRING are not picked for expiry is enough.
* do not remove NO_RCU when we set EXPIRING, don't bother with smp_mb()
there. Clear it at the same time we clear EXPIRING. Makes a bunch of
tests simpler.
* rename NO_RCU to WANT_EXPIRE, which is what it really is.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 26032471
(cherry picked from commit ea01a18494b3d7a91b2f1f2a6a5aaef4741bc294)
Signed-off-by: Mingming Cao <mingming.cao@oracle.com>
Reviewed-by: Shirley Ma <shirley.ma@oracle.com>

Revert "kernel.spec: Require the new microcode_ctl."

orabug: 27423273

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
This reverts commit 7fa039f8a06c7b2cc440a118410120328664e3b3.

(cherry picked from commit c63fbab634cc84c688a9266fb8979576522e4e7f)
From QU6 to QU7
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

dtrace: revive dtrace_gethrtime()

This patch re-introduces dtrace_gethrtime(). The CDDL workaround is no longer
required.

Orabug: 27409933

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Conflicts:
include/linux/dtrace_os.h
kernel/dtrace/dtrace_os.c

(cherry picked from commit 71f7d587d7ad8aea85019fe3f2190553820422d0)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

Merge tag 'v4.1.12-122.bug26670475#v3' into uek/uek-4.1-next

OLdev BUG:26670475 tag v4.1.12-122.bug26670475#v3

* tag 'v4.1.12-122.bug26670475#v3':
  xen-blkback: add pending_req allocation stats
  xen-blkback: move indirect req allocation out-of-line
  xen-blkback: pull nseg validation out in a function
  xen-blkback: make struct pending_req less monolithic

x86: Clean up IBRS functionality resident in common code

The IBRS and IBPB code and defines are resident in common code.
This patch moves them to x86 specific locations.

The patch also adds the command line options:
spectre_v2={on|off|auto}
nospectre_v2

on: unconditionally enable
off: unconditionally disable, no_spectre_v2, noibrs + noibpb
auto: default, on

Existing noibrs and noibpb will print a deprecated message

Orabug: 27353383
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86: Display correct settings for the SPECTRE_V2 bug

Update the display message for spectre v2. Move the set up of the
bug bits to identify_cpu routine to remain persistent. This
routine reinitializes the data structure.

Orabug: 27353383

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Set CONFIG_GENERIC_CPU_VULNERABILITIES flag

Enable CONFIG_GENERIC_CPU_VULNERABILITIES by default to show the
CPU vulnerabilities in sysfs.

Orabug: 27353383

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/cpu: Implement CPU vulnerabilites sysfs functions

Implement the CPU vulnerabilty show functions for meltdown, spectre_v1 and
spectre_v2.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20180107214913.177414879@linutronix.de
(cherry picked from commit 61dc0f555b5c761cdafb0ba5bd41ecf22d68a4c4)
Orabug: 27353383
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts:
arch/x86/Kconfig

Resolved conflicts to pick only the changes from the patches.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

sysfs/cpu: Fix typos in vulnerability documentation

Fixes: 87590ce6e ("sysfs/cpu: Add vulnerability folder")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 9ecccfaa7cb5249bd31bdceb93fcf5bedb8a24d8)
Orabug: 27353383
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

sysfs/cpu: Add vulnerability folder

As the meltdown/spectre problem affects several CPU architectures, it makes
sense to have common way to express whether a system is affected by a
particular vulnerability or not. If affected the way to express the
mitigation should be common as well.

Create /sys/devices/system/cpu/vulnerabilities folder and files for
meltdown, spectre_v1 and spectre_v2.

Allow architectures to override the show function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20180107214913.096657732@linutronix.de
(cherry picked from commit 87590ce6e373d1a5401f6539f0c59ef92dd924a9)
Orabug: 27353383
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts:
Documentation/ABI/testing/sysfs-devices-system-cpu
include/linux/cpu.h

Conflicts resolved by picking only the changes from original
patch.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/cpufeatures: Add X86_BUG_SPECTRE_V[12]

Add the bug bits for spectre v1/2 and force them unconditionally for all
cpus.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1515239374-23361-2-git-send-email-dwmw@amazon.co.uk
(cherry picked from commit 99c6fa2511d8a683e61468be91b83f85452115fa)
Orabug: 27353383
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/kernel/cpu/common.c

Resolved the conflict to pick only change required in common.c. The changes
in cpufeatures.h have been implemented in cpufeature.h. We do not set the
bit for SPECTRE_V1 bug.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/cpufeatures: Add X86_BUG_CPU_MELTDOWN

Add the BUG bit to indicate that the CPU is affected by the leak due to
lack of isolation of kernel and user space page tables. Currently AMD
CPUs are not affected by this.

Orabug: 27353383

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

KVM: x86: Add memory barrier on vmcs field lookup

This adds a memory barrier when performing a lookup into
the vmcs_field_to_offset_table.  This is related to
CVE-2017-5753.

This particularly scenario would involve an L1 hypervisor using
vmread/vmwrite to try execute a variant 1 side channel leak on the host.

In general variant 1 relies on a bounds check that gets bypassed
speculatively.  However it requires a fairly specific code pattern to
actually be useful for an exploit, which is why most bounds check do
not require speculation barrier. It requires two memory references
close to each other.  One that is out of bounds and attacker
controlled and one where the memory address is based on the memory
read in the first access.  The first memory reference is a read of the
memory that the attacker wants to leak and the second references
creates side channel in the cache where the line accessed represents
the data to be leaked.

This code has that pattern because a potentially very large value for
field could be used in the vmcs_to_offset_table lookup which will be
put into f.  Then very shortly thereafter and potentially still in
the speculation window will be dereferenced in the vmcs_to_field_offset
function.

OraBug: 27380809
Signed-off-by: Andrew Honig <ahonig@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 75f139aaf896d6fdeec2e468ddfa4b2fe469bf40)

[The upstream commit used asm('lfence') but we already have the osb()
macro so changed that out]
Reviewed-by: Boris.Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

KVM: VMX: remove I/O port 0x80 bypass on Intel hosts

This fixes CVE-2017-1000407.

KVM allows guests to directly access I/O port 0x80 on Intel hosts.  If
the guest floods this port with writes it generates exceptions and
instability in the host kernel, leading to a crash.  With this change
guest writes to port 0x80 on Intel will behave the same as they
currently behave on AMD systems.

Prevent the flooding by removing the code that sets port 0x80 as a
passthrough port.  This is essentially the same as upstream patch
99f85a28a78e96d28907fe036e1671a218fee597, except that patch was
for AMD chipsets and this patch is for Intel.

Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
(cherry picked from commit d59d51f088014f25c2562de59b9abff4f42a7468)
Orabug: 27206805
CVE: CVE-2017-1000407
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Acked-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ixgbevf: handle mbox_api_13 in ixgbevf_change_mtu

Commit 180603fe7 added a new API but failed to update one place
specifically in ixgbevf_change_mtu. The lack of it leads to
mtu set failures on 82599 VFs.

Orabug: 27397028
Fixes: 180603fe7 ("ixgbevf: Add support for VF promiscuous mode")
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Tested-by: Chuan Liu <chuan.liu@oracle.com>

xen-blkback: add pending_req allocation stats

Add statistics to count direct and indirect pending_req allocations
separately.

Orabug: 26670475

Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

xen-blkback: move indirect req allocation out-of-line

struct pending_req is allocated ahead-of-time (in connect_ring())
for each ring slot. This is potentially a large number of allocations
(number-of-queues * ring-slots) and given that the structure is sized
for the worst case (MAX_INDIRECT_SEGMENTS), each element is 16616 bytes
on 64-bit.

The allocation itself is via kmalloc so this becomes multiple order-3
allocations for each vbd.

This patch slims down the structure by limiting the pre-allocated
structures to BLKIF_MAX_SEGMENTS_PER_REQUEST. Requests larger than
this are allocated dynamically. On my machine (E5-2660 0 @ 2.20GHz),
without any memory pressure, this adds an average of about 1us to do
the indirect allocation path.

Orabug: 26670475

Suggested-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

xen-blkback: pull nseg validation out in a function

Mechanical change in preparation for moving this validation
in __do_block_io_op().

Orabug: 26670475

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

xen-blkback: make struct pending_req less monolithic

Changes to struct pending_req to allocate the internal arrays
separately.

Orabug: 26670475

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

x86/fpu: Don't let userspace set bogus xcomp_bv

On x86, userspace can use the ptrace() or rt_sigreturn() system calls to
set a task's extended state (xstate) or "FPU" registers.  ptrace() can
set them for another task using the PTRACE_SETREGSET request with
NT_X86_XSTATE, while rt_sigreturn() can set them for the current task.
In either case, registers can be set to any value, but the kernel
assumes that the XSAVE area itself remains valid in the sense that the
CPU can restore it.

However, in the case where the kernel is using the uncompacted xstate
format (which it does whenever the XSAVES instruction is unavailable),
it was possible for userspace to set the xcomp_bv field in the
xstate_header to an arbitrary value.  However, all bits in that field
are reserved in the uncompacted case, so when switching to a task with
nonzero xcomp_bv, the XRSTOR instruction failed with a #GP fault.  This
caused the WARN_ON_FPU(err) in copy_kernel_to_xregs() to be hit.  In
addition, since the error is otherwise ignored, the FPU registers from
the task previously executing on the CPU were leaked.

Fix the bug by checking that the user-supplied value of xcomp_bv is 0 in
the uncompacted case, and returning an error otherwise.

The reason for validating xcomp_bv rather than simply overwriting it
with 0 is that we want userspace to see an error if it (incorrectly)
provides an XSAVE area in compacted format rather than in uncompacted
format.

Note that as before, in case of error we clear the task's FPU state.
This is perhaps non-ideal, especially for PTRACE_SETREGSET; it might be
better to return an error before changing anything.  But it seems the
"clear on error" behavior is fine for now, and it's a little tricky to
do otherwise because it would mean we couldn't simply copy the full
userspace state into kernel memory in one __copy_from_user().

This bug was found by syzkaller, which hit the above-mentioned
WARN_ON_FPU():

    WARNING: CPU: 1 PID: 0 at ./arch/x86/include/asm/fpu/internal.h:373 __switch_to+0x5b5/0x5d0
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.13.0 #453
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff9ba2bc8e42c0 task.stack: ffffa78cc036c000
    RIP: 0010:__switch_to+0x5b5/0x5d0
    RSP: 0000:ffffa78cc08bbb88 EFLAGS: 00010082
    RAX: 00000000fffffffe RBX: ffff9ba2b8bf2180 RCX: 00000000c0000100
    RDX: 00000000ffffffff RSI: 000000005cb10700 RDI: ffff9ba2b8bf36c0
    RBP: ffffa78cc08bbbd0 R08: 00000000929fdf46 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff9ba2bc8e42c0
    R13: 0000000000000000 R14: ffff9ba2b8bf3680 R15: ffff9ba2bf5d7b40
    FS:  00007f7e5cb10700(0000) GS:ffff9ba2bf400000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000004005cc CR3: 0000000079fd5000 CR4: 00000000001406e0
    Call Trace:
    Code: 84 00 00 00 00 00 e9 11 fd ff ff 0f ff 66 0f 1f 84 00 00 00 00 00 e9 e7 fa ff ff 0f ff 66 0f 1f 84 00 00 00 00 00 e9 c2 fa ff ff <0f> ff 66 0f 1f 84 00 00 00 00 00 e9 d4 fc ff ff 66 66 2e 0f 1f

Here is a C reproducer.  The expected behavior is that the program spin
forever with no output.  However, on a buggy kernel running on a
processor with the "xsave" feature but without the "xsaves" feature
(e.g. Sandy Bridge through Broadwell for Intel), within a second or two
the program reports that the xmm registers were corrupted, i.e. were not
restored correctly.  With CONFIG_X86_DEBUG_FPU=y it also hits the above
kernel warning.

    #define _GNU_SOURCE
    #include <stdbool.h>
    #include <inttypes.h>
    #include <linux/elf.h>
    #include <stdio.h>
    #include <sys/ptrace.h>
    #include <sys/uio.h>
    #include <sys/wait.h>
    #include <unistd.h>

    int main(void)
    {
        int pid = fork();
        uint64_t xstate[512];
        struct iovec iov = { .iov_base = xstate, .iov_len = sizeof(xstate) };

        if (pid == 0) {
            bool tracee = true;
            for (int i = 0; i < sysconf(_SC_NPROCESSORS_ONLN) && tracee; i++)
                tracee = (fork() != 0);
            uint32_t xmm0[4] = { [0 ... 3] = tracee ? 0x00000000 : 0xDEADBEEF };
            asm volatile("   movdqu %0, %%xmm0\n"
                         "   mov %0, %%rbx\n"
                         "1: movdqu %%xmm0, %0\n"
                         "   mov %0, %%rax\n"
                         "   cmp %%rax, %%rbx\n"
                         "   je 1b\n"
                         : "+m" (xmm0) : : "rax", "rbx", "xmm0");
            printf("BUG: xmm registers corrupted!  tracee=%d, xmm0=%08X%08X%08X%08X\n",
                   tracee, xmm0[0], xmm0[1], xmm0[2], xmm0[3]);
        } else {
            usleep(100000);
            ptrace(PTRACE_ATTACH, pid, 0, 0);
            wait(NULL);
            ptrace(PTRACE_GETREGSET, pid, NT_X86_XSTATE, &iov);
            xstate[65] = -1;
            ptrace(PTRACE_SETREGSET, pid, NT_X86_XSTATE, &iov);
            ptrace(PTRACE_CONT, pid, 0, 0);
            wait(NULL);
        }
        return 1;
    }

Note: the program only tests for the bug using the ptrace() system call.
The bug can also be reproduced using the rt_sigreturn() system call, but
only when called from a 32-bit program, since for 64-bit programs the
kernel restores the FPU state from the signal frame by doing XRSTOR
directly from userspace memory (with proper error checking).

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org> [v3.17+]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: kernel-hardening@lists.openwall.com
Fixes: 0b29643 ("x86/xsaves: Change compacted format xsave area header")
Link: http://lkml.kernel.org/r/20170922174156.16780-2-ebiggers3@gmail.com
Link: http://lkml.kernel.org/r/20170923130016.21448-25-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 814fb7bb7db5433757d76f4c4502c96fc53b0b5e)

Orabug: 27050688
CVE: CVE-2017-15537

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Hand picked because it's missing some commits that refactored fpu
functions out. For UEK4, xstateregs_set() is in arch/x86/kernel/i387.c
and __fpu__restore_sig() is in arch/x86/kernel/xsave.c.
xsave->header is xsave->xsave_hdr_struct and
xregs_state is xsave.

sctp: do not peel off an assoc from one netns to another one

Now when peeling off an association to the sock in another netns, all
transports in this assoc are not to be rehashed and keep use the old
key in hashtable.

As a transport uses sk->net as the hash key to insert into hashtable,
it would miss removing these transports from hashtable due to the new
netns when closing the sock and all transports are being freeed, then
later an use-after-free issue could be caused when looking up an asoc
and dereferencing those transports.

This is a very old issue since very beginning, ChunYu found it with
syzkaller fuzz testing with this series:

  socket$inet6_sctp()
  bind$inet6()
  sendto$inet6()
  unshare(0x40000000)
  getsockopt$inet_sctp6_SCTP_GET_ASSOC_ID_LIST()
  getsockopt$inet_sctp6_SCTP_SOCKOPT_PEELOFF()

This patch is to block this call when peeling one assoc off from one
netns to another one, so that the netns of all transport would not
go out-sync with the key in hashtable.

Note that this patch didn't fix it by rehashing transports, as it's
difficult to handle the situation when the tuple is already in use
in the new netns. Besides, no one would like to peel off one assoc
to another netns, considering ipaddrs, ifaces, etc. are usually
different.

Reported-by: ChunYu Wang <chunwang@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit df80cd9b28b9ebaa284a41df611dbf3a2d05ca74)

Orabug: 27386997
CVE: CVE-2017-15115

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

media: dib0700: fix invalid dvb_detach argument

dvb_detach(arg) calls symbol_put_addr(arg), where arg should be a pointer
to a function. Right now a pointer to state->dib7000p_ops is passed to
dvb_detach(), which causes a BUG() in symbol_put_addr() as discovered by
syzkaller. Pass state->dib7000p_ops.set_wbd_ref instead.

------------[ cut here ]------------
kernel BUG at kernel/module.c:1081!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
Modules linked in:
CPU: 1 PID: 1151 Comm: kworker/1:1 Tainted: G        W
4.14.0-rc1-42251-gebb2c2437d80 #224
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
task: ffff88006a336300 task.stack: ffff88006a7c8000
RIP: 0010:symbol_put_addr+0x54/0x60 kernel/module.c:1083
RSP: 0018:ffff88006a7ce210 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880062a8d190 RCX: 0000000000000000
RDX: dffffc0000000020 RSI: ffffffff85876d60 RDI: ffff880062a8d190
RBP: ffff88006a7ce218 R08: 1ffff1000d4f9c12 R09: 1ffff1000d4f9ae4
R10: 1ffff1000d4f9bed R11: 0000000000000000 R12: ffff880062a8d180
R13: 00000000ffffffed R14: ffff880062a8d190 R15: ffff88006947c000
FS:  0000000000000000(0000) GS:ffff88006c900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6416532000 CR3: 00000000632f5000 CR4: 00000000000006e0
Call Trace:
stk7070p_frontend_attach+0x515/0x610
drivers/media/usb/dvb-usb/dib0700_devices.c:1013
dvb_usb_adapter_frontend_init+0x32b/0x660
drivers/media/usb/dvb-usb/dvb-usb-dvb.c:286
dvb_usb_adapter_init drivers/media/usb/dvb-usb/dvb-usb-init.c:86
dvb_usb_init drivers/media/usb/dvb-usb/dvb-usb-init.c:162
dvb_usb_device_init+0xf70/0x17f0 drivers/media/usb/dvb-usb/dvb-usb-init.c:277
dib0700_probe+0x171/0x5a0 drivers/media/usb/dvb-usb/dib0700_core.c:886
usb_probe_interface+0x35d/0x8e0 drivers/usb/core/driver.c:361
really_probe drivers/base/dd.c:413
driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
__device_attach_driver+0x230/0x290 drivers/base/dd.c:653
bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
__device_attach+0x26e/0x3d0 drivers/base/dd.c:710
device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
device_add+0xd0b/0x1660 drivers/base/core.c:1835
usb_set_configuration+0x104e/0x1870 drivers/usb/core/message.c:1932
generic_probe+0x73/0xe0 drivers/usb/core/generic.c:174
usb_probe_device+0xaf/0xe0 drivers/usb/core/driver.c:266
really_probe drivers/base/dd.c:413
driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
__device_attach_driver+0x230/0x290 drivers/base/dd.c:653
bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
__device_attach+0x26e/0x3d0 drivers/base/dd.c:710
device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
device_add+0xd0b/0x1660 drivers/base/core.c:1835
usb_new_device+0x7b8/0x1020 drivers/usb/core/hub.c:2457
hub_port_connect drivers/usb/core/hub.c:4903
hub_port_connect_change drivers/usb/core/hub.c:5009
port_event drivers/usb/core/hub.c:5115
hub_event+0x194d/0x3740 drivers/usb/core/hub.c:5195
process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
worker_thread+0x221/0x1850 kernel/workqueue.c:2253
kthread+0x3a1/0x470 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
Code: ff ff 48 85 c0 74 24 48 89 c7 e8 48 ea ff ff bf 01 00 00 00 e8
de 20 e3 ff 65 8b 05 b7 2f c2 7e 85 c0 75 c9 e8 f9 0b c1 ff eb c2 <0f>
0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 b8 00 00
RIP: symbol_put_addr+0x54/0x60 RSP: ffff88006a7ce210
---[ end trace b75b357739e7e116 ]---

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Orabug: 27215141
CVE-2017-16646
(cherry picked from commit eb0c19942288569e0ae492476534d5a485fb8ab4)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>

Sanitize 'move_pages()' permission checks

The 'move_paghes()' system call was introduced long long ago with the
same permission checks as for sending a signal (except using
CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).

That turns out to not be a great choice - while the system call really
only moves physical page allocations around (and you need other
capabilities to do a lot of it), you can check the return value to map
out some the virtual address choices and defeat ASLR of a binary that
still shares your uid.

So change the access checks to the more common 'ptrace_may_access()'
model instead.

This tightens the access checks for the uid, and also effectively
changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
anybody really _uses_ this legacy system call any more (we hav ebetter
NUMA placement models these days), so I expect nobody to notice.

Famous last words.

Reported-by: Otto Ebeling <otto.ebeling@iki.fi>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 197e7e521384a23b9e585178f3f11c9fa08274b9)

Orabug: 27364683
CVE: CVE-2017-14140

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
mm/migrate.c

assoc_array: Fix a buggy node-splitting case

This fixes CVE-2017-12193.

Fix a case in the assoc_array implementation in which a new leaf is
added that needs to go into a node that happens to be full, where the
existing leaves in that node cluster together at that level to the
exclusion of new leaf.

What needs to happen is that the existing leaves get moved out to a new
node, N1, at level + 1 and the existing node needs replacing with one,
N0, that has pointers to the new leaf and to N1.

The code that tries to do this gets this wrong in two ways:

(1) The pointer that should've pointed from N0 to N1 is set to point
     recursively to N0 instead.

(2) The backpointer from N0 needs to be set correctly in the case N0 is
     either the root node or reached through a shortcut.

Fix this by removing this path and using the split_node path instead,
which achieves the same end, but in a more general way (thanks to Eric
Biggers for spotting the redundancy).

The problem manifests itself as:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
  IP: assoc_array_apply_edit+0x59/0xe5

Fixes: 3cb989501c26 ("Add a generic associative array implementation.")
Reported-and-tested-by: WU Fan <u3536072@connect.hku.hk>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org [v3.13-rc1+]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ea6789980fdaa610d7eb63602c746bf6ec70cd2b)

Orabug: 27364588
CVE: CVE-2017-12193

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

net: ipv4: fix for a race condition in raw_sendmsg

inet->hdrincl is racy, and could lead to uninitialized stack pointer
usage, so its value should be read only once.

Fixes: c008ba5bdc9f ("ipv4: Avoid reading user iov twice after raw_probe_proto_opt")
Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8f659a03a0ba9289b9aeb9b4470e6fb263d6f483)

Orabug: 27390679
CVE: CVE-2017-17712

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
net/ipv4/raw.c

x86/pti/efi: broken conversion from efi to kernel page table

In entry_64.S we have code like this:

    /* Unconditionally use kernel CR3 for do_nmi() */
    /* %rax is saved above, so OK to clobber here */
    ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER
    /* If PCID enabled, NOFLUSH now and NOFLUSH on return */
    ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID
    pushq   %rax
    /* mask off "user" bit of pgd address and 12 PCID bits: */
    andq    $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
    movq    %rax, %cr3
2:

    /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
    call    do_nmi

With this instruction:
    andq    $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax

We unconditionally switch from whatever our CR3 was to kernel page table.
But, in arch/x86/platform/efi/efi_64.c We temporarily set a different page
table, that does not have the kernel page table with 0x1000 offset from it.

Look in efi_thunk() and efi_thunk_set_virtual_address_map().

So, while CR3 points to the other page table, we get an NMI interrupt,
and clear 0x1000 from CR3, resulting in a bogus CR3 if the 0x1000 bit was
set.

The efi page table comes from realmode/rm/trampoline_64.S:

arch/x86/realmode/rm/trampoline_64.S

141 .bss
142 .balign PAGE_SIZE
143 GLOBAL(trampoline_pgd) .space PAGE_SIZE

Notice: alignment is PAGE_SIZE, so after applying KAISER_SHADOW_PGD_OFFSET
which equal to PAGE_SIZE, we can get a different page table.

But, even if we fix alignment, here the trampoline binary is later copied
into dynamically allocated memory in reserve_real_mode(), so we need to
fix that place as well.

Fixes: 8a43ddfb93a0 ("KAISER: Kernel Address Isolation")
Orabug: 27378516
Orabug: 27333760

CVE: CVE-2017-5754

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec: Always set IBRS to guest value on VMENTER and host on VMEXIT (redux)

The commit bc5d49f8ee73ddf252f8a4ed106643abed3bb4d6
that was pulled was a bit stale and missing an important change.

We will set the IBRS to 0 unconditionally on VMENTER.

Orabug: 27378451

Reported-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/IBRS: Make sure we restore MSR_IA32_SPEC_CTRL to a valid value

It is possible to (re-)enable IBRS between invocations of
ENABLE_IBRS_SAVE_AND_CLOBBER and RESTORE_IBRS_CLOBBER. If that happens,
the latter will be trying to write MSR_IA32_SPEC_CTRL with an
uninitialized value, possibly triggering a #GPF.

To avoid this let's make sure that we always save a valid value into
the save register. If IBRS is disabled that safe value will be
SPEC_CTRL_FEATURE_ENABLE_IBRS.

Orabug: 27378102

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Instead of setting to zero we set it to SPEC_CTRL_FEATURE_ENABLE_IBRS

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/IBRS/IBPB: Set sysctl_ibrs/ibpb_enabled properly

init_scattered_cpuid_features() is called twice, first time from
early_cpu_init(), before 'noibrs' or 'noibpb' option are parsed.
This results in setting of sysctl_* variables. When we call
init_scattered_cpuid_features() the second time, after the boot options
have been parsed, init_scattered_cpuid_features() will leave sysctl_*
parameters unchanged (i.e. already set).

To avoid this, always set those variables based on ibrs/pb_inuse.

Orabug: 27382723

Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec_ctrl: Add missing 'lfence' when IBRS is not supported.

As a way for machines without the speculation MSR to thwart
the speculation engine (along with the stuffing of RSBs).

Orabug: 27344012
CVE:CVE-2017-5715

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/entry_64: TRACE_IRQS_OFF before re-enabling.

Our TRACE_IRQS_OFF call introduced in d572bdfdeb7a (x86/entry: Stuff RSB
for entry to kernel for non-SMEP platform) is after we have already
called ENABLE_INTERRUPTS, resulting in:

WARNING: CPU: 1 PID: 1 at kernel/locking/lockdep.c:2639 trace_hardirqs_off_caller+0xb9/0x130()
DEBUG_LOCKS_WARN_ON(!irqs_disabled())
Modules linked in:
CPU: 1 PID: 1 Comm: init Not tainted 4.1.12+ #91
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
0000000000000009 ffff88011955fdd8 ffffffff815e4336 ffff88011955fe58
ffff880119550000 ffff88011955fe28 ffffffff810b556a ffff88011955fe28
ffffffff8112cd59 0000000000000000 ffffed00232abfc7 ffffffff81ab5f31
Call Trace:
[<ffffffff815e4336>] dump_stack+0x86/0xc0
[<ffffffff810b556a>] warn_slowpath_common+0xca/0xf0
[<ffffffff8112cd59>] ? trace_hardirqs_off_caller+0xb9/0x130
[<ffffffff81ab5f31>] ? system_call_after_swapgs+0x17b/0x18c
[<ffffffff810b5620>] warn_slowpath_fmt+0x90/0xb0
[<ffffffff810b5590>] ? warn_slowpath_common+0xf0/0xf0
[<ffffffff8112b663>] ? up_read+0x23/0x40
[<ffffffff81133142>] ? mark_held_locks+0x22/0xd0
[<ffffffff810a0150>] ? __do_page_fault+0x440/0x540
[<ffffffff8112cd59>] trace_hardirqs_off_caller+0xb9/0x130
[<ffffffff815fbbc1>] trace_hardirqs_off_thunk+0x17/0x19
[<ffffffff81ab5f31>] ? system_call_after_swapgs+0x17b/0x18c

Move TRACE_IRQS_OFF to before interrupts have been re-enabled.

Orabug: 27344012
CVE:CVE-2017-5715

Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

ptrace: remove unlocked RCU dereference.

Commit 02bc4c7f77877 (x86/mm: Only set IBPB when the new thread cannot
ptrace current thread) reworked ___ptrace_may_access to take an
arbitrary task, but getting the task credentials needs to be done inside
an RCU critical section.

Move the dereference into the rcu_read_lock() below, preventing a boot
splat like:

===============================
[ INFO: suspicious RCU usage. ]
4.1.12+ #89 Not tainted
-------------------------------
kernel/ptrace.c:224 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by systemd/1:
#0: (&p->lock){+.+.+.}, at: [<ffffffff8130e548>] seq_read+0xc8/0x820
#1: (&(&p->alloc_lock)->rlock){+.+...}, at: [<ffffffff810c5f77>] ptrace_may_access+0x27/0x60

stack backtrace:
CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.12+ #89
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
ffffffff81c42760 ffff8801194577e8 ffffffff815e3176 ffff880119448000
0000000000000001 ffff880119457818 ffffffff8112decf ffff880119448000
000000000000000d ffff880119448000 ffff8800bb8d8a00 ffff880119457868
Call Trace:
[<ffffffff815e3176>] dump_stack+0x86/0xc0
[<ffffffff8112decf>] lockdep_rcu_suspicious+0x11f/0x130
[<ffffffff810c516c>] ___ptrace_may_access+0x6c/0x560
[<ffffffff810c5f8b>] ptrace_may_access+0x3b/0x60
[<ffffffff8138f7d9>] do_task_stat+0x129/0xef0
...

Orabug: 27344012
CVE:CVE-2017-5715

Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/ia32: Adds code hygiene for 32bit SYSCALL instruction entry.

This is a followup on the 111ba91464f2e29fc6417b50a1c1425e2080bc59
(*INCOMPLETE* x86/syscall: Clear unused extra registers on syscall entrance)
where we didn't completely finish adding the clearing of these
registers. This fixes it on the  32-bit system call entrances.

The movq    R8(%rsp),%r8 is there to update the r8 as the
CLEAR_R8_TO_R15 clears that register so we have to fetch it
from the  pt_regs->r8.

We also remove the SAVE_EXTRA_REGS from the ptrace code as
we clear them (r8->r15) so the extra SAVE_EXTRA_REGS ends
up putting NULLs in the pt->regs->[r8->r15].

Orabug: 27344012
CVE:CVE-2017-5715

Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/ia32: don't save registers on audit call

This is a followup on (x86/ia32: save and clear registers on syscall.)
where we would save the registers at the start of the system call
and also clear them (r8->15). But the ptrace syscall would do
the same thing (save) which meant we would end up over-writting them
with zeros.

Orabug: 27344012
CVE:CVE-2017-5715

Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec/ia32: Sprinkle IBRS and RSB at the 32-bit SYSCALL

We missed them in the first round of backporting.

Also move the DISABLE_IBRS _after_ the trace_hardirqs_on_caller
call.

Orabug: 27344012
CVE:CVE-2017-5715

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Move the DISABLE_IBRS after the TRACE_HARDIRQ macro
Move the ENABLE_IBRS up

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/ia32: Move STUFF_RSB And ENABLE_IBRS

The:
x86/entry: Stuff RSB for entry to kernel for non-SMEP platform
x86/enter: Use IBRS on syscall and interrupts
backports put the macros after the ENABLE_INTERRUPTS, but in case
the ENABLE_INTERRUPTS macro unrolls, let put it above it.

Orabug: 27344012
CVE:CVE-2017-5715
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec: Always set IBRS to guest value on VMENTER and host on VMEXIT.

The paper says that to "set IBRS even if it was already set".
The Intel drop does not have that (it checks to see if it was enabled, and
if so does not do the WRMSR).

Furtheremore it says that on VM Entry we should restore the guest value.
But the patches from Intel again have that _only_ if they the guest
has the IBRS set to zero.

Xen does it that way (as the PDF).

Red Hat code follows the same way as Intel.

It is confusing. Upstream Arjan says:
IBRS will ensure that, when set after the ring transition, no earlier
branch prediction data is used for indirect branches while IBRS is set

What is a ring transition? Upon more clarification it is not
ring transition, but predication mode change. And
VMX non-root transition to VMX root is a prediction mode change and
1 setting in less privilege mode is not sufficient for VMX root mode.

In effect we do want to make a write to the MSR setting IBRS
(even if the value is already set to 1).

Orabug: 27365575
CVE: CVE-2017-5715

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/ia32: save and clear registers on syscall.

This is a followup to 111ba91464f2 (x86/syscall: Clear unused extra
registers on syscall entrance) and a1aa2e658e0af (Re-introduce clearing
of r12-15, rbp, rbx), making sure that we also save and clear registers
on the compat syscalls. Otherwise we see segfaults when running an
32-bit binary on a 64-bit kernel.

Orabug: 27365431
CVE: CVE-2017-5754

Cc: Kris Van Hees <kris.van.hees@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/IBRS: Save current status of MSR_IA32_SPEC_CTRL

... otherwise we are restoring garbage and the MSR only allows
writes to two lower bits, causing a #GPF is other bits are set

While at it, also stuff RSB, which we typically do before enabling
IBRS

Orabug: 27365419

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

pti: Rename X86_FEATURE_KAISER to X86_FEATURE_PTI

cat /proc/cpuinfo still shows kaiser feature, and want only pti
to be visible to users. Therefore, rename this macro to get
correct user visible output.

Orabug: 27333760
CVE: CVE-2017-5754

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec_ctrl: Add missing IBRS_DISABLE

.. which was missing when the system call is done before
the IBRS.

Orabug: 27365403

Reported-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

Make use of ibrs_inuse consistent.

Orabug: 27365390

Reported-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kvm: Set IBRS on VMEXIT if guest disabled it.

If the guest writes does not write FEATURE_ENABLE_IBRS to
MSR_IA32_SPEC_CTRL, then KVM will not issue such write after
(Indirect Branch Prediction Injection).

Right before VMENTER we set the MSR to zero (if the guest
had it set to zero), or leave it at 1 (if the guest
had it set to 1).

But on the VMEXIT if the guest decided to set it to _zero_
before an VMEXIT, then we will leave it at zero and _not_
set the wrmsl to 1!

That is wrong.

And also if the guest did set to 1, then we write 1 to it again.

This fix turns the check around so that the MSR will always
be at MSR 1 - with the optimization that if the guest had
set it, we just keep it at 1.

Orabug: 27364900

Reported-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

Re-introduce clearing of r12-15, rbp, rbx

Re-introduce the clearing of the extra registers (r12-r15, rbp, rbx)
upon entry into a system call. This commit ensures that we do not
save the extra registers after they got cleared, because that causes
NULL values to get written in place of the saved values.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86: more ibrs/pti fixes

Restore IBRS before cr3 is restored, and save IBRS3 after
switching to kernel cr3.

Orabug: 27333760
CVE: CVE-2017-5754

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec: Actually do the check for in_use on ENABLE_IBRS

Orabug: 27344012
CVE: CVE-2017-5715
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kvm: svm: Expose the CPUID.0x80000008 ebx flag.

If it is enabled of course.

Orabug: 27344012
CVE: CVE-2017-5715

Reviewed-by: Todd Vierling <todd.vierling@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec_ctrl: Provide the sysfs version of the ibrs_enabled

as well as the proc version.

This is rather important as customers are asking about the sysfs
now. But this may change in the future. See

https://lkml.org/lkml/2018/1/6/303

Orabug: 27344012
CVE: CVE-2017-5715

Reviewed-by: Todd Vierling <todd.vierling@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86: Use better #define for FEATURE_ENABLE_IBRS and 0

Upstream patches use:
SPEC_CTRL_FEATURE_DISABLE_IBRS for 0
SPEC_CTRL_FEATURE_ENABLE_IBRS for 1

Lets use those fancy names so that it is easier to look
in the code and compare to upstream.

Orabug: 27344012
CVE: CVE-2017-5715

Reviewed-by: Todd Vierling <todd.vierling@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86: Instead of 0x2, 0x4, and 0x1 use #defines.

This mirrors what the posted upstream patches are doing.

Orabug: 27344012
CVE: CVE-2017-5715

Reviewed-by: Todd Vierling <todd.vierling@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kpti: Disable when running under Xen PV

Very very partial backport from aa8c6248f8c75 where
there is a check to see if this is an Xen PV guest - and
if so disable it.

The reason is that the PV ABI would require a major
overhaul to be Meltdown resistent.

Instead there are mitigations (PV in HVM) which are far more
suitable.

Orabug: 27333760
CVE: CVE-2017-5754

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86: Don't ENABLE_IBRS in nmi when we are still running on user cr3

It won't end well - especially as we need to be careful about
touching kernel variables and can only do that in the kernel cr3.

The code:
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

may lead one to believe you can access kernel variables, but in fact
we haven't yet switched over the kernel cr3.

Orabug: 27344012
CVE: CVE-2017-5715

Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/enter: Use IBRS on syscall and interrupts - fix ia32 path

The backports missed a tiny bit of changes.

The easier of them is the ia32_syscall - there are two ways it returns
back to userspace - to int_ret_from_sys_call and there eventually
end up either in syscall_return_via_sysret or opportunistic_sysret_failed.

syscall_return_via_sysret had it, but opportunistic_sysret_failed failed
to have it. That is b/c we optimized a bit and stuck the DISABLE_IBRS
on restore_c_regs_and_iret which was called from opportunistic_sysret_failed
and retint_swapgs.

But with KPTI, doing IBRS_DISABLE from within restore_c_regs_and_iret is
not good - as we are touching an kernel variable and restore_c_regs_and_iret is
running with user-mode cr3!

So "x86: Fix spectre/kpti integration" fixed it by adding the DISABLE_IBRS
syscall_return_via_sysret.
(If you look at the original commit you would think that we should
also fix opportunistic_sysret_failed, but that is fixed in
"x86: Fix spectre/kpti integration")

The seconday issue is that we did not call DISABLE_IBRS from
sysexit_from_sys_call. This patch adds that in too.

Orabug: 27344012
CVE: CVE-2017-5715

Reported-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86: Fix spectre/kpti integration

The issue is that DISABLE_IBRS (and pretty much all of the _IBRS) first
operation is touching an kernel variable. The restore_c_regs_and_iret is
already in user-space cr3 so we page fault.

The fix is simple - do not run any of the IBRS macros from within
restore_c_regs_and_iret. Which means that the three functions that
used to call it now have to call IBRS_DISABLE by themselves:
retint_swapgs, opportunistic_sysret_failed, and nmi.

Adding in the IBRS_DISABLE in opportunistic_sysret_failed also
fixes another bug - which is more clearly explained in
"x86/enter: Use IBRS on syscall and interrupts - fix ia32 path"

Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

PTI: unbreak EFI old_memmap

old_memmap's efi_call_phys_prolog() calls set_pgd() with swapper PGD that
has PAGE_USER set, which makes PTI set NX on it, and therefore EFI can't
execute it's code.

Fix that by forcefully clearing _PAGE_NX from the PGD (this can't be done
by the pgprot API).

_PAGE_NX will be automatically reintroduced in efi_call_phys_epilog(), as
_set_pgd() will again notice that this is _PAGE_USER, and set _PAGE_NX on
it.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

KAISER KABI tweaks.

Makes KPTI KABI compatible.

Orabug: 27333760
CVE: CVE-2017-5754

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/ldt: fix crash in ldt freeing.

94b1f3e2c4b7 (kaiser: merged update) factored out __free_ldt_struct() to
use vfree/free_page, but in the page allocation case it is actually
allocated with kmalloc so needs to be freed with kfree and not
free_page().

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/entry: Define 'cpu_current_top_of_stack' for 64-bit code

32-bit code has PER_CPU_VAR(cpu_current_top_of_stack).
64-bit code uses somewhat more obscure: PER_CPU_VAR(cpu_tss + TSS_sp0).

Define the 'cpu_current_top_of_stack' macro on CONFIG_X86_64
as well so that the PER_CPU_VAR(cpu_current_top_of_stack)
expression can be used in both 32-bit and 64-bit code.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1429889495-27850-3-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3a23208e69679597e767cf3547b1a30dd845d9b5)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/ia32/ia32entry.S

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/entry: Remove unused 'kernel_stack' per-cpu variable

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1429889495-27850-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit fed7c3f0f750f225317828d691e9eb76eec887b3)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/entry: Stop using PER_CPU_VAR(kernel_stack)

PER_CPU_VAR(kernel_stack) is redundant:

- On the 64-bit build, we can use PER_CPU_VAR(cpu_tss + TSS_sp0).
- On the 32-bit build, we can use PER_CPU_VAR(cpu_current_top_of_stack).

PER_CPU_VAR(kernel_stack) will be deleted by a separate change.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1429889495-27850-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 63332a8455d8310b77d38779c6c21a660a8d9feb)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: Set _PAGE_NX only if supported

This resolves a crash if loaded under qemu + haxm under windows.
See https://www.spinics.net/lists/kernel/msg2689835.html for details.
Here is a boot log (the log is from chromeos-4.4, but Tao Wu says that
the same log is also seen with vanilla v4.4.110-rc1).

[    0.712750] Freeing unused kernel memory: 552K
[    0.721821] init: Corrupted page table at address 57b029b332e0
[    0.722761] PGD 80000000bb238067 PUD bc36a067 PMD bc369067 PTE 45d2067
[    0.722761] Bad pagetable: 000b [#1] PREEMPT SMP
[    0.722761] Modules linked in:
[    0.722761] CPU: 1 PID: 1 Comm: init Not tainted 4.4.96 #31
[    0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[    0.722761] task: ffff8800bc290000 ti: ffff8800bc28c000 task.ti: ffff8800bc28c000
[    0.722761] RIP: 0010:[<ffffffff83f4129e>]  [<ffffffff83f4129e>] __clear_user+0x42/0x67
[    0.722761] RSP: 0000:ffff8800bc28fcf8  EFLAGS: 00010202
[    0.722761] RAX: 0000000000000000 RBX: 00000000000001a4 RCX: 00000000000001a4
[    0.722761] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000057b029b332e0
[    0.722761] RBP: ffff8800bc28fd08 R08: ffff8800bc290000 R09: ffff8800bb2f4000
[    0.722761] R10: ffff8800bc290000 R11: ffff8800bb2f4000 R12: 000057b029b332e0
[    0.722761] R13: 0000000000000000 R14: 000057b029b33340 R15: ffff8800bb1e2a00
[    0.722761] FS:  0000000000000000(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
[    0.722761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.722761] CR2: 000057b029b332e0 CR3: 00000000bb2f8000 CR4: 00000000000006e0
[    0.722761] Stack:
[    0.722761]  000057b029b332e0 ffff8800bb95fa80 ffff8800bc28fd18 ffffffff83f4120c
[    0.722761]  ffff8800bc28fe18 ffffffff83e9e7a1 ffff8800bc28fd68 0000000000000000
[    0.722761]  ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 ffff8800bc290000
[    0.722761] Call Trace:
[    0.722761]  [<ffffffff83f4120c>] clear_user+0x2e/0x30
[    0.722761]  [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7
[    0.722761]  [<ffffffff83de2088>] search_binary_handler+0x86/0x19c
[    0.722761]  [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98
[    0.722761]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
[    0.722761]  [<ffffffff83de40be>] do_execve+0x23/0x25
[    0.722761]  [<ffffffff83c002e3>] run_init_process+0x2b/0x2d
[    0.722761]  [<ffffffff844fec4d>] kernel_init+0x6d/0xda
[    0.722761]  [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70
[    0.722761]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
[    0.722761] Code: 86 84 be 12 00 00 00 e8 87 0d e8 ff 66 66 90 48 89 d8 48 c1
eb 03 4c 89 e7 83 e0 07 48 89 d9 be 08 00 00 00 31 d2 48 85 c9 74 0a <48> 89 17
48 01 f7 ff c9 75 f6 48 89 c1 85 c9 74 09 88 17 48 ff
[    0.722761] RIP  [<ffffffff83f4129e>] __clear_user+0x42/0x67
[    0.722761]  RSP <ffff8800bc28fcf8>
[    0.722761] ---[ end trace def703879b4ff090 ]---
[    0.722761] BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/v4.4/kernel/locking/rwsem.c:21
[    0.722761] in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: init
[    0.722761] CPU: 1 PID: 1 Comm: init Tainted: G      D         4.4.96 #31
[    0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[    0.722761]  0000000000000086 dcb5d76098c89836 ffff8800bc28fa30 ffffffff83f34004
[    0.722761]  ffffffff84839dc2 0000000000000015 ffff8800bc28fa40 ffffffff83d57dc9
[    0.722761]  ffff8800bc28fa68 ffffffff83d57e6a ffffffff84a53640 0000000000000000
[    0.722761] Call Trace:
[    0.722761]  [<ffffffff83f34004>] dump_stack+0x4d/0x63
[    0.722761]  [<ffffffff83d57dc9>] ___might_sleep+0x13a/0x13c
[    0.722761]  [<ffffffff83d57e6a>] __might_sleep+0x9f/0xa6
[    0.722761]  [<ffffffff84502788>] down_read+0x20/0x31
[    0.722761]  [<ffffffff83cc5d9b>] __blocking_notifier_call_chain+0x35/0x63
[    0.722761]  [<ffffffff83cc5ddd>] blocking_notifier_call_chain+0x14/0x16
[    0.800374] usb 1-1: new full-speed USB device number 2 using uhci_hcd
[    0.722761]  [<ffffffff83cefe97>] profile_task_exit+0x1a/0x1c
[    0.802309]  [<ffffffff83cac84e>] do_exit+0x39/0xe7f
[    0.802309]  [<ffffffff83ce5938>] ? vprintk_default+0x1d/0x1f
[    0.802309]  [<ffffffff83d7bb95>] ? printk+0x57/0x73
[    0.802309]  [<ffffffff83c46e25>] oops_end+0x80/0x85
[    0.802309]  [<ffffffff83c7b747>] pgtable_bad+0x8a/0x95
[    0.802309]  [<ffffffff83ca7f4a>] __do_page_fault+0x8c/0x352
[    0.802309]  [<ffffffff83eefba5>] ? file_has_perm+0xc4/0xe5
[    0.802309]  [<ffffffff83ca821c>] do_page_fault+0xc/0xe
[    0.802309]  [<ffffffff84507682>] page_fault+0x22/0x30
[    0.802309]  [<ffffffff83f4129e>] ? __clear_user+0x42/0x67
[    0.802309]  [<ffffffff83f4127f>] ? __clear_user+0x23/0x67
[    0.802309]  [<ffffffff83f4120c>] clear_user+0x2e/0x30
[    0.802309]  [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7
[    0.802309]  [<ffffffff83de2088>] search_binary_handler+0x86/0x19c
[    0.802309]  [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98
[    0.802309]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
[    0.802309]  [<ffffffff83de40be>] do_execve+0x23/0x25
[    0.802309]  [<ffffffff83c002e3>] run_init_process+0x2b/0x2d
[    0.802309]  [<ffffffff844fec4d>] kernel_init+0x6d/0xda
[    0.802309]  [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70
[    0.802309]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
[    0.830559] Kernel panic - not syncing: Attempted to kill init!  exitcode=0x00000009
[    0.830559]
[    0.831305] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    0.831305] ---[ end Kernel panic - not syncing: Attempted to kill init!  exitcode=0x00000009

The crash part of this problem may be solved with the following patch
(thanks to Hugh for the hint). There is still another problem, though -
with this patch applied, the qemu session aborts with "VCPU Shutdown
request", whatever that means.

Cc: lepton <ytht.net@gmail.com>
Signed-off-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b33c3c64c4786cd724ccde6fa97c87ada49f6a73)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap

commit dac16fba6fc590fa7239676b35ed75dae4c4cd2b upstream.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/9d37826fdc7e2d2809efe31d5345f97186859284.1449702533.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Jamie Iles <jamie.iles@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 755bd549d9328d6d1e949a0a213f9a78e84d11fc)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/vdso/vclock_gettime.c (not in this tree)
arch/x86/vdso/vclock_gettime.c (patched instead of that)
arch/x86/entry/vdso/vdso2c.c (not in this tree)
arch/x86/vdso/vdso2c.c (patched instead of that)
arch/x86/entry/vdso/vma.c (not in this tree)
arch/x86/vdso/vma.c (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

KPTI: Report when enabled

Make sure dmesg reports when KPTI is enabled.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bfd51a4d715b6ef44bd01b9fbfc13da936f93d76)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/mm/kaiser.c

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

KPTI: Rename to PAGE_TABLE_ISOLATION

This renames CONFIG_KAISER to CONFIG_PAGE_TABLE_ISOLATION.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3e1457d6bf26d9ec300781f84cd0057e44deb45d)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kaiser: Move feature detection up

... before the first use of kaiser_enabled as otherwise funky
things happen:

  about to get started...
  (XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0000]
  (XEN) Pagetable walk from ffff88022a449090:
  (XEN)  L4[0x110] = 0000000229e0e067 0000000000001e0e
  (XEN)  L3[0x008] = 0000000000000000 ffffffffffffffff
  (XEN) domain_crash_sync called from entry.S: fault at ffff82d08033fd08
  entry.o#create_bounce_frame+0x135/0x14d
  (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
  (XEN) ----[ Xen-4.9.1_02-3.21  x86_64  debug=n   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e033:[<ffffffff81007460>]
  (XEN) RFLAGS: 0000000000000286   EM: 1   CONTEXT: pv guest (d0v0)

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7f79599df9c4a36130f7a4f6778b334a97632477)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/kernel/setup.c

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kaiser: Reenable PARAVIRT

Now that the required bits have been addressed, reenable
PARAVIRT.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 750fb627d764eb66430c36961b94ab0002694c02)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/paravirt: Dont patch flush_tlb_single

commit a035795499ca1c2bd1928808d1a156eda1420383 upstream

native_flush_tlb_single() will be changed with the upcoming
PAGE_TABLE_ISOLATION feature. This requires to have more code in
there than INVLPG.

Remove the paravirt patching for it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3e809caffdd7beeac731feb16788873c3bdb811e)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: kaiser_flush_tlb_on_return_to_user() check PCID

Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID
check, instead of each caller doing it inline first: nobody needs
to optimize for the noPCID case, it's clearer this way, and better
suits later changes. Replace those no-op X86_CR3_PCID_KERN_FLUSH lines
by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8eaca4c7d9f167209a9cc568ff028c0a3b0deb2d)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: asm/tlbflush.h handle noPGE at lower level

I found asm/tlbflush.h too twisty, and think it safer not to avoid
__native_flush_tlb_global_irq_disabled() in the kaiser_enabled case,
but instead let it handle kaiser_enabled along with cr3: it can just
use __native_flush_tlb() for that, no harm in re-disabling preemption.

(This is not the same change as Kirill and Dave have suggested for
upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge
check; cr3 is enough for kaiser, and thought to be cheaper than cr4.)

Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals()
preference from __native_flush_tlb(): unlike the invpcid_flush_all()
preference in __native_flush_tlb_global(), it's not seen in upstream
4.14, and was recently reported to be surprisingly slow.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 0651b3ad99dd59269e2ec883338ab8fba617e203)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: drop is_atomic arg to kaiser_pagetable_walk()

I have not observed a might_sleep() warning from setup_fixmap_gdt()'s
use of kaiser_add_mapping() in our tree (why not?), but like upstream
we have not provided a way for that to pass is_atomic true down to
kaiser_pagetable_walk(), and at startup it's far from a likely source
of trouble: so just delete the walk's is_atomic arg and might_sleep().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 28c6de5441740f868a5b371804a0e8dde03757fb)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/mm/kaiser.c

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush

Now that we're playing the ALTERNATIVE game, use that more efficient
method: instead of user-mapping an extra page, and reading an extra
cacheline each time for x86_cr3_pcid_noflush.

Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax)
is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs;
but the one line with $63 in looks clearer, so let's stick with that.

Worried about what happens with an ALTERNATIVE between the jump and
jump label in another ALTERNATIVE? I was, but have checked the
combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64,
and it does a good job.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2dff99eb0335f9e0817410696a180dba25ca7371)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kaiser: Check boottime cmdline params

AMD (and possibly other vendors) are not affected by the leak
KAISER is protecting against.

Keep the "nopti" for traditional reasons and add pti=<on|off|auto>
like upstream.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e405a064bd7d6eca88935342ddb71057a9d6ceab)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling

Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti".

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dea9aa9ffae11c91285335cc3215b4f0e48e8139)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: add "nokaiser" boot option, using ALTERNATIVE

Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime.  That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.

Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.

It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h).  I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.

Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments.  But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e345dcc9481543edf4a0a5df4c4c2f9597b0a997)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>