www.infradead.org Git - users/jedix/linux-maple.git/log

x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()

Currently, vmx_vcpu_run() checks if l1tf_flush_l1d is set and invokes
vmx_l1d_flush() if so.

This test is unncessary for the "always flush L1D" mode.

Move the check to vmx_l1d_flush()'s conditional mode code path.

Notes:
- vmx_l1d_flush() is likely to get inlined anyway and thus, there's no
extra function call.

- This inverts the (static) branch prediction, but there hadn't been any
explicit likely()/unlikely() annotations before and so it stays as is.

Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220625
CVE: CVE-2018-3646

(cherry picked from commit 5b6ccc6c3b1a477fbac9ec97a0b4c1c48e765209)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content caused by not having all static key features

x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'

The vmx_l1d_flush_always static key is only ever evaluated if
vmx_l1d_should_flush is enabled. In that case however, there are only two
L1d flushing modes possible: "always" and "conditional".

The "conditional" mode's implementation tends to require more sophisticated
logic than the "always" mode.

Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
with a 'vmx_l1d_flush_cond' one.

There is no change in functionality.

Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220625
CVE: CVE-2018-3646

(cherry picked from commit 427362a142441f08051369db6fbe7f61c73b3dca)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content caused by not having all static key features

x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()

vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no
point in setting l1tf_flush_l1d to true from there again.

Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220625
CVE: CVE-2018-3646

(cherry picked from commit 379fd0c7e6a391e5565336a646f19f218fb98c6c)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content caused by not having all static key features.

KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR

This lets userspace read the MSR_IA32_ARCH_CAPABILITIES and check that all
requested features are available on the host.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Orabug: 28220625
CVE: CVE-2018-3646

(cherry picked from commit cd28325249a1ca0d771557ce823e0308ad629f98)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/x86.c
Contextual: different content

KVM: X86: Introduce kvm_get_msr_feature()

Introduce kvm_get_msr_feature() to handle the msrs which are supported
by different vendors and sharing the same emulation logic.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 66421c1ec340096b291af763ed5721314cdd9c5c)

This was needed in order to backport commit cd28325249a1ca0d771557ce823e0308ad629f98.

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

KVM: x86: Add a framework for supporting MSR-based features

Provide a new KVM capability that allows bits within MSRs to be recognized
as features. Two new ioctls are added to the /dev/kvm ioctl routine to
retrieve the list of these MSRs and then retrieve their values. A kvm_x86_ops
callback is used to determine support for the listed MSR-based features.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Tweaked documentation. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 801e459a6f3a63af9d447e6249088c76ae16efc4)

This was needed in order to backport commit cd28325249a1ca0d771557ce823e0308ad629f98.

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/kvm_host.h
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c
arch/x86/kvm/x86.c
include/uapi/linux/kvm.h
Contextual: different content

cpu/hotplug: detect SMT disabled by BIOS

If SMT is disabled in BIOS, the CPU code doesn't properly detect it.
The /sys/devices/system/cpu/smt/control file shows 'on', and the 'l1tf'
vulnerabilities file shows SMT as vulnerable.

Fix it by forcing 'cpu_smt_control' to CPU_SMT_NOT_SUPPORTED in such a
case. Unfortunately the detection can only be done after bringing all
the CPUs online, so we have to overwrite any previous writes to the
variable.

Reported-by: Joe Mario <jmario@redhat.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Fixes: f048c399e0f7 ("x86/topology: Provide topology_smt_supported()")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 73d5e2b472640b1fcdb61ae8be389912ef211bda)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

Documentation/l1tf: Fix typos

Fix spelling and other typos

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 1949f9f49792d65dba2090edddbe36a5f02e3ba3)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' content

The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order
to evict the L1d cache.

However, these pages are never cleared and, in theory, their data could be
leaked.

More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages
to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break
the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses.

Fix this by initializing the individual vmx_l1d_flush_pages with a
different pattern each.

Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to
"flush_pages" to reflect this change.

Fixes: a47dd5f06714 ("x86/KVM/VMX: Add L1D flush algorithm")
Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 288d152c23dcf3c09da46c5c481903ca10ebfef7)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/speculation/l1tf: Unbreak !__HAVE_ARCH_PFN_MODIFY_ALLOWED architectures

pfn_modify_allowed() and arch_has_pfn_modify_check() are outside of the
!__ASSEMBLY__ section in include/asm-generic/pgtable.h, which confuses
assembler on archs that don't have __HAVE_ARCH_PFN_MODIFY_ALLOWED (e.g.
ia64) and breaks build:

    include/asm-generic/pgtable.h: Assembler messages:
    include/asm-generic/pgtable.h:538: Error: Unknown opcode `static inline bool pfn_modify_allowed(unsigned long pfn,pgprot_t prot)'
    include/asm-generic/pgtable.h:540: Error: Unknown opcode `return true'
    include/asm-generic/pgtable.h:543: Error: Unknown opcode `static inline bool arch_has_pfn_modify_check(void)'
    include/asm-generic/pgtable.h:545: Error: Unknown opcode `return false'
    arch/ia64/kernel/entry.S:69: Error: `mov' does not fit into bundle

Move those two static inlines into the !__ASSEMBLY__ section so that they
don't confuse the asm build pass.

Fixes: 42e4089c7890 ("x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 6c26fcd2abfe0a56bbd95271fce02df2896cfd24)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
include/asm-generic/pgtable.h
Contextual: different content

Documentation: Add section about CPU vulnerabilities

Add documentation for the L1TF vulnerability and the mitigation mechanisms:

  - Explain the problem and risks
  - Document the mitigation mechanisms
  - Document the command line controls
  - Document the sysfs files

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180713142323.287429944@linutronix.de
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 3ec8ce5d866ec6a08a9cfab82b62acf4a830b35f)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/index.rst
Documentation/admin-guide/l1tf.rst
Contextual: The admin-guide is introduced later by 9d85025b "docs-rst: create
an user's manual book" so l1tf is added in Documentation folder.

x86/bugs, kvm: Introduce boot-time control of L1TF mitigations

Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.

The possible values are:

  full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.

  full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.

  flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.

  flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.

  flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.

  off
Disables hypervisor mitigations and doesn't emit any warnings.

Default is 'flush'.

Let KVM adhere to these semantics, which means:

  - 'lt1f=full,force' : Performe L1D flushes. No runtime control
       possible.

  - 'l1tf=full'
  - 'l1tf-flush'
  - 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if
  SMT has been runtime enabled or L1D flushing
  has been run-time enabled

  - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.

  - 'l1tf=off' : L1D flushes are not performed and no warnings
  are emitted.

KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.

This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.

Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit d90a7a0ec83fb86622cd7dae23255d3c50a99ec8)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/processor.h
Documentation/admin-guide/kernel-parameters.txt
arch/x86/kernel/cpu/bugs.c
arch/x86/kvm/vmx.c
Contextual: different content; modified bugs_64.c instead of bugs.c and
Documentation/kernel-parameters.txt instead of
Documentation/admin-guide/kernel-parameters.txt.

cpu/hotplug: Set CPU_SMT_NOT_SUPPORTED early

The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support
SMT) when the sysfs SMT control file is initialized.

That was fine so far as this was only required to make the output of the
control file correct and to prevent writes in that case.

With the upcoming l1tf command line parameter, this needs to be set up
before the L1TF mitigation selection and command line parsing happens.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.121795971@linutronix.de
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit fee0aede6f4739c87179eca76136f83210953b86)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
kernel/cpu.c
Contextual: Modified bugs_64.c instead of bugs.c

cpu/hotplug: Expose SMT control init function

The L1TF mitigation will gain a commend line parameter which allows to set
a combination of hypervisor mitigation and SMT control.

Expose cpu_smt_disable() so the command line parser can tweak SMT settings.

[ tglx: Split out of larger patch and made it preserve an already existing
force off state ]

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.039715135@linutronix.de
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 8e1b706b6e819bed215c0db16345568864660393)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
kernel/cpu.c
Contextual: different content

x86/kvm: Allow runtime control of L1D flush

All mitigation modes can be switched at run time with a static key now:

- Use sysfs_streq() instead of strcmp() to handle the trailing new line
from sysfs writes correctly.
- Make the static key management handle multiple invocations properly.
- Set the module parameter file to RW

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.954525119@linutronix.de
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 895ae47f9918833c3a880fbccd41e0692b37e7d9)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
arch/x86/kvm/vmx.c
Contextual: different content; bugs_64.cwas modified instead of bugs.c

x86/kvm: Serialize L1D flush parameter setter

Writes to the parameter files are not serialized at the sysfs core
level, so local serialization is required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.873642605@linutronix.de
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit dd4bfa739a72508b75760b393d129ed7b431daab)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content

x86/kvm: Add static key for flush always

Avoid the conditional in the L1D flush control path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.790914912@linutronix.de
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 4c6523ec59fe895ea352a650218a6be0653910b1)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content; Use unlikely(static_key_enabled) instead
of static_branch_unlikely because 11276d530 "locking/static_keys: Add a
new static_key interface" is not present in this version of kernel.

x86/kvm: Move l1tf setup function

In preparation of allowing run time control for L1D flushing, move the
setup code to the module parameter handler.

In case of pre module init parsing, just store the value and let vmx_init()
do the actual setup after running kvm_init() so that enable_ept is having
the correct state.

During run-time invoke it directly from the parameter setter to prepare for
run-time control.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.694063239@linutronix.de
Orabug: 28220625
CVE: CVE-2018-3646

(cherry picked from commit 7db92e165ac814487264632ab2624e832f20ae38)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content

x86/l1tf: Handle EPT disabled state proper

If Extended Page Tables (EPT) are disabled or not supported, no L1D
flushing is required. The setup function can just avoid setting up the L1D
flush for the EPT=n case.

Invoke it after the hardware setup has be done and enable_ept has the
correct state and expose the EPT disabled state in the mitigation status as
well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.612160168@linutronix.de
Orabug: 28220625
CVE: CVE-2018-3620

(cherry picked from commit a7b9020b06ec6d7c3f3b0d4ef1a9eba12654f4f7)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs_64.c
arch/x86/kvm/vmx.c
Contextual: different content; Modified arch/x86/kernel/cpu/bugs_64.c
instead of arch/x86/kernel/cpu/bugs.c.

x86/kvm: Drop L1TF MSR list approach

The VMX module parameter to control the L1D flush should become
writeable.

The MSR list is set up at VM init per guest VCPU, but the run time
switching is based on a static key which is global. Toggling the MSR list
at run time might be feasible, but for now drop this optimization and use
the regular MSR write to make run-time switching possible.

The default mitigation is the conditional flush anyway, so for extra
paranoid setups this will add some small overhead, but the extra code
executed is in the noise compared to the flush itself.

Aside of that the EPT disabled case is not handled correctly at the moment
and the MSR list magic is in the way for fixing that as well.

If it's really providing a significant advantage, then this needs to be
revisited after the code is correct and the control is writable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.516940445@linutronix.de
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 2f055947ae5e2741fb2dc5bba1033c417ccf4faa)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content

x86/litf: Introduce vmx status variable

Store the effective mitigation of VMX in a status variable and use it to
report the VMX state in the l1tf sysfs file.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.433098358@linutronix.de
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 72c6d2db64fa18c996ece8f06e499509e6c9a37e)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
arch/x86/kernel/cpu/bugs.c
Contextual: different content; Changes were done in arch/x86/kernel/cpu/bugs_64.c
instead of arch/x86/kernel/cpu/bugs.c.
__ro_after_init was replaced with __read_mostly for l1tf_vmx_mitigation to avoid
cherry-picking c74ba8b3.

cpu/hotplug: Online siblings when SMT control is turned on

Writing 'off' to /sys/devices/system/cpu/smt/control offlines all SMT
siblings. Writing 'on' merily enables the abilify to online them, but does
not online them automatically.

Make 'on' more useful by onlining all offline siblings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 215af5499d9e2b55f111d2431ea20218115f29b3)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required

If the L1D flush module parameter is set to 'always' and the IA32_FLUSH_CMD
MSR is available, optimize the VMENTER code with the MSR save list.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 390d975e0c4e60ce70d4157e0dd91ede37824603)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content

x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs

The IA32_FLUSH_CMD MSR needs only to be written on VMENTER. Extend
add_atomic_switch_msr() with an entry_only parameter to allow storing the
MSR only in the guest (ENTRY) MSR array.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 989e3992d2eca32c3f1404f2bc91acda3aa122d8)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/KVM/VMX: Separate the VMX AUTOLOAD guest/host number accounting

This allows to load a different number of MSRs depending on the context:
VMEXIT or VMENTER.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 3190709335dd31fe1aeeebfe4ffb6c7624ef971f)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/KVM/VMX: Add find_msr() helper function

.. to help find the MSR on either the guest or host MSR list.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit ca83b4a7f2d068da79a029d323024aa45decb250)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/KVM/VMX: Split the VMX MSR LOAD structures to have an host/guest numbers

There is no semantic change but this change allows an unbalanced amount of
MSRs to be loaded on VMEXIT and VMENTER, i.e. the number of MSRs to save or
restore on VMEXIT or VMENTER may be different.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 33966dd6b2d2c352fae55412db2ea8cfff5df13a)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content; in this version, prepare_vmcs02_full
doesn't exist and nested_vmx_vmexit lacks the lines modified by the
original patch.

x86/KVM/VMX: Add L1D flush logic

Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.

The flags is set:
- Always, if the flush module parameter is 'always'

- Conditionally at:
   - Entry to vcpu_run(), i.e. after executing user space

   - From the sched_in notifier, i.e. when switching to a vCPU thread.

   - From vmexit handlers which are considered unsafe, i.e. where
     sensitive data can be brought into L1D:

     - The emulator, which could be a good target for other speculative
       execution-based threats,

     - The MMU, which can bring host page tables in the L1 cache.

     - External interrupts

     - Nested operations that require the MMU (see above). That is
       vmptrld, vmptrst, vmclear,vmwrite,vmread.

     - When handling invept,invvpid

[ tglx: Split out from combo patch and reduced to a single flag ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit c595ceee45707f00f64f61c54fb64ef0cc0b4e85)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/kvm_host.h
arch/x86/kvm/mmu.c
arch/x86/kvm/vmx.c
arch/x86/kvm/x86.c
Contextual: different content
kvm_handle_page_fault doesn't exist so l1tf_flush_l1d was set on true in
handle_exception.
static_branch_unlikely was replaced with unlikely(static_key_enabled) because
11276d53 "locking/static_keys: Add a new static_key interface" is missing in
the current version.

x86/KVM/VMX: Add L1D MSR based flush

336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
(IA32_FLUSH_CMD aka 0x10B) which has similar write-only semantics to other
MSRs defined in the document.

The semantics of this MSR is to allow "finer granularity invalidation of
caching structures than existing mechanisms like WBINVD. It will writeback
and invalidate the L1 data cache, including all cachelines brought in by
preceding instructions, without invalidating all caches (eg. L2 or
LLC). Some processors may also invalidate the first level level instruction
cache on a L1D_FLUSH command. The L1 data and instruction caches may be
shared across the logical processors of a core."

Use it instead of the loop based L1 flush algorithm.

A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199511

[ tglx: Avoid allocating pages when the MSR is available ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 3fa045be4c720146b18a19cea7a767dc6ad5df94)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/msr-index.h
arch/x86/kvm/vmx.c
Contextual: different content; arch/x86/include/uapi/asm/msr-index.h
was modified instead of arch/x86/include/asm/msr-index.h which doesn't
exist in this version.

x86/KVM/VMX: Add L1D flush algorithm

To mitigate the L1 Terminal Fault vulnerability it's required to flush L1D
on VMENTER to prevent rogue guests from snooping host memory.

CPUs will have a new control MSR via a microcode update to flush L1D with a
single MSR write, but in the absence of microcode a fallback to a software
based flush algorithm is required.

Add a software flush loop which is based on code from Intel.

[ tglx: Split out from combo patch ]
[ bpetkov: Polish the asm code ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit a47dd5f06714c844b33f3b5f517b6f3e81ce57b5)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content

x86/KVM/VMX: Add module argument for L1TF mitigation

Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3646, aka
L1 terminal fault. The valid arguments are:

- "always" L1D cache flush on every VMENTER.
- "cond" Conditional L1D cache flush, explained below
- "never" Disable the L1D cache flush mitigation

"cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
interesting information into L1D which might exploited.

[ tglx: Split out from a larger patch ]

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit a399477e52c17e148746d3ce9a483f681c2aa9a0)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Documentation/admin-guide/kernel-parameters.txt
Contextual: different content.
Replaced DEFINE_STATIC_KEY_FALSE and static_branch_enable with their
existent equivalent and cherry-picked e33886b38.

locking/static_keys: Add static_key_{en,dis}able() helpers

Add two helpers to make it easier to treat the refcount as boolean.

Suggested-by: Jason Baron <jasonbaron0@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit e33886b38cc82a9fc3b2d655dfc7f50467594138)
This commit was necessary for porting 13b7b7.

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present

If the L1TF CPU bug is present we allow the KVM module to be loaded as the
major of users that use Linux and KVM have trusted guests and do not want a
broken setup.

Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as
such they are the ones that should set nosmt to one.

Setting 'nosmt' means that the system administrator also needs to disable
SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
parameter, or via the /sys/devices/system/cpu/smt/control. See commit
05736e4ac13c ("cpu/hotplug: Provide knobs to control SMT").

Other mitigations are to use task affinity, cpu sets, interrupt binding,
etc - anything to make sure that _only_ the same guests vCPUs are running
on sibling threads.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 26acfb666a473d960f0fd971fe68f3e3ad16c70b)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
arch/x86/kvm/vmx.c
kernel/cpu.c
Contextual: different content; This commit needs 03543133
"KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks"

KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks

Adding function pointers in struct kvm_x86_ops for processor-specific
layer to provide hooks for when KVM initialize and destroy VM.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Orabug: 28220674
CVE: CVE-2018-3646

(cherry picked from commit 03543133cea646406307870823343912c1ef0a3a)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/x86.c
Contextual: different content
The commit was necessary for 47009da5 "x86/KVM: Warn user if KVM
is loaded SMT and L1TF CPU bug being present"

cpu/hotplug: Boot HT siblings at least once

Due to the way Machine Check Exceptions work on X86 hyperthreads it's
required to boot up _all_ logical cores at least once in order to set the
CR4.MCE bit.

So instead of ignoring the sibling threads right away, let them boot up
once so they can configure themselves. After they came out of the initial
boot stage check whether its a "secondary" sibling and cancel the operation
which puts the CPU back into offline state.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 0cc3cd21657be04cb0559fe8063f2130493f92cf)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
kernel/cpu.c
Contextual: The following changes were applied to the initial commit:
- used a per-cpu variable booted_once because struct cpuhp_cpu_state doesn't
exist in the current version
- cpu_smt_allowed check from bringup_wait_for_ap is done in _cpu_up
- taking into account that boot_cpu_state_init doesn't exist in the current
version (it was added by cff7d378: cpu/hotplug: Convert to a state machine
for the control processor), the per-cpu variable booted_once isn't made static
and it is set on true in start_kernel

Revert "x86/apic: Ignore secondary threads if nosmt=force"

Dave Hansen reported, that it's outright dangerous to keep SMT siblings
disabled completely so they are stuck in the BIOS and wait for SIPI.

The reason is that Machine Check Exceptions are broadcasted to siblings and
the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
reboots the machine. The MCE chapter in the SDM contains the following
blurb:

    Because the logical processors within a physical package are tightly
    coupled with respect to shared hardware resources, both logical
    processors are notified of machine check errors that occur within a
    given physical processor. If machine-check exceptions are enabled when
    a fatal error is reported, all the logical processors within a physical
    package are dispatched to the machine-check exception handler. If
    machine-check exceptions are disabled, the logical processors enter the
    shutdown state and assert the IERR# signal. When enabling machine-check
    exceptions, the MCE flag in control register CR4 should be set for each
    logical processor.

Reverting the commit which ignores siblings at enumeration time solves only
half of the problem. The core cpuhotplug logic needs to be adjusted as
well.

This thoughtful engineered mechanism also turns the boot process on all
Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
before the secondary CPUs are brought up. Depending on the number of
physical cores the window in which this situation can happen is smaller or
larger. On a HSW-EX it's about 750ms:

MCE is enabled on the boot CPU:

[    0.244017] mce: CPU supports 22 MCE banks

The corresponding sibling #72 boots:

[    1.008005] .... node  #0, CPUs:    #72

That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
between these two points the machine is going to shutdown. At least it's a
known safe state.

It's obvious that the early boot can be hit by an MCE as well and then runs
into the same situation because MCEs are not yet enabled on the boot CPU.
But after enabling them on the boot CPU, it does not make any sense to
prevent the kernel from recovering.

Adjust the nosmt kernel parameter documentation as well.

Reverts: 2207def700f9 ("x86/apic: Ignore secondary threads if nosmt=force")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 506a66f374891ff08e064a058c446b336c5ac760)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/apic.h
arch/x86/kernel/apic/apic.c
Contextual

x86/speculation/l1tf: Fix up pte->pfn conversion for PAE

Jan has noticed that pte_pfn and co. resp. pfn_pte are incorrect for
CONFIG_PAE because phys_addr_t is wider than unsigned long and so the
pte_val reps. shift left would get truncated. Fix this up by using proper
types.

Fixes: 6b28baca9b1f ("x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation")
Reported-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit e14d7dfb41f5807a0c1c26a13f2b8ef16af24935)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/pgtable.h
Contextual and we do not have pfn_pud.

x86/speculation/l1tf: Protect PAE swap entries against L1TF

The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
offset. The lower word is zeroed, thus systems with less than 4GB memory are
safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
to L1TF; with even more memory, also the swap offfset influences the address.
This might be a problem with 32bit PAE guests running on large 64bit hosts.

By continuing to keep the whole swap entry in either high or low 32bit word of
PTE we would limit the swap size too much. Thus this patch uses the whole PAE
PTE with the same layout as the 64bit version does. The macros just become a
bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 0d0f6249058834ffe1ceaad0bb31464af66f6e7a)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/CPU/AMD: Move TOPOEXT reenablement before reading smp_num_siblings

The TOPOEXT reenablement is a workaround for broken BIOSen which didn't
enable the CPUID bit. amd_get_topology_early(), however, relies on
that bit being set so that it can read out the CPUID leaf and set
smp_num_siblings properly.

Move the reenablement up to early_init_amd(). While at it, simplify
amd_get_topology_early().

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 7ce2f0393ea2396142b7faf6ee9b1f3676d08a5f)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/amd.c
Contextual

x86/cpufeatures: Add detection of L1D cache flush support.

336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
(IA32_FLUSH_CMD) which is detected by CPUID.7.EDX[28]=1 bit being set.

This new MSR "gives software a way to invalidate structures with finer
granularity than other architectual methods like WBINVD."

A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199511

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 11e34e64e4103955fc4568750914c75d65ea87ee)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
We do not have word 18. To preserve kABI compat we will use word 2 which has
free entries.

x86/speculation/l1tf: Extend 64bit swap file size limit

The previous patch has limited swap file size so that large offsets cannot
clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation.

It assumed that offsets are encoded starting with bit 12, same as pfn. But
on x86_64, offsets are encoded starting with bit 9.

Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA
and 256TB with 46bit MAX_PA.

Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 1a7ed1ba4bba6c075d5ad61bb75e3fbc870840d6)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/apic: Ignore secondary threads if nosmt=force

nosmt on the kernel command line merely prevents the onlining of the
secondary SMT siblings.

nosmt=force makes the APIC detection code ignore the secondary SMT siblings
completely, so they even do not show up as possible CPUs. That reduces the
amount of memory allocations for per cpu variables and saves other
resources from being allocated too large.

This is not fully equivalent to disabling SMT in the BIOS because the low
level SMT enabling in the BIOS can result in partitioning of resources
between the siblings, which is not undone by just ignoring them. Some CPUs
can use the full resources when their sibling is not onlined, but this is
depending on the CPU family and model and it's not well documented whether
this applies to all partitioned resources. That means depending on the
workload disabling SMT in the BIOS might result in better performance.

Linus analysis of the Intel manual:

  The intel optimization manual is not very clear on what the partitioning
  rules are.

  I find:

    "In general, the buffers for staging instructions between major pipe
     stages  are partitioned. These buffers include µop queues after the
     execution trace cache, the queues after the register rename stage, the
     reorder buffer which stages instructions for retirement, and the load
     and store buffers.

     In the case of load and store buffers, partitioning also provided an
     easier implementation to maintain memory ordering for each logical
     processor and detect memory ordering violations"

  but some of that partitioning may be relaxed if the HT thread is "not
  active":

    "In Intel microarchitecture code name Sandy Bridge, the micro-op queue
     is statically partitioned to provide 28 entries for each logical
     processor,  irrespective of software executing in single thread or
     multiple threads. If one logical processor is not active in Intel
     microarchitecture code name Ivy Bridge, then a single thread executing
     on that processor  core can use the 56 entries in the micro-op queue"

  but I do not know what "not active" means, and how dynamic it is. Some of
  that partitioning may be entirely static and depend on the early BIOS
  disabling of HT, and even if we park the cores, the resources will just be
  wasted.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 2207def700f902f169fc237b717252c326f9e464)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/apic.h
arch/x86/kernel/apic/apic.c
Contextual

x86/cpu/AMD: Evaluate smp_num_siblings early

To support force disabling of SMT it's required to know the number of
thread siblings early. amd_get_topology() cannot be called before the APIC
driver is selected, so split out the part which initializes
smp_num_siblings and invoke it from amd_early_init().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 1e1d7e25fd759eddf96d8ab39d0a90a1979b2d8c)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/amd.c
Contextual

x86/CPU/AMD: Do not check CPUID max ext level before parsing SMP info

Old code used to check whether CPUID ext max level is >= 0x80000008 because
that last leaf contains the number of cores of the physical CPU. The three
functions called there now do not depend on that leaf anymore so the check
can go.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 119bff8a9c9bb00116a844ec68be7bc4b1c768f5)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/amd.c
Contextual: we only have two functions

x86/cpu/intel: Evaluate smp_num_siblings early

Make use of the new early detection function to initialize smp_num_siblings
on the boot cpu before the MP-Table or ACPI/MADT scan happens. That's
required for force disabling SMT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 1910ad5624968f93be48e8e265513c54d66b897c)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/intel.c
Contextual

x86/cpu/topology: Provide detect_extended_topology_early()

To support force disabling of SMT it's required to know the number of
thread siblings early. detect_extended_topology() cannot be called before
the APIC driver is selected, so split out the part which initializes
smp_num_siblings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 95f3d39ccf7aaea79d1ffdac1c887c2e100ec1b6)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts:
arch/x86/kernel/cpu/cpu.h
arch/x86/kernel/cpu/topology.c
Contextual

x86/cpu/common: Provide detect_ht_early()

To support force disabling of SMT it's required to know the number of
thread siblings early. detect_ht() cannot be called before the APIC driver
is selected, so split out the part which initializes smp_num_siblings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 545401f4448a807b963ff17b575e0a393e68b523)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/common.c
arch/x86/kernel/cpu/cpu.h
Contextual

x86/cpu/AMD: Remove the pointless detect_ht() call

Real 32bit AMD CPUs do not have SMT and the only value of the call was to
reach the magic printout which got removed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 44ca36de56d1bf196dca2eb67cd753a46961ffe6)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/cpu: Remove the pointless CPU printout

The value of this printout is dubious at best and there is no point in
having it in two different places along with convoluted ways to reach it.

Remove it completely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 55e6d279abd92cfd7576bba031e7589be8475edb)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/common.c
arch/x86/kernel/cpu/topology.c
Contextual

cpu/hotplug: Provide knobs to control SMT

Provide a command line and a sysfs knob to control SMT.

The command line options are:

'nosmt': Enumerate secondary threads, but do not online them

'nosmt=force': Ignore secondary threads completely during enumeration
via MP table and ACPI/MADT.

The sysfs control file has the following states (read/write):

'on': SMT is enabled. Secondary threads can be freely onlined
'off': SMT is disabled. Secondary threads, even if enumerated
cannot be onlined
'forceoff': SMT is permanentely disabled. Writes to the control
file are rejected.
'notsupported': SMT is not supported by the CPU

The command line option 'nosmt' sets the sysfs control to 'off'. This
can be changed to 'on' to reenable SMT during runtime.

The command line option 'nosmt=force' sets the sysfs control to
'forceoff'. This cannot be changed during runtime.

When SMT is 'on' and the control file is changed to 'off' then all online
secondary threads are offlined and attempts to online a secondary thread
later on are rejected.

When SMT is 'off' and the control file is changed to 'on' then secondary
threads can be onlined again. The 'off' -> 'on' transition does not
automatically online the secondary threads.

When the control file is set to 'forceoff', the behaviour is the same as
setting it to 'off', but the operation is irreversible and later writes to
the control file are rejected.

When the control status is 'notsupported' then writes to the control file
are rejected.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 05736e4ac13c08a4a9b1ef2de26dd31a32cbee57)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
arch/Kconfig
arch/x86/Kconfig
kernel/cpu.c

All contextual, but in kernel-cpu.c it was a little more complicated as we did
not have cpuhp_sysfs_init and I created a device_initcall for
cpu_smt_state_init.

x86/topology: Add topology_max_smt_threads()

For SMT specific workarounds it is useful to know if SMT is active
on any online CPU in the system. This currently requires a loop
over all online CPUs.

Add a global variable that is updated with the maximum number
of smt threads on any CPU on online/offline, and use it for
topology_max_smt_threads()

The single call is easier to use than a loop.

Not exported to user space because user space already can use
the existing sibling interfaces to find this out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: jolsa@kernel.org
Link: http://lkml.kernel.org/r/1463703002-19686-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 70b8301f6b8f7bc053377a9cbd0c4e42e29d9807)

This patch was backported as a dependency, was not part of the original L1TF
patch list.

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/topology.h
Contextual.
For smpboot.c we modified topology_sibling_cpumask with topology_thread_cpumask.

cpu/hotplug: Split do_cpu_down()

Split out the inner workings of do_cpu_down() to allow reuse of that
function for the upcoming SMT disabling mechanism.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit cc1fe215e1efa406b03aa4389e6269b61342dec5)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
kernel/cpu.c
Contextual and we do not have cpuhp_state target.

x86/topology: Provide topology_smt_supported()

Provide information whether SMT is supoorted by the CPUs. Preparatory patch
for SMT control mechanism.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit f048c399e0f7490ab7296bc2c255d37eb14a9675)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/smpboot.c
Contextual

x86/smp: Provide topology_is_primary_thread()

If the CPU is supporting SMT then the primary thread can be found by
checking the lower APIC ID bits for zero. smp_num_siblings is used to build
the mask for the APIC ID bits which need to be taken into account.

This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
than the initial APIC ID in CPUID. But according to AMD the lower bits have
to be consistent. Intel gave a tentative confirmation as well.

Preparatory patch to support disabling SMT at boot/runtime.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 6a4d2657e048f096c7ffcad254010bd94891c8c0)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/apic.h
arch/x86/include/asm/topology.h
arch/x86/kernel/apic/apic.c
arch/x86/kernel/smpboot.c
Contextual

x86/bugs: Move the l1tf function and define pr_fmt properly

The pr_warn in l1tf_select_mitigation would have used the prior pr_fmt
which was defined as "Spectre V2 : ".

Move the function to be past SSBD and also define the pr_fmt.

Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 56563f53d3066afa9e63d6c997bf67e76a8b05c0)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
Different filename.

x86/speculation/l1tf: Limit swap file size to MAX_PA/2

For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.

Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.

The check is only enabled if the CPU is vulnerable to L1TF.

In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings

For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.

Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.

To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

Valid page mappings are allowed because the system is then unsafe anyways.

It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.

For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().

For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 42e4089c7890725fcd329999252dc489b72f2921)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
include/asm-generic/pgtable.h
mm/memory.c
pgtable.h: contextual, not the same functions
memory.c: different content of the modified functions

x86/speculation/l1tf: Add sysfs reporting for l1tf

L1TF core kernel workarounds are cheap and normally always enabled, However
they still should be reported in sysfs if the system is vulnerable or
mitigated. Add the necessary CPU feature/bug bits.

- Extend the existing checks for Meltdowns to determine if the system is
  vulnerable. All CPUs which are not vulnerable to Meltdown are also not
  vulnerable to L1TF

- Check for 32bit non PAE and emit a warning as there is no practical way
  for mitigation due to the limited physical address bits

- If the system has more than MAX_PA/2 physical memory the invert page
  workarounds don't protect the system against the L1TF attack anymore,
  because an inverted physical address will also point to valid
  memory. Print a warning in this case and report that the system is
  vulnerable.

Add a function which returns the PFN limit for the L1TF mitigation, which
will be used in follow up patches for sanity and range checks.

[ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 17dbca119312b4e8173d4e25ff64262119fcef38)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts for UEK5:
arch/x86/include/asm/cpufeatures.h
arch/x86/kernel/cpu/bugs.c
[
cpufeatures.h: we do not have that entry free. use another
bugs.c: we have in place the ibrs logic, compared to upstream
]

Conflicts for UEK4:
arch/x86/include/asm/cpufeatures.h
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/common.c
cpufeatures.h: different file name and also moved the feature to word 2. We also removed word 2 as we won't use Transmeta on UEK4
bugs.c: different file name (bugs_64.c)
common.c: contextual (cpu_set_bug_bits didn't have __init)

x86/speculation/l1tf: Make sure the first page is always reserved

The L1TF workaround doesn't make any attempt to mitigate speculate accesses
to the first physical page for zeroed PTEs. Normally it only contains some
data from the early real mode BIOS.

It's not entirely clear that the first page is reserved in all
configurations, so add an extra reservation call to make sure it is really
reserved. In most configurations (e.g. with the standard reservations)
it's likely a nop.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 10a70416e1f067f6c4efda6ffd8ea96002ac4223)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation

When PTEs are set to PROT_NONE the kernel just clears the Present bit and
preserves the PFN, which creates attack surface for L1TF speculation
speculation attacks.

This is important inside guests, because L1TF speculation bypasses physical
page remapping. While the host has its own migitations preventing leaking
data from other VMs into the guest, this would still risk leaking the wrong
page inside the current guest.

This uses the same technique as Linus' swap entry patch: while an entry is
is in PROTNONE state invert the complete PFN part part of it. This ensures
that the the highest bit will point to non existing memory.

The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
pte/pmd/pud_pfn undo it.

This assume that no code path touches the PFN part of a PTE directly
without using these primitives.

This doesn't handle the case that MMIO is on the top of the CPU physical
memory. If such an MMIO region was exposed by an unpriviledged driver for
mmap it would be possible to attack some real memory.  However this
situation is all rather unlikely.

For 32bit non PAE the inversion is not done because there are really not
enough bits to protect anything.

Q: Why does the guest need to be protected when the HyperVisor already has
   L1TF mitigations?

A: Here's an example:

   Physical pages 1 2 get mapped into a guest as
   GPA 1 -> PA 2
   GPA 2 -> PA 1
   through EPT.

   The L1TF speculation ignores the EPT remapping.

   Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
   they belong to different users and should be isolated.

   A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
   gets read access to the underlying physical page. Which in this case
   points to PA 2, so it can read process B's data, if it happened to be in
   L1, so isolation inside the guest is broken.

   There's nothing the hypervisor can do about this. This mitigation has to
   be done in the guest itself.

[ tglx: Massaged changelog ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 6b28baca9b1f0d4a42b865da7a05b1c81424bd5c)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/pgtable.h
[We do not have the wrapper check_pgprot (which was just adding some debug
config and calls massage_pgprot). Just use massage_pgprot.]

Conflicts:
arch/x86/include/asm/pgtable-3level.h
arch/x86/include/asm/pgtable.h
arch/x86/include/asm/pgtable_64.h
pgtable-3level.h: contextual
pgtable.h: contextual as we didn't have some of the functions
pgtable_64.h: contextual

x86/speculation/l1tf: Protect swap entries against L1TF

With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
side effects allow to read the memory the PTE is pointing too, if its
values are still in the L1 cache.

For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
them.

To protect against L1TF it must be ensured that the swap entry is not
pointing to valid memory, which requires setting higher bits (between bit
36 and bit 45) that are inside the CPUs physical address space, but outside
any real memory.

To do this invert the offset to make sure the higher bits are always set,
as long as the swap file is not too big.

Note there is no workaround for 32bit !PAE, or on systems which have more
than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
real systems.

[AK: updated description and minor tweaks by. Split out from the original
patch ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 2f22b4cd45b67b3496f4aa4c7180a1271c6452f6)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/pgtable_64.h
Do not have comment with " * | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|X|SD|0| <- swp entry".

x86/speculation/l1tf: Change order of offset/type in swap entry

If pages are swapped out, the swap entry is stored in the corresponding
PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
on PTE entries which have the present bit set and would treat the swap
entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
must be set so the PTE points to non existent memory.

The swap entry stores the type and the offset of a swapped out page in the
PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
ignores the bits beyond the phsyical address space limit, so to make the
mitigation effective its required to start 'offset' at the lowest possible
bit so that even large swap offsets do not reach into the physical address
space limit bits.

Move offset to bit 9-58 and type to bit 59-63 which are the bits that
hardware generally doesn't care about.

That, in turn, means that if you on desktop chip with only 40 bits of
physical addressing, now that the offset starts at bit 9, there needs to be
30 bits of offset actually *in use* until bit 39 ends up being set, which
means when inverted it will again point into existing memory.

So that's 4 terabyte of swap space (because the offset is counted in pages,
so 30 bits of offset is 42 bits of actual coverage). With bigger physical
addressing, that obviously grows further, until the limit of the offset is
hit (at 50 bits of offset - 62 bits of actual swap file coverage).

This is a preparatory change for the actual swap entry inversion to protect
against L1TF.

[ AK: Updated description and minor tweaks. Split into two parts ]
[ tglx: Massaged changelog ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit bcd11afa7adad8d720e7ba5ef58bdcd9775cf45f)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/pgtable_64.h
SWP_OFFSET_FIRST_BIT didn't exist.

x86/speculation/l1tf: Increase 32bit PAE __PHYSICAL_PAGE_SHIFT

L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
speculates on PTE entries which do not have the PRESENT bit set, if the
content of the resulting physical address is available in the L1D cache.

The OS side mitigation makes sure that a !PRESENT PTE entry points to a
physical address outside the actually existing and cachable memory
space. This is achieved by inverting the upper bits of the PTE. Due to the
address space limitations this only works for 64bit and 32bit PAE kernels,
but not for 32bit non PAE.

This mitigation applies to both host and guest kernels, but in case of a
64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
the PAE address space (44bit) is not enough if the host has more than 43
bits of populated memory address space, because the speculation treats the
PTE content as a physical host address bypassing EPT.

The host (hypervisor) protects itself against the guest by flushing L1D as
needed, but pages inside the guest are not protected against attacks from
other processes inside the same guest.

For the guest the inverted PTE mask has to match the host to provide the
full protection for all pages the host could possibly map into the
guest. The hosts populated address space is not known to the guest, so the
mask must cover the possible maximal host address space, i.e. 52 bit.

On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
the mask to be below what the host could possible use for physical pages.

The L1TF PROT_NONE protection code uses the PTE masks to determine which
bits to invert to make sure the higher bits are set for unmapped entries to
prevent L1TF speculation attacks against EPT inside guests.

In order to invert all bits that could be used by the host, increase
__PHYSICAL_PAGE_SHIFT to 52 to match 64bit.

The real limit for a 32bit PAE kernel is still 44 bits because all Linux
PTEs are created from unsigned long PFNs, so they cannot be higher than 44
bits on a 32bit kernel. So these extra PFN bits should be never set. The
only users of this macro are using it to look at PTEs, so it's safe.

[ tglx: Massaged changelog ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 50896e180c6aa3a9c61a26ced99e15d602666a4c)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/mm: Limit mmap() of /dev/mem to valid physical addresses

One thing /dev/mem access APIs should verify is that there's no way
that excessively large pfn's can leak into the high bits of the
page table entry.

In particular, if people can use "very large physical page addresses"
through /dev/mem to set the bits past bit 58 - SOFTW4 and permission
key bits and NX bit, that could *really* confuse the kernel.

We had an earlier attempt:

  ce56a86e2ade ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses")

... which turned out to be too restrictive (breaking mem=... bootups for example) and
had to be reverted in:

  90edaac62729 ("Revert "x86/mm: Limit mmap() of /dev/mem to valid physical addresses"")

This v2 attempt modifies the original patch and makes sure that mmap(/dev/mem)
limits the pfns so that it at least fits in the actual pteval_t architecturally:

- Make sure mmap_mem() actually validates that the offset fits in phys_addr_t

    ( This may be indirectly true due to some other check, but it's not
      entirely obvious. )

- Change valid_mmap_phys_addr_range() to just use phys_addr_valid()
   on the top byte

    ( Top byte is sufficient, because mmap_mem() has already checked that
      it cannot wrap. )

- Add a few comments about what the valid_phys_addr_range() vs.
   valid_mmap_phys_addr_range() difference is.

Signed-off-by: Craig Bergstrom <craigb@google.com>
[ Fixed the checks and added comments. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ Collected the discussion and patches into a commit. ]
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Sean Young <sean@mess.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CA+55aFyEcOMb657vWSmrM13OxmHxC-XxeBmNis=DwVvpJUOogQ@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit be62a32044061cb4a3b70a10598e093f1319102e)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
drivers/char/mem.c
Contextual: we have another structure. Just ditch that check.

x86/mm: Prevent non-MAP_FIXED mapping across DEFAULT_MAP_WINDOW border

In case of 5-level paging, the kernel does not place any mapping above
47-bit, unless userspace explicitly asks for it.

Userspace can request an allocation from the full address space by
specifying the mmap address hint above 47-bit.

Nicholas noticed that the current implementation violates this interface:

  If user space requests a mapping at the end of the 47-bit address space
  with a length which causes the mapping to cross the 47-bit border
  (DEFAULT_MAP_WINDOW), then the vma is partially in the address space
  below and above.

Sanity check the mmap address hint so that start and end of the resulting
vma are on the same side of the 47-bit border. If that's not the case fall
back to the code path which ignores the address hint and allocate from the
regular address space below 47-bit.

To make the checks consistent, mask out the address hints lower bits
(either PAGE_MASK or huge_page_mask()) instead of using ALIGN() which can
push them up to the next boundary.

[ tglx: Moved the address check to a function and massaged comment and
   changelog ]

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20171115143607.81541-1-kirill.shutemov@linux.intel.com
Orabug: 28220674
CVE: CVE-2018-3620

(cherry picked from commit 1e0f25dbf2464df8445dd40881f4d9e732434947)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/elf.h
Contextual: we do not have the rest of the functions nearby
Also we do not have DEFAULT_MAP_WINDOW, we used TASK_SIZE_MAX.

x86/speculation: Support per-process SSBD with IBRS

Currently per-process SSBD can't be used with IBRS because both SSBD
and IBRS have to concurrently modify the spec_ctrl state which is globally
stored across all cpus. As a consequence, when using IBRS, Speculative
Store Bypass is disabled for userspace and it can't be used per-process
with prctl.

Commit fabdd62357ac ("x86/speculation: Implement per-cpu IBRS control")
has implemented per-cpu IBRS control and a similar change is needed to
maintain part of the spec_ctrl state (x86_spec_ctrl_priv) per-cpu too.
Then spec_ctrl state will be maintained per-cpu and per-process SSBD
can be used with IBRS.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from UEK5 commit 9275aeff3b17bf85c7fe99d98d73b69628c14796)

Orabug: 28354043

[Backport: assembly code changes in spec_ctrl.h are similar but the
code is organized slightly differently on UEK5 and UEK4. On UEK5,
changes are directly in assembly macros (.macro). On UEK4, changes
are in preprocessor macros (#define) which are used by assembly macros.]

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent"

This reverts commit ddcf804ba7f4856849323145e348d75180d40406.

The dom0 panic would not panic if commit 7fc30809bfa8 ("xen-swiotlb: fix
the check condition for xen_swiotlb_free_coherent") is reverted. It is not
clear if the issue is due to this commit or the issue is just revealed by
this commit. This patch is temporarily reverted and will be merged again
once the root cause is identified.

The dom0 panic always because the page table entry is not set correctly,
e.g., the pte write bit is not set, or the pte is 0. This might be
directly/indirectly related to the patch.

There are usually warnings as below before the panic:

[10998.691856] 6 multicall(s) failed: cpu 10
[10998.692048] CPU: 10 PID: 2078 Comm: sort Tainted: G        W        4.1.12-124.16.1.el6uek.bug28258102debug4.x86_64 #2
[10998.692049] Hardware name: Oracle Corporation ORACLE SERVER X5-2/ASM,MOTHERBOARD,1U, BIOS 30130400 02/08/2018
[10998.692050]  0000000000000000 ffff880401a9bcb8 ffffffff816e5253 ffff8804a6c8d2c0
[10998.692052]  0000000000000006 ffff880401a9bd18 ffffffff810055d9 ffffffff81007220
[10998.692053]  ffff880497137080 00000000000000ff 0000000000000000 0000000000000030
[10998.692055] Call Trace:
[10998.692057]  [<ffffffff816e5253>] dump_stack+0x63/0x81
[10998.692059]  [<ffffffff810055d9>] xen_mc_flush+0x209/0x2d0
[10998.692061]  [<ffffffff81007220>] ? xen_pin_page+0x150/0x150
[10998.692063]  [<ffffffff81007bf9>] __xen_pgd_unpin+0x109/0x250
[10998.692065]  [<ffffffff81007eb7>] xen_exit_mmap+0x177/0x2a0
[10998.692067]  [<ffffffff811c3628>] exit_mmap+0x48/0x160
[10998.692068]  [<ffffffff81105da2>] ? exit_robust_list+0x62/0x130
[10998.692070]  [<ffffffff81081cf3>] mmput+0x63/0x100
[10998.692072]  [<ffffffff81087263>] do_exit+0x343/0xa90
[10998.692074]  [<ffffffff8112c794>] ? __audit_syscall_entry+0xb4/0x110
[10998.692075]  [<ffffffff81025a2c>] ? do_audit_syscall_entry+0x6c/0x70
[10998.692077]  [<ffffffff81087a45>] do_group_exit+0x45/0xb0
[10998.692079]  [<ffffffff81087ac4>] SyS_exit_group+0x14/0x20
[10998.692081]  [<ffffffff816edc1c>] system_call_fastpath+0x18/0xd6
[10998.692083]   call  1/8: op=14 arg=[ffff8804595ba000] result=-16 xen_unpin_page+0x155/0x160
[10998.692466]   call  2/8: op=26 arg=[ffff8804a6c8e3d0] result=0 xen_extend_mmuext_op+0x62/0x100
[10998.692871]   call  3/8: op=14 arg=[ffff880468f6c000] result=-16 xen_unpin_page+0x155/0x160
[10998.693255]   call  4/8: op=14 arg=[ffff88048342a000] result=-16 xen_unpin_page+0x60/0x160
[10998.693657]   call  5/8: op=14 arg=[ffff88043effc000] result=-16 xen_unpin_page+0x60/0x160
[10998.694038]   call  6/8: op=26 arg=[ffff8804a6c8e3e8] result=0 xen_extend_mmuext_op+0x62/0x100
[10998.694422]   call  7/8: op=14 arg=[ffff88043effd000] result=-16 xen_unpin_page+0x155/0x160
[10998.694804]   call  8/8: op=14 arg=[ffff8804466d7000] result=-16 xen_unpin_page+0x60/0x160

There are error logs in xen hypervisor as well:

(XEN) mm.c:2560:d0 Bad type (saw 2400000000000001 != exp 7000000000000000) for mfn 447f42a (pfn 407a72)
(XEN) mm.c:986:d0 Could not get page type PGT_writable_page
(XEN) mm.c:1038:d0 Error getting mfn 447f42a (pfn 407a72) from L1 entry 800000447f42a063 for l1e_owner=0, pg_owner=0
(XEN) mm.c:2560:d0 Bad type (saw 1400000000000001 != exp 7000000000000000) for mfn 447f42b (pfn 407a73)
(XEN) mm.c:986:d0 Could not get page type PGT_writable_page
(XEN) mm.c:1038:d0 Error getting mfn 447f42b (pfn 407a73) from L1 entry 800000447f42b063 for l1e_owner=0, pg_owner=0

Orabug: 28441054

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Acked-by: Joe Jin <joe.jin@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Implement per-cpu IBRS control

If a system is booted with non-IBRS microcode, starts a microcode update
which enables IBRS, flags IBRS as supported, then an NMI is taken on a CPU
that hasn't actually received the update, we get a #GP since the SPEC_CTRL
MSR will get frobbed on the unsupported CPU anyway.

IBRS usage is now defined globally as well as per-cpu. The boot cpu defines
the initial IBRS usage, then each cpu defines its own per-cpu IBRS usage
based on the global IBRS uage and its cpu capabilities.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from UEK5 commit fabdd62357acce2531c9e16c5510e04862eab52d
and part of UEK5 commit 5055b50f67b59ec07c64c833cb351f088486cb02)

[Backport: backport includes part of UEK5 commit 5055b50f67b59e
("x86/topology: Avoid wasting 128k for package id array") to track when
cpu data are initialized.]

Orabug: 28064081

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/mad: Use ID allocator routines to allocate agent number

The agent TID is a 64 bit value split in two dwords. The least
significant dword is the TID running counter. The most significant
dword is the agent number. In the CX-3 shared port model, the mlx4
driver uses the most significant byte of the agent number to store the
slave number, making agent numbers greater and equal to 2^24 (3 bytes)
unusable. The current codebase uses a variable which is incremented
for each new agent number giving too large agent numbers over time.
The IDA set of functions are used instead of the simple counter
approach. This allows re-use of agent numbers.

The signature of the bug is a MAD layer that stops working and the
console is flooded with messages like:
mlx4_ib: egress mad has non-null tid msb:1 class:4 slave:0

Orabug: 28339815

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/mcheck: Reorganize the hotplug callbacks

Initially I wanted to remove mcheck_cpu_init() from identify_cpu() and let it
become an independent early hotplug callback. The main problem here was that
the init on the boot CPU may happen too late
(device_initcall_sync(mcheck_init_device)) and nobody wanted to risk receiving
and MCE event at boot time leading to a shutdown (if the MCE feature is not yet
enabled).

Here is attempt two: the timming stays as-is but the ordering of the functions
is changed:
- mcheck_cpu_init() (which is run from identify_cpu()) will setup the timer
  struct but won't fire the timer. This is moved to CPU_ONLINE since its
  cleanup part is in CPU_DOWN_PREPARE. So if it is okay to stop the timer early
  in the shutdown phase, it should be okay to start it late in the bring up phase.

- CPU_DOWN_PREPARE disables the MCE feature flags for !INTEL CPUs in
  mce_disable_cpu(). If a failure occures it would be re-enabled on all vendor
  CPUs (including Intel where it was not disabled during shutdown). To keep this
  working I am moving it to CPU_ONLINE. smp_call_function_single() is dropped
  beause the notifier runs nowdays on the target CPU.

- CPU_ONLINE is invoking mce_device_create() + mce_threshold_create_device()
  but its cleanup part is in CPU_DEAD (mce_threshold_remove_device() and
  mce_device_remove()). In order to keep this symmetrical I am moving the clean
  up from CPU_DEAD to CPU_DOWN_PREPARE.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: rt@linutronix.de
Cc: linux-edac@vger.kernel.org
Link: http://lkml.kernel.org/r/20161110174447.11848-6-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28387566

(cherry picked from commit 39f152ff)

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/mcheck/mce.c
In this cherry-pick we preserved all the functions that were used in UEK4 with
the new logic (didn't make sense to backport all the patches that were also
modifying the interfaces for various reasons not related to this warning). So
we have:
- setup_timer instead of setup_pinned_timer (in UEK4 we do not have this interface)
- mce_{disable|reenable}_cpu we are calling it using smp_call_function_single, not directly
- threshold_cpu_callback instead of mce_threshold_remove_device

Signed-off-by: Brian Maly <brian.maly@oracle.com>

rtnetlink: fix rtnl_vfinfo_size

The size reported by rtnl_vfinfo_size doesn't match the space used by
rtnl_fill_vfinfo.

rtnl_vfinfo_size currently doesn't account for the nest attributes
used by statistics (added in commit 3b766cd83232), nor for struct
ifla_vf_tx_rate (since commit ed616689a3d9, which added ifla_vf_rate
to the dump without removing ifla_vf_tx_rate, but replaced
ifla_vf_tx_rate with ifla_vf_rate in the size computation).

Fixes: ed616689a3d9 ("net-next:v4: Add support to configure SR-IOV VF minimum and maximum Tx rate through ip tool")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 27998927

(cherry picked from commit 7e75f74a171a8146cc3ee92d5562878b40c25fb5)
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/core/rtnetlink.c
(Precedes commits that added VF stats, VF trust, vlan info)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: IB: avoid migration to port that is down

When rds makes failback, it will migrate to a new ib device after
several seconds delays. But in the delay, the target ib is possible
to be down. So it is necessary to check the target ib state.
The following is an example.
"
...
May 11 10:37:23 kernel: mlx4_core 0000:03:00.0: mlx4_ib: Port 2 logical link is down
May 11 10:37:23 kernel: RDS/IP: IP 192.168.10.2 migrated from ib1 to ib0:P02
May 11 10:38:18 kernel: mlx4_core 0000:03:00.0: mlx4_ib: Port 2 logical link is up
May 11 10:38:23 kernel: mlx4_core 0000:03:00.0: mlx4_ib: Port 2 logical link is down
May 11 10:38:28 kernel: RDS/IP: IP 192.168.10.2 migrated from ib0:P02 to ib1
...
"
When ib1 is up, failback is in process. But this failback has several
seconds delay. In this delay, ib1 is down. Finally rds migrates to an
ib device that is down.

Orabug: 28097129

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/rds: Fix kernel panic caused by a race between setup/teardown

Running rds-stress with --reset option in a tight loop provokes a
"NULL pointer dereference" in rds_ib_recv_refill().

IP: [<..fa04fefa1>] rds_ib_recv_refill+0x2a1/0x630 [rds_rdma]
Call Trace:
  [<..fa04fc140>] rds_ib_cm_connect_complete+0x2f0/0x360 [rds_rdma]
  [<..f810ba337>] ? wake_up_process+0x27/0x50
  [<..f810a1b54>] ? wake_up_worker+0x24/0x30
  [<..f810a2902>] ? insert_work+0x62/0xa0
  [<..fa04f4405>] rds_rdma_cm_event_handler_cmn+0x405/0x8c0 [rds_rdma]
  [<..f810a2d82>] ? __queue_delayed_work+0xb2/0x1a0
  [<..fa04f48d0>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
  [<..fa03bf85f>] cma_ib_handler+0x10f/0x280 [rdma_cm]
  [<..fa03b1f5b>] cm_process_work+0x2b/0x140 [ib_cm]
  [<..fa03b414b>] cm_work_handler+0x99b/0x1630 [ib_cm]
  [<..f810a5375>] process_one_work+0x165/0x470
  [<..f810a5b82>] worker_thread+0x112/0x540
  [<..f810a5a70>] ? rescuer_thread+0x3f0/0x3f0
  [<..f810ab51a>] kthread+0xda/0xf0
  [<..f810ab440>] ? kthread_create_on_node+0x1b0/0x1b0
  [<..f817521b8>] ret_from_fork+0x58/0x90
  [<..f810ab440>] ? kthread_create_on_node+0x1b0/0x1b0

This is due to a race condition between rds_ib_cm_connect_complete()
trying to post a recv (using QP) and rds_ib_conn_path_shutdown
destroying the QP and zeroing it. By waiting and reserving the
RDS_RECV_REFILL bit before calling shutdown we avoid the problem.

Adding path-transition of CONNECTING to DISCONNECTING for
completeness.

Orabug: 28326553

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: IB: fix returned value not set error

When IB port link up/down too frequently, the following will occur.
"
16:57:08  kernel: mlx5_core 0000:af:00.1 enp175s0f1: Link up
16:57:08  kernel: RDS/IB: PORT-EVENT: port active, PORT:
          mlx5_0/port_2/enp175s0f1 : port state transition to UP
          (portlayers 0x7)
16:57:08  kernel: RDS/IB: NET-EVENT: NETDEV-CHANGE, PORT
          mlx5_0/port_2/enp175s0f1 : port state transition NONE -
          port retained in state UP (portlayers 0x7)
16:57:13  kernel: mlx5_core 0000:af:00.1 enp175s0f1: Link down
16:57:13  kernel: RDS/IB: PORT-EVENT: port error, PORT:
          mlx5_0/port_2/enp175s0f1 : port state transition to DOWN
          (portlayers 0x4)
16:57:13  kernel: RDS/IB: NET-EVENT: NETDEV-CHANGE, PORT
          mlx5_0/port_2/enp175s0f1 : port state transition NONE -
          port retained in state DOWN (portlayers 0x4)
16:57:13  kernel: RDS/IB: Address already exist
16:57:17  kernel: mlx5_core 0000:af:00.1 enp175s0f1: Link up
16:57:17  kernel: RDS/IB: PORT-EVENT: port active, PORT:
          mlx5_0/port_2/enp175s0f1 : port state transition to UP
          (portlayers 0x7)
16:57:17  kernel: RDS/IB: NET-EVENT: NETDEV-CHANGE, PORT
          mlx5_0/port_2/enp175s0f1 : port state transition NONE -
          port retained in state UP (portlayers 0x7)
16:57:18  kernel: RDS/IB: IPv4 172.16.2.52 migrated from enp175s0f0
          (port 1) to enp175s0f1 (port 2)
"
From the above, rds_ib_addr_exist is called to check ip address. If
exists, the function goes to the end directly while the ret is not set.
This is not reasonable. This will make some global variables error.

Orabug: 28356474

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

fs: proc: array.c: fix Speculation_Store_Bypass print format

This patch fixes commit id f590427c8b01298b416f043107e654918107c579 While
backporting, we introduce a \n in front of Speculation_Store_Bypass ending up
with a suplimentary newline which is affecting different tools (like iotop):
Seccomp: 0

Speculation_Store_Bypass: thread vulnerable

Orabug: 28128750

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: don't chain ioends during writepage submission

Currently we can build a long ioend chain during ->writepages that
gets attached to the writepage context. IO submission only then
occurs when we finish all the writepage processing. This means we
can have many ioends allocated and pending, and this violates the
mempool guarantees that we need to give about forwards progress.
i.e. we really should only have one ioend being built at a time,
otherwise we may drain the mempool trying to allocate a new ioend
and that blocks submission, completion and freeing of ioends that
are already in progress.

To prevent this situation from happening, we need to submit ioends
for IO as soon as they are ready for dispatch rather than queuing
them for later submission. This means the ioends have bios built
immediately and they get queued on any plug that is current active.
Hence if we schedule away from writeback, the ioends that have been
built will make forwards progress due to the plug flushing on
context switch. This will also prevent context switches from
creating unnecessary IO submission latency.

We can't completely avoid having nested IO allocation - when we have
a block size smaller than a page size, we still need to hold the
ioend submission until after we have marked the current page dirty.
Hence we may need multiple ioends to be held while the current page
is completely mapped and made ready for IO dispatch. We cannot avoid
this problem - the current code already has this ioend chaining
within a page so we can mostly ignore that it occurs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit e10de3723c53378e7cf441529f563c316fdc0dd3)

Orabug: 28193043

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: factor mapping out of xfs_do_writepage

Separate out the bufferhead based mapping from the writepage code so
that we have a clear separation of the page operations and the
bufferhead state.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit bfce7d2e2d5ee05e9d465888905c66a70a9c243c)

Orabug: 28193043

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: xfs_cluster_write is redundant

xfs_cluster_write() is not necessary now that xfs_vm_writepages()
aggregates writepage calls across a single mapping. This means we no
longer need to do page lookups in xfs_cluster_write, so writeback
only needs to look up th epage cache once per page being written.
This also removes a large amount of mostly duplicate code between
xfs_do_writepage() and xfs_convert_page().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit ad68972acb82c3e8bba316d542ab204984cb1f1c)

Orabug: 28193043

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: Introduce writeback context for writepages

xfs_vm_writepages() calls generic_writepages to writeback a range of
a file, but then xfs_vm_writepage() clusters pages itself as it does
not have any context it can pass between->writepage calls from
__write_cache_pages().

Introduce a writeback context for xfs_vm_writepages() and call
__write_cache_pages directly with our own writepage callback so that
we can pass that context to each writepage invocation. This
encapsulates the current mapping, whether it is valid or not, the
current ioend and it's IO type and the ioend chain being built.

This requires us to move the ioend submission up to the level where
the writepage context is declared. This does mean we do not submit
IO until we packaged the entire writeback range, but with the block
plugging in the writepages call this is the way IO is submitted,
anyway.

It also means that we need to handle discontiguous page ranges. If
the pages sent down by write_cache_pages to the writepage callback
are discontiguous, we need to detect this and put each discontiguous
page range into individual ioends. This is needed to ensure that the
ioend accurately represents the range of the file that it covers so
that file size updates during IO completion set the size correctly.
Failure to take into account the discontiguous ranges results in
files being too small when writeback patterns are non-sequential.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit fbcc025613590d7b1d15521555dcc6393a148a6b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/xfs/xfs_aops.c

Orabug: 28193043

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: remove xfs_cancel_ioend

We currently have code to cancel ioends being built because we
change bufferhead state as we build the ioend. On error, this needs
to be unwound and so we have cancelling code that walks the buffers
on the ioend chain and undoes these state changes.

However, the IO submission path already handles state changes for
buffers when a submission error occurs, so we don't really need a
separate cancel function to do this - we can simply submit the
ioend chain with the specific error and it will be cancelled rather
than submitted.

Hence we can remove the explicit cancel code and just rely on
submission to deal with the error correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 150d5be09ce49a9bed6feb7b7dc4e5ae188778ec)

Orabug: 28193043

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: remove nonblocking mode from xfs_vm_writepage

Remove the nonblocking optimisation done for mapping lookups during
writeback. It's not clear that leaving a hole in the writeback range
just because we couldn't get a lock is really a win, as it makes us
do another small random IO later on rather than a large sequential
IO now.

As this gets in the way of sane error handling later on, just remove
for the moment and we can re-introduce an equivalent optimisation in
future if we see problems due to extent map lock contention.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 988ef927926aa3481cbf144f235c0cefd7deb9e4)

Orabug: 28193043

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: tcp: cancel all worker threads before shutting down socket

We could end up executing rds_conn_shutdown before the rds_recv_worker
thread, then rds_conn_shutdown -> rds_tcp_conn_shutdown can do a
sock_release and set sock->sk to null, which may interleave in bad
ways with rds_recv_worker, e.g., it could result in:

    "BUG: unable to handle kernel NULL pointer dereference at 0000000000000078"
       [ffff881769f6fd70] release_sock at ffffffff815f337b
       [ffff881769f6fd90] rds_tcp_recv at ffffffffa043c888 [rds_tcp]
       [ffff881769f6fdb0] rds_recv_worker at ffffffffa04a4810 [rds]
       [ffff881769f6fde0] process_one_work at ffffffff810a14c1
       [ffff881769f6fe40] worker_thread at ffffffff810a1940
       [ffff881769f6fec0] kthread at ffffffff810a6b1e

cancel send/recv worker threads before shutting down connection.

Also, do not enqueue any new shutdown workq items when the connection is
shutting down (this may happen for rds-tcp in softirq mode, if a FIN
or CLOSE is received while the modules is in the middle of an unload)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
(cherry picked from commit UEK4 bf1f0c8c007c58eae14622c6f9a3fe36f6b19d5a)

Orabug: 28298156

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: signedness bug

In the original code if the copy_from_user() fails in rds_rdma_pages()
then the error handling fails and we get a stack trace from kmalloc().

<amended commit message by hbugge>

This bug was originally fixed in v2.6.36-rc5. However, commit
edc31ebe9e36 ("[patch1/3] Merge for Mellanox OFED R2, 0080 release")
re-introduces the bug, most probably because the Mellanox OFED didn't
have the upstream fix.

So, this bug has been present since v2.6.39-400.9.0 and is present in
all our releases from uek2 to uek5.

</amended>

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9b9d2e00bfa592aceda7b43da76c670df61faa97)

Orabug: 28319166

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Fixes: edc31ebe9e36d87dcd08931784a49c2e75eed359
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: update integrity interval after queue limits change

A disk's integrity interval is calculated based on the logical block
size of it's queue. If the queue or it's block size is not set, default
512 integrity interval is set.
The issue is that the integrity interval of dm device is set before it's
queue limits are set and it is not updated once the appropriate queue
limits are set.
So, dm integrity interval is set to 512 by default. This causes
following integrity mismatch errors for devices with logical block size
!=512.
This further leads to dm paths not being setup correctly.

blk_integrity_compare: dm-X/sdx protection interval 512 != 4096

Solution is to set integrity interval all the time in
blk_integrity_register() according to the logical block size of the
queue. Once logical block size of the queue is updated, the integrity
interval will be set correctly on the next blk_integrity_register().

This issue does not occur in UEK5 or upstream as they contain the
following
commits which resolve the current and couple of other issues as well.
These when backported to UEK4 cause KABI breakage. To work around it,
the current patch resolves this issue.

aff34e192e4eeacfb8b5ffc68e10a240f2c0c6d7
0f8087ecdeac921fc4920f1328f55c15080bc6aa
a48f041d91bf1aee599fa2adb53b780ed20c2ee5
4c241d08dbfcbdc7a949b91d72707a289d464954
25520d55cdb6ee289abc68f553d364d22478ff54
2859323e35ab5fc42f351fbda23ab544eaa85945

Orabug: 27586756

Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

dccp: check sk for closed state in dccp_sendmsg()

dccp_disconnect() sets 'dp->dccps_hc_tx_ccid' tx handler to NULL,
therefore if DCCP socket is disconnected and dccp_sendmsg() is
called after it, it will cause a NULL pointer dereference in
dccp_write_xmit().

This crash and the reproducer was reported by syzbot. Looks like
it is reproduced if commit 69c64866ce07 ("dccp: CVE-2017-8824:
use-after-free in DCCP code") is applied.

Reported-by: syzbot+f99ab3887ab65d70f816@syzkaller.appspotmail.com
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 67f93df79aeefc3add4e4b31a752600f834236e2)

Orabug: 28001529
CVE: CVE-2018-1130

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/rds: Implement ARP flushing correctly

If a remote peer has moved its IP address from one port to the other,
the local node may have an incorrect ARP entry in its cache. During
connection management, we will then get back a route-error-event from
the CM.

Current code attempts to flush the ARP entry from the cache. However,
1) it does not check for return values, 2) it does not supply the
device name, 3) it does not iterate over all possible device names,
and 4) its doesn't supply the correct flags.

Due to 2-4 above, the flushing doesn't work.

This commit fixes this.

On a system with a single CX-3 and 16 VFs, fail-over just after a
fail-back is reduced from ~60 seconds down to ~10 seconds with the fix
(1156 RDS connections).

Orabug: 28219857

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
---

V1 -> V2:
- Move NOTE section down as suggested by Santosh
- Remove unnecessary initialization of ret
- Fix a nit in the commit message (2-4 instead of 1-4)
- Use pr_err() instead of printk() as suggested by Yuval
- Use sizeof(*r) as suggested by Yuval

NOTE:
* The uek-4-QU7 version of this bug has an approved exception from
Greg Marsden, that it can be merged first in uek-4-QU7, before
uek5/master-next.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/rds: Fix incorrect bigger vs. smaller IP address check

In commit d8bd5dfb5de4 ("net/rds: use one sided reconnection during a
race"), the function rds_addr_cmp() was called without comparing the
return value to less-than or bigger-than-equal to zero.

This led to a situation where both peers thought they were the winner,
and consequently, racing RDS connections didn't form or took a long
time to form.

The issue is fixed by this commit.

Orabug: 28236599

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Suggested-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Fixes: d8bd5dfb5de4 ("net/rds: use one sided reconnection during a race")
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
(cherry picked from commit 653ccd08be01f82c6129f5060bd8110489aca110
repo UEK/linux-hbugge-public.git)

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
* Had to "rename" the SHA of the Fixes: uek-5-master commit to
the correct one for uek-4.1-qu7.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

ocfs2: Fix locking for res->tracking and dlm->tracking_list

Orabug: 28256391

In dlm_init_lockres() we access and modify res->tracking and
dlm->tracking_list without holding dlm->track_lock. This can cause list
corruptions and can end up in kernel panic.

Fix this by locking res->tracking and dlm->tracking_list with
dlm->track_lock instead of dlm->spinlock.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
CC: stable@vger.kernel.org
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfrm: policy: check policy direction value

Orabug: 28256487
CVE: CVE-2017-11600

The 'dir' parameter in xfrm_migrate() is a user-controlled byte which is used
as an array index. This can lead to an out-of-bound access, kernel lockup and
DoS. Add a check for the 'dir' value.

This fixes CVE-2017-11600.

References: https://bugzilla.redhat.com/show_bug.cgi?id=1474928
Fixes: 80c9abaabf42 ("[XFRM]: Extension for dynamic update of endpoint address(es)")
Cc: <stable@vger.kernel.org> # v2.6.21-rc1
Reported-by: "bo Zhang" <zhangbo5891001@gmail.com>
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
(cherry picked from commit 7bab09631c2a303f87a7eb7e3d69e888673b9b7e)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>

add kernel param to pre-allocate NICs

Orabug: 27870400

This is so that eth0, eth1, ... ethX are available for ip link rename set for certain
applications (that will not be named).

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Saar Maoz <Saar.Maoz@oracle.com>

mm/mempolicy.c: fix error handling in set_mempolicy and mbind.

Orabug: 28242475
CVE: CVE-2017-7616

In the case that compat_get_bitmap fails we do not want to copy the
bitmap to the user as it will contain uninitialized stack data and leak
sensitive data.

Signed-off-by: Chris Salls <salls@cs.ucsb.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cf01fb9985e8deb25ccf0ea54d916b8871ae0e62)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xhci: Fix USB3 NULL pointer dereference at logical disconnect.

Hub driver will try to disable a USB3 device twice at logical disconnect,
racing with xhci_free_dev() callback from the first port disable.

This can be triggered with "udisksctl power-off --block-device <disk>"
or by writing "1" to the "remove" sysfs file for a USB3 device
in 4.17-rc4.

USB3 devices don't have a similar disabled link state as USB2 devices,
and use a U3 suspended link state instead. In this state the port
is still enabled and connected.

hub_port_connect() first disconnects the device, then later it notices
that device is still enabled (due to U3 states) it will try to disable
the port again (set to U3).

The xhci_free_dev() called during device disable is async, so checking
for existing xhci->devs[i] when setting link state to U3 the second time
was successful, even if device was being freed.

The regression was caused by, and whole thing revealed by,
Commit 44a182b9d177 ("xhci: Fix use-after-free in xhci_free_virt_device")
which sets xhci->devs[i]->udev to NULL before xhci_virt_dev() returned.
and causes a NULL pointer dereference the second time we try to set U3.

Fix this by checking xhci->devs[i]->udev exists before setting link state.

The original patch went to stable so this fix needs to be applied there as
well.

Fixes: 44a182b9d177 ("xhci: Fix use-after-free in xhci_free_virt_device")
Cc: <stable@vger.kernel.org>
Reported-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2278446e2b7cd33ad894b32e7eb63afc7db6c86e)

Orabug: 27426023
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Hand merge the change since multiple patches have been added to
xhci-hub.c since this cherry pick became available.

Signed-off-by: Joe Moriarty <joe.moriarty@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mlx4_core: restore optimal ICM memory allocation

Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
brought two regressions caught in our regression suite.

The big one is an additional cost of 256 bytes of overhead per 4096 bytes,
or 6.25 % which is unacceptable since ICM can be pretty large.

This comes from having to allocate one struct mlx4_icm_chunk (256 bytes)
per MLX4_TABLE_CHUNK, which the buggy commit shrank to 4KB
(instead of prior 256KB)

Note that mlx4_alloc_icm() is already able to try high order allocations
and fallback to low-order allocations under high memory pressure.

Most of these allocations happen right after boot time, when we get
plenty of non fragmented memory, there is really no point being so
pessimistic and break huge pages into order-0 ones just for fun.

We only have to tweak gfp_mask a bit, to help falling back faster,
without risking OOM killings.

Second regression is an KASAN fault, that will need further investigations.

Fixes: 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Cc: John Sperbeck <jsperbeck@google.com>
Cc: Tarick Bedeir <tarick@google.com>
Cc: Qing Huang <qing.huang@oracle.com>
Cc: Daniel Jurgens <danielj@mellanox.com>
Cc: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from upstream commit
885892fb378dc096693557ba4f2b875188619b36)

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
replaced __GFP_DIRECT_RECLAIM with __GFP_WAIT due to kernel difference
between UEK4 and upstream

Orabug: 27718303

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Tested-by: June Wang <june.wang@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mlx4_core: allocate ICM memory in page size chunks

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Acked-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from upstream commit
1383cb8103bb166e50cbab1543bb3b5118fccf82)

conflicts:
drivers/net/ethernet/mellanox/mlx4/icm.c

Orabug: 27718303

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Tested-by: June Wang <june.wang@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

kernel/signal.c: avoid undefined behaviour in kill_something_info
When running kill(72057458746458112, 0) in userspace I hit the following
issue.

  UBSAN: Undefined behaviour in kernel/signal.c:1462:11
  negation of -2147483648 cannot be represented in type 'int':
  CPU: 226 PID: 9849 Comm: test Tainted: G    B          ---- -------   3.10.0-327.53.58.70.x86_64_ubsan+ #116
  Hardware name: Huawei Technologies Co., Ltd. RH8100 V3/BC61PBIA, BIOS BLHSV028 11/11/2014
  Call Trace:
    dump_stack+0x19/0x1b
    ubsan_epilogue+0xd/0x50
    __ubsan_handle_negate_overflow+0x109/0x14e
    SYSC_kill+0x43e/0x4d0
    SyS_kill+0xe/0x10
    system_call_fastpath+0x16/0x1b

Add code to avoid the UBSAN detection.

[akpm@linux-foundation.org: tweak comment]
Link: http://lkml.kernel.org/r/1496670008-59084-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhongjiang <zhongjiang@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(Cherry-picked from commit 4ea77014af0d)

Orabug: 28078687
CVE: CVE-2018-10124

Signed-off-by: mridula shastry <mridula.c.shastry@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: tcp: compute m_ack_seq as offset from ->write_seq

rds-tcp uses m_ack_seq to track the tcp ack# that indicates
that the peer has received a rds_message. The m_ack_seq is
used in rds_tcp_is_acked() to figure out when it is safe to
drop the rds_message from the RDS retransmit queue.

The m_ack_seq must be calculated as an offset from the right
edge of the in-flight tcp buffer, i.e., it should be based on
the ->write_seq, not the ->snd_nxt.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b589513e6354a5fd6934823b7fd66bffad41137a)

Orabug: 28085214

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: fix bitmap position validation

[ Upstream commit 22be37acce25d66ecf6403fc8f44df9c5ded2372 ]

Currently in ext4_valid_block_bitmap() we expect the bitmap to be
positioned anywhere between 0 and s_blocksize clusters, but that's
wrong because the bitmap can be placed anywhere in the block group. This
causes false positives when validating bitmaps on perfectly valid file
system layouts. Fix it by checking whether the bitmap is within the group
boundary.

The problem can be reproduced using the following

mkfs -t ext3 -E stride=256 /dev/vdb1
mount /dev/vdb1 /mnt/test
cd /mnt/test
wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.16.3.tar.xz
tar xf linux-4.16.3.tar.xz

This will result in the warnings in the logs

EXT4-fs error (device vdb1): ext4_validate_block_bitmap:399: comm tar: bg 84: block 2774529: invalid block bitmap

[ Changed slightly for clarity and to not drop a overflow test -- TYT ]

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: 7dac4a1726a9 ("ext4: add validity checks for bitmap block numbers")
Cc: stable@vger.kernel.org
(cherry picked from commit 22be37acce25d66ecf6403fc8f44df9c5ded2372)

Orabug: 28167032
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/rds: Fix bug in failover_group parsing

When supplying a module parameter value for the fail-over group as:

ib0,ib1;ib2,ib3;ib4,ib5;ib6,ib7;ib8,ib9

rdmaip / rds_rdma is unable to parse it correctly. Based on debug
prints, it does:

lab38 kernel: RDS/IB: ib0 is designated group  1
lab38 kernel: RDS/IB: ib1 is designated group  1
lab38 kernel: RDS/IB: ib2 is designated group  2
(no more prints).

This implies, that for Xn-8 systems, fail-over will not be distributed
correctly. We see:

lab38 kernel: RDS/IP: IP 192.2.23.100 migrated from ib0 to ib1:P01
lab38 kernel: RDS/IP: IP 192.2.23.102 migrated from ib2 to ib1:P03
lab38 kernel: RDS/IP: IP 192.2.23.104 migrated from ib4 to ib3:P05
lab38 kernel: RDS/IP: IP 192.2.23.106 migrated from ib6 to ib3:P07
lab38 kernel: RDS/IP: IP 192.2.23.108 migrated from ib8 to ib3:P09

That is, all fail-overs except for ib0 are migrated to ib3.

This commit fixes this. The following values for the module parameter
have been tested (in user space):

ib0
ib0,ib1
ib0,ib1,ib2
ib0,ib1,ib2,ib3
ib0;ib1
ib0,ib1;ib2,ib3
ib0,ib1,ib2;ib3,ib4,ib5
ib0,ib1,ib2,ib3;ib4,ib5,ib6,ib7
ib0;ib1;ib2
ib0,ib1;ib2,ib3;ib4,ib5
ib0,ib1,ib2;ib3,ib4,ib5;ib6,ib7,ib8
ib0,ib1,ib2,ib3;ib4,ib5,ib6,ib7;ib8,ib9,ib10,ib11
ib0;ib1;ib2;ib3
ib0,ib1;ib2,ib3;ib4,ib5;ib6,ib7
ib0,ib1,ib2;ib3,ib4,ib5;ib6,ib7,ib8;ib9,ib10,ib11
ib0,ib1,ib2,ib3;ib4,ib5,ib6,ib7;ib8,ib9,ib10,ib11;ib12,ib13,ib14,ib15

Orabug: 28198749

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
(inspired by uek-5-next commit bf8cd0080482fa23ee859cc6118c964693cb3a72)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

sctp: verify size of a new chunk in _sctp_make_chunk()

When SCTP makes INIT or INIT_ACK packet the total chunk length
can exceed SCTP_MAX_CHUNK_LEN which leads to kernel panic when
transmitting these packets, e.g. the crash on sending INIT_ACK:

[  597.804948] skbuff: skb_over_panic: text:00000000ffae06e4 len:120168
               put:120156 head:000000007aa47635 data:00000000d991c2de
               tail:0x1d640 end:0xfec0 dev:<NULL>
...
[  597.976970] ------------[ cut here ]------------
[  598.033408] kernel BUG at net/core/skbuff.c:104!
[  600.314841] Call Trace:
[  600.345829]  <IRQ>
[  600.371639]  ? sctp_packet_transmit+0x2095/0x26d0 [sctp]
[  600.436934]  skb_put+0x16c/0x200
[  600.477295]  sctp_packet_transmit+0x2095/0x26d0 [sctp]
[  600.540630]  ? sctp_packet_config+0x890/0x890 [sctp]
[  600.601781]  ? __sctp_packet_append_chunk+0x3b4/0xd00 [sctp]
[  600.671356]  ? sctp_cmp_addr_exact+0x3f/0x90 [sctp]
[  600.731482]  sctp_outq_flush+0x663/0x30d0 [sctp]
[  600.788565]  ? sctp_make_init+0xbf0/0xbf0 [sctp]
[  600.845555]  ? sctp_check_transmitted+0x18f0/0x18f0 [sctp]
[  600.912945]  ? sctp_outq_tail+0x631/0x9d0 [sctp]
[  600.969936]  sctp_cmd_interpreter.isra.22+0x3be1/0x5cb0 [sctp]
[  601.041593]  ? sctp_sf_do_5_1B_init+0x85f/0xc30 [sctp]
[  601.104837]  ? sctp_generate_t1_cookie_event+0x20/0x20 [sctp]
[  601.175436]  ? sctp_eat_data+0x1710/0x1710 [sctp]
[  601.233575]  sctp_do_sm+0x182/0x560 [sctp]
[  601.284328]  ? sctp_has_association+0x70/0x70 [sctp]
[  601.345586]  ? sctp_rcv+0xef4/0x32f0 [sctp]
[  601.397478]  ? sctp6_rcv+0xa/0x20 [sctp]
...

Here the chunk size for INIT_ACK packet becomes too big, mostly
because of the state cookie (INIT packet has large size with
many address parameters), plus additional server parameters.

Later this chunk causes the panic in skb_put_data():

  skb_packet_transmit()
      sctp_packet_pack()
          skb_put_data(nskb, chunk->skb->data, chunk->skb->len);

'nskb' (head skb) was previously allocated with packet->size
from u16 'chunk->chunk_hdr->length'.

As suggested by Marcelo we should check the chunk's length in
_sctp_make_chunk() before trying to allocate skb for it and
discard a chunk if its size bigger than SCTP_MAX_CHUNK_LEN.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leinter@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 07f2c7ab6f8d0a7e7c5764c4e6cc9c52951b9d9c)

Orabug: 28240074
CVE: CVE-2018-5803

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/sctp/sm_make_chunk.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

netfilter: xt_TCPMSS: add more sanity tests on tcph->doff

Orabug: 27896802
CVE: CVE-2017-18017

Denys provided an awesome KASAN report pointing to an use
after free in xt_TCPMSS

I have provided three patches to fix this issue, either in xt_TCPMSS or
in xt_tcpudp.c. It seems xt_TCPMSS patch has the smallest possible
impact.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 2638fd0f92d4397884fd991d8f4925cb3f081901)
Signed-off-by: Brian Maly <brian.maly@oracle.com>