www.infradead.org Git - users/jedix/linux-maple.git/log

x86/bugs: Don't lie when fallback retpoline is engaged

That is we actually have two mitigations in effect: retpoline
and IBRS.

OraBug: 27897282

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

fs: aio: fix the increment of aio-nr and counting against aio-max-nr

Currently, aio-nr is incremented in steps of 'num_possible_cpus() * 8'
for io_setup(nr_events, ..) with 'nr_events < num_possible_cpus() * 4':

    ioctx_alloc()
    ...
        nr_events = max(nr_events, num_possible_cpus() * 4);
        nr_events *= 2;
    ...
        ctx->max_reqs = nr_events;
    ...
        aio_nr += ctx->max_reqs;
    ....

This limits the number of aio contexts actually available to much less
than aio-max-nr, and is increasingly worse with greater number of CPUs.

For example, with 64 CPUs, only 256 aio contexts are actually available
(with aio-max-nr = 65536) because the increment is 512 in that scenario.

Note: 65536 [max aio contexts] / (64*4*2) [increment per aio context]
is 128, but make it 256 (double) as counting against 'aio-max-nr * 2':

    ioctx_alloc()
    ...
        if (aio_nr + nr_events > (aio_max_nr * 2UL) ||
        ...
            goto err_ctx;
    ...

This patch uses the original value of nr_events (from userspace) to
increment aio-nr and count against aio-max-nr, which resolves those.

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Reported-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Paul Nguyen <nguyenp@us.ibm.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Orabug: 28079082

(cherry picked from commit 2a8a98673c13cb2a61a6476153acf8344adfa992)
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

qla2xxx: Enable buffer boundary check when DIF bundling is on.

The Qlogic firmware requires the upper 32-bit of dma buffer address
not flipped, this happens when DIF bundling is enabled and the SGE
buffer address plus length changes the upper 32-bit address, a local
buffer is used for DIF information.

Orabug: 28130589

Co-authored-by: Giri Malavali <giridhar.malavali@cavium.com>
Co-authored-by: Joe Carnuccio <joe.carnuccio@cavium.com>
Reviewed-by: Giri Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

kernel: sys.c: missing break for prctl spec ctrl

In the process of backporting speculation control bits a break was missed which
is causing prctl with PR_SET_SPECULATION_CTRL to also fail.

Orabug: 28144775

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs/IBRS: Keep SSBD mitigation in effect if spectre_v2=ibrs is selected

From: Boris Ostrovsky <boris.ostrovsky@oracle.com>

If the system admins picks to disable memory disambiguation at bootup
(spec_store_bypass_disable=on) and enable IBRS (spectre_v2=ibrs) we
end up briefly at bootup disabling memory disambiguation and then
IBRS SPEC_CTRL kicks - and memory disambiguation is enabled back again.

The logic is there for the 'auto' case, but we missed it for
the other ones. Lets fix it up.

OraBug: 28071800

Fixes: 89981b51b9240ec16e506304990ce2311e93285b ("x86/speculation: Add prctl for Speculative Store Bypass mitigation")
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

fs/pstore: update the backend parameter in pstore module

This patch update the module parameter backend, so it is visible
through /sys/module/pstore/parameters/backend.

For example:
if pstore backend is ramoops, with this patch:
# cat /sys/module/pstore/parameters/backend
ramoops
and without this patch:
# cat /sys/module/pstore/parameters/backend
(null)

Signed-off-by: Wang Long <long.wanglong@huawei.com>
Acked-by: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
(cherry picked from commit 42222c2a5d5da7fe4839491d5c44034f40761071)
Orabug: 27994372
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

kvm: vmx: Reinstate support for CPUs without virtual NMI

[ Upstream commit 8a1b43922d0d1279e7936ba85c4c2a870403c95f ]

This is more or less a revert of commit 2c82878b0cb3 ("KVM: VMX: require
virtual NMI support", 2017-03-27); it turns out that Core 2 Duo machines
only had virtual NMIs in some SKUs.

The revert is not trivial because in the meanwhile there have been several
fixes to nested NMI injection.  Therefore, the entire vNMI state is moved
to struct loaded_vmcs.

Another change compared to before the patch is a simplification here:

       if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
           !(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
                                       get_vmcs12(vcpu))))) {

The final condition here is always true (because nested_cpu_has_virtual_nmis
is always false) and is removed.

Fixes: 2c82878b0cb38fd516fd612c67852a6bbf282003
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1490803
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit 8a1b43922d0d1279e7936ba85c4c2a870403c95f)

Orabug: 28041210

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c

Conflicts are due to the fact that a large number of patches affecting this
area of the code have not been ported to UEK4-QU7. Portions of the
cherry-picked patch had to be manually inserted in the correct places, but
no logical changes were required.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

dm crypt: add big-endian variant of plain64 IV

The big-endian IV (plain64be) is needed to map images from extracted
disks that are used in some external (on-chip FDE) disk encryption
drives, e.g.: data recovery from external USB/SATA drives that support
"internal" encryption.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
(cherry picked from commit 7e3fd855ad66ffc0dd926911da23dd21e59f9462)
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Tested-by: Qiang Wang <qiang.z.wang@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 28043932
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/md/dm-crypt.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Rename SSBD_NO to SSB_NO

The "336996 Speculative Execution Side Channel Mitigations" from
May defines this as SSB_NO, hence lets sync-up.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit 240da953)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/msr-index.h
[msr-index.h: different file location]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD

Expose the new virtualized architectural mechanism, VIRT_SSBD, for using
speculative store bypass disable (SSBD) under SVM. This will allow guests
to use SSBD on hardware that uses non-architectural mechanisms for enabling
SSBD.

[ tglx: Folded the migration fixup from Paolo Bonzini ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit bc226f07)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/kvm_host.h
arch/x86/kernel/cpu/common.c
arch/x86/kvm/cpuid.c
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c
arch/x86/kvm/x86.c
[
We did not have cpu_has_high_real_mode_segbase entry at all.
Also msr_info is not in this patchset, I will take care of it
in Orabug: 28069548 in a future patchset.
]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG

Add the necessary logic for supporting the emulated VIRT_SPEC_CTRL MSR to
x86_virt_spec_ctrl(). If either X86_FEATURE_LS_CFG_SSBD or
X86_FEATURE_VIRT_SPEC_CTRL is set then use the new guest_virt_spec_ctrl
argument to check whether the state must be modified on the host. The
update reuses speculative_store_bypass_update() so the ZEN-specific sibling
coordination can be reused.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit 47c61b39)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/spec-ctrl.h
arch/x86/kernel/cpu/bugs.c
[
bugs.c: different file name
spec-ctrl.h: contextual changes. Diff couldn't be applied
]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Rework spec_ctrl base and mask logic

x86_spec_ctrL_mask is intended to mask out bits from a MSR_SPEC_CTRL value
which are not to be modified. However the implementation is not really used
and the bitmask was inverted to make a check easier, which was removed in
"x86/bugs: Remove x86_spec_ctrl_set()"

Aside of that it is missing the STIBP bit if it is supported by the
platform, so if the mask would be used in x86_virt_spec_ctrl() then it
would prevent a guest from setting STIBP.

Add the STIBP bit if supported and use the mask in x86_virt_spec_ctrl() to
sanitize the value which is supplied by the guest.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit be6fcb54)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c

[Konrad:
As we have the IBRS support and boy that makes it double hard.
The first part of this patch is to invert the mask, no biggie.

But the mask for the IBRS mode (that is - we want to set SPEC_CTRL
MSR to 1<<0 in kernel space, but in user-space we want it to be
1<<1) didn't set the SSBD bit as we should not set the SSBD
in kernel mode. But with the inversion that is OK.

Next part is the two values - x86_spec_ctrl_base and x86_spec_ctrl_priv.

The x86_spec_ctrl_base is what userspace is going to have (so
tack on SSBD), and x86_spec_ctrl_priv what runs in kernel (so
tack on IBRS, but _NOT_ SSBD).

That means the whole logic of filtering the supported SPEC_CTRL
value depending on what the host supports should be seeded
with x86_spec_ctrl_priv.

With all that the logic works - we end up ANDing our mask
and what we can support (and the initial boot-time value of the
MSR), and then ORing what the guest wants with our mask.

All the while supporting any other bits in the SPEC_CTRL that
may come in the future.

And this logic is fine on AMD too - where the SSBD bit does not
show up in the SPEC_CTRL mask

P.S.
To make it more fun the x86_spec_ctrl_priv |= IBRS is set in a
header (see set_ibrs_inuse).]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Expose x86_spec_ctrl_base directly

x86_spec_ctrl_base is the system wide default value for the SPEC_CTRL MSR.
x86_spec_ctrl_get_default() returns x86_spec_ctrl_base and was intended to
prevent modification to that variable. Though the variable is read only
after init and globaly visible already.

Remove the function and export the variable instead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit fa8ac498)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/nospec-branch.h
arch/x86/include/asm/spec-ctrl.h
arch/x86/kernel/cpu/bugs.c
[Contextual changes: things weren't in the expected place]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}

Function bodies are very similar and are going to grow more almost
identical code. Add a bool arg to determine whether SPEC_CTRL is being set
for the guest or restored to the host.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit cc69b349)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
[Different filename bugs_64.c]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Rework speculative_store_bypass_update()

The upcoming support for the virtual SPEC_CTRL MSR on AMD needs to reuse
speculative_store_bypass_update() to avoid code duplication. Add an
argument for supplying a thread info (TIF) value and create a wrapper
speculative_store_bypass_update_current() which is used at the existing
call site.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit 0270be3e)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
[Different filename (bugs_64.c)]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Add virtualized speculative store bypass disable support

Some AMD processors only support a non-architectural means of enabling
speculative store bypass disable (SSBD). To allow a simplified view of
this to a guest, an architectural definition has been created through a new
CPUID bit, 0x80000008_EBX[25], and a new MSR, 0xc001011f. With this, a
hypervisor can virtualize the existence of this definition and provide an
architectural method for using SSBD to a guest.

Add the new CPUID feature, the new MSR and update the existing SSBD
support to use this MSR when present.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit 11fb0683)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/include/asm/msr-index.h
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/process.c
[
cpufeatures.h: different file name and different index
msr-index.h: different file location
bugs.c: different file name
process.c: different file structure
common.c: This is because we skipped the first two patches from the patch
series. We do no have enough feature bits to align all the actual feature in
our cpufeature structure. We created a synthetic feature and this is where we
detect and set it.
]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL

AMD is proposing a VIRT_SPEC_CTRL MSR to handle the Speculative Store
Bypass Disable via MSR_AMD64_LS_CFG so that guests do not have to care
about the bit position of the SSBD bit and thus facilitate migration.
Also, the sibling coordination on Family 17H CPUs can only be done on
the host.

Extend x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() with an
extra argument for the VIRT_SPEC_CTRL MSR.

Hand in 0 from VMX and in SVM add a new virt_spec_ctrl member to the CPU
data structure which is going to be used in later patches for the actual
implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit ccbcd267)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c
[
We skipped cherry-picking two commits from upstream:
x86/cpufeatures: Disentangle SSBD enumeration ...
x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS ...
x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP ...

Because we do not enough space in word 7 of synthetic bits for cpufeature. Also
we do not have the word 13 to move the bits around (like upstream did). We
cannot add word 13 because we will break kABI. So we do no have VIRT_SPEC_CTRL
cpufeature but we have VIRT_SSBD cpufeature which is a bit in virt_spec_ctrl
MSR.
]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Handle HT correctly on AMD

The AMD64_LS_CFG MSR is a per core MSR on Family 17H CPUs. That means when
hyperthreading is enabled the SSBD bit toggle needs to take both cores into
account. Otherwise the following situation can happen:

CPU0 CPU1

disable SSB
disable SSB
enable SSB <- Enables it for the Core, i.e. for CPU0 as well

So after the SSB enable on CPU1 the task on CPU0 runs with SSB enabled
again.

On Intel the SSBD control is per core as well, but the synchronization
logic is implemented behind the per thread SPEC_CTRL MSR. It works like
this:

CORE_SPEC_CTRL = THREAD0_SPEC_CTRL | THREAD1_SPEC_CTRL

i.e. if one of the threads enables a mitigation then this affects both and
the mitigation is only disabled in the core when both threads disabled it.

Add the necessary synchronization logic for AMD family 17H. Unfortunately
that requires a spinlock to serialize the access to the MSR, but the locks
are only shared between siblings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit 1f50ddb4)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/process.c
arch/x86/kernel/smpboot.c
[Contextual changes: in UEK4 files are pretty different and do not match the diff]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpufeatures: Add FEATURE_ZEN

Add a ZEN feature bit so family-dependent static_cpu_has() optimizations
can be built for ZEN.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit d1035d97)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
[Different filename and differnet bit for FEATURE_ZEN]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpu/AMD: Fix erratum 1076 (CPB bit)

CPUID Fn8000_0007_EDX[CPB] is wrongly 0 on models up to B1. But they do
support CPB (AMD's Core Performance Boosting cpufreq CPU feature), so fix that.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170907170821.16021-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28063992
CVE: CVE-2018-3639

(cherry picked from commit f7f3dc00)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

perf/hwbp: Simplify the perf-hwbp code, fix documentation

Orabug: 27947602
CVE: CVE-2018-1000199

Annoyingly, modify_user_hw_breakpoint() unnecessarily complicates the
modification of a breakpoint - simplify it and remove the pointless
local variables.

Also update the stale Docbook while at it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit f67b15037a7a50c57f72e69a6d59941ad90a0f0f)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "perf/hwbp: Simplify the perf-hwbp code, fix documentation"

Orabug: 27947602

Wrong CVE tag was included in this commit

This reverts commit 2ff296f2eb2475d4c3f206e145c4b2132d86bc2c.

KVM: SVM: Move spec control call after restore of GS

svm_vcpu_run() invokes x86_spec_ctrl_restore_host() after VMEXIT, but
before the host GS is restored. x86_spec_ctrl_restore_host() uses 'current'
to determine the host SSBD state of the thread. 'current' is GS based, but
host GS is not yet restored and the access causes a triple fault.

Move the call after the host GS restore.

OraBug: 28041771
CVE: CVE-2018-3639

Fixes: 885f82bfbc6f x86/process: Allow runtime control of Speculative Store Bypass
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 15e6c22fd8e5a42c5ed6d487b7c9fe44c2517765)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/svm.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Fix the parameters alignment and missing void

Fixes: 7bb4d366c ("x86/bugs: Make cpu_show_common() static")
Fixes: 24f7fc83b ("x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation")
OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit ffed645e3be0e32f8e9ab068d257aee8d0fe8eec)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
[It is called bugs_64 in UEK4]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Make cpu_show_common() static

cpu_show_common() is not used outside of arch/x86/kernel/cpu/bugs.c, so
make it static.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 7bb4d366cba992904bffa4820d24e70a3de93e76)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
[It is called bugs_64.c]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Fix __ssb_select_mitigation() return type

__ssb_select_mitigation() returns one of the members of enum ssb_mitigation,
not ssb_mitigation_cmd; fix the prototype to reflect that.

OraBug: 28041771
CVE: CVE-2018-3639

Fixes: 24f7fc83b9204 ("x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit d66d8ff3d21667b41eddbe86b35ab411e40d8c5f)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
[It is called bugs_64.c]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Documentation/spec_ctrl: Do some minor cleanups

Fix some typos, improve formulations, end sentences with a fullstop.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit dd0792699c4058e63c0715d9a7c2d40226fcdddc)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

proc: Use underscores for SSBD in 'status'

The style for the 'status' file is CamelCase or this. _.

Fixes: fae1fa0fc ("proc: Provide details on speculation flaw mitigations")
OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit e96f46ee8587607a828f783daa6eb5b44d25004d)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Rename _RDS to _SSBD

Intel collateral will reference the SSB mitigation bit in IA32_SPEC_CTL[2]
as SSBD (Speculative Store Bypass Disable).

Hence changing it.

It is unclear yet what the MSR_IA32_ARCH_CAPABILITIES (0x10a) Bit(4) name
is going to be. Following the rename it would be SSBD_NO but that rolls out
to Speculative Store Bypass Disable No.

Also fixed the missing space in X86_FEATURE_AMD_SSBD.

[ tglx: Fixup x86_amd_rds_enable() and rds_tif_to_amd_ls_cfg() as well ]

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 9f65fb29374ee37856dbad847b4e121aab72b510)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/include/asm/msr-index.h
arch/x86/include/asm/spec-ctrl.h
arch/x86/include/asm/thread_info.h
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/intel.c
arch/x86/kernel/process.c
arch/x86/kvm/cpuid.c
arch/x86/kvm/vmx.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Make "seccomp" the default mode for Speculative Store Bypass

Unless explicitly opted out of, anything running under seccomp will have
SSB mitigations enabled. Choosing the "prctl" mode will disable this.

[ tglx: Adjusted it to the new arch_seccomp_spec_mitigate() mechanism ]

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit f21b53b20c754021935ea43364dbf53778eeba32)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
[It is called Documentation/kernel-paramters.txt]

arch/x86/include/asm/nospec-branch.h

[Different name..]
arch/x86/kernel/cpu/bugs.c
[And again, bugs_64.c, and also we did provide the SPEC_STORE_BYPASS_USERSPACE]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

seccomp: Move speculation migitation control to arch code

The migitation control is simpler to implement in architecture code as it
avoids the extra function call to check the mode. Aside of that having an
explicit seccomp enabled mode in the architecture mitigations would require
even more workarounds.

Move it into architecture code and provide a weak function in the seccomp
code. Remove the 'which' argument as this allows the architecture to decide
which mitigations are relevant for seccomp.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 8bf37d8c067bb7eb8e7c381bdadf9bd89182b6bc)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
[Which is called bugs_64.c in UEK4]
include/linux/nospec.h

[Which is called nospec-branch.h]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

seccomp: Add filter flag to opt-out of SSB mitigation

If a seccomp user is not interested in Speculative Store Bypass mitigation
by default, it can set the new SECCOMP_FILTER_FLAG_SPEC_ALLOW flag when
adding filters.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 00a02d0c502a06d15e07b857f8ff921e3e402675)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
include/linux/seccomp.h
include/uapi/linux/seccomp.h
tools/testing/selftests/seccomp/seccomp_bpf.c
[No eBPF in UEK4]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

seccomp: Use PR_SPEC_FORCE_DISABLE

Use PR_SPEC_FORCE_DISABLE in seccomp() because seccomp does not allow to
widen restrictions.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit b849a812f7eb92e96d1c8239b06581b2cfd8b275)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

prctl: Add force disable speculation

For certain use cases it is desired to enforce mitigations so they cannot
be undone afterwards. That's important for loader stubs which want to
prevent a child from disabling the mitigation again. Will also be used for
seccomp(). The extra state preserving of the prctl state for SSB is a
preparatory step for EBPF dymanic speculation control.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 356e4bfff2c5489e016fdb925adbf12a1e3950ee)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c

[File is called bugs_64.c]
include/linux/sched.h

Signed-off-by: Brian Maly <brian.maly@oracle.com>

seccomp: Enable speculation flaw mitigations

When speculation flaw mitigations are opt-in (via prctl), using seccomp
will automatically opt-in to these protections, since using seccomp
indicates at least some level of sandboxing is desired.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 5c3070890d06ff82eecb808d02d2ca39169533ef)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
kernel/seccomp.c
[The include file is called nospec-branch.h instead of nospec.h]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

proc: Provide details on speculation flaw mitigations

As done with seccomp and no_new_privs, also show speculation flaw
mitigation state in /proc/$pid/status.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit fae1fa0fc6cca8beee3ab8ed71d54f9a78fa3f64)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
fs/proc/array.c
[Missing seq_putc]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

nospec: Allow getting/setting on non-current task

Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
current.

This is needed both for /proc/$pid/status queries and for seccomp (since
thread-syncing can trigger seccomp in non-current threads).

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 7bbf1373e228840bb0295a2ca26d548ef37f448e)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
include/linux/nospec.h
kernel/sys.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs/IBRS: Disable SSB (RDS) if IBRS is sslected for spectre_v2.

If =userspace is selected we want frob the SPEC_CTRL MSR on every
userspace entrace (disable memory disambigation), and also on every
kernel entrace (enable memory disambiguation). However we have
to be careful as having MSR frobbed and retpoline being enabled
slows the machine even further.

Therefore if possible swap over to using SPEC_CTRL MSR (IBRS) on
every kernel entrace instead of using retpoline.

Naturally this heuristic is controlled by various knobs.

To summarize, if "spectre_v2=retpoline spec_store_bypass_disable=userspace"
is set then we will switch the spectre_v2 to IBRS.

This table may explain this better:
effect    | spectre_v2  | spec_store_bypass_disable | remark
==========+=============+===========================+======
IBRS      | ibrs        | userspace                 |
IBRS      | auto        | userspace                 | *1 *2
IBRS      | retpoline   | userspace                 | *1
IBRS      | ibrs        | boot                      |
retpoline | auto        | boot                      |
retpoline | retpoline   | boot                      |
retpoline | auto        | boot                      |
retpoline | auto        | auto                      |

*1: If spectre_v2_heuristic=off or spectre_v2_heuristic=rds=off
is selected then the spec_store_bypass_disable=userspace parameter
is not followed and the effect is both retpoline and IBRS enabled
in the kernel.

*2: If we run under Skylake+ the 'spec_store_bypass_disable=auto'
will disable retpoline and enable IBRS. If not on Skylake+, then
retpoline and IBRS are both enabled.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Add prctl for Speculative Store Bypass mitigation

Add prctl based control for Speculative Store Bypass mitigation and make it
the default mitigation for Intel and AMD.

Andi Kleen provided the following rationale (slightly redacted):

There are multiple levels of impact of Speculative Store Bypass:

1) JITed sandbox.
    It cannot invoke system calls, but can do PRIME+PROBE and may have call
    interfaces to other code

2) Native code process.
    No protection inside the process at this level.

3) Kernel.

4) Between processes.

The prctl tries to protect against case (1) doing attacks.

If the untrusted code can do random system calls then control is already
lost in a much worse way. So there needs to be system call protection in
some way (using a JIT not allowing them or seccomp). Or rather if the
process can subvert its environment somehow to do the prctl it can already
execute arbitrary code, which is much worse than SSB.

To put it differently, the point of the prctl is to not allow JITed code
to read data it shouldn't read from its JITed sandbox. If it already has
escaped its sandbox then it can already read everything it wants in its
address space, and do much worse.

The ability to control Speculative Store Bypass allows to enable the
protection selectively without affecting overall system performance.

Based on an initial patch from Tim Chen. Completely rewritten.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit a73ec77ee17ec556fe7f165d00314cb7c047b1ac)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
arch/x86/kernel/cpu/bugs.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86: thread_info.h: move RDS from index 5 to 23

In UEK4, the thread flags field is split in two parts:
- lower bits of the word which are used usually for "pending work-to-be-done"
- upper bits of the word

There is a comment in arch/x86/include/asm/thread_info.h:88 where it says that
the lower bits are hard-coded in entry_64.S. In entry_64.S a mask of 0x0000ffff
is used to check the state of the thread and determine if it would go to
userspace or not. Because we used bit "5", which was in the lower bits part,
one of the checked condition was always true and the program never returned
from kernel.

We moved RDS to bit 23 which was free to solve the issue.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/process: Allow runtime control of Speculative Store Bypass

The Speculative Store Bypass vulnerability can be mitigated with the
Reduced Data Speculation (RDS) feature. To allow finer grained control of
this eventually expensive mitigation a per task mitigation control is
required.

Add a new TIF_RDS flag and put it into the group of TIF flags which are
evaluated for mismatch in switch_to(). If these bits differ in the previous
and the next task, then the slow path function __switch_to_xtra() is
invoked. Implement the TIF_RDS dependent mitigation control in the slow
path.

If the prctl for controlling Speculative Store Bypass is disabled or no
task uses the prctl then there is no overhead in the switch_to() fast
path.

Update the KVM related speculation control functions to take TID_RDS into
account as well.

Based on a patch from Tim Chen. Completely rewritten.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 885f82bfbc6fefb6664ea27965c3ab9ac4194b8c)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/include/asm/msr-index.h
arch/x86/include/asm/thread_info.h
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/process.c
[u64->u32]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

prctl: Add speculation control prctls

Add two new prctls to control aspects of speculation related vulnerabilites
and their mitigations to provide finer grained control over performance
impacting mitigations.

PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
the following meaning:

Bit  Define           Description
0    PR_SPEC_PRCTL    Mitigation can be controlled per task by
                      PR_SET_SPECULATION_CTRL
1    PR_SPEC_ENABLE   The speculation feature is enabled, mitigation is
                      disabled
2    PR_SPEC_DISABLE  The speculation feature is disabled, mitigation is
                      enabled

If all bits are 0 the CPU is not affected by the speculation misfeature.

If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
misfeature will fail.

PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.

The common return values are:

EINVAL  prctl is not implemented by the architecture or the unused prctl()
        arguments are not 0
ENODEV  arg2 is selecting a not supported speculation misfeature

PR_SET_SPECULATION_CTRL has these additional return values:

ERANGE  arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
ENXIO   prctl control of the selected speculation misfeature is disabled

The first supported controlable speculation misfeature is
PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
architectures.

Based on an initial patch from Tim Chen and mostly rewritten.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b617cfc858161140d69cc0b5cc211996b557a1c7)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
Documentation/userspace-api/index.rst
include/linux/nospec.h
include/uapi/linux/prctl.h
kernel/sys.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Create spec-ctrl.h to avoid include hell

Having everything in nospec-branch.h creates a hell of dependencies when
adding the prctl based switching mechanism. Move everything which is not
required in nospec-branch.h to spec-ctrl.h and fix up the includes in the
relevant files.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 28a2775217b17208811fa43a9e96bd1fdf417b86)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/KVM/VMX: Expose SPEC_CTRL Bit(2) to the guest

Expose the CPUID.7.EDX[31] bit to the guest, and also guard against various
combinations of SPEC_CTRL MSR values.

The handling of the MSR (to take into account the host value of SPEC_CTRL
Bit(2)) is taken care of in patch:

KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit da39556f66f5cfe8f9c989206974f1cb16ca5d7c)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kvm/cpuid.c
arch/x86/kvm/vmx.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs/AMD: Add support to disable RDS on Fam[15,16,17]h if requested

AMD does not need the Speculative Store Bypass mitigation to be enabled.

The parameters for this are already available and can be done via MSR
C001_1020. Each family uses a different bit in that MSR for this.

[ tglx: Expose the bit mask via a variable and move the actual MSR fiddling
into the bugs code as that's the right thing to do and also required
to prepare for dynamic enable/disable ]

OraBug: 28041771
CVE: CVE-2018-3639

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 764f3c21588a059cd783c6ba0734d4db2d72822d)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/kernel/cpu/amd.c
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/common.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Whitelist allowed SPEC_CTRL MSR values

Intel and AMD SPEC_CTRL (0x48) MSR semantics may differ in the
future (or in fact use different MSRs for the same functionality).

As such a run-time mechanism is required to whitelist the appropriate MSR
values.

[ tglx: Made the variable __ro_after_init ]

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1115a859f33276fe8afb31c60cf9d8e657872558)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
[It is called bugs_64.c]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs/intel: Set proper CPU features and setup RDS

Intel CPUs expose methods to:

- Detect whether RDS capability is available via CPUID.7.0.EDX[31],

- The SPEC_CTRL MSR(0x48), bit 2 set to enable RDS.

- MSR_IA32_ARCH_CAPABILITIES, Bit(4) no need to enable RRS.

With that in mind if spec_store_bypass_disable=[auto,on] is selected set at
boot-time the SPEC_CTRL MSR to enable RDS if the platform requires it.

Note that this does not fix the KVM case where the SPEC_CTRL is exposed to
guests which can muck with it, see patch titled :
KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS.

And for the firmware (IBRS to be set), see patch titled:
x86/spectre_v2: Read SPEC_CTRL MSR during boot and re-use reserved bits

[ tglx: Distangled it from the intel implementation and kept the call order ]

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 772439717dbf703b39990be58d8d4e3e4ad0598a)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/include/asm/msr-index.h

[Different file names]
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/common.c
arch/x86/kernel/cpu/cpu.h
arch/x86/kernel/cpu/intel.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation

Contemporary high performance processors use a common industry-wide
optimization known as "Speculative Store Bypass" in which loads from
addresses to which a recent store has occurred may (speculatively) see an
older value. Intel refers to this feature as "Memory Disambiguation" which
is part of their "Smart Memory Access" capability.

Memory Disambiguation can expose a cache side-channel attack against such
speculatively read values. An attacker can create exploit code that allows
them to read memory outside of a sandbox environment (for example,
malicious JavaScript in a web page), or to perform more complex attacks
against code running within the same privilege level, e.g. via the stack.

As a first step to mitigate against such attacks, provide two boot command
line control knobs:

nospec_store_bypass_disable
spec_store_bypass_disable=[off,auto,on]

By default affected x86 processors will power on with Speculative
Store Bypass enabled. Hence the provided kernel parameters are written
from the point of view of whether to enable a mitigation or not.
The parameters are as follows:

- auto - Kernel detects whether your CPU model contains an implementation
  of Speculative Store Bypass and picks the most appropriate
  mitigation.

- on   - disable Speculative Store Bypass
- off  - enable Speculative Store Bypass

[ tglx: Reordered the checks so that the whole evaluation is not done
   when the CPU does not support RDS ]

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 24f7fc83b9204d20f878c57cb77d261ae825e033)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt

[It is called Documentation/kernel-parameters.txt]

arch/x86/include/asm/cpufeatures.h

[It is called cpufeature.h]

arch/x86/kernel/cpu/bugs.c

[And it is bugs_64.c]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpufeatures: Add X86_FEATURE_RDS

Add the CPU feature bit CPUID.7.0.EDX[31] which indicates whether the CPU
supports Reduced Data Speculation.

[ tglx: Split it out from a later patch ]

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 0cc5fa00b0a88dad140b4e5c2cead9951ad36822)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
[It is called cpufeature.h]
[We also need to use the scattered function to set the flag similar
to how the rest of CPUID.7.0.EDX[31] are done]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Expose /sys/../spec_store_bypass

Add the sysfs file for the new vulerability. It does not do much except
show the words 'Vulnerable' for recent x86 cores.

Intel cores prior to family 6 are known not to be vulnerable, and so are
some Atoms and some Xeon Phi.

It assumes that older Cyrix, Centaur, etc. cores are immune.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit c456442cd3a59eeb1d60293c26cbe2ff2c4e42cf)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
[It is a different file - cpufeature.h]

arch/x86/kernel/cpu/bugs.c
[As well, called bugs_64.c]

arch/x86/kernel/cpu/common.c
[Location of cpu_set_bug_bits is different and also had to drop the __initconst]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpu/intel: Add Knights Mill to Intel family

Add CPUID of Knights Mill (KNM) processor to Intel family list.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161012180520.30976-1-piotr.luc@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 0047f59834e5947d45f34f5f12eb330d158f700b)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpu: Rename Merrifield2 to Moorefield

Merrifield2 is actually Moorefield.

Rename it accordingly and drop tail digit from Merrifield1.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160906184254.94440-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit f5fbf848303c8704d0e1a1e7cabd08fd0a49552f)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/platform/atom/punit_atom_debug.c
[Does not exist]
drivers/pci/pci-mid.c
drivers/powercap/intel_rapl.c
[We never added the support for it]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs, KVM: Support the combination of guest and host IBRS

A guest may modify the SPEC_CTRL MSR from the value used by the
kernel. Since the kernel doesn't use IBRS, this means a value of zero is
what is needed in the host.

But the 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to
the other bits as reserved so the kernel should respect the boot time
SPEC_CTRL value and use that.

This allows to deal with future extensions to the SPEC_CTRL interface if
any at all.

Note: This uses wrmsrl() instead of native_wrmsl(). I does not make any
difference as paravirt will over-write the callq *0xfff.. with the wrmsrl
assembler code.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 5cf687548705412da47c9cec342fd952d71ed3d5)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c
[We need to preserve the check for ibrs_inuse - which we can do now in the
functions]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs/IBRS: Warn if IBRS is enabled during boot.

It should never be. But in case it is lets warn and clear it.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs/IBRS: Use variable instead of defines for enabling IBRS

This follows what "x86/bugs: Read SPEC_CTRL MSR during boot and re-use
reserved bits" patch does - that is respect the other bits of the
SPEC CTRL MSR (if any at all).

This necessitates to convert all the assembler macros over, all
the various uses of the SPEC CTRL guarded by 'use_ibrs'.

Note the not so obvious change in the assembler macro from 'cmp' to
'test' to verify the right bit being set.

And to make sure it works with the IBRS support we need to
recognize it in x86_spec_ctrl_set.

This is not upstreamed. It builds on top of IBRS backport.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs_64.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Read SPEC_CTRL MSR during boot and re-use reserved bits

The 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to all
the other bits as reserved. The Intel SDM glossary defines reserved as
implementation specific - aka unknown.

As such at bootup this must be taken it into account and proper masking for
the bits in use applied.

A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199511

[ tglx: Made x86_spec_ctrl_base __ro_after_init ]

OraBug: 28041771
CVE: CVE-2018-3639

Suggested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1b86883ccb8d5d9506529d42dbe1a5257cb30b18)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/include/asm/nospec-branch.h
[As we don't have the firmware_restrict_branch_speculation_start and
firmware_restrict_branch_speculation_end and end up with a different
name. See commit 473ad76ea8d76f34555d764a3d5820bc1b33cabf
"x86/speculation: Use IBRS if available before calling into firmware"]

arch/x86/kernel/cpu/bugs.c
[File is called bugs_64.c in UEK4]

[Also the backport needs nospec-branch.h in different files ,and we can't
use __ro_after_init]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Concentrate bug reporting into a separate function

Those SysFS functions have a similar preamble, as such make common
code to handle them.

OraBug: 28041771
CVE: CVE-2018-3639

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit d1059518b4789cabe34bb4b714d07e6089c82ca1)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c

[As it does not exist in UEK4. It is called bugs_64.c]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Concentrate bug detection into a separate function

Combine the various logic which goes through all those
x86_cpu_id matching structures in one function.

OraBug: 28041771
CVE: CVE-2018-3639

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4a28bfe3267b68e22c663ac26185aa16c9b879ef)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
arch/x86/kernel/cpu/common.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs/IBRS: Turn on IBRS in spectre_v2_select_mitigation

instead of during early bootup. This makes the bootup much
faster as we may get an NMI (watchdog) during booting before we
make it to spectre_v2_select_mitigation - which means we would
be running with IBRS enabled.

OraBug: 28041771
CVE: CVE-2018-3639

Fixes: XYZ ("x86/bugs/IBRS: Use variable instead of defines for enabling IBRS")
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/msr: Add SPEC_CTRL_IBRS..

Instead of using the defines we have to easy backporting.

OraBug: 28041771
CVE: CVE-2018-3639

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: libfc: Revisit kref handling

The kref handling in fc_rport is a mess. This patch updates
the kref handling according to the following rules:

- Take a reference whenever scheduling a workqueue
- Take a reference whenever an ELS command is send
- Drop the reference at the end of the workqueue function
- Drop the reference at the end of handling ELS replies
- Take a reference when allocating an rport
- Drop the reference when removing an rport

Signed-off-by: Hannes Reinecke <hare@suse.com>
Acked-by: Johannes Thumshirn <jth@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
OraBug: 27363267
(cherry picked from commit 4d2095cc42a2d8062590891f929d9d694cbd927f)
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Fred Herard <fred.herard@oracle.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: libfc: reset exchange manager during LOGO handling

FC-LS mandates that we should invalidate all sequences before sending a
LOGO. And we should set the event to RPORT_EV_STOP when a LOGO request
has been received to signal that all exchanges are terminated.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Chad Dupuis <chad.dupuis@qlogic.com>
Tested-by: Chad Dupuis <chad.dupuis@qlogic.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
OraBug: 27363267
(cherry picked from commit 649eb8693857e9b9fca009fba4eb7e80f9f3a326)
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Fred Herard <fred.herard@oracle.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: libfc: send LOGO for PLOGI failure

When running in point-to-multipoint mode PLOGI is done after FLOGI
completed. So when the PLOGI fails we should be sending a LOGO to the
remote port.

[mkp: Applied by hand]

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Chad Dupuis <chad.dupuis@qlogic.com>
Tested-by: Chad Dupuis <chad.dupuis@qlogic.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
OraBug: 27363267
(cherry picked from commit d391966a03846176a78ef8d53898de8b4302a2be)
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Fred Herard <fred.herard@oracle.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: libfc: Issue PRLI after a PRLO has been received

When receiving a PRLO it just means that the operating parameters have
changed, it does _not_ mean that the port doesn't want to communicate
with us. So instead of implicitly logging out we should be issueing a
PRLI to figure out the new operating parameters. We can always recover
once PRLI fails.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Chad Dupuis <chad.dupuis@qlogic.com>
Tested-by: Chad Dupuis <chad.dupuis@qlogic.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
OraBug: 27363267
(cherry picked from commit 166f310b629c046b7f5ca846adf978cda47b06c2)
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Fred Herard <fred.herard@oracle.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

libfc: Update rport reference counting

Originally libfc would just be initializing the refcount to '1', and
using the disc_mutex to synchronize if and when the final put should be
happening.  This has a race condition as the mutex might be delayed,
causing other threads to access an invalid structure.  This patch
updates the rport reference counting to increase the reference every
time 'rport_lookup' is called, and decreases the reference
correspondingly.  This removes the need to hold 'disc_mutex' when
removing the structure, and avoids the above race condition.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Acked-by: Vasu Dev <vasu.dev@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
OraBug: 27363267
(cherry picked from commit baa6719f902af9c03e528b08dfb847de295b5137)
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Fred Herard <fred.herard@oracle.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

amd/kvm: do not intercept new MSRs for spectre v2 mitigation

Do not intercept MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD on AMD
for Spectre v2 mitigation.
As IBRS is not used on AMD, attempt to intercept MSR_IA32_SPEC_CTRL
will have guest crash with injected GP fault.
Also change the comment about field 'always' in svm_direct_access_msrs structure
for clarity.

OraBug: 27370258

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

RDS: null pointer dereference in rds_atomic_free_op

Orabug: 27422832
CVE: CVE-2018-5333

set rm->atomic.op_active to 0 when rds_pin_pages() fails
or the user supplied address is invalid,
this prevents a NULL pointer usage in rds_atomic_free_op()

Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7d11f77f84b27cef452cee332f4e469503084737)

Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ACPI: sbshc: remove raw pointer from printk() message

Orabug: 27501257
CVE: CVE-2018-5750

There's no need to be printing a raw kernel pointer to the kernel log at
every boot. So just remove it, and change the whole message to use the
correct dev_info() call at the same time.

Reported-by: Wang Qize <wang_qize@venustech.com.cn>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 43cdd1b716b26f6af16da4e145b6578f98798bf6)

Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

futex: Prevent overflow by strengthen input validation

Orabug: 27539548
CVE: CVE-2018-6927

UBSAN reports signed integer overflow in kernel/futex.c:

UBSAN: Undefined behaviour in kernel/futex.c:2041:18
signed integer overflow:
0 - -2147483648 cannot be represented in type 'int'

Add a sanity check to catch negative values of nr_wake and nr_requeue.

Signed-off-by: Li Jinyue <lijinyue@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: dvhart@infradead.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1513242294-31786-1-git-send-email-lijinyue@huawei.com
(cherry picked from commit fbe0e839d1e22d88810f3ee3e2f1479be4c0aa4a)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
kernel/futex.c

Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: ipv4: add support for ECMP hash policy choice

This patch adds support for ECMP hash policy choice via a new sysctl
called fib_multipath_hash_policy and also adds support for L4 hashes.
The current values for fib_multipath_hash_policy are:
0 - layer 3 (default)
1 - layer 4
If there's an skb hash already set and it matches the chosen policy then it
will be used instead of being calculated (currently only for L4).
In L3 mode we always calculate the hash due to the ICMP error special
case, the flow dissector's field consistentification should handle the
address order thus we can remove the address reversals.
If the skb is provided we always use it for the hash calculation,
otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.

Orabug: 27547114

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit bf4e0a3db97eb882368fd82980b3b1fa0b5b9778)

Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/net/ip_fib.h
include/net/netns/ipv4.h
include/net/route.h
net/ipv4/fib_semantics.c
net/ipv4/icmp.c
net/ipv4/route.c
net/ipv4/sysctl_net_ipv4.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: ipv4: Consider failed nexthops in multipath routes

Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

                     [h2]                   [h3]   15.0.0.5
                      |                      |
                     3|                     3|
                    [SP1]                  [SP2]--+
                     1  2                   1     2
                     |  |     /-------------+     |
                     |   \   /                    |
                     |     X                      |
                     |    / \                     |
                     |   /   \---------------\    |
                     1  2                     1   2
         12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                     4                         4
                      \                       /
                        \                    /
                         \                  /
                          -------|   |-----/
                                 1   2
                                [TOR3]
                                  3|
                                   |
                                  [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

    root@h1:~# ip ro ls
    ...
    12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
    15.0.0.0/16
            nexthop via 12.0.0.2  dev swp1 weight 1
            nexthop via 12.0.0.3  dev swp1 weight 1
    ...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

    root@h1:~# ip neigh show
    ...
    12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
    12.0.0.2 dev swp1  FAILED
    ...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Orabug: 27547114

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a6db4494d218c2e559173661ee972e048dc04fdd)

Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/ipv4/fib_semantics.c
net/ipv4/sysctl_net_ipv4.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

ipv4: L3 hash-based multipath

Replaces the per-packet multipath with a hash-based multipath using
source and destination address.

Orabug: 27547114

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0e884c78ee19e902f300ed147083c28a0c6302f0)

Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/net/ip_fib.h
net/ipv4/fib_semantics.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

dm: fix race between dm_get_from_kobject() and __dm_destroy()

Orabug: 27677556
CVE: CVE-2017-18203

The following BUG_ON was hit when testing repeat creation and removal of
DM devices:

    kernel BUG at drivers/md/dm.c:2919!
    CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
    Call Trace:
     [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
     [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
     [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
     [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
     [<ffffffff811de257>] kernfs_seq_show+0x23/0x25
     [<ffffffff81199118>] seq_read+0x16f/0x325
     [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
     [<ffffffff8117b625>] __vfs_read+0x26/0x9d
     [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
     [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
     [<ffffffff8117be9d>] vfs_read+0x8f/0xcf
     [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
     [<ffffffff8117c686>] SyS_read+0x4b/0x76
     [<ffffffff817b606e>] system_call_fastpath+0x12/0x71

The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
between the test of DMF_FREEING & DMF_DELETING and dm_get() in
dm_get_from_kobject().

To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
dm_get() are done in an atomic way, so _minor_lock is used.

The other callers of dm_get() have also been checked to be OK: some
callers invoke dm_get() under _minor_lock, some callers invoke it under
_hash_lock, and dm_start_request() invoke it after increasing
md->open_count.

Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
(cherry picked from commit b9a41d21dceadf8104812626ef85dc56ee8a60ed)

Reviewd-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

NFS: only invalidate dentrys that are clearly invalid.

Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate")
in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
to be invalidated even if it has filesystems mounted on or it or on a
descendant.  The mounted filesystem is unmounted.

This means we need to be careful not to return 0 unless the directory
referred to truly is invalid.  So -ESTALE or -ENOENT should invalidate
the directory.  Other errors such a -EPERM or -ERESTARTSYS should be
returned from ->d_revalidate() so they are propagated to the caller.

A particular problem can be demonstrated by:

1/ mount an NFS filesystem using NFSv3 on /mnt
2/ mount any other filesystem on /mnt/foo
3/ ls /mnt/foo
4/ turn off network, or otherwise make the server unable to respond
5/ ls /mnt/foo &
6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
7/ kill -9 $! # this results in -ERESTARTSYS being returned
8/ observe that /mnt/foo has been unmounted.

This patch changes nfs_lookup_revalidate() to only treat
  -ESTALE from nfs_lookup_verify_inode() and
  -ESTALE or -ENOENT from ->lookup()
as indicating an invalid inode.  Other errors are returned.

Also nfs_check_inode_attributes() is changed to return -ESTALE rather
than -EIO.  This is consistent with the error returned in similar
circumstances from nfs_update_inode().

As this bug allows any user to unmount a filesystem mounted on an NFS
filesystem, this fix is suitable for stable kernels.

Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
Cc: stable@vger.kernel.org (v3.18+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 27870824

(cherry picked from commit cc89684c9a265828ce061037f1f79f4a68ccd3f7)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: Improve handling of failures on link and route dumps

In general, rtnetlink dumps do not anticipate failure to dump a single
object (e.g., link or route) on a single pass. As both route and link
objects have grown via more attributes, that is no longer a given.

netlink dumps can handle a failure if the dump function returns an
error; specifically, netlink_dump adds the return code to the response
if it is <= 0 so userspace is notified of the failure. The missing
piece is the rtnetlink dump functions returning the error.

Fix route and link dump functions to return the errors if no object is
added to an skb (detected by skb->len != 0). IPv6 route dumps
(rt6_dump_route) already return the error; this patch updates IPv4 and
link dumps. Other dump functions may need to be ajusted as well.

Reported-by: Jan Moskyto Matejka <mq@ucw.cz>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 27959177
(cherry picked from commit f6c5775ff0bfa62b072face6bf1d40f659f194b2)
Add missing err variable for the cherry-pick in rtnetlink.c
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/mempolicy: fix use after free when calling get_mempolicy

I hit a use after free issue when executing trinity and repoduced it
with KASAN enabled.  The related call trace is as follows.

  BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
  Read of size 2 by task syz-executor1/798

  INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
     __slab_alloc+0x768/0x970
     kmem_cache_alloc+0x2e7/0x450
     mpol_new.part.2+0x74/0x160
     mpol_new+0x66/0x80
     SyS_mbind+0x267/0x9f0
     system_call_fastpath+0x16/0x1b
  INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
     __slab_free+0x495/0x8e0
     kmem_cache_free+0x2f3/0x4c0
     __mpol_put+0x2b/0x40
     SyS_mbind+0x383/0x9f0
     system_call_fastpath+0x16/0x1b
  INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
  INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600

  Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
  Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
  Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5                          kkkkkkk.
  Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb                          ........
  Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
  Memory state around the buggy address:
  ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc

!shared memory policy is not protected against parallel removal by other
thread which is normally protected by the mmap_sem.  do_get_mempolicy,
however, drops the lock midway while we can still access it later.

Early premature up_read is a historical artifact from times when
put_user was called in this path see https://lwn.net/Articles/124754/
but that is gone since 8bccd85ffbaf ("[PATCH] Implement sys_* do_*
layering in the memory policy layer.").  but when we have the the
current mempolicy ref count model.  The issue was introduced
accordingly.

Fix the issue by removing the premature release.

Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [2.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 27963519
CVE: CVE-2018-10675
(cherry picked from commit 73223e4e2e3867ebf033a5a8eb2e5df0158ccc99)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

drm: udl: Properly check framebuffer mmap offsets

Orabug: 27963530
CVE-2018-8781

The memmap options sent to the udl framebuffer driver were not being
checked for all sets of possible crazy values. Fix this up by properly
bounding the allowed values.

Reported-by: Eyal Itkin <eyalit@checkpoint.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20180321154553.GA18454@kroah.com
(cherry picked from commit 3b82a4db8eaccce735dffd50b4d4e1578099b8e8)

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: set format back to extents if xfs_bmap_extents_to_btree

If xfs_bmap_extents_to_btree fails in a mode where we call
xfs_iroot_realloc(-1) to de-allocate the root, set the
format back to extents.

Otherwise we can assume we can dereference ifp->if_broot
based on the XFS_DINODE_FMT_BTREE format, and crash.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
(cherry picked from commit 2c4306f719b083d17df2963bc761777576b8ad1b)

Orabug: 27963576
CVE: CVE-2018-10323
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/xfs/libxfs/xfs_bmap.c
Drop a part of the patch which is inapplicable to the current kernel.
The WARN_ON_ONCE in the dropped part is introduced in the upstream
commit 2fcc319d2467 as a debugging facility since v4.11 kernel.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "mlx4: change the ICM table allocations to lowest needed size"

This reverts commit UEK4-QU7 commit:
e7567cf1d53bddbe233dabea5c632658671dae9e
("mlx4: change the ICM table allocations to lowest needed size")

Orabug: 27980030

Signed-off-by: Håkon Bugge < haakon.bugge@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Signed-off-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Bluetooth: Prevent stack info leak from the EFS element.

Orabug: 28030514
CVE: CVE-2017-1000410

In the function l2cap_parse_conf_rsp and in the function
l2cap_parse_conf_req the following variable is declared without
initialization:

struct l2cap_conf_efs efs;

In addition, when parsing input configuration parameters in both of
these functions, the switch case for handling EFS elements may skip the
memcpy call that will write to the efs variable:

...
case L2CAP_CONF_EFS:
if (olen == sizeof(efs))
memcpy(&efs, (void *)val, olen);
...

The olen in the above if is attacker controlled, and regardless of that
if, in both of these functions the efs variable would eventually be
added to the outgoing configuration request that is being built:

l2cap_add_conf_opt(&ptr, L2CAP_CONF_EFS, sizeof(efs), (unsigned long) &efs);

So by sending a configuration request, or response, that contains an
L2CAP_CONF_EFS element, but with an element length that is not
sizeof(efs) - the memcpy to the uninitialized efs variable can be
avoided, and the uninitialized variable would be returned to the
attacker (16 bytes).

This issue has been assigned CVE-2017-1000410

Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Ben Seri <ben@armis.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 06e7e776ca4d36547e503279aeff996cbb292c16)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

netfilter: nfnetlink_cthelper: Add missing permission checks

The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, nfnl_cthelper_list is shared by all net namespaces on the
system.  An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:

    $ nfct helper list
    nfct v1.4.4: netlink error: Operation not permitted
    $ vpnns -- nfct helper list
    {
            .name = ftp,
            .queuenum = 0,
            .l3protonum = 2,
            .l4protonum = 6,
            .priv_data_len = 24,
            .status = enabled,
    };

Add capable() checks in nfnetlink_cthelper, as this is cleaner than
trying to generalize the solution.

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 4b380c42f7d00a395feede754f0bc2292eebe6e5)

Orabug: 27260771
CVE: CVE-2017-17448

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

netlink: Add netns check on taps

Currently, a nlmon link inside a child namespace can observe systemwide
netlink activity.  Filter the traffic so that nlmon can only sniff
netlink messages from its own netns.

Test case:

    vpnns -- bash -c "ip link add nlmon0 type nlmon; \
                      ip link set nlmon0 up; \
                      tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
    sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
        spi 0x1 mode transport \
        auth sha1 0x6162633132330000000000000000000000000000 \
        enc aes 0x00000000000000000000000000000000
    grep --binary abc123 /tmp/nlmon.pcap

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 93c647643b48f0131f02e45da3bd367d80443291)

Orabug: 27260799
CVE: CVE-2017-17449

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: Fix stack-out-of-bounds read in write_mmio

commit e39d200fa5bf5b94a0948db0dae44c1b73b84a56 upstream.

Reported by syzkaller:

  BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
  Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

  CPU: 6 PID: 32298 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #18
  Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
  Call Trace:
   dump_stack+0xab/0xe1
   print_address_description+0x6b/0x290
   kasan_report+0x28a/0x370
   write_mmio+0x11e/0x270 [kvm]
   emulator_read_write_onepage+0x311/0x600 [kvm]
   emulator_read_write+0xef/0x240 [kvm]
   emulator_fix_hypercall+0x105/0x150 [kvm]
   em_hypercall+0x2b/0x80 [kvm]
   x86_emulate_insn+0x2b1/0x1640 [kvm]
   x86_emulate_instruction+0x39a/0xb90 [kvm]
   handle_exception+0x1b4/0x4d0 [kvm_intel]
   vcpu_enter_guest+0x15a0/0x2640 [kvm]
   kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
   kvm_vcpu_ioctl+0x479/0x880 [kvm]
   do_vfs_ioctl+0x142/0x9a0
   SyS_ioctl+0x74/0x80
   entry_SYSCALL_64_fastpath+0x23/0x9a

The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741).  This patch fixes
it by just accessing the bytes which we operate on.

Before patch:

syz-executor-5567  [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

After patch:

syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry-picked from commit 653c41ac4729261cb356ee1aff0f3f4f342be1eb)

Orabug: 27290606
CVE: CVE-2017-17741

Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xprtrdma: Detect unreachable NFS/RDMA servers more reliably

Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.

Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:

/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
goto drop_connection;
req->rl_connect_cookie = xprt->connect_cookie;

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 27587008

(cherry picked from commit 33849792cbcdae2b04819cfb09fe3dca0a84a11e)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

sunrpc: Export xprt_force_disconnect()

xprt_force_disconnect() is already invoked from the socket
transport. I want to invoke xprt_force_disconnect() from the
RPC-over-RDMA transport, which is a separate module from sunrpc.ko.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 27587008

(cherry picked from commit e2a4f4fbefc5e5b7b4435f73711b7be94f780584)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

sunrpc: Allow xprt->ops->timer method to sleep

The transport lock is needed to protect the xprt_adjust_cwnd() call
in xs_udp_timer, but it is not necessary for accessing the
rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate
the lock into UDP's xs_udp_timer method, where it is required.

The ->timer method has to take the transport lock if needed, but it
can now sleep safely, or even call back into the RPC scheduler.

This is more a clean-up than a fix, but the "issue" was introduced
by my transport switch patches back in 2005.

Fixes: 46c0ee8bc4ad ("RPC: separate xprt_timer implementations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 27587008

(cherry picked from commit b977b644ccf821ab1269582f7efe1d0d85faa1f6)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit

When KVM emulates an exit from L2 to L1, it loads L1 CR4 into the
guest CR4. Before this CR4 loading, the guest CR4 refers to L2
CR4. Because these two CR4's are in different levels of guest, we
should vmx_set_cr4() rather than kvm_set_cr4() here. The latter, which
is used to handle guest writes to its CR4, checks the guest change to
CR4 and may fail if the change is invalid.

The failure may cause trouble. Consider we start
  a L1 guest with non-zero L1 PCID in use,
     (i.e. L1 CR4.PCIDE == 1 && L1 CR3.PCID != 0)
and
  a L2 guest with L2 PCID disabled,
     (i.e. L2 CR4.PCIDE == 0)
and following events may happen:

1. If kvm_set_cr4() is used in load_vmcs12_host_state() to load L1 CR4
   into guest CR4 (in VMCS01) for L2 to L1 exit, it will fail because
   of PCID check. As a result, the guest CR4 recorded in L0 KVM (i.e.
   vcpu->arch.cr4) is left to the value of L2 CR4.

2. Later, if L1 attempts to change its CR4, e.g., clearing VMXE bit,
   kvm_set_cr4() in L0 KVM will think L1 also wants to enable PCID,
   because the wrong L2 CR4 is used by L0 KVM as L1 CR4. As L1
   CR3.PCID != 0, L0 KVM will inject GP to L1 guest.

Fixes: 4704d0befb072 ("KVM: nVMX: Exiting from L2 to L1")
Cc: qemu-stable@nongnu.org
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry-picked from commit 8eb3f87d903168bdbd1222776a6b1e281f50513e)
Orabug: 27720128
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/microcode: probe CPU features on microcode update

Probe for updated CPUID features each time the microcode is
loaded. Specifically this means when the sysfs cpu/microcode
nodes are created (which is when the microcode is first loaded)
or from a user trigger via the sysfs microcode/reload interface.

Orabug: 27878230

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
(cherry-picked from commit c97a2bf2aa93390b23fbf9adb943e494fee18a18)

conflicts:
arch/x86/kernel/cpu/microcode/core.c
arch/x86/include/asm/processor.h

[Backport: call init_scattered_cpuid_features() instead of
get_cpu_cap().]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/microcode: microcode_write() should not reference boot_cpu_data

microcode_write() internally calls the AMD or Intel microcode
update logic, both of which update the cpu_data(cpu)->microcode
value. For probing speculation features however, we call
init_scattered_cpuid_features() with boot_cpu_data which is stale
and might have an old value of microcode version.

Fix this by using cpu_data() instead.

Orabug: 27878230

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry-picked from commit ea35f733dca011782ef2a3647e7f3ac284dbed2e)

Conflict:
arch/x86/kernel/cpu/microcode/core.c

[Backport: modify arch/x86/kernel/microcode_core.c instead.]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpufeatures: use cpu_data in init_scattered_cpuid_flags()

Post SMP init, cpu_data() contains the current cpuinfo state with
the boot_cpu_data potentially going stale.

Switch all boot_cpu_data references to cpu_data() in
init_scattered_cpuid_flags().

Orabug: 27878230

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry-picked from commit 6258d6d6ce2b9cf41830e8616ffc7841b3fddca9)

Conflict:
arch/x86/kernel/cpu/common.c

[Backport: Modify init_scattered_cpuid_flags() instead of
init_speculation_control().]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/pagewalk.c: report holes in hugetlb ranges

This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.

Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.

This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.

This issue was found using an AFL-based fuzzer.

v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)

Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 27913118
CVE: CVE-2017-16994

(cherry picked from commit 373c4557d2aa362702c4c2d41288fb1e54990b7c)
Conflict: one line adjust due to huge_pte_offset() interface diff
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KEYS: don't let add_key() update an uninstantiated key

Currently, when passed a key that already exists, add_key() will call the
key's ->update() method if such exists.  But this is heavily broken in the
case where the key is uninstantiated because it doesn't call
__key_instantiate_and_link().  Consequently, it doesn't do most of the
things that are supposed to happen when the key is instantiated, such as
setting the instantiation state, clearing KEY_FLAG_USER_CONSTRUCT and
awakening tasks waiting on it, and incrementing key->user->nikeys.

It also never takes key_construction_mutex, which means that
->instantiate() can run concurrently with ->update() on the same key.  In
the case of the "user" and "logon" key types this causes a memory leak, at
best.  Maybe even worse, the ->update() methods of the "encrypted" and
"trusted" key types actually just dereference a NULL pointer when passed an
uninstantiated key.

Change key_create_or_update() to wait interruptibly for the key to finish
construction before continuing.

This patch only affects *uninstantiated* keys.  For now we still allow a
negatively instantiated key to be updated (thereby positively
instantiating it), although that's broken too (the next patch fixes it)
and I'm not sure that anyone actually uses that functionality either.

Here is a simple reproducer for the bug using the "encrypted" key type
(requires CONFIG_ENCRYPTED_KEYS=y), though as noted above the bug
pertained to more than just the "encrypted" key type:

    #include <stdlib.h>
    #include <unistd.h>
    #include <keyutils.h>

    int main(void)
    {
        int ringid = keyctl_join_session_keyring(NULL);

        if (fork()) {
            for (;;) {
                const char payload[] = "update user:foo 32";

                usleep(rand() % 10000);
                add_key("encrypted", "desc", payload, sizeof(payload), ringid);
                keyctl_clear(ringid);
            }
        } else {
            for (;;)
                request_key("encrypted", "desc", "callout_info", ringid);
        }
    }

It causes:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: encrypted_update+0xb0/0x170
    PGD 7a178067 P4D 7a178067 PUD 77269067 PMD 0
    PREEMPT SMP
    CPU: 0 PID: 340 Comm: reproduce Tainted: G      D         4.14.0-rc1-00025-g428490e38b2e #796
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff8a467a39a340 task.stack: ffffb15c40770000
    RIP: 0010:encrypted_update+0xb0/0x170
    RSP: 0018:ffffb15c40773de8 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff8a467a275b00 RCX: 0000000000000000
    RDX: 0000000000000005 RSI: ffff8a467a275b14 RDI: ffffffffb742f303
    RBP: ffffb15c40773e20 R08: 0000000000000000 R09: ffff8a467a275b17
    R10: 0000000000000020 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8a4677057180 R15: ffff8a467a275b0f
    FS:  00007f5d7fb08700(0000) GS:ffff8a467f200000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000018 CR3: 0000000077262005 CR4: 00000000001606f0
    Call Trace:
     key_create_or_update+0x2bc/0x460
     SyS_add_key+0x10c/0x1d0
     entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x7f5d7f211259
    RSP: 002b:00007ffed03904c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000f8
    RAX: ffffffffffffffda RBX: 000000003b2a7955 RCX: 00007f5d7f211259
    RDX: 00000000004009e4 RSI: 00000000004009ff RDI: 0000000000400a04
    RBP: 0000000068db8bad R08: 000000003b2a7955 R09: 0000000000000004
    R10: 000000000000001a R11: 0000000000000246 R12: 0000000000400868
    R13: 00007ffed03905d0 R14: 0000000000000000 R15: 0000000000000000
    Code: 77 28 e8 64 34 1f 00 45 31 c0 31 c9 48 8d 55 c8 48 89 df 48 8d 75 d0 e8 ff f9 ff ff 85 c0 41 89 c4 0f 88 84 00 00 00 4c 8b 7d c8 <49> 8b 75 18 4c 89 ff e8 24 f8 ff ff 85 c0 41 89 c4 78 6d 49 8b
    RIP: encrypted_update+0xb0/0x170 RSP: ffffb15c40773de8
    CR2: 0000000000000018

Cc: <stable@vger.kernel.org> # v2.6.12+
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Biggers <ebiggers@google.com>

Orabug: 27913330
CVE: CVE-2017-15299

(cherry picked from commit 60ff5b2f547af3828aebafd54daded44cfb0807a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

drm/vmwgfx: NULL pointer dereference in vmw_surface_define_ioctl()

Before memory allocations vmw_surface_define_ioctl() checks the
upper-bounds of a user-supplied size, but does not check if the
supplied size is 0.

Add check to avoid NULL pointer dereferences.

Cc: <stable@vger.kernel.org>
Signed-off-by: Murray McAllister <murray.mcallister@insomniasec.com>
Reviewed-by: Sinclair Yeh <syeh@vmware.com>
Orabug: 27913367
CVE: CVE-2017-7294

(cherry picked from commit 36274ab8c596f1240c606bb514da329add2a1bcd)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vmscan: Support multiple kswapd threads per node

Page replacement is handled in the Linux Kernel in one of two ways:

1) Asynchronously via kswapd
2) Synchronously, via direct reclaim

At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing
business logic.

Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.

When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min
watermark, the task will no longer pull pages from the free list.
Instead, the task will run the same CPU-bound routines as kswapd to
satisfy its own allocation by scanning and evicting pages. This is
called a direct reclaim.

The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd
is not actually required on a linux system. It exists for the sole
purpose of optimizing performance by preventing direct reclaims.

When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur
in small tasks that rarely allocate additional memory. Consider the
impact of injecting an additional 100ms of latency when nscd allocates
memory to facilitate caching of a DNS query.

The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the
problem go away. Since then hardware has evolved to bring a new struggle
for kswapd. Storage speeds have increased by orders of magnitude while
CPU clock speeds have actually slowed down. This presents a throughput
problem for a single threaded kswapd that will get worse with each
generation of new hardware.

------------
Test Details
------------

The tests below were designed with the assumption that a kswapd
bottleneck is best demonstrated using filesystem reads. This way, the
inactive list will be full of clean pages, simplifying the analysis and
allowing kswapd to achieve the highest possible steal rate. Maximum
steal rates for kswapd are likely to be the same or lower for any other
mix of page types on the system.

Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz
cores, 756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each
drive has an XFS filesystem mounted separately as /d0 through /d7. NVMe
drives require multiple concurrent streams to show their potential, so I
created 11 250GB zero-filled files on each drive so that I could test
with parallel reads.

The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.

During each stage dd tasks are launched to read from each drive in a
round robin fashion until the specified number of tasks for the stage
has been reached. Then iostat, vmstat and top are started in the
background with 10 second intervals. After five minutes, all of the dd
tasks are killed and the iostat, vmstat and top output is parsed in
order to report the following:

CPU consumption
- sy: aggregate kernel mode CPU consumption from vmstat output. The
  value doesn't tend to fluctuate much so I just grab the highest value.
  Each sample is averaged over 10 seconds
- dd_cpu: for all of the dd tasks averaged across the top samples since
  there is a lot of variation.

Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total

This first test performs reads using O_DIRECT in order to show the peak
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.

The dd command for this test looks like this:

Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M

dd sy dd_cpu throughput
6  1  4.52   14966994.50
10 1  4.94   21503269.37
16 1  4.70   25791251.00
22 1  5.02   26139553.00
28 1  4.85   26242989.00
34 2  4.53   26253264.20
40 2  3.82   26265978.60
46 2  3.39   26256091.80
52 2  3.06   26256913.60
58 2  2.74   26256988.40
64 2  2.50   26256534.20
70 2  2.27   26255088.00
76 2  2.12   26247909.00
80 2  1.99   26251164.80

Throughput peaked with 40 dd tasks at 26265978.60 KB/s. Very little
system CPU was consumed as expected the drives DMA directly into the
user address space when using O_DIRECT.

The remaining tests do not use O_DIRECT. We drop the page cache before
testing and stop the test as soon as kswapd wakes up.

dd sy dd_cpu throughput
6  2  30.34  5245348.50
10 3 32.53 7735288.00
16 5 32.78 11059243.20
22 6 30.77 13371912.80
28 8 31.52 16092092.00
34 10 30.12 18000076.80
40 11 29.34 19368494.40
46 11 26.45 20450313.60
52 13 25.47 21249290.40
58 13 23.75 22008188.80
64 13 21.38 22298248.80
70 15 21.39 22442940.80
76 14 18.82 22876260.80
80 15 19.45 23143716.80

Each read has to pause after the buffer in kernel space is populated
while that data is added to the pagecache and copied into the user
address space. For this reason, more parallel streams are required to
achieve peak throughput. The copy operation consumes substantially more
CPU than direct IO as expected.

The next test measures throughput after kswapd starts running. This is
the same test only we wait for kswapd to wake up before we start
collecting metrics.

The script actually keeps track of a few things that were not mentioned
earlier. It tracks direct reclaims and page scans by watching the
metrics in /proc/vmstat. CPU consumption for kswapd is tracked the same
way it is tracked for dd.

Since the test is 100% reads, you can assume that the page steal rate
for kswapd and direct reclaims is almost identical to the scan rate.

1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput  dr's  pgscan_kswapd pgscan_direct
10 4  31.56  24.86   23.56   7828015.60  0     460668253     0
16 7  36.10  65.94   71.74   11149401.80 0     900894848     0
22 10 37.79  91.94   94.25   14271445.20 179   1149779893    4236610
28 14 46.04  86.71   84.39   14633873.80 14624 829782742     346638611
34 16 41.23  85.76   84.04   16058195.40 16303 927594668     386370834
40 20 47.65  69.78   68.79   15381517.80 28538 566746661     676222823
46 22 45.76  64.39   64.50   15941522.40 32567 510483237     771659039
52 25 47.40  60.56   63.15   15189850.20 34504 422051924     816932307
58 29 48.21  53.44   57.19   15191931.40 38313 330596133     907319630
64 32 49.88  51.08   51.10   15073485.20 41939 233133908     993009680
70 36 50.71  51.32   51.54   15265733.00 43348 209193894     1026357511
76 40 51.78  51.39   54.38   15091290.20 43804 192167798     1037072462
80 44 52.95  48.91   55.95   15009935.60 44218 177718893     1046802606

Look closely at the scan statistics and the CPU consumption numbers and
it should be clear that the bulk of the CPU consumption is occuring in
the context of the dd tasks due to direct reclaims, not kswapd.

Same test, more kswapd tasks:

6 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput  dr's  pgscan_kswapd pgscan_direct
10 4  33.19  6.98    6.44    8184050.97  0     460877355     0
16 10 41.61  27.99   28.26   11533556.80 0     941456735     0
22 12 39.00  28.12   29.67   14303265.20 10    1170356251    237431
28 15 37.53  38.01   40.42   16449001.40 30    1355387318    711292
34 19 38.87  49.81   51.33   18094928.20 0     1495622630    0
40 22 37.62  56.93   59.27   19562580.80 0     1618461307    0
46 25 36.51  64.00   66.34   20800868.60 0     1715162179    0
52 28 36.89  70.68   74.60   21650189.60 0     1787311285    0
58 34 37.44  80.59   81.43   22395273.00 1190  1794721827    28089474
64 44 50.22  67.36   76.96   21848111.20 18150 1105060289    429342513
70 46 55.57  56.59   64.22   18766118.20 27918 724659653     660301547
76 50 42.37  67.79   75.83   23688889.40 18603 1171174674    440088674
80 49 40.14  72.05   79.60   22350506.00 15680 1310470634    370843890

With 58 dd tasks, throughput is roughly the same as what we saw without
memory pressure. Ten additional kswapd tasks (5 per node) resulted in a
17% increase in aggregate kernel mode CPU consumption.

NOTE: The kswapd tasks were originally tracked with an array of task
structs in each pgdata structure. Sadly, any changes to the pg_data_t
resulted in KABI breakage. Look for the following definition that was
used as a workaround:

static struct task_struct *kswapd_list[MAX_NUMNODES][MAX_KSWAPD_THREADS];

Orabug: 27913411

Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Reviewed by: Henry Willard <henry.willard@oracle.com

Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: don't use F-RTO on non-recurring timeouts

Currently F-RTO may repeatedly send new data packets on non-recurring
timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682)
should only be used on either new recovery or recurring timeouts.

This exacerbates the recovery progress during frequent timeout &
repair, because we prioritize sending new data packets instead of
repairing the holes when the bandwidth is already scarce.

Fix it by correcting the test of a new recovery episode.

Orabug: 27901860

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f82b681a511f4d61069e9586a9cf97bdef371ef3)

Reviewed-by: Hakon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/rds: ib: Release correct number of frags

Commit c682e8474bd4 ("net/rds: reduce memory footprint during
ib_post_recv in IB transport") introduces an SG list instead of a
single contiguously fragment. When rebuilding the caches, it attempts
to release the number of fragments used by the new connection,
independent of the actual number of fragments used by the cache. This
leads to a kernel crash. Instead, release the correct number of
fragments.

Orabug: 27924161

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: rng - Remove old low-level rng interface

Orabug: 27926676
CVE: CVE-2017-15116

Now that all rng implementations have switched over to the new
interface, we can remove the old low-level interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 94f1bb15bed84ad6c893916b7e7b9db6f1d7eec6)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: drbg - Convert to new rng interface

Orabug: 27926676
CVE: CVE-2017-15116

This patch converts the DRBG implementation to the new low-level
rng interface.

This allows us to get rid of struct drbg_gen by using the new RNG
API instead.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephan Mueller <smueller@chronox.de>
(cherry picked from commit 8fded5925d0a733c46f8d0b5edd1c9b315882b1d)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
crypto/drbg.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: ansi_cprng - Convert to new rng interface

Orabug: 27926676
CVE: CVE-2017-15116

This patch ocnverts the ANSI CPRNG implementation to the new
low-level rng interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
(cherry picked from commit e7c2422a839bfc6876a2f7a9b283bb2963f0287b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: krng - Convert to new rng interface

Orabug: 27926676
CVE: CVE-2017-15116

This patch ocnverts the KRNG implementation to the new low-level
rng interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit e33cf2c5aab7d0012e7890089e89ae2466c2449c)
Signed-off-by: Brian Maly <brian.maly@oracle.com>