www.infradead.org Git - users/jedix/linux-maple.git/log

x86/microcode: Synchronize late microcode loading

Original idea by Ashok, completely rewritten by Borislav.

Before you read any further: the early loading method is still the
preferred one and you should always do that. The following patch is
improving the late loading mechanism for long running jobs and cloud use
cases.

Gather all cores and serialize the microcode update on them by doing it
one-by-one to make the late update process as reliable as possible and
avoid potential issues caused by the microcode update.

[ Borislav: Rewrite completely. ]

Co-developed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Link: https://lkml.kernel.org/r/20180228102846.13447-8-bp@alien8.de
(cherry picked from commit a5321aec6412b20b5ad15db2d6b916c05349dbff)

Orabug: 29754165

Conflicts --- quite a few. Notable ones:
* We don't have microcode cache and so call request_microcode_fw() for
  each CPU
* No need to get/put_online_cpus() --- they are part of stop_machine()
* No stop_machine_cpuslocked() in uek4 but uek4's version of
  stop_machine() prevents CPU hotplug.
* uek4's has fewer result codes for microcode operations and thus error
  handling is slightly different.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/speculation: Support 'mitigations=' cmdline option

commit d68be4c4d31295ff6ae34a8ddfaa4c1a8ff42812 upstream

Configure x86 runtime CPU speculation bug mitigations in accordance with
the 'mitigations=' cmdline option. This affects Meltdown, Spectre v2,
Speculative Store Bypass, and L1TF.

The default behavior is unchanged.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86)
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-arch@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/6616d0ae169308516cfdf5216bedd169f8a8291b.1555085500.git.jpoimboe@redhat.com
(cherry picked from commit aaa95f2f1112dd4ec31ae13c4cf877dc7c7fcbc8)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
arch/x86/kernel/cpu/bugs.c
arch/x86/mm/pti.c
Documentation/admin-guide/kernel-parameters.txt: different location
arch/x86/kernel/cpu/bugs.c: different name (bugs_64.c). Also we have different logic in nospectre_v2.
arch/x86/mm/pti.c: different location for mitigation arch/x86/mm/kaiser.c.

cpu/speculation: Add 'mitigations=' cmdline option

commit 98af8452945c55652de68536afdde3b520fec429 upstream

Keeping track of the number of mitigations for all the CPU speculation
bugs has become overwhelming for many users.  It's getting more and more
complicated to decide which mitigations are needed for a given
architecture.  Complicating matters is the fact that each arch tends to
have its own custom way to mitigate the same vulnerability.

Most users fall into a few basic categories:

a) they want all mitigations off;

b) they want all reasonable mitigations on, with SMT enabled even if
   it's vulnerable; or

c) they want all reasonable mitigations on, with SMT disabled if
   vulnerable.

Define a set of curated, arch-independent options, each of which is an
aggregation of existing options:

- mitigations=off: Disable all mitigations.

- mitigations=auto: [default] Enable all the default mitigations, but
  leave SMT enabled, even if it's vulnerable.

- mitigations=auto,nosmt: Enable all the default mitigations, disabling
  SMT if needed by a mitigation.

Currently, these options are placeholders which don't actually do
anything.  They will be fleshed out in upcoming patches.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86)
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-arch@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.com
(cherry picked from commit 6cbbaa933b325234d6ffc93836d6b7c06dea7a56)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
Different location: Documentation/kernel-parameters.txt

x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off

commit e2c3c94788b08891dcf3dbe608f9880523ecd71b upstream

This code is only for CPUs which are affected by MSBDS, but are *not*
affected by the other two MDS issues.

For such CPUs, enabling the mds_idle_clear mitigation is enough to
mitigate SMT.

However if user boots with 'mds=off' and still has SMT enabled, we should
not report that SMT is mitigated:

$cat /sys//devices/system/cpu/vulnerabilities/mds
Vulnerable; SMT mitigated

But rather:
Vulnerable; SMT vulnerable

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20190412215118.294906495@localhost.localdomain
(cherry picked from commit 31203de125c7e160bcb42927a5db0bf01de98f6a)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
bugs.c vs bug_64.c in UEK4

x86/speculation/mds: Fix comment

commit cae5ec342645746d617dd420d206e1588d47768a upstream

s/L1TF/MDS/

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
(cherry picked from commit 9a5f1b636984bcf0a8495208e0fc9f30e2ec8359)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
bugs.c vs bugs_64.c on UEK4

x86/speculation/mds: update mds_mitigation to reflect debugfs configuration

If we enable mds_user_clear, we set mds_mitigation to MDS_MITIGATION_FULL or
MDS_MITIGATION_VMWERV. When we disable mds_user_clear, we set mds_mitigation to
MDS_MITIGATION_IDLE if mds_idle_clear, otherwise MDS_MITIGATION_OFF. If we
enable mds_idle_clear, we set mds_mitigation to MDS_MITIGATION_IDLE only if
mds_user_clear is disabled. When we disable mds_idle_clear, we set
mds_mitigation to MDS_MITIGATION_OFF if mds_user_clear is disabled.

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/speculation/mds: fix microcode late loading

In the microcode late loading case we have to:
- clear the CPU bugs related to MDS to be re-evaluated
- add proper evaluation of the MDS state and enable mitigation if necessary.

If the user has enforced off or idle mitigation, we keep it. Also if the
microcode fixes the MDS bug, mitigation will be turned off.

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/speculation/mds: Add boot option to enable MDS protection only while in idle

Such protection may be useful when we only care about virtualization and
no core sharing between guests is allowed.

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>

x86/speculation/mds: Improve coverage for MDS vulnerability

We seem to be missing a bunch of cases when we don't clear fill/store
buffers for MDS vulnerability during return to userspace.

Since we always call DISABLE_IBRS in those cases let's define a new
macro SPEC_RETURN_TO_USER than will both disable IBRS and flush the
buffers.

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>

x86/speculation/mds: Add SMT warning message

commit 39226ef02bfb43248b7db12a4fdccb39d95318e3 upstream

MDS is vulnerable with SMT. Make that clear with a one-time printk
whenever SMT first gets enabled.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
(cherry picked from commit 2dad2dcf7f7eab0e6d10d560e22fddd0a08b37b3)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
bugs.c vs bugs_64.c: we do not have arch_smt_update. Squash the check in mds_select_mitigation.

x86/speculation/mds: Add mds=full,nosmt cmdline option

commit d71eb0ce109a124b0fa714832823b9452f2762cf upstream

Add the mds=full,nosmt cmdline option. This is like mds=full, but with
SMT disabled if the CPU is vulnerable.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
(cherry picked from commit 623b724d5e50c15d160799446956ba0d23d4f978)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
arch/x86/kernel/cpu/bugs.c
bugs.64 vs bugs_64.c: different boot command line parsing code
Documentation/admin-guide/kernel-parameters.txt vs Documentation/kernel-parameters.txt

Documentation: Add MDS vulnerability documentation

commit 5999bbe7a6ea3c62029532ec84dc06003a1fa258 upstream

Add the initial MDS vulnerability documentation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 3b5d5994ef174daf5f77ba3f101cd879cf0c5fd6)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

Documentation: Move L1TF to separate directory

commit 65fd4cb65b2dad97feb8330b6690445910b56d6a upstream

Move L1TF to a separate directory so the MDS stuff can be added at the
side. Otherwise the all hardware vulnerabilites have their own top level
entry. Should have done that right away.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 19049ee5b2543696dd3cb164a4c4e566984f9615)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/speculation/mds: Add mitigation mode VMWERV

commit 22dd8365088b6403630b82423cf906491859b65e upstream

In virtualized environments it can happen that the host has the microcode
update which utilizes the VERW instruction to clear CPU buffers, but the
hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
to guests.

Introduce an internal mitigation mode VWWERV which enables the invocation
of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
system has no updated microcode this results in a pointless execution of
the VERW instruction wasting a few CPU cycles. If the microcode is updated,
but not exposed to a guest then the CPU buffers will be cleared.

That said: Virtual Machines Will Eventually Receive Vaccine

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit a2227b3f734fad5ace7c99103d6a7bc020c193fd)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
The changes to arch/x86/kernel/cpu/bugs.c instead need to be made to
arch/x86/kernel/cpu/bugs_64.c.

x86/speculation/mds: Add debugfs for controlling MDS

Add debugfs entries for controlling mds_user_clear and mds_idle_clear static
keys which would enable the mitigation in the respective paths.

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/speculation/mds: Add sysfs reporting for MDS

commit 8a4b06d391b0a42a373808979b5028f5c84d9c6a upstream

Add the sysfs reporting file for MDS. It exposes the vulnerability and
mitigation state similar to the existing files for the other speculative
hardware vulnerabilities.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit db366061fff1f76407cb5d1b0975fcc381400cc3)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
The changes to arch/x86/kernel/cpu/bugs.c instead need to be made to
arch/x86/kernel/cpu/bugs_64.c.
X86_HYPER_NATIVE doesn't exist so just leave that change out.
sched_smt_active() does not exist, instead use cpu_smp_control.
hypervisor_is_type replaced with cpu_has_hypervisor

x86/speculation/mds: Add mitigation control for MDS

commit bc1241700acd82ec69fde98c5763ce51086269f8 upstream

Now that the mitigations are in place, add a command line parameter to
control the mitigation, a mitigation selector function and a SMT update
mechanism.

This is the minimal straight forward initial implementation which just
provides an always on/off mode. The command line parameter is:

mds=[full|off]

This is consistent with the existing mitigations for other speculative
hardware vulnerabilities.

The idle invocation is dynamically updated according to the SMT state of
the system similar to the dynamic update of the STIBP mitigation. The idle
mitigation is limited to CPUs which are only affected by MSBDS and not any
other variant, because the other variants cannot be mitigated on SMT
enabled systems.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 4cad86e4abd472f637038e0ad70a70d0d7333f83)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
The changes to arch/x86/kernel/cpu/bugs.c instead need to be made to
arch/x86/kernel/cpu/bugs_64.c.

x86/speculation/mds: Conditionally clear CPU buffers on idle entry

commit 07f07f55a29cb705e221eda7894dd67ab81ef343 upstream

Add a static key which controls the invocation of the CPU buffer clear
mechanism on idle entry. This is independent of other MDS mitigations
because the idle entry invocation to mitigate the potential leakage due to
store buffer repartitioning is only necessary on SMT systems.

Add the actual invocations to the different halt/mwait variants which
covers all usage sites. mwaitx is not patched as it's not available on
Intel CPUs.

The buffer clear is only invoked before entering the C-State to prevent
that stale data from the idling CPU is spilled to the Hyper-Thread sibling
after the Store buffer got repartitioned and all entries are available to
the non idle sibling.

When coming out of idle the store buffer is partitioned again so each
sibling has half of it available. Now CPU which returned from idle could be
speculatively exposed to contents of the sibling, but the buffers are
flushed either on exit to user space or on VMENTER.

When later on conditional buffer clearing is implemented on top of this,
then there is no action required either because before returning to user
space the context switch will set the condition flag which causes a flush
on the return to user path.

Note, that the buffer clearing on idle is only sensible on CPUs which are
solely affected by MSBDS and not any other variant of MDS because the other
MDS variants cannot be mitigated when SMT is enabled, so the buffer
clearing on idle would be a window dressing exercise.

This intentionally does not handle the case in the acpi/processor_idle
driver which uses the legacy IO port interface for C-State transitions for
two reasons:

- The acpi/processor_idle driver was replaced by the intel_idle driver
   almost a decade ago. Anything Nehalem upwards supports it and defaults
   to that new driver.

- The legacy IO port interface is likely to be used on older and therefore
   unaffected CPUs or on systems which do not receive microcode updates
   anymore, so there is no point in adding that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 7af0d7a8d40fa9dbf7700ed8e78f93c90b9a9e8b)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Conflicts:
Ignore added comments in the nonexistent function in mwait.h
Port the changes in bugs.c to bugs_64.c.
Move #include <cpufeature.h> in alternative.h to resolve implicit definition error.

x86/kvm/vmx: Add MDS protection when L1D Flush is not active

commit 650b68a0622f933444a6d66936abb3103029413b upstream

CPUs which are affected by L1TF and MDS mitigate MDS with the L1D Flush on
VMENTER when updated microcode is installed.

If a CPU is not affected by L1TF or if the L1D Flush is not in use, then
MDS mitigation needs to be invoked explicitly.

For these cases, follow the host mitigation state and invoke the MDS
mitigation before VMENTER.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit f0da457fb20b9f321adb8acce0c287eff6f850d1)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Changes from bugs.c imported to bugs_64.c

x86/speculation/mds: Clear CPU buffers on exit to user

commit 04dcbdb8057827b043b3c71aa397c4c63e67d086 upstream

Add a static key which controls the invocation of the CPU buffer clear
mechanism on exit to user space and add the call into
prepare_exit_to_usermode() and do_nmi() right before actually returning.

Add documentation which kernel to user space transition this covers and
explain why some corner cases are not mitigated.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 62ba379c5925f285e3ab9362761c18823e5a049e)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
UEK4 uses bugs_64.c instead of bugs.c arch/x86/entry/common.c doesn't exist.
Make the corresponding changes to arch/x86/kernel/entry_64.S.

x86/speculation/mds: Add mds_clear_cpu_buffers()

commit 6a9e529272517755904b7afa639f6db59ddb793e upstream

The Microarchitectural Data Sampling (MDS) vulernabilities are mitigated by
clearing the affected CPU buffers. The mechanism for clearing the buffers
uses the unused and obsolete VERW instruction in combination with a
microcode update which triggers a CPU buffer clear when VERW is executed.

Provide a inline function with the assembly magic. The argument of the VERW
instruction must be a memory operand as documented:

  "MD_CLEAR enumerates that the memory-operand variant of VERW (for
   example, VERW m16) has been extended to also overwrite buffers affected
   by MDS. This buffer overwriting functionality is not guaranteed for the
   register operand variant of VERW."

Documentation also recommends to use a writable data segment selector:

  "The buffer overwriting occurs regardless of the result of the VERW
   permission check, as well as when the selector is null or causes a
   descriptor load segment violation. However, for lowest latency we
   recommend using a selector that indicates a valid writable data
   segment."

Add x86 specific documentation about MDS and the internal workings of the
mitigation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 2f7a0abf72199b1c9cbaae99270343993b104921)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Added an assembly version of mds_clear_cpu_buffers

x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests

commit 6c4dbbd14730c43f4ed808a9c42ca41625925c22 upstream

X86_FEATURE_MD_CLEAR is a new CPUID bit which is set when microcode
provides the mechanism to invoke a flush of various exploitable CPU buffers
by invoking the VERW instruction.

Hand it through to guests so they can adjust their mitigations.

This also requires corresponding qemu changes, which are available
separately.

[ tglx: Massaged changelog ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 0908473b20312b30f2600e4b16027d6c7facef4a)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/cpuid.c
Different initial content of cpuid bits.

x86/speculation/mds: Add BUG_MSBDS_ONLY

commit e261f209c3666e842fd645a1e31f001c3a26def9 upstream

This bug bit is set on CPUs which are only affected by Microarchitectural
Store Buffer Data Sampling (MSBDS) and not by any other MDS variant.

This is important because the Store Buffers are partitioned between
Hyper-Threads so cross thread forwarding is not possible. But if a thread
enters or exits a sleep state the store buffer is repartitioned which can
expose data from one thread to the other. This transition can be mitigated.

That means that for CPUs which are only affected by MSBDS SMT can be
enabled, if the CPU is not affected by other SMT sensitive vulnerabilities,
e.g. L1TF. The XEON PHI variants fall into that category. Also the
Silvermont/Airmont ATOMs, but for them it's not really relevant as they do
not support SMT, but mark them for completeness sake.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 24a8b2c167461f94542ecce1b0e48b30a9ef1a3b)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/speculation/mds: Add basic bug infrastructure for MDS

commit ed5194c2732c8084af9fd159c146ea92bf137128 upstream

Microarchitectural Data Sampling (MDS), is a class of side channel attacks
on internal buffers in Intel CPUs. The variants are:

- Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
- Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
- Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)

MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
dependent load (store-to-load forwarding) as an optimization. The forward
can also happen to a faulting or assisting load operation for a different
memory address, which can be exploited under certain conditions. Store
buffers are partitioned between Hyper-Threads so cross thread forwarding is
not possible. But if a thread enters or exits a sleep state the store
buffer is repartitioned which can expose data from one thread to the other.

MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
L1 miss situations and to hold data which is returned or sent in response
to a memory or I/O operation. Fill buffers can forward data to a load
operation and also write data to the cache. When the fill buffer is
deallocated it can retain the stale data of the preceding operations which
can then be forwarded to a faulting or assisting load operation, which can
be exploited under certain conditions. Fill buffers are shared between
Hyper-Threads so cross thread leakage is possible.

MLDPS leaks Load Port Data. Load ports are used to perform load operations
from memory or I/O. The received data is then forwarded to the register
file or a subsequent operation. In some implementations the Load Port can
contain stale data from a previous operation which can be forwarded to
faulting or assisting loads under certain conditions, which again can be
exploited eventually. Load ports are shared between Hyper-Threads so cross
thread leakage is possible.

All variants have the same mitigation for single CPU thread case (SMT off),
so the kernel can treat them as one MDS issue.

Add the basic infrastructure to detect if the current CPU is affected by
MDS.

[ tglx: Rewrote changelog ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 928a58d368271748e2d8a4bdb7ba1412c5f1348f)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/speculation: Consolidate CPU whitelists

commit 36ad35131adacc29b328b9c8b6277a8bf0d6fd5d upstream

The CPU vulnerability whitelists have some overlap and there are more
whitelists coming along.

Use the driver_data field in the x86_cpu_id struct to denote the
whitelisted vulnerabilities and combine all whitelists into one.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit dbcafb8714cdee996b72fc47e9df8eeca394ba70)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/msr-index: Cleanup bit defines

commit d8eabc37310a92df40d07c5a8afc53cebf996716 upstream

Greg pointed out that speculation related bit defines are using (1 << N)
format instead of BIT(N). Aside of that (1 << N) is wrong as it should use
1UL at least.

Clean it up.

[ Josh Poimboeuf: Fix tools build ]

Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 4b2fc235844db92090c944b92174afb341395c97)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

Documentation/l1tf: Fix small spelling typo

commit 60ca05c3b44566b70d64fbb8e87a6e0c67725468 upstream

Fix small typo (wiil -> will) in the "3.4. Nested virtual machines"
section.

Fixes: 5b76a3cff011 ("KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry")
Cc: linux-kernel@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-doc@vger.kernel.org
Cc: trivial@kernel.org
Signed-off-by: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 118c7a0fb6157b93bc6998b09f6bd15ccded5550)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>

x86/speculation: Simplify the CPU bug detection logic

commit 8ecc4979b1bd9c94168e6fc92960033b7a951336 upstream

Only CPUs which speculate can speculate. Therefore, it seems prudent
to test for cpu_no_speculation first and only then determine whether
a specific speculating CPU is susceptible to store bypass speculation.
This is underlined by all CPUs currently listed in cpu_no_speculation
were present in cpu_no_spec_store_bypass as well.

Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: konrad.wilk@oracle.com
Link: https://lkml.kernel.org/r/20180522090539.GA24668@light.dominikbrodowski.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit d24249216c7ddfc290843d9708e33adec921700b)

Orabug: 29526900
CVE: CVE-2018-12126
CVE: CVE-2018-12130
CVE: CVE-2018-12127

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/apic: Make arch_setup_hwirq NUMA node aware

This is a second attempt to fix bug 29292411. The first attempt ran
afoul of special behavior in cluster_vector_allocation_domain() explained
below. This was discovered by QA immediately before QU7 was to be release,
so the original was reverted.

In a xen VM with vNUMA enabled, irq affinity for a device on node 1
may become stuck on CPU 0. /proc/irq/nnn/smp_affinity_list may show
affinity for all the CPUs on node 1, but this is wrong. All interrupts
are on the first CPU of node 0 which is usually CPU 0.

The problem is caused when __assign_irq_vector() is called by
arch_setup_hwirq() with a mask of all online CPUs, and then called later
with a mask including only the node 1 CPUs. The first call assigns affinity
to CPU 0, and the second tries to move affinity to the first online node 1
CPU. In the reported case this is always CPU 2. For some reason, the
CPU 0 affinity is never cleaned up, and all interrupts remain with CPU 0.
Since an incomplete move appears to be in progress, all attempts to
reassign affinity for the irq fail. Because of a quirk in how affinity is
displayed in /proc/irq/nnn/smp_affinity_list, changes may appear to work
temporarily.

It was not reproducible on baremetal on the machine I had available for
testing, but it is possible that it was observed on other machines. It
does not appear in UEK5. The APIC and IRQ code is very different in UEK5,
and the code changed here doesn't exist in UEK5. Upstream has completely
abandoned the UEK4 approach to IRQ management. It is unknown whether KVM
guests might see the same problem with UEK4.

Making arch_setup_hwirq() NUMA sensitive eliminates the problem by
using the correct cpumask for the node for the initial assignment. The
second assignment becomes a noop. After initialization is complete,
affinity can be moved to any CPU on any node and back without a problem.

However, cluster_vector_allocation_domain() contains a hack designed to
reduce vector pressure in cluster x2apic. Specifically, it relies on
the address of the cpumask passed to it to determine if this allocation
is for the default device bringup case or explicit migration. If the
address of the cpumask does not match what is returned by
apic->target_cpus(), it assumes it is the explicit migration case and goes
into cluster mode which uses up vectors on multiple CPUs. Since the
original patch modifies arch_setup_hwirq() to pass a cpumask with only
local CPUs in it, cluster_vector_allocation_domain() allocates for the
entire cluster rather than a single CPU. This can cause vector allocation
failures when there are a very large number of devices such as can be the
case when there are a large number of VFs (see bug 29534769).

Orabug: 29534769
Signed-off-by: Henry Willard <henry.willard@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KEYS: encrypted: fix buffer overread in valid_master_desc()

With the 'encrypted' key type it was possible for userspace to provide a
data blob ending with a master key description shorter than expected,
e.g. 'keyctl add encrypted desc "new x" @s'. When validating such a
master key description, validate_master_desc() could read beyond the end
of the buffer. Fix this by using strncmp() instead of memcmp(). [Also
clean up the code to deduplicate some logic.]

Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
(cherry picked from commit 794b4bc292f5d31739d89c0202c54e7dc9bc3add)

Orabug: 29591025
CVE: CVE-2017-13305

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: target: remove hardcoded T10 Vendor ID in INQUIRY response

Use the value stored in t10_wwn.vendor, which defaults to "LIO-ORG",
but can be reconfigured via the vendor_id ConfigFS attribute.

This commit was backported by hand.

Orabug: 29344862

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: target: add device vendor id, product id and revision configfs attributes

The vendor_id, product_id and revision attributes will allow for the
modification of the T10 Model and Revision strings returned in
inquiry responses. Its value can be viewed and modified via the
ConfigFS path at:

target/core/$backstore/$name/wwn/vendor_id
target/core/$backstore/$name/wwn/product_id
target/core/$backstore/$name/wwn/revision

Orabug: 29344862

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: target: consistently null-terminate t10_wwn strings

In preparation for supporting user provided vendor strings, add an extra
byte to the vendor, model and revision arrays in struct t10_wwn. This
ensures that the full INQUIRY data can be carried in the arrays along with
a null-terminator.

Change a number of array readers and writers so that they account for
explicit null-termination:

- The pscsi_set_inquiry_info() and emulate_model_alias_store() codepaths
  don't currently explicitly null-terminate; fix this.

- Existing t10_wwn field dumps use for-loops which step over
  null-terminators for right-padding.
  + Use printf with width specifiers instead.

This commit was backported by hand.

Orabug: 29344862

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: target: use consistent left-aligned ASCII INQUIRY data

spc5r17.pdf specifies:

  4.3.1 ASCII data field requirements
  ASCII data fields shall contain only ASCII printable characters (i.e.,
  code values 20h to 7Eh) and may be terminated with one or more ASCII null
  (00h) characters.  ASCII data fields described as being left-aligned
  shall have any unused bytes at the end of the field (i.e., highest
  offset) and the unused bytes shall be filled with ASCII space characters
  (20h).

LIO currently space-pads the T10 VENDOR IDENTIFICATION and PRODUCT
IDENTIFICATION fields in the standard INQUIRY data. However, the PRODUCT
REVISION LEVEL field in the standard INQUIRY data as well as the T10 VENDOR
IDENTIFICATION field in the INQUIRY Device Identification VPD Page are
zero-terminated/zero-padded.

Fix this inconsistency by using space-padding for all of the above fields.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bryant G. Ly <bly@catalogicsoftware.com>
Reviewed-by: Lee Duncan <lduncan@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 0de263577de5d5e052be5f4f93334e63cc8a7f0b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/target/target_core_spc.c

Orabug: 29344862

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: fix data corruption caused by unaligned direct AIO

commit 372a03e01853f860560eade508794dd274e9b390 upstream.

Ext4 needs to serialize unaligned direct AIO because the zeroing of
partial blocks of two competing unaligned AIOs can result in data
corruption.

However it decides not to serialize if the potentially unaligned aio is
past i_size with the rationale that no pending writes are possible past
i_size. Unfortunately if the i_size is not block aligned and the second
unaligned write lands past i_size, but still into the same block, it has
the potential of corrupting the previous unaligned write to the same
block.

This is (very simplified) reproducer from Frank

    // 41472 = (10 * 4096) + 512
    // 37376 = 41472 - 4096

    ftruncate(fd, 41472);
    io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376);
    io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472);

    io_submit(io_ctx, 1, &iocbs[1]);
    io_submit(io_ctx, 1, &iocbs[2]);

    io_getevents(io_ctx, 2, 2, events, NULL);

Without this patch the 512B range from 40960 up to the start of the
second unaligned write (41472) is going to be zeroed overwriting the data
written by the first write. This is a data corruption.

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
*
00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
*
0000a000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
*
0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31

With this patch the data corruption is avoided because we will recognize
the unaligned_aio and wait for the unwritten extent conversion.

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
*
00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
*
0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31
*
0000b200

Reported-by: Frank Sorenson <fsorenso@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: e9e3bcecf44c ("ext4: serialize unaligned asynchronous DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2ebfb9ae0047e900a932e6219a1a8ee830cde940)
Orabug: 29553371
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

swiotlb: checking whether swiotlb buffer is full with io_tlb_used

This patch uses io_tlb_used to help check whether swiotlb buffer is full.
io_tlb_used is no longer used for only debugfs. It is also used to help
optimize swiotlb_tbl_map_single().

Suggested-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 29582587
(cherry picked from commit 60513ed06a41049768a6875229b025b6e726e148)

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

swiotlb: add debugfs to track swiotlb buffer usage

The device driver will not be able to do dma operations once swiotlb buffer
is full, either because the driver is using so many IO TLB blocks inflight,
or because there is memory leak issue in device driver. To export the
swiotlb buffer usage via debugfs would help the user estimate the size of
swiotlb buffer to pre-allocate or analyze device driver memory leak issue.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 29582587
(cherry picked from commit 71602fe6d4e9291af105adfef8e893b57c735906)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
debugfs_create_ulong() is not available. debugfs_create_file() is used instead.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

swiotlb: fix comment on swiotlb_bounce()

Fix the comment as swiotlb_bounce() is used to copy from original dma
location to swiotlb buffer during swiotlb_tbl_map_single(), while to
copy from swiotlb buffer to original dma location during
swiotlb_tbl_unmap_single().

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 29582587
(cherry picked from commit 6442ca2abf882d9d838fb844d852ba6acd1db7f4)

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success.

Various places assume that if nfs4_fl_prepare_ds() turns a non-NULL 'ds',
then ds->ds_clp will also be non-NULL.

This is not necessasrily true in the case when the process received a fatal signal
while nfs4_pnfs_ds_connect is waiting in nfs4_wait_ds_connect().
In that case ->ds_clp may not be set, and the devid may not recently have been marked
unavailable.

So add a test for ds_clp == NULL and return NULL in that case.

Fixes: c23266d532b4 ("NFS4.1 Fix data server connection race")
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Olga Kornievskaia <aglo@umich.edu>
Acked-by: Adamson, Andy <William.Adamson@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
(cherry picked from commit cfd278c280f997cf2fe4662e0acab0fe465f637b)
Orabug: 29617508
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ib_core: initialize shpd field when allocating 'struct ib_pd'

The shared pd feature in Oracle Linux was added by
commit a1911c2c180d ("IB/Shared PD support from Oracle").
It adds a field named shpd to 'struct ib_pd' but fails to
initialize it in all the places it is allocated.

This results its uninitialized content being referenced
in mlx4_ib_dealloc_pd() and actions taken based on it
which eventually leads to resource leaks even when shared
pd feature is not being used.

This fix here initializes it to NULL in ib_alloc_pd() where
the ib_core module allocates the data structure.

Orabug: 29384815

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "x86/apic: Make arch_setup_hwirq NUMA node aware"

Orabug: 29542185

This reverts commit ff3efcf39b34994ab1ba9390bd3a79c1a985c895.

qlcnic: fix Tx descriptor corruption on 82xx devices

In regular NIC transmission flow, driver always configures MAC using
Tx queue zero descriptor as a part of MAC learning flow.
But with multi Tx queue supported NIC, regular transmission can occur on
any non-zero Tx queue and from that context it uses
Tx queue zero descriptor to configure MAC, at the same time TX queue
zero could be used by another CPU for regular transmission
which could lead to Tx queue zero descriptor corruption and cause FW
abort.

This patch fixes this in such a way that driver always configures
learned MAC address from the same Tx queue which is used for
regular transmission.

Fixes: 7e2cf4feba05 ("qlcnic: change driver hardware interface mechanism")
Signed-off-by: Shahed Shaikh <shahed.shaikh@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 27708787

(cherry picked from commit c333fa0c4f220f8f7ea5acd6b0ebf3bf13fd684d)
Signed-off-by: aru kolappan <aru.kolappan@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: Fix a race between blk_cleanup_queue() and timeout handling

Make sure that if the timeout timer fires after a queue has been
marked "dying" that the affected requests are finished.

Reported-by: chenxiang (M) <chenxiang66@hisilicon.com>
Fixes: commit 287922eb0b18 ("block: defer timeouts to a workqueue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: chenxiang (M) <chenxiang66@hisilicon.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(backport from upstream commit 4e9b6f20828ac880dbc1fa2fdbafae779473d1af)

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
block/blk-core.c
block/blk-timeout.c

Orabug: 29158186

Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

can: gw: ensure DLC boundaries after CAN frame modification

Muyu Yu provided a POC where user root with CAP_NET_ADMIN can create a CAN
frame modification rule that makes the data length code a higher value than
the available CAN frame data size. In combination with a configured checksum
calculation where the result is stored relatively to the end of the data
(e.g. cgw_csum_xor_rel) the tail of the skb (e.g. frag_list pointer in
skb_shared_info) can be rewritten which finally can cause a system crash.

Michael Kubecek suggested to drop frames that have a DLC exceeding the
available space after the modification process and provided a patch that can
handle CAN FD frames too. Within this patch we also limit the length for the
checksum calculations to the maximum of Classic CAN data length (8).

CAN frames that are dropped by these additional checks are counted with the
CGW_DELETED counter which indicates misconfigurations in can-gw rules.

This fixes CVE-2019-3701.

Reported-by: Muyu Yu <ieatmuttonchuan@gmail.com>
Reported-by: Marcus Meissner <meissner@suse.de>
Suggested-by: Michal Kubecek <mkubecek@suse.cz>
Tested-by: Muyu Yu <ieatmuttonchuan@gmail.com>
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.2
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 29215299
CVE: CVE-2019-3701
(cherry picked from commit 0aaa81377c5a01f686bcdb8c7a6929a7bf330c68)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

CIFS: Enable encryption during session setup phase

In order to allow encryption on SMB connection we need to exchange
a session key and generate encryption and decryption keys.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
(cherry picked from commit cabfb3680f78981d26c078a26e5c748531257ebb)

Orabug: 29338239
CVE: CVE-2018-1066

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/cifs/smb2pdu.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: clear i_data in ext4_inode_info when removing inline data

commit 6e8ab72a812396996035a37e5ca4b3b99b5d214b upstream.

When converting from an inode from storing the data in-line to a data
block, ext4_destroy_inline_data_nolock() was only clearing the on-disk
copy of the i_blocks[] array. It was not clearing copy of the
i_blocks[] in ext4_inode_info, in i_data[], which is the copy actually
used by ext4_map_blocks().

This didn't matter much if we are using extents, since the extents
header would be invalid and thus the extents could would re-initialize
the extents tree. But if we are using indirect blocks, the previous
contents of the i_blocks array will be treated as block numbers, with
potentially catastrophic results to the file system integrity and/or
user data.

This gets worse if the file system is using a 1k block size and
s_first_data is zero, but even without this, the file system can get
quite badly corrupted.

This addresses CVE-2018-10881.

https://bugzilla.kernel.org/show_bug.cgi?id=200015

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit deb465ec750b80776cc4ac5b92b72c0a71fd4f0b)

Orabug: 29540709
CVE: CVE-2018-10881

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: add more inode number paranoia checks

commit c37e9e013469521d9adb932d17a1795c139b36db upstream.

If there is a directory entry pointing to a system inode (such as a
journal inode), complain and declare the file system to be corrupted.

Also, if the superblock's first inode number field is too small,
refuse to mount the file system.

This addresses CVE-2018-10882.

https://bugzilla.kernel.org/show_bug.cgi?id=200069

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ff6c96461be35381399466ad58f02b8d78ab480a)

Orabug: 29545566
CVE: CVE-2018-10882

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/inode.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "KVM: nVMX: Eliminate vmcs02 pool"

This reverts commit 22d8dc569898dec08b3c1fdc8a5b8b0e48ab8986.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "KVM: VMX: introduce alloc_loaded_vmcs"

This reverts commit 8dd66ca98dfc03f7921ce5bf5926ce2d95507d84.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "KVM: VMX: make MSR bitmaps per-VCPU"

This reverts commit b3bd557498bff191933d0481b609fdff0caa9ead.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "KVM: x86: pass host_initiated to functions that read MSRs"

This reverts commit fe09396e31ddc452c8ef2e0a5474c28d5a501e4e.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "KVM/x86: Add IBPB support"

This reverts commit e5455ef7dbbab7ee5bc901ffdc7666e61fc41e11.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL - reloaded"

This reverts commit eaaa119c0f931596b7da2bb70e33d8cd1fcea576.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL"

This reverts commit 53da198cf7a918baa7ac267af1d89b71bc117ce9.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "KVM: SVM: Add MSR-based feature support for serializing LFENCE"

This reverts commit 351ff7a273f566e1efed1be14f01fc9510ad9002.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "x86/cpufeatures: rename X86_FEATURE_AMD_SSBD to X86_FEATURE_LS_CFG_SSBD"

This reverts commit a1a02e26cd913a7cfb5086b3f3cb1561c04447b6.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "x86/bugs: Add AMD's SPEC_CTRL MSR usage"

This reverts commit 6ed384dcb12d6b7f2b8a448810e56206b9db4fb7.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "x86/bugs: Fix the AMD SSBD usage of the SPEC_CTRL MSR"

This reverts commit ceea9d71d84a402818f9b070d1e932a05fa46c8d.

Revert due to performance regression.

Orabug: 29542029

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs_64.c

arch: x86: remove unsued SET_IBPB from spec_ctrl.h

Remove unsued SET_IBPB macro from spec_ctrl.h

Orabug: 29336760

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86: cpu: microcode: fix late loading SpectreV2 bugs eval

On microcode reloading we have to update the status of SpectreV2 mitigations if
they were not present at init time: we run the logic of selecting the default
mitigation in auto mode.

It was not possible to use the same functions as most of the logic is using
boot_command_line which is in init data and dropped after booting. Also we had
to drop some of the __init clauses on some functions we use in order to not
duplicate them.

This patch is not addressing alternative instructions related to SpectreV2. The
only one that we found so far is STUFF_RSB macro, and it will be addressed
later.

Orabug: 29336760

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86: cpu: microcode: fix late loading SSBD and L1TF bugs eval

On microcode reloading we have to update the status of SpectreV2 mitigations if
they were not present at init time: we have opted for the default mitigation:
seccomp or prctl (so per process).

For L1TF we do not have to do anything as host mitigation does not depend on
any CPU bits and from hypervisor perspective we just call vmentry_l1d_flush_set
to re-assess the mitigation. vmentry_l1d_flush_ops is exposed through this structure:
static const struct kernel_param_ops vmentry_l1d_flush_ops = {
>-------.set = vmentry_l1d_flush_set,
>-------.get = vmentry_l1d_flush_get,
};

And can be set/get using this sysfs entry:
/sys/module/kvm_intel/parameters/vmentry_l1d_flush

It was not possible to use the same functions as most of the logic is using
boot_command_line which is in init data and dropped after booting.

Orabug: 29336760

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86: cpu: microcode: Re-evaluate bugs in a CPU after microcode loading

After doing a new microcode load, we have to re-evaluate bugs in the CPU in
order to see if the microcode brought changes. We clear caps for
X86_BUG_SPEC_STORE_BYPASS, X86_BUG_CPU_MELTDOWN and X86_BUG_L1TF as these can
be modified by new microcode loading.

Orabug: 29336760

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86: cpu: microcode: update flags for all cpus

After a new microcode is loaded, we need to update cpu features and flags for
all CPUs not just for CPU 0. As microcode_late_eval_cpuid is setting some of
the boot_cpu_data we are doing CPU 0 last.

Orabug: 29336760

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/apic: Make arch_setup_hwirq NUMA node aware

In a xen VM with vNUMA enabled, irq affinity for a device on node 1
may become stuck on CPU 0. /proc/irq/nnn/smp_affinity_list may show
affinity for all the CPUs on node 1, but this is wrong. All interrupts
are on the first CPU of node 0 which is usually CPU 0.

The problem is caused when __assign_irq_vector() is called by
arch_setup_hwirq() with a mask of all online CPUs, and then called later
with a mask including only the node 1 CPUs. The first call assigns affinity
to CPU 0, and the second tries to move affinity to the first online node 1
CPU. In the reported case this is always CPU 2. For some reason, the
CPU 0 affinity is never cleaned up, and all interrupts remain with CPU 0.
Since an incomplete move appears to be in progress, all attempts to
reassign affinity for the irq fail. Because of a quirk in how affinity is
displayed in /proc/irq/nnn/smp_affinity_list, changes may appear to work
temporarily.

It was not reproducible on baremetal on the machine I had available for
testing, but it is possible that it was observed on other machines. It
does not appear in UEK5. The APIC and IRQ code is very different in UEK5,
and the code changed here doesn't exist there. It is unknown whether KVM
guests might see the same problem with UEK4.

Making arch_setup_hwirq() NUMA sensitive eliminates the problem by
using the correct cpumask for the node for the initial assignment. The
second assignment becomes a noop. After initialization is complete,
affinity can be moved to any CPU on any node and back without a problem.

Orabug: 29292411
Signed-off-by: Henry Willard <henry.willard@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: scsi_transport_iscsi: modify detected conn err to KERN_ERR

Fix for Orabug 29469714 is insufficient.  The goal of fix for
Orabug 29469714 is to print "detected conn error" message
produced by scsi_transport_iscsi module onto console which is
particularly useful when iscsi boot device is used and there
are connection issues.  The fix for Orabug 29469714 falls short
because currently, the default console log level in OL is 4
which means that only log levels less than 4 (KERN_WARNING) will
be printed to console.  This commit modifies the corresponding
printk log level to KERN_ERR (log level 3).

Orabug: 29487790
Signed-off-by: Fred Herard <fred.herard@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen/blkfront: avoid NULL blkfront_info dereference on device removal

If a block device is hot-added when we are out of grants,
gnttab_grant_foreign_access fails with -ENOSPC (log message "28
granting access to ring page") in this code path:

  talk_to_blkback ->
setup_blkring ->
xenbus_grant_ring ->
gnttab_grant_foreign_access

and the failing path in talk_to_blkback sets the driver_data to NULL:

destroy_blkring:
        blkif_free(info, 0);

        mutex_lock(&blkfront_mutex);
        free_info(info);
        mutex_unlock(&blkfront_mutex);

        dev_set_drvdata(&dev->dev, NULL);

This results in a NULL pointer BUG when blkfront_remove and blkif_free
try to access the failing device's NULL struct blkfront_info.

Cc: stable@vger.kernel.org # 4.5 and later
Signed-off-by: Vasilis Liaskovitis <vliaskovitis@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Orabug: 29469740

(cherry picked from commit f92898e7f32e3533bfd95be174044bc349d416ca)

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Fix race conditions in .ndo_get_stats64().

.ndo_get_stats64() may not be protected by RTNL and can race with
.ndo_stop() or other ethtool operations that can free the statistics
memory. Fix it by setting a new flag BNXT_STATE_READ_STATS and then
proceeding to read statistics memory only if the state is OPEN. The
close path that frees the memory clears the OPEN state and then waits
for the BNXT_STATE_READ_STATS to clear before proceeding to free the
statistics memory.

Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f9b76ebd49f97458857568918c305a17fa7c6567)
Signed-off-by: aru kolappan <aru.kolappan@oracle.com>
Orabug: 29129625

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: aru kolappan <aru.kolappan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: always verify the magic number in xattr blocks

commit 513f86d73855ce556ea9522b6bfd79f87356dc3a upstream.

If there an inode points to a block which is also some other type of
metadata block (such as a block allocation bitmap), the
buffer_verified flag can be set when it was validated as that other
metadata block type; however, it would make a really terrible external
attribute block. The reason why we use the verified flag is to avoid
constantly reverifying the block. However, it doesn't take much
overhead to make sure the magic number of the xattr block is correct,
and this will avoid potential crashes.

This addresses CVE-2018-10879.

https://bugzilla.kernel.org/show_bug.cgi?id=200001

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3150e8913b957d71398511a0580606e181153d10)

Orabug: 29437127
CVE: CVE-2018-10879

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/xattr.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: add corruption check in ext4_xattr_set_entry()

commit 5369a762c882c0b6e9599e4ebbb3a9ba9eee7e2d upstream.

In theory this should have been caught earlier when the xattr list was
verified, but in case it got missed, it's simple enough to add check
to make sure we don't overrun the xattr buffer.

This addresses CVE-2018-10879.

https://bugzilla.kernel.org/show_bug.cgi?id=200001

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 0dc148230f3827f49b1e2a72d14cbad0c5c63e86)

Orabug: 29437127
CVE: CVE-2018-10879

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/xattr.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: add netif_is_lag_port helper

Orabug: 29495360

Some code does not mind if a device is bond slave or team port and treats
them the same, as generic LAG ports.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit e0ba1414f310c83bf425fe26fa2cd5f1befcd6dc)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: add netif_is_lag_master helper

Orabug: 29495360

Some code does not mind if the master is bond or team and treats them
the same, as generic LAG.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7be61833042e7757745345eedc7b0efee240c189)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: add netif_is_team_port helper

Orabug: 29495360

Similar to other helpers, caller can use this to find out if device is
team port.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f7f019ee6d117de5007d0b10e7960696bbf111eb)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: add netif_is_team_master helper

Orabug: 29495360

Similar to other helpers, caller can use this to find out if device is
team master.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c981e4213e9d2d4ec79501bd607722ec712742a2)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/team/team.c
include/linux/netdevice.h

scsi: scsi_transport_iscsi: redirect conn error to console

This commit changes "detected conn error" printk
log level from KERN_INFO to KERN_WARNING. This change
is made with the assumption that KERN_WARNING messages
are configured to be redirected to console. It is
particularly useful to have detected connection errors
redirected to console when using iscsi boot device as it
may give clues as to why the system appears to be hung.

Orabug: 29469714

Signed-off-by: Fred Herard <fred.herard@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

Revert x86/apic/x2apic: set affinity of a single interrupt to one cpu

The commit 092aa78c11f0
("x86/apic/x2apic: set affinity of a single interrupt to one cpu")
was causing performance regression on block storage server on X5.

On OCI X5 server, they were not binding irqs to CPUs 1:1,
irq to cpu affinity was set to multiple cpus
(/proc/$irq/smp_affinity: 00,003ffff0,0003ffff, cpu0-17 and 36-53).
This is not the default behavior of bnxt_en. From bnxt_en,
driver, when NIC link is up, it sets irq affinity, OCI assumed that
most of the bnxt_en interrupts will go to cpu3.

After the patch "x86/apic/x2apic:
set affinity of a single interrupt to one cpu",
if we set irq to cpu 1:1, it works fine, but if we set irq affinity
to multiple cpus, it only sets irq_cfg->domain/cpumask to the first
online cpu which is on the cpu affinity list. With the current setting
which caused the perf issue, although /proc/$irq/smp_affinity is set
to multiple cpus, irq_cfg->domain cpumask only has cpu 0, this lead all
ens4f0-TxRx interrupts to route to cpu0, also iscsi target application
was being run on CPU0 during the testing which led to the performace issue.
The issue is no longer seen after the patch was reverted.

Orabug: 29449976
Signed-off-by: Mridula Shastry <mridula.c.shastry@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

jiffies: use jiffies64_to_nsecs() to fix 100% steal usage for xen vcpu hotplug

[ Not relevant upstream, therefore no upstream commit. ]

To fix, use jiffies64_to_nsecs() directly instead of deriving the result
according to jiffies_to_usecs().

As the return type of jiffies_to_usecs() is 'unsigned int', when the return
value is more than the size of 'unsigned int', the leading 32 bits would be
discarded.

Suppose USEC_PER_SEC=1000000L and HZ=1000, below are the expected and
actual incorrect result of jiffies_to_usecs(0x7770ef70):

- expected  : jiffies_to_usecs(0x7770ef70) = 0x000001d291274d80
- incorrect : jiffies_to_usecs(0x7770ef70) = 0x0000000091274d80

The leading 0x000001d200000000 is discarded.

After xen vcpu hotplug and when the new vcpu steal clock is calculated for
the first time, the result of this_rq()->prev_steal_time in
steal_account_process_tick() would be far smaller than the expected
value, due to that jiffies_to_usecs() discards the leading 32 bits.

As a result, the diff between current steal and this_rq()->prev_steal_time
is always very large. Steal usage would become 100% when the initial steal
clock obtained from xen hypervisor is very large during xen vcpu hotplug,
that is, when the guest is already up for a long time.

The bug can be detected by doing the following:

* Boot xen guest with vcpus=2 and maxvcpus=4
* Leave the guest running for a month so that the initial steal clock for
  the new vcpu would be very large
* Hotplug 2 extra vcpus
* The steal time of new vcpus in /proc/stat would increase abnormally and
  sometimes steal usage in top can become 100%

This was incidentally fixed in the patch set starting by
commit 93825f2ec736 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY")
and ended with
commit b672592f0221 ("sched/cputime: Remove generic asm headers").

Orabug: 28806208

Link: https://lkml.org/lkml/2019/2/28/1373
Suggested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

time: Introduce jiffies64_to_nsecs()

This will be needed for the cputime_t to nsec conversion.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 07e5f5e353aaa61696c8353d87050994a0c4648a)

Orabug: 28806208

This backport makes jiffies64_to_nsecs() available for the next patch.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net_failover: delay taking over primary device to accommodate udevd renaming

There is an inherent race with udev rename in userspace due to the exposure
of two lower slave devices while kernel attempts to manage the creation
for failover bonding itself automatically. The existing userspace naming
logic in udevd was not specifically written for this in-kernel
automagic.

The clean fix for the problem is either to update the udevd to not try
rename the 3-netdev (ideally rename the device in a coordinated manner),
or to fix the kernel to hide the 2 lower devices which does not have to
be shown to userspace unless needed (1-netdev model).

However, our pursuance of 1-netdev model had not been acknowledged by
upstream, and there's no motivation in the systemd/udevd community at
this point to refactor the rename logic and make it work well with
3-netdev.

Hyper-V's netvsc mitigated this by postponing the VF's dev_open() to
allow a userspace thread to rename the device within a 100ms worth of
window. For the interim, we follow the same as done by netvsc to avoid
the renaming failure, until we move to the point where a clean solution
is available in upstream.

OraBug: 29281273

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Tested-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

fs/dcache.c: add cond_resched() in shrink_dentry_list()

Orabug: 29412146

As previously reported (https://patchwork.kernel.org/patch/8642031/)
it's possible to call shrink_dentry_list with a large number of dentries
(> 10000).  This, in turn, could trigger the softlockup detector and
possibly trigger a panic.  In addition to the unmount path being
vulnerable to this scenario, at SuSE we've observed similar situation
happening during process exit on processes that touch a lot of dentries.
Here is an excerpt from a crash dump.  The number after the colon are
the number of dentries on the list passed to shrink_dentry_list:

PID 99760: 10722
PID 107530: 215
PID 108809: 24134
PID 108877: 21331
PID 141708: 16487

So we want to kill between 15k-25k dentries without yielding.

And one possible call stack looks like:

4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
5 [ffff8839ece41db0] evict at ffffffff811c3026
6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909

While some of the callers of shrink_dentry_list do use cond_resched,
this is not sufficient to prevent softlockups.  So just move
cond_resched into shrink_dentry_list from its callers.

David said: I've found hundreds of occurrences of warnings that we emit
when need_resched stays set for a prolonged period of time with the
stack trace that is included in the change log.

Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32785c0539b7e96f77a14a4f4ab225712665a5a4)
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Conflicts:
fs/dcache.c
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

NFS: commit direct writes even if they fail partially

If some of the WRITE calls making up an O_DIRECT write syscall fail,
we neglect to commit, even if some of the WRITEs succeed.

We also depend on the commit code to free the reference count on the
nfs_page taken in the "if (request_commit)" case at the end of
nfs_direct_write_completion(). The problem was originally noticed
because ENOSPC's encountered partway through a write would result in a
closed file being sillyrenamed when it should have been unlinked.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
(cherry picked from commit 1b8d97b0a837beaf48a8449955b52c650a7114b4)

Orabug: 28212440
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: update correct congestion map for loopback transport

Loopback transport delivers data directly to the destination
socket since destination is always local to the machine.
The connection data structure passed to an internal
initializer API rds_inc_init() is from send side (unlike the
usual call to APIs from receive path when receive side
connection data structure is passed).

This inconsistency causes an update of the incorrect
congestion map when marking destination port congested when
loopback transport is used (which is when one or both end(s)
of a RDS connection has an IP loopback address).

The fix it to ensure correct map is updated, that of the
destination IP regardless of delivery coming from send side
of loopback transport or receive side of other transports.

Orabug: 29175685

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: only look at the bg_flags field if it is valid

The bg_flags field in the block group descripts is only valid if the
uninit_bg or metadata_csum feature is enabled. We were not
consistently looking at this field; fix this.

Also block group #0 must never have uninitialized allocation bitmaps,
or need to be zeroed, since that's where the root inode, and other
special inodes are set up. Check for these conditions and mark the
file system as corrupted if they are detected.

This addresses CVE-2018-10876.

https://bugzilla.kernel.org/show_bug.cgi?id=199403

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
(cherry picked from commit 8844618d8aa7a9973e7b527d038a2a589665002c)

Orabug: 29316684
CVE: CVE-2018-10876.

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/balloc.c
fs/ext4/ialloc.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

uek-rpm: Add kernel-uek version to kernel-ueknano provides

Orabug: 29357643

kernel-ueknano package provides kernel-uek with no version in it. So it
can match any kernel-uek version installed. This commit adds the version
that the rpm is built for in kernel-uek provides also.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: Set sk_prot_creator when cloning sockets to the right proto

sk->sk_prot and sk->sk_prot_creator can differ when the app uses
IPV6_ADDRFORM (transforming an IPv6-socket to an IPv4-one).
Which is why sk_prot_creator is there to make sure that sk_prot_free()
does the kmem_cache_free() on the right kmem_cache slab.

Now, if such a socket gets transformed back to a listening socket (using
connect() with AF_UNSPEC) we will allocate an IPv4 tcp_sock through
sk_clone_lock() when a new connection comes in. But sk_prot_creator will
still point to the IPv6 kmem_cache (as everything got copied in
sk_clone_lock()). When freeing, we will thus put this
memory back into the IPv6 kmem_cache although it was allocated in the
IPv4 cache. I have seen memory corruption happening because of this.

With slub-debugging and MEMCG_KMEM enabled this gives the warning
"cache_from_obj: Wrong slab cache. TCPv6 but object is from TCP"

A C-program to trigger this:

void main(void)
{
        int fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
        int new_fd, newest_fd, client_fd;
        struct sockaddr_in6 bind_addr;
        struct sockaddr_in bind_addr4, client_addr1, client_addr2;
        struct sockaddr unsp;
        int val;

        memset(&bind_addr, 0, sizeof(bind_addr));
        bind_addr.sin6_family = AF_INET6;
        bind_addr.sin6_port = ntohs(42424);

        memset(&client_addr1, 0, sizeof(client_addr1));
        client_addr1.sin_family = AF_INET;
        client_addr1.sin_port = ntohs(42424);
        client_addr1.sin_addr.s_addr = inet_addr("127.0.0.1");

        memset(&client_addr2, 0, sizeof(client_addr2));
        client_addr2.sin_family = AF_INET;
        client_addr2.sin_port = ntohs(42421);
        client_addr2.sin_addr.s_addr = inet_addr("127.0.0.1");

        memset(&unsp, 0, sizeof(unsp));
        unsp.sa_family = AF_UNSPEC;

        bind(fd, (struct sockaddr *)&bind_addr, sizeof(bind_addr));

        listen(fd, 5);

        client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
        connect(client_fd, (struct sockaddr *)&client_addr1, sizeof(client_addr1));
        new_fd = accept(fd, NULL, NULL);
        close(fd);

        val = AF_INET;
        setsockopt(new_fd, SOL_IPV6, IPV6_ADDRFORM, &val, sizeof(val));

        connect(new_fd, &unsp, sizeof(unsp));

        memset(&bind_addr4, 0, sizeof(bind_addr4));
        bind_addr4.sin_family = AF_INET;
        bind_addr4.sin_port = ntohs(42421);
        bind(new_fd, (struct sockaddr *)&bind_addr4, sizeof(bind_addr4));

        listen(new_fd, 5);

        client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
        connect(client_fd, (struct sockaddr *)&client_addr2, sizeof(client_addr2));

        newest_fd = accept(new_fd, NULL, NULL);
        close(new_fd);

        close(client_fd);
        close(new_fd);
}

As far as I can see, this bug has been there since the beginning of the
git-days.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9d538fa60bad4f7b23193c89e843797a1cf71ef3)

Orabug: 29422739
CVE: CVE-2018-9568

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: always check block group bounds in ext4_init_block_bitmap()

commit 819b23f1c501b17b9694325471789e6b5cc2d0d2 upstream.

Regardless of whether the flex_bg feature is set, we should always
check to make sure the bits we are setting in the block bitmap are
within the block group bounds.

https://bugzilla.kernel.org/show_bug.cgi?id=199865

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ac48bb9bc0a32f5a4432be1645b57607f8c46aa7)

Orabug: 29428607
CVE: CVE-2018-10878

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: make sure bitmaps and the inode table don't overlap with bg descriptors

commit 77260807d1170a8cf35dbb06e07461a655f67eee upstream.

It's really bad when the allocation bitmaps and the inode table
overlap with the block group descriptors, since it causes random
corruption of the bg descriptors. So we really want to head those off
at the pass.

https://bugzilla.kernel.org/show_bug.cgi?id=199865

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ac93c718365ac6ea9d7631641c8dec867d623491)

Orabug: 29428607
CVE: CVE-2018-10878

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

Add an sb_rdonly() function to query the MS_RDONLY flag on sb->s_flags
preparatory to providing an SB_RDONLY flag.

Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 94e92e7ac90d06e1e839e112d3ae80b2457dbdd7)

Orabug: 29428607
CVE: CVE-2018-10878

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

iscsi: Capture iscsi debug messages using tracepoints

This commit enhances iscsi initiator modules to capture iscsi debug messages
using linux kernel tracepoint facility:

https://www.kernel.org/doc/Documentation/trace/tracepoints.txt

The following tracepoint events have been created under the iscsi tracepoint
event group:

iscsi_dbg_conn - to capture connection debug messages (libiscsi module)
iscsi_dbg_session - to capture session debug messages (libiscsi module)
iscsi_dbg_eh - to capture error handling debug messages (libiscsi module)
iscsi_dbg_tcp - to capture iscsi tcp debug messages (libiscsi_tcp module)
iscsi_dbg_sw_tcp - to capture iscsi sw tcp debug messages (iscsi_tcp module)
iscsi_dbg_trans_session - to cpature iscsi trasnsport sess debug messages
(scsi_transport_iscsi module)
iscsi_dbg_trans_conn - to capture iscsi tansport conn debug messages
(scsi_transport_iscsi module)

Orabug: 29429855

Signed-off-by: Fred Herard <fred.herard@oracle.com>
Reviewed-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KEYS: add missing permission check for request_key() destination

When the request_key() syscall is not passed a destination keyring, it
links the requested key (if constructed) into the "default" request-key
keyring.  This should require Write permission to the keyring.  However,
there is actually no permission check.

This can be abused to add keys to any keyring to which only Search
permission is granted.  This is because Search permission allows joining
the keyring.  keyctl_set_reqkey_keyring(KEY_REQKEY_DEFL_SESSION_KEYRING)
then will set the default request-key keyring to the session keyring.
Then, request_key() can be used to add keys to the keyring.

Both negatively and positively instantiated keys can be added using this
method.  Adding negative keys is trivial.  Adding a positive key is a
bit trickier.  It requires that either /sbin/request-key positively
instantiates the key, or that another thread adds the key to the process
keyring at just the right time, such that request_key() misses it
initially but then finds it in construct_alloc_key().

Fix this bug by checking for Write permission to the keyring in
construct_get_dest_keyring() when the default keyring is being used.

We don't do the permission check for non-default keyrings because that
was already done by the earlier call to lookup_user_key().  Also,
request_key_and_link() is currently passed a 'struct key *' rather than
a key_ref_t, so the "possessed" bit is unavailable.

We also don't do the permission check for the "requestor keyring", to
continue to support the use case described by commit 8bbf4976b59f
("KEYS: Alter use of key instantiation link-to-keyring argument") where
/sbin/request-key recursively calls request_key() to add keys to the
original requestor's destination keyring.  (I don't know of any users
who actually do that, though...)

Fixes: 3e30148c3d52 ("[PATCH] Keys: Make request-key create an authorisation key")
Cc: <stable@vger.kernel.org> # v2.6.13+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 4dca6ea1d9432052afb06baf2e3ae78188a4410b)

Orabug: 29304551
CVE: CVE-2017-17807

Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
security/keys/request_key.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

KEYS: Don't permit request_key() to construct a new keyring

If request_key() is used to find a keyring, only do the search part - don't
do the construction part if the keyring was not found by the search. We
don't really want keyrings in the negative instantiated state since the
rejected/negative instantiation error value in the payload is unioned with
keyring metadata.

Now the kernel gives an error:

request_key("keyring", "#selinux,bdekeyring", "keyring", KEY_SPEC_USER_SESSION_KEYRING) = -1 EPERM (Operation not permitted)

Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 911b79cde95c7da0ec02f48105358a36636b7a71)

Orabug: 29304551
CVE: CVE-2017-17807

Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mlx4_ib: Distribute completion vectors when zero is supplied

MAD packet sending/receiving is not properly virtualized in
CX-3. Hence, these are proxied through the PF driver. The proxying
uses UD QPs. The associated CQs are created with completion vector
zero, in anticipation that zero will return the least-used vector, as
per commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs").

However, this does not happen, and we see that only the first EQ is
used for these proxy QPs.

This leads to great imbalance in CPU processing, in particular during
fail-over and fail-back, when a large number of RDMA CM
requests/responses are proxied through the PF driver.

Orabug: 29318191

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Fix TX timeout during netpoll.

Orabug: 29357977

The current netpoll implementation in the bnxt_en driver has problems
that may miss TX completion events. bnxt_poll_work() in effect is
only handling at most 1 TX packet before exiting. In addition,
there may be in flight TX completions that ->poll() may miss even
after we fix bnxt_poll_work() to handle all visible TX completions.
netpoll may not call ->poll() again and HW may not generate IRQ
because the driver does not ARM the IRQ when the budget (0 for netpoll)
is reached.

We fix it by handling all TX completions and to always ARM the IRQ
when we exit ->poll() with 0 budget.

Also, the logic to ACK the completion ring in case it is almost filled
with TX completions need to be adjusted to take care of the 0 budget
case, as discussed with Eric Dumazet <edumazet@google.com>

Reported-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 73f21c653f930f438d53eed29b5e4c65c8a0f906)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Fix for system hang if request_irq fails

Orabug: 29357977

Fix bug in the error code path when bnxt_request_irq() returns failure.
bnxt_disable_napi() should not be called in this error path because
NAPI has not been enabled yet.

Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c58387ab1614f6d7fb9e244f214b61e7631421fc)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Fix firmware message delay loop regression.

Orabug: 29357977

A recent change to reduce delay granularity waiting for firmware
reponse has caused a regression.  With a tighter delay loop,
the driver may see the beginning part of the response faster.
The original 5 usec delay to wait for the rest of the message
is not long enough and some messages are detected as invalid.

Increase the maximum wait time from 5 usec to 20 usec.  Also, fix
the debug message that shows the total delay time for the response
when the message times out.  With the new logic, the delay time
is not fixed per iteration of the loop, so we define a macro to
show the total delay time.

Fixes: 9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cc559c1ac250a6025bd4a9528e424b8da250655b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c

bnxt_en: reduce timeout on initial HWRM calls

Orabug: 29357977

Testing with DIM enabled on older kernels indicated that firmware calls
were slower than expected. More detailed analysis indicated that the
default 25us delay was higher than necessary. Reducing the time spend in
usleep_range() for the first several calls would reduce the overall
latency of firmware calls on newer Intel processors.

Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9751e8e714872aa650b030e52a9fafbb694a3714)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c

bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().

Orabug: 29357977

When open fails during ethtool -L ring change, for example, the driver
may crash at bnxt_free_irq() because bp->bnapi is NULL.

If we fail to allocate all the new rings, bnxt_open_nic() will free
all the memory including bp->bnapi. Subsequent call to bnxt_close_nic()
will try to dereference bp->bnapi in bnxt_free_irq().

Fix it by checking for !bp->bnapi in bnxt_free_irq().

Fixes: e5811b8c09df ("bnxt_en: Add IRQ remapping logic.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cb98526bf9b985866d648dbb9c983ba9eb59daba)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Check valid VNIC ID in bnxt_hwrm_vnic_set_tpa().

Orabug: 29357977

During initialization, if we encounter errors, there is a code path that
calls bnxt_hwrm_vnic_set_tpa() with invalid VNIC ID. This may cause a
warning in firmware logs.

Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 3c4fe80b32c685bdc02b280814d0cfe80d441c72)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Do not modify max IRQ count after RDMA driver requests/frees IRQs.

Orabug: 29357977

Calling bnxt_set_max_func_irqs() to modify the max IRQ count requested or
freed by the RDMA driver is flawed.  The max IRQ count is checked when
re-initializing the IRQ vectors and this can happen multiple times
during ifup or ethtool -L.  If the max IRQ is reduced and the RDMA
driver is operational, we may not initailize IRQs correctly.  This
problem shows up on VFs with very small number of MSIX.

There is no other logic that relies on the IRQ count excluding the ones
used by RDMA.  So we fix it by just removing the call to subtract or
add the IRQs used by RDMA.

Fixes: a588e4580a7e ("bnxt_en: Add interface to support RDMA driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 30f529473ec962102e8bcd33a6a04f1e1b490ae2)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.h
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c

mm: cleancache: fix corruption on missed inode invalidation

commit 6ff38bd40230af35e446239396e5fc8ebd6a5248 upstream.

If all pages are deleted from the mapping by memory reclaim and also
moved to the cleancache:

__delete_from_page_cache
  (no shadow case)
  unaccount_page_cache_page
    cleancache_put_page
  page_cache_delete
    mapping->nrpages -= nr
    (nrpages becomes 0)

We don't clean the cleancache for an inode after final file truncation
(removal).

truncate_inode_pages_final
  check (nrpages || nrexceptional) is false
    no truncate_inode_pages
      no cleancache_invalidate_inode(mapping)

These way when reading the new file created with same inode we may get
these trash leftover pages from cleancache and see wrong data instead of
the contents of the new file.

Fix it by always doing truncate_inode_pages which is already ready for
nrpages == 0 && nrexceptional == 0 case and just invalidates inode.

[akpm@linux-foundation.org: add comment, per Jan]
Link: http://lkml.kernel.org/r/20181112095734.17979-1-ptikhomirov@virtuozzo.com
Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 29364670
CVE: CVE-2018-16862

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

l2tp: fix reading optional fields of L2TPv3

Use pskb_may_pull() to make sure the optional fields are in skb linear
parts, so we can safely read them later.

It's easy to reproduce the issue with a net driver that supports paged
skb data. Just create a L2TPv3 over IP tunnel and then generates some
network traffic.
Once reproduced, rx err in /sys/kernel/debug/l2tp/tunnels will increase.

Changes in v4:
1. s/l2tp_v3_pull_opt/l2tp_v3_ensure_opt_in_linear/
2. s/tunnel->version != L2TP_HDR_VER_2/tunnel->version == L2TP_HDR_VER_3/
3. Add 'Fixes' in commit messages.

Changes in v3:
1. To keep consistency, move the code out of l2tp_recv_common.
2. Use "net" instead of "net-next", since this is a bug fix.

Changes in v2:
1. Only fix L2TPv3 to make code simple.
   To fix both L2TPv3 and L2TPv2, we'd better refactor l2tp_recv_common.
   It's complicated to do so.
2. Reloading pointers after pskb_may_pull

Fixes: f7faffa3ff8e ("l2tp: Add L2TPv3 protocol support")
Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4522a70db7aa5e77526a4079628578599821b193)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/l2tp/l2tp_core.c
net/l2tp/l2tp_ip.c
net/l2tp/l2tp_ip6.c
Commit 2b139e6b1ec8 ("l2tp: remove ->recv_payload_hook") is not in UEK5.

l2tp_core.h:
    s/l2tp_get_l2specific_len(session)/session->l2specific_len/ due to
    62e7b6a57c7b ("l2tp: remove l2specific_len dependency in l2tp_core")
    is not in UEK4.

Orabug: 29368048

Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>