www.infradead.org Git - users/jedix/linux-maple.git/log

x86/paravirt: Dont patch flush_tlb_single

commit a035795499ca1c2bd1928808d1a156eda1420383 upstream

native_flush_tlb_single() will be changed with the upcoming
PAGE_TABLE_ISOLATION feature. This requires to have more code in
there than INVLPG.

Remove the paravirt patching for it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3e809caffdd7beeac731feb16788873c3bdb811e)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: kaiser_flush_tlb_on_return_to_user() check PCID

Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID
check, instead of each caller doing it inline first: nobody needs
to optimize for the noPCID case, it's clearer this way, and better
suits later changes. Replace those no-op X86_CR3_PCID_KERN_FLUSH lines
by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8eaca4c7d9f167209a9cc568ff028c0a3b0deb2d)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: asm/tlbflush.h handle noPGE at lower level

I found asm/tlbflush.h too twisty, and think it safer not to avoid
__native_flush_tlb_global_irq_disabled() in the kaiser_enabled case,
but instead let it handle kaiser_enabled along with cr3: it can just
use __native_flush_tlb() for that, no harm in re-disabling preemption.

(This is not the same change as Kirill and Dave have suggested for
upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge
check; cr3 is enough for kaiser, and thought to be cheaper than cr4.)

Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals()
preference from __native_flush_tlb(): unlike the invpcid_flush_all()
preference in __native_flush_tlb_global(), it's not seen in upstream
4.14, and was recently reported to be surprisingly slow.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 0651b3ad99dd59269e2ec883338ab8fba617e203)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: drop is_atomic arg to kaiser_pagetable_walk()

I have not observed a might_sleep() warning from setup_fixmap_gdt()'s
use of kaiser_add_mapping() in our tree (why not?), but like upstream
we have not provided a way for that to pass is_atomic true down to
kaiser_pagetable_walk(), and at startup it's far from a likely source
of trouble: so just delete the walk's is_atomic arg and might_sleep().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 28c6de5441740f868a5b371804a0e8dde03757fb)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/mm/kaiser.c

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush

Now that we're playing the ALTERNATIVE game, use that more efficient
method: instead of user-mapping an extra page, and reading an extra
cacheline each time for x86_cr3_pcid_noflush.

Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax)
is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs;
but the one line with $63 in looks clearer, so let's stick with that.

Worried about what happens with an ALTERNATIVE between the jump and
jump label in another ALTERNATIVE? I was, but have checked the
combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64,
and it does a good job.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2dff99eb0335f9e0817410696a180dba25ca7371)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kaiser: Check boottime cmdline params

AMD (and possibly other vendors) are not affected by the leak
KAISER is protecting against.

Keep the "nopti" for traditional reasons and add pti=<on|off|auto>
like upstream.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e405a064bd7d6eca88935342ddb71057a9d6ceab)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling

Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti".

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dea9aa9ffae11c91285335cc3215b4f0e48e8139)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: add "nokaiser" boot option, using ALTERNATIVE

Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime.  That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.

Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.

It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h).  I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.

Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments.  But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e345dcc9481543edf4a0a5df4c4c2f9597b0a997)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: fix unlikely error in alloc_ldt_struct()

An error from kaiser_add_mapping() here is not at all likely, but
Eric Biggers rightly points out that __free_ldt_struct() relies on
new_ldt->size being initialized: move that up.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 500943e57db8d3e298e98f595f835c5b613e843b)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls

Synthetic filesystem mempressure testing has shown softlockups, with
hour-long page allocation stalls, and pgd_alloc() trying for order:1
with __GFP_REPEAT in one of the backtraces each time.

That's _pgd_alloc() going for a Kaiser double-pgd, using the __GFP_REPEAT
common to all page table allocations, but actually having no effect on
order:0 (see should_alloc_oom() and should_continue_reclaim() in this
tree, but beware that ports to another tree might behave differently).

Order:1 stack allocation has been working satisfactorily without
__GFP_REPEAT forever, and page table allocation only asks __GFP_REPEAT
for awkward occasions in a long-running process: it's not appropriate
at fork or exec time, and seems to be doing much more harm than good:
getting those contiguous pages under very heavy mempressure can be
hard (though even without it, Kaiser does generate more mempressure).

Mask out that __GFP_REPEAT inside _pgd_alloc(). Why not take it out
of the PGALLOG_GFP altogether, as v4.7 commit a3a9a59d2067 ("x86: get
rid of superfluous __GFP_REPEAT") did? Because I think that might
make a difference to our page table memcg charging, which I'd prefer
not to interfere with at this time.

hughd adds: __alloc_pages_slowpath() in the 4.4.89-stable tree handles
__GFP_REPEAT a little differently than in prod kernel or 3.18.72-stable,
so it may not always be exactly a no-op on order:0 pages, as said above;
but I think still appropriate to omit it from Kaiser or non-Kaiser pgd.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d41f46f778951b0ea851ca52b88b2549c6336b47)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: paranoid_entry pass cr3 need to paranoid_exit

Neel Natu points out that paranoid_entry() was wrong to assume that
an entry that did not need swapgs would not need SWITCH_KERNEL_CR3:
paranoid_entry (used for debug breakpoint, int3, double fault or MCE;
though I think it's only the MCE case that is cause for concern here)
can break in at an awkward time, between cr3 switch and swapgs, but
its handling always needs kernel gs and kernel cr3.

Easy to fix in itself, but paranoid_entry() also needs to convey to
paranoid_exit() (and my reading of macro idtentry says paranoid_entry
and paranoid_exit are always paired) how to restore the prior state.
The swapgs state is already conveyed by %ebx (0 or 1), so extend that
also to convey when SWITCH_USER_CR3 will be needed (2 or 3).

(Yes, I'd much prefer that 0 meant no swapgs, whereas it's the other
way round: and a convention shared with error_entry() and error_exit(),
which I don't want to touch.  Perhaps I should have inverted the bit
for switch cr3 too, but did not.)

paranoid_exit() would be straightforward, except for TRACE_IRQS: it
did TRACE_IRQS_IRETQ when doing swapgs, but TRACE_IRQS_IRETQ_DEBUG
when not: which is it supposed to use when SWITCH_USER_CR3 is split
apart from that?  As best as I can determine, commit 5963e317b1e9
("ftrace/x86: Do not change stacks in DEBUG when calling lockdep")
missed the swapgs case, and should have used TRACE_IRQS_IRETQ_DEBUG
there too (the discrepancy has nothing to do with the liberal use
of _NO_STACK and _UNSAFE_STACK hereabouts: TRACE_IRQS_OFF_DEBUG has
just been used in all cases); discrepancy lovingly preserved across
several paranoid_exit() cleanups, but I'm now removing it.

Neel further indicates that to use SWITCH_USER_CR3_NO_STACK there in
paranoid_exit() is now not only unnecessary but unsafe: might corrupt
syscall entry's unsafe_stack_register_backup of %rax.  Just use
SWITCH_USER_CR3: and delete SWITCH_USER_CR3_NO_STACK altogether,
before we make the mistake of using it again.

hughd adds: this commit fixes an issue in the Kaiser-without-PCIDs
part of the series, and ought to be moved earlier, if you decided
to make a release of Kaiser-without-PCIDs.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit fc8334e6b3e5d28afd4eec8a74493933f73b2784)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user

Mostly this commit is just unshouting X86_CR3_PCID_KERN_VAR and
X86_CR3_PCID_USER_VAR: we usually name variables in lower-case.

But why does x86_cr3_pcid_noflush need to be __aligned(PAGE_SIZE)?
Ah, it's a leftover from when kaiser_add_user_map() once complained
about mapping the same page twice. Make it __read_mostly instead.
(I'm a little uneasy about all the unrelated data which shares its
page getting user-mapped too, but that was so before, and not a big
deal: though we call it user-mapped, it's not mapped with _PAGE_USER.)

And there is a little change around the two calls to do_nmi().
Previously they set the NOFLUSH bit (if PCID supported) when
forcing to kernel context before do_nmi(); now they also have the
NOFLUSH bit set (if PCID supported) when restoring context after:
nothing done in do_nmi() should require a TLB to be flushed here.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 20268a10ffecd9fcc04880b21fc99a9192394599)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: PCID 0 for kernel and 128 for user

Why was 4 chosen for kernel PCID and 6 for user PCID?
No good reason in a backport where PCIDs are only used for Kaiser.

If we continue with those, then we shall need to add Andy Lutomirski's
4.13 commit 6c690ee1039b ("x86/mm: Split read_cr3() into read_cr3_pa()
and __read_cr3()"), which deals with the problem of read_cr3() callers
finding stray bits in the cr3 that they expected to be page-aligned;
and for hibernation, his 4.14 commit f34902c5c6c0 ("x86/hibernate/64:
Mask off CR3's PCID bits in the saved CR3").

But if 0 is used for kernel PCID, then there's no need to add in those
commits - whenever the kernel looks, it sees 0 in the lower bits; and
0 for kernel seems an obvious choice.

And I naughtily propose 128 for user PCID. Because there's a place
in _SWITCH_TO_USER_CR3 where it takes note of the need for TLB FLUSH,
but needs to reset that to NOFLUSH for the next occasion. Currently
it does so with a "movb $(0x80)" into the high byte of the per-cpu
quadword, but that will cause a machine without PCID support to crash.
Now, if %al just happened to have 0x80 in it at that point, on a
machine with PCID support, but 0 on a machine without PCID support...

(That will go badly wrong once the pgd can be at a physical address
above 2^56, but even with 5-level paging, physical goes up to 2^52.)

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3b4ce0e1a17228eec71815d7997e49e403ebf2a7)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user

We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.

Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so?  I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.

Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.

Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH.  And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.

I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences.  At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has().  Probably all gratuitous
change, but that's the way it's working at present.

I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect.  However, my experiment
to per-cpu-ify that one did not end well...

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 0731188fc74cc2237975a2b5bedd36e2463ef10b)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/include/asm/tlbflush.h

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: enhanced by kernel and user PCIDs

Merged performance improvements to Kaiser, using distinct kernel
and user Process Context Identifiers to minimize the TLB flushing.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit eb82151d0b1df53d1ad8d060ecd554ca12eb552a)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)
arch/x86/entry/entry_64_compat.S (not in this tree)
arch/x86/ia32/ia32entry.S (patched instead of that)
arch/x86/include/asm/tlbflush.h

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: vmstat show NR_KAISERTABLE as nr_overhead

The kaiser update made an interesting choice, never to free any shadow
page tables.  Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing.  Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).

But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).

Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).

In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too).  But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3e3d38fd9832e82a8cb1a5b1154acfa43ac08d15)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: delete KAISER_REAL_SWITCH option

We fail to see what CONFIG_KAISER_REAL_SWITCH is for: it seems to be
left over from early development, and now just obscures tricky parts
of the code. Delete it before adding PCIDs, or nokaiser boot option.

(Or if there is some good reason to keep the option, then it needs
a help text - and a "depends on KAISER", so that all those without
KAISER are not asked the question.)

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b9d2ccc54e17b5aa50dd0c036d3f4fb4e5248d54)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET

There's a 0x1000 in various places, which looks better with a name.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit aeda21d77e22fb382c51fd3f6bbb18df69bc032f)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: cleanups while trying for gold link

While trying to get our gold link to work, four cleanups:
matched the gdt_page declaration to its definition;
in fiddling unsuccessfully with PERCPU_INPUT(), lined up backslashes;
lined up the backslashes according to convention in percpu-defs.h;
deleted the unused irq_stack_pointer addition to irq_stack_union.

Sad to report that aligning backslashes does not appear to help gold
align to 8192: but while these did not help, they are worth keeping.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c52e55a2a82d3a44189810d35717d81cb4cf61d4)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: kaiser_remove_mapping() move along the pgd

When removing the bogus comment from kaiser_remove_mapping(),
I really ought to have checked the extent of its bogosity: as
Neel points out, there is nothing to stop unmap_pud_range_nofree()
from continuing beyond the end of a pud (and starting in the wrong
position on the next).

Fix kaiser_remove_mapping() to constrain the extent and advance pgd
pointer correctly: use pgd_addr_end() macro as used throughout base
mm (but don't assume page-rounded start and size in this case).

But this bug was very unlikely to trigger in this backport: since
any buddy allocation is contained within a single pud extent, and
we are not using vmapped stacks (and are only mapping one page of
stack anyway): the only way to hit this bug here would be when
freeing a large modified ldt.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit f127705d26b34c053e59b47aef84b3ea564dd743)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: tidied up kaiser_add/remove_mapping slightly

Yes, unmap_pud_range_nofree()'s declaration ought to be in a
header file really, but I'm not sure we want to use it anyway:
so for now just declare it inside kaiser_remove_mapping().
And there doesn't seem to be such a thing as unmap_p4d_range(),
even in a 5-level paging tree.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 0c68228f7b39c96cabd89bee3e1d6bd55926df80)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: tidied up asm/kaiser.h somewhat

Mainly deleting a surfeit of blank lines, and reflowing header comment.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5fbd46c4be78174656b52e1b04d3057a5dd7af66)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: ENOMEM if kaiser_pagetable_walk() NULL

kaiser_add_user_map() took no notice when kaiser_pagetable_walk() failed.
And avoid its might_sleep() when atomic (though atomic at present unused).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 407c3ff6a24c7cb418b77a124d17e282f9622037)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Orabug: 27333760
CVE: CVE-2017-5754

Conflicts:
arch/x86/mm/kaiser.c

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: fix perf crashes

Avoid perf crashes: place debug_store in the user-mapped per-cpu area
instead of allocating, and use page allocator plus kaiser_add_mapping()
to keep the BTS and PEBS buffers user-mapped (that is, present in the
user mapping, though visible only to kernel and hardware). The PEBS
fixup buffer does not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 20cbe9a3aa2e341824da57ce0ac6d52cbffaa570)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/kernel/cpu/perf_event_intel_ds.c

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER

pjt has observed that nmi's second (nmi_from_kernel) call to do_nmi()
adjusted the %rdi regs arg, rightly when CONFIG_KAISER, but wrongly
when not CONFIG_KAISER.

Although the minimal change is to add an #ifdef CONFIG_KAISER around
the addq line, that looks cluttered, and I prefer how the first call
to do_nmi() handled it: prepare args in %rdi and %rsi before getting
into the CONFIG_KAISER block, since it does not touch them at all.

And while we're here, place the "#ifdef CONFIG_KAISER" that follows
each, to enclose the "Unconditionally restore CR3" comment: matching
how the "Unconditionally use kernel CR3" comment above is enclosed.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 487f0b73d82611a2dc48d7d78409e2e9d994006a)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: KAISER depends on SMP

It is absurd that KAISER should depend on SMP, but apparently nobody
has tried a UP build before: which breaks on implicit declaration of
function 'per_cpu_offset' in arch/x86/mm/kaiser.c.

Now, you would expect that to be trivially fixed up; but looking at
the System.map when that block is #ifdef'ed out of kaiser_init(),
I see that in a UP build __per_cpu_user_mapped_end is precisely at
__per_cpu_user_mapped_start, and the items carefully gathered into
that section for user-mapping on SMP, dispersed elsewhere on UP.

So, some other kind of section assignment will be needed on UP,
but implementing that is not a priority: just make KAISER depend
on SMP for now.

Also inserted a blank line before the option, tidied up the
brief Kconfig help message, and added an "If unsure, Y".

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d94df20135ccfdfb77b1479c501564e9b4ab5bc9)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: fix build and FIXME in alloc_ldt_struct()

Include linux/kaiser.h instead of asm/kaiser.h to build ldt.c without
CONFIG_KAISER. kaiser_add_mapping() does already return an error code,
so fix the FIXME.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9b94cf97f42ca30fe9b5010900fa6e1d6855a9f6)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE

Kaiser only needs to map one page of the stack; and
kernel/fork.c did not build on powerpc (no __PAGE_KERNEL).
It's all cleaner if linux/kaiser.h provides kaiser_map_thread_stack()
and kaiser_unmap_thread_stack() wrappers around asm/kaiser.h's
kaiser_add_mapping() and kaiser_remove_mapping(). And use
linux/kaiser.h in init/main.c to avoid the #ifdefs there.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 003e476716906afa135faf605ae0a5c3598c0293)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
init/main.c

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: do not set _PAGE_NX on pgd_none

native_pgd_clear() uses native_set_pgd(), so native_set_pgd() must
avoid setting the _PAGE_NX bit on an otherwise pgd_none() entry:
usually that just generated a warning on exit, but sometimes
more mysterious and damaging failures (our production machines
could not complete booting).

The original fix to this just avoided adding _PAGE_NX to
an empty entry; but eventually more problems surfaced with kexec,
and EFI mapping expected to be a problem too.  So now instead
change native_set_pgd() to update shadow only if _PAGE_USER:

A few places (kernel/machine_kexec_64.c, platform/efi/efi_64.c for sure)
use set_pgd() to set up a temporary internal virtual address space, with
physical pages remapped at what Kaiser regards as userspace addresses:
Kaiser then assumes a shadow pgd follows, which it will try to corrupt.

This appears to be responsible for the recent kexec and kdump failures;
though it's unclear how those did not manifest as a problem before.
Ah, the shadow pgd will only be assumed to "follow" if the requested
pgd is on an even-numbered page: so I suppose it was going wrong 50%
of the time all along.

What we need is a flag to set_pgd(), to tell it we're dealing with
userspace.  Er, isn't that what the pgd's _PAGE_USER bit is saying?
Add a test for that.  But we cannot do the same for pgd_clear()
(which may be called to clear corrupted entries - set aside the
question of "corrupt in which pgd?" until later), so there just
rely on pgd_clear() not being called in the problematic cases -
with a WARN_ON_ONCE() which should fire half the time if it is.

But this is getting too big for an inline function: move it into
arch/x86/mm/kaiser.c (which then demands a boot/compressed mod);
and de-void and de-space native_get_shadow/normal_pgd() while here.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit edde73205b3fdde8c8a3adfce78cc6d0de72386b)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kaiser: merged update

Merged fixes and cleanups, rebased to 4.4.89 tree (no 5-level paging).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bed9bb7f3e6d4045013d2bb9e4004896de57f02b)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)
arch/x86/kernel/ldt.c
kernel/fork.c

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

KAISER: Kernel Address Isolation

This patch introduces our implementation of KAISER (Kernel Address Isolation to
have Side-channels Efficiently Removed), a kernel isolation technique to close
hardware side channels on kernel address information.

More information about the patch can be found on:

https://github.com/IAIK/KAISER

From: Richard Fellner <richard.fellner@student.tugraz.at>
From: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
X-Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode
Date: Thu, 4 May 2017 14:26:50 +0200
Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2
Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5

To: <linux-kernel@vger.kernel.org>
To: <kernel-hardening@lists.openwall.com>
Cc: <clementine.maurice@iaik.tugraz.at>
Cc: <moritz.lipp@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <kirill.shutemov@linux.intel.com>
Cc: <anders.fogh@gdata-adan.de>
After several recent works [1,2,3] KASLR on x86_64 was basically
considered dead by many researchers. We have been working on an
efficient but effective fix for this problem and found that not mapping
the kernel space when running in user mode is the solution to this
problem [4] (the corresponding paper [5] will be presented at ESSoS17).

With this RFC patch we allow anybody to configure their kernel with the
flag CONFIG_KAISER to add our defense mechanism.

If there are any questions we would love to answer them.
We also appreciate any comments!

Cheers,
Daniel (+ the KAISER team from Graz University of Technology)

[1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf
[2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf
[3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf
[4] https://github.com/IAIK/KAISER
[5] https://gruss.cc/files/kaiser.pdf

[patch based also on
https://raw.githubusercontent.com/IAIK/KAISER/master/KAISER/0001-KAISER-Kernel-Address-Isolation.patch]

Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at>
Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8a43ddfb93a0c6ae1a6e1f5c25705ec5d1843c40)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)
arch/x86/entry/entry_64_compat.S (not in this tree)
arch/x86/ia32/ia32entry.S (patched instead of that)
arch/x86/include/asm/hw_irq.h
arch/x86/kernel/irqinit.c
kernel/fork.c

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/boot: Add early cmdline parsing for options with arguments

commit e505371dd83963caae1a37ead9524e8d997341be upstream.

Add a cmdline_find_option() function to look for cmdline options that
take arguments. The argument is returned in a supplied buffer and the
argument length (regardless of whether it fits in the supplied buffer)
is returned, with -1 indicating not found.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/36b5f97492a9745dce27682305f990fc20e5cf8a.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 0fa147b407478e73fe7a478677ff2b12bb824014)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm/64: Fix reboot interaction with CR4.PCIDE

commit 924c6b900cfdf376b07bccfd80e62b21914f8a5a upstream.

Trying to reboot via real mode fails with PCID on: long mode cannot
be exited while CR4.PCIDE is set.  (No, I have no idea why, but the
SDM and actual CPUs are in agreement here.)  The result is a GPF and
a hang instead of a reboot.

I didn't catch this in testing because neither my computer nor my VM
reboots this way.  I can trigger it with reboot=bios, though.

Fixes: 660da7c9228f ("x86/mm: Enable CR4.PCIDE on supported systems")
Reported-and-tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/f1e7d965998018450a7a70c2823873686a8b21c0.1507524746.git.luto@kernel.org
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6c4db09c291a19da66512f99c4bcb378a862f9e6)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Enable CR4.PCIDE on supported systems

commit 660da7c9228f685b2ebe664f9fd69aaddcc420b5 upstream.

We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect. A followup patch will start using PCID.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/6327ecd907b32f79d5aa0d466f04503bbec5df88.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit fd0504525efd2ce2063cd4229baabd3e3a56ecbc)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Add the 'nopcid' boot option to turn off PCID

commit 0790c9aad84901ca1bdc14746175549c8b5da215 upstream.

The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/8bbb2e65bcd249a5f18bfb8128b4689f08ac2b60.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dcccd3c266e24ce80ecf592765315c54c222ac33)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Disable PCID on 32-bit kernels

commit cba4671af7550e008f7a7835f06df0763825bf3e upstream.

32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode. Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/2e391769192a4d31b808410c383c6bf0734bc6ea.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 78043e5b6fb2921d836b31f23e89e52925191153)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code

commit ce4a4e565f5264909a18c733b864c3f74467f69e upstream.

The UP asm/tlbflush.h generates somewhat nicer code than the SMP version.
Aside from that, it's fallen quite a bit behind the SMP code:

- flush_tlb_mm_range() didn't flush individual pages if the range
   was small.

- The lazy TLB code was much weaker.  This usually wouldn't matter,
   but, if a kernel thread flushed its lazy "active_mm" more than
   once (due to reclaim or similar), it wouldn't be unlazied and
   would instead pointlessly flush repeatedly.

- Tracepoints were missing.

Aside from that, simply having the UP code around was a maintanence
burden, since it means that any change to the TLB flush code had to
make sure not to break it.

Simplify everything by deleting the UP code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b2e24274d50e0ecdf560ebe06dbed0cc648ad3f9)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/Kconfig

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()

commit ca6c99c0794875c6d1db6e22f246699691ab7e6b upstream.

flush_tlb_page() was very similar to flush_tlb_mm_range() except that
it had a couple of issues:

- It was missing an smp_mb() in the case where
   current->active_mm != mm.  (This is a longstanding bug reported by Nadav Amit)

- It was missing tracepoints and vm counter updates.

The only reason that I can see for keeping it at as a separate
function is that it could avoid a few branches that
flush_tlb_mm_range() needs to decide to flush just one page.  This
hardly seems worthwhile.  If we decide we want to get rid of those
branches again, a better way would be to introduce an
__flush_tlb_mm_range() helper and make both flush_tlb_page() and
flush_tlb_mm_range() use it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/3cc3847cf888d8907577569b8bac3f01992ef8f9.1495492063.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3efba6062a410a2a65fc9d6f53dca63db2602e65)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Make flush_tlb_mm_range() more predictable

commit ce27374fabf553153c3f53efcaa9bfab9216bd8c upstream.

I'm about to rewrite the function almost completely, but first I
want to get a functional change out of the way.  Currently, if
flush_tlb_mm_range() does not flush the local TLB at all, it will
never do individual page flushes on remote CPUs.  This seems to be
an accident, and preserving it will be awkward.  Let's change it
first so that any regressions in the rewrite will be easier to
bisect and so that the rewrite can attempt to change no visible
behavior at all.

The fix is simple: we can simply avoid short-circuiting the
calculation of base_pages_to_flush.

As a side effect, this also eliminates a potential corner case: if
tlb_single_page_flush_ceiling == TLB_FLUSH_ALL, flush_tlb_mm_range()
could have ended up flushing the entire address space one page at a
time.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4b29b771d9975aad7154c314534fec235618175a.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9f4d1ba1d407e56dac833aa0b11c60f952939e1c)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Remove flush_tlb() and flush_tlb_current_task()

commit 29961b59a51f8c6838a26a45e871a7ed6771809b upstream.

I was trying to figure out what how flush_tlb_current_task() would
possibly work correctly if current->mm != current->active_mm, but I
realized I could spare myself the effort: it has no callers except
the unused flush_tlb() macro.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 227d6f0e79f809e448d3157fbfd00eb54dcbb54e)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()

commit 9ccee2373f0658f234727700e619df097ba57023 upstream.

mark_screen_rdonly() is the last remaining caller of flush_tlb().
flush_tlb_mm_range() is potentially faster and isn't obsolete.

Compile-tested only because I don't know whether software that uses
this mechanism even exists.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/791a644076fc3577ba7f7b7cafd643cc089baa7d.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6ce9d1e6819e53c4de0bf980555c4e07bbedb4ce)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/irq: Do not substract irq_tlb_count from irq_call_count

commit 82ba4faca1bffad429f15c90c980ffd010366c25 upstream.

Since commit:

  52aec3308db8 ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")

the TLB remote shootdown is done through call function vector. That
commit didn't take care of irq_tlb_count, which a later commit:

  fd0f5869724f ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts")

... tried to fix.

The fix assumes every increase of irq_tlb_count has a corresponding
increase of irq_call_count. So the irq_call_count is always bigger than
irq_tlb_count and we could substract irq_tlb_count from irq_call_count.

Unfortunately this is not true for the smp_call_function_single() case.
The IPI is only sent if the target CPU's call_single_queue is empty when
adding a csd into it in generic_exec_single. That means if two threads
are both adding flush tlb csds to the same CPU's call_single_queue, only
one IPI is sent. In other words, the irq_call_count is incremented by 1
but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be
bigger than irq_call_count and the substract will produce a very large
irq_call_count value due to overflow.

Considering that:

  1) it's not worth to send more IPIs for the sake of accurate counting of
     irq_call_count in generic_exec_single();

  2) it's not easy to tell if the call function interrupt is for TLB
     shootdown in __smp_call_function_single_interrupt().

Not to exclude TLB shootdown from call function count seems to be the
simplest fix and this patch just does that.

This bug was found by LKP's cyclic performance regression tracking recently
with the vm-scalability test suite. I have bisected to commit:

  3dec0ba0be6a ("mm/rmap: share the i_mmap_rwsem")

This commit didn't do anything wrong but revealed the irq_call_count
problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file
concurrent with multiple threads.  When remap_one is try_to_unmap_one(),
then multiple threads could queue flush TLB to the same CPU but only
one IPI will be sent.

Since the commit was added in Linux v3.19, the counting problem only
shows up from v3.19 onwards.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3b9d9ec0d8261bb9b12f858e66f0c84cd2a6a3bb)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()

commit 252d2a4117bc181b287eeddf848863788da733ae upstream.

idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().

This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes. Nonetheless, I think it
should be backported because it's trivial. There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 18a5348d49afcfc2b95da939143c9420edd78b9e)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

ARM: Hide finish_arch_post_lock_switch() from modules

commit ef0491ea17f8019821c7e9c8e801184ecf17f85a upstream.

The introduction of switch_mm_irqs_off() brought back an old bug
regarding the use of preempt_enable_no_resched:

As part of:

  62b94a08da1b ("sched/preempt: Take away preempt_enable_no_resched() from modules")

the definition of preempt_enable_no_resched() is only available in
built-in code, not in loadable modules, so we can't generally use
it from header files.

However, the ARM version of finish_arch_post_lock_switch()
calls preempt_enable_no_resched() and is defined as a static
inline function in asm/mmu_context.h. This in turn means we cannot
include asm/mmu_context.h from modules.

With today's tip tree, asm/mmu_context.h gets included from
linux/mmu_context.h, which is normally the exact pattern one would
expect, but unfortunately, linux/mmu_context.h can be included from
the vhost driver that is a loadable module, now causing this compile
time error with modular configs:

  In file included from ../include/linux/mmu_context.h:4:0,
                   from ../drivers/vhost/vhost.c:18:
  ../arch/arm/include/asm/mmu_context.h: In function 'finish_arch_post_lock_switch':
  ../arch/arm/include/asm/mmu_context.h:88:3: error: implicit declaration of function 'preempt_enable_no_resched' [-Werror=implicit-function-declaration]
     preempt_enable_no_resched();

Andy already tried to fix the bug by including linux/preempt.h
from asm/mmu_context.h, but that didn't help. Arnd suggested reordering
the header files, which wasn't popular, so let's use this
workaround instead:

The finish_arch_post_lock_switch() definition is now also hidden
inside of #ifdef MODULE, so we don't see anything referencing
preempt_enable_no_resched() from a header file. I've built a
few hundred randconfig kernels with this, and did not see any
new problems.

Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-arm-kernel@lists.infradead.org
Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/1463146234-161304-1-git-send-email-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c22d4b4d1c7fcc0d9eb4d8618d86c554c48ed9c0)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm, sched/core: Turn off IRQs in switch_mm()

commit 078194f8e9fe3cf54c8fd8bded48a1db5bd8eb8a upstream.

Potential races between switch_mm() and TLB-flush or LDT-flush IPIs
could be very messy. AFAICT the code is currently okay, whether by
accident or by careful design, but enabling PCID will make it
considerably more complicated and will no longer be obviously safe.

Fix it with a big hammer: run switch_mm() with IRQs off.

To avoid a performance hit in the scheduler, we take advantage of
our knowledge that the scheduler already has IRQs disabled when it
calls switch_mm().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f19baf759693c9dcae64bbff76189db77cb13398.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4ead44fd2525ed97e5362a806d312a0e3b0ea445)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/include/asm/mmu_context.h

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm, sched/core: Uninline switch_mm()

commit 69c0319aabba45bcf33178916a2f06967b4adede upstream.

It's fairly large and it has quite a few callers. This may also
help untangle some headers down the road.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/54f3367803e7f80b2be62c8a21879aa74b1a5f57.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 70a39c7fd167399fde76aeac314dce026a255b49)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Conflicts:
arch/x86/include/asm/mmu_context.h

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Build arch/x86/mm/tlb.c even on !SMP

commit e1074888c326038340a1ada9129d679e661f2ea6 upstream.

Currently all of the functions that live in tlb.c are inlined on
!SMP builds. One can debate whether this is a good idea (in many
respects the code in tlb.c is better than the inlined UP code).

Regardless, I want to add code that needs to be built on UP and SMP
kernels and relates to tlb flushing, so arrange for tlb.c to be
compiled unconditionally.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f0d778f0d828fc46e5d1946bca80f0aaf9abf032.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 83cc4b50e3a977915666ade0b951ba446e7181bd)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

sched/core: Add switch_mm_irqs_off() and use it in the scheduler

commit f98db6013c557c216da5038d9c52045be55cd039 upstream.

By default, this is the same thing as switch_mm().

x86 will override it as an optimization.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 425f13a36652523d604fd96413d6c438d415dd70)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

mm/mmu_context, sched/core: Fix mmu_context.h assumption

commit 8efd755ac2fe262d4c8d5c9bbe054bb67dae93da upstream.

Some architectures (such as Alpha) rely on include/linux/sched.h definitions
in their mmu_context.h files.

So include sched.h before mmu_context.h.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dfe513a4e8ddde75ffc6abd3f139c5d65bf925d7)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: If INVPCID is available, use it to flush global mappings

commit d8bced79af1db6734f66b42064cc773cada2ce99 upstream.

On my Skylake laptop, INVPCID function 2 (flush absolutely
everything) takes about 376ns, whereas saving flags, twiddling
CR4.PGE to flush global mappings, and restoring flags takes about
539ns.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/ed0ef62581c0ea9c99b9bf6df726015e96d44743.1454096309.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 85d3700c744a11ee2989252acf50ccbbd814167a)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Add a 'noinvpcid' boot option to turn off INVPCID

commit d12a72b844a49d4162f24cefdab30bed3f86730e upstream.

This adds a chicken bit to turn off INVPCID in case something goes
wrong. It's an early_param() because we do TLB flushes before we
parse __setup() parameters.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/f586317ed1bc2b87aee652267e515b90051af385.1454096309.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 791a0f3fecdabe18cc291e5f9b7ebbdc81895975)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Fix INVPCID asm constraint

commit e2c7698cd61f11d4077fdb28148b2d31b82ac848 upstream.

So we want to specify the dependency on both @pcid and @addr so that the
compiler doesn't reorder accesses to them *before* the TLB flush. But
for that to work, we need to express this properly in the inline asm and
deref the whole desc array, not the pointer to it. See clwb() for an
example.

This fixes the build error on 32-bit:

arch/x86/include/asm/tlbflush.h: In function __invpcid
arch/x86/include/asm/tlbflush.h:26:18: error: memory input 0 is not directly addressable

which gcc4.7 caught but 5.x didn't. Which is strange. :-\

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Michael Matz <matz@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 04ec428b15f161ce8449756fb64b6f380c8d95fd)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Add INVPCID helpers

commit 060a402a1ddb551455ee410de2eadd3349f2801b upstream.

This adds helpers for each of the four currently-specified INVPCID
modes.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/8a62b23ad686888cee01da134c91409e22064db9.1454096309.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit becf292446e9f2dc8842c448836bbe8005e24db0)
Orabug: 27333760
CVE: CVE-2017-5754
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/ibrs: Remove 'ibrs_dump' and remove the pr_debug

Orabug: 27351274

There is no business in having ibrs_dump exposed to user-space
and it allowing to write to it. Reading that entry ends up
doing an IPI across all CPUs reading an MSR and that (on large
machines) is not something user-space should be able to do.

And also remove the pr_debug statements.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kABI: Revert kABI: Make the boot_cpu_data look normal

.. which was the wrong way about it. The 'struct task_struct'
embeds boot_cpu_data in it, and the increase from 13 to 14
meant that any offset's in the 'struct task_struct' are now
off. Which is definitly an kABI breakage!

This fix puts the structure back to the original size,
and moves the 'ipbp' in the 'Linux custom' word and all is good.

Orabug: 27344012
CVE: CVE-2017-5715

Reported-by: Todd Vierling <todd.vierling@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

userns: prevent speculative execution

From: Elena Reshetova <elena.reshetova@intel.com>

Since the pos value in function m_start()
seems to be controllable by userspace and later on
conditionally (upon bound check) used to resolve
map->extent, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

udf: prevent speculative execution

Since the eahd->appAttrLocation value in function
udf_add_extendedattr() seems to be controllable by
userspace and later on conditionally (upon bound check)
used in following memmove, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

net: mpls: prevent speculative execution

Since the index value in function mpls_route_input_rcu()
seems to be controllable by userspace and later on
conditionally (upon bound check) used to resolve
platform_label, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

fs: prevent speculative execution

Since the fd value in function __fcheck_files()
seems to be controllable by userspace and later on
conditionally (upon bound check) used to resolve
fdt->fd, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

ipv6: prevent speculative execution

Since the offset value in function raw6_getfrag()
seems to be controllable by userspace and later on
conditionally (upon bound check) used in the
following memcpy, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

ipv4: prevent speculative execution

Since the offset value in function raw_getfrag()
seems to be controllable by userspace and later on
conditionally (upon bound check) used in the following
memcpy, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

Thermal/int340x: prevent speculative execution

Since the trip value in function int340x_thermal_get_trip_temp()
seems to be controllable by userspace and later on
conditionally (upon bound check) used to resolve
d->aux_trips, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
patch refers to arch/x86/include/asm/msr-index.h
code base has arch/x86/include/uapi/asm/msr-index.h

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

cw1200: prevent speculative execution

Since the queue value in function cw1200_conf_tx()
seems to be controllable by userspace and later on
conditionally (upon bound check) used in
WSM_TX_QUEUE_SET, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
patch refers to drivers/net/wireless/st/cw1200/sta.c
code base has drivers/net/wireless/cw1200/sta.c

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

qla2xxx: prevent speculative execution

Since the handle value in functions qlafx00_status_entry()
and qlafx00_multistatus_entry() seems to be controllable
by userspace and later on conditionally (upon bound check)
used to resolve req->outstanding_cmds, insert an observable
speculation barrier before its usage. This should prevent
observable speculation on that branch and avoid kernel
memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

p54: prevent speculative execution

Since the queue value in function p54_conf_tx()
seems to be controllable by userspace and later on
conditionally (upon bound check) used to resolve
priv->qos_params, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
patch refers to drivers/net/wireless/intersil/p54/main.c
code base has drivers/net/wireless/p54/main.c

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

carl9170: prevent speculative execution

Since the queue value in function carl9170_op_conf_tx()
seems to be controllable by userspace and later on
conditionally (upon bound check) used to resolve
ar9170_qmap and following ar->edcf, insert an observable
speculation barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

uvcvideo: prevent speculative execution

Since the index value in function uvc_ioctl_enum_input()
seems to be controllable by userspace and later on
conditionally (upon bound check) used to resolve
selector->baSourceID, insert an observable speculation
barrier before its usage. This should prevent
observable speculation on that branch and avoid
kernel memory leak.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

bpf: prevent speculative execution in eBPF interpreter

This adds an observable speculation barrier before LD_IMM_DW and
LDX_MEM_B/H/W/DW eBPF instructions during eBPF program
execution in order to prevent speculative execution on out
of bound BFP_MAP array indexes. This way an arbitary kernel
memory is not exposed through side channel attacks.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
kernel/bpf/core.c code base differences

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

locking/barriers: introduce new observable speculation barrier

The new observable speculation barrier, osb(), ensures
that any user observable speculation doesn't cross the boundary.

Any user observable speculative activity on this CPU
thread before this point either completes, reaches a
state it can no longer cause an observable activity, or
is aborted before instructions after the barrier execute.

In x86 case, osb() resolves in lfence if X86_FEATURE_LFENCE_RDTSC
is present. Other architectures can define their variants.

Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
Suggested-by: Alan Cox <alan.cox@intel.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
include/asm-generic/barrier.h code base differences

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/cpu/AMD: Remove now unused definition of MFENCE_RDTSC feature

With the switch to using LFENCE_RDTSC on AMD platforms there is no longer
a need for the MFENCE_RDTSC feature. Remove its usage and definition.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
Patch refers to arch/x86/include/asm/cpufeatures.h
Code base has arch/x86/include/asm/cpufeature.h
Patch references X86_FEATURE_MFENCE_RDTSC in arch/x86/include/asm/msr.h
Code base references it in:
arch/x86/include/asm/barrier.h
arch/x86/um/asm/barrier.h

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/cpu/AMD: Make the LFENCE instruction serialized

In order to reduce the impact of using MFENCE, make the execution of the
LFENCE instruction serialized. This is done by setting bit 1 of MSR
0xc0011029 (DE_CFG).

Some families that support LFENCE do not have this MSR. For these
families, the LFENCE instruction is already serialized.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Orabug: 27340445
CVE: CVE-2017-5753

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
patch refers to arch/x86/include/asm/msr-index.h
code base has arch/x86/include/uapi/asm/msr-index.h

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kABI: Make the boot_cpu_data look normal.

It is statically allocated and we only grow it - so having an
GENKSYMS around it is fine. This fixes
aff7641cb9f37c7aa6897a7b51faa6e20b08013f
"x86/cpu/AMD: Add speculative control support for AMD" breaking the kABI

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kernel.spec: Require the new microcode_ctl.

Which provides the new CPUID and MSRs to combat
CVE-2017-5715

Orabug: 27344012
CVE: CVE-2017-5715

Suggested-by: Todd Vierling <todd.vierling@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/microcode/AMD: Add support for fam17h microcode loading

The size for the Microcode Patch Block (MPB) for an AMD family 17h
processor is 3200 bytes. Add a #define for fam17h so that it does
not default to 2048 bytes and fail a microcode load/update.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20171130224640.15391.40247.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec_ctrl: Disable if running as Xen PV guest.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

Set IBPB when running a different VCPU

Picking up a change from:
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Thu, 30 Nov 2017 15:00:12 +0100
[RHEL7.5 PATCH 07/35] kvm: vmx: Set IBPB when running a different
VCPU

Ensure an IBPB (Indirect branch prediction barrier) before every VCPU
switch.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

Clear the host registers after setbe

The original patch cleared the host registers before setbe doing XOR,
and it set a false flag as VM enry failure.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

Use the ibpb_inuse variable.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

KVM: x86: add SPEC_CTRL to MSR and CPUID lists

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kvm: vmx: add MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD

[RHEL7.5 PATCH 08/35] kvm: vmx: add MSR_IA32_SPEC_CTRL and
MSR_IA32_PRED_CMD

Allow load/store of MSR_IA32_SPEC_CTRL, restore guest IBRS on VM entry
and set it to 1 on VM exit.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

Use the "ibrs_inuse" variable.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

kvm: svm: add MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/svm: Set IBPB when running a different VCPU

[RHEL7.5 PATCH 09/35] x86/svm: Set IBPB when running a different VCPU

Set IBPB (Indirect Branch Prediction Barrier) when the current CPU is
going to run a VCPU different from what was previously run. Nested
virtualization uses the same VMCB for the second level guest, but the
L1 hypervisor should be using IBRS to protect itself.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kvm: Pad RSB on VM transition

This is a patch from:

From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Thu, 30 Nov 2017 15:00:10 +0100
Subject: [RHEL7.5 PATCH 05/35] x86/kvm: Pad RSB on VM transition

Add code to pad the local CPU's RSB entries to protect
from previous less privilege mode.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/cpu/AMD: Add speculative control support for AMD

Add speculative control support for AMD processors. For AMD, speculative
control is indicated as follows:

  CPUID EAX=0x00000007, ECX=0x00 return EDX[26] indicates support for
  both IBRS and IBPB.

  CPUID EAX=0x80000008, ECX=0x00 return EBX[12] indicates support for
  just IBPB.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.inte.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Backport: We don't have 39c06df4dc10a "x86/cpufeature: Cleanup get_cpu_cap()"
which adds a nice enum and we neither do we have 2167ceabf3416
"x86/cpu: Add CLZERO detection". As such we just a partial backport
of the last one and only look for one specific bit (12).]

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/microcode: Recheck IBRS and IBPB feature on microcode reload

On new microcode write, check whether IBRS and IBPB features
are present by rescanning scattered CPU features.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86: Move IBRS/IBPB feature detection to scattered.c

Move IBRS/IBPB to scattered features for easier feature rescan.
This help to rescan feature on microcode reload later.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec_ctrl: Add lock to serialize changes to ibrs and ibpb control

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec_ctrl: Add sysctl knobs to enable/disable SPEC_CTRL feature

There are 2 ways to control IBPB and IBRS

1. At boot time
        noibrs kernel boot parameter will disable IBRS usage
        noibpb kernel boot parameter will disable IBPB usage
Otherwise if the above parameters are not specified, the system
will enable ibrs and ibpb usage if the cpu supports it.

2. At run time
        echo 0 > /proc/sys/kernel/ibrs_enabled will turn off IBRS
        echo 1 > /proc/sys/kernel/ibrs_enabled will turn on IBRS in kernel
        echo 2 > /proc/sys/kernel/ibrs_enabled will turn on IBRS in both userspace and kernel

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Backport: This completes the scaffolding work done in the earlier
patch which had the same title]

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kvm: clear registers on VM exit

Clear registers on VM exit to prevent speculative use of them.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/kvm: Set IBPB when switching VM

Set IBPB (Indirect branch prediction barrier) when switching VM.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

*INCOMPLETE* x86/syscall: Clear unused extra registers on syscall entrance

To prevent the unused registers %r12-%r15, %rbp and %rbx from
being used speculatively, we clear them upon syscall entrance
for code hygiene.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Backport: We don't have the ORC stack which means our calling.h
has the CTF code. And that has RESTORE_EXTRA_ARGS and ZERO_EXTRA_ARGS
so there was no need to port that in. See
commit 76f5df43cab5e765c0bd42289103e8f625813ae1
x86/asm/entry/64: Always allocate a complete "struct pt_regs" on the kernel stack
which added them.

The ZERO_EXTRA_REGS (aka CLEAR_EXTRA_REGS) is not part of it.
It ends up crashing the user-space. Not sure why not.

Which means this patch is pretty much useless - we don't clear
any of the %r12-%r15, nor %rbp, nor %rbx at all.

In other words we just save now more registers on the %esp
and restore them.

But somewhere we depend on these and need to fix that.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/entry: Stuff RSB for entry to kernel for non-SMEP platform

Stuff RSB to prevent RSB underflow on non-SMEP platform.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Only set IBPB when the new thread cannot ptrace current thread

To reduce overhead of setting IBPB, we only do that when
the new thread cannot ptrace the current one. If the new
thread has ptrace capability on current thread, it is safe.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Backport: Need more #include's than the original]

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/mm: Set IBPB upon context switch

Set IBPB on context switch with changing of page table.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Backport needs an asm/microcode.h to include the native_wrmsrl]

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/idle: Disable IBRS when offlining cpu and re-enable on wakeup

Clear IBRS when cpu is offlined and set it when bringing it back online.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/idle: Disable IBRS entering idle and enable it on wakeup

Clear IBRS on idle entry and set it on idle exit into kernel on mwait.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Backport: We don't have b466bdb614823
"x86/asm/delay: Introduce an MWAITX-based delay with a configurable timer"
hence the change to delay_mwaitx is not needed]

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/spec_ctrl: save IBRS MSR value in paranoid_entry

If the NMI runs while entering kernel between SWAPGS and IBRS_ENABLE
everything is fine, paranoid_entry would have unconditionally set
IBRS bit 0 and when exiting the NMI it would have cleared bit 0 like
if it was returning to userland. IBRS_ENABLE would have then enabled
bit 0 again.

If NMI instead runs when exiting kernel between IBRS_DISABLE and
SWAPGS, the NMI would have turned on IBRS bit 0 and then it would have
left enabled when exiting the NMI. IBRS bit 0 would then be left
enabled in userland until the next enter kernel.

That is a minor inefficiency only, but we can eliminate it by saving
the MSR when entering the NMI in save_paranoid and restoring it when
exiting the NMI.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

*Scaffolding* x86/spec_ctrl: Add sysctl knobs to enable/disable SPEC_CTRL feature

1. At boot time
noibrs kernel boot parameter will disable IBRS usage
noibpb kernel boot parameter will disable IBPB usage

Otherwise if the above parameters are not specified, the system
will enable ibrs and ibpb usage if the cpu supports it.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[*Scaffolding*:This backport lacks a lot. It is only put it on so that the
later patches compiled _and_ can be tested to run. It is meant to be removed
once the full set of patches are all good. Aka scaffolding.]

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>