In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.
Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit 94623c7463f3424776408df2733012c42b52395a)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.
1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.
We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.
In the future, we might add the possibility of terminating flows
that are proven to be malicious.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit a878681484a0992ee3dfbd7826439951f9f82a69)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :
if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);
tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])
Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.
Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit fdf258ed5dd85b57cf0e0e66500be98d38d42d02)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.
Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.
Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit 2d08921c8da26bdce3d8848ef6f32068f594d7d4)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Yaogong Wang [Wed, 7 Sep 2016 21:49:28 +0000 (14:49 -0700)]
tcp: use an RB tree for ooo receive queue
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.
Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.
In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.
Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.
However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.
This patch converts it to a RB tree, to get bounded latencies.
Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.
Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)
Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: Yaogong Wang <wygivan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
(adapted from v4.9.x commit 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Dumazet [Wed, 17 Aug 2016 21:17:09 +0000 (14:17 -0700)]
tcp: refine tcp_prune_ofo_queue() to not drop all packets
Over the years, TCP BDP has increased a lot, and is typically
in the order of ~10 Mbytes with help of clever Congestion Control
modules.
In presence of packet losses, TCP stores incoming packets into an out of
order queue, and number of skbs sitting there waiting for the missing
packets to be received can match the BDP (~10 Mbytes)
In some cases, TCP needs to make room for incoming skbs, and current
strategy can simply remove all skbs in the out of order queue as a last
resort, incurring a huge penalty, both for receiver and sender.
Unfortunately these 'last resort events' are quite frequent, forcing
sender to send all packets again, stalling the flow and wasting a lot of
resources.
This patch cleans only a part of the out of order queue in order
to meet the memory constraints.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: C. Stephen Gun <csg@google.com> Cc: Van Jacobson <vanj@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(adapted from v4.9.x commit 36a6503feddadbbad415fb3891e80f94c10a9b21)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Dumazet [Fri, 15 May 2015 19:39:27 +0000 (12:39 -0700)]
tcp: introduce tcp_under_memory_pressure()
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(adapted from v4.9.x commit b8da51ebb1aa93908350f95efae73aecbc2e266c)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Dumazet [Fri, 1 Apr 2016 15:52:19 +0000 (08:52 -0700)]
tcp: increment sk_drops for dropped rx packets
Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.
Following patch takes care of listeners drops.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(adapted from v4.9.x commit 532182cd610782db8c18230c2747626562032205)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
George Kennedy [Thu, 2 Aug 2018 18:51:42 +0000 (14:51 -0400)]
x86/entry/64: Ensure %ebx handling correct in xen_failsafe_callback
error_entry and error_exit communicate the user vs. kernel status of
the frame using %ebx. Make sure %ebx is set properly in
xen_failsafe_callback().
This fix is different from the one provided in the XSA to minimize
risk of a regression. This alternative fix was suggested
by Andy Lutomirski in the original patch (Commit b3681dd548d0
("x86/entry/64: Remove %ebx handling from error_entry/exit"))
Signed-off-by: George Kennedy <george.kennedy@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Andi Kleen [Fri, 24 Aug 2018 17:03:50 +0000 (10:03 -0700)]
x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+
On Nehalem and newer core CPUs the CPU cache internally uses 44 bits
physical address space. The L1TF workaround is limited by this internal
cache address width, and needs to have one bit free there for the
mitigation to work.
Older client systems report only 36bit physical address space so the range
check decides that L1TF is not mitigated for a 36bit phys/32GB system with
some memory holes.
But since these actually have the larger internal cache width this warning
is bogus because it would only really be needed if the system had more than
43bits of memory.
Add a new internal x86_cache_bits field. Normally it is the same as the
physical bits field reported by CPUID, but for Nehalem and newerforce it to
be at least 44bits.
Change the L1TF memory size warning to use the new cache_bits field to
avoid bogus warnings and remove the bogus comment about memory size.
Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf") Reported-by: George Anchev <studio@anchev.net> Reported-by: Christopher Snowhill <kode54@gmail.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Cc: linux-kernel@vger.kernel.org Cc: Michael Hocko <mhocko@suse.com> Cc: vbabka@suse.cz Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180824170351.34874-1-andi@firstfloor.org
(cherry picked from commit cc51e5428ea54f575d49cfcede1d4cb3a72b4ec4)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/processor.h
arch/x86/kernel/cpu/common.c
arch/x86/kernel/cpu/bugs.c
Contextual: different content; bugs_64.c was modified instead of bugs.c
Vlastimil Babka [Thu, 23 Aug 2018 14:21:29 +0000 (16:21 +0200)]
x86/speculation/l1tf: Suggest what to do on systems with too much RAM
Two users have reported [1] that they have an "extremely unlikely" system
with more than MAX_PA/2 memory and L1TF mitigation is not effective.
Make the warning more helpful by suggesting the proper mem=X kernel boot
parameter to make it effective and a link to the L1TF document to help
decide if the mitigation is worth the unusable RAM.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
Contextual: bugs_64.c was modified instead of bugs.c
Vlastimil Babka [Thu, 23 Aug 2018 13:44:18 +0000 (15:44 +0200)]
x86/speculation/l1tf: Fix off-by-one error when warning that system has too much RAM
Two users have reported [1] that they have an "extremely unlikely" system
with more than MAX_PA/2 memory and L1TF mitigation is not effective. In
fact it's a CPU with 36bits phys limit (64GB) and 32GB memory, but due to
holes in the e820 map, the main region is almost 500MB over the 32GB limit:
Suggestions to use 'mem=32G' to enable the L1TF mitigation while losing the
500MB revealed, that there's an off-by-one error in the check in
l1tf_select_mitigation().
l1tf_pfn_limit() returns the last usable pfn (inclusive) and the range
check in the mitigation path does not take this into account.
Instead of amending the range check, make l1tf_pfn_limit() return the first
PFN which is over the limit which is less error prone. Adjust the other
users accordingly.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Vlastimil Babka [Mon, 20 Aug 2018 09:58:35 +0000 (11:58 +0200)]
x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
On 32bit PAE kernels on 64bit hardware with enough physical bits,
l1tf_pfn_limit() will overflow unsigned long. This in turn affects
max_swapfile_size() and can lead to swapon returning -EINVAL. This has been
observed in a 32bit guest with 42 bits physical address size, where
max_swapfile_size() overflows exactly to 1 << 32, thus zero, and produces
the following warning to dmesg:
[ 6.396845] Truncating oversized swap area, only using 0k out of 2047996k
Fix this by using unsigned long long instead.
Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf") Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Reported-by: Dominique Leuenberger <dimstar@suse.de> Reported-by: Adrian Schroeter <adrian@suse.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180820095835.5298-1-vbabka@suse.cz
(cherry picked from commit 9df9516940a61d29aedf4d91b483ca6597e7d480)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Sean Christopherson [Fri, 17 Aug 2018 17:27:36 +0000 (10:27 -0700)]
x86/speculation/l1tf: Exempt zeroed PTEs from inversion
It turns out that we should *not* invert all not-present mappings,
because the all zeroes case is obviously special.
clear_page() does not undergo the XOR logic to invert the address bits,
i.e. PTE, PMD and PUD entries that have not been individually written
will have val=0 and so will trigger __pte_needs_invert(). As a result,
{pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones
(adjusted by the max PFN mask) instead of zero. A zeroed entry is ok
because the page at physical address 0 is reserved early in boot
specifically to mitigate L1TF, so explicitly exempt them from the
inversion when reading the PFN.
Manifested as an unexpected mprotect(..., PROT_NONE) failure when called
on a VMA that has VM_PFNMAP and was mmap'd to as something other than
PROT_NONE but never used. mprotect() sends the PROT_NONE request down
prot_none_walk(), which walks the PTEs to check the PFNs.
prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
-EACCES because it thinks mprotect() is trying to adjust a high MMIO
address.
[ This is a very modified version of Sean's original patch, but all
credit goes to Sean for doing this and also pointing out that
sometimes the __pte_needs_invert() function only gets the protection
bits, not the full eventual pte. But zero remains special even in
just protection bits, so that's ok. - Linus ]
Fixes: f22cc87f6c1f ("x86/speculation/l1tf: Invert all not present mappings") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f19f5c49bbc3ffcc9126cc245fc1b24cc29f4a37)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
Contextual: bugs_64.c was modified instead of bugs.c
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts: bad_spectre_microcode was in arch/x86/kernel/cpu/scattered.c Signed-off-by: Brian Maly <brian.maly@oracle.com>
Thomas Gleixner [Sun, 12 Aug 2018 18:41:45 +0000 (20:41 +0200)]
KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled
Mikhail reported the following lockdep splat:
WARNING: possible irq lock inversion dependency detected
CPU 0/KVM/10284 just changed the state of lock: 000000000d538a88 (&st->lock){+...}, at:
speculative_store_bypass_update+0x10b/0x170
but this lock was taken by another, HARDIRQ-safe lock
in the past:
(&(&sighand->siglock)->rlock){-.-.}
and interrupts could create inverse lock ordering between them.
In svm_vcpu_run() speculative_store_bypass_update() is called with
interupts enabled via x86_virt_spec_ctrl_set_guest/host().
This is actually a false positive, because GIF=0 so interrupts are
disabled even if IF=1; however, we can easily move the invocations of
x86_virt_spec_ctrl_set_guest/host() into the interrupt disabled region to
cure it, and it's a good idea to keep the GIF=0/IF=1 area as small
and self-contained as possible.
Fixes: 1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD") Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Borislav Petkov <bp@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: kvm@vger.kernel.org Cc: x86@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 024d83cadc6b2af027e473720f3c3da97496c318)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts: contextual Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ashok Raj [Wed, 28 Feb 2018 10:28:43 +0000 (11:28 +0100)]
x86/microcode: Do not upload microcode if CPUs are offline
Avoid loading microcode if any of the CPUs are offline, and issue a
warning. Having different microcode revisions on the system at any time
is outright dangerous.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/microcode/core.c
Contextual
in for(),if((optlen > 0) && (optptr[1] == 0)), enter infinite loop.
Test: receive a packet which the ip length > 20 and the first byte of ip option is 0, produce this issue
Signed-off-by: yujuan.qi <yujuan.qi@mediatek.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 40413955ee265a5e42f710940ec78f5450d49149) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Brian Maly <brian.maly@oracle.com Signed-off-by: Brian Maly <brian.maly@oracle.com>
[ 58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
[ 58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen
[ 58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (pr
[ 58.153518] ------------[ cut here ]------------
[ 58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
...
[ 58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
...
[ 58.161956] Call Trace:
[ 58.162264] btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
[ 58.163583] btrfs_log_dentry_safe+0x60/0x80 [btrfs]
[ 58.164003] btrfs_sync_file+0x4c2/0x6f0 [btrfs]
[ 58.164393] vfs_fsync_range+0x5f/0xd0
[ 58.164898] do_fsync+0x5a/0x90
[ 58.165170] SyS_fsync+0x10/0x20
[ 58.165395] entry_SYSCALL_64_fastpath+0x1f/0xbe
...
It turns out that we could record btrfs_log_ctx:io_err in
log_one_extents when IO fails, but make log_one_extents() return '0'
instead of -EIO, so the IO error is not acknowledged by the callers,
i.e. btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
from list head 'root->log_ctxs'. Since btrfs_log_ctx is allocated
from stack memory, it'd get freed with a object alive on the
list. then a future list_add will throw the above warning.
This returns the correct error in the above case.
Jeff also reported this while testing against his fsync error
patch set[1].
[1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
"btrfs list corruption and soft lockups while testing writeback error handling"
Fixes: 8407f553268a4611f254 ("Btrfs: fix data corruption after fast fsync and writeback error") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherrypicked from commit ebb70442cdd4872260c2415929c456be3562da82)
Conflicts: fs/btrfs/file.c Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Peter Zijlstra [Fri, 3 Aug 2018 14:41:39 +0000 (16:41 +0200)]
x86/paravirt: Fix spectre-v2 mitigations for paravirt guests
Nadav reported that on guests we're failing to rewrite the indirect
calls to CALLEE_SAVE paravirt functions. In particular the
pv_queued_spin_unlock() call is left unpatched and that is all over the
place. This obviously wrecks Spectre-v2 mitigation (for paravirt
guests) which relies on not actually having indirect calls around.
The reason is an incorrect clobber test in paravirt_patch_call(); this
function rewrites an indirect call with a direct call to the _SAME_
function, there is no possible way the clobbers can be different
because of this.
Therefore remove this clobber check. Also put WARNs on the other patch
failure case (not enough room for the instruction) which I've not seen
trigger in my (limited) testing.
Three live kernel image disassemblies for lock_sock_nested (as a small
function that illustrates the problem nicely). PRE is the current
situation for guests, POST is with this patch applied and NATIVE is with
or without the patch for !guests.
Signed-off-by: George Kennedy <george.kennedy@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Suggested-by: Matthew Wilcox <matthew.wilcox@oracle.com> Signed-off-by: George Kennedy <george.kennedy@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
close_sync() needs to set conf->next_resync to a large, but safe value
below MaxSector and use it to determine whether or not to set
start_next_window in wait_barrier()
Solution suggested by Neil Brown.
Reported-by: Nate Dailey <nate.dailey@stratus.com> Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
(cherry picked from commit e8ff8bf09ff49733534ff3cee91bde030186055f)
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jan Kara [Thu, 11 Aug 2016 16:38:55 +0000 (12:38 -0400)]
ext4: avoid deadlock when expanding inode size
When we need to move xattrs into external xattr block, we call
ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end
up calling ext4_mark_inode_dirty() again which will recurse back into
the inode expansion code leading to deadlocks.
Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move
its management into ext4_expand_extra_isize_ea() since its manipulation
is safe there (due to xattr_sem) from possible races with
ext4_xattr_set_handle() which plays with it as well.
CC: stable@vger.kernel.org # 4.4.x Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 2e81a4eeedcaa66e35f58b81e0755b87057ce392)
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jan Kara [Thu, 11 Aug 2016 16:00:01 +0000 (12:00 -0400)]
ext4: properly align shifted xattrs when expanding inodes
We did not count with the padding of xattr value when computing desired
shift of xattrs in the inode when expanding i_extra_isize. As a result
we could create unaligned start of inline xattrs. Account for alignment
properly.
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jan Kara [Thu, 11 Aug 2016 15:58:32 +0000 (11:58 -0400)]
ext4: fix xattr shifting when expanding inodes part 2
When multiple xattrs need to be moved out of inode, we did not properly
recompute total size of xattr headers in the inode and the new header
position. Thus when moving the second and further xattr we asked
ext4_xattr_shift_entries() to move too much and from the wrong place,
resulting in possible xattr value corruption or general memory
corruption.
CC: stable@vger.kernel.org # 4.4.x Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 418c12d08dc64a45107c467ec1ba29b5e69b0715)
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jan Kara [Thu, 11 Aug 2016 15:50:30 +0000 (11:50 -0400)]
ext4: fix xattr shifting when expanding inodes
The code in ext4_expand_extra_isize_ea() treated new_extra_isize
argument sometimes as the desired target i_extra_isize and sometimes as
the amount by which we need to grow current i_extra_isize. These happen
to coincide when i_extra_isize is 0 which used to be the common case and
so nobody noticed this until recently when we added i_projid to the
inode and so i_extra_isize now needs to grow from 28 to 32 bytes.
The result of these bugs was that we sometimes unnecessarily decided to
move xattrs out of inode even if there was enough space and we often
ended up corrupting in-inode xattrs because arguments to
ext4_xattr_shift_entries() were just wrong. This could demonstrate
itself as BUG_ON in ext4_xattr_shift_entries() triggering.
Fix the problem by introducing new isize_diff variable and use it where
appropriate.
CC: stable@vger.kernel.org # 4.4.x Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit d0141191a20289f8955c1e03dad08e42e6f71ca9)
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Victor Erminpour [Wed, 25 Jul 2018 16:38:51 +0000 (09:38 -0700)]
uek-rpm: Enable perf stripped binary
Due to a bug in the find-debuginfo.sh file provided by
RPM, the perf binary was not properly stripped in OL7.
This commit provides a new patch, and cleans up previous
patches to find-debuginfo.sh.
Signed-off-by: Victor Erminpour <victor.erminpour@oracle.com> Reviewed-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
J. Bruce Fields [Tue, 19 Sep 2017 23:25:41 +0000 (19:25 -0400)]
nfsd: give out fewer session slots as limit approaches
Instead of granting client's full requests until we hit our DRC size
limit and then failing CREATE_SESSIONs (and hence mounts) completely,
start granting clients smaller slot tables as we approach the limit.
J. Bruce Fields [Wed, 20 Sep 2017 00:51:31 +0000 (20:51 -0400)]
nfsd: increase DRC cache limit
An NFSv4.1+ client negotiates the size of its duplicate reply cache size
in the initial CREATE_SESSION request. The server preallocates the
memory for the duplicate reply cache to ensure that we'll never fail to
record the response to a nonidempotent operation.
To prevent a few CREATE_SESSIONs from consuming all of memory we set an
upper limit based on nr_free_buffer_pages(). 1/2^10 has been too
limiting in practice; 1/2^7 is still less than one percent.
Knut Omang [Fri, 23 Mar 2018 14:49:26 +0000 (15:49 +0100)]
uek-rpm: config-debug: Turn off torture testing by default
The OL7 debug kernels are configured with multiple torture tests
built into the kernel, and consequently are all of them
trying to run by default during boot. This is a waste
of time and kWs.
This can be observed by the following
messages in the system log during boot:
kernel: spin_lock-torture:--- Start of test [debug]: nwriters_stress=4 nreaders_stress=0 stat_interval=60 verbose=1 shuffle_interval=3 stutter=5 shutdown_secs=0
kernel: spin_lock-torture: Creating torture_shuffle task
kernel: spin_lock-torture: Creating torture_stutter task
kernel: spin_lock-torture: torture_shuffle task started
kernel: spin_lock-torture: Creating lock_torture_writer task
kernel: spin_lock-torture: torture_stutter task started
kernel: spin_lock-torture: Creating lock_torture_writer task
kernel: spin_lock-torture: lock_torture_writer task started
kernel: spin_lock-torture: Creating lock_torture_writer task
kernel: spin_lock-torture: lock_torture_writer task started
kernel: spin_lock-torture: Creating lock_torture_writer task
kernel: spin_lock-torture: lock_torture_writer task started
kernel: spin_lock-torture: Creating lock_torture_stats task
kernel: spin_lock-torture: lock_torture_writer task started
kernel: spin_lock-torture: lock_torture_stats task started
kernel: torture_init_begin: Refusing rcu init: spin_lock running.
kernel: torture_init_begin: One torture test at a time!
Better to compile these as modules to make them readily
available for testing if need be, than to run them all
at boot which does not seem to make sense at all judging
from the above log.
Junichi Nomura [Fri, 10 Jun 2016 04:31:52 +0000 (04:31 +0000)]
ipmi: Remove smi_msg from waiting_rcv_msgs list before handle_one_recv_msg()
Commit 7ea0ed2b5be8 ("ipmi: Make the message handler easier to use for
SMI interfaces") changed handle_new_recv_msgs() to call handle_one_recv_msg()
for a smi_msg while the smi_msg is still connected to waiting_rcv_msgs list.
That could lead to following list corruption problems:
1) low-level function treats smi_msg as not connected to list
handle_one_recv_msg() could end up calling smi_send(), which
assumes the msg is not connected to list.
For example, the following sequence could corrupt list by
doing list_add_tail() for the entry still connected to other list.
Fixes: 7ea0ed2b5be8 ("ipmi: Make the message handler easier to use for SMI interfaces") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
[Added a comment to describe why this works.] Signed-off-by: Corey Minyard <cminyard@mvista.com> Cc: stable@vger.kernel.org # 3.19 Tested-by: Ye Feng <yefeng.yl@alibaba-inc.com>
(cherry picked from commit ae4ea9a2460c7fee2ae8feeb4dfe96f5f6c3e562)
Yazen Ghannam [Thu, 30 Mar 2017 11:17:14 +0000 (13:17 +0200)]
x86/mce/AMD: Give a name to MCA bank 3 when accessed with legacy MSRs
MCA bank 3 is reserved on systems pre-Fam17h, so it didn't have a name.
However, MCA bank 3 is defined on Fam17h systems and can be accessed
using legacy MSRs. Without a name we get a stack trace on Fam17h systems
when trying to register sysfs files for bank 3 on kernels that don't
recognize Scalable MCA.
Call MCA bank 3 "decode_unit" since this is what it represents on
Fam17h. This will allow kernels without SMCA support to see this bank on
Fam17h+ and prevent the stack trace. This will not affect older systems
since this bank is reserved on them, i.e. it'll be ignored.
Tested on AMD Fam15h and Fam17h systems.
WARNING: CPU: 26 PID: 1 at lib/kobject.c:210 kobject_add_internal
kobject: (ffff88085bb256c0): attempted to be registered with empty name!
...
Call Trace:
kobject_add_internal
kobject_add
kobject_create_and_add
threshold_create_device
threshold_init_device
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
sgid directories have special semantics, making newly created files in
the directory belong to the group of the directory, and newly created
subdirectories will also become sgid. This is historically used for
group-shared directories.
But group directories writable by non-group members should not imply
that such non-group members can magically join the group, so make sure
to clear the sgid bit on non-directories for non-members (but remember
that sgid without group execute means "mandatory locking", just to
confuse things even more).
Reported-by: Jann Horn <jannh@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0fa3ecd87848c9c93c2c828ef4c3a8ca36ce46c7)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jason Yan [Thu, 8 Mar 2018 02:34:53 +0000 (10:34 +0800)]
scsi: libsas: defer ata device eh commands to libata
When ata device doing EH, some commands still attached with tasks are
not passed to libata when abort failed or recover failed, so libata did
not handle these commands. After these commands done, sas task is freed,
but ata qc is not freed. This will cause ata qc leak and trigger a
warning like below:
If ata qc leaked too many, ata tag allocation will fail and io blocked
for ever.
As suggested by Dan Williams, defer ata device commands to libata and
merge sas_eh_finish_cmd() with sas_eh_defer_cmd(). libata will handle
ata qcs correctly after this.
Signed-off-by: Jason Yan <yanaijie@huawei.com> CC: Xiaofei Tan <tanxiaofei@huawei.com> CC: John Garry <john.garry@huawei.com> CC: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 318aaf34f1179b39fa9c30fa0f3288b645beee39)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Bjorn Helgaas [Tue, 7 Aug 2018 00:51:33 +0000 (20:51 -0400)]
PCI: Allocate ATS struct during enumeration
Previously, we allocated pci_ats structures when an IOMMU driver called
pci_enable_ats(). An SR-IOV VF shares the STU setting with its PF, so when
enabling ATS on the VF, we allocated a pci_ats struct for the PF if it
didn't already have one. We held the sriov->lock to serialize threads
concurrently enabling ATS on several VFS so only one would allocate the PF
pci_ats.
There's no reason to delay allocating the pci_ats struct, and if we
allocate it for each device at enumeration-time, there's no need for
locking in pci_enable_ats().
Allocate pci_ats struct during enumeration, when we initialize other
capabilities.
Note that this implementation requires ATS to be enabled on the PF first,
before on any of the VFs because the PF controls the STU for all the VFs.
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
qla2xxx: Utilize complete local DMA buffer for DIF PI inforamtion.
When any of the DIF PI SGEs received from upper layer is lesser than the
local DMA SGE length, then local DMA buffer was getting partially filled
causing DIF errors.
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This is introduced by the 490515384dc4 (scsi: megaraid_sas: Use
synchronize_irq to wait for IRQs to complete). The introduced
pci_irq_vector() compatibility wrapper did not take all interrupt
delivery methods into account. We should use the instance->msixentry[]
to get the irq number for the case of instance->msix_vectors > 0.
Fixes: 490515384dc4 (scsi: megaraid_sas: Use synchronize_irq to wait for IRQs to complete) Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Cc: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The ALSA sequencer ioctls have no protection against racy calls while
the concurrent operations may lead to interfere with each other. As
reported recently, for example, the concurrent calls of setting client
pool with a combination of write calls may lead to either the
unkillable dead-lock or UAF.
As a slightly big hammer solution, this patch introduces the mutex to
make each ioctl exclusive. Although this may reduce performance via
parallel ioctl calls, usually it's not demanded for sequencer usages,
hence it should be negligible.
Reported-by: Luo Quan <a4651386@163.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c3162384aed4cfe3f1a1f40041f3ba8cd7704d88) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
sound/core/seq/seq_clientmgr.c
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Takashi Iwai [Mon, 12 Feb 2018 14:20:51 +0000 (15:20 +0100)]
ALSA: seq: Fix racy pool initializations
ALSA sequencer core initializes the event pool on demand by invoking
snd_seq_pool_init() when the first write happens and the pool is
empty. Meanwhile user can reset the pool size manually via ioctl
concurrently, and this may lead to UAF or out-of-bound accesses since
the function tries to vmalloc / vfree the buffer.
A simple fix is to just wrap the snd_seq_pool_init() call with the
recently introduced client->ioctl_mutex; as the calls for
snd_seq_pool_init() from other side are always protected with this
mutex, we can avoid the race.
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
As part of the ongoing changes to the block layer, the semantics of
bip_for_each_vec() changed from walking the entire bip vec (as originally
designed) to only walking the residual. This led to a memory leak as page
cache pages were not released during I/O completion.
Manually walk the bip vec in oracleasm instead of relying on the block
layer helper.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
If a user runs an old ASMLIB without integrity support, the bip
attached to a bio is owned by the block layer. We should not attempt
to unmap the pages in question.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Using gcc7 (gcc version 7.2.1) on aarch64, the following error is produced:
drivers/block/oracleasm/driver.c:2385:55: error: positional initialization of
field in 'struct' declared with 'designated_init'
attribute [-Werror=designated-init]
static struct file_operations asmfs_dir_operations = {0, };
^
drivers/block/oracleasm/driver.c:2385:55: note: (
near initialization for 'asmfs_dir_operations')
cc1: some warnings being treated as errors
Initialization of asmfs_dir_operations happens in
init_asmfs_dir_operations, where asmfs_dir_operations is
initialized to simple_dir_operations.
Fix the above error, by removing superfluous static initialization line.
Signed-off-by: Tom Saeger <tom.saeger@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
We currently check current frags memory usage only when
a new frag queue is created. This allows attackers to first
consume the memory budget (default : 4 MB) creating thousands
of frag queues, then sending tiny skbs to exceed high_thresh
limit by 2 to 3 order of magnitude.
Note that before commit 648700f76b03 ("inet: frags: use rhashtables
for reassembly units"), work queue could be starved under DOS,
getting no cpu cycles.
After commit 648700f76b03, only the per frag queue timer can eventually
remove an incomplete frag queue and its skbs.
Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jann Horn <jannh@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Peter Oskolkov <posk@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit df30bfccc463082cfc2a5b164e5590403f16af94)
Mihai Carabas [Mon, 13 Aug 2018 18:49:53 +0000 (21:49 +0300)]
x86/pgtable.h: fix PMD/PUD mask
Fix commit 4b9ccc49725729d5026d764cd17c9d3e33de296a (x86/speculation/l1tf:
Protect PROT_NONE PTEs against speculation). On UEK4 we did not have
pmd_pfn_mask/pud_pfn_mask and assumed PTE_PFN_MASK.
x86/asm: Add pud/pmd mask interfaces to handle large PAT bit
The PAT bit gets relocated to bit 12 when PUD and PMD mappings are
used. This bit 12, however, is not covered by PTE_FLAGS_MASK, which
is used for masking pfn and flags for all levels.
Add pud/pmd mask interfaces to handle pfn and flags properly by using
P?D_PAGE_MASK when PUD/PMD mappings are used, i.e. PSE bit is set.
Suggested-by: Juergen Gross <jgross@suse.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Robert Elliot <elliott@hpe.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1442514264-12475-4-git-send-email-toshi.kani@hpe.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 4be4c1fb9a754b100466ebaec50f825be0b2050b)
Boris Ostrovsky [Mon, 13 Aug 2018 00:45:49 +0000 (20:45 -0400)]
x86/speculation: parse l1tf boot parameter early
early_param() is too late for us to use for parsing "l1tf" boot parameter,
we should do it similarly to how spectre_v2_select_mitigation() and
ssbd_ibrs_selected() do it.
timer_create() specifies via sigevent->sigev_notify the signal delivery for
the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD
and (SIGEV_SIGNAL | SIGEV_THREAD_ID).
The sanity check in good_sigevent() is only checking the valid combination
for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is
not set it accepts any random value.
This has no real effects on the posix timer and signal delivery code, but
it affects show_timer() which handles the output of /proc/$PID/timers. That
function uses a string array to pretty print sigev_notify. The access to
that array has no bound checks, so random sigev_notify cause access beyond
the array bounds.
Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID
masking from various code pathes as SIGEV_NONE can never be set in
combination with SIGEV_THREAD_ID.
Reported-by: Eric Biggers <ebiggers3@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
(cherry picked from commit 16cd05f25489459d10035ffab9cb7391512f1437) Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Andi Kleen [Tue, 7 Aug 2018 22:09:38 +0000 (15:09 -0700)]
x86/mm/kmmio: Make the tracer robust against L1TF
The mmio tracer sets io mapping PTEs and PMDs to non present when enabled
without inverting the address bits, which makes the PTE entry vulnerable
for L1TF.
Make it use the right low level macros to actually invert the address bits
to protect against L1TF.
In principle this could be avoided because MMIO tracing is not likely to be
enabled on production machines, but the fix is straigt forward and for
consistency sake it's better to get rid of the open coded PTE manipulation.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 1063711b57393c1999248cccb57bebfaf16739e7)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andi Kleen [Tue, 7 Aug 2018 22:09:39 +0000 (15:09 -0700)]
x86/mm/pat: Make set_memory_np() L1TF safe
set_memory_np() is used to mark kernel mappings not present, but it has
it's own open coded mechanism which does not have the L1TF protection of
inverting the address bits.
Replace the open coded PTE manipulation with the L1TF protecting low level
PTE routines.
Passes the CPA self test.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 958f79b9ee55dfaf00c8106ed1c22a2919e0028b)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/mm/pageattr.c
We do not have complete pud infrastructure. Backporting all those interfaces,
will need some surgery. Manually make the desired operations for l1tf
inversions.
Matt Fleming [Fri, 27 Nov 2015 21:09:31 +0000 (21:09 +0000)]
x86/mm/pat: Ensure cpa->pfn only contains page frame numbers
The x86 pageattr code is confused about the data that is stored
in cpa->pfn, sometimes it's treated as a page frame number,
sometimes it's treated as an unshifted physical address, and in
one place it's treated as a pte.
The result of this is that the mapping functions do not map the
intended physical address.
This isn't a problem in practice because most of the addresses
we're mapping in the EFI code paths are already mapped in
'trampoline_pgd' and so the pageattr mapping functions don't
actually do anything in this case. But when we move to using a
separate page table for the EFI runtime this will be an issue.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Reviewed-by: Borislav Petkov <bp@suse.de> Acked-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-3-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 28220674
CVE: CVE-2018-3620
Andi Kleen [Tue, 7 Aug 2018 22:09:37 +0000 (15:09 -0700)]
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
Some cases in THP like:
- MADV_FREE
- mprotect
- split
mark the PMD non present for temporarily to prevent races. The window for
an L1TF attack in these contexts is very small, but it wants to be fixed
for correctness sake.
Use the proper low level functions for pmd/pud_mknotpresent() to address
this.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/pgtable.h
pgtable.h: different content
Andi Kleen [Tue, 7 Aug 2018 22:09:36 +0000 (15:09 -0700)]
x86/speculation/l1tf: Invert all not present mappings
For kernel mappings PAGE_PROTNONE is not necessarily set for a non present
mapping, but the inversion logic explicitely checks for !PRESENT and
PROT_NONE.
Remove the PROT_NONE check and make the inversion unconditional for all not
present mappings.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Thomas Gleixner [Tue, 7 Aug 2018 06:19:57 +0000 (08:19 +0200)]
cpu/hotplug: Fix SMT supported evaluation
Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
on the kernel command line as it cannot differentiate between SMT disabled
by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
makes the sysfs interface unusable.
Rework this so that during bringup of the non boot CPUs the availability of
SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
'primary' thread then set the local cpu_smt_available marker and evaluate
this explicitely right after the initial SMP bringup has finished.
SMT evaulation on x86 is a trainwreck as the firmware has all the
information _before_ booting the kernel, but there is no interface to query
it.
Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
kernel/cpu.c
kernel/smp.c
bugs.c: different filename
cpu.c and smp.c: contextual
Paolo Bonzini [Sun, 5 Aug 2018 14:07:47 +0000 (16:07 +0200)]
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
When nested virtualization is in use, VMENTER operations from the nested
hypervisor into the nested guest will always be processed by the bare metal
hypervisor, and KVM's "conditional cache flushes" mode in particular does a
flush on nested vmentry. Therefore, include the "skip L1D flush on
vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
Add the relevant Documentation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: we did not have "if(boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))"
Paolo Bonzini [Sun, 5 Aug 2018 14:07:46 +0000 (16:07 +0200)]
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
not needed. Add a new value to enum vmx_l1d_flush_state, which is used
either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/msr-index.h
arch/x86/kernel/cpu/bugs.c
msr-index.h: different location
bugs.c: different filename
KarimAllah Ahmed [Thu, 1 Feb 2018 21:59:44 +0000 (22:59 +0100)]
KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
Intel processors use MSR_IA32_ARCH_CAPABILITIES MSR to indicate RDCL_NO
(bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default the
contents will come directly from the hardware, but user-space can still
override it.
[dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: kvm@vger.kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Ashok Raj <ashok.raj@intel.com> Link: https://lkml.kernel.org/r/1517522386-18410-4-git-send-email-karahmed@amazon.de
Orabug: 28220674
CVE: CVE-2018-3646
We need this commit for the next one "x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry".
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/cpuid.c
arch/x86/kvm/vmx.c
arch/x86/kvm/x86.c
cpuid.c: ARCH_CAPABILITIES is scattered. Define it for KVM
x86.c: contextual
vmx.c: we do not have msr_info structure. Drop the checks.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
Different filename: bugs_64.c
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
The last missing piece to having vmx_l1d_flush() take interrupts after
VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on
irq entry.
Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(),
ipi_entering_ack_irq(), smp_reschedule_interrupt() and
uv_bau_message_interrupt().
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/apic.h
arch/x86/kernel/smp.c
Contextual: different content
This causes compilation errors because of the header guards becoming
effective in the second inclusion: symbols/macros that had been defined
before wouldn't be available to intermediate headers in the #include chain
anymore.
A possible workaround would be to move the definition of irq_cpustat_t
into its own header and include that from both, asm/hardirq.h and
asm/apic.h.
However, this wouldn't solve the real problem, namely asm/harirq.h
unnecessarily pulling in all the linux/irq.h cruft: nothing in
asm/hardirq.h itself requires it. Also, note that there are some other
archs, like e.g. arm64, which don't have that #include in their
asm/hardirq.h.
Remove the linux/irq.h #include from x86' asm/hardirq.h.
Fix resulting compilation errors by adding appropriate #includes to *.c
files as needed.
Note that some of these *.c files could be cleaned up a bit wrt. to their
set of #includes, but that should better be done from separate patches, if
at all.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/apic/vector.c
drivers/gpu/drm/i915/i915_pmu.c
drivers/pci/controller/pci-hyperv.c
arch/x86/kernel/apic/apic.c
arch/x86/kernel/apic/htirq.c
arch/x86/kernel/fpu/core.c
arch/x86/kernel/idt.c
arch/x86/kernel/kprobes/core.c
arch/x86/mm/fault.c
arch/x86/mm/pti.c
arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c
drivers/gpu/drm/i915/intel_lpe_audio.c
drivers/pci/host/pci-hyperv.c
Contextual: different content
In these files we included "linux/irq.h" just to preserve KABI:
arch/x86/kernel/ftrace.c
drivers/misc/ioc4.c
drivers/pci/pci.c
drivers/pci/pcie/aer/aerdrv_core.c
drivers/pci/search.c
drivers/pnp/resource.c
fs/file_table.c
mm/vmstat.c
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
VMENTRY.
L1D flushes are costly and two modes of operations are provided to users:
"always" and the more selective "conditional" mode.
If operating in the latter, the cache would get flushed only if a host side
code path considered unconfined had been traversed. "Unconfined" in this
context means that it might have pulled in sensitive data like user data
or kernel crypto keys.
The need for L1D flushes is tracked by means of the per-vcpu flag
l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
L1d flush based on its value and reset that flag again.
Currently, interrupts delivered "normally" while in root operation between
VMEXIT and VMENTER are not taken into account. Part of the reason is that
these don't leave any traces and thus, the vmx code is unable to tell if
any such has happened.
As proposed by Paolo Bonzini, prepare for tracking all interrupts by
introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
strong analogy to the per-vcpu ->l1tf_flush_l1d.
A later patch will make interrupt handlers set it.
For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
per-cpu irq_cpustat_t as suggested by Peter Zijlstra.
Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.
Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
l1tf_flush_l1d.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Orabug: 28220625
CVE: CVE-2018-3646
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content caused by not having all static key features
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content caused by not having all static key features
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
The vmx_l1d_flush_always static key is only ever evaluated if
vmx_l1d_should_flush is enabled. In that case however, there are only two
L1d flushing modes possible: "always" and "conditional".
The "conditional" mode's implementation tends to require more sophisticated
logic than the "always" mode.
Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
with a 'vmx_l1d_flush_cond' one.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content caused by not having all static key features
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content caused by not having all static key features.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/x86.c
Contextual: different content
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tom Lendacky [Wed, 21 Feb 2018 19:39:51 +0000 (13:39 -0600)]
KVM: x86: Add a framework for supporting MSR-based features
Provide a new KVM capability that allows bits within MSRs to be recognized
as features. Two new ioctls are added to the /dev/kvm ioctl routine to
retrieve the list of these MSRs and then retrieve their values. A kvm_x86_ops
callback is used to determine support for the listed MSR-based features.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Tweaked documentation. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Orabug: 28220674
CVE: CVE-2018-3646
If SMT is disabled in BIOS, the CPU code doesn't properly detect it.
The /sys/devices/system/cpu/smt/control file shows 'on', and the 'l1tf'
vulnerabilities file shows SMT as vulnerable.
Fix it by forcing 'cpu_smt_control' to CPU_SMT_NOT_SUPPORTED in such a
case. Unfortunately the detection can only be done after bringing all
the CPUs online, so we have to overwrite any previous writes to the
variable.
Reported-by: Joe Mario <jmario@redhat.com> Tested-by: Jiri Kosina <jkosina@suse.cz> Fixes: f048c399e0f7 ("x86/topology: Provide topology_smt_supported()") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Orabug: 28220674
CVE: CVE-2018-3620
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' content
The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order
to evict the L1d cache.
However, these pages are never cleared and, in theory, their data could be
leaked.
More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages
to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break
the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses.
Fix this by initializing the individual vmx_l1d_flush_pages with a
different pattern each.
Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to
"flush_pages" to reflect this change.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
pfn_modify_allowed() and arch_has_pfn_modify_check() are outside of the
!__ASSEMBLY__ section in include/asm-generic/pgtable.h, which confuses
assembler on archs that don't have __HAVE_ARCH_PFN_MODIFY_ALLOWED (e.g.
ia64) and breaks build:
include/asm-generic/pgtable.h: Assembler messages:
include/asm-generic/pgtable.h:538: Error: Unknown opcode `static inline bool pfn_modify_allowed(unsigned long pfn,pgprot_t prot)'
include/asm-generic/pgtable.h:540: Error: Unknown opcode `return true'
include/asm-generic/pgtable.h:543: Error: Unknown opcode `static inline bool arch_has_pfn_modify_check(void)'
include/asm-generic/pgtable.h:545: Error: Unknown opcode `return false'
arch/ia64/kernel/entry.S:69: Error: `mov' does not fit into bundle
Move those two static inlines into the !__ASSEMBLY__ section so that they
don't confuse the asm build pass.
Fixes: 42e4089c7890 ("x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings") Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
include/asm-generic/pgtable.h
Contextual: different content
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/index.rst
Documentation/admin-guide/l1tf.rst
Contextual: The admin-guide is introduced later by 9d85025b "docs-rst: create
an user's manual book" so l1tf is added in Documentation folder.
x86/bugs, kvm: Introduce boot-time control of L1TF mitigations
Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.
The possible values are:
full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.
flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.
flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.
off
Disables hypervisor mitigations and doesn't emit any warnings.
Default is 'flush'.
Let KVM adhere to these semantics, which means:
- 'lt1f=full,force' : Performe L1D flushes. No runtime control
possible.
- 'l1tf=full'
- 'l1tf-flush'
- 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if
SMT has been runtime enabled or L1D flushing
has been run-time enabled
- 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.
- 'l1tf=off' : L1D flushes are not performed and no warnings
are emitted.
KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.
This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.
Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/include/asm/processor.h
Documentation/admin-guide/kernel-parameters.txt
arch/x86/kernel/cpu/bugs.c
arch/x86/kvm/vmx.c
Contextual: different content; modified bugs_64.c instead of bugs.c and
Documentation/kernel-parameters.txt instead of
Documentation/admin-guide/kernel-parameters.txt.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
kernel/cpu.c
Contextual: different content
Thomas Gleixner [Fri, 13 Jul 2018 14:23:22 +0000 (16:23 +0200)]
x86/kvm: Allow runtime control of L1D flush
All mitigation modes can be switched at run time with a static key now:
- Use sysfs_streq() instead of strcmp() to handle the trailing new line
from sysfs writes correctly.
- Make the static key management handle multiple invocations properly.
- Set the module parameter file to RW
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
arch/x86/kvm/vmx.c
Contextual: different content; bugs_64.cwas modified instead of bugs.c
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content; Use unlikely(static_key_enabled) instead
of static_branch_unlikely because 11276d530 "locking/static_keys: Add a
new static_key interface" is not present in this version of kernel.
Thomas Gleixner [Fri, 13 Jul 2018 14:23:19 +0000 (16:23 +0200)]
x86/kvm: Move l1tf setup function
In preparation of allowing run time control for L1D flushing, move the
setup code to the module parameter handler.
In case of pre module init parsing, just store the value and let vmx_init()
do the actual setup after running kvm_init() so that enable_ept is having
the correct state.
During run-time invoke it directly from the parameter setter to prepare for
run-time control.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content
Thomas Gleixner [Fri, 13 Jul 2018 14:23:18 +0000 (16:23 +0200)]
x86/l1tf: Handle EPT disabled state proper
If Extended Page Tables (EPT) are disabled or not supported, no L1D
flushing is required. The setup function can just avoid setting up the L1D
flush for the EPT=n case.
Invoke it after the hardware setup has be done and enable_ept has the
correct state and expose the EPT disabled state in the mitigation status as
well.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs_64.c
arch/x86/kvm/vmx.c
Contextual: different content; Modified arch/x86/kernel/cpu/bugs_64.c
instead of arch/x86/kernel/cpu/bugs.c.
Thomas Gleixner [Fri, 13 Jul 2018 14:23:17 +0000 (16:23 +0200)]
x86/kvm: Drop L1TF MSR list approach
The VMX module parameter to control the L1D flush should become
writeable.
The MSR list is set up at VM init per guest VCPU, but the run time
switching is based on a static key which is global. Toggling the MSR list
at run time might be feasible, but for now drop this optimization and use
the regular MSR write to make run-time switching possible.
The default mitigation is the conditional flush anyway, so for extra
paranoid setups this will add some small overhead, but the extra code
executed is in the noise compared to the flush itself.
Aside of that the EPT disabled case is not handled correctly at the moment
and the MSR list magic is in the way for fixing that as well.
If it's really providing a significant advantage, then this needs to be
revisited after the code is correct and the control is writable.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
arch/x86/kernel/cpu/bugs.c
Contextual: different content; Changes were done in arch/x86/kernel/cpu/bugs_64.c
instead of arch/x86/kernel/cpu/bugs.c.
__ro_after_init was replaced with __read_mostly for l1tf_vmx_mitigation to avoid
cherry-picking c74ba8b3.
Thomas Gleixner [Sat, 7 Jul 2018 09:40:18 +0000 (11:40 +0200)]
cpu/hotplug: Online siblings when SMT control is turned on
Writing 'off' to /sys/devices/system/cpu/smt/control offlines all SMT
siblings. Writing 'on' merily enables the abilify to online them, but does
not online them automatically.
Make 'on' more useful by onlining all offline siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3620
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: different content
Konrad Rzeszutek Wilk [Thu, 21 Jun 2018 02:01:22 +0000 (22:01 -0400)]
x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs
The IA32_FLUSH_CMD MSR needs only to be written on VMENTER. Extend
add_atomic_switch_msr() with an entry_only parameter to allow storing the
MSR only in the guest (ENTRY) MSR array.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Orabug: 28220674
CVE: CVE-2018-3646
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>