www.infradead.org Git - users/jedix/linux-maple.git/log

net/mlx4_core: print firmware version during driver loading

When debugging firmware related issues, it's very helpful to have
firmware version info printed in the kernel log when the driver is
loaded. It's easier to match error/warning messages with different
FW versions in the log other than running a separate tool to get
the information during remote debugging.

Orabug: 28809377

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
CC: Gerd Rausch <gerd.rausch@oracle.com>
CC: Aru Kolappan <aru.kolappan@oracle.com>
CC: Sujatha Srinivasa Gopalan <sujatha.tolstoy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: numa: Do not trap faults on shared data section pages.

Workloads consisting of a large number of processes running the same program
with a very large shared data segment may experience performance problems
when numa balancing attempts to migrate the shared cow pages. This manifests
itself with many processes or tasks in TASK_UNINTERRUPTIBLE state waiting
for the shared pages to be migrated.

The program listed below simulates the conditions with these results when
run with 288 processes on a 144 core/8 socket machine.

Average throughput Average throughput     Average throughput
with numa_balancing=0 with numa_balancing=1  with numa_balancing=1
      without the patch      with the patch
--------------------- ---------------------  ---------------------
2118782 2021534        2107979

Complex production environments show less variability and fewer poorly
performing outliers accompanied with a smaller number of processes waiting
on NUMA page migration with this patch applied. In some cases, %iowait drops
from 16%-26% to 0.

// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
*/

int a[1000000] = {13};

int  main(int argc, const char **argv)
{
int n = 0;
int i;
pid_t pid;
int stat;
int *count_array;
int cpu_count = 288;
long total = 0;

struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};

if (argc > 2)
cpu_count = atoi(argv[2]);

count_array = mmap(NULL, cpu_count * sizeof(int),
   (PROT_READ|PROT_WRITE),
   (MAP_SHARED|MAP_ANONYMOUS), 0, 0);

if (count_array == MAP_FAILED) {
perror("mmap:");
return 0;
}

for (i = 0; i < cpu_count; ++i) {
pid = fork();
if (pid <= 0)
break;
if ((i & 0xf) == 0)
usleep(2);
}

if (pid != 0) {
if (i == 0) {
perror("fork:");
return 0;
}

for (;;) {
pid = wait(&stat);
if (pid < 0)
break;
}

for (i = 0; i < cpu_count; ++i)
total += count_array[i];

printf("Total %ld\n", total);
munmap(count_array, cpu_count * sizeof(int));
return 0;
}

gettimeofday(&t1, 0);
timeradd(&t1, &t2, &t1);
while (timercmp(&t2, &t1, <)) {
int b = 0;
int j;

for (j = 0; j < 1000000; j++)
b += a[j];
gettimeofday(&t2, 0);
n++;
}
count_array[i] = n;
return 0;
}

This patch changes change_pte_range() to skip shared copy-on-write pages when
called from change_prot_numa().

NOTE: change_prot_numa() is nominally called from task_numa_work() and
queue_pages_test_walk(). task_numa_work() is the auto NUMA balancing path, and
queue_pages_test_walk() is part of explicit NUMA policy management. However,
queue_pages_test_walk() only calls change_prot_numa() when MPOL_MF_LAZY is
specified and currently that is not allowed, so change_prot_numa() is only
called from auto NUMA balancing.

In the case of explicit NUMA policy management, shared pages are not migrated
unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL depends on
CAP_SYS_NICE. Currently, there is no way to pass information about
MPOL_MF_MOVE_ALL to change_pte_range. This will have to be fixed if
MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
migration mode.

task_numa_work() skips the read-only VMAs of programs and shared libraries.

NOTE2: This patch was accepted upstream should appear in 4.15 or 4.16

V2:
- Combined patch and cover letter
- Added note about applicability of MPOL_MF_MOVE_ALL

Signed-off-by: Henry Willard <henry.willard@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Orabug: 28814880
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mprotect.c (Add #include "internal.h")

Backporting commit 04489748048c1bd12d2a14ffe46d7de5dcb408c5
("mm: numa: Do not trap faults on shared data section pages.") from UEK5

Similar test results are seen on UEK4 when running the above program
with and without this patch and with and without auto numa balancing enabled

(cherry picked from commit 04489748048c1bd12d2a14ffe46d7de5dcb408c5)
Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

hugetlbfs: dirty pages as they are added to pagecache

Some test systems were experiencing negative huge page reserve
counts and incorrect file block counts.  This was traced to
/proc/sys/vm/drop_caches removing clean pages from hugetlbfs
file pagecaches.  When non-hugetlbfs explicit code removes the
pages, the appropriate accounting is not performed.

This can be recreated as follows:
fallocate -l 2M /dev/hugepages/foo
echo 1 > /proc/sys/vm/drop_caches
fallocate -l 2M /dev/hugepages/foo
grep -i huge /proc/meminfo
   AnonHugePages:         0 kB
   ShmemHugePages:        0 kB
   HugePages_Total:    2048
   HugePages_Free:     2047
   HugePages_Rsvd:    18446744073709551615
   HugePages_Surp:        0
   Hugepagesize:       2048 kB
   Hugetlb:         4194304 kB
ls -lsh /dev/hugepages/foo
   4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo

To address this issue, dirty pages as they are added to pagecache.
This can easily be reproduced with fallocate as shown above. Read
faulted pages will eventually end up being marked dirty.  But there
is a window where they are clean and could be impacted by code such
as drop_caches.  So, just dirty them all as they are added to the
pagecache.

Orabug: 28813968

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: RDS (tcp) hangs on sendto() to unresponding address

In rds_send_mprds_hash(), if the calculated hash value is non-zero and
the MPRDS connections are not yet up, it will wait. But it should not
wait if the send is non-blocking. In this case, it should just use the
base c_path for sending the message.

Orabug: 28762608

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nfs: fix a deadlock in nfs client initialization

The following deadlock can occur between a process waiting for a client
to initialize in while walking the client list during nfsv4 server trunking
detection and another process waiting for the nfs_clid_init_mutex so it
can initialize that client:

Process 1                               Process 2
---------                               ---------
spin_lock(&nn->nfs_client_lock);
list_add_tail(&CLIENTA->cl_share_link,
        &nn->nfs_client_list);
spin_unlock(&nn->nfs_client_lock);
                                        spin_lock(&nn->nfs_client_lock);
                                        list_add_tail(&CLIENTB->cl_share_link,
                                                &nn->nfs_client_list);
                                        spin_unlock(&nn->nfs_client_lock);
                                        mutex_lock(&nfs_clid_init_mutex);
                                        nfs41_walk_client_list(clp, result, cred);
                                        nfs_wait_client_init_complete(CLIENTA);
(waiting for nfs_clid_init_mutex)

Make sure nfs_match_client() only evaluates clients that have completed
initialization in order to prevent that deadlock.

This patch also fixes v4.0 trunking behavior by not marking the client
NFS_CS_READY until the clientid has been confirmed.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 28486463

(cherry picked from commit c156618e15101a9cc8c815108fec0300a0ec6637)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

infiniband: fix a possible use-after-free bug

ucma_process_join() will free the new allocated "mc" struct,
if there is any error after that, especially the copy_to_user().

But in parallel, ucma_leave_multicast() could find this "mc"
through idr_find() before ucma_process_join() frees it, since it
is already published.

So "mc" could be used in ucma_leave_multicast() after it is been
allocated and freed in ucma_process_join(), since we don't refcnt
it.

Fix this by separating "publish" from ID allocation, so that we
can get an ID first and publish it later after copy_to_user().

Fixes: c8f6a362bf3e ("RDMA/cma: Add multicast communication support")
Reported-by: Noam Rathaus <noamr@beyondsecurity.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
(cherry picked from commit cb2595c1393b4a5211534e6f0a0fbad369e21ad8)

Orabug: 28774517
CVE: CVE-2018-14734

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs/IBRS: properly use the _base ops if IBRS is currently unset

... or else we can run into an idle state with IBRS enabled which lead to
low performance.

Orabug: 28782729

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

hugetlbfs: check for reserve page count going negative

When modifying hugetlb reserve page count, check for the value going
negative. This should not happen in practice. However, if it does
log a message and stack trace to aid in debug.

Orabug: 28734496

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

hugetlbfs: introduce truncation/fault mutex to avoid races

The following hugetlbfs truncate/page fault race can be recreated
with programs doing something like the following.

A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages.  At
mmap time, 4 huge pages are reserved for the file/mapping.  So,
the global reserve count is 4.  In addition, since this is a shared
mapping an entry for 4 pages is added to the file's reserve map.
The first 3 of the 4 pages are faulted into the file.  As a result,
the global reserve count is now 1.

Task A starts to fault in the last page (routines hugetlb_fault,
hugetlb_no_page).  It allocates a huge page (alloc_huge_page).
The reserve map indicates there is a reserved page, so this is
used and the global reserve count goes to 0.

Now, task B truncates the file to size 0.  It starts by setting
inode size to 0(hugetlb_vmtruncate).  It then unmaps all mapping
of the file (hugetlb_vmdelete_list).  Since task A's page table
lock is not held at the time, truncation is not blocked.  Truncation
removes the 3 pages from the file (remove_inode_hugepages).  When
cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
the reserve map was for 4 pages.  However, it has only freed 3 pages.
So it assumes there is still (4 - 3) 1 reserved pages.  It then
decrements the global reserve count by 1 and it goes negative.

Task A then continues the page fault process and adds it's newly
acquired page to the page cache.  Note that the index of this page
is beyond the size of the truncated file (0).  The page fault process
then notices the file has been truncated and exits.  However, the
page is left in the cache associated with the file.

Now, if the file is immediately deleted the truncate code runs again.
It will find and free the one page associated with the file.  When
cleaning up reserves, it notices the reserve map is empty.  Yet, one
page freed.  So, the global reserve count is decremented by (0 - 1) -1.
This returns the global count to 0 as it should be.  But, it is
possible for someone else to mmap this file/range before it is deleted.
If this happens, a reserve map entry for the allocated page is created
and the reserved page is forever leaked.

To avoid all these conditions, let's simply prevent faults to a file
while it is being truncated.  Add a new truncation specific rw mutex
to hugetlbfs inode extensions.  faults take the mutex in read mode,
truncation takes in write mode.

Orabug: 28734496

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/hugetlb.c: don't call region_abort if region_chg fails

Changes to hugetlbfs reservation maps is a two step process.  The first
step is a call to region_chg to determine what needs to be changed, and
prepare that change.  This should be followed by a call to call to
region_add to commit the change, or region_abort to abort the change.

The error path in hugetlb_reserve_pages called region_abort after a
failed call to region_chg.  As a result, the adds_in_progress counter in
the reservation map is off by 1.  This is caught by a VM_BUG_ON in
resv_map_release when the reservation map is freed.

syzkaller fuzzer (when using an injected kmalloc failure) found this
bug, that resulted in the following:

kernel BUG at mm/hugetlb.c:742!
Call Trace:
  hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
  evict+0x481/0x920 fs/inode.c:553
  iput_final fs/inode.c:1515 [inline]
  iput+0x62b/0xa20 fs/inode.c:1542
  hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
  newseg+0x422/0xd30 ipc/shm.c:575
  ipcget_new ipc/util.c:285 [inline]
  ipcget+0x21e/0x580 ipc/util.c:639
  SYSC_shmget ipc/shm.c:673 [inline]
  SyS_shmget+0x158/0x230 ipc/shm.c:657
  entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742

Link: http://lkml.kernel.org/r/1490821682-23228-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 28734496

(cherry picked from commit ff8c0c53c47530ffea82c22a0a6df6332b56c957)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: hugetlb: fix hugepage memory leak caused by wrong reserve count

When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
alloc_buddy_huge_page() to directly create a hugepage from the buddy
allocator.

In that case, however, if alloc_buddy_huge_page() succeeds we don't
decrement h->resv_huge_pages, which means that successful
hugetlb_fault() returns without releasing the reserve count.  As a
result, subsequent hugetlb_fault() might fail despite that there are
still free hugepages.

This patch simply adds decrementing code on that code path.

I reproduced this problem when testing v4.3 kernel in the following situation:
- the test machine/VM is a NUMA system,
- hugepage overcommiting is enabled,
- most of hugepages are allocated and there's only one free hugepage
   which is on node 0 (for example),
- another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
   node 1, tries to allocate a hugepage,
- the allocation should fail but the reserve count is still hold.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org> [3.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 28734496

(cherry picked from commit a88c769548047b21f76fd71e04b6a3300ff17160)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: xdp: don't make drivers report attachment mode (partial backport)

Orabug: 27988326

Pull the upstream commit 6b8675897338 to update just the bnxt_en driver.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bpf: make bnxt compatible w/ bpf_xdp_adjust_tail

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for bnxt driver we will just calculate packet's length unconditionally

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
(cherry picked from commit b968e735c79767a3c91217fbae691581aa557d8d)
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: add meta pointer for direct access (partial backport)

Orabug: 27988326

Pull the upstream commit de8f3a83b0a0 to update only the bnxt_en driver.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Fix bug in ethtool -L.

When changing channels from combined to rx/tx or vice versa, the code
uses the wrong "sh" parameter to determine if we are reserving rings
for shared or non-shared mode. It should be using the ethtool requested
"sh" parameter instead of the current "sh" parameter.

Fix it by passing the "sh" parameter to bnxt_reserve_rings(). For
ethtool, we will pass in the requested "sh" parameter.

Fixes: 391be5c27364 ("bnxt_en: Implement new scheme to reserve tx rings.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit 3b6b34df342553a7522561e34288f5bb803aa9aa ]
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bpf: bnxt: Report bpf_prog ID during XDP_QUERY_PROG

Add support to bnxt to report bpf_prog ID during XDP_QUERY_PROG.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8902965f8cb23bba8aa7f3be293ec2f3067b82c6)
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Optimize doorbell write operations for newer chips (reapply).

Older chips require the doorbells to be written twice, but newer chips
do not. Add a new common function bnxt_db_write() to write all
doorbells appropriately depending on the chip. Eliminating the extra
doorbell on newer chips has a significant performance improvement
on pktgen.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 434c975a8fe2f70b70ac09ea5ddd008e0528adfa)
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Use short TX BDs for the XDP TX ring.

No offload is performed on the XDP_TX ring so we can use the short TX
BDs. This has the effect of doubling the size of the XDP TX ring so
that it now matches the size of the rx ring by default.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 932dbf83ba18bdb871e0c03a4ffdd9785f7a9c07)
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Add ethtool mac loopback self test (reapply).

The mac loopback self test operates in polling mode.  To support that,
we need to add functions to open and close the NIC half way.  The half
open mode allows the rings to operate without IRQ and NAPI.  We
use the XDP transmit function to send the loopback packet.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit f7dc1ea6c4c1f31371b7098d6fae0d49dc6cdff1 ]
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Add support for XDP_TX action.

Add dedicated transmit function and transmit completion handler for
XDP.  The XDP transmit logic and completion logic are different than
regular TX ring.  The TX buffer is recycled back to the RX ring when
it completes.

v3: Improved the buffer recyling scheme for XDP_TX.

v2: Add trace_xdp_exception().
    Add dma_sync.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[Upstream commit 38413406277fd060f46855ad527f6f8d4cf2652d]
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Add basic XDP support.

Add basic ndo_xdp support to setup and query program, configure the NIC
to run in rx page mode, and support XDP_PASS, XDP_DROP, XDP_ABORTED
actions only.

v3: Pass modified offset and length to stack for XDP_PASS.
    Remove Kconfig option.

v2: Added trace_xdp_exception()
    Added dma_syncs.
    Added XDP headroom support.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[Upstream commit c6d30e8391b85e00eb544e6cf047ee0160ee9938]
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/ia32: Restore r8 correctly in 32bit SYSCALL instruction entry.

This commit fixes a bug in a previous commit
8e69671028ac ("x86/ia32: Adds code hygiene for 32bit SYSCALL instruction
entry.")

SAVE_EXTRA_REGS does not save the r8 register. r8 is rather saved in
pt_regs->sp before it is cleared. So, retrieve r8 from pt_regs->sp.

Orabug: 28529706

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Bert Barbe <bert.barbe@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: enable RPS on vlan devices

This patch modifies the RPS processing code so that it searches
for a matching vlan interface on the packet and then uses the
RPS settings of the vlan interface. If no vlan interface
is found or the vlan interface does not have RPS enabled,
it will fall back to the RPS settings of the underlying device.

In supporting VMs where we can't control the OS being used,
we'd like to separate the VM cpu processing from the host's
cpus as a way to help mitigate the impact of the L1TF issue.
When running the VM's traffic on a vlan we can stick the Rx
processing on one set of CPUs separate from the VM's CPUs.
Yes, choosing to use this will cause a bit of throughput pain
when the packets are actually passed into the VM and have to
move from one cache to another.

Orabug: 28645929

Signed-off-by: Silviu Smarandache <silviu.smarandache@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: hold write vbd-lock while swapping the vbd

All paths holding a vbd handle or dereferencing it hold the read vbd-lock.
Swapping the device takes a write-trylock on the vbd. So, we fail a swap
if, for instance, there is an active IO operation.

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: implement swapping of active vbd

Currently we disallow any change of major:minor of the vbd once created.
The danger is in the user switching backends where the contents of the
backend device are dissimilar.
However, changing the vbd can be quite useful -- for instance by switching
from a backend which is not multi-pathed (or raid'd) to one that is.

As for blkback itself, it is used purely as a passthrough device so there
is no state outside struct vbd which would be impacted with this change.

This patch allows a vbd swap, communicating the state of the swap via the
xenbus paths:

.../vbd/<domain>/51712/oracle/active-physical-device = "<major>:<minor>"
.../vbd/<domain>/51712/oracle/physical-device-change-status = "<error>"

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: emit active physical device to xenstore

The vbd physical-device is set by user-space writing to device's xenbus
path:
/local/domain/0/backend/vbd/1/51712/physical-device = "<major>:<minor>"

However, there is no way for the blkback driver to specify the current
state of the physical-device. Accordingly we add two new backend paths:

.../vbd/<domain>/51712/oracle/active-physical-device = "<major>:<minor>"
.../vbd/<domain>/51712/oracle/physical-device-change-status = "<error>"

which specify the active physical device and any error incurred in
trying to instantiate a new physical device.

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: refactor backend_changed()

Add xenbus_scan_be_params() to handle vbd param parsing. This gets
called from backend_changed().

Also place all the parameters (major, minor, cdrom, readonly) in struct
blkif_params. The readonly param is an integer replacement of mode --
this allows us to avoid tracking an additional pointer.

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: pull out blkif grant features from vbd

grant related features don't have anything to do with the vbd.
Move them to the blkif structure.

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: get rid of vmacache_flush_all() entirely

Jann Horn points out that the vmacache_flush_all() function is not only
potentially expensive, it's buggy too.  It also happens to be entirely
unnecessary, because the sequence number overflow case can be avoided by
simply making the sequence number be 64-bit.  That doesn't even grow the
data structures in question, because the other adjacent fields are
already 64-bit.

So simplify the whole thing by just making the sequence number overflow
case go away entirely, which gets rid of all the complications and makes
the code faster too.  Win-win.

[ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
  also just goes away entirely with this ]

Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Will Deacon <will.deacon@arm.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 7a9cdebdcc17e426fb5287e4a82db1dfe86339b2)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/linux/mm_types.h
include/linux/mm_types_task.h
mm/debug.c

Orabug: 28701016
CVE: CVE-2018-17182

Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: crash at rds_ib_inc_copy_to_user+104 due to NULL ptr reference

The customer hit this crash few times.

PID: 31556  TASK: ffff880f823caa00  CPU: 1   COMMAND: "cellsrv"
#0 [ffff880f823db850] machine_kexec at ffffffff8105d93c
#1 [ffff880f823db8b0] crash_kexec at ffffffff811103b3
#2 [ffff880f823db980] oops_end at ffffffff8101a788
#3 [ffff880f823db9b0] no_context at ffffffff8106b9cf
#4 [ffff880f823dba20] __bad_area_nosemaphore at ffffffff8106bc9d
#5 [ffff880f823dba70] bad_area at ffffffff8106be97
#6 [ffff880f823dbaa0] __do_page_fault at ffffffff8106c71e
#7 [ffff880f823dbb00] do_page_fault at ffffffff8106c81f
#8 [ffff880f823dbb40] page_fault at ffffffff816b5a9f
    [exception RIP: rds_ib_inc_copy_to_user+104]
    RIP: ffffffffa04607b8  RSP: ffff880f823dbbf8  RFLAGS: 00010287
    RAX: 0000000000000340  RBX: 0000000000001000  RCX: 0000000000004000
    RDX: 0000000000001000  RSI: ffff88176cea2000  RDI: ffff8817d291f520
    RBP: ffff880f823dbc48   R8: 0000000000001340   R9: 0000000000001000
    R10: 0000000000001200  R11: ffff880f823dc000  R12: ffff880f823dbed0
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000001000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#9 [ffff880f823dbc50] rds_recvmsg at ffffffffa041d837 [rds]

int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to)
...
...
        ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
        frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
        len = be32_to_cpu(inc->i_hdr.h_len);
        sg = frag->f_sg;

        while (iov_iter_count(to) && copied < len) {
                to_copy = min_t(unsigned long, iov_iter_count(to),
                                sg->length - frag_off);
                ...

sg is NULL and it crashes accessing sg->length above.

The cause looks like is due to ic->i_frag_sz returning incorrect value.
16KB when 4KB was expected.

                if (copied % ic->i_frag_sz == 0) {
                        frag = list_entry(frag->f_item.next,
                                          struct rds_page_frag, f_item);
                        frag_off = 0;
                        sg = frag->f_sg;
                }

The other end is using 4KB RDS fragsize (Solaris Super Cluster).
This end is UEK4 (4.1.12-94.8.4.el6uek.x86_64).

The message being copied arrived over 4KB RDS frag size connection.
But during the above check ic->i_frag_sz is 16KB.
This can happen during a reconnect at the connection setup phase.
We start off with ic->i_frag_sz as 16KB. Then settle down at 4KB.

Failing this check
  if (copied % ic->i_frag_sz == 0) {
can result in sg not getting set correctly.

Say, "copied" = 4KB but ic->i_frag_sz is 16KB when it should be 4KB.

During race condition with a reconnect, ic->i_frag_sz can be 16KB
even though once the connection is set up it settled down to 4KB.
It can change from 4KB to 16KB and back to 4KB during connection setup
due to reconnect.

We started seeing this crash after bug 26848749.
But prior to that the same scenario could result in data copied to user
from incorrect "sg" resulting in data corruption.

Orabug: 28506569

Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/core: For multicast functions, verify that LIDs are multicast LIDs

The Infiniband spec defines "A multicast address is defined by a
MGID and a MLID" (section 10.5).  Currently the MLID value is not
validated.

Add check to verify that the MLID value is in the correct address
range.

Fixes: 0c33aeedb2cf ("[IB] Add checks to multicast attach and detach")
Cc: stable@vger.kernel.org
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 8561eae60ff9417a50fa1fb2b83ae950dc5c1e21)

The reason for applying this patch is a bug in ibacm, which goes
undetected without this patch. For further information, see
https://patchwork.kernel.org/patch/10008461/

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:

   * Had to add the define IB_MULTICAST_LID_BASE in
     include/rdma/ib_verbs.h, due to missing commit b4e64397dabc
     ("IB/rdmavt: Break rdma_vt main include header file up").

Orabug: 28700490

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
v2 -> v3:
   * Added Shannon's r-b

v1 -> v2:
   * Adding IB_MULTICAST_LID_BASE to ib_verbs.h
   * A little more verbose commit message

Signed-off-by: Brian Maly <brian.maly@oracle.com>

sunrpc: increase UNX_MAXNODENAME from 32 to __NEW_UTS_LEN bytes

The current limit of 32 bytes artificially limits the name string that
we end up stuffing into NFSv4.x client ID blobs. If you have multiple
hosts with long hostnames that only differ near the end, then this can
cause NFSv4 client ID collisions.

Linux nodenames are actually limited to __NEW_UTS_LEN bytes (64), so use
that as the limit instead. Also, use XDR_QUADLEN to specify the slack
length, just for clarity and in case someone in the future changes this
to something not evenly divisible by 4.

Reported-by: Michael Skralivetsky <michael.skralivetsky@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Orabug: 28660177
(cherry picked from commit 24a9a9610ce3ba36fd87c1d2f2c9106de6b7e832)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Tested-by: Joe Jin <joe.jin@oracle.com>

net: rds: Use address family to designate IPv4 or IPv6 addresses

The condition to interpret the supplied address in
rds_cancel_sent_to() was the length of the supplied user-data. We need
to tighten the API here to the extent possible, subject to SKGXP
passing in sizeof(struct sockaddr_storage) as the optlen, independent
of the actual address size.

We will 1) make sure the data passed in to the kernel is "big enough",
2) that the address is either AF_INET or AF_INET6, and 3) that the
bound address is ipv6_addr_v4mapped() in the AF_INET case and not in
the AF_INET6 case.

Orabug: 28720071

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>

net: rds: Fix blank at eol in af_rds.c

Orabug: 28720071

Signed-off-by: Håkon Bugge <Haakon.Bugge@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>

exec: Limit arg stack to at most 75% of _STK_LIM

Orabug: 28709994
CVE: CVE-2018-14634

To avoid pathological stack usage or the need to special-case setuid
execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit da029c11e6b12f321f36dac8771e833b65cec962)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

nsfs: mark dentry with DCACHE_RCUACCESS

Andrey reported a use-after-free in __ns_get_path():

  spin_lock include/linux/spinlock.h:299 [inline]
  lockref_get_not_dead+0x19/0x80 lib/lockref.c:179
  __ns_get_path+0x197/0x860 fs/nsfs.c:66
  open_related_ns+0xda/0x200 fs/nsfs.c:143
  sock_ioctl+0x39d/0x440 net/socket.c:1001
  vfs_ioctl fs/ioctl.c:45 [inline]
  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
  SYSC_ioctl fs/ioctl.c:700 [inline]
  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691

We are under rcu read lock protection at that point:

        rcu_read_lock();
        d = atomic_long_read(&ns->stashed);
        if (!d)
                goto slow;
        dentry = (struct dentry *)d;
        if (!lockref_get_not_dead(&dentry->d_lockref))
                goto slow;
        rcu_read_unlock();

but don't use a proper RCU API on the free path, therefore a parallel
__d_free() could free it at the same time.  We need to mark the stashed
dentry with DCACHE_RCUACCESS so that __d_free() will be called after all
readers leave RCU.

Fixes: e149ed2b805f ("take the targets of /proc/*/ns/* symlinks to separate fs")
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 073c516ff73557a8f7315066856c04b50383ac34)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/nsfs.c

Orabug: 28576290
CVE: CVE-2018-5873

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

dm crypt: add middle-endian variant of plain64 IV

The big-endian IV (plain64be) backend implementation has bugs and
ended up with a weird middle endian (left shift by 32).

Adding a new module for this.

The middle endian is this weirdness where the value is shifted
to the middle (and also wraps):

iv: 00000000000000010000000000000000
instead of:
iv: 00000000000000000000000000000001

Orabug: 28604628
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/ipoib: Improve filtering log message

The values of subnet_prefix and sgid are missing in log message when a
non-CM packet is filtered since they are both initialized only for ARP
packets.

Fix this by setting them anyway.

Orabug: 28655409

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/ipoib: Fix wrong update of arp_blocked counter

Counter is updated after goto instruction - swap the two lines.

Orabug: 28655409

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/ipoib: Update RX counters after ACL filtering

There is no reason to update RX counters if packet will be dropped in
IB-ACL filtering.

Orabug: 28655409

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/ipoib: Filter RX packets before adding pseudo header

IB-ACL filtering is based on SGID (carried in the GRH).
The function skb_add_pseudo_hdr clears the SGID from the header.

Doing filtering after this call has no effect since all packets will drop
for not having SGID in.

Fix this by moving the call to skb_add_pseudo_hdr to be after IB-ACL
filtering while still adjusting skb->data as needed for the filtering.

Orabug: 28655409

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

cdrom: Fix info leak/OOB read in cdrom_ioctl_drive_status

Like d88b6d04: "cdrom: information leak in cdrom_ioctl_media_changed()"

There is another cast from unsigned long to int which causes
a bounds check to fail with specially crafted input. The value is
then used as an index in the slot array in cdrom_slot_status().

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Scott Bauer <sbauer@plzdonthack.me>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 8f3fafc9c2f0ece10832c25f7ffcb07c97a32ad4)

Orabug: 28664501
CVE: CVE-2018-16658

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c

I found an ACPI cache leak in ACPI early termination and boot continuing case.

When early termination occurs due to malicious ACPI table, Linux kernel
terminates ACPI function and continues to boot process. While kernel terminates
ACPI function, kmem_cache_destroy() reports Acpi-Operand cache leak.

Boot log of ACPI operand cache leak is as follows:
>[    0.464168] ACPI: Added _OSI(Module Device)
>[    0.467022] ACPI: Added _OSI(Processor Device)
>[    0.469376] ACPI: Added _OSI(3.0 _SCP Extensions)
>[    0.471647] ACPI: Added _OSI(Processor Aggregator Device)
>[    0.477997] ACPI Error: Null stack entry at ffff880215c0aad8 (20170303/exresop-174)
>[    0.482706] ACPI Exception: AE_AML_INTERNAL, While resolving operands for [opcode_name unavailable] (20170303/dswexec-461)
>[    0.487503] ACPI Error: Method parse/execution failed [\DBG] (Node ffff88021710ab40), AE_AML_INTERNAL (20170303/psparse-543)
>[    0.492136] ACPI Error: Method parse/execution failed [\_SB._INI] (Node ffff88021710a618), AE_AML_INTERNAL (20170303/psparse-543)
>[    0.497683] ACPI: Interpreter enabled
>[    0.499385] ACPI: (supports S0)
>[    0.501151] ACPI: Using IOAPIC for interrupt routing
>[    0.503342] ACPI Error: Null stack entry at ffff880215c0aad8 (20170303/exresop-174)
>[    0.506522] ACPI Exception: AE_AML_INTERNAL, While resolving operands for [opcode_name unavailable] (20170303/dswexec-461)
>[    0.510463] ACPI Error: Method parse/execution failed [\DBG] (Node ffff88021710ab40), AE_AML_INTERNAL (20170303/psparse-543)
>[    0.514477] ACPI Error: Method parse/execution failed [\_PIC] (Node ffff88021710ab18), AE_AML_INTERNAL (20170303/psparse-543)
>[    0.518867] ACPI Exception: AE_AML_INTERNAL, Evaluating _PIC (20170303/bus-991)
>[    0.522384] kmem_cache_destroy Acpi-Operand: Slab cache still has objects
>[    0.524597] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc5 #26
>[    0.526795] Hardware name: innotek gmb_h virtual_box/virtual_box, BIOS virtual_box 12/01/2006
>[    0.529668] Call Trace:
>[    0.530811]  ? dump_stack+0x5c/0x81
>[    0.532240]  ? kmem_cache_destroy+0x1aa/0x1c0
>[    0.533905]  ? acpi_os_delete_cache+0xa/0x10
>[    0.535497]  ? acpi_ut_delete_caches+0x3f/0x7b
>[    0.537237]  ? acpi_terminate+0xa/0x14
>[    0.538701]  ? acpi_init+0x2af/0x34f
>[    0.540008]  ? acpi_sleep_proc_init+0x27/0x27
>[    0.541593]  ? do_one_initcall+0x4e/0x1a0
>[    0.543008]  ? kernel_init_freeable+0x19e/0x21f
>[    0.546202]  ? rest_init+0x80/0x80
>[    0.547513]  ? kernel_init+0xa/0x100
>[    0.548817]  ? ret_from_fork+0x25/0x30
>[    0.550587] vgaarb: loaded
>[    0.551716] EDAC MC: Ver: 3.0.0
>[    0.553744] PCI: Probing PCI hardware
>[    0.555038] PCI host bridge to bus 0000:00
> ... Continue to boot and log is omitted ...

I analyzed this memory leak in detail and found acpi_ns_evaluate() function
only removes Info->return_object in AE_CTRL_RETURN_VALUE case. But, when errors
occur, the status value is not AE_CTRL_RETURN_VALUE, and Info->return_object is
also not null. Therefore, this causes acpi operand memory leak.

This cache leak causes a security threat because an old kernel (<= 4.9) shows
memory locations of kernel functions in stack dump. Some malicious users
could use this information to neutralize kernel ASLR.

I made a patch to fix ACPI operand cache leak.

Signed-off-by: Seunghun Han <kkamagui@gmail.com>
Signed-off-by: Erik Schmauss <erik.schmauss@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 97f3c0a4b0579b646b6b10ae5a3d59f0441cc12c)

Orabug: 28664577
CVE: CVE-2017-13695

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

uek-rpm: Disable deprecated CONFIG_ACPI_PROCFS_POWER

Disable the CONFIG_ACPI_PROCFS_POWER option which allows
deprecated power /proc/acpi/ directories.

Orabug: 28680213

Signed-off-by: Victor Erminpour <victor.erminpour@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "x86/spec_ctrl: Only set SPEC_CTRL_IBRS_FIRMWARE if IBRS is actually in use"

Orabug: 28610707

This reverts commit 6a9757c562b36e32798b3e69a22295cd55ef8a69.

btrfs: fix check_shared for fiemap ioctl

Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least n^3, so if a extent
has too many references, even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit afce772e87c36c7f07f230a76d525025aaf09e41)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/btrfs/backref.c - Modified comment so that patch applies.

Signed-off-by: Divya Indi <divya.indi@oracle.com>
Orabug: 24716710
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/pti: Don't report XenPV as vulnerable

Xen PV domain kernel is not by design affected by meltdown as it's
enforcing split CR3 itself. Let's not report such systems as "Vulnerable"
in sysfs (we're also already forcing PTI to off in X86_HYPER_XEN_PV cases);
the security of the system ultimately depends on presence of mitigation in
the Hypervisor, which can't be easily detected from DomU; let's report
that.

Reported-and-tested-by: Mike Latimer <mlatimer@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1806180959080.6203@cbobk.fhfr.pm
[ Merge the user-visible string into a single line. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6cb2b08ff92460290979de4be91363e5d1b6cec1)

Conflicts:
arch/x86/kernel/cpu/bugs.c

In UEK4, these changes are made in arch/x86/kernel/cpu/bugs_64.c.

Context around the headers was slightly different (there were some extra
headers relative to the cherry-picked patch).

There is noX86_HYPER_XEN_PV, instead compare x86_hyper to x86_hyper_xen.

Orabug: 28476681

Signed-off-by: Patrick Colp <patrick.colp@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: give all workqueues rescuer threads

Orabug: 28518694

We're consistently hitting deadlocks here with XFS on recent kernels.
After some digging through the crash files, it looks like everyone in
the system is waiting for XFS to reclaim memory.

Something like this:

PID: 2733434 TASK: ffff8808cd242800 CPU: 19 COMMAND: "java"
#0 [ffff880019c53588] __schedule at ffffffff818c4df2
#1 [ffff880019c535d8] schedule at ffffffff818c5517
#2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348
#3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb
#4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e
#5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453
#6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7
#7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433
#8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9
#9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73

xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting
for IO, which is waiting for workers to complete the IO which is waiting
for worker threads that don't exist yet:

PID: 2752451 TASK: ffff880bd6bdda00 CPU: 37 COMMAND: "kworker/37:1"
#0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2
#1 [ffff8808d20abc00] schedule at ffffffff818c5517
#2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c
#3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495
#4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82
#5 [ffff8808d20abdf0] create_worker at ffffffff8106752f
#6 [ffff8808d20abe40] worker_thread at ffffffff810699be
#7 [ffff8808d20abec0] kthread at ffffffff8106ef59
#8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8

I think we should be using WQ_MEM_RECLAIM to make sure this thread
pool makes progress when we're not able to allocate new workers.

[dchinner: make all workqueues WQ_MEM_RECLAIM]

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 7a29ac474a47eb8cf212b45917683ae89d6fa13b)

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/spec_ctrl: Only set SPEC_CTRL_IBRS_FIRMWARE if IBRS is actually in use

Currently the SPEC_CTRL_IBRS_FIRMWARE flag always gets set as long as
IBRS is supported by the hardware. However, as best as can be determined
by the documention, if IBRS has been disabled (e.g., spectre_v2=off) then
SPEC_CTRL_IBRS_FIRMWARE should not be set:

nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2
(indirect branch prediction) vulnerability. System may
allow data leaks with this option, which is equivalent
to spectre_v2=off.

and:

spectre_v2= [X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.

off - unconditionally disable

Add a check in set_ibrs_firmware() to only set SPEC_CTRL_IBRS_FIRMWARE if
ibrs_disabled is not also set.

Orabug: 28274907

Signed-off-by: Patrick Colp <patrick.colp@oracle.com>
Reviewed-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: add tcp_ooo_try_coalesce() helper

commit 58152ecbbcc6a0ce7fddd5bf5f6ee535834ece0c upstream.

In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.

I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit 36ee106e844187e3fc612c9b87f12e5e23e9d8a5)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: call tcp_drop() from tcp_data_queue_ofo()

[ Upstream commit 8541b21e781a22dce52a74fef0b9bed00404a1cd ]

In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.

Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit 94623c7463f3424776408df2733012c42b52395a)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: detect malicious patterns in tcp_collapse_ofo_queue()

[ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.

1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.

We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.

In the future, we might add the possibility of terminating flows
that are proven to be malicious.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit a878681484a0992ee3dfbd7826439951f9f82a69)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: avoid collapses in tcp_prune_queue() if possible

[ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :

if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);

tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])

Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.

Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit fdf258ed5dd85b57cf0e0e66500be98d38d42d02)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: free batches of packets in tcp_prune_ofo_queue()

[ Upstream commit 72cd43ba64fc172a443410ce01645895850844c8 ]

Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.

Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.

Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.

Strategy taken in this patch is to purge ~12.5 % of the queue capacity.

Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit 2d08921c8da26bdce3d8848ef6f32068f594d7d4)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: use an RB tree for ooo receive queue

Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
(adapted from v4.9.x commit 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: refine tcp_prune_ofo_queue() to not drop all packets

Over the years, TCP BDP has increased a lot, and is typically
in the order of ~10 Mbytes with help of clever Congestion Control
modules.

In presence of packet losses, TCP stores incoming packets into an out of
order queue, and number of skbs sitting there waiting for the missing
packets to be received can match the BDP (~10 Mbytes)

In some cases, TCP needs to make room for incoming skbs, and current
strategy can simply remove all skbs in the out of order queue as a last
resort, incurring a huge penalty, both for receiver and sender.

Unfortunately these 'last resort events' are quite frequent, forcing
sender to send all packets again, stalling the flow and wasting a lot of
resources.

This patch cleans only a part of the out of order queue in order
to meet the memory constraints.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: C. Stephen Gun <csg@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(adapted from v4.9.x commit 36a6503feddadbbad415fb3891e80f94c10a9b21)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: introduce tcp_under_memory_pressure()

Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(adapted from v4.9.x commit b8da51ebb1aa93908350f95efae73aecbc2e266c)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: increment sk_drops for dropped rx packets

Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.

Following patch takes care of listeners drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(adapted from v4.9.x commit 532182cd610782db8c18230c2747626562032205)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/entry/64: Ensure %ebx handling correct in xen_failsafe_callback

error_entry and error_exit communicate the user vs. kernel status of
the frame using %ebx. Make sure %ebx is set properly in
xen_failsafe_callback().

This fix is different from the one provided in the XSA to minimize
risk of a regression. This alternative fix was suggested
by Andy Lutomirski in the original patch (Commit b3681dd548d0
("x86/entry/64: Remove %ebx handling from error_entry/exit"))

Orabug: 28402927
CVE: CVE-2018-14678

This is XSA-274

Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+

On Nehalem and newer core CPUs the CPU cache internally uses 44 bits
physical address space. The L1TF workaround is limited by this internal
cache address width, and needs to have one bit free there for the
mitigation to work.

Older client systems report only 36bit physical address space so the range
check decides that L1TF is not mitigated for a 36bit phys/32GB system with
some memory holes.

But since these actually have the larger internal cache width this warning
is bogus because it would only really be needed if the system had more than
43bits of memory.

Add a new internal x86_cache_bits field. Normally it is the same as the
physical bits field reported by CPUID, but for Nehalem and newerforce it to
be at least 44bits.

Change the L1TF memory size warning to use the new cache_bits field to
avoid bogus warnings and remove the bogus comment about memory size.

Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Reported-by: George Anchev <studio@anchev.net>
Reported-by: Christopher Snowhill <kode54@gmail.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Michael Hocko <mhocko@suse.com>
Cc: vbabka@suse.cz
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180824170351.34874-1-andi@firstfloor.org
(cherry picked from commit cc51e5428ea54f575d49cfcede1d4cb3a72b4ec4)

Orabug: 28488808
CVE: CVE-2018-3620

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/processor.h
arch/x86/kernel/cpu/common.c
arch/x86/kernel/cpu/bugs.c
Contextual: different content; bugs_64.c was modified instead of bugs.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation/l1tf: Suggest what to do on systems with too much RAM

Two users have reported [1] that they have an "extremely unlikely" system
with more than MAX_PA/2 memory and L1TF mitigation is not effective.

Make the warning more helpful by suggesting the proper mem=X kernel boot
parameter to make it effective and a link to the L1TF document to help
decide if the mitigation is worth the unusable RAM.

[1] https://bugzilla.suse.com/show_bug.cgi?id=1105536

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/966571f0-9d7f-43dc-92c6-a10eec7a1254@suse.cz
(cherry picked from commit 6a012288d6906fee1dbc244050ade1dafe4a9c8d)

Orabug: 28488808
CVE: CVE-2018-3620

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
Contextual: bugs_64.c was modified instead of bugs.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation/l1tf: Fix off-by-one error when warning that system has too much RAM

Two users have reported [1] that they have an "extremely unlikely" system
with more than MAX_PA/2 memory and L1TF mitigation is not effective. In
fact it's a CPU with 36bits phys limit (64GB) and 32GB memory, but due to
holes in the e820 map, the main region is almost 500MB over the 32GB limit:

[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000081effffff] usable

Suggestions to use 'mem=32G' to enable the L1TF mitigation while losing the
500MB revealed, that there's an off-by-one error in the check in
l1tf_select_mitigation().

l1tf_pfn_limit() returns the last usable pfn (inclusive) and the range
check in the mitigation path does not take this into account.

Instead of amending the range check, make l1tf_pfn_limit() return the first
PFN which is over the limit which is less error prone. Adjust the other
users accordingly.

[1] https://bugzilla.suse.com/show_bug.cgi?id=1105536

Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Reported-by: George Anchev <studio@anchev.net>
Reported-by: Christopher Snowhill <kode54@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180823134418.17008-1-vbabka@suse.cz
(cherry picked from commit b0a182f875689647b014bc01d36b340217792852)

Orabug: 28488808
CVE: CVE-2018-3620

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit

On 32bit PAE kernels on 64bit hardware with enough physical bits,
l1tf_pfn_limit() will overflow unsigned long. This in turn affects
max_swapfile_size() and can lead to swapon returning -EINVAL. This has been
observed in a 32bit guest with 42 bits physical address size, where
max_swapfile_size() overflows exactly to 1 << 32, thus zero, and produces
the following warning to dmesg:

[ 6.396845] Truncating oversized swap area, only using 0k out of 2047996k

Fix this by using unsigned long long instead.

Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Reported-by: Dominique Leuenberger <dimstar@suse.de>
Reported-by: Adrian Schroeter <adrian@suse.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180820095835.5298-1-vbabka@suse.cz
(cherry picked from commit 9df9516940a61d29aedf4d91b483ca6597e7d480)

Orabug: 28488808
CVE: CVE-2018-3620

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation/l1tf: Exempt zeroed PTEs from inversion

It turns out that we should *not* invert all not-present mappings,
because the all zeroes case is obviously special.

clear_page() does not undergo the XOR logic to invert the address bits,
i.e. PTE, PMD and PUD entries that have not been individually written
will have val=0 and so will trigger __pte_needs_invert(). As a result,
{pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones
(adjusted by the max PFN mask) instead of zero. A zeroed entry is ok
because the page at physical address 0 is reserved early in boot
specifically to mitigate L1TF, so explicitly exempt them from the
inversion when reading the PFN.

Manifested as an unexpected mprotect(..., PROT_NONE) failure when called
on a VMA that has VM_PFNMAP and was mmap'd to as something other than
PROT_NONE but never used. mprotect() sends the PROT_NONE request down
prot_none_walk(), which walks the PTEs to check the PFNs.
prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
-EACCES because it thinks mprotect() is trying to adjust a high MMIO
address.

[ This is a very modified version of Sean's original patch, but all
  credit goes to Sean for doing this and also pointing out that
  sometimes the __pte_needs_invert() function only gets the protection
  bits, not the full eventual pte.  But zero remains special even in
  just protection bits, so that's ok.   - Linus ]

Fixes: f22cc87f6c1f ("x86/speculation/l1tf: Invert all not present mappings")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f19f5c49bbc3ffcc9126cc245fc1b24cc29f4a37)

Orabug: 28488808
CVE: CVE-2018-3620

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/l1tf: Fix build error seen if CONFIG_KVM_INTEL is disabled

allmodconfig+CONFIG_INTEL_KVM=n results in the following build error.

ERROR: "l1tf_vmx_mitigation" [arch/x86/kvm/kvm.ko] undefined!

Fixes: 5b76a3cff011 ("KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry")
Reported-by: Meelis Roos <mroos@linux.ee>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1eb46908b35dfbac0ec1848d4b1e39667e0187e9)

Orabug: 28488808
CVE: CVE-2018-3620

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
Contextual: bugs_64.c was modified instead of bugs.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/spectre: Add missing family 6 check to microcode check

The check for Spectre microcodes does not check for family 6, only the
model numbers.

Add a family 6 check to avoid ambiguity with other families.

Fixes: a5b296636453 ("x86/cpufeature: Blacklist SPEC_CTRL/PRED_CMD on early Spectre v2 microcodes")
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180824170351.34874-2-andi@firstfloor.org
(cherry picked from commit 1ab534e85c93945f7862378d8c8adcf408205b19)

Orabug: 28488808
CVE: CVE-2018-3620

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts: bad_spectre_microcode was in arch/x86/kernel/cpu/scattered.c
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled

Mikhail reported the following lockdep splat:

WARNING: possible irq lock inversion dependency detected
CPU 0/KVM/10284 just changed the state of lock:
  000000000d538a88 (&st->lock){+...}, at:
  speculative_store_bypass_update+0x10b/0x170

but this lock was taken by another, HARDIRQ-safe lock
in the past:

(&(&sighand->siglock)->rlock){-.-.}

   and interrupts could create inverse lock ordering between them.

Possible interrupt unsafe locking scenario:

    CPU0                    CPU1
    ----                    ----
   lock(&st->lock);
                           local_irq_disable();
                           lock(&(&sighand->siglock)->rlock);
                           lock(&st->lock);
    <Interrupt>
     lock(&(&sighand->siglock)->rlock);
     *** DEADLOCK ***

The code path which connects those locks is:

   speculative_store_bypass_update()
   ssb_prctl_set()
   do_seccomp()
   do_syscall_64()

In svm_vcpu_run() speculative_store_bypass_update() is called with
interupts enabled via x86_virt_spec_ctrl_set_guest/host().

This is actually a false positive, because GIF=0 so interrupts are
disabled even if IF=1; however, we can easily move the invocations of
x86_virt_spec_ctrl_set_guest/host() into the interrupt disabled region to
cure it, and it's a good idea to keep the GIF=0/IF=1 area as small
and self-contained as possible.

Fixes: 1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD")
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 024d83cadc6b2af027e473720f3c3da97496c318)

Orabug: 28488808
CVE: CVE-2018-3646

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Conflicts: contextual
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/microcode: Allow late microcode loading with SMT disabled

The kernel unnecessarily prevents late microcode loading when SMT is
disabled. It should be safe to allow it if all the primary threads are
online.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
(cherry picked from commit 07d981ad4cf1e78361c6db1c28ee5ba105f96cc1)

Orabug: 28488808
CVE: CVE-2018-3620

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/microcode: Do not upload microcode if CPUs are offline

Avoid loading microcode if any of the CPUs are offline, and issue a
warning. Having different microcode revisions on the system at any time
is outright dangerous.

[ Borislav: Massage changelog. ]

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Link: http://lkml.kernel.org/r/1519352533-15992-4-git-send-email-ashok.raj@intel.com
Link: https://lkml.kernel.org/r/20180228102846.13447-5-bp@alien8.de
(cherry picked from commit 30ec26da9967d0d785abc24073129a34c3211777)

Needed for the next cherry-picked commit "x86/microcode: Allow late microcode loading with SMT disabled".

Orabug: 28488808
CVE: CVE-2018-3620

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/microcode/core.c
Contextual

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Cipso: cipso_v4_optptr enter infinite loop

Orabug: 28563992
CVE: CVE-2018-10938

in for(),if((optlen > 0) && (optptr[1] == 0)), enter infinite loop.

Test: receive a packet which the ip length > 20 and the first byte of ip option is 0, produce this issue

Signed-off-by: yujuan.qi <yujuan.qi@mediatek.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 40413955ee265a5e42f710940ec78f5450d49149)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Btrfs: fix list_add corruption and soft lockups in fsync

Orabug: 28119834

Xfstests btrfs/146 revealed this corruption,

[   58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
[   58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen
[   58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (pr
[   58.153518] ------------[ cut here ]------------
[   58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
...
[   58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
...
[   58.161956] Call Trace:
[   58.162264]  btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
[   58.163583]  btrfs_log_dentry_safe+0x60/0x80 [btrfs]
[   58.164003]  btrfs_sync_file+0x4c2/0x6f0 [btrfs]
[   58.164393]  vfs_fsync_range+0x5f/0xd0
[   58.164898]  do_fsync+0x5a/0x90
[   58.165170]  SyS_fsync+0x10/0x20
[   58.165395]  entry_SYSCALL_64_fastpath+0x1f/0xbe
...

It turns out that we could record btrfs_log_ctx:io_err in
log_one_extents when IO fails, but make log_one_extents() return '0'
instead of -EIO, so the IO error is not acknowledged by the callers,
i.e.  btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
from list head 'root->log_ctxs'.  Since btrfs_log_ctx is allocated
from stack memory, it'd get freed with a object alive on the
list. then a future list_add will throw the above warning.

This returns the correct error in the above case.

Jeff also reported this while testing against his fsync error
patch set[1].

[1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
"btrfs list corruption and soft lockups while testing writeback error handling"

Fixes: 8407f553268a4611f254 ("Btrfs: fix data corruption after fast fsync and writeback error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherrypicked from commit ebb70442cdd4872260c2415929c456be3562da82)
Conflicts: fs/btrfs/file.c
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/paravirt: Fix spectre-v2 mitigations for paravirt guests

Nadav reported that on guests we're failing to rewrite the indirect
calls to CALLEE_SAVE paravirt functions. In particular the
pv_queued_spin_unlock() call is left unpatched and that is all over the
place. This obviously wrecks Spectre-v2 mitigation (for paravirt
guests) which relies on not actually having indirect calls around.

The reason is an incorrect clobber test in paravirt_patch_call(); this
function rewrites an indirect call with a direct call to the _SAME_
function, there is no possible way the clobbers can be different
because of this.

Therefore remove this clobber check. Also put WARNs on the other patch
failure case (not enough room for the instruction) which I've not seen
trigger in my (limited) testing.

Three live kernel image disassemblies for lock_sock_nested (as a small
function that illustrates the problem nicely). PRE is the current
situation for guests, POST is with this patch applied and NATIVE is with
or without the patch for !guests.

PRE:

(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
   0xffffffff817be970 <+0>:     push   %rbp
   0xffffffff817be971 <+1>:     mov    %rdi,%rbp
   0xffffffff817be974 <+4>:     push   %rbx
   0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
   0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
   0xffffffff817be981 <+17>:    mov    %rbx,%rdi
   0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
   0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
   0xffffffff817be98f <+31>:    test   %eax,%eax
   0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
   0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
   0xffffffff817be99d <+45>:    mov    %rbx,%rdi
   0xffffffff817be9a0 <+48>:    callq  *0xffffffff822299e8
   0xffffffff817be9a7 <+55>:    pop    %rbx
   0xffffffff817be9a8 <+56>:    pop    %rbp
   0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
   0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
   0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
   0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
   0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
   0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
End of assembler dump.

POST:

(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
   0xffffffff817be970 <+0>:     push   %rbp
   0xffffffff817be971 <+1>:     mov    %rdi,%rbp
   0xffffffff817be974 <+4>:     push   %rbx
   0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
   0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
   0xffffffff817be981 <+17>:    mov    %rbx,%rdi
   0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
   0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
   0xffffffff817be98f <+31>:    test   %eax,%eax
   0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
   0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
   0xffffffff817be99d <+45>:    mov    %rbx,%rdi
   0xffffffff817be9a0 <+48>:    callq  0xffffffff810a0c20 <__raw_callee_save___pv_queued_spin_unlock>
   0xffffffff817be9a5 <+53>:    xchg   %ax,%ax
   0xffffffff817be9a7 <+55>:    pop    %rbx
   0xffffffff817be9a8 <+56>:    pop    %rbp
   0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
   0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
   0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063aa0 <__local_bh_enable_ip>
   0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
   0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
   0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
End of assembler dump.

NATIVE:

(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
   0xffffffff817be970 <+0>:     push   %rbp
   0xffffffff817be971 <+1>:     mov    %rdi,%rbp
   0xffffffff817be974 <+4>:     push   %rbx
   0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
   0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
   0xffffffff817be981 <+17>:    mov    %rbx,%rdi
   0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
   0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
   0xffffffff817be98f <+31>:    test   %eax,%eax
   0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
   0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
   0xffffffff817be99d <+45>:    mov    %rbx,%rdi
   0xffffffff817be9a0 <+48>:    movb   $0x0,(%rdi)
   0xffffffff817be9a3 <+51>:    nopl   0x0(%rax)
   0xffffffff817be9a7 <+55>:    pop    %rbx
   0xffffffff817be9a8 <+56>:    pop    %rbp
   0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
   0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
   0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
   0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
   0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
   0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
End of assembler dump.

Fixes: 63f70270ccd9 ("[PATCH] i386: PARAVIRT: add common patching machinery")
Fixes: 3010a0663fd9 ("x86/paravirt, objtool: Annotate indirect calls")
Reported-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: stable@vger.kernel.org
(cherry picked from commit 5800dc5c19f34e6e03b5adab1282535cb102fafd)

Orabug: 28474643

Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

sym53c8xx: fix NULL pointer dereference panic in sym_int_sir() in sym_hipd.c

sym_int_sir() in sym_hipd.c does not check the command pointer for NULL
before using it in debug message prints.

Orabug: 28481893

Suggested-by: Matthew Wilcox <matthew.wilcox@oracle.com>
Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

md/raid1: Avoid raid1 resync getting stuck

close_sync() needs to set conf->next_resync to a large, but safe value
below MaxSector and use it to determine whether or not to set
start_next_window in wait_barrier()

Solution suggested by Neil Brown.

Reported-by: Nate Dailey <nate.dailey@stratus.com>
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
(cherry picked from commit e8ff8bf09ff49733534ff3cee91bde030186055f)

Orabug: 28529228

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/spectrev2: Don't set mode to SPECTRE_V2_NONE when retpoline is available.

When booting with noibrs we may accidentally set mode to
SPECTRE_V2_NONE, missing the fact that we will, in fact, be
using retpoline.

Change ibrs_select()'s behavior so that it only modifies
spectre_v2_mitigation mode if it was actually able to set
SPEC_CTRL_IBRS_INUSE flag.

Orabug: 28540376

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: avoid deadlock when expanding inode size

When we need to move xattrs into external xattr block, we call
ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end
up calling ext4_mark_inode_dirty() again which will recurse back into
the inode expansion code leading to deadlocks.

Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move
its management into ext4_expand_extra_isize_ea() since its manipulation
is safe there (due to xattr_sem) from possible races with
ext4_xattr_set_handle() which plays with it as well.

CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 2e81a4eeedcaa66e35f58b81e0755b87057ce392)

Orabug: 25718971

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: properly align shifted xattrs when expanding inodes

We did not count with the padding of xattr value when computing desired
shift of xattrs in the inode when expanding i_extra_isize. As a result
we could create unaligned start of inline xattrs. Account for alignment
properly.

CC: stable@vger.kernel.org # 4.4.x-
Signed-off-by: Jan Kara <jack@suse.cz>
(cherry picked from commit 443a8c41cd49de66a3fda45b32b9860ea0292b84)

Orabug: 25718971

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: fix xattr shifting when expanding inodes part 2

When multiple xattrs need to be moved out of inode, we did not properly
recompute total size of xattr headers in the inode and the new header
position. Thus when moving the second and further xattr we asked
ext4_xattr_shift_entries() to move too much and from the wrong place,
resulting in possible xattr value corruption or general memory
corruption.

CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 418c12d08dc64a45107c467ec1ba29b5e69b0715)

Orabug: 25718971

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: fix xattr shifting when expanding inodes

The code in ext4_expand_extra_isize_ea() treated new_extra_isize
argument sometimes as the desired target i_extra_isize and sometimes as
the amount by which we need to grow current i_extra_isize. These happen
to coincide when i_extra_isize is 0 which used to be the common case and
so nobody noticed this until recently when we added i_projid to the
inode and so i_extra_isize now needs to grow from 28 to 32 bytes.

The result of these bugs was that we sometimes unnecessarily decided to
move xattrs out of inode even if there was enough space and we often
ended up corrupting in-inode xattrs because arguments to
ext4_xattr_shift_entries() were just wrong. This could demonstrate
itself as BUG_ON in ext4_xattr_shift_entries() triggering.

Fix the problem by introducing new isize_diff variable and use it where
appropriate.

CC: stable@vger.kernel.org # 4.4.x
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit d0141191a20289f8955c1e03dad08e42e6f71ca9)

Orabug: 25718971

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

uek-rpm: Enable perf stripped binary

Due to a bug in the find-debuginfo.sh file provided by
RPM, the perf binary was not properly stripped in OL7.
This commit provides a new patch, and cleans up previous
patches to find-debuginfo.sh.

Orabug: 27801171

Signed-off-by: Victor Erminpour <victor.erminpour@oracle.com>
Reviewed-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nfsd: give out fewer session slots as limit approaches

Instead of granting client's full requests until we hit our DRC size
limit and then failing CREATE_SESSIONs (and hence mounts) completely,
start granting clients smaller slot tables as we approach the limit.

The factor chosen here is pretty much arbitrary.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit de766e570413bd0484af0b580299b495ada625c3)

Orabug: 28023821

Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nfsd: increase DRC cache limit

An NFSv4.1+ client negotiates the size of its duplicate reply cache size
in the initial CREATE_SESSION request. The server preallocates the
memory for the duplicate reply cache to ensure that we'll never fail to
record the response to a nonidempotent operation.

To prevent a few CREATE_SESSIONs from consuming all of memory we set an
upper limit based on nr_free_buffer_pages(). 1/2^10 has been too
limiting in practice; 1/2^7 is still less than one percent.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit 44d8660d3bb0a1c8363ebcb906af2343ea8e15f6)

Orabug: 28023821
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

uek-rpm: config-debug: Turn off torture testing by default

The OL7 debug kernels are configured with multiple torture tests
built into the kernel, and consequently are all of them
trying to run by default during boot. This is a waste
of time and kWs.

This can be observed by the following
messages in the system log during boot:

kernel: spin_lock-torture:--- Start of test [debug]: nwriters_stress=4 nreaders_stress=0 stat_interval=60 verbose=1 shuffle_interval=3 stutter=5 shutdown_secs=0
kernel: spin_lock-torture: Creating torture_shuffle task
kernel: spin_lock-torture: Creating torture_stutter task
kernel: spin_lock-torture: torture_shuffle task started
kernel: spin_lock-torture: Creating lock_torture_writer task
kernel: spin_lock-torture: torture_stutter task started
kernel: spin_lock-torture: Creating lock_torture_writer task
kernel: spin_lock-torture: lock_torture_writer task started
kernel: spin_lock-torture: Creating lock_torture_writer task
kernel: spin_lock-torture: lock_torture_writer task started
kernel: spin_lock-torture: Creating lock_torture_writer task
kernel: spin_lock-torture: lock_torture_writer task started
kernel: spin_lock-torture: Creating lock_torture_stats task
kernel: spin_lock-torture: lock_torture_writer task started
kernel: spin_lock-torture: lock_torture_stats task started
kernel: torture_init_begin: Refusing rcu init: spin_lock running.
kernel: torture_init_begin: One torture test at a time!

Better to compile these as modules to make them readily
available for testing if need be, than to run them all
at boot which does not seem to make sense at all judging
from the above log.

Orabug: 28261886

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Victor Erminpour <victor.erminpour@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ipmi: Remove smi_msg from waiting_rcv_msgs list before handle_one_recv_msg()

Commit 7ea0ed2b5be8 ("ipmi: Make the message handler easier to use for
SMI interfaces") changed handle_new_recv_msgs() to call handle_one_recv_msg()
for a smi_msg while the smi_msg is still connected to waiting_rcv_msgs list.
That could lead to following list corruption problems:

1) low-level function treats smi_msg as not connected to list

  handle_one_recv_msg() could end up calling smi_send(), which
  assumes the msg is not connected to list.

  For example, the following sequence could corrupt list by
  doing list_add_tail() for the entry still connected to other list.

    handle_new_recv_msgs()
      msg = list_entry(waiting_rcv_msgs)
      handle_one_recv_msg(msg)
        handle_ipmb_get_msg_cmd(msg)
          smi_send(msg)
            spin_lock(xmit_msgs_lock)
            list_add_tail(msg)
            spin_unlock(xmit_msgs_lock)

2) race between multiple handle_new_recv_msgs() instances

  handle_new_recv_msgs() once releases waiting_rcv_msgs_lock before calling
  handle_one_recv_msg() then retakes the lock and list_del() it.

  If others call handle_new_recv_msgs() during the window shown below
  list_del() will be done twice for the same smi_msg.

  handle_new_recv_msgs()
    spin_lock(waiting_rcv_msgs_lock)
    msg = list_entry(waiting_rcv_msgs)
    spin_unlock(waiting_rcv_msgs_lock)
  |
  | handle_one_recv_msg(msg)
  |
    spin_lock(waiting_rcv_msgs_lock)
    list_del(msg)
    spin_unlock(waiting_rcv_msgs_lock)

Fixes: 7ea0ed2b5be8 ("ipmi: Make the message handler easier to use for SMI interfaces")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
[Added a comment to describe why this works.]
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Cc: stable@vger.kernel.org # 3.19
Tested-by: Ye Feng <yefeng.yl@alibaba-inc.com>
(cherry picked from commit ae4ea9a2460c7fee2ae8feeb4dfe96f5f6c3e562)

Orabug:28281455

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/mce/AMD: Give a name to MCA bank 3 when accessed with legacy MSRs

MCA bank 3 is reserved on systems pre-Fam17h, so it didn't have a name.
However, MCA bank 3 is defined on Fam17h systems and can be accessed
using legacy MSRs. Without a name we get a stack trace on Fam17h systems
when trying to register sysfs files for bank 3 on kernels that don't
recognize Scalable MCA.

Call MCA bank 3 "decode_unit" since this is what it represents on
Fam17h. This will allow kernels without SMCA support to see this bank on
Fam17h+ and prevent the stack trace. This will not affect older systems
since this bank is reserved on them, i.e. it'll be ignored.

Tested on AMD Fam15h and Fam17h systems.

  WARNING: CPU: 26 PID: 1 at lib/kobject.c:210 kobject_add_internal
  kobject: (ffff88085bb256c0): attempted to be registered with empty name!
  ...
  Call Trace:
   kobject_add_internal
   kobject_add
   kobject_create_and_add
   threshold_create_device
   threshold_init_device

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1490102285-3659-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 29f72ce3e4d18066ec75c79c857bee0618a3504b)

Orabug: 28416303

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Fix up non-directory creation in SGID directories

sgid directories have special semantics, making newly created files in
the directory belong to the group of the directory, and newly created
subdirectories will also become sgid. This is historically used for
group-shared directories.

But group directories writable by non-group members should not imply
that such non-group members can magically join the group, so make sure
to clear the sgid bit on non-directories for non-members (but remember
that sgid without group execute means "mandatory locking", just to
confuse things even more).

Reported-by: Jann Horn <jannh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0fa3ecd87848c9c93c2c828ef4c3a8ca36ce46c7)

Orabug: 28459477
CVE: CVE-2018-13405

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: libsas: defer ata device eh commands to libata

When ata device doing EH, some commands still attached with tasks are
not passed to libata when abort failed or recover failed, so libata did
not handle these commands. After these commands done, sas task is freed,
but ata qc is not freed. This will cause ata qc leak and trigger a
warning like below:

WARNING: CPU: 0 PID: 28512 at drivers/ata/libata-eh.c:4037
ata_eh_finish+0xb4/0xcc
CPU: 0 PID: 28512 Comm: kworker/u32:2 Tainted: G W OE 4.14.0#1
......
Call trace:
[<ffff0000088b7bd0>] ata_eh_finish+0xb4/0xcc
[<ffff0000088b8420>] ata_do_eh+0xc4/0xd8
[<ffff0000088b8478>] ata_std_error_handler+0x44/0x8c
[<ffff0000088b8068>] ata_scsi_port_error_handler+0x480/0x694
[<ffff000008875fc4>] async_sas_ata_eh+0x4c/0x80
[<ffff0000080f6be8>] async_run_entry_fn+0x4c/0x170
[<ffff0000080ebd70>] process_one_work+0x144/0x390
[<ffff0000080ec100>] worker_thread+0x144/0x418
[<ffff0000080f2c98>] kthread+0x10c/0x138
[<ffff0000080855dc>] ret_from_fork+0x10/0x18

If ata qc leaked too many, ata tag allocation will fail and io blocked
for ever.

As suggested by Dan Williams, defer ata device commands to libata and
merge sas_eh_finish_cmd() with sas_eh_defer_cmd(). libata will handle
ata qcs correctly after this.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
CC: Xiaofei Tan <tanxiaofei@huawei.com>
CC: John Garry <john.garry@huawei.com>
CC: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 318aaf34f1179b39fa9c30fa0f3288b645beee39)

Orabug: 28459685
CVE: CVE-2018-10021

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

PCI: Allocate ATS struct during enumeration

Previously, we allocated pci_ats structures when an IOMMU driver called
pci_enable_ats().  An SR-IOV VF shares the STU setting with its PF, so when
enabling ATS on the VF, we allocated a pci_ats struct for the PF if it
didn't already have one.  We held the sriov->lock to serialize threads
concurrently enabling ATS on several VFS so only one would allocate the PF
pci_ats.

Gregor reported a deadlock here:

  pci_enable_sriov
    sriov_enable
      virtfn_add
        mutex_lock(dev->sriov->lock)      # acquire sriov->lock
        pci_device_add
          device_add
            BUS_NOTIFY_ADD_DEVICE notifier chain
            iommu_bus_notifier
              amd_iommu_add_device        # iommu_ops.add_device
                init_iommu_group
                  iommu_group_get_for_dev
                    iommu_group_add_device
                      __iommu_attach_device
                        amd_iommu_attach_device  # iommu_ops.attach_device
                          attach_device
                            pci_enable_ats
                              mutex_lock(dev->sriov->lock) # deadlock

There's no reason to delay allocating the pci_ats struct, and if we
allocate it for each device at enumeration-time, there's no need for
locking in pci_enable_ats().

Allocate pci_ats struct during enumeration, when we initialize other
capabilities.

Note that this implementation requires ATS to be enabled on the PF first,
before on any of the VFs because the PF controls the STU for all the VFs.

Link: http://permalink.gmane.org/gmane.linux.kernel.iommu/9433
Reported-by: Gregor Dick <gdick@solarflare.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
(cherry pick from commit edc90fee916b4f0d14af9c6b5c08666747488ef8)

Orabug: 28460092

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

qla2xxx: Update the version to 9.00.00.00.41.0-k1.

Orabug: 28172611

Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

qla2xxx: Utilize complete local DMA buffer for DIF PI inforamtion.

When any of the DIF PI SGEs received from upper layer is lesser than the
local DMA SGE length, then local DMA buffer was getting partially filled
causing DIF errors.

Orabug: 28172611

Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

qla2xxx: Correction to total data segment count when local DMA buffers used for DIF PI.

When local DMA buffers used for DIF PI information, the total data
segment count should include the local DMA SGEs used.

OraBug: 28172611

Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: megaraid_sas: fix the wrong way to get irq number

Customer's machine report following warning log:
WARNING: CPU: 7 PID: 934 at drivers/scsi/megaraid/kcompat.h:14 megasas_sync_irqs+0x7a/0x90 [megaraid_sas]()
...
CPU: 7 PID: 934 Comm: scsi_eh_0 Not tainted 4.1.12-112.14.2.el6uek.x86_64 #2
Hardware name: Dell Inc. PowerEdge M630/0R10KJ, BIOS 2.6.0 10/27/2017
Call Trace:
[<ffffffff816eb150>] dump_stack+0x63/0x83
[<ffffffff81086f25>] warn_slowpath_common+0x95/0xe0
[<ffffffff81086f8a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa031a8ca>] megasas_sync_irqs+0x7a/0x90 [megaraid_sas]
[<ffffffffa031e2d0>] megasas_reset_fusion+0xd0/0x8b0 [megaraid_sas]
[<ffffffffa0314cbe>] megasas_reset_bus_host+0x8e/0x210 [megaraid_sas]
[<ffffffff814ca1e6>] scsi_try_host_reset+0x56/0x120
[<ffffffff814cd649>] scsi_eh_host_reset+0x59/0x180
[<ffffffff814cd804>] scsi_eh_ready_devs+0x94/0x100
[<ffffffff814cd97d>] scsi_unjam_host+0x10d/0x230
[<ffffffff814cdc28>] scsi_error_handler+0x188/0x210
[<ffffffff810a72de>] kthread+0xce/0xf0
[<ffffffff816f11a2>] ret_from_fork+0x42/0x70

This is introduced by the 490515384dc4 (scsi: megaraid_sas: Use
synchronize_irq to wait for IRQs to complete). The introduced
pci_irq_vector() compatibility wrapper did not take all interrupt
delivery methods into account. We should use the instance->msixentry[]
to get the irq number for the case of instance->msix_vectors > 0.

Orabug: 28436426

Fixes: 490515384dc4 (scsi: megaraid_sas: Use synchronize_irq to wait for IRQs to complete)
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Cc: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ALSA: seq: Make ioctls race-free

commit b3defb791b26ea0683a93a4f49c77ec45ec96f10 upstream.

The ALSA sequencer ioctls have no protection against racy calls while
the concurrent operations may lead to interfere with each other. As
reported recently, for example, the concurrent calls of setting client
pool with a combination of write calls may lead to either the
unkillable dead-lock or UAF.

As a slightly big hammer solution, this patch introduces the mutex to
make each ioctl exclusive. Although this may reduce performance via
parallel ioctl calls, usually it's not demanded for sequencer usages,
hence it should be negligible.

Reported-by: Luo Quan <a4651386@163.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c3162384aed4cfe3f1a1f40041f3ba8cd7704d88)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
sound/core/seq/seq_clientmgr.c

Orabug: 28459728
CVE: CVE-2018-7566

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ALSA: seq: Fix racy pool initializations

ALSA sequencer core initializes the event pool on demand by invoking
snd_seq_pool_init() when the first write happens and the pool is
empty. Meanwhile user can reset the pool size manually via ioctl
concurrently, and this may lead to UAF or out-of-bound accesses since
the function tries to vmalloc / vfree the buffer.

A simple fix is to just wrap the snd_seq_pool_init() call with the
recently introduced client->ioctl_mutex; as the calls for
snd_seq_pool_init() from other side are always protected with this
mutex, we can avoid the race.

Reported-by: 范龙飞 <long7573@126.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
(cherry picked from commit d15d662e89fc667b90cd294b0eb45694e33144da)

Orabug: 28459728
CVE: CVE-2018-7566

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

oracleasm: Fix use after free for request processing timer

Orabug: 28506080

Update r->r_elapsed under the spinlock to avoid racing with the
completion code freeing the asm_request.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

oracleasm: Fix incorrectly set flag

Orabug: 28506080

BIP_MAPPED_INTEGRITY does not need to be shifted.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

oracleasm: Fix memory leak

Orabug: 28506080

As part of the ongoing changes to the block layer, the semantics of
bip_for_each_vec() changed from walking the entire bip vec (as originally
designed) to only walking the residual. This led to a memory leak as page
cache pages were not released during I/O completion.

Manually walk the bip vec in oracleasm instead of relying on the block
layer helper.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

oracleasm: Add ENXIO handling

Orabug: 28506080

The I/O stack may return ENXIO in some cases. Treat it like ENODEV.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

oracleasm: Add missing tracepoint

Orabug: 28506080

The tracepoint for bio mapping got dropped during the merge. Reinstate
it.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

oracleasm: Don't assume bip was allocated by oracleasm

Orabug: 28506080

If a user runs an old ASMLIB without integrity support, the bip
attached to a bio is owned by the block layer. We should not attempt
to unmap the pages in question.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>