www.infradead.org Git - users/jedix/linux-maple.git/log

HID: debug: check length before copy_to_user()

If our length is greater than the size of the buffer, we
overflow the buffer

Cc: stable@vger.kernel.org
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
(cherry picked from commit 717adfdaf14704fd3ec7fa2c04520c0723247eac)
Orabug: 29128165
CVE: CVE-2018-9516
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: John.Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/MCE: Serialize sysfs changes

The check_interval file in

/sys/devices/system/machinecheck/machinecheck<cpu number>

directory is a global timer value for MCE polling. If it is changed by one
CPU, mce_restart() broadcasts the event to other CPUs to delete and restart
the MCE polling timer and __mcheck_cpu_init_timer() reinitializes the
mce_timer variable.

If more than one CPU writes a specific value to the check_interval file
concurrently, mce_timer is not protected from such concurrent accesses and
all kinds of explosions happen. Since only root can write to those sysfs
variables, the issue is not a big deal security-wise.

However, concurrent writes to these configuration variables is void of
reason so the proper thing to do is to serialize the access with a mutex.

Boris:

- Make store_int_with_restart() use device_store_ulong() to filter out
negative intervals
- Limit min interval to 1 second
- Correct locking
- Massage commit message

Signed-off-by: Seunghun Han <kkamagui@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20180302202706.9434-1-kkamagui@gmail.com
(cherry picked from commit b3b7c4795ccab5be71f080774c45bbbcc75c2aaf)

Orabug: 29149888
CVE: CVE-2018-7995

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Input: i8042 - fix crash at boot time

The driver checks port->exists twice in i8042_interrupt(), first when
trying to assign temporary "serio" variable, and second time when deciding
whether it should call serio_interrupt(). The value of port->exists may
change between the 2 checks, and we may end up calling serio_interrupt()
with a NULL pointer:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
IP: [<ffffffff8150feaf>] _spin_lock_irqsave+0x1f/0x40
PGD 0
Oops: 0002 [#1] SMP
last sysfs file:
CPU 0
Modules linked in:

Pid: 1, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1 QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:[<ffffffff8150feaf>]  [<ffffffff8150feaf>] _spin_lock_irqsave+0x1f/0x40
RSP: 0018:ffff880028203cc0  EFLAGS: 00010082
RAX: 0000000000010000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000282 RSI: 0000000000000098 RDI: 0000000000000050
RBP: ffff880028203cc0 R08: ffff88013e79c000 R09: ffff880028203ee0
R10: 0000000000000298 R11: 0000000000000282 R12: 0000000000000050
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000098
FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000050 CR3: 0000000001a85000 CR4: 00000000001407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88013e79c000, task ffff88013e79b500)
Stack:
ffff880028203d00 ffffffff813de186 ffffffffffffff02 0000000000000000
<d> 0000000000000000 0000000000000000 0000000000000000 0000000000000098
<d> ffff880028203d70 ffffffff813e0162 ffff880028203d20 ffffffff8103b8ac
Call Trace:
<IRQ>
[<ffffffff813de186>] serio_interrupt+0x36/0xa0
[<ffffffff813e0162>] i8042_interrupt+0x132/0x3a0
[<ffffffff8103b8ac>] ? kvm_clock_read+0x1c/0x20
[<ffffffff8103b8b9>] ? kvm_clock_get_cycles+0x9/0x10
[<ffffffff810e1640>] handle_IRQ_event+0x60/0x170
[<ffffffff8103b154>] ? kvm_guest_apic_eoi_write+0x44/0x50
[<ffffffff810e3d8e>] handle_edge_irq+0xde/0x180
[<ffffffff8100de89>] handle_irq+0x49/0xa0
[<ffffffff81516c8c>] do_IRQ+0x6c/0xf0
[<ffffffff8100b9d3>] ret_from_intr+0x0/0x11
[<ffffffff81076f63>] ? __do_softirq+0x73/0x1e0
[<ffffffff8109b75b>] ? hrtimer_interrupt+0x14b/0x260
[<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
[<ffffffff8100de05>] ? do_softirq+0x65/0xa0
[<ffffffff81076d95>] ? irq_exit+0x85/0x90
[<ffffffff81516d80>] ? smp_apic_timer_interrupt+0x70/0x9b
[<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20

To avoid the issue let's change the second check to test whether serio is
NULL or not.

Also, let's take i8042_lock in i8042_start() and i8042_stop() instead of
trying to be overly smart and using memory barriers.

Signed-off-by: Chen Hong <chenhong3@huawei.com>
[dtor: take lock in i8042_start()/i8042_stop()]
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
(cherry picked from commit 340d394a789518018f834ff70f7534fc463d3226)

Orabug: 29152328
CVE: CVE-2017-18079

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

base/memory, hotplug: fix a kernel oops in show_valid_zones()

Reading a sysfs "memoryN/valid_zones" file leads to the following oops
when the first page of a range is not backed by struct page.
show_valid_zones() assumes that 'start_pfn' is always valid for
page_zone().

BUG: unable to handle kernel paging request at ffffea017a000000
IP: show_valid_zones+0x6f/0x160

This issue may happen on x86-64 systems with 64GiB or more memory since
their memory block size is bumped up to 2GiB.  [1] An example of such
systems is desribed below.  0x3240000000 is only aligned by 1GiB and
this memory block starts from 0x3200000000, which is not backed by
struct page.

BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

Since test_pages_in_a_zone() already checks holes, fix this issue by
extending this function to return 'valid_start' and 'valid_end' for a
given range.  show_valid_zones() then proceeds with the valid range.

Orabug: 29050538

[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a96dfddbcc04336bbed50dc2b24823e45e09e80c)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/base/memory.c
(retained existing show_valid_zones() code and modified
to use valid pfns)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()

Patch series "fix a kernel oops when reading sysfs valid_zones", v2.

A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory.  [1] When the start address of a
memory block is not backed by struct page, i.e.  a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops.  This issue was observed on multiple x86-64 systems with
more than 64GiB of memory.  This patch-set fixes this issue.

Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.

Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).

Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9.  However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.

So, I recommend that we backport it up to 4.4.

[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

This patch (of 2):

test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'.  Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.

Fix it by properly setting 'sec_end_pfn' to the next section pfn.

Also make sure that this function returns 1 only when the range belongs
to a zone.

Orabug: 29050538

Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit deb88a2a19e85842d79ba96b05031739ec327ff4)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

drivers/base/memory.c: prohibit offlining of memory blocks with missing sections

Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory
x86-64 systems") and 982792c782ef ("x86, mm: probe memory block size for
generic x86 64bit") introduced large block sizes for x86. This made it
possible to have multiple sections per memory block where previously,
there was a only every one section per block.

Since blocks consist of contiguous ranges of section, there can be holes
in the blocks where sections are not present. If one attempts to
offline such a block, a crash occurs since the code is not designed to
deal with this.

This patch is a quick fix to gaurd against the crash by not allowing
blocks with non-present sections to be offlined.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=107781

Orabug: 29050538

Signed-off-by: Seth Jennings <sjennings@variantweb.net>
Reported-by: Andrew Banman <abanman@sgi.com>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 26bbe7ef6d5cdc7ec08cba6d433fca4060f258f3)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: Check if section present during memory block (un)registering

Tony found on his setup, if memory block size 512M will cause crash
during booting.

BUG: unable to handle kernel paging request at ffffea0074000020
IP: [<ffffffff81670527>] get_nid_for_pfn+0x17/0x40
PGD 128ffcb067 PUD 128ffc9067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc8 #1
...
Call Trace:
  [<ffffffff81453b56>] ? register_mem_sect_under_node+0x66/0xe0
  [<ffffffff81453eeb>] register_one_node+0x17b/0x240
  [<ffffffff81b1f1ed>] ? pci_iommu_alloc+0x6e/0x6e
  [<ffffffff81b1f229>] topology_init+0x3c/0x95
  [<ffffffff8100213d>] do_one_initcall+0xcd/0x1f0

The system has non continuous RAM address:
BIOS-e820: [mem 0x0000001300000000-0x0000001cffffffff] usable
BIOS-e820: [mem 0x0000001d70000000-0x0000001ec7ffefff] usable
BIOS-e820: [mem 0x0000001f00000000-0x0000002bffffffff] usable
BIOS-e820: [mem 0x0000002c18000000-0x0000002d6fffefff] usable
BIOS-e820: [mem 0x0000002e00000000-0x00000039ffffffff] usable

So there are start sections in memory block not present.
For example:
memory block : [0x2c18000000, 0x2c20000000) 512M
first three sections are not present.

Current register_mem_sect_under_node() assume first section is present,
but memory block section number range [start_section_nr, end_section_nr]
would include not present section.

For arch that support vmemmap, we don't setup memmap for struct page area
within not present sections area.

So skip the pfn range that belong to absent section.

Also fixes unregister_mem_sect_under_nodes() that assume one section per
memory block.

Orabug: 29050538

Reported-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Fixes: bdee237c0343 ("x86: mm: Use 2GB memory block size on large memory x86-64 systems")
Fixes: 982792c782ef ("x86, mm: probe memory block size for generic x86 64bit")
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: stable@vger.kernel.org #v3.15
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7568fb63f57ac8672f8bf2018171255441238882)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

hugetlb: take PMD sharing into account when flushing tlb/caches

When fixing an issue with PMD sharing and migration, it was discovered
via code inspection that other callers of huge_pmd_unshare potentially
have an issue with cache and tlb flushing.

Use the routine adjust_range_if_pmd_sharing_possible() to calculate
worst case ranges for mmu notifiers. Ensure that this range is flushed
if huge_pmd_unshare succeeds and unmaps a PUD_SIZE area.

Based on upstream dff11abe280b. Ported to UEK4.

Orabug: 28951854

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: migration: fix migration of huge PMD shared pages

The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
no matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages.  A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page.  After this, flush
caches and TLB.

mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked.  Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case.  The mmu notifier
calls in this commit are different than upstream.  That is because
upstream went to a different model here.  Instead of moving to the
new model, we leave existing model unchanged and only use the
mmu_*range* calls in this special case.

Based on upstream 017b1660df89.  Ported to UEK4.

Orabug: 28951854

Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

hugetlbfs: use truncate mutex to prevent pmd sharing race

The synchronization mechanism for hugetlbfs pagefaults/truncation and
pmd sharing ideally needs to be modified to use i_mmap_rwsem. See:
http://lkml.kernel.org/r/20181024045053.1467-1-mike.kravetz@oracle.com

In UEK, we have introduced a hugetlbfs truncate mutex in an inode
extension. By taking this mutex earlier in hugetlb_fault (before calling
huge_pte_alloc), we eliminate the most common cause of problems where
ptep can be altered by a call to huge_pmd_unshare.

Orabug: 28896255

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: ib: Improve tracing during failover/back

Orabug: 28860366

Signed-off-by: Håkon Bugge <Haakon.Bugge@oracle.com>
Reviewed-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com>
---

v1 -> v2:
* Added Sudhakar's r-b

Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: ib: Remove superfluous add of address on fail-back device

During failover, we see in the ibacm log:

acm_ipnl_handler: Link added : ib0
acm_ipnl_handler: System address removed ib0 : 192.168.200.200
acm_ipnl_handler: New system address available ib1 : 192.168.200.200
acm_ipnl_handler: System address removed ib1 : 192.168.200.200
acm_ipnl_handler: New system address available ib1 : 192.168.200.200

and everything is OK. Fail-back:

acm_ipnl_handler: Link added : ib0
acm_ipnl_handler: New system address available ib0 : 192.168.200.200
acm_ipnl_handler: System address removed ib0 : 192.168.200.200
acm_ipnl_handler: New system address available ib0 : 192.168.200.200
acm_ipnl_handler: System address removed ib1 : 192.168.200.200

The address is moved from ib1 to ib0, thereafter deleted.

This implies that ibacm looses the address when it's moved back to the
original device.

With this patch, we see:

acm_ipnl_handler: System address removed ib0 : 192.168.200.200
acm_ipnl_handler: New system address available ib1 : 192.168.200.200
acm_ipnl_handler: System address removed ib1 : 192.168.200.200
acm_ipnl_handler: New system address available ib1 : 192.168.200.200
acm_ipnl_handler: Link added : ib0
acm_ipnl_handler: System address removed ib1 : 192.168.200.200
acm_ipnl_handler: New system address available ib0 : 192.168.200.200
acm_ipnl_handler: System address removed ib0 : 192.168.200.200
acm_ipnl_handler: New system address available ib0 : 192.168.200.200

The first lines are failover, after the "Link added : ib0", it's
fail-back (which is done 10 seconds after link up).

Now we see that the fail-back address is properly restored.

Orabug: 28860366

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com>
---

v1 -> v2:
* Changed $Subject
* Added Sudhakar's r-b

Signed-off-by: Brian Maly <brian.maly@oracle.com>

libiscsi: Fix NULL pointer dereference in iscsi_eh_session_reset

This commit addresses NULL pointer dereference in iscsi_eh_session_reset.
Reference should not be made to session->leadconn when session->state
is set to ISCSI_STATE_TERMINATE.

Orabug: 28946207

Signed-off-by: Fred Herard <fred.herard@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 315b38414a1a6830740d0bf27eab034c989f7563)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/scsi/libiscsi.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

wil6210: missing length check in wmi_set_ie

Add a length check in wmi_set_ie to detect unsigned integer
overflow.

Signed-off-by: Lior David <qca_liord@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
(cherry picked from commit b5a8ffcae4103a9d823ea3aa3a761f65779fbe2a)

Orabug: 28951265
CVE: CVE-2018-5848

Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflict:

drivers/net/wireless/ath/wil6210/wmi.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

netfilter: xt_osf: Add missing permission checks

The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, xt_osf_fingers is shared by all net namespaces on the
system.  An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:

    vpnns -- nfnl_osf -f /tmp/pf.os

    vpnns -- nfnl_osf -f /tmp/pf.os -d

These non-root operations successfully modify the systemwide OS
fingerprint list.  Add new capable() checks so that they can't.

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 916a27901de01446bcf57ecca4783f6cff493309)

Orabug: 29037831
CVE: CVE-2017-17450

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Fix bad argument to rdmsrl() in cpu_set_bug_bits()

At the beginning of cpu_set_bug_bits(), rdmsrl() is incorrectly
passed as its first argument the value of 86_FEATURE_IA32_ARCH_CAPS,
which is a CPUID feature bit and not a valid MSR value. The correct
parameter to pass in the first argument to rdmsrl() is
MSR_IA32_ARCH_CAPABILITIES (0x10a).

The value returned by rdmsrl(), specifically the RDCL_NO bit, is
later used to determine if the CPU is vulnerable to L1TF and
Meltdown exploits.

Orabug: 29044805

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

n_tty: fix EXTPROC vs ICANON interaction with TIOCINQ (aka FIONREAD)

Orabug: 28855335

We added support for EXTPROC back in 2010 in commit 26df6d13406d ("tty:
Add EXTPROC support for LINEMODE") and the intent was to allow it to
override some (all?) ICANON behavior.  Quoting from that original commit
message:

         There is a new bit in the termios local flag word, EXTPROC.
         When this bit is set, several aspects of the terminal driver
         are disabled.  Input line editing, character echo, and mapping
         of signals are all disabled.  This allows the telnetd to turn
         off these functions when in linemode, but still keep track of
         what state the user wants the terminal to be in.

but the problem turns out that "several aspects of the terminal driver
are disabled" is a bit ambiguous, and you can really confuse the n_tty
layer by setting EXTPROC and then causing some of the ICANON invariants
to no longer be maintained.

This fixes at least one such case (TIOCINQ) becoming unhappy because of
the confusion over whether ICANON really means ICANON when EXTPROC is set.

This basically makes TIOCINQ match the case of read: if EXTPROC is set,
we ignore ICANON.  Also, make sure to reset the ICANON state ie EXTPROC
changes, not just if ICANON changes.

Fixes: 26df6d13406d ("tty: Add EXTPROC support for LINEMODE")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Cc: Jiri Slaby <jslaby@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 966031f340185eddd05affcf72b740549f056348)
CVE: CVE-2018-18386
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nfs: Don't take a reference on fl->fl_file for LOCK operation

I have reports of a crash that look like __fput() was called twice for
a NFSv4.0 file. It seems possible that the state manager could try to
reclaim a lock and take a reference on the fl->fl_file at the same time the
file is being released if, during the close(), a signal interrupts the wait
for outstanding IO while removing locks which then skips the removal
of that lock.

Since 83bfff23e9ed ("nfs4: have do_vfs_lock take an inode pointer") has
removed the need to traverse fl->fl_file->f_inode in nfs4_lock_done(),
taking that reference is no longer necessary.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
(cherry picked from commit 4b09ec4b14a168bf2c687e1f598140c3c11e9222)

Orabug: 28887442
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations

Without this fix, /proc/cpuinfo will display an incorrect amount
of CPU cores, after bringing them offline and online again, as
exemplified below:

  $ cat /proc/cpuinfo | grep cores
  cpu cores : 4
  cpu cores : 8
  cpu cores : 8
  cpu cores : 20
  cpu cores : 4
  cpu cores : 3
  cpu cores : 2
  cpu cores : 2

This patch fixes this by always zeroing the booted_cores variable
upon turning off a logical CPU.

Tested-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jgross@suse.com
Cc: luto@kernel.org
Cc: prarit@redhat.com
Cc: vkuznets@redhat.com
Link: http://lkml.kernel.org/r/20180221205036.5244-1-sneves@dei.uc.pt
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4596749339e06dc7a424fc08a15eded850ed78b7)

Orabug: 28933009

Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ALSA: seq: Fix regression by incorrect ioctl_mutex usages

This is the revised backport of the upstream commit
b3defb791b26ea0683a93a4f49c77ec45ec96f10

We had another backport (e.g. 623e5c8ae32b in 4.4.115), but it applies
the new mutex also to the code paths that are invoked via faked
kernel-to-kernel ioctls. As reported recently, this leads to a
deadlock at suspend (or other scenarios triggering the kernel
sequencer client).

This patch addresses the issue by taking the mutex only in the code
paths invoked by user-space, just like the original fix patch does.

Reported-and-tested-by: Andres Bertens <abertensu@yahoo.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 29005188
CVE: CVE-2018-1000004

(cherry picked from commit 8e8992a93d66adb640631a6778a5110f01118202)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: phy: mdio-bcm-unimac: fix potential NULL dereference in unimac_mdio_probe()

platform_get_resource() may fail and return NULL, so we should
better check it's return value to avoid a NULL pointer dereference
a bit later in the code.

This is detected by Coccinelle semantic patch.

@@
expression pdev, res, n, t, e, e1, e2;
@@

res = platform_get_resource(pdev, t, n);
+ if (!res)
+ return -EINVAL;
... when != res == NULL
e = devm_ioremap(e1, res->start, e2);

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 297a6961ffb8ff4dc66c9fbf53b924bd1dda05d5)

Orabug: 29012346
CVE: CVE-2018-8043

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: don't call xfs_da_shrink_inode with NULL bp

xfs_attr3_leaf_create may have errored out before instantiating a buffer,
for example if the blkno is out of range. In that case there is no work
to do to remove it, and in fact xfs_da_shrink_inode will lead to an oops
if we try.

This also seems to fix a flaw where the original error from
xfs_attr3_leaf_create gets overwritten in the cleanup case, and it
removes a pointless assignment to bp which isn't used after this.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199969
Reported-by: Xu, Wen <wen.xu@gatech.edu>
Tested-by: Xu, Wen <wen.xu@gatech.edu>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
(cherry picked from commit bb3d48dcf86a97dc25fe9fc2c11938e19cb4399a)

Orabug: 28898616
CVE: CVE-2018-13094

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ALSA: rawmidi: Change resized buffers atomically

The SNDRV_RAWMIDI_IOCTL_PARAMS ioctl may resize the buffers and the
current code is racy. For example, the sequencer client may write to
buffer while it being resized.

As a simple workaround, let's switch to the resized buffer inside the
stream runtime lock.

Reported-by: syzbot+52f83f0ea8df16932f7f@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
(cherry picked from commit 39675f7a7c7e7702f7d5341f1e0d01db746543a0)

Orabug: 28898636
CVE: CVE-2018-10902

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

md/raid5: fix a race condition in stripe batch

We have a race condition in below scenario, say have 3 continuous stripes, sh1,
sh2 and sh3, sh1 is the stripe_head of sh2 and sh3:

CPU1 CPU2 CPU3
handle_stripe(sh3)
stripe_add_to_batch_list(sh3)
-> lock(sh2, sh3)
-> lock batch_lock(sh1)
-> add sh3 to batch_list of sh1
-> unlock batch_lock(sh1)
clear_batch_ready(sh1)
-> lock(sh1) and batch_lock(sh1)
-> clear STRIPE_BATCH_READY for all stripes in batch_list
-> unlock(sh1) and batch_lock(sh1)
->clear_batch_ready(sh3)
-->test_and_clear_bit(STRIPE_BATCH_READY, sh3)
--->return 0 as sh->batch == NULL
-> sh3->batch_head = sh1
-> unlock (sh2, sh3)

In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list
of sh1. By moving sh3->batch_head assignment in to batch_lock, we make it
impossible to clear STRIPE_BATCH_READY before batch_head is set.

Thanks Stephane for helping debug this tricky issue.

Reported-and-tested-by: Stephane Thiell <sthiell@stanford.edu>
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: Shaohua Li <shli@fb.com>
(cherry pick from upstream commit 3664847d95e60a9a943858b7800f8484669740fc)

Orabug: 28917012

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE

Kanda Motohiro reported that expanding a tiny xattr into a large xattr
fails on XFS because we remove the tiny xattr from a shortform fork and
then try to re-add it after converting the fork to extents format having
not removed the ATTR_REPLACE flag. This fails because the attr is no
longer present, causing a fs shutdown.

This is derived from the patch in his bug report, but we really
shouldn't ignore a nonzero retval from the remove call.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119
Reported-by: kanda.motohiro@gmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
(cherry picked from commit 7b38460dc8e4eafba06c78f8e37099d3b34d473c)

Orabug: 28924091
CVE: CVE-2018-18690

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

certs: Add Oracle's new X509 cert into the kernel keyring

Add Oracle's new code signing X509 cert into the kernel keyring.

Orabug: 28926203
Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: fix bdi vs gendisk lifetime mismatch

Orabug: 28945039

Inspired by upstream commit df08c32ce3be5be138c1dbfcba203314a3a7cd6f

The kABI breakage caused by the above upstream commit cannot be fixed
by the uek_abi facilities because it extends a data structure which is
embedded inside other data structures in block and filesystem codes.

This patch fixes the breakage by moving the "owner" field from
backing_dev_info to request_queue structure, it's safe to do this for
below reasons:
o the purpose of the upstream commit is just hold a reference to
  the gendisk in backing_dev_info to sync their lifetimes
o the backing_dev_info is embedded into the request_queue, which
  means their lifetimes are in sync
o so the lifetime of gendisk can be synced with any of request_queue
  or backing_dev_info
o syncing with request_queue does not break kABI
o we extended the request_queue structure previously and no third
  party binary driver breakage was reported

The reason why crafted another patch instead of cherry-picking the
upstream commit directly is that because including of blkdev.h
in mm/backing_dev.c broke kABI too, note it's just inclusion without
other changes.

Open coded the get/set of owner field of request_queue to reduce the
impact on blkdev.h to avoid other potential kABI breakage.

v2:
  o move put owner after bdi_destroy(), followed the suggestion from
    Ashish Samant <ashish.samant@oracle.com>
  o update inline comment

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Add the following entries to 'uek-rpm/ol[67]/nano_modules.list':
kernel/drivers/net/net_failover.ko
kernel/net/core/failover.ko
Fixes: b3bc7c163fc9 ('net: Introduce generic failover module')
Orabug: 28953351
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

floppy: Do not copy a kernel pointer to user memory in FDGETPRM ioctl

The final field of a floppy_struct is the field "name", which is a pointer
to a string in kernel memory.  The kernel pointer should not be copied to
user memory.  The FDGETPRM ioctl copies a floppy_struct to user memory,
including this "name" field.  This pointer cannot be used by the user
and it will leak a kernel address to user-space, which will reveal the
location of kernel code and data and undermine KASLR protection.

Model this code after the compat ioctl which copies the returned data
to a previously cleared temporary structure on the stack (excluding the
name pointer) and copy out to userspace from there.  As we already have
an inparam union with an appropriate member and that memory is already
cleared even for read only calls make use of that as a temporary store.

Based on an initial patch by Brian Belleville.

CVE-2018-7755
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Broke up long line.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 65eea8edc315589d6c993cf12dbb5d0e9ef1fe4e)

Orabug: 28956547
CVE: CVE-2018-7755

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

iov_iter: don't revert iov buffer if csum error

commit a6a5993243550b09f620941dea741b7421fdf79c upstream.

The patch 327868212381 (make skb_copy_datagram_msg() et.al. preserve
->msg_iter on error) will revert the iov buffer if copy to iter
failed, but it didn't copy any datagram if the skb_checksum_complete
error, so no need to revert any data at this place.

v2: Sabrina notice that return -EFAULT when checksum error is not correct
here, it would confuse the caller about the return value, so fix it.

Fixes: 327868212381 ("make skb_copy_datagram_msg() et.al. preserve->msg_iter on error")
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
UEK4: iov_iter_revert() doesn't exist until 4.9, so stash a copy of the
iov_iter for revert purposes [Junxiao Bi]

Orabug: 28960296
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: salsa20 - fix blkcipher_walk API usage

When asked to encrypt or decrypt 0 bytes, both the generic and x86
implementations of Salsa20 crash in blkcipher_walk_done(), either when
doing 'kfree(walk->buffer)' or 'free_page((unsigned long)walk->page)',
because walk->buffer and walk->page have not been initialized.

The bug is that Salsa20 is calling blkcipher_walk_done() even when
nothing is in 'walk.nbytes'.  But blkcipher_walk_done() is only meant to
be called when a nonzero number of bytes have been provided.

The broken code is part of an optimization that tries to make only one
call to salsa20_encrypt_bytes() to process inputs that are not evenly
divisible by 64 bytes.  To fix the bug, just remove this "optimization"
and use the blkcipher_walk API the same way all the other users do.

Reproducer:

    #include <linux/if_alg.h>
    #include <sys/socket.h>
    #include <unistd.h>

    int main()
    {
            int algfd, reqfd;
            struct sockaddr_alg addr = {
                    .salg_type = "skcipher",
                    .salg_name = "salsa20",
            };
            char key[16] = { 0 };

            algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
            bind(algfd, (void *)&addr, sizeof(addr));
            reqfd = accept(algfd, 0, 0);
            setsockopt(algfd, SOL_ALG, ALG_SET_KEY, key, sizeof(key));
            read(reqfd, key, sizeof(key));
    }

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: eb6f13eb9f81 ("[CRYPTO] salsa20_generic: Fix multi-page processing")
Cc: <stable@vger.kernel.org> # v2.6.25+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit ecaaab5649781c5a0effdaf298a925063020500e)

Orabug: 28976583
CVE: CVE-2017-17805

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: hmac - require that the underlying hash algorithm is unkeyed

Because the HMAC template didn't check that its underlying hash
algorithm is unkeyed, trying to use "hmac(hmac(sha3-512-generic))"
through AF_ALG or through KEYCTL_DH_COMPUTE resulted in the inner HMAC
being used without having been keyed, resulting in sha3_update() being
called without sha3_init(), causing a stack buffer overflow.

This is a very old bug, but it seems to have only started causing real
problems when SHA-3 support was added (requires CONFIG_CRYPTO_SHA3)
because the innermost hash's state is ->import()ed from a zeroed buffer,
and it just so happens that other hash algorithms are fine with that,
but SHA-3 is not.  However, there could be arch or hardware-dependent
hash algorithms also affected; I couldn't test everything.

Fix the bug by introducing a function crypto_shash_alg_has_setkey()
which tests whether a shash algorithm is keyed.  Then update the HMAC
template to require that its underlying hash algorithm is unkeyed.

Here is a reproducer:

    #include <linux/if_alg.h>
    #include <sys/socket.h>

    int main()
    {
        int algfd;
        struct sockaddr_alg addr = {
            .salg_type = "hash",
            .salg_name = "hmac(hmac(sha3-512-generic))",
        };
        char key[4096] = { 0 };

        algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
        bind(algfd, (const struct sockaddr *)&addr, sizeof(addr));
        setsockopt(algfd, SOL_ALG, ALG_SET_KEY, key, sizeof(key));
    }

Here was the KASAN report from syzbot:

    BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:341  [inline]
    BUG: KASAN: stack-out-of-bounds in sha3_update+0xdf/0x2e0  crypto/sha3_generic.c:161
    Write of size 4096 at addr ffff8801cca07c40 by task syzkaller076574/3044

    CPU: 1 PID: 3044 Comm: syzkaller076574 Not tainted 4.14.0-mm1+ #25
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  Google 01/01/2011
    Call Trace:
      __dump_stack lib/dump_stack.c:17 [inline]
      dump_stack+0x194/0x257 lib/dump_stack.c:53
      print_address_description+0x73/0x250 mm/kasan/report.c:252
      kasan_report_error mm/kasan/report.c:351 [inline]
      kasan_report+0x25b/0x340 mm/kasan/report.c:409
      check_memory_region_inline mm/kasan/kasan.c:260 [inline]
      check_memory_region+0x137/0x190 mm/kasan/kasan.c:267
      memcpy+0x37/0x50 mm/kasan/kasan.c:303
      memcpy include/linux/string.h:341 [inline]
      sha3_update+0xdf/0x2e0 crypto/sha3_generic.c:161
      crypto_shash_update+0xcb/0x220 crypto/shash.c:109
      shash_finup_unaligned+0x2a/0x60 crypto/shash.c:151
      crypto_shash_finup+0xc4/0x120 crypto/shash.c:165
      hmac_finup+0x182/0x330 crypto/hmac.c:152
      crypto_shash_finup+0xc4/0x120 crypto/shash.c:165
      shash_digest_unaligned+0x9e/0xd0 crypto/shash.c:172
      crypto_shash_digest+0xc4/0x120 crypto/shash.c:186
      hmac_setkey+0x36a/0x690 crypto/hmac.c:66
      crypto_shash_setkey+0xad/0x190 crypto/shash.c:64
      shash_async_setkey+0x47/0x60 crypto/shash.c:207
      crypto_ahash_setkey+0xaf/0x180 crypto/ahash.c:200
      hash_setkey+0x40/0x90 crypto/algif_hash.c:446
      alg_setkey crypto/af_alg.c:221 [inline]
      alg_setsockopt+0x2a1/0x350 crypto/af_alg.c:254
      SYSC_setsockopt net/socket.c:1851 [inline]
      SyS_setsockopt+0x189/0x360 net/socket.c:1830
      entry_SYSCALL_64_fastpath+0x1f/0x96

Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit af3ff8045bbf3e32f1a448542e73abb4c8ceb6f1)

Orabug: 28976653
CVE: CVE-2017-17806

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert commit 8bd274934987 ("block: fix bdi vs gendisk lifetime mismatch")

Orabug: 28968102

8bd274934987 adds a new element at the end of struct backing_dev_info.
struct backing_dev_info is embedded inside struct request_queue
and is usually allocated during request_queue allocation. However, some
out of tree modules also allocate this struct separately using
sizeof() or via direct inclusion in their own structs. This patch
essentially breaks KABI for such modules and can cause a mismatch in the
size of the allocated struct and lead to kernel crashes.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM/x86: Add IBPB support

commit 15d45071523d89b3fb7372e2135fbd72f6af9506

The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.

Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.

IBPB helps mitigate against three potential attacks:

* Mitigate guests from being attacked by other guests.
  - This is addressed by issing IBPB when we do a guest switch.

* Mitigate attacks from guest/ring3->host/ring3.
  These would require a IBPB during context switch in host, or after
  VMEXIT. The host process has two ways to mitigate
  - Either it can be compiled with retpoline
  - If its going through context switch, and has set !dumpable then
    there is a IBPB in that path.
    (Tim's patch: https://patchwork.kernel.org/patch/10192871)
  - The case where after a VMEXIT you return back to Qemu might make
    Qemu attackable from guest when Qemu isn't compiled with retpoline.
  There are issues reported when doing IBPB on every VMEXIT that resulted
  in some tsc calibration woes in guest.

* Mitigate guest/ring0->host/ring0 attacks.
  When host kernel is using retpoline it is safe against these attacks.
  If host kernel isn't using retpoline we might need to do a IBPB flush on
  every VMEXIT.

Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.

* IBPB is issued only for SVM during svm_free_vcpu().
  VMX has a vmclear and SVM doesn't.  Follow discussion here:
  https://lkml.org/lkml/2018/1/15/146

Please refer to the following spec for more details on the enumeration
and control.

Refer here to get documentation about mitigations.

https://software.intel.com/en-us/side-channel-security-support

[peterz: rebase and changelog rewrite]
[karahmed: - rebase
           - vmx: expose PRED_CMD if guest has it in CPUID
           - svm: only pass through IBPB if guest has it in CPUID
           - vmx: support !cpu_has_vmx_msr_bitmap()]
           - vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
        PRED_CMD is a write-only MSR]

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: kvm@vger.kernel.org
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com
Link: https://lkml.kernel.org/r/1517522386-18410-3-git-send-email-karahmed@amazon.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d395d69de67ea95760e1f207eb0f6fdfbcb6e069)

Orabug: 28703712

Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/cpuid.c
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c

[ manual merge - functionality is not a match with upstream nor UEK5 ]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/intel/spectre_v2: Remove unnecessary retp_compiler() test

... and the unneeded set of parenthesis.

Orabug: 28814570

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/intel/spectre_v4: Deprecate spec_store_bypass_disable=userspace

Enforcing userspace-only spectre_v4 mitigations cannot be done performantly
when retpoline mitigations for spectre_v2 are in force. To do so we would
need to write MSR_IA32_SPEC_CTRL when entering and leaving kernel (i.e. system
calls, interrupts, etc.) Since retpoline is the preferred method of spectre_v2
mitigations exactly because it avoids writing this extremely slow MSR, adding
these two writes for SSBD bit management will make using retpoline pointless.

While there may be some cases where running with speculative storage bypass
enabled in kernel only is better even in presense of the extra writes to
MSR_IA32_SPEC_CTRL we don't expect this to be the case in majority of cases.
Plus removing this mode makes code less unreadable.

Orabug: 28814570

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: x86_spec_ctrl_set needs to be called unconditionally

Because on entring idle we want to clear SSBD bit as well,
testing for ibrs_inuse is not sufficient.

We should also clear SSBD bit in x86_spec_ctrl_base during
initialization since it's up to kernel to manage it.

Orabug: 28814570

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Drop unused DISABLE_IBRS_CLOBBER macro

... and x86_spec_ctrl_base declaration in ASM part of the file.

Orabug: 28814570

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/intel/spectre_v4: Keep SPEC_CTRL_SSBD when IBRS is in use

When IBRS mitigations are in use, and we are running with prctl or seccomp
SSBD mitigations, we end up not setting SPEC_CTRL_SSBD bit in MSR_IA32_SPEC_CTRL
in DISABLE_IBRS (which is called, for example, when returning from a syscall to
userspace.

Orabug: 28814570

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: net_failover: fix typo in net_failover_slave_register()

Sync both unicast and multicast lists instead of unicast twice.

Fixes: cfc80d9a116 ("net: Introduce net_failover driver")
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit e5223438280d76ef782592cf643e09441140d14c)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
(cherry picked from commit feaed1611895abfa50989b6b93837aa102f6c5f7)
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

virtio_net: Extend virtio to use VF datapath when available

This patch enables virtio_net to switch over to a VF datapath when STANDBY
feature is enabled and a VF netdev is present with the same MAC address.
It allows live migration of a VM with a direct attached VF without the need
to setup a bond/team between a VF and virtio net device in the guest.

It uses the API that is exported by the net_failover driver to create and
and destroy a master failover netdev. When STANDBY feature is enabled, an
additional netdev(failover netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'standby' netdev and a passthru device with the same MAC is
registered as 'primary' netdev.

The hypervisor needs to unplug the VF device from the guest on the source
host and reset the MAC filter of the VF to initiate failover of datapath
to virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ba5e4426e80e0435358c7117c339e6a4c22c34ad)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
(cherry picked from commit cf863ad81d0bcdefe6520fea20c81df26decef4f)
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/Kconfig
(merge during cherry-pick need manual intervention to resolve)
drivers/net/virtio_net.c
(manual merge required as file differes between UEK5 and UEK4)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.

VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9805069d14c1b0b66b1600ea60cfc08f94841bd8)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
(cherry picked from commit c4b1cdd0459d953eddac306c3a1fc88c8d631e17)
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/virtio_net.c
(UEK4 doesn't have VIRTNET_FEATURES defined)
include/uapi/linux/virtio_net.h
(there is an additional feature bitmap in UEK5 for virtio net which
cherry-pick inserted during merge)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: Introduce net_failover driver

The net_failover driver provides an automated failover mechanism via APIs
to create and destroy a failover master netdev and manages a primary and
standby slave netdevs that get registered via the generic failover
infrastructure.

The failover netdev acts a master device and controls 2 slave devices. The
original paravirtual interface gets registered as 'standby' slave netdev and
a passthru/vf device with the same MAC gets registered as 'primary' slave
netdev. Both 'standby' and 'failover' netdevs are associated with the same
'pci' device. The user accesses the network interface via 'failover' netdev.
The 'failover' netdev chooses 'primary' netdev as default for transmits when
it is available with link up and running.

This can be used by paravirtual drivers to enable an alternate low latency
datapath. It also enables hypervisor controlled live migration of a VM with
direct attached VF by failing over to the paravirtual datapath when the VF
is unplugged.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cfc80d9a11635404a40199a1c9471c96890f3f74)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/Kconfig
(a missing config in UEK5 needed manual conflict merge)

drivers/net/Makefile
(not all upstream network device driver objects present in UEK5,
cherry-pick fails to merge, resolved manually)

(cherry picked from commit 0a251fd907bd530b28076cb272110faa4b1f3103)
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
drivers/net/net_failover.c
- new ETHTOOL_xLINKSETTINGS API are not present in UEK4,
continue to use ethtool get_settings
- centralized net_device min/max MTU checking not present in UEK4,
lines referencing min/max MTU are removed

Conflicts:
drivers/net/Makefile
(insignificant manual merge after cherry-pick)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: Introduce generic failover module

The failover module provides a generic interface for paravirtual drivers
to register a netdev and a set of ops with a failover instance. The ops
are used as event handlers that get called to handle netdev register/
unregister/link change/name change events on slave pci ethernet devices
with the same mac address as the failover netdev.

This enables paravirtual drivers to use a VF as an accelerated low latency
datapath. It also allows migration of VMs with direct attached VFs by
failing over to the paravirtual datapath when the VF is unplugged.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 30c8bd5aa8b2c78546c3e52337101b9c85879320)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Made a change to resolve build error as #arguments differs in UEK5
for routine netdev_master_upper_dev_link().

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/linux/netdevice.h
(enum net_device_priv_flags list in UEK5 is a subset of upstream list,
merged conflicts manually)

net/Kconfig
(a config absent in UEK5, cherry-pick failed to resolve)

(cherry picked from commit 2c6fa9893c5d8b5e71103af40986d3fd952a28d7)
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Conflicts:
MAINTAINERS
(the surrounding list where new MAINTAINER for failover added by cherry-pick
differs between UEK5 and UEK4)
include/linux/netdevice.h
(there is additonal code only in UEK5 which cherry-pick inserted into UEK4,
only retained relevant code)
net/core/Makefile
(there are additional objects only in UEK5 which cherry-pick inserted into UEK4
makefile, only relevant object retained)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: introduce lower state changed info structure for LAG lowers

This is shared info structure for bonding and team. Serves to pass down
info about link state and port activity to notification listeners.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit fb1b2e3ce53aef80b3cef71f3885d584cdbdc6b8)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: introduce change lower state notifier

When lower device like bonding slave, team/bridge port, etc changes its
state, it is useful for others to notice this change. Currently this is
implemented specificly for bonding as NETDEV_BONDING_INFO notifier. This
patch aims to replace this specific usage and make this more generic to
be used for all upper-lower devices.

Introduce NETDEV_CHANGELOWERSTATE netdev notifier type and
netdev_lower_state_changed() helper.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 04d482660a07039fc4e9a42bb3517db236d98f96)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/linux/netdevice.h
(a #define not present in UEK4)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: add info struct for LAG changeupper

This struct will be shared by bonding and team to pass internal
information to notifier listeners.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 764f5e544118508add420724789f46e04dba91eb)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/linux/netdevice.h
(cherry-pick merge included unrelated lines immediately before
related code)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: add possibility to pass information about upper device via notifier

Sometimes the drivers and other code would find it handy to know some
internal information about upper device being changed. So allow upper-code
to pass information down to notifier listeners during linking.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 29bf24afb29042f568fa67b1b0eee46796725ed2)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/bonding/bond_main.c
drivers/net/team/team.c
drivers/net/vrf.c
(vrf.c - file not present in uek4)
include/linux/netdevice.h
net/batman-adv/hard-interface.c
net/bridge/br_if.c
net/core/dev.c
net/openvswitch/vport-netdev.c
(conflicts are related to #of arguments to netdev_master_upper_dev_link(),
which is retained to maintain kABI. To allow upper-code to pass
information down to notifier during linking we have modified
number of arguments to netdev_master_upper_dev_link_private())

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: Check CHANGEUPPER notifier return value

switchdev drivers reflect the newly requested topology to hardware when
CHANGEUPPER is received, after software links were already formed.
However, the operation can fail and user will not be notified, as the
return value of the notifier is not checked.

Add this check and rollback software links if necessary.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b03804e7c3ad41c265c0ca21ddb306b252b4f99f)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: introduce change upper device notifier change info

Add info that is passed along with NETDEV_CHANGEUPPER event.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0e4ead9d7b3655d76371604abb9b0dcc4e79bb7d)
Orabug: 28122104
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: rework x86_spec_ctrl_set to make its changes explicit

x86_spec_ctrl_set is difficult to understand because its argument may
not be explicit or complete about the SPEC_CTRL MSR bits the function
changes.

For example, the call x86_spec_ctrl_set(x86_spec_ctrl_base &
x86_spec_ctrl_mask) is made to enable the SSBD bit at boot, and
x86_spec_ctrl_set(SPEC_CTRL_FEATURE_DISABLE_IBRS) may also turn off
SSBD.

To make the function easier to understand, rework it to take the context
it's called in instead of a subset of the MSR bits to be changed.
Explain the bits modified for each context.

No functional change. In particular, x86_spec_ctrl_set continues to
clear the SSBD MSR bit in the ssbd_userspace_selected() case when the
kernel becomes idle, in accordance with this section from
336996-Speculative-Execution-Side-Channel-Mitigations.pdf[1]:

"On Intel® Core™ and Intel® Xeon® processors that enable Intel®
Hyper-Threading Technology and do not support enhanced IBRS, setting
SSBD on a logical processor may impact the performance of a sibling
logical processor on the same core. Intel recommends that the SSBD MSR
bit be cleared when in an idle state on such processors."

[1] https://bugzilla.kernel.org/show_bug.cgi?id=199511

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit 44651cc127944465e6595a9362248e3bdf9c6d1c)

Orabug: 28271063

Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: rename ssbd_ibrs_selected to ssbd_userspace_selected

The name of ssbd_ibrs_selected referred to how CPUs vulnerable to
Speculative Store Bypass automatically enable the SSB mitigation in
userspace, provided that the CPU is also using IBRS.[*]    However, the
function doesn't test for IBRS, so the name may be confusing, and there
are unusual combinations of options where the function returns true
without IBRS being enabled (say spectre_v2=off and
spec_store_bypass_disable=userspace).

Rename the function, as brought up by Boris.  The new name reflects what
it's testing.  While at it, make it static to match its declaration
because it's not used outside the file.  No functional change.

[*] The rationale is that setting the SPEC_CTRL MSR's SSBD bit in
    addition to the IBRS bit is free.

Suggested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit 5342c8c0879f2ce6318139baaa3358c4d16f482e)

Orabug: 28271063

Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: always use x86_spec_ctrl_base or _priv when setting spec ctrl MSR

x86_spec_ctrl_base and x86_spec_ctrl_priv contain reserved bits from the
first read of the spec ctrl MSR but in one case neither of these are
used when updating the MSR in x86_spec_ctrl_set.

This does not seem to cause problems now, but add it in for consistency.
In this case, we need to use x86_spec_ctrl_base because IBRS isn't being
enabled.

Fixes: edcba197bb44 ("x86/bugs/IBRS: Use variable instead of defines for enabling IBRS")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit a5429dd0a22342bf3e03649af25e5ad3dd6e01e7)

Orabug: 28271063

Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkfront: fix kernel panic with negotiate_mq error path

info->nr_rings isn't adjusted in case of ENOMEM error from
negotiate_mq(). This leads to kernel panic in error path.

Typical call stack involving panic -
#8 page_fault at ffffffff8175936f
[exception RIP: blkif_free_ring+33]
RIP: ffffffffa0149491 RSP: ffff8804f7673c08 RFLAGS: 00010292
...
#9 blkif_free at ffffffffa0149aaa [xen_blkfront]
#10 talk_to_blkback at ffffffffa014c8cd [xen_blkfront]
#11 blkback_changed at ffffffffa014ea8b [xen_blkfront]
#12 xenbus_otherend_changed at ffffffff81424670
#13 backend_changed at ffffffff81426dc3
#14 xenwatch_thread at ffffffff81422f29
#15 kthread at ffffffff810abe6a
#16 ret_from_fork at ffffffff81754078

Cc: stable@vger.kernel.org
Fixes: 7ed8ce1c5fc7 ("xen-blkfront: move negotiate_mq to cover all cases of new VBDs")
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 6cc4a0863c9709c512280c64e698d68443ac8053)

Orabug: 28798861
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Reviewed-by: Patrick Colp <patrick.colp@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: lpfc: Correct MDS diag and nvmet configuration

Orabug: 28855939

A recent change added some MDS processing in the lpfc_drain_txq routine
that relies on the fcp_wq being allocated. For nvmet operation the fcp_wq
is not allocated because it can only be an nvme-target. When the original
MDS support was added LS_MDS_LOOPBACK was defined wrong, (0x16) it should
have been 0x10 (decimal value used for hex setting). This incorrect value
allowed MDS_LOOPBACK to be set simultaneously with LS_NPIV_FAB_SUPPORTED,
causing the driver to crash when it accesses the non-existent fcp_wq.

Correct the bad value setting for LS_MDS_LOOPBACK.

Fixes: ae9e28f36a6c ("lpfc: Add MDS Diagnostic support.")
Cc: <stable@vger.kernel.org> # v4.12+
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Tested-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 53e13ee087a80e8d4fc95436318436e5c2c1f8c2)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: virtio_scsi: let host do exception handling

virtio_scsi tries to do exception handling after the default 30 seconds
timeout expires.  However, it's better to let the host control the
timeout, otherwise with a heavy I/O load it is likely that an abort will
also timeout.  This leads to fatal errors like filesystems going
offline.

Disable the 'sd' timeout and allow the host to do exception handling,
following the precedent of the storvsc driver.

Hannes has a proposal to introduce timeouts in virtio, but this provides
an immediate solution for stable kernels too.

[mkp: fixed typo]

Reported-by: Douglas Miller <dougmill@linux.vnet.ibm.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: linux-scsi@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 28856913

(cherry picked from commit e72c9a2a67a6400c8ef3d01d4c461dbbbfa0e1f0)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
- The function above virtscsi_host_template_single() is
  virtscsi_target_destroy() (in uek4), but not virtscsi_map_queues() (in
  upstream)
- virtscsi_host_template_multi.slave_alloc is implemented in upstream, but
  not uek4

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Patrick Colp <patrick.colp@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/rds: Fix endless RNR situation

Working with the following SRs:

Exadata SR# 3-15640329311
Linux SR#3-15675579325

it was discovered that by inserting IB_SEND_SOLICITED at regular
intervals removed the endless RNR Retry situation. The test was made
by inserting IB_SEND_SOLICITED at the same interval as
IB_SEND_SIGNALED was inserted, that is, by default for every 17th
fragment.

This commit introduces the sysctl variable
net.rds.ib.max_unsolicited_wr. A value of zero disables the
functionality of inserting IB_SEND_SOLICITED. A value of N will insert
IB_SEND_SOLICITED for every Nth fragment.

net.rds.ib.max_unsolicited_wr is by default 16, in order to avoid
customization when this fix is applied at the customer site.

This fix also has the nice side-effect that it improves IOPS for 1Q,
1D, 1T cases:

-q 1M -a 256:

Without fix:

tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
   1   1161      0 1189243.20       0.00       0.00  203.52   857.34 -1.00
(average)

With fix (with default net.rds.ib.max_unsolicited_wr = 16):

tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
   1   1323      0 1355849.36       0.00       0.00  203.76   751.50 -1.00
(average)

-q $[32*1024+256] -a 256:

With fix (net.rds.ib.max_unsolicited_wr = 0, i.e. disabled):

tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
   1  15243      0  492547.75       0.00       0.00   10.58    62.01 -1.00
(average)

Ditto with net.rds.ib.max_unsolicited_wr = 4 (two SEND_SOLICITED per ~32K):

tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
   1  16422      0  530641.03       0.00       0.00   10.28    57.25 -1.00
(average)

Orabug: 28857027

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: sg: allocate with __GFP_ZERO in sg_build_indirect()

This shall help avoid copying uninitialized memory to the userspace when
calling ioctl(fd, SG_IO) with an empty command.

Reported-by: syzbot+7d26fc1eea198488deab@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit a45b599ad808c3c982fdcdc12b0b8611c2f92824)

Orabug: 28892656
CVE: CVE-2018-1000204

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

cdrom: fix improper type cast, which can leat to information leak.

There is another cast from unsigned long to int which causes
a bounds check to fail with specially crafted input. The value is
then used as an index in the slot array in cdrom_slot_status().

This issue is similar to CVE-2018-16658 and CVE-2018-10940.

Signed-off-by: Young_X <YangX92@hotmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit e4f3aa2e1e67bb48dfbaaf1cad59013d5a5bc276)

Orabug: 28929767
CVE: CVE-2018-18710

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

oracleasm: Honor ASM_IFLAG_FORMAT_NOCHECK flag

If ASMLib supports the QUERY HANDLE operation, it will set the
ASM_IFLAG_FORMAT_NOCHECK flag on the ioc. This signals to the kernel
driver that the integrity information does not depend on the contents
of the it_format field and the ASM disk handle.

Orabug: 28650922

Reviewed-by: Sajid Zia <sajid.zia@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

oracleasm: Implement support for QUERY HANDLE operation

ASMLib previously relied on tagging the disk handle pointer to store
the integrity format. This had the advantage that a simple masking
operation was all that was required to get from a handle to the
integrity information.

However, we have seen a few cases where it appears the disk handle has
been corrupted in userland post discovery. Consequently, it has proven
necessary to be able to query disk properties without relying on the
disk pointer tag.

The oracleasm driver currently only supports looking up disk by their
label string via the query_disk operation. Implement a query_handle
operation that is similar to query_disk in that it returns all known
disk properties. However, the lookup is done by disk handle instead of
device node. Otherwise the two queries are identical.

Adding the new operation to oracleasm does not prevent older versions
of the library from working correctly. The old tagging mechanism is
still in place, the use of query_handle is entirely optional. Later
ASMLib versions will use the new mode of operation if the query_handle
transaction file appears in /dev/oracleasm.

Orabug: 28650922

Reviewed-by: Sajid Zia <sajid.zia@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: MTRR: remove MSR 0x2f8

MSR 0x2f8 accessed the 124th Variable Range MTRR ever since MTRR support
was introduced by 9ba075a664df ("KVM: MTRR support").

0x2f8 became harmful when 910a6aae4e2e ("KVM: MTRR: exactly define the
size of variable MTRRs") shrinked the array of VR MTRRs from 256 to 8,
which made access to index 124 out of bounds. The surrounding code only
WARNs in this situation, thus the guest gained a limited read/write
access to struct kvm_arch_vcpu.

0x2f8 is not a valid VR MTRR MSR, because KVM has/advertises only 16 VR
MTRR MSRs, 0x200-0x20f. Every VR MTRR is set up using two MSRs, 0x2f8
was treated as a PHYSBASE and 0x2f9 would be its PHYSMASK, but 0x2f9 was
not implemented in KVM, therefore 0x2f8 could never do anything useful
and getting rid of it is safe.

This fixes CVE-2016-3713.

Fixes: 910a6aae4e2e ("KVM: MTRR: exactly define the size of variable MTRRs")
Cc: stable@vger.kernel.org
Reported-by: David Matlack <dmatlack@google.com>
Signed-off-by: Andy Honig <ahonig@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/mtrr.c

Though the commit 910a6aae4e2e is not present in this stream and as
per the upstream commit 9842df62004f, 0x2f8 is not a valid VR MTRR MSR,
getting rid of it is safe.

Orabug: 23276795
CVE: CVE-2016-3713

(cherry picked from commit 9842df62004f366b9fed2423e24df10542ee0dc5)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpu/amd: Limit cpu_core_id fixup to families older than F17h

Current cpu_core_id fixup causes downcored F17h configurations to be
incorrect:

  NODE: 0
  processor  0 core id : 0
  processor  1 core id : 1
  processor  2 core id : 2
  processor  3 core id : 4
  processor  4 core id : 5
  processor  5 core id : 0

  NODE: 1
  processor  6 core id : 2
  processor  7 core id : 3
  processor  8 core id : 4
  processor  9 core id : 0
  processor 10 core id : 1
  processor 11 core id : 2

Code that relies on the cpu_core_id, like match_smt(), for example,
which builds the thread siblings masks used by the scheduler, is
mislead.

So, limit the fixup to pre-F17h machines. The new value for cpu_core_id
for F17h and later will represent the CPUID_Fn8000001E_EBX[CoreId],
which is guaranteed to be unique for each core within a socket.

This way we have:

  NODE: 0
  processor  0 core id : 0
  processor  1 core id : 1
  processor  2 core id : 2
  processor  3 core id : 4
  processor  4 core id : 5
  processor  5 core id : 6

  NODE: 1
  processor  6 core id : 8
  processor  7 core id : 9
  processor  8 core id : 10
  processor  9 core id : 12
  processor 10 core id : 13
  processor 11 core id : 14

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[ Heavily massaged. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Link: http://lkml.kernel.org/r/20170731085159.9455-2-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b89b41d0b8414690ec0030c134b8bde209e6d06c)

Orabug: 28783929

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/CPU/AMD: Fix Bulldozer topology

The following commit:

8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id")

... broke the initial strategy for Bulldozer-based cores' topology,
where we consider each thread of a compute unit a standalone core
and not a HT or SMT thread.

Revert to the firmware-supplied core_id numbering and do not make
them thread siblings as we don't consider them for such even if they
technically are, more or less.

Reported-and-tested-by: Brice Goglin <Brice.Goglin@inria.fr>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # v4.6+
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id")
Link: http://lkml.kernel.org/r/20170105092638.5247-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a33d331761bc5dd330499ca5ceceb67f0640a8e6)

Orabug: 28783929

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/amd.c
amd.c: contextual

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpu/AMD: Clean up cpu_llc_id assignment per topology feature

These changes do not affect current hw - just a cleanup:

Currently, we assume that a system has a single Last Level Cache (LLC)
per node, and that the cpu_llc_id is thus equal to the node_id. This no
longer applies since Fam17h can have multiple last level caches within a
node.

So group the cpu_llc_id assignment by topology feature and family in
order to make the computation of cpu_llc_id on the different families
more clear.

Here is how the LLC ID is being computed on the different families:

The NODEID_MSR feature only applies to Fam10h in which case the LLC is
at the node level.

The TOPOEXT feature is used on families 15h, 16h and 17h. So far we only
see multiple last level caches if L3 caches are available. Otherwise,
the cpu_llc_id will default to be the phys_proc_id.

We have L3 caches only on families 15h and 17h:

- on Fam15h, the LLC is at the node level.

- on Fam17h, the LLC is at the core complex level and can be found by
   right shifting the APIC ID. Also, keep the family checks explicit so that
   new families will fall back to the default, which will be node_id for
   TOPOEXT systems.

Single node systems in families 10h and 15h will have a Node ID of 0
which will be the same as the phys_proc_id, so we don't need to check
for multiple nodes before using the node_id.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
[ Rewrote the commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20161108153054.bs3sajbyevq6a6uu@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b6a50cddbcbda7105355898ead18f1a647c22520)

Orabug: 28783929

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpu: Get rid of compute_unit_id

It is cpu_core_id anyway.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1458917557-8757-3-git-send-email-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 8196dab4fc159943df6baaac04973bb1accb7100)

Orabug: 28783929

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/amd.c
Removed smp_num_siblings calculation as we have it in get_topology_early.
Add UEK_KABI_DEPRECATE for compute_unit_id to cpuinfo to preserva KABI.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/topology: Fix AMD core count

It turns out AMD gets x86_max_cores wrong when there are compute
units.

The issue is that Linux assumes:

nr_logical_cpus = nr_cores * nr_siblings

But AMD reports its CU unit as 2 cores, but then sets num_smp_siblings
to 2 as well.

Boris: fixup ras/mce_amd_inj.c too, to compute the Node Base Core
properly, according to the new nomenclature.

Fixes: 1f12e32f4cd5 ("x86/topology: Create logical package id")
Reported-by: Xiong Zhou <jencce.kernel@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andreas Herrmann <aherrmann@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Link: http://lkml.kernel.org/r/20160317095220.GO6344@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit ee6825c80e870fff1a370c718ec77022ade0889b)

Orabug: 28783929

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/amd.c
arch/x86/ras/mce_amd_inj.c
amd.c: contextual
mce_amd_inj.c: does not exists

Signed-off-by: Brian Maly <brian.maly@oracle.com>

perf/x86/amd: Move nodes_per_socket into bsp_init_amd()

nodes_per_socket is static and it needn't be initialized many
times during every CPU core init. So move its initialization into
bsp_init_amd().

Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: spg_linux_kernel@amd.com
Link: http://lkml.kernel.org/r/1452739808-11871-2-git-send-email-ray.huang@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 8dfeae0d73bf803be1a533e147b3b0ea69375596)

Orabug: 28783929

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpu/amd: Give access to the number of nodes in a physical package

Stash the number of nodes in a physical processor package
locally and add an accessor to be called by interested parties.
The first user is the MCE injection module which uses it to find
the node base core in a package for injecting a certain type of
errors.

Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
[ Rewrote the commit message, merged it with the accessor patch and unified naming. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: mchehab@osg.samsung.com
Link: http://lkml.kernel.org/r/1433868317-18417-2-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit cc2749e4095cbbcb35518fb2db5e926b85c3f25f)

Orabug: 28783929

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ocfs2: should wait dio before inode lock in ocfs2_setattr()

we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:

process 1                  process 2                    process 3
truncate file 'A'          end_io of writing file 'A'   receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
  ocfs2_inode_lock_full
inode_dio_wait
  __inode_dio_wait
  -->waiting for all dio
  requests finish
                                                        dlm_proxy_ast_handler
                                                         dlm_do_local_bast
                                                          ocfs2_blocking_ast
                                                           ocfs2_generic_handle_bast
                                                            set OCFS2_LOCK_BLOCKED flag
                        dio_end_io
                         dio_bio_end_aio
                          dio_complete
                           ocfs2_dio_end_io
                            ocfs2_dio_end_io_write
                             ocfs2_inode_lock
                              __ocfs2_cluster_lock
                               ocfs2_wait_for_mask
                               -->waiting for OCFS2_LOCK_BLOCKED
                               flag to be cleared, that is waiting
                               for 'process 1' unlocking the inode lock
                           inode_dio_end
                           -->here dec the i_dio_count, but will never
                           be called, so a deadlock happened.

Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300)

Orabug: 28852806
CVE: CVE-2017-18204

Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Update dracut version requirement within the kernel

dracut used script /usr/share/dracut/modules.d/90kernel-modules/installkernel
to determine if would add the block driver to initramfs. On earlier UEK2 kernel,
blk_init_queue still been used by xen-blkfront.c, but after changed it to MQ
support(commit 907c3eb18e0), the call had been replaced.
Update the required version of dracut (>= 004-347), would fix the issue.

Orabug: 28873097

Signed-off-by: Jie Li <jie.l.li@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

secureboot: update UEFI public keys in kernel rpms

Orabug: 28901191

Signed-off-by: Brian Maly <brian.maly@oracle.com>

hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:447!

This BUG is in the routine remove_inode_hugepages() as follows:
/*
* If page is mapped, it was faulted in after being
* unmapped in caller.  Unmap (again) now after taking
* the fault mutex.  The mutex will prevent faults
* until we finish removing the page.
*
* This race can only happen in the hole punch case.
* Getting here in a truncate operation is a bug.
*/
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);

In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the
huge pmd sharing code.  Consider the following:
- Process A maps a hugetlbfs file of sufficient size and alignment
  (PUD_SIZE) that a pmd page could be shared.
- Process B maps the same hugetlbfs file with the same size and alignment
  such that a pmd page is shared.
- Process B then calls mprotect() to change protections for the mapping
  with the shared pmd.  As a result, the pmd is 'unshared'.
- Process B then calls mprotect() again to chage protections for the
  mapping back to their original value.  pmd remains unshared.
- Process B then forks and process C is created.  During the fork process,
  we do dup_mm -> dup_mmap -> copy_page_range to copy page tables.  Copying
  page tables for hugetlb mappings is done in the routine
  copy_hugetlb_page_range.

In copy_hugetlb_page_range(), the destination pte is obtained by:
dst_pte = huge_pte_alloc(dst, addr, sz);
If pmd sharing is possible, the returned pointer will be to a pte in
an existing page table.  In the situation above, process C could share
with either process A or process B.  Since process A is first in the
list, the returned pte is a pointer to a pte in process A's page table.

However, the following check for pmd sharing is in copy_hugetlb_page_range.
/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;

Since process C is sharing with process A instead of process B, the above
test fails.  The code in copy_hugetlb_page_range which follows assumes
dst_pte points to a huge_pte_none pte.  It copies the pte entry from
src_pte to dst_pte and increments this map count of the associated page.
This is how we end up with an elevated map count.

To solve, check the dst_pte entry for huge_pte_none.  If !none, this
implies PMD sharing so do not copy.

Orabug: 28839992

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: libsas: fix memory leak in sas_smp_get_phy_events()

We've got a memory leak with the following producer:

while true;
do cat /sys/class/sas_phy/phy-1:0:12/invalid_dword_count >/dev/null;
done

The buffer req is allocated and not freed after we return. Fix it.

Fixes: 2908d778ab3e ("[SCSI] aic94xx: new driver")
Signed-off-by: Jason Yan <yanaijie@huawei.com>
CC: John Garry <john.garry@huawei.com>
CC: chenqilin <chenqilin2@huawei.com>
CC: chenxiang <chenxiang66@hisilicon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 4a491b1ab11ca0556d2fda1ff1301e862a2d44c4)
Orabug: 27927687
CVE: CVE-2018-7757
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
(cherry picked from commit 2a0a021e9d96ba54719f977b798e3bdd928a6c53)
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: vmx: shadow more fields that are read/written on every vmexits

Compared to when VMCS shadowing was added to KVM, we are reading/writing
a few more fields: the PML index, the interrupt status and the preemption
timer value. The first two are because we are exposing more features
to nested guests. Adding them to the shadow VMCS field lists reduces
the cost of a vmexit by about 1000 clock cycles for each field that exists
on bare metal.

On the other hand, the guest BNDCFGS and TSC offset are not written on
fast paths, so remove them.

Suggested-by: Jim Mattson <jmattson@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit c5d167b27e00026711ad19a33a23d5d3d562148a)

Orabug: 28581045

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Different context and dropped VMX_PREEMPTION_TIMER_VALUE shadow at it requires
a lot more dependencies.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

vhost/scsi: Use common handling code in request queue handler

NOTE: copy_from_iter_full() is not available so use copy_from_iter().

Orabug: 28775573

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vhost/scsi: Extract common handling code from control queue handler

Prepare to change the request queue handler to use common handling routines.

NOTE: copy_from_iter_full() is not available so use copy_from_iter().

Orabug: 28775573

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vhost/scsi: Respond to control queue operations

The vhost-scsi driver currently does not handle any control queue
operations. In particular, vhost_scsi_ctl_handle_kick, merely prints out
a debug message but does nothing else. This can cause guest VMs to hang.

As part of SCSI recovery from an error, e.g., an I/O timeout, the SCSI
midlayer attempts to abort the failed operation. The SCSI virtio driver
translates the abort to a SCSI TMF request that gets put on the control
queue (virtscsi_abort -> virtscsi_tmf). The SCSI virtio driver then
waits indefinitely for this request to be completed, but it never will
because vhost-scsi never responds to that request.

To avoid a hang, always respond to control queue operations; explicitly
reject TMF requests, and return a no-op response to event requests.

NOTE: copy_from_iter_full() is not available so use copy_from_iter().

Orabug: 28775573

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: lpfc: devloss timeout race condition caused null pointer reference

Orabug: 27994179

A race condition between the context of devloss timeout handler and I/O
completion caused devloss timeout handler de-referencing pointer that had
been released.

Added the check in lpfc_sli_validate_fcp_iocb() on LPFC_IO_ON_TXCMPLQ to
capture the race condition of I/O completion and devloss timeout handler
attemption for aborting the I/O. Also, added check on lpfc_cmd->rdata
pointer before de-referenceing lpfc_cmd->rdata->pnode.

Also, added protection in lpfc_sli_abort_iocb() routine on driver performed
FCP I/O FLUSHING already under way before proceeding to aborting I/Os.

Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b0e830125b669570d8096b8ba22eb00f659fc05e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: qla2xxx: Fix race condition between iocb timeout and initialisation

qla2x00_init_timer() calls add_timer() on the iocb timeout timer, which
means the timeout function pointer and any data that the function depends on
must be initialised beforehand.

Move this initialisation before each call to qla2x00_init_timer(). In some
cases qla2x00_init_timer() initialises a completion structure needed by the
timeout function, so move the call to add_timer() after that.

Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e74e7d95878d7993cf56c801d55d78f16ea58d1d)

Orabug: 28013813

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/scsi/qla2xxx/qla_gs.c
drivers/scsi/qla2xxx/qla_init.c
drivers/scsi/qla2xxx/qla_inline.h
drivers/scsi/qla2xxx/qla_iocb.c
drivers/scsi/qla2xxx/qla_mid.c
drivers/scsi/qla2xxx/qla_mr.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

i40e: Add programming descriptors to cleaned_count

This patch updates the i40e driver to include programming descriptors in
the cleaned_count. Without this change it becomes possible for us to leak
memory as we don't trigger a large enough allocation when the time comes to
allocate new buffers and we end up overwriting a number of rx_buffers equal
to the number of programming descriptors we encountered.

Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 62b4c6694dfd3821bd5ea5bed48238bbabd5fe8b)

Orabug: 28228724

Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

i40e: Fix memory leak related filter programming status

It looks like we weren't correctly placing the pages from buffers that had
been used to return a filter programming status back on the ring. As a
result they were being overwritten and tracking of the pages was lost.

This change works to correct that by incorporating part of
i40e_put_rx_buffer into the programming status handler code. As a result we
should now be correctly placing the pages for those buffers on the
re-allocation list instead of letting them stay in place.

Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status")
Reported-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Anders K Pedersen <akp@cohaesio.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 2b9478ffc550f17c6cd8c69057234e91150f5972)

Orabug: 28228724

Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/intel/i40e/i40e_txrx.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-swiotlb: use actually allocated size on check physical continuous

xen_swiotlb_{alloc,free}_coherent() allocate/free memory based on the
order of the pages and not size argument (bytes). This is inconsistent with
range_straddles_page_boundary and memset which use the 'size' value,
which may lead to not exchanging memory with Xen (range_straddles_page_boundary()
returned true). And then the call to xen_swiotlb_free_coherent() would
actually try to exchange the memory with Xen, leading to the kernel
hitting an BUG (as the hypercall returned an error).

This patch fixes it by making the 'size' variable be of the same size
as the amount of memory allocated.

CC: stable@vger.kernel.org
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Christoph Helwig <hch@lst.de>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: John Sobecki <john.sobecki@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 28258102

(cherry picked from commit 7250f422da0480d8512b756640f131b9b893ccda)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/xen/swiotlb-xen.c

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "Revert "xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent""

This reverts commit 4dbc2ddc8d51dd616b95d868b0223d102428b995.

The root cause of panic in commit 7fc30809bfa8 ("xen-swiotlb: fix the check
condition for xen_swiotlb_free_coherent") is identified. Enable this patch
again as the fix is already available.

The Reviewed-by by for the revert is only to sync the uek4 code with upstream.
It is not clear at the moment whether upstream code is correct.

Orabug: 28258102

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/mlx4_en: fix potential use-after-free with dma_unmap_page

[ Not relevant upstream, therefore no upstream commit. ]

To fix, unmap the page as soon as possible.

When swiotlb is in use, calling dma_unmap_page means that
the original page mapped with dma_map_page must still be valid,
as swiotlb will copy data from its internal cache back to the
originally requested DMA location.

When GRO is enabled, before this patch all references to the
original frag may be put and the page freed before dma_unmap_page
in mlx4_en_free_frag is called.

It is possible there is a path where the use-after-free occurs
even with GRO disabled, but this has not been observed so far.

The bug can be trivially detected by doing the following:

* Compile the kernel with DEBUG_PAGEALLOC
* Run the kernel as a Xen Dom0
* Leave GRO enabled on the interface
* Run a 10 second or more test with iperf over the interface.

This bug was likely introduced in
commit 4cce66cdd14a ("mlx4_en: map entire pages to increase throughput"),
first part of u3.6.

It was incidentally fixed in
commit 34db548bfb95 ("mlx4: add page recycling in receive path"),
first part of v4.12.

This version applies to the v4.9 series.

Signed-off-by: Sarah Newman <srn@prgmr.com>
Tested-by: Sarah Newman <srn@prgmr.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5d70bd5c98d0e655bde2aae2b5251bdd44df5e71)

Orabug: 28376051

Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/en_rx.c
[ Lack of frag_info->dma_dir ]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

ocfs2: fix ocfs2 read block panic

Orabug: 28580543

While reading block, it is possible that io error return due to underlying
storage issue, in this case, BH_NeedsValidate was left in the buffer head.
Then when reading the very block next time, if it was already linked into
journal, that will trigger the following panic.

[203748.702517] kernel BUG at fs/ocfs2/buffer_head_io.c:342!
[203748.702533] invalid opcode: 0000 [#1] SMP
[203748.702561] Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc dm_switch dm_queue_length dm_multipath bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif i2c_core ipmi_si ipmi_msghandler acpi_pad pcspkr sb_edac edac_core lpc_ich mfd_core shpchp sg tg3 ptp pps_core ext4 jbd2 mbcache2 sr_mod cdrom sd_mod ahci libahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
[203748.703024] CPU: 7 PID: 38369 Comm: touch Not tainted 4.1.12-124.18.6.el6uek.x86_64 #2
[203748.703045] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.2 01/28/2015
[203748.703067] task: ffff880768139c00 ti: ffff88006ff48000 task.ti: ffff88006ff48000
[203748.703088] RIP: 0010:[<ffffffffa05e9f09>]  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.703130] RSP: 0018:ffff88006ff4b818  EFLAGS: 00010206
[203748.703389] RAX: 0000000008620029 RBX: ffff88006ff4b910 RCX: 0000000000000000
[203748.703885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000023079fe
[203748.704382] RBP: ffff88006ff4b8d8 R08: 0000000000000000 R09: ffff8807578c25b0
[203748.704877] R10: 000000000f637376 R11: 000000003030322e R12: 0000000000000000
[203748.705373] R13: ffff88006ff4b910 R14: ffff880732fe38f0 R15: 0000000000000000
[203748.705871] FS:  00007f401992c700(0000) GS:ffff880bfebc0000(0000) knlGS:0000000000000000
[203748.706370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[203748.706627] CR2: 00007f4019252440 CR3: 00000000a621e000 CR4: 0000000000060670
[203748.707124] Stack:
[203748.707371]  ffff88006ff4b828 ffffffffa0609f52 ffff88006ff4b838 0000000000000001
[203748.707885]  0000000000000000 0000000000000000 ffff880bf67c3800 ffffffffa05eca00
[203748.708399]  00000000023079ff ffffffff81c58b80 0000000000000000 0000000000000000
[203748.708915] Call Trace:
[203748.709175]  [<ffffffffa0609f52>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
[203748.709680]  [<ffffffffa05eca00>] ? ocfs2_empty_dir_filldir+0x80/0x80 [ocfs2]
[203748.710185]  [<ffffffffa05ec0cb>] ocfs2_read_dir_block_direct+0x3b/0x200 [ocfs2]
[203748.710691]  [<ffffffffa05f0fbf>] ocfs2_prepare_dx_dir_for_insert.isra.57+0x19f/0xf60 [ocfs2]
[203748.711204]  [<ffffffffa065660f>] ? ocfs2_metadata_cache_io_unlock+0x1f/0x30 [ocfs2]
[203748.711716]  [<ffffffffa05f4f3a>] ocfs2_prepare_dir_for_insert+0x13a/0x890 [ocfs2]
[203748.712227]  [<ffffffffa05f442e>] ? ocfs2_check_dir_for_entry+0x8e/0x140 [ocfs2]
[203748.712737]  [<ffffffffa061b2f2>] ocfs2_mknod+0x4b2/0x1370 [ocfs2]
[203748.713003]  [<ffffffffa061c385>] ocfs2_create+0x65/0x170 [ocfs2]
[203748.713263]  [<ffffffff8121714b>] vfs_create+0xdb/0x150
[203748.713518]  [<ffffffff8121b225>] do_last+0x815/0x1210
[203748.713772]  [<ffffffff812192e9>] ? path_init+0xb9/0x450
[203748.714123]  [<ffffffff8121bca0>] path_openat+0x80/0x600
[203748.714378]  [<ffffffff811bcd45>] ? handle_pte_fault+0xd15/0x1620
[203748.714634]  [<ffffffff8121d7ba>] do_filp_open+0x3a/0xb0
[203748.714888]  [<ffffffff8122a767>] ? __alloc_fd+0xa7/0x130
[203748.715143]  [<ffffffff81209ffc>] do_sys_open+0x12c/0x220
[203748.715403]  [<ffffffff81026ddb>] ? syscall_trace_enter_phase1+0x11b/0x180
[203748.715668]  [<ffffffff816f0c9f>] ? system_call_after_swapgs+0xe9/0x190
[203748.715928]  [<ffffffff8120a10e>] SyS_open+0x1e/0x20
[203748.716184]  [<ffffffff816f0d5e>] system_call_fastpath+0x18/0xd7
[203748.716440] Code: 00 00 48 8b 7b 08 48 83 c3 10 45 89 f8 44 89 e1 44 89 f2 4c 89 ee e8 07 06 11 e1 48 8b 03 48 85 c0 75 df 8b 5d c8 e9 4d fa ff ff <0f> 0b 48 8b 7d a0 e8 dc c6 06 00 48 b8 00 00 00 00 00 00 00 10
[203748.717505] RIP  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.717775]  RSP <ffff88006ff4b818>

Joesph ever reported a similar panic.
Link: https://oss.oracle.com/pipermail/ocfs2-devel/2013-May/008931.html
Link: http://lkml.kernel.org/r/20180912063207.29484-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 234b69e3e089d850a98e7b3145bd00e9b52b1111)
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: fix bdi vs gendisk lifetime mismatch

Orabug: 28645416

The name for a bdi of a gendisk is derived from the gendisk's devt.
However, since the gendisk is destroyed before the bdi it leaves a
window where a new gendisk could dynamically reuse the same devt
while a bdi with the same name is still live. Arrange for the bdi to hold a
reference against its "owner" disk device while it is registered.
Otherwise we can hit sysfs duplicate name collisions like the
following:

WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31
sysfs_warn_dup+0x64/0x80
sysfs: cannot create duplicate filename
'/devices/virtual/bdi/259:1'

Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
Call Trace:
[<ffffffff8134caec>] dump_stack+0x63/0x87
[<ffffffff8108c351>] __warn+0xd1/0xf0
[<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
[<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
[<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
[<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
[<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
[<ffffffff8134ff55>] kobject_add+0x75/0xd0
[<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
[<ffffffff8148b0a5>] device_add+0x125/0x610
[<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
[<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
[<ffffffff811b775c>] bdi_register+0x8c/0x180
[<ffffffff811b7877>] bdi_register_dev+0x27/0x30
[<ffffffff813317f5>] add_disk+0x175/0x4a0

Cc: <stable@vger.kernel.org>
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fixed up missing 0 return in bdi_register_owner().

Signed-off-by: Jens Axboe <axboe@fb.com>
(cherrypicked from commit df08c32ce3be5be138c1dbfcba203314a3a7cd6f)
Conflicts: mm/backing-dev.c
include/linux/backing-dev.h

The patch breaks KABI as-is because of the introduction of a new "owner"
field in struct backing_dev_info. To work around that, use the
UEK_KABI_EXTEND() macro to wrap the "owner" filed with ifdef GENKSYMS.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

e1000e: Fix link check race condition

Orabug: 28716958

Alex reported the following race condition:

/* link goes up... interrupt... schedule watchdog */
\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
\ e1000e_phy_has_link_generic(..., &link)
link = true

/* link goes down... interrupt */
\ e1000_msix_other
hw->mac.get_link_status = true

/* link is up */
mac->get_link_status = false

link_active = true
/* link_active is true, wrongly, and stays so because
* get_link_status is false */

Avoid this problem by making sure that we don't set get_link_status = false
after having checked the link.

It seems this problem has been present since the introduction of e1000e.

Link: https://lkml.org/lkml/2018/1/29/338
Reported-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit e2710dbf0dc1e37d85368e2404049dadda848d5a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "e1000e: Separate signaling for link check/link up"

Orabug: 28716958

This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
This reverts commit d3604515c9eda464a92e8e67aae82dfe07fe3c98.

Commit 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
changed what happens to the link status when there is an error which
happens after "get_link_status = false" in the copper check_for_link
callbacks. Previously, such an error would be ignored and the link
considered up. After that commit, any error implies that the link is down.

Revert commit 19110cfbb34d ("e1000e: Separate signaling for link check/link
up") and its followups. After reverting, the race condition described in
the log of commit 19110cfbb34d is reintroduced. It may still be triggered
by LSC events but this should keep the link down in case the link is
electrically unstable, as discussed. The race may no longer be
triggered by RXO events because commit 4aea7a5c5e94 ("e1000e: Avoid
receiver overrun interrupt bursts") restored reading icr in the Other
handler.

Link: https://lkml.org/lkml/2018/3/1/789
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 3016e0a0c91246e55418825ba9aae271be267522)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

e1000e: Avoid missed interrupts following ICR read

Orabug: 28716958

The 82574 specification update errata 12 states that interrupts may be
missed if ICR is read while INT_ASSERTED is not set. Avoid that problem by
setting all bits related to events that can trigger the Other interrupt in
IMS.

The Other interrupt is raised for such events regardless of whether or not
they are set in IMS. However, only when they are set is the INT_ASSERTED
bit also set in ICR.

By doing this, we ensure that INT_ASSERTED is always set when we read ICR
in e1000_msix_other() and steer clear of the errata. This also ensures that
ICR will automatically be cleared on read, therefore we no longer need to
clear bits explicitly.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 116f4a640b3197401bc93b8adc6c35040308ceff)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

e1000e: Fix queue interrupt re-raising in Other interrupt

Orabug: 28716958

Restores the ICS write for Rx/Tx queue interrupts which was present before
commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1)
but was not restored in commit 4aea7a5c5e94
("e1000e: Avoid receiver overrun interrupt bursts", v4.15-rc1).

This re-raises the queue interrupts in case the txq or rxq bits were set in
ICR and the Other interrupt handler read and cleared ICR before the queue
interrupt was raised.

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 361a954e6a7215de11a6179ad9bdc07d7e394b04)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

Orabug: 28716958

This partially reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.

We keep the fix for the first part of the problem (1) described in the log
of that commit, that is to read ICR in the other interrupt handler. We
remove the fix for the second part of the problem (2), Other interrupt
throttling.

Bursts of "Other" interrupts may once again occur during rxo (receive
overflow) traffic conditions. This is deemed acceptable in the interest of
avoiding unforeseen fallout from changes that are not strictly necessary.
As discussed, the e1000e driver should be in "maintenance mode".

Link: https://www.spinics.net/lists/netdev/msg480675.html
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 1f0ea19722ef9dfa229a9540f70b8d1c34a98a6a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

e1000e: Remove Other from EIAC

Orabug: 28716958

It was reported that emulated e1000e devices in vmware esxi 6.5 Build
7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
icr=0x80000004 (_INT_ASSERTED | _LSC) in the same situation.

Some experimentation showed that this flaw in vmware e1000e emulation can
be worked around by not setting Other in EIAC. This is how it was before
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 745d0bd3af99ccc8c5f5822f808cd133eadad6ac)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Fix error code in nfs_lookup_verify_inode()

Return -ESTALE to force a lookup when the file has no more links

Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Orabug: 28789030

(cherry picked from commit a61246c96195fc5f7500f6842e883b9eb1567d8d)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Tested-by: alfredo.ramirez@oracle.com
Signed-off-by: Brian Maly <brian.maly@oracle.com>

workqueue: Allow modifying low level unbound workqueue cpumask

Allow to modify the low-level unbound workqueues cpumask through
sysfs. This is performed by traversing the entire workqueue list
and calling apply_wqattrs_prepare() on the unbound workqueues
with the new low level mask. Only after all the preparation are done,
we commit them all together.

Ordered workqueues are ignored from the low level unbound workqueue
cpumask, it will be handled in near future.

All the (default & per-node) pwqs are mandatorily controlled by
the low level cpumask. If the user configured cpumask doesn't overlap
with the low level cpumask, the low level cpumask will be used for the
wq instead.

The comment of wq_calc_node_cpumask() is updated and explicitly
requires that its first argument should be the attrs of the default
pwq.

The default wq_unbound_cpumask is cpu_possible_mask.  The workqueue
subsystem doesn't know its best default value, let the system manager
or the other subsystem set it when needed.

Changed from V8:
  merge the calculating code for the attrs of the default pwq together.
  minor change the code&comments for saving the user configured attrs.
  remove unnecessary list_del().
  minor update the comment of wq_calc_node_cpumask().
  update the comment of workqueue_set_unbound_cpumask();

Cc: Christoph Lameter <cl@linux.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 042f7df15a4fff8eec42873f755aea848dcdedd1)

Orabug: 28813166

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

workqueue: Create low-level unbound workqueues cpumask

Create a cpumask that limits the affinity of all unbound workqueues.
This cpumask is controlled through a file at the root of the workqueue
sysfs directory.

It works on a lower-level than the per WQ_SYSFS workqueues cpumask files
such that the effective cpumask applied for a given unbound workqueue is
the intersection of /sys/devices/virtual/workqueue/$WORKQUEUE/cpumask and
the new /sys/devices/virtual/workqueue/cpumask file.

This patch implements the basic infrastructure and the read interface.
wq_unbound_cpumask is initially set to cpu_possible_mask.

Cc: Christoph Lameter <cl@linux.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit b05a79280b346eb24ddb73b39988398015291075)

Orabug: 28813166

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: sg: mitigate read/write abuse

As Al Viro noted in commit 128394eff343 ("sg_write()/bsg_write() is not fit
to be called under KERNEL_DS"), sg improperly accesses userspace memory
outside the provided buffer, permitting kernel memory corruption via
splice(). But it doesn't just do it on ->write(), also on ->read().

As a band-aid, make sure that the ->read() and ->write() handlers can not
be called in weird contexts (kernel context or credentials different from
file opener), like for ib_safe_file_access().

If someone needs to use these interfaces from different security contexts,
a new interface should be written that goes through the ->ioctl() handler.

I've mostly copypasted ib_safe_file_access() over as sg_safe_file_access()
because I couldn't find a good common header - please tell me if you know a
better way.

[mkp: s/_safe_/_check_/]

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 28824718
CVE: CVE-2017-13168

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 26b5b874aff5659a7e26e5b1997e3df2c41fa7fd)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/scsi/sg.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "rds: RDS (tcp) hangs on sendto() to unresponding address"

Orabug: 28837953

This reverts commit 4d8376fac652927fcc94ca88db134ce7822c03c1.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Retpoline should always be available on Skylake

Now that we can dynamically toggle retpoline on or off, retpoline
should always be available even on Skylake platforms, without
having to explicitly select retpoline at boot time.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Jamie Iles <jamie.iles@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
(cherry picked from UEK5 commit 7bbc586ac477df24012552712df10ab9ae300919)

Orabug: 28801831

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Jamie Iles <jamie.iles@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Add sysfs entry to enable/disable retpoline

Add /sys/kernel/debug/x86/retpoline_enabled to enable/disable retpoline.
Enabling retpoline will also enable IBRS for the firmware.

Note that IBRS and retpoline can't be enabled together. Enabling retpoline
while IBRS is already enabled will automatically disable IBRS. Similarly,
enabling IBRS while retpoline is already enabled will automatically disable
retpoline.

On Skylake, retpoline is not provided and can't be enabled unless the system
has been explicitly booted with retpoline (using spectre_v2=retpoline or
spectre_v2_heuristics=skylake=off).

Also fix the behavior when retpoline is not available (!CONFIG_RETPOLINE):
now we will try using IBRS (if it is available) instead of not using any
mitigation.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
(cherry picked from UEK5 commit d75554157882d9b4df91f0b2bbc4907e2731781e)

[Backport: a large part of d75554157882d9b4df91f0b2bbc4907e2731781e
was already ported in previous commit ("x86/speculation: switch to IBRS
when loading a non-retpoline module"). This ports the remaining part
which effectively adds the retpoline_enabled sysfs entry.

Also we issue a warning when enabling retpoline and a non-retpoline
module is loaded.]

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>