www.infradead.org Git - users/jedix/linux-maple.git/log

vhost/scsi: Respond to control queue operations

The vhost-scsi driver currently does not handle any control queue
operations. In particular, vhost_scsi_ctl_handle_kick, merely prints out
a debug message but does nothing else. This can cause guest VMs to hang.

As part of SCSI recovery from an error, e.g., an I/O timeout, the SCSI
midlayer attempts to abort the failed operation. The SCSI virtio driver
translates the abort to a SCSI TMF request that gets put on the control
queue (virtscsi_abort -> virtscsi_tmf). The SCSI virtio driver then
waits indefinitely for this request to be completed, but it never will
because vhost-scsi never responds to that request.

To avoid a hang, always respond to control queue operations; explicitly
reject TMF requests, and return a no-op response to event requests.

NOTE: copy_from_iter_full() is not available so use copy_from_iter().

Orabug: 28775573

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: lpfc: devloss timeout race condition caused null pointer reference

Orabug: 27994179

A race condition between the context of devloss timeout handler and I/O
completion caused devloss timeout handler de-referencing pointer that had
been released.

Added the check in lpfc_sli_validate_fcp_iocb() on LPFC_IO_ON_TXCMPLQ to
capture the race condition of I/O completion and devloss timeout handler
attemption for aborting the I/O. Also, added check on lpfc_cmd->rdata
pointer before de-referenceing lpfc_cmd->rdata->pnode.

Also, added protection in lpfc_sli_abort_iocb() routine on driver performed
FCP I/O FLUSHING already under way before proceeding to aborting I/Os.

Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b0e830125b669570d8096b8ba22eb00f659fc05e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: qla2xxx: Fix race condition between iocb timeout and initialisation

qla2x00_init_timer() calls add_timer() on the iocb timeout timer, which
means the timeout function pointer and any data that the function depends on
must be initialised beforehand.

Move this initialisation before each call to qla2x00_init_timer(). In some
cases qla2x00_init_timer() initialises a completion structure needed by the
timeout function, so move the call to add_timer() after that.

Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e74e7d95878d7993cf56c801d55d78f16ea58d1d)

Orabug: 28013813

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/scsi/qla2xxx/qla_gs.c
drivers/scsi/qla2xxx/qla_init.c
drivers/scsi/qla2xxx/qla_inline.h
drivers/scsi/qla2xxx/qla_iocb.c
drivers/scsi/qla2xxx/qla_mid.c
drivers/scsi/qla2xxx/qla_mr.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

i40e: Add programming descriptors to cleaned_count

This patch updates the i40e driver to include programming descriptors in
the cleaned_count. Without this change it becomes possible for us to leak
memory as we don't trigger a large enough allocation when the time comes to
allocate new buffers and we end up overwriting a number of rx_buffers equal
to the number of programming descriptors we encountered.

Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 62b4c6694dfd3821bd5ea5bed48238bbabd5fe8b)

Orabug: 28228724

Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

i40e: Fix memory leak related filter programming status

It looks like we weren't correctly placing the pages from buffers that had
been used to return a filter programming status back on the ring. As a
result they were being overwritten and tracking of the pages was lost.

This change works to correct that by incorporating part of
i40e_put_rx_buffer into the programming status handler code. As a result we
should now be correctly placing the pages for those buffers on the
re-allocation list instead of letting them stay in place.

Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status")
Reported-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Anders K Pedersen <akp@cohaesio.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 2b9478ffc550f17c6cd8c69057234e91150f5972)

Orabug: 28228724

Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/intel/i40e/i40e_txrx.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-swiotlb: use actually allocated size on check physical continuous

xen_swiotlb_{alloc,free}_coherent() allocate/free memory based on the
order of the pages and not size argument (bytes). This is inconsistent with
range_straddles_page_boundary and memset which use the 'size' value,
which may lead to not exchanging memory with Xen (range_straddles_page_boundary()
returned true). And then the call to xen_swiotlb_free_coherent() would
actually try to exchange the memory with Xen, leading to the kernel
hitting an BUG (as the hypercall returned an error).

This patch fixes it by making the 'size' variable be of the same size
as the amount of memory allocated.

CC: stable@vger.kernel.org
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Christoph Helwig <hch@lst.de>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: John Sobecki <john.sobecki@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 28258102

(cherry picked from commit 7250f422da0480d8512b756640f131b9b893ccda)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/xen/swiotlb-xen.c

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "Revert "xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent""

This reverts commit 4dbc2ddc8d51dd616b95d868b0223d102428b995.

The root cause of panic in commit 7fc30809bfa8 ("xen-swiotlb: fix the check
condition for xen_swiotlb_free_coherent") is identified. Enable this patch
again as the fix is already available.

The Reviewed-by by for the revert is only to sync the uek4 code with upstream.
It is not clear at the moment whether upstream code is correct.

Orabug: 28258102

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/mlx4_en: fix potential use-after-free with dma_unmap_page

[ Not relevant upstream, therefore no upstream commit. ]

To fix, unmap the page as soon as possible.

When swiotlb is in use, calling dma_unmap_page means that
the original page mapped with dma_map_page must still be valid,
as swiotlb will copy data from its internal cache back to the
originally requested DMA location.

When GRO is enabled, before this patch all references to the
original frag may be put and the page freed before dma_unmap_page
in mlx4_en_free_frag is called.

It is possible there is a path where the use-after-free occurs
even with GRO disabled, but this has not been observed so far.

The bug can be trivially detected by doing the following:

* Compile the kernel with DEBUG_PAGEALLOC
* Run the kernel as a Xen Dom0
* Leave GRO enabled on the interface
* Run a 10 second or more test with iperf over the interface.

This bug was likely introduced in
commit 4cce66cdd14a ("mlx4_en: map entire pages to increase throughput"),
first part of u3.6.

It was incidentally fixed in
commit 34db548bfb95 ("mlx4: add page recycling in receive path"),
first part of v4.12.

This version applies to the v4.9 series.

Signed-off-by: Sarah Newman <srn@prgmr.com>
Tested-by: Sarah Newman <srn@prgmr.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5d70bd5c98d0e655bde2aae2b5251bdd44df5e71)

Orabug: 28376051

Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/en_rx.c
[ Lack of frag_info->dma_dir ]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

ocfs2: fix ocfs2 read block panic

Orabug: 28580543

While reading block, it is possible that io error return due to underlying
storage issue, in this case, BH_NeedsValidate was left in the buffer head.
Then when reading the very block next time, if it was already linked into
journal, that will trigger the following panic.

[203748.702517] kernel BUG at fs/ocfs2/buffer_head_io.c:342!
[203748.702533] invalid opcode: 0000 [#1] SMP
[203748.702561] Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc dm_switch dm_queue_length dm_multipath bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif i2c_core ipmi_si ipmi_msghandler acpi_pad pcspkr sb_edac edac_core lpc_ich mfd_core shpchp sg tg3 ptp pps_core ext4 jbd2 mbcache2 sr_mod cdrom sd_mod ahci libahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
[203748.703024] CPU: 7 PID: 38369 Comm: touch Not tainted 4.1.12-124.18.6.el6uek.x86_64 #2
[203748.703045] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.2 01/28/2015
[203748.703067] task: ffff880768139c00 ti: ffff88006ff48000 task.ti: ffff88006ff48000
[203748.703088] RIP: 0010:[<ffffffffa05e9f09>]  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.703130] RSP: 0018:ffff88006ff4b818  EFLAGS: 00010206
[203748.703389] RAX: 0000000008620029 RBX: ffff88006ff4b910 RCX: 0000000000000000
[203748.703885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000023079fe
[203748.704382] RBP: ffff88006ff4b8d8 R08: 0000000000000000 R09: ffff8807578c25b0
[203748.704877] R10: 000000000f637376 R11: 000000003030322e R12: 0000000000000000
[203748.705373] R13: ffff88006ff4b910 R14: ffff880732fe38f0 R15: 0000000000000000
[203748.705871] FS:  00007f401992c700(0000) GS:ffff880bfebc0000(0000) knlGS:0000000000000000
[203748.706370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[203748.706627] CR2: 00007f4019252440 CR3: 00000000a621e000 CR4: 0000000000060670
[203748.707124] Stack:
[203748.707371]  ffff88006ff4b828 ffffffffa0609f52 ffff88006ff4b838 0000000000000001
[203748.707885]  0000000000000000 0000000000000000 ffff880bf67c3800 ffffffffa05eca00
[203748.708399]  00000000023079ff ffffffff81c58b80 0000000000000000 0000000000000000
[203748.708915] Call Trace:
[203748.709175]  [<ffffffffa0609f52>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
[203748.709680]  [<ffffffffa05eca00>] ? ocfs2_empty_dir_filldir+0x80/0x80 [ocfs2]
[203748.710185]  [<ffffffffa05ec0cb>] ocfs2_read_dir_block_direct+0x3b/0x200 [ocfs2]
[203748.710691]  [<ffffffffa05f0fbf>] ocfs2_prepare_dx_dir_for_insert.isra.57+0x19f/0xf60 [ocfs2]
[203748.711204]  [<ffffffffa065660f>] ? ocfs2_metadata_cache_io_unlock+0x1f/0x30 [ocfs2]
[203748.711716]  [<ffffffffa05f4f3a>] ocfs2_prepare_dir_for_insert+0x13a/0x890 [ocfs2]
[203748.712227]  [<ffffffffa05f442e>] ? ocfs2_check_dir_for_entry+0x8e/0x140 [ocfs2]
[203748.712737]  [<ffffffffa061b2f2>] ocfs2_mknod+0x4b2/0x1370 [ocfs2]
[203748.713003]  [<ffffffffa061c385>] ocfs2_create+0x65/0x170 [ocfs2]
[203748.713263]  [<ffffffff8121714b>] vfs_create+0xdb/0x150
[203748.713518]  [<ffffffff8121b225>] do_last+0x815/0x1210
[203748.713772]  [<ffffffff812192e9>] ? path_init+0xb9/0x450
[203748.714123]  [<ffffffff8121bca0>] path_openat+0x80/0x600
[203748.714378]  [<ffffffff811bcd45>] ? handle_pte_fault+0xd15/0x1620
[203748.714634]  [<ffffffff8121d7ba>] do_filp_open+0x3a/0xb0
[203748.714888]  [<ffffffff8122a767>] ? __alloc_fd+0xa7/0x130
[203748.715143]  [<ffffffff81209ffc>] do_sys_open+0x12c/0x220
[203748.715403]  [<ffffffff81026ddb>] ? syscall_trace_enter_phase1+0x11b/0x180
[203748.715668]  [<ffffffff816f0c9f>] ? system_call_after_swapgs+0xe9/0x190
[203748.715928]  [<ffffffff8120a10e>] SyS_open+0x1e/0x20
[203748.716184]  [<ffffffff816f0d5e>] system_call_fastpath+0x18/0xd7
[203748.716440] Code: 00 00 48 8b 7b 08 48 83 c3 10 45 89 f8 44 89 e1 44 89 f2 4c 89 ee e8 07 06 11 e1 48 8b 03 48 85 c0 75 df 8b 5d c8 e9 4d fa ff ff <0f> 0b 48 8b 7d a0 e8 dc c6 06 00 48 b8 00 00 00 00 00 00 00 10
[203748.717505] RIP  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.717775]  RSP <ffff88006ff4b818>

Joesph ever reported a similar panic.
Link: https://oss.oracle.com/pipermail/ocfs2-devel/2013-May/008931.html
Link: http://lkml.kernel.org/r/20180912063207.29484-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 234b69e3e089d850a98e7b3145bd00e9b52b1111)
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: fix bdi vs gendisk lifetime mismatch

Orabug: 28645416

The name for a bdi of a gendisk is derived from the gendisk's devt.
However, since the gendisk is destroyed before the bdi it leaves a
window where a new gendisk could dynamically reuse the same devt
while a bdi with the same name is still live. Arrange for the bdi to hold a
reference against its "owner" disk device while it is registered.
Otherwise we can hit sysfs duplicate name collisions like the
following:

WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31
sysfs_warn_dup+0x64/0x80
sysfs: cannot create duplicate filename
'/devices/virtual/bdi/259:1'

Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
Call Trace:
[<ffffffff8134caec>] dump_stack+0x63/0x87
[<ffffffff8108c351>] __warn+0xd1/0xf0
[<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
[<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
[<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
[<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
[<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
[<ffffffff8134ff55>] kobject_add+0x75/0xd0
[<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
[<ffffffff8148b0a5>] device_add+0x125/0x610
[<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
[<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
[<ffffffff811b775c>] bdi_register+0x8c/0x180
[<ffffffff811b7877>] bdi_register_dev+0x27/0x30
[<ffffffff813317f5>] add_disk+0x175/0x4a0

Cc: <stable@vger.kernel.org>
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fixed up missing 0 return in bdi_register_owner().

Signed-off-by: Jens Axboe <axboe@fb.com>
(cherrypicked from commit df08c32ce3be5be138c1dbfcba203314a3a7cd6f)
Conflicts: mm/backing-dev.c
include/linux/backing-dev.h

The patch breaks KABI as-is because of the introduction of a new "owner"
field in struct backing_dev_info. To work around that, use the
UEK_KABI_EXTEND() macro to wrap the "owner" filed with ifdef GENKSYMS.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

e1000e: Fix link check race condition

Orabug: 28716958

Alex reported the following race condition:

/* link goes up... interrupt... schedule watchdog */
\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
\ e1000e_phy_has_link_generic(..., &link)
link = true

/* link goes down... interrupt */
\ e1000_msix_other
hw->mac.get_link_status = true

/* link is up */
mac->get_link_status = false

link_active = true
/* link_active is true, wrongly, and stays so because
* get_link_status is false */

Avoid this problem by making sure that we don't set get_link_status = false
after having checked the link.

It seems this problem has been present since the introduction of e1000e.

Link: https://lkml.org/lkml/2018/1/29/338
Reported-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit e2710dbf0dc1e37d85368e2404049dadda848d5a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "e1000e: Separate signaling for link check/link up"

Orabug: 28716958

This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
This reverts commit d3604515c9eda464a92e8e67aae82dfe07fe3c98.

Commit 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
changed what happens to the link status when there is an error which
happens after "get_link_status = false" in the copper check_for_link
callbacks. Previously, such an error would be ignored and the link
considered up. After that commit, any error implies that the link is down.

Revert commit 19110cfbb34d ("e1000e: Separate signaling for link check/link
up") and its followups. After reverting, the race condition described in
the log of commit 19110cfbb34d is reintroduced. It may still be triggered
by LSC events but this should keep the link down in case the link is
electrically unstable, as discussed. The race may no longer be
triggered by RXO events because commit 4aea7a5c5e94 ("e1000e: Avoid
receiver overrun interrupt bursts") restored reading icr in the Other
handler.

Link: https://lkml.org/lkml/2018/3/1/789
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 3016e0a0c91246e55418825ba9aae271be267522)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

e1000e: Avoid missed interrupts following ICR read

Orabug: 28716958

The 82574 specification update errata 12 states that interrupts may be
missed if ICR is read while INT_ASSERTED is not set. Avoid that problem by
setting all bits related to events that can trigger the Other interrupt in
IMS.

The Other interrupt is raised for such events regardless of whether or not
they are set in IMS. However, only when they are set is the INT_ASSERTED
bit also set in ICR.

By doing this, we ensure that INT_ASSERTED is always set when we read ICR
in e1000_msix_other() and steer clear of the errata. This also ensures that
ICR will automatically be cleared on read, therefore we no longer need to
clear bits explicitly.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 116f4a640b3197401bc93b8adc6c35040308ceff)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

e1000e: Fix queue interrupt re-raising in Other interrupt

Orabug: 28716958

Restores the ICS write for Rx/Tx queue interrupts which was present before
commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1)
but was not restored in commit 4aea7a5c5e94
("e1000e: Avoid receiver overrun interrupt bursts", v4.15-rc1).

This re-raises the queue interrupts in case the txq or rxq bits were set in
ICR and the Other interrupt handler read and cleared ICR before the queue
interrupt was raised.

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 361a954e6a7215de11a6179ad9bdc07d7e394b04)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

Orabug: 28716958

This partially reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.

We keep the fix for the first part of the problem (1) described in the log
of that commit, that is to read ICR in the other interrupt handler. We
remove the fix for the second part of the problem (2), Other interrupt
throttling.

Bursts of "Other" interrupts may once again occur during rxo (receive
overflow) traffic conditions. This is deemed acceptable in the interest of
avoiding unforeseen fallout from changes that are not strictly necessary.
As discussed, the e1000e driver should be in "maintenance mode".

Link: https://www.spinics.net/lists/netdev/msg480675.html
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 1f0ea19722ef9dfa229a9540f70b8d1c34a98a6a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

e1000e: Remove Other from EIAC

Orabug: 28716958

It was reported that emulated e1000e devices in vmware esxi 6.5 Build
7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
icr=0x80000004 (_INT_ASSERTED | _LSC) in the same situation.

Some experimentation showed that this flaw in vmware e1000e emulation can
be worked around by not setting Other in EIAC. This is how it was before
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 745d0bd3af99ccc8c5f5822f808cd133eadad6ac)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Fix error code in nfs_lookup_verify_inode()

Return -ESTALE to force a lookup when the file has no more links

Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Orabug: 28789030

(cherry picked from commit a61246c96195fc5f7500f6842e883b9eb1567d8d)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Tested-by: alfredo.ramirez@oracle.com
Signed-off-by: Brian Maly <brian.maly@oracle.com>

workqueue: Allow modifying low level unbound workqueue cpumask

Allow to modify the low-level unbound workqueues cpumask through
sysfs. This is performed by traversing the entire workqueue list
and calling apply_wqattrs_prepare() on the unbound workqueues
with the new low level mask. Only after all the preparation are done,
we commit them all together.

Ordered workqueues are ignored from the low level unbound workqueue
cpumask, it will be handled in near future.

All the (default & per-node) pwqs are mandatorily controlled by
the low level cpumask. If the user configured cpumask doesn't overlap
with the low level cpumask, the low level cpumask will be used for the
wq instead.

The comment of wq_calc_node_cpumask() is updated and explicitly
requires that its first argument should be the attrs of the default
pwq.

The default wq_unbound_cpumask is cpu_possible_mask.  The workqueue
subsystem doesn't know its best default value, let the system manager
or the other subsystem set it when needed.

Changed from V8:
  merge the calculating code for the attrs of the default pwq together.
  minor change the code&comments for saving the user configured attrs.
  remove unnecessary list_del().
  minor update the comment of wq_calc_node_cpumask().
  update the comment of workqueue_set_unbound_cpumask();

Cc: Christoph Lameter <cl@linux.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 042f7df15a4fff8eec42873f755aea848dcdedd1)

Orabug: 28813166

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

workqueue: Create low-level unbound workqueues cpumask

Create a cpumask that limits the affinity of all unbound workqueues.
This cpumask is controlled through a file at the root of the workqueue
sysfs directory.

It works on a lower-level than the per WQ_SYSFS workqueues cpumask files
such that the effective cpumask applied for a given unbound workqueue is
the intersection of /sys/devices/virtual/workqueue/$WORKQUEUE/cpumask and
the new /sys/devices/virtual/workqueue/cpumask file.

This patch implements the basic infrastructure and the read interface.
wq_unbound_cpumask is initially set to cpu_possible_mask.

Cc: Christoph Lameter <cl@linux.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit b05a79280b346eb24ddb73b39988398015291075)

Orabug: 28813166

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: sg: mitigate read/write abuse

As Al Viro noted in commit 128394eff343 ("sg_write()/bsg_write() is not fit
to be called under KERNEL_DS"), sg improperly accesses userspace memory
outside the provided buffer, permitting kernel memory corruption via
splice(). But it doesn't just do it on ->write(), also on ->read().

As a band-aid, make sure that the ->read() and ->write() handlers can not
be called in weird contexts (kernel context or credentials different from
file opener), like for ib_safe_file_access().

If someone needs to use these interfaces from different security contexts,
a new interface should be written that goes through the ->ioctl() handler.

I've mostly copypasted ib_safe_file_access() over as sg_safe_file_access()
because I couldn't find a good common header - please tell me if you know a
better way.

[mkp: s/_safe_/_check_/]

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 28824718
CVE: CVE-2017-13168

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 26b5b874aff5659a7e26e5b1997e3df2c41fa7fd)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/scsi/sg.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "rds: RDS (tcp) hangs on sendto() to unresponding address"

Orabug: 28837953

This reverts commit 4d8376fac652927fcc94ca88db134ce7822c03c1.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Retpoline should always be available on Skylake

Now that we can dynamically toggle retpoline on or off, retpoline
should always be available even on Skylake platforms, without
having to explicitly select retpoline at boot time.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Jamie Iles <jamie.iles@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
(cherry picked from UEK5 commit 7bbc586ac477df24012552712df10ab9ae300919)

Orabug: 28801831

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Jamie Iles <jamie.iles@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Add sysfs entry to enable/disable retpoline

Add /sys/kernel/debug/x86/retpoline_enabled to enable/disable retpoline.
Enabling retpoline will also enable IBRS for the firmware.

Note that IBRS and retpoline can't be enabled together. Enabling retpoline
while IBRS is already enabled will automatically disable IBRS. Similarly,
enabling IBRS while retpoline is already enabled will automatically disable
retpoline.

On Skylake, retpoline is not provided and can't be enabled unless the system
has been explicitly booted with retpoline (using spectre_v2=retpoline or
spectre_v2_heuristics=skylake=off).

Also fix the behavior when retpoline is not available (!CONFIG_RETPOLINE):
now we will try using IBRS (if it is available) instead of not using any
mitigation.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
(cherry picked from UEK5 commit d75554157882d9b4df91f0b2bbc4907e2731781e)

[Backport: a large part of d75554157882d9b4df91f0b2bbc4907e2731781e
was already ported in previous commit ("x86/speculation: switch to IBRS
when loading a non-retpoline module"). This ports the remaining part
which effectively adds the retpoline_enabled sysfs entry.

Also we issue a warning when enabling retpoline and a non-retpoline
module is loaded.]

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Switch to IBRS when loading a non-retpoline module

When retpoline is used and a non-retpoline module gets loaded, IBRS
is enabled in addition of retpoline. Now that we can dynamically
disable retpoline, do it and use IBRS only. That way, IBRS and
retpoline will never be enabled together.

Changes are located in kernel/module.c. Additional changes are from
parts of UEK5 commit d75554157882d9b4df91f0b2bbc4907e2731781e
("x86/speculation: Add sysfs entry to enable/disable retpoline")
to provide the mechanism for switching between retpoline and IBRS.

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Remove unnecessary retpoline alternatives

Now that the X86_FEATURE_RETPOLINE is always set, some assembly
alternatives can be simplied or even removed. Also add early
check of retpoline_enabled_key in CALL_NOSPEC to avoid an extra
call into thunk when retpoline is disabled.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
(cherry picked from UEK5 commit eb88d822befdc73952ae7c00cfcbce9ff5aad574)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Use static key to enable/disable retpoline

Change the way retpoline is enabled.

Up to now, the retpoline feature was enabled using alternatives
depending whether the X86_FEATURE_RETPOLINE cpu capability was set
or not. The usage of alternatives prevented enabling or disabling
after the boot sequence.

Now, retpoline is enabled using jump labels controlled by the
retpoline_enabled_key static key. This enables retpoline to be
selectively enabled or disabled even after the boot sequence.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
(cherry picked from UEK5 commit 3d01785866e2a73119ed4e35d5cb6b02edf5957a)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

locking/static_keys: Provide DECLARE and well as DEFINE macros

We will need to provide declarations of static keys in header
files. Provide DECLARE_STATIC_KEY_{TRUE,FALSE} macros.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/816881cf85bd3cf13385d212882618f38a3b5d33.1472754711.git.tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit b8fb03785d4de097507d0cf45873525e0ac4d2b2)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

jump_label: remove bug.h, atomic.h dependencies for HAVE_JUMP_LABEL

The current jump_label.h includes bug.h for things such as WARN_ON().
This makes the header problematic for inclusion by kernel.h or any
headers that kernel.h includes, since bug.h includes kernel.h (circular
dependency). The inclusion of atomic.h is similarly problematic. Thus,
this should make jump_label.h 'includable' from most places.

Link: http://lkml.kernel.org/r/7060ce35ddd0d20b33bf170685e6b0fab816bdf2.1467837322.git.jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1f69bf9c6137602cd028c96b4f8329121ec89231)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

locking/static_key: Fix concurrent static_key_slow_inc()

The following scenario is possible:

    CPU 1                                   CPU 2
    static_key_slow_inc()
     atomic_inc_not_zero()
      -> key.enabled == 0, no increment
     jump_label_lock()
     atomic_inc_return()
      -> key.enabled == 1 now
                                            static_key_slow_inc()
                                             atomic_inc_not_zero()
                                              -> key.enabled == 1, inc to 2
                                             return
                                            ** static key is wrong!
     jump_label_update()
     jump_label_unlock()

Testing the static key at the point marked by (**) will follow the
wrong path for jumps that have not been patched yet.  This can
actually happen when creating many KVM virtual machines with userspace
LAPIC emulation; just run several copies of the following program:

    #include <fcntl.h>
    #include <unistd.h>
    #include <sys/ioctl.h>
    #include <linux/kvm.h>

    int main(void)
    {
        for (;;) {
            int kvmfd = open("/dev/kvm", O_RDONLY);
            int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
            close(ioctl(vmfd, KVM_CREATE_VCPU, 1));
            close(vmfd);
            close(kvmfd);
        }
        return 0;
    }

Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call.
The static key's purpose is to skip NULL pointer checks and indeed one
of the processes eventually dereferences NULL.

As explained in the commit that introduced the bug:

  706249c222f6 ("locking/static_keys: Rework update logic")

jump_label_update() needs key.enabled to be true.  The solution adopted
here is to temporarily make key.enabled == -1, and use go down the
slow path when key.enabled <= 0.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v4.3+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 706249c222f6 ("locking/static_keys: Rework update logic")
Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com
[ Small stylistic edits to the changelog and the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from 4c5ea0a9cd02d6aa8adc86e100b2a4cff8d614ff)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

jump_label: make static_key_enabled() work on static_key_true/false types too

static_key_enabled() can be used on struct static_key but not on its
wrapper types static_key_true and static_key_false. The function is
useful for debugging and management of static keys. Update it so that
it can be used for the wrapper types too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from fa128fd735bd236b6b04d3fedfed7a784137c185)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

locking/static_keys: Fix up the static keys documentation

Fix a few small mistakes in the static key documentation and
delete an unneeded sentence.

Suggested-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150914171105.511e1e21@lwn.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1975dbc276c6ab62230cf4f9df5ddc9ff0e0e473)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

locking/static_keys: Fix a silly typo

Commit:

412758cb2670 ("jump label, locking/static_keys: Update docs")

introduced a typo that might as well get fixed.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150907131803.54c027e1@lwn.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from edcd591c77a48da753456f92daf8bb50fe9bac93)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

jump label, locking/static_keys: Update docs

Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: benh@kernel.crashing.org
Cc: bp@alien8.de
Cc: davem@davemloft.net
Cc: ddaney@caviumnetworks.com
Cc: heiko.carstens@de.ibm.com
Cc: linux-kernel@vger.kernel.org
Cc: liuj97@gmail.com
Cc: luto@amacapital.net
Cc: michael@ellerman.id.au
Cc: rabin@rab.in
Cc: ralf@linux-mips.org
Cc: rostedt@goodmis.org
Cc: vbabka@suse.cz
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/6b50f2f6423a2244f37f4b1d2d6c211b9dcdf4f8.1438227999.git.jbaron@akamai.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 412758cb26704e5087ca2976ec3b28fb2bdbfad4)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/asm: Add asm macros for static keys/jump labels

Unfortunately, we can only do this if HAVE_JUMP_LABEL. In
principle, we could do some serious surgery on the core jump
label infrastructure to keep the patch infrastructure available
on x86 on all builds, but that's probably not worth it.

Implementing the macros using a conditional branch as a fallback
seems like a bad idea: we'd have to clobber flags.

This limitation can't cause silent failures -- trying to include
asm/jump_label.h at all on a non-HAVE_JUMP_LABEL kernel will
error out. The macro's users are responsible for handling this
issue themselves.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/63aa45c4b692e8469e1876d6ccbb5da707972990.1447361906.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2671c3e4fe2a34bd9bf2eecdf5d1149d4b55dbdf)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/asm: Error out if asm/jump_label.h is included inappropriately

Rather than potentially generating incorrect code on a
non-HAVE_JUMP_LABEL kernel if someone includes asm/jump_label.h,
error out.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/99407f0ac7fa3ab03a3d31ce076d47b5c2f44795.1447361906.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit c28454332fe0b65e22c3a2717e5bf05b5b47ca20)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

jump_label/x86: Work around asm build bug on older/backported GCCs

Boris reported that gcc version 4.4.4 20100503 (Red Hat
4.4.4-2) fails to build linux-next kernels that have
this fresh commit via the locking tree:

11276d5306b8 ("locking/static_keys: Add a new static_key interface")

The problem appears to be that even though @key and @branch are
compile time constants, it doesn't see the following expression
as an immediate value:

&((char *)key)[branch]

More recent GCCs don't appear to have this problem.

In particular, Red Hat backported the 'asm goto' feature into 4.4,
'normal' 4.4 compilers will not have this feature and thus not
run into this asm.

The workaround is to supply both values to the asm as immediates
and do the addition in asm.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from d420acd816c07c7be31bd19d09cbcb16e5572fa6)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

locking/static_keys: Add a new static_key interface

There are various problems and short-comings with the current
static_key interface:

- static_key_{true,false}() read like a branch depending on the key
   value, instead of the actual likely/unlikely branch depending on
   init value.

- static_key_{true,false}() are, as stated above, tied to the
   static_key init values STATIC_KEY_INIT_{TRUE,FALSE}.

- we're limited to the 2 (out of 4) possible options that compile to
   a default NOP because that's what our arch_static_branch() assembly
   emits.

So provide a new static_key interface:

  DEFINE_STATIC_KEY_TRUE(name);
  DEFINE_STATIC_KEY_FALSE(name);

Which define a key of different types with an initial true/false
value.

Then allow:

   static_branch_likely()
   static_branch_unlikely()

to take a key of either type and emit the right instruction for the
case.

This means adding a second arch_static_branch_jump() assembly helper
which emits a JMP per default.

In order to determine the right instruction for the right state,
encode the branch type in the LSB of jump_entry::key.

This is the final step in removing the naming confusion that has led to
a stream of avoidable bugs such as:

  a833581e372a ("x86, perf: Fix static_key bug in load_mm_cr4()")

... but it also allows new static key combinations that will give us
performance enhancements in the subsequent patches.

Tested-by: Rabin Vincent <rabin@rab.in> # arm
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> # ppc
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # s390
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 11276d5306b8e5b438a36bbff855fe792d7eaa61)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

locking/static_keys: Rework update logic

Instead of spreading the branch_default logic all over the place,
concentrate it into the one jump_label_type() function.

This does mean we need to actually increment/decrement the enabled
count _before_ calling the update path, otherwise jump_label_type()
will not see the right state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from 706249c222f68471b6f8e9e8e9b77665c404b226)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

jump_label: Add jump_entry_key() helper

Avoid some casting with a helper, also prepares the way for
overloading the LSB of jump_entry::key.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7dcfd915bae51571bcc339d8e3dda027287880e5)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

jump_label, locking/static_keys: Rename JUMP_LABEL_TYPE_* and related helpers to the static_key* pattern

Rename the JUMP_LABEL_TYPE_* macros to be JUMP_TYPE_* and move the
inline helpers into kernel/jump_label.c, since that's the only place
they're ever used.

Also rename the helpers where it's all about static keys.

This is the second step in removing the naming confusion that has led to
a stream of avoidable bugs such as:

a833581e372a ("x86, perf: Fix static_key bug in load_mm_cr4()")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from a1efb01feca597b2abbc89873b40ef8ec6690168)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

jump_label: Rename JUMP_LABEL_{EN,DIS}ABLE to JUMP_LABEL_{JMP,NOP}

Since we've already stepped away from ENABLE is a JMP and DISABLE is a
NOP with the branch_default bits, and are going to make it even worse,
rename it to make it all clearer.

This way we don't mix multiple levels of logic attributes, but have a
plain 'physical' name for what the current instruction patching status
of a jump label is.

This is a first step in removing the naming confusion that has led to
a stream of avoidable bugs such as:

a833581e372a ("x86, perf: Fix static_key bug in load_mm_cr4()")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
[ Beefed up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 76b235c6bcb16062d663e2ee96db0b69f2e6bc14)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

module, jump_label: Fix module locking

As per the module core lockdep annotations in the coming patch:

[   18.034047] ---[ end trace 9294429076a9c673 ]---
[   18.047760] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[   18.059228]  ffffffff817d8676 ffff880036683c38 ffffffff8157e98b 0000000000000001
[   18.067541]  0000000000000000 ffff880036683c78 ffffffff8105fbc7 ffff880036683c68
[   18.075851]  ffffffffa0046b08 0000000000000000 ffffffffa0046d00 ffffffffa0046cc8
[   18.084173] Call Trace:
[   18.086906]  [<ffffffff8157e98b>] dump_stack+0x4f/0x7b
[   18.092649]  [<ffffffff8105fbc7>] warn_slowpath_common+0x97/0xe0
[   18.099361]  [<ffffffff8105fc2a>] warn_slowpath_null+0x1a/0x20
[   18.105880]  [<ffffffff810ee502>] __module_address+0x1d2/0x1e0
[   18.112400]  [<ffffffff81161153>] jump_label_module_notify+0x143/0x1e0
[   18.119710]  [<ffffffff810814bf>] notifier_call_chain+0x4f/0x70
[   18.126326]  [<ffffffff8108160e>] __blocking_notifier_call_chain+0x5e/0x90
[   18.134009]  [<ffffffff81081656>] blocking_notifier_call_chain+0x16/0x20
[   18.141490]  [<ffffffff810f0f00>] load_module+0x1b50/0x2660
[   18.147720]  [<ffffffff810f1ade>] SyS_init_module+0xce/0x100
[   18.154045]  [<ffffffff81587429>] system_call_fastpath+0x12/0x17
[   18.160748] ---[ end trace 9294429076a9c674 ]---

Jump labels is not doing it right; fix this.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jason Baron <jbaron@akamai.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(cherry picked from commit bed831f9a251968272dae10a83b512c7db256ef0)

Orabug: 28607548

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Protect against userspace-userspace spectreRSB

The article "Spectre Returns! Speculation Attacks using the Return Stack
Buffer" [1] describes two new (sub-)variants of spectrev2-like attacks,
making use solely of the RSB contents even on CPUs that don't fallback to
BTB on RSB underflow (Skylake+).

Mitigate userspace-userspace attacks by always unconditionally filling RSB on
context switch when the generic spectrev2 mitigation has been enabled.

[1] https://arxiv.org/pdf/1807.07940.pdf

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1807261308190.997@cbobk.fhfr.pm
(cherry picked from commit fdf82a7856b32d905c39afc85e34364491e46346)

Orabug: 28631590
CVE: CVE-2018-15572

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
(UEK4 has the relevant code in arch/x86/kernel/cpu/bugs_64.c.
Also, the upstream patch removes the function is_skylake_era(),
but this patch does not since it is still used in the UEK code)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/spectre_v2: Remove remaining references to lfence mitigation

The LFENCE mitigation alone was deemed to be insufficient and the
related code was mostly removed by commit cf3b86bc07619
(Revert "x86/spec_ctrl: Add 'nolfence' knob to disable fallback for
spectre_v2 mitigation")

This patch cleans up two additional references that still exist.

Orabug: 28631590
CVE: CVE-2018-15572

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "md: allow a partially recovered device to be hot-added to an array."

This reverts commit 7eb418851f3278de67126ea0c427641ab4792c57.

This commit is poorly justified, I can find not discusison in email,
and it clearly causes a problem.

If a device which is being recovered fails and is subsequently
re-added to an array, there could easily have been changes to the
array *before* the point where the recovery was up to.  So the
recovery must start again from the beginning.

If a spare is being recovered and fails, then when it is re-added we
really should do a bitmap-based recovery up to the recovery-offset,
and then a full recovery from there.  Before this reversion, we only
did the "full recovery from there" which is not corect.  After this
reversion with will do a full recovery from the start, which is safer
but not ideal.

It will be left to a future patch to arrange the two different styles
of recovery.

Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (3.14+)
Fixes: 7eb418851f32 ("md: allow a partially recovered device to be hot-added to an array.")
(cherry pick from upstream commit d01552a76d71f9879af448e9142389ee9be6e95b)

Orabug: 28702623

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: ssbd_ibrs_selected called prematurely

spectre_v2_select_mitigation considers the SSB mitigation that's been
selected in its own decision; specifically, it calls ssbd_ibrs_selected
when deciding whether to use IBRS.  The problem is, the SSB mitigation
hasn't been chosen yet, so ssbd_ibrs_selected will always return the
same result, possibly the wrong one.

Address this by splitting SSB initialization into two phases.  The
first, ssb_select_mitigation, picks the SSB mitigation mode, getting far
enough for spectre_v2_select_mitigation's needs, and the second,
ssb_init, sets up the necessary SSB state based on the mode.  The second
phase is necessary because it relies on the outcome of
spectre_v2_select_mitigation.

Fixes: 648702f587df ("x86/bugs/IBRS: Disable SSB (RDS) if IBRS is selected for spectre_v2.")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit 86a51ae1ce42d9f76e09e3c87c20c3625c6a1a6c)

Orabug: 28788839

Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/mlx4_core: print firmware version during driver loading

When debugging firmware related issues, it's very helpful to have
firmware version info printed in the kernel log when the driver is
loaded. It's easier to match error/warning messages with different
FW versions in the log other than running a separate tool to get
the information during remote debugging.

Orabug: 28809377

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
CC: Gerd Rausch <gerd.rausch@oracle.com>
CC: Aru Kolappan <aru.kolappan@oracle.com>
CC: Sujatha Srinivasa Gopalan <sujatha.tolstoy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: numa: Do not trap faults on shared data section pages.

Workloads consisting of a large number of processes running the same program
with a very large shared data segment may experience performance problems
when numa balancing attempts to migrate the shared cow pages. This manifests
itself with many processes or tasks in TASK_UNINTERRUPTIBLE state waiting
for the shared pages to be migrated.

The program listed below simulates the conditions with these results when
run with 288 processes on a 144 core/8 socket machine.

Average throughput Average throughput     Average throughput
with numa_balancing=0 with numa_balancing=1  with numa_balancing=1
      without the patch      with the patch
--------------------- ---------------------  ---------------------
2118782 2021534        2107979

Complex production environments show less variability and fewer poorly
performing outliers accompanied with a smaller number of processes waiting
on NUMA page migration with this patch applied. In some cases, %iowait drops
from 16%-26% to 0.

// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
*/

int a[1000000] = {13};

int  main(int argc, const char **argv)
{
int n = 0;
int i;
pid_t pid;
int stat;
int *count_array;
int cpu_count = 288;
long total = 0;

struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};

if (argc > 2)
cpu_count = atoi(argv[2]);

count_array = mmap(NULL, cpu_count * sizeof(int),
   (PROT_READ|PROT_WRITE),
   (MAP_SHARED|MAP_ANONYMOUS), 0, 0);

if (count_array == MAP_FAILED) {
perror("mmap:");
return 0;
}

for (i = 0; i < cpu_count; ++i) {
pid = fork();
if (pid <= 0)
break;
if ((i & 0xf) == 0)
usleep(2);
}

if (pid != 0) {
if (i == 0) {
perror("fork:");
return 0;
}

for (;;) {
pid = wait(&stat);
if (pid < 0)
break;
}

for (i = 0; i < cpu_count; ++i)
total += count_array[i];

printf("Total %ld\n", total);
munmap(count_array, cpu_count * sizeof(int));
return 0;
}

gettimeofday(&t1, 0);
timeradd(&t1, &t2, &t1);
while (timercmp(&t2, &t1, <)) {
int b = 0;
int j;

for (j = 0; j < 1000000; j++)
b += a[j];
gettimeofday(&t2, 0);
n++;
}
count_array[i] = n;
return 0;
}

This patch changes change_pte_range() to skip shared copy-on-write pages when
called from change_prot_numa().

NOTE: change_prot_numa() is nominally called from task_numa_work() and
queue_pages_test_walk(). task_numa_work() is the auto NUMA balancing path, and
queue_pages_test_walk() is part of explicit NUMA policy management. However,
queue_pages_test_walk() only calls change_prot_numa() when MPOL_MF_LAZY is
specified and currently that is not allowed, so change_prot_numa() is only
called from auto NUMA balancing.

In the case of explicit NUMA policy management, shared pages are not migrated
unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL depends on
CAP_SYS_NICE. Currently, there is no way to pass information about
MPOL_MF_MOVE_ALL to change_pte_range. This will have to be fixed if
MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
migration mode.

task_numa_work() skips the read-only VMAs of programs and shared libraries.

NOTE2: This patch was accepted upstream should appear in 4.15 or 4.16

V2:
- Combined patch and cover letter
- Added note about applicability of MPOL_MF_MOVE_ALL

Signed-off-by: Henry Willard <henry.willard@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Orabug: 28814880
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mprotect.c (Add #include "internal.h")

Backporting commit 04489748048c1bd12d2a14ffe46d7de5dcb408c5
("mm: numa: Do not trap faults on shared data section pages.") from UEK5

Similar test results are seen on UEK4 when running the above program
with and without this patch and with and without auto numa balancing enabled

(cherry picked from commit 04489748048c1bd12d2a14ffe46d7de5dcb408c5)
Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

hugetlbfs: dirty pages as they are added to pagecache

Some test systems were experiencing negative huge page reserve
counts and incorrect file block counts.  This was traced to
/proc/sys/vm/drop_caches removing clean pages from hugetlbfs
file pagecaches.  When non-hugetlbfs explicit code removes the
pages, the appropriate accounting is not performed.

This can be recreated as follows:
fallocate -l 2M /dev/hugepages/foo
echo 1 > /proc/sys/vm/drop_caches
fallocate -l 2M /dev/hugepages/foo
grep -i huge /proc/meminfo
   AnonHugePages:         0 kB
   ShmemHugePages:        0 kB
   HugePages_Total:    2048
   HugePages_Free:     2047
   HugePages_Rsvd:    18446744073709551615
   HugePages_Surp:        0
   Hugepagesize:       2048 kB
   Hugetlb:         4194304 kB
ls -lsh /dev/hugepages/foo
   4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo

To address this issue, dirty pages as they are added to pagecache.
This can easily be reproduced with fallocate as shown above. Read
faulted pages will eventually end up being marked dirty.  But there
is a window where they are clean and could be impacted by code such
as drop_caches.  So, just dirty them all as they are added to the
pagecache.

Orabug: 28813968

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: RDS (tcp) hangs on sendto() to unresponding address

In rds_send_mprds_hash(), if the calculated hash value is non-zero and
the MPRDS connections are not yet up, it will wait. But it should not
wait if the send is non-blocking. In this case, it should just use the
base c_path for sending the message.

Orabug: 28762608

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nfs: fix a deadlock in nfs client initialization

The following deadlock can occur between a process waiting for a client
to initialize in while walking the client list during nfsv4 server trunking
detection and another process waiting for the nfs_clid_init_mutex so it
can initialize that client:

Process 1                               Process 2
---------                               ---------
spin_lock(&nn->nfs_client_lock);
list_add_tail(&CLIENTA->cl_share_link,
        &nn->nfs_client_list);
spin_unlock(&nn->nfs_client_lock);
                                        spin_lock(&nn->nfs_client_lock);
                                        list_add_tail(&CLIENTB->cl_share_link,
                                                &nn->nfs_client_list);
                                        spin_unlock(&nn->nfs_client_lock);
                                        mutex_lock(&nfs_clid_init_mutex);
                                        nfs41_walk_client_list(clp, result, cred);
                                        nfs_wait_client_init_complete(CLIENTA);
(waiting for nfs_clid_init_mutex)

Make sure nfs_match_client() only evaluates clients that have completed
initialization in order to prevent that deadlock.

This patch also fixes v4.0 trunking behavior by not marking the client
NFS_CS_READY until the clientid has been confirmed.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 28486463

(cherry picked from commit c156618e15101a9cc8c815108fec0300a0ec6637)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

infiniband: fix a possible use-after-free bug

ucma_process_join() will free the new allocated "mc" struct,
if there is any error after that, especially the copy_to_user().

But in parallel, ucma_leave_multicast() could find this "mc"
through idr_find() before ucma_process_join() frees it, since it
is already published.

So "mc" could be used in ucma_leave_multicast() after it is been
allocated and freed in ucma_process_join(), since we don't refcnt
it.

Fix this by separating "publish" from ID allocation, so that we
can get an ID first and publish it later after copy_to_user().

Fixes: c8f6a362bf3e ("RDMA/cma: Add multicast communication support")
Reported-by: Noam Rathaus <noamr@beyondsecurity.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
(cherry picked from commit cb2595c1393b4a5211534e6f0a0fbad369e21ad8)

Orabug: 28774517
CVE: CVE-2018-14734

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs/IBRS: properly use the _base ops if IBRS is currently unset

... or else we can run into an idle state with IBRS enabled which lead to
low performance.

Orabug: 28782729

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

hugetlbfs: check for reserve page count going negative

When modifying hugetlb reserve page count, check for the value going
negative. This should not happen in practice. However, if it does
log a message and stack trace to aid in debug.

Orabug: 28734496

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

hugetlbfs: introduce truncation/fault mutex to avoid races

The following hugetlbfs truncate/page fault race can be recreated
with programs doing something like the following.

A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages.  At
mmap time, 4 huge pages are reserved for the file/mapping.  So,
the global reserve count is 4.  In addition, since this is a shared
mapping an entry for 4 pages is added to the file's reserve map.
The first 3 of the 4 pages are faulted into the file.  As a result,
the global reserve count is now 1.

Task A starts to fault in the last page (routines hugetlb_fault,
hugetlb_no_page).  It allocates a huge page (alloc_huge_page).
The reserve map indicates there is a reserved page, so this is
used and the global reserve count goes to 0.

Now, task B truncates the file to size 0.  It starts by setting
inode size to 0(hugetlb_vmtruncate).  It then unmaps all mapping
of the file (hugetlb_vmdelete_list).  Since task A's page table
lock is not held at the time, truncation is not blocked.  Truncation
removes the 3 pages from the file (remove_inode_hugepages).  When
cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
the reserve map was for 4 pages.  However, it has only freed 3 pages.
So it assumes there is still (4 - 3) 1 reserved pages.  It then
decrements the global reserve count by 1 and it goes negative.

Task A then continues the page fault process and adds it's newly
acquired page to the page cache.  Note that the index of this page
is beyond the size of the truncated file (0).  The page fault process
then notices the file has been truncated and exits.  However, the
page is left in the cache associated with the file.

Now, if the file is immediately deleted the truncate code runs again.
It will find and free the one page associated with the file.  When
cleaning up reserves, it notices the reserve map is empty.  Yet, one
page freed.  So, the global reserve count is decremented by (0 - 1) -1.
This returns the global count to 0 as it should be.  But, it is
possible for someone else to mmap this file/range before it is deleted.
If this happens, a reserve map entry for the allocated page is created
and the reserved page is forever leaked.

To avoid all these conditions, let's simply prevent faults to a file
while it is being truncated.  Add a new truncation specific rw mutex
to hugetlbfs inode extensions.  faults take the mutex in read mode,
truncation takes in write mode.

Orabug: 28734496

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/hugetlb.c: don't call region_abort if region_chg fails

Changes to hugetlbfs reservation maps is a two step process.  The first
step is a call to region_chg to determine what needs to be changed, and
prepare that change.  This should be followed by a call to call to
region_add to commit the change, or region_abort to abort the change.

The error path in hugetlb_reserve_pages called region_abort after a
failed call to region_chg.  As a result, the adds_in_progress counter in
the reservation map is off by 1.  This is caught by a VM_BUG_ON in
resv_map_release when the reservation map is freed.

syzkaller fuzzer (when using an injected kmalloc failure) found this
bug, that resulted in the following:

kernel BUG at mm/hugetlb.c:742!
Call Trace:
  hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
  evict+0x481/0x920 fs/inode.c:553
  iput_final fs/inode.c:1515 [inline]
  iput+0x62b/0xa20 fs/inode.c:1542
  hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
  newseg+0x422/0xd30 ipc/shm.c:575
  ipcget_new ipc/util.c:285 [inline]
  ipcget+0x21e/0x580 ipc/util.c:639
  SYSC_shmget ipc/shm.c:673 [inline]
  SyS_shmget+0x158/0x230 ipc/shm.c:657
  entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742

Link: http://lkml.kernel.org/r/1490821682-23228-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 28734496

(cherry picked from commit ff8c0c53c47530ffea82c22a0a6df6332b56c957)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: hugetlb: fix hugepage memory leak caused by wrong reserve count

When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
alloc_buddy_huge_page() to directly create a hugepage from the buddy
allocator.

In that case, however, if alloc_buddy_huge_page() succeeds we don't
decrement h->resv_huge_pages, which means that successful
hugetlb_fault() returns without releasing the reserve count.  As a
result, subsequent hugetlb_fault() might fail despite that there are
still free hugepages.

This patch simply adds decrementing code on that code path.

I reproduced this problem when testing v4.3 kernel in the following situation:
- the test machine/VM is a NUMA system,
- hugepage overcommiting is enabled,
- most of hugepages are allocated and there's only one free hugepage
   which is on node 0 (for example),
- another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
   node 1, tries to allocate a hugepage,
- the allocation should fail but the reserve count is still hold.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org> [3.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 28734496

(cherry picked from commit a88c769548047b21f76fd71e04b6a3300ff17160)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: xdp: don't make drivers report attachment mode (partial backport)

Orabug: 27988326

Pull the upstream commit 6b8675897338 to update just the bnxt_en driver.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bpf: make bnxt compatible w/ bpf_xdp_adjust_tail

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for bnxt driver we will just calculate packet's length unconditionally

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
(cherry picked from commit b968e735c79767a3c91217fbae691581aa557d8d)
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: add meta pointer for direct access (partial backport)

Orabug: 27988326

Pull the upstream commit de8f3a83b0a0 to update only the bnxt_en driver.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Fix bug in ethtool -L.

When changing channels from combined to rx/tx or vice versa, the code
uses the wrong "sh" parameter to determine if we are reserving rings
for shared or non-shared mode. It should be using the ethtool requested
"sh" parameter instead of the current "sh" parameter.

Fix it by passing the "sh" parameter to bnxt_reserve_rings(). For
ethtool, we will pass in the requested "sh" parameter.

Fixes: 391be5c27364 ("bnxt_en: Implement new scheme to reserve tx rings.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit 3b6b34df342553a7522561e34288f5bb803aa9aa ]
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bpf: bnxt: Report bpf_prog ID during XDP_QUERY_PROG

Add support to bnxt to report bpf_prog ID during XDP_QUERY_PROG.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8902965f8cb23bba8aa7f3be293ec2f3067b82c6)
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Optimize doorbell write operations for newer chips (reapply).

Older chips require the doorbells to be written twice, but newer chips
do not. Add a new common function bnxt_db_write() to write all
doorbells appropriately depending on the chip. Eliminating the extra
doorbell on newer chips has a significant performance improvement
on pktgen.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 434c975a8fe2f70b70ac09ea5ddd008e0528adfa)
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Use short TX BDs for the XDP TX ring.

No offload is performed on the XDP_TX ring so we can use the short TX
BDs. This has the effect of doubling the size of the XDP TX ring so
that it now matches the size of the rx ring by default.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 932dbf83ba18bdb871e0c03a4ffdd9785f7a9c07)
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Add ethtool mac loopback self test (reapply).

The mac loopback self test operates in polling mode.  To support that,
we need to add functions to open and close the NIC half way.  The half
open mode allows the rings to operate without IRQ and NAPI.  We
use the XDP transmit function to send the loopback packet.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit f7dc1ea6c4c1f31371b7098d6fae0d49dc6cdff1 ]
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Add support for XDP_TX action.

Add dedicated transmit function and transmit completion handler for
XDP.  The XDP transmit logic and completion logic are different than
regular TX ring.  The TX buffer is recycled back to the RX ring when
it completes.

v3: Improved the buffer recyling scheme for XDP_TX.

v2: Add trace_xdp_exception().
    Add dma_sync.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[Upstream commit 38413406277fd060f46855ad527f6f8d4cf2652d]
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Add basic XDP support.

Add basic ndo_xdp support to setup and query program, configure the NIC
to run in rx page mode, and support XDP_PASS, XDP_DROP, XDP_ABORTED
actions only.

v3: Pass modified offset and length to stack for XDP_PASS.
    Remove Kconfig option.

v2: Added trace_xdp_exception()
    Added dma_syncs.
    Added XDP headroom support.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[Upstream commit c6d30e8391b85e00eb544e6cf047ee0160ee9938]
Orabug: 27988326

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/ia32: Restore r8 correctly in 32bit SYSCALL instruction entry.

This commit fixes a bug in a previous commit
8e69671028ac ("x86/ia32: Adds code hygiene for 32bit SYSCALL instruction
entry.")

SAVE_EXTRA_REGS does not save the r8 register. r8 is rather saved in
pt_regs->sp before it is cleared. So, retrieve r8 from pt_regs->sp.

Orabug: 28529706

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Bert Barbe <bert.barbe@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: enable RPS on vlan devices

This patch modifies the RPS processing code so that it searches
for a matching vlan interface on the packet and then uses the
RPS settings of the vlan interface. If no vlan interface
is found or the vlan interface does not have RPS enabled,
it will fall back to the RPS settings of the underlying device.

In supporting VMs where we can't control the OS being used,
we'd like to separate the VM cpu processing from the host's
cpus as a way to help mitigate the impact of the L1TF issue.
When running the VM's traffic on a vlan we can stick the Rx
processing on one set of CPUs separate from the VM's CPUs.
Yes, choosing to use this will cause a bit of throughput pain
when the packets are actually passed into the VM and have to
move from one cache to another.

Orabug: 28645929

Signed-off-by: Silviu Smarandache <silviu.smarandache@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: hold write vbd-lock while swapping the vbd

All paths holding a vbd handle or dereferencing it hold the read vbd-lock.
Swapping the device takes a write-trylock on the vbd. So, we fail a swap
if, for instance, there is an active IO operation.

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: implement swapping of active vbd

Currently we disallow any change of major:minor of the vbd once created.
The danger is in the user switching backends where the contents of the
backend device are dissimilar.
However, changing the vbd can be quite useful -- for instance by switching
from a backend which is not multi-pathed (or raid'd) to one that is.

As for blkback itself, it is used purely as a passthrough device so there
is no state outside struct vbd which would be impacted with this change.

This patch allows a vbd swap, communicating the state of the swap via the
xenbus paths:

.../vbd/<domain>/51712/oracle/active-physical-device = "<major>:<minor>"
.../vbd/<domain>/51712/oracle/physical-device-change-status = "<error>"

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: emit active physical device to xenstore

The vbd physical-device is set by user-space writing to device's xenbus
path:
/local/domain/0/backend/vbd/1/51712/physical-device = "<major>:<minor>"

However, there is no way for the blkback driver to specify the current
state of the physical-device. Accordingly we add two new backend paths:

.../vbd/<domain>/51712/oracle/active-physical-device = "<major>:<minor>"
.../vbd/<domain>/51712/oracle/physical-device-change-status = "<error>"

which specify the active physical device and any error incurred in
trying to instantiate a new physical device.

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: refactor backend_changed()

Add xenbus_scan_be_params() to handle vbd param parsing. This gets
called from backend_changed().

Also place all the parameters (major, minor, cdrom, readonly) in struct
blkif_params. The readonly param is an integer replacement of mode --
this allows us to avoid tracking an additional pointer.

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-blkback: pull out blkif grant features from vbd

grant related features don't have anything to do with the vbd.
Move them to the blkif structure.

Orabug: 28651655

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: get rid of vmacache_flush_all() entirely

Jann Horn points out that the vmacache_flush_all() function is not only
potentially expensive, it's buggy too.  It also happens to be entirely
unnecessary, because the sequence number overflow case can be avoided by
simply making the sequence number be 64-bit.  That doesn't even grow the
data structures in question, because the other adjacent fields are
already 64-bit.

So simplify the whole thing by just making the sequence number overflow
case go away entirely, which gets rid of all the complications and makes
the code faster too.  Win-win.

[ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
  also just goes away entirely with this ]

Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Will Deacon <will.deacon@arm.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 7a9cdebdcc17e426fb5287e4a82db1dfe86339b2)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/linux/mm_types.h
include/linux/mm_types_task.h
mm/debug.c

Orabug: 28701016
CVE: CVE-2018-17182

Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: crash at rds_ib_inc_copy_to_user+104 due to NULL ptr reference

The customer hit this crash few times.

PID: 31556  TASK: ffff880f823caa00  CPU: 1   COMMAND: "cellsrv"
#0 [ffff880f823db850] machine_kexec at ffffffff8105d93c
#1 [ffff880f823db8b0] crash_kexec at ffffffff811103b3
#2 [ffff880f823db980] oops_end at ffffffff8101a788
#3 [ffff880f823db9b0] no_context at ffffffff8106b9cf
#4 [ffff880f823dba20] __bad_area_nosemaphore at ffffffff8106bc9d
#5 [ffff880f823dba70] bad_area at ffffffff8106be97
#6 [ffff880f823dbaa0] __do_page_fault at ffffffff8106c71e
#7 [ffff880f823dbb00] do_page_fault at ffffffff8106c81f
#8 [ffff880f823dbb40] page_fault at ffffffff816b5a9f
    [exception RIP: rds_ib_inc_copy_to_user+104]
    RIP: ffffffffa04607b8  RSP: ffff880f823dbbf8  RFLAGS: 00010287
    RAX: 0000000000000340  RBX: 0000000000001000  RCX: 0000000000004000
    RDX: 0000000000001000  RSI: ffff88176cea2000  RDI: ffff8817d291f520
    RBP: ffff880f823dbc48   R8: 0000000000001340   R9: 0000000000001000
    R10: 0000000000001200  R11: ffff880f823dc000  R12: ffff880f823dbed0
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000001000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#9 [ffff880f823dbc50] rds_recvmsg at ffffffffa041d837 [rds]

int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to)
...
...
        ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
        frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
        len = be32_to_cpu(inc->i_hdr.h_len);
        sg = frag->f_sg;

        while (iov_iter_count(to) && copied < len) {
                to_copy = min_t(unsigned long, iov_iter_count(to),
                                sg->length - frag_off);
                ...

sg is NULL and it crashes accessing sg->length above.

The cause looks like is due to ic->i_frag_sz returning incorrect value.
16KB when 4KB was expected.

                if (copied % ic->i_frag_sz == 0) {
                        frag = list_entry(frag->f_item.next,
                                          struct rds_page_frag, f_item);
                        frag_off = 0;
                        sg = frag->f_sg;
                }

The other end is using 4KB RDS fragsize (Solaris Super Cluster).
This end is UEK4 (4.1.12-94.8.4.el6uek.x86_64).

The message being copied arrived over 4KB RDS frag size connection.
But during the above check ic->i_frag_sz is 16KB.
This can happen during a reconnect at the connection setup phase.
We start off with ic->i_frag_sz as 16KB. Then settle down at 4KB.

Failing this check
  if (copied % ic->i_frag_sz == 0) {
can result in sg not getting set correctly.

Say, "copied" = 4KB but ic->i_frag_sz is 16KB when it should be 4KB.

During race condition with a reconnect, ic->i_frag_sz can be 16KB
even though once the connection is set up it settled down to 4KB.
It can change from 4KB to 16KB and back to 4KB during connection setup
due to reconnect.

We started seeing this crash after bug 26848749.
But prior to that the same scenario could result in data copied to user
from incorrect "sg" resulting in data corruption.

Orabug: 28506569

Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/core: For multicast functions, verify that LIDs are multicast LIDs

The Infiniband spec defines "A multicast address is defined by a
MGID and a MLID" (section 10.5).  Currently the MLID value is not
validated.

Add check to verify that the MLID value is in the correct address
range.

Fixes: 0c33aeedb2cf ("[IB] Add checks to multicast attach and detach")
Cc: stable@vger.kernel.org
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 8561eae60ff9417a50fa1fb2b83ae950dc5c1e21)

The reason for applying this patch is a bug in ibacm, which goes
undetected without this patch. For further information, see
https://patchwork.kernel.org/patch/10008461/

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:

   * Had to add the define IB_MULTICAST_LID_BASE in
     include/rdma/ib_verbs.h, due to missing commit b4e64397dabc
     ("IB/rdmavt: Break rdma_vt main include header file up").

Orabug: 28700490

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
v2 -> v3:
   * Added Shannon's r-b

v1 -> v2:
   * Adding IB_MULTICAST_LID_BASE to ib_verbs.h
   * A little more verbose commit message

Signed-off-by: Brian Maly <brian.maly@oracle.com>

sunrpc: increase UNX_MAXNODENAME from 32 to __NEW_UTS_LEN bytes

The current limit of 32 bytes artificially limits the name string that
we end up stuffing into NFSv4.x client ID blobs. If you have multiple
hosts with long hostnames that only differ near the end, then this can
cause NFSv4 client ID collisions.

Linux nodenames are actually limited to __NEW_UTS_LEN bytes (64), so use
that as the limit instead. Also, use XDR_QUADLEN to specify the slack
length, just for clarity and in case someone in the future changes this
to something not evenly divisible by 4.

Reported-by: Michael Skralivetsky <michael.skralivetsky@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Orabug: 28660177
(cherry picked from commit 24a9a9610ce3ba36fd87c1d2f2c9106de6b7e832)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Tested-by: Joe Jin <joe.jin@oracle.com>

net: rds: Use address family to designate IPv4 or IPv6 addresses

The condition to interpret the supplied address in
rds_cancel_sent_to() was the length of the supplied user-data. We need
to tighten the API here to the extent possible, subject to SKGXP
passing in sizeof(struct sockaddr_storage) as the optlen, independent
of the actual address size.

We will 1) make sure the data passed in to the kernel is "big enough",
2) that the address is either AF_INET or AF_INET6, and 3) that the
bound address is ipv6_addr_v4mapped() in the AF_INET case and not in
the AF_INET6 case.

Orabug: 28720071

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>

net: rds: Fix blank at eol in af_rds.c

Orabug: 28720071

Signed-off-by: Håkon Bugge <Haakon.Bugge@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>

exec: Limit arg stack to at most 75% of _STK_LIM

Orabug: 28709994
CVE: CVE-2018-14634

To avoid pathological stack usage or the need to special-case setuid
execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit da029c11e6b12f321f36dac8771e833b65cec962)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

nsfs: mark dentry with DCACHE_RCUACCESS

Andrey reported a use-after-free in __ns_get_path():

  spin_lock include/linux/spinlock.h:299 [inline]
  lockref_get_not_dead+0x19/0x80 lib/lockref.c:179
  __ns_get_path+0x197/0x860 fs/nsfs.c:66
  open_related_ns+0xda/0x200 fs/nsfs.c:143
  sock_ioctl+0x39d/0x440 net/socket.c:1001
  vfs_ioctl fs/ioctl.c:45 [inline]
  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
  SYSC_ioctl fs/ioctl.c:700 [inline]
  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691

We are under rcu read lock protection at that point:

        rcu_read_lock();
        d = atomic_long_read(&ns->stashed);
        if (!d)
                goto slow;
        dentry = (struct dentry *)d;
        if (!lockref_get_not_dead(&dentry->d_lockref))
                goto slow;
        rcu_read_unlock();

but don't use a proper RCU API on the free path, therefore a parallel
__d_free() could free it at the same time.  We need to mark the stashed
dentry with DCACHE_RCUACCESS so that __d_free() will be called after all
readers leave RCU.

Fixes: e149ed2b805f ("take the targets of /proc/*/ns/* symlinks to separate fs")
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 073c516ff73557a8f7315066856c04b50383ac34)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/nsfs.c

Orabug: 28576290
CVE: CVE-2018-5873

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

dm crypt: add middle-endian variant of plain64 IV

The big-endian IV (plain64be) backend implementation has bugs and
ended up with a weird middle endian (left shift by 32).

Adding a new module for this.

The middle endian is this weirdness where the value is shifted
to the middle (and also wraps):

iv: 00000000000000010000000000000000
instead of:
iv: 00000000000000000000000000000001

Orabug: 28604628
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/ipoib: Improve filtering log message

The values of subnet_prefix and sgid are missing in log message when a
non-CM packet is filtered since they are both initialized only for ARP
packets.

Fix this by setting them anyway.

Orabug: 28655409

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/ipoib: Fix wrong update of arp_blocked counter

Counter is updated after goto instruction - swap the two lines.

Orabug: 28655409

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/ipoib: Update RX counters after ACL filtering

There is no reason to update RX counters if packet will be dropped in
IB-ACL filtering.

Orabug: 28655409

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IB/ipoib: Filter RX packets before adding pseudo header

IB-ACL filtering is based on SGID (carried in the GRH).
The function skb_add_pseudo_hdr clears the SGID from the header.

Doing filtering after this call has no effect since all packets will drop
for not having SGID in.

Fix this by moving the call to skb_add_pseudo_hdr to be after IB-ACL
filtering while still adjusting skb->data as needed for the filtering.

Orabug: 28655409

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

cdrom: Fix info leak/OOB read in cdrom_ioctl_drive_status

Like d88b6d04: "cdrom: information leak in cdrom_ioctl_media_changed()"

There is another cast from unsigned long to int which causes
a bounds check to fail with specially crafted input. The value is
then used as an index in the slot array in cdrom_slot_status().

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Scott Bauer <sbauer@plzdonthack.me>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 8f3fafc9c2f0ece10832c25f7ffcb07c97a32ad4)

Orabug: 28664501
CVE: CVE-2018-16658

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c

I found an ACPI cache leak in ACPI early termination and boot continuing case.

When early termination occurs due to malicious ACPI table, Linux kernel
terminates ACPI function and continues to boot process. While kernel terminates
ACPI function, kmem_cache_destroy() reports Acpi-Operand cache leak.

Boot log of ACPI operand cache leak is as follows:
>[    0.464168] ACPI: Added _OSI(Module Device)
>[    0.467022] ACPI: Added _OSI(Processor Device)
>[    0.469376] ACPI: Added _OSI(3.0 _SCP Extensions)
>[    0.471647] ACPI: Added _OSI(Processor Aggregator Device)
>[    0.477997] ACPI Error: Null stack entry at ffff880215c0aad8 (20170303/exresop-174)
>[    0.482706] ACPI Exception: AE_AML_INTERNAL, While resolving operands for [opcode_name unavailable] (20170303/dswexec-461)
>[    0.487503] ACPI Error: Method parse/execution failed [\DBG] (Node ffff88021710ab40), AE_AML_INTERNAL (20170303/psparse-543)
>[    0.492136] ACPI Error: Method parse/execution failed [\_SB._INI] (Node ffff88021710a618), AE_AML_INTERNAL (20170303/psparse-543)
>[    0.497683] ACPI: Interpreter enabled
>[    0.499385] ACPI: (supports S0)
>[    0.501151] ACPI: Using IOAPIC for interrupt routing
>[    0.503342] ACPI Error: Null stack entry at ffff880215c0aad8 (20170303/exresop-174)
>[    0.506522] ACPI Exception: AE_AML_INTERNAL, While resolving operands for [opcode_name unavailable] (20170303/dswexec-461)
>[    0.510463] ACPI Error: Method parse/execution failed [\DBG] (Node ffff88021710ab40), AE_AML_INTERNAL (20170303/psparse-543)
>[    0.514477] ACPI Error: Method parse/execution failed [\_PIC] (Node ffff88021710ab18), AE_AML_INTERNAL (20170303/psparse-543)
>[    0.518867] ACPI Exception: AE_AML_INTERNAL, Evaluating _PIC (20170303/bus-991)
>[    0.522384] kmem_cache_destroy Acpi-Operand: Slab cache still has objects
>[    0.524597] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc5 #26
>[    0.526795] Hardware name: innotek gmb_h virtual_box/virtual_box, BIOS virtual_box 12/01/2006
>[    0.529668] Call Trace:
>[    0.530811]  ? dump_stack+0x5c/0x81
>[    0.532240]  ? kmem_cache_destroy+0x1aa/0x1c0
>[    0.533905]  ? acpi_os_delete_cache+0xa/0x10
>[    0.535497]  ? acpi_ut_delete_caches+0x3f/0x7b
>[    0.537237]  ? acpi_terminate+0xa/0x14
>[    0.538701]  ? acpi_init+0x2af/0x34f
>[    0.540008]  ? acpi_sleep_proc_init+0x27/0x27
>[    0.541593]  ? do_one_initcall+0x4e/0x1a0
>[    0.543008]  ? kernel_init_freeable+0x19e/0x21f
>[    0.546202]  ? rest_init+0x80/0x80
>[    0.547513]  ? kernel_init+0xa/0x100
>[    0.548817]  ? ret_from_fork+0x25/0x30
>[    0.550587] vgaarb: loaded
>[    0.551716] EDAC MC: Ver: 3.0.0
>[    0.553744] PCI: Probing PCI hardware
>[    0.555038] PCI host bridge to bus 0000:00
> ... Continue to boot and log is omitted ...

I analyzed this memory leak in detail and found acpi_ns_evaluate() function
only removes Info->return_object in AE_CTRL_RETURN_VALUE case. But, when errors
occur, the status value is not AE_CTRL_RETURN_VALUE, and Info->return_object is
also not null. Therefore, this causes acpi operand memory leak.

This cache leak causes a security threat because an old kernel (<= 4.9) shows
memory locations of kernel functions in stack dump. Some malicious users
could use this information to neutralize kernel ASLR.

I made a patch to fix ACPI operand cache leak.

Signed-off-by: Seunghun Han <kkamagui@gmail.com>
Signed-off-by: Erik Schmauss <erik.schmauss@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 97f3c0a4b0579b646b6b10ae5a3d59f0441cc12c)

Orabug: 28664577
CVE: CVE-2017-13695

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

uek-rpm: Disable deprecated CONFIG_ACPI_PROCFS_POWER

Disable the CONFIG_ACPI_PROCFS_POWER option which allows
deprecated power /proc/acpi/ directories.

Orabug: 28680213

Signed-off-by: Victor Erminpour <victor.erminpour@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "x86/spec_ctrl: Only set SPEC_CTRL_IBRS_FIRMWARE if IBRS is actually in use"

Orabug: 28610707

This reverts commit 6a9757c562b36e32798b3e69a22295cd55ef8a69.

btrfs: fix check_shared for fiemap ioctl

Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least n^3, so if a extent
has too many references, even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit afce772e87c36c7f07f230a76d525025aaf09e41)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/btrfs/backref.c - Modified comment so that patch applies.

Signed-off-by: Divya Indi <divya.indi@oracle.com>
Orabug: 24716710
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/pti: Don't report XenPV as vulnerable

Xen PV domain kernel is not by design affected by meltdown as it's
enforcing split CR3 itself. Let's not report such systems as "Vulnerable"
in sysfs (we're also already forcing PTI to off in X86_HYPER_XEN_PV cases);
the security of the system ultimately depends on presence of mitigation in
the Hypervisor, which can't be easily detected from DomU; let's report
that.

Reported-and-tested-by: Mike Latimer <mlatimer@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1806180959080.6203@cbobk.fhfr.pm
[ Merge the user-visible string into a single line. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6cb2b08ff92460290979de4be91363e5d1b6cec1)

Conflicts:
arch/x86/kernel/cpu/bugs.c

In UEK4, these changes are made in arch/x86/kernel/cpu/bugs_64.c.

Context around the headers was slightly different (there were some extra
headers relative to the cherry-picked patch).

There is noX86_HYPER_XEN_PV, instead compare x86_hyper to x86_hyper_xen.

Orabug: 28476681

Signed-off-by: Patrick Colp <patrick.colp@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: give all workqueues rescuer threads

Orabug: 28518694

We're consistently hitting deadlocks here with XFS on recent kernels.
After some digging through the crash files, it looks like everyone in
the system is waiting for XFS to reclaim memory.

Something like this:

PID: 2733434 TASK: ffff8808cd242800 CPU: 19 COMMAND: "java"
#0 [ffff880019c53588] __schedule at ffffffff818c4df2
#1 [ffff880019c535d8] schedule at ffffffff818c5517
#2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348
#3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb
#4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e
#5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453
#6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7
#7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433
#8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9
#9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73

xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting
for IO, which is waiting for workers to complete the IO which is waiting
for worker threads that don't exist yet:

PID: 2752451 TASK: ffff880bd6bdda00 CPU: 37 COMMAND: "kworker/37:1"
#0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2
#1 [ffff8808d20abc00] schedule at ffffffff818c5517
#2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c
#3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495
#4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82
#5 [ffff8808d20abdf0] create_worker at ffffffff8106752f
#6 [ffff8808d20abe40] worker_thread at ffffffff810699be
#7 [ffff8808d20abec0] kthread at ffffffff8106ef59
#8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8

I think we should be using WQ_MEM_RECLAIM to make sure this thread
pool makes progress when we're not able to allocate new workers.

[dchinner: make all workqueues WQ_MEM_RECLAIM]

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 7a29ac474a47eb8cf212b45917683ae89d6fa13b)

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/spec_ctrl: Only set SPEC_CTRL_IBRS_FIRMWARE if IBRS is actually in use

Currently the SPEC_CTRL_IBRS_FIRMWARE flag always gets set as long as
IBRS is supported by the hardware. However, as best as can be determined
by the documention, if IBRS has been disabled (e.g., spectre_v2=off) then
SPEC_CTRL_IBRS_FIRMWARE should not be set:

nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2
(indirect branch prediction) vulnerability. System may
allow data leaks with this option, which is equivalent
to spectre_v2=off.

and:

spectre_v2= [X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.

off - unconditionally disable

Add a check in set_ibrs_firmware() to only set SPEC_CTRL_IBRS_FIRMWARE if
ibrs_disabled is not also set.

Orabug: 28274907

Signed-off-by: Patrick Colp <patrick.colp@oracle.com>
Reviewed-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: add tcp_ooo_try_coalesce() helper

commit 58152ecbbcc6a0ce7fddd5bf5f6ee535834ece0c upstream.

In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.

I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit 36ee106e844187e3fc612c9b87f12e5e23e9d8a5)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: call tcp_drop() from tcp_data_queue_ofo()

[ Upstream commit 8541b21e781a22dce52a74fef0b9bed00404a1cd ]

In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.

Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit 94623c7463f3424776408df2733012c42b52395a)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: detect malicious patterns in tcp_collapse_ofo_queue()

[ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.

1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.

We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.

In the future, we might add the possibility of terminating flows
that are proven to be malicious.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit a878681484a0992ee3dfbd7826439951f9f82a69)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: avoid collapses in tcp_prune_queue() if possible

[ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :

if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);

tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])

Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.

Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(adapted from v4.9.x commit fdf258ed5dd85b57cf0e0e66500be98d38d42d02)

Orabug: 28639707
CVE: CVE-2018-5390

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>