www.infradead.org Git - users/jedix/linux-maple.git/log

x86/spec_ctrl: save IBRS MSR value in paranoid_entry

If the NMI runs while entering kernel between SWAPGS and IBRS_ENABLE
everything is fine, paranoid_entry would have unconditionally set
IBRS bit 0 and when exiting the NMI it would have cleared bit 0 like
if it was returning to userland. IBRS_ENABLE would have then enabled
bit 0 again.

If NMI instead runs when exiting kernel between IBRS_DISABLE and
SWAPGS, the NMI would have turned on IBRS bit 0 and then it would have
left enabled when exiting the NMI. IBRS bit 0 would then be left
enabled in userland until the next enter kernel.

That is a minor inefficiency only, but we can eliminate it by saving
the MSR when entering the NMI in save_paranoid and restoring it when
exiting the NMI.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

*Scaffolding* x86/spec_ctrl: Add sysctl knobs to enable/disable SPEC_CTRL feature

1. At boot time
noibrs kernel boot parameter will disable IBRS usage
noibpb kernel boot parameter will disable IBPB usage

Otherwise if the above parameters are not specified, the system
will enable ibrs and ibpb usage if the cpu supports it.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[*Scaffolding*:This backport lacks a lot. It is only put it on so that the
later patches compiled _and_ can be tested to run. It is meant to be removed
once the full set of patches are all good. Aka scaffolding.]

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/enter: Use IBRS on syscall and interrupts

Set IBRS upon kernel entrance via syscall and interrupts. Clear it
upon exit.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Backport: I had to add 'asm/spec_ctrl.h' in the assembler files]
Also we should not put ENABLE_IBRS on irq_entries_start]

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86: Add macro that does not save rax, rcx, rdx on stack to disable IBRS

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/enter: MACROS to set/clear IBRS and set IBP

Setup macros to control IBRS and IBPB

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Backport: In UEK4 it is 'cpufeature.h', not 'cpufeatures.h']

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/feature: Report presence of IBPB and IBRS control

Report presence of IBPB and IBRS.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86: Add STIBP feature enumeration

Enumerate single thread indirect branch predictors (STIBP) feature. It
provides means to prevent indirect branch predictions from being
controlled by sibling HW thread.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/cpufeature: Add X86_FEATURE_IA32_ARCH_CAPS and X86_FEATURE_IBRS_ATT

Enumerate future CPU that implements IBRS all the time in its architecture.

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

x86/feature: Enable the x86 feature to control

cpuid ax=0x7, return rdx bit 26 to indicate presence of this feature
IA32_SPEC_CTRL (0x48) and IA32_PRED_CMD (0x49)
IA32_SPEC_CTRL, bit0 – Indirect Branch Restricted Speculation (IBRS)
IA32_PRED_CMD, bit0 – Indirect Branch Prediction Barrier (IBPB)

Orabug: 27344012
CVE: CVE-2017-5715

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

dccp: CVE-2017-8824: use-after-free in DCCP code

Whenever the sock object is in DCCP_CLOSED state,
dccp_disconnect() must free dccps_hc_tx_ccid and
dccps_hc_rx_ccid and set to NULL.

Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 69c64866ce072dea1d1e59a0d61e0f66c0dffb76)

Orabug: 27290292
CVE: CVE-2017-8824

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>

negotiate_mq should happen in all cases of a new VBD being discovered by
xen-blkfront, whether called through _probe() or a hot-attached new VBD
from dom-0 via xenstore. Otherwise, hot-attached new VBDs are left
configured without multi-queue.

Orabug: 27180421

Signed-off-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Patrick Colp <patrick.colp@oracle.com>

e1000: avoid null pointer dereference on invalid stat type

Currently if the stat type is invalid then data[i] is being set
either by dereferencing a null pointer p, or it is reading from
an incorrect previous location if we had a valid stat type
previously. Fix this by skipping over the read of p on an invalid
stat type.

Detected by CoverityScan, CID#113385 ("Explicit null dereferenced")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit 5983587c8c5ef00d6886477544ad67d495bc5479)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000: fix race condition between e1000_down() and e1000_watchdog

This patch fixes a race condition that can result into the interface being
up and carrier on, but with transmits disabled in the hardware.
The bug may show up by repeatedly IFF_DOWN+IFF_UP the interface, which
allows e1000_watchdog() interleave with e1000_down().

    CPU x                           CPU y
    --------------------------------------------------------------------
    e1000_down():
        netif_carrier_off()
                                    e1000_watchdog():
                                        if (carrier == off) {
                                            netif_carrier_on();
                                            enable_hw_transmit();
                                        }
        disable_hw_transmit();
                                    e1000_watchdog():
                                        /* carrier on, do nothing */

Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit 44c445c3d1b4eacff23141fa7977c3b2ec3a45c9)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: Be drop monitor friendly

e1000e_put_txbuf() can be called from normal reclamation path as well as
when a DMA mapping failure, so we need to differentiate these two cases
when freeing SKBs to be drop monitor friendly. e1000e_tx_hwtstamp_work()
and e1000_remove() are processing TX timestamped SKBs and those should
not be accounted as drops either.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit 377b62736c01f14309141c69caa6d84363c12e12)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: apply burst mode settings only on default

Devices that support FLAG2_DMA_BURST have different default values
for RDTR and RADV. Apply burst mode default settings only when no
explicit value was passed at module load.

The RDTR default is zero. If the module is loaded for low latency
operation with RxIntDelay=0, do not override this value with a burst
default of 32.

Move the decision to apply burst values earlier, where explicitly
initialized module variables can be distinguished from defaults.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit 48072ae1ec7a1c778771cad8c1b8dd803c4992ab)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: fix buffer overrun while the I219 is processing DMA transactions

Intel® 100/200 Series Chipset platforms reduced the round-trip
latency for the LAN Controller DMA accesses, causing in some high
performance cases a buffer overrun while the I219 LAN Connected
Device is processing the DMA transactions. I219LM and I219V devices
can fall into unrecovered Tx hang under very stressfully UDP traffic
and multiple reconnection of Ethernet cable. This Tx hang of the LAN
Controller is only recovered if the system is rebooted. Slightly slow
down DMA access by reducing the number of outstanding requests.
This workaround could have an impact on TCP traffic performance
on the platform. Disabling TSO eliminates performance loss for TCP
traffic without a noticeable impact on CPU performance.

Please, refer to I218/I219 specification update:
https://www.intel.com/content/www/us/en/embedded/products/networking/
ethernet-connection-i218-family-documentation.html

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Reviewed-by: Raanan Avargil <raanan.avargil@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit b10effb92e272051dd1ec0d7be56bf9ca85ab927)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: Avoid receiver overrun interrupt bursts

When e1000e_poll() is not fast enough to keep up with incoming traffic, the
adapter (when operating in msix mode) raises the Other interrupt to signal
Receiver Overrun.

This is a double problem because 1) at the moment e1000_msix_other()
assumes that it is only called in case of Link Status Change and 2) if the
condition persists, the interrupt is repeatedly raised again in quick
succession.

Ideally we would configure the Other interrupt to not be raised in case of
receiver overrun but this doesn't seem possible on this adapter. Instead,
we handle the first part of the problem by reverting to the practice of
reading ICR in the other interrupt handler, like before commit 16ecba59bc33
("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
anymore. We handle the second part of the problem by not re-enabling the
Other interrupt right away when there is overrun. Instead, we wait until
traffic subsides, napi polling mode is exited and interrupts are
re-enabled.

Reported-by: Lennart Sorensen <lsorense@csclub.uwaterloo.ca>
Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit 4aea7a5c5e940c1723add439f4088844cd26196d)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: Separate signaling for link check/link up

Lennart reported the following race condition:

\ e1000_watchdog_task
    \ e1000e_has_link
        \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
            /* link is up */
            mac->get_link_status = false;

                            /* interrupt */
                            \ e1000_msix_other
                                hw->mac.get_link_status = true;

        link_active = !hw->mac.get_link_status
        /* link_active is false, wrongly */

This problem arises because the single flag get_link_status is used to
signal two different states: link status needs checking and link status is
down.

Avoid the problem by using the return value of .check_for_link to signal
the link status to e1000e_has_link().

Reported-by: Lennart Sorensen <lsorense@csclub.uwaterloo.ca>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: Fix return value test

All the helpers return -E1000_ERR_PHY.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit d3509f8bc7b0560044c15f0e3ecfde1d9af757a6)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: Fix wrong comment related to link detection

Reading e1000e_check_for_copper_link() shows that get_link_status is set to
false after link has been detected. Therefore, it stays TRUE until then.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit 65a29da1f5fd20fdebef3b959bef9b3660807b20)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: Fix error path in link detection

In case of error from e1e_rphy(), the loop will exit early and "success"
will be set to true erroneously.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit c4c40e51f9c32c6dd8adf606624c930a1c4d9bbb)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

drivers: net: e1000e: use setup_timer() helper.

Use setup_timer function instead of initializing timer with the
function and data fields.

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 27069012
(cherry picked from commit 4a9c07ed71c2b8d755ee585264f80dd2d82a8066)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: Initial Support for IceLake

i219 (8) and i219 (9) are the next LOM generations that will be available
on the next Intel Client platform (IceLake).
This patch provides the initial support for these devices

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Reviewed-by: Raanan Avargil <raanan.avargil@intel.com>
Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit 48f76b68f9fca4e1d5bbb1755d14e8e8e09bdd5b)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: add check on e1e_wphy() return value

Check return value from call to e1e_wphy(). This value is being
checked during previous calls to function e1e_wphy() and it seems
a check was missing here.

Addresses-Coverity-ID: 1226905
Signed-off-by: Gustavo A R Silva <garsilva@embeddedor.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit d75372a2daf5dc48207ee9e5592917e893cddb87)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

e1000e: Undo e1000e_pm_freeze if __e1000_shutdown fails

An error during suspend (e100e_pm_suspend),

[  429.994338] ACPI : EC: event blocked
[  429.994633] e1000e: EEE TX LPI TIMER: 00000011
[  430.955451] pci_pm_suspend(): e1000e_pm_suspend+0x0/0x30 [e1000e] returns -2
[  430.955454] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -2
[  430.955458] PM: Device 0000:00:19.0 failed to suspend async: error -2
[  430.955581] PM: Some devices failed to suspend, or early wake event detected
[  430.957709] ACPI : EC: event unblocked

lead to complete failure:

[  432.585002] ------------[ cut here ]------------
[  432.585013] WARNING: CPU: 3 PID: 8372 at kernel/irq/manage.c:1478 __free_irq+0x9f/0x280
[  432.585015] Trying to free already-free IRQ 20
[  432.585016] Modules linked in: cdc_ncm usbnet x86_pkg_temp_thermal intel_powerclamp coretemp mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep lpc_ich snd_hda_core snd_pcm mei_me mei sdhci_pci sdhci i915 mmc_core e1000e ptp pps_core prime_numbers
[  432.585042] CPU: 3 PID: 8372 Comm: kworker/u16:40 Tainted: G     U          4.10.0-rc8-CI-Patchwork_3870+ #1
[  432.585044] Hardware name: LENOVO 2356GCG/2356GCG, BIOS G7ET31WW (1.13 ) 07/02/2012
[  432.585050] Workqueue: events_unbound async_run_entry_fn
[  432.585051] Call Trace:
[  432.585058]  dump_stack+0x67/0x92
[  432.585062]  __warn+0xc6/0xe0
[  432.585065]  warn_slowpath_fmt+0x4a/0x50
[  432.585070]  ? _raw_spin_lock_irqsave+0x49/0x60
[  432.585072]  __free_irq+0x9f/0x280
[  432.585075]  free_irq+0x34/0x80
[  432.585089]  e1000_free_irq+0x65/0x70 [e1000e]
[  432.585098]  e1000e_pm_freeze+0x7a/0xb0 [e1000e]
[  432.585106]  e1000e_pm_suspend+0x21/0x30 [e1000e]
[  432.585113]  pci_pm_suspend+0x71/0x140
[  432.585118]  dpm_run_callback+0x6f/0x330
[  432.585122]  ? pci_pm_freeze+0xe0/0xe0
[  432.585125]  __device_suspend+0xea/0x330
[  432.585128]  async_suspend+0x1a/0x90
[  432.585132]  async_run_entry_fn+0x34/0x160
[  432.585137]  process_one_work+0x1f4/0x6d0
[  432.585140]  ? process_one_work+0x16e/0x6d0
[  432.585143]  worker_thread+0x49/0x4a0
[  432.585145]  kthread+0x107/0x140
[  432.585148]  ? process_one_work+0x6d0/0x6d0
[  432.585150]  ? kthread_create_on_node+0x40/0x40
[  432.585154]  ret_from_fork+0x2e/0x40
[  432.585156] ---[ end trace 6712df7f8c4b9124 ]---

The unwind failures stems from commit 2800209994f8 ("e1000e: Refactor PM
flows"), but it may be a later patch that introduced the non-recoverable
behaviour.

Fixes: 2800209994f8 ("e1000e: Refactor PM flows")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99847
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 27069012
(cherry picked from commit 833521ebc65b1c3092e5c0d8a97092f98eec595d)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

qla2xxx: Fix system crash in qlt_plogi_ack_unref

Orabug: 27235104

Fix system crash due to NULL pointer access.

qlt_plogi_ack_t and fc_port structures were not properly
bound before calling qlt_plogi_ack_unref().

RIP: 0010:qlt_plogi_ack_unref+0xa1/0x150 [qla2xxx]
Call Trace:
qla24xx_create_new_sess+0xb1/0x320 [qla2xxx]
qla2x00_do_work+0x123/0x260 [qla2xxx]
qla2x00_iocb_work_fn+0x30/0x40 [qla2xxx]
process_one_work+0x1f3/0x530
worker_thread+0x4e/0x480
kthread+0x10c/0x140

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 19759033e0d0beed70421ab9258f5ede79e070ae ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Remove aborting ELS IOCB call issued as part of timeout.

Orabug: 27235104

This fix the spinlock recursion issue seen while unloading the driver.

14 [ffff9f2e21e03db8] native_queued_spin_lock_slowpath at ffffffffad0d8802
15 [ffff9f2e21e03dc0] do_raw_spin_lock at ffffffffad0d99e4
16 [ffff9f2e21e03dd8] _raw_spin_lock_irqsave at ffffffffad652471
17 [ffff9f2e21e03e00] qla2x00_els_dcmd_iocb_timeout at ffffffffc070cd63
18 [ffff9f2e21e03e40] qla2x00_sp_timeout at ffffffffc06f06d3 [qla2xxx]
19 [ffff9f2e21e03e68] call_timer_fn at ffffffffad0f97d8
20 [ffff9f2e21e03ed8] run_timer_softirq at ffffffffad0faf47
21 [ffff9f2e21e03f68] __softirqentry_text_start at ffffffffad655f32

Fixes: 6eb54715b54bb ("qla2xxx: Added interface to send explicit LOGO.")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit bf07ef86e882013522876f7c834c8eea085f35b4 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Defer processing of GS IOCB calls

Orabug: 27235104

This patch defers processing of GS IOCB calls from interrupt
context to avoid hardware spinlock recursion.

Following stack trace is seen

? mod_timer+0x193/0x330
? ql_dbg+0xa7/0xf0 [qla2xxx]
_raw_spin_lock_irqsave+0x31/0x40
qla2x00_start_sp+0x3b/0x250 [qla2xxx]
qla24xx_async_gnl+0x1d3/0x240 [qla2xxx]
qla24xx_fcport_handle_login+0x285/0x290 [qla2xxx]
? vprintk_func+0x20/0x50

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 5d3300a9b8b122b4743aed5a178bf12c87e2b8c9 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Clear loop id after delete

Orabug: 27235104

clear loop id after delete to prevent session invalidation
of stale session.

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit ba743f9148e951abe1c94f89c174ec8e44fb145b ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Fix scan state field for fcport

Orabug: 27235104

Add correct value of scan_state field indicating state
of the FC port

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 76f9a2dd4c60183879a1898bcd56a1dbab19a85d ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Replace fcport alloc with qla2x00_alloc_fcport

Orabug: 27235104

Current code manually allocate an fcport structure that
is not properly initialize. Replace kzalloc with
qla2x00_alloc_fcport, so that all fields are initialized.

Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 063b36d6b0ad74c748d536f5cb47bac2f850a0fa ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Fix abort command deadlock due to spinlock

Orabug: 27235104

Original code acquires hardware_lock to add Abort IOCB
onto driver request queue for processing. However,
abort_command() will also acquire hardware lock to look up
sp pointer before issuing abort IOCB command resulting
into a deadlock. This patch safely removes the possible
deadlock scenario by removing extra spinlock.

Fixes: 6eb54715b54bb ("qla2xxx: Added interface to send explicit LOGO.")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit b0dcce746b32ac573343ad39cb3dc485030de95e ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Fix PRLI state check

Orabug: 27235104

Get Port Database MBX cmd is to validate current Login state upon
PRLI completion. Current code looks at the last login state for
re-validation which was incorrect. This patch removed incorrect
state check.

Fixes: 15f30a5752287 ("qla2xxx: Use IOCB interface to submit non-critical MBX.")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 23c645595dab7b414f23639d0a428a07515807df ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Clear send ELS LOGO flag after target re-login

Orabug: 27235104

This patch fixes clearing out els_send_logo flag at the
time of session deletion.

Fixes: 3515832cc614 ("scsi: qla2xxx: Reset the logo flag, after target re-login.")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Fix Relogin being triggered too fast

Orabug: 27235104

Current driver design schedules relogin process via DPC thread
every 1 second. In a large fabric, this DPC thread tries to
schedule too many jobs and might get overloaded. As a result of
this processing of DPC thread, it can schedule relogin earlier
than 1 second.

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 4005a995668b8fd58f4cf1460dd4cf63efa18363 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Relogin to target port on a cable swap

Orabug: 27235104

If user swaps one target port for another target port for same
switch port, the new target port is not being recognized by the
driver. Current code assumes that old Target port has recovered
from link down. The fix will ask switch what is the WWPN of a
specific NportID (GPNID) rather than assuming it's the same Target
port which has came back.

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>'
[ Upstream commit 5ef696aa9f3ccf999552d924c4e21a348f2bbea9 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Recheck session state after RSCN.

Orabug: 27235104

when RSCN is delivered for specific remote port,
the switch say the remote port is still up and
current state of the remote port/session is good,
don't trust the state. Instead use ADISC to very
the session is still valid or not.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Sawan Chandak <sawan.chandak@cavium.com>
[ Upstream commit a07fc0a42e9ae76e93235f59b089986dc1b751c8 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Fix login state machine stuck at GPDB

Orabug: 27235104

This patch returns discovery state machine back to
Login Complete.

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 414d9ff3f8039f85d23f619dcbbd1ba2628a1a67 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Serialize GPNID for multiple RSCN

Orabug: 27235104

GPNID is triggered by RSCN. For multiple RSCNs of the same
affected NPORT ID, serialize the GPNID to prevent confusion.

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
[ Upstream commit 2d73ac6102d943c4be4945735a338005359c6abc ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: fix stale memory access.

Orabug: 27235104

Name pointer describing each command is assigned with stack frame's memory.
The stack frame could eventually be re-use, where name pointer
access can get get garbage. To fix the problem, use designated
static memory for name pointer.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Sawan Chandak <sawan.chandak@cavium.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Retry switch command on time out

Orabug: 27235104

Retry GID_PN & GPN_ID switch commands for time out case.

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 25ad76b703d9ad536f3411b15b1070aeb059ab55 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Fix system crash for Notify ack timeout handling

Orabug: 27235104

Fix NULL pointer crash due to missing timeout handling callback
for Notify Ack IOCB.

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 2e01d0ba868ec1d4d55ddcba519339e072b0bf4d ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

qla2xxx: Fix re-login for Nport Handle in use

Orabug: 27235104

When NPort Handle is in use, driver needs to mark the handle
as used and pick another. Instead, the code clears the handle
and re-pick the same handle.

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
[ Upstream commit a084fd68e1d26174c4cc1a13fbb0112f468ff7f4 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

scsi: qla2xxx: Cleanup debug message IDs

Orabug: 27235104

Assign unique id to all traces and logs for debug purpose.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit 83548fe2fcbb78a233e8156feff4e167f1d0831e ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

scsi: qla2xxx: Fix name server relogin

Orabug: 27235104

Name server login is normally handle by FW. In some rare case where one
of the switches is being updated, name server login could get
affected. Trigger relogin to name server when driver detects this
condition.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit b98ae0d748dbc80016c5cc2e926f33648d83353d ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

scsi: qla2xxx: Fix path recovery

Orabug: 27235104

If the port is moved/changed, current code would trigger
a deletion. If the port is already deleted, then do relogin.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit 66ee0fece9a0b91f59a7eb8dd39808a0ef5db02b ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

scsi: qla2xxx: Fix an integer overflow in sysfs code

Orabug: 27235104

The value of "size" comes from the user.  When we add "start + size" it
could lead to an integer overflow bug.

It means we vmalloc() a lot more memory than we had intended.  I believe
that on 64 bit systems vmalloc() can succeed even if we ask it to
allocate huge 4GB buffers.  So we would get memory corruption and likely
a crash when we call ha->isp_ops->write_optrom() and ->read_optrom().

Only root can trigger this bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=194061
Cc: <stable@vger.kernel.org>
Fixes: b7cc176c9eb3 ("[SCSI] qla2xxx: Allow region-based flash-part accesses.")
Reported-by: shqking <shqking@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit e6f77540c067b48dee10f1e33678415bfcc89017 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

scsi: qla2xxx: don't disable a not previously enabled PCI device

Orabug: 27235104

commit ddff7ed45edce4a4c92949d3c61cd25d229c4a14 upstream.

When pci_enable_device() or pci_enable_device_mem() fail in
qla2x00_probe_one() we bail out but do a call to
pci_disable_device(). This causes the dev_WARN_ON() in
pci_disable_device() to trigger, as the device wasn't enabled
previously.

So instead of taking the 'probe_out' error path we can directly return
*iff* one of the pci_enable_device() calls fails.

Additionally rename the 'probe_out' goto label's name to the more
descriptive 'disable_device'.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Fixes: e315cd28b9ef ("[SCSI] qla2xxx: Code changes for qla data structure refactoring")
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ddff7ed45edce4a4c92949d3c61cd25d229c4a14 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ALSA: pcm: prevent UAF in snd_pcm_info

When the device descriptor is closed, the `substream->runtime` pointer
is freed. But another thread may be in the ioctl handler, case
SNDRV_CTL_IOCTL_PCM_INFO. This case calls snd_pcm_info_user() which
calls snd_pcm_info() which accesses the now freed `substream->runtime`.

Note: this fixes CVE-2017-0861

Signed-off-by: Robb Glasser <rglasser@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
(cherry picked from commit 362bca57f5d78220f8b5907b875961af9436e229)

Orabug: 27344839
CVE: CVE-2017-0861

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

kernel-uek.spec: update linux-firmware and linux-nano-firmware dependency

Orabug: 27185358

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>

x86, kasan: Fix build failure on KASAN=y && KMEMCHECK=y kernels

Declaration of memcpy() is hidden under #ifndef CONFIG_KMEMCHECK.
In asm/efi.h under #ifdef CONFIG_KASAN we #undef memcpy(), due to
which the following happens:

  In file included from arch/x86/kernel/setup.c:96:0:
  ./arch/x86/include/asm/desc.h: In function â\80\98native_write_idt_entryâ\80\99:
  ./arch/x86/include/asm/desc.h:122:2: error: implicit declaration of function â\80\98memcpyâ\80\99 [-Werror=implicit-function-declaration]   memcpy(&idt[entry], gate, sizeof(*gate));
    ^
    cc1: some warnings being treated as errors
    make[2]: *** [arch/x86/kernel/setup.o] Error 1

We will get rid of that #undef in asm/efi.h eventually.
But in the meanwhile move memcpy() declaration out of #ifdefs
to fix the build.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444994933-28328-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a75ca545e8d57473da47ece828ad98a10727ec6f)

Orabug: 27132235

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Acked-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Acked-by: Girish Moodalbail <girish.moodalbail@oracle.com>

x86, efi, kasan: Fix build failure on !KASAN && KMEMCHECK=y kernels

With KMEMCHECK=y, KASAN=n we get this build failure:

  arch/x86/platform/efi/efi.c:673:3:    error: implicit declaration of function â\80\98memcpyâ\80\99 [-Werror=implicit-function-declaration]
  arch/x86/platform/efi/efi_64.c:139:2: error: implicit declaration of function â\80\98memcpyâ\80\99 [-Werror=implicit-function-declaration]
  arch/x86/include/asm/desc.h:121:2:    error: implicit declaration of function â\80\98memcpyâ\80\99 [-Werror=implicit-function-declaration]

Don't #undef memcpy if KASAN=n.

Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 769a8089c1fd ("x86, efi, kasan: #undef memset/memcpy/memmove per arch")
Link: http://lkml.kernel.org/r/1443544814-20122-1-git-send-email-ryabinin.a.a@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4ac86a6dcec1c3878de9747bf5a2aa4455be69e3)

Orabug: 27132235

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Acked-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Acked-by: Girish Moodalbail <girish.moodalbail@oracle.com>

x86, efi, kasan: #undef memset/memcpy/memmove per arch

In not-instrumented code KASAN replaces instrumented memset/memcpy/memmove
with not-instrumented analogues __memset/__memcpy/__memove.

However, on x86 the EFI stub is not linked with the kernel. It uses
not-instrumented mem*() functions from arch/x86/boot/compressed/string.c

So we don't replace them with __mem*() variants in EFI stub.

On ARM64 the EFI stub is linked with the kernel, so we should replace
mem*() functions with __mem*(), because the EFI stub runs before KASAN
sets up early shadow.

So let's move these #undef mem* into arch's asm/efi.h which is also
included by the EFI stub.

Also, this will fix the warning in 32-bit build reported by kbuild test
robot:

efi-stub-helper.c:599:2: warning: implicit declaration of function 'memcpy'

[akpm@linux-foundation.org: use 80 cols in comment]
Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: Fengguang Wu <fengguang.wu@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 769a8089c1fd2fe94c13e66fe6e03d7820953ee3)

Orabug: 27132235

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Acked-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Acked-by: Girish Moodalbail <girish.moodalbail@oracle.com>

Revert "Makefile: Build with -Werror=date-time if the compiler supports it"

This reverts commit fe7c36c7bde1
("Makefile: Build with -Werror=date-time if the compiler supports it")

Revert this change because UEK 4.1 still uses __DATE__ and __TIME__ macro
and building UEK with GCC 4.9 or newer should not cause compilation errors.

Orabug: 27132235

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Acked-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Acked-by: Girish Moodalbail <girish.moodalbail@oracle.com>

x86/efi: Initialize and display UEFI secure boot state a bit later during init

Otherwise Xen dom0 does not display "Secure boot enabled" message if it runs
on secure boot enabled platform. This happens because boot_params.secure_boot
is initialized too late. However, despite lack of message all features depending
on UEFI secure boot are enabled properly.

Orabug: 27258204

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

bnxt_en: Fix possible corrupted NVRAM parameters from firmware response.

In bnxt_find_nvram_item(), it is copying firmware response data after
releasing the mutex. This can cause the firmware response data
to be corrupted if the next firmware response overwrites the response
buffer. The rare problem shows up when running ethtool -i repeatedly.

Fix it by calling the new variant _hwrm_send_message_silent() that requires
the caller to take the mutex and to release it after the response data has
been copied.

Fixes: 3ebf6f0a09a2 ("bnxt_en: Add installed-package version reporting via Ethtool GDRVINFO")
Reported-by: Sarveswara Rao Mygapula <sarveswararao.mygapula@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cc72f3b1feb4fd38d33ab7a013d5ab95041cb8ba)

Orabug: 27285190

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

dtrace: do not use copy_from_user when accessing kernel stack

The implementation of sdt_getarg() for x86_64 uses a copy_from_user
variant while reading from kernel stack which is obviously wrong.
This commit corrects that.

Orabug: 25949088
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

dtrace: fix arg5 and up retrieval for FBT entry probes on x86

When tracing function entry using FBT entry probes, access to all the
function arguments should be supported.  The existing code supported
access up to the 5th argument, but would yield incorrect results for
any argument beyond that.  On some kernels it could result in a crash
when the 6th (or higher) argument was being retrieved.

The reason for the problem lies in the fact that the first 5 arguments
could be read directly from the register set, whereas the 6th argument
and beyond needs to be retrieved from the stack.  The generic code
implementing retrieval of arguments turns out to be incorrect for FBT
entry probes.

Thi commit introduces a FBT-specific function to access function
arguments.

Orabug: 25949088
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

x86/espfix: Init espfix on the boot CPU side

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is
called before we enable local irqs, so the lockdep sub-system
would (correctly) trigger a warning about the potentially
blocking API.

So we allocate them on the boot CPU side when the secondary CPU is
brought up by the boot CPU, and hand them over to the secondary
CPU.

And we use alloc_pages_node() with the secondary CPU's node, to
make sure the espfix stack is NUMA-local to the CPU that is
going to use it.

Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Cc: <bp@alien8.de>
Cc: <luto@amacapital.net>
Cc: <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c97add2670e9abebb90095369f0cfc172373ac94.1435824469.git.zhugh.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 26523661
(cherry picked from commit 20d5e4a9cd453991e2590a4c25230a99b42dee0c)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>

x86/espfix: Add 'cpu' parameter to init_espfix_ap()

Add a CPU index parameter to init_espfix_ap(), so that the
parameter could be propagated to the function for espfix
page allocation.

Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Cc: <bp@alien8.de>
Cc: <luto@amacapital.net>
Cc: <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/cde3fcf1b3211f3f03feb1a995bce3fee850f0fc.1435824469.git.zhugh.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 26523661
(cherry picked from commit 1db875631f8d5bbf497f67e47f254eece888d51d)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>

xen: Make PV Dom0 Linux kernel NUMA aware

Issues Xen hypercall subop XENMEM_get_vnumainfo and sets the
NUMA topology, otherwise sets dummy NUMA node and prevents
numa_init from calling other numa initializators.
Enables vNUMA for dom0 if numa kernel boot option does not
disable it.
It also requires Xen to have patches that support Dom0 NUMA
and xen boot option dom0_vcpus_pin=numa.

Dom0 NUMA topology with this patch applied and Xen booted with
"dom0_mem=max:6144M dom0_vcpus_pin=numa dom0_max_vcpus=20":

[root@localhost ~]# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 2966 MB
node 0 free: 2175 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 2880 MB
node 1 free: 2143 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

And lstopo output:

Machine (5847MB)
  NUMANode L#0 (P#0 2967MB)
    Socket L#0
      L3 L#0 (25MB) + L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L3 L#1 (25MB) + L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L3 L#2 (25MB) + L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L3 L#3 (25MB) + L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L3 L#4 (25MB) + L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L3 L#5 (25MB) + L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
      L3 L#6 (25MB) + L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L3 L#7 (25MB) + L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L3 L#8 (25MB) + L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
      L3 L#9 (25MB) + L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
    HostBridge L#0
      PCIBridge
        PCI 1000:005d
          Block L#0 "sda"
          Block L#1 "sdb"
      PCIBridge
        PCI 8086:1528
          Net L#2 "eth0"
        PCI 8086:1528
          Net L#3 "eth1"
      PCIBridge
        PCI 102b:0522
      PCI 8086:8d02
  NUMANode L#1 (P#1 2880MB)
    Socket L#1
      L3 L#10 (25MB) + L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L3 L#11 (25MB) + L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
      L3 L#12 (25MB) + L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
      L3 L#13 (25MB) + L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
      L3 L#14 (25MB) + L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
      L3 L#15 (25MB) + L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
      L3 L#16 (25MB) + L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
      L3 L#17 (25MB) + L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
      L3 L#18 (25MB) + L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
      L3 L#19 (25MB) + L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
    HostBridge L#4
      PCIBridge
        PCI 8086:1528
          Net L#4 "eth2"
        PCI 8086:1528
          Net L#5 "eth3"
      PCIBridge
        PCI 8086:2701

OraBug: 26189217

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()

Orabug: 27255674

ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
       -c "seek -d 0" /mnt/ext4/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
(cherry picked from commit 624327f8794704c5066b11a52f9da6a09dce7f9a)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

DTrace: IO wait probes b_flags can contain incorrect operation

In the synchronous IO path, the xfs buffer flag value can change,
causing the IO provider io:::wait-start and io:::wait-done to report
an incorrect operation through the bufinfo_t b_flags field.

Orabug: 27193447

Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

KVM: x86: pvclock: Handle first-time write to pvclock-page contains random junk

When guest passes KVM it's pvclock-page GPA via WRMSR to
MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW, KVM don't initialize
pvclock-page to some start-values. It just requests a clock-update which
will happen before entering to guest.

The clock-update logic will call kvm_setup_pvclock_page() to update the
pvclock-page with info. However, kvm_setup_pvclock_page() *wrongly*
assumes that the version-field is initialized to an even number. This is
wrong because at first-time write, field could be any-value.

Fix simply makes sure that if first-time version-field is odd, increment
it once more to make it even and only then start standard logic.
This follows same logic as done in other pvclock shared-pages (See
kvm_write_wall_clock() and record_steal_time()).

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit 51c4b8bba674cfd2260d173602c4dac08e4c3a99)

Orabug: 27146591

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

KVM: x86: always fill in vcpu->arch.hv_clock

We will use it in the next patches for KVM_GET_CLOCK and as a basis for the
contents of the Hyper-V TSC page. Get the values from the Linux
timekeeper even if kvmclock is not enabled.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 0d6dd2ff8206dc1da3428d5b1611f6304d481dab)

Orabug: 27146591

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

KVM: nVMX: Fix vmx_check_nested_events() return value in case an event was reinjected to L2

vmx_check_nested_events() should return -EBUSY only in case there is a
pending L1 event which requires a VMExit from L2 to L1 but such a
VMExit is currently blocked. Such VMExits are blocked either
because nested_run_pending=1 or an event was reinjected to L2.
vmx_check_nested_events() should return 0 in case there are no
pending L1 events which requires a VMExit from L2 to L1 or if
a VMExit from L2 to L1 was done internally.

However, upstream commit which introduced blocking in case an event was
reinjected to L2 (commit acc9ab601327 ("KVM: nVMX: Fix pending events
injection")) contains a bug: It returns -EBUSY even if there are no
pending L1 events which requires VMExit from L2 to L1.

This commit fix this issue.

Fixes: acc9ab601327 ("KVM: nVMX: Fix pending events injection")
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit 917dc6068bc12a2dafffcf0e9d405ddb1b8780cb)
Orabug: 27200329
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Acked-by: Liran Alon <liran.alon@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

KVM: VMX: use kvm_event_needs_reinjection

Use kvm_event_needs_reinjection() encapsulation.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 274bba52a01d6de01f03cfb1b80af2d35772e62e)
Orabug: 27200329
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Acked-by: Liran Alon <liran.alon@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

KVM: nVMX: Fix pending events injection

L2 fails to boot on a non-APICv box dues to 'commit 0ad3bed6c5ec
("kvm: nVMX: move nested events check to kvm_vcpu_running")'

KVM internal error. Suberror: 3
extra data[0]: 800000ef
extra data[1]: 1
RAX=0000000000000000 RBX=ffffffff81f36140 RCX=0000000000000000 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=ffff88007c92fe90 RSP=ffff88007c92fe90
R8 =ffff88007fccdca0 R9 =0000000000000000 R10=00000000fffedb3d R11=0000000000000000
R12=0000000000000003 R13=0000000000000000 R14=0000000000000000 R15=ffff88007c92c000
RIP=ffffffff810645e6 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0000 0000000000000000 ffffffff 00c00000
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 0000000000000000 ffffffff 00c00000
GS =0000 ffff88007fcc0000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 ffff88007fcd4200 00002087 00008b00 DPL=0 TSS64-busy
GDT= ffff88007fcc9000 0000007f
IDT= ffffffffff578000 00000fff
CR0=80050033 CR2=00000000ffffffff CR3=0000000001e0a000 CR4=003406e0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01

We should try to reinject previous events if any before trying to inject
new event if pending. If vmexit is triggered by L2 guest and L0 interested
in, we should reinject IDT-vectoring info to L2 through vmcs02 if any,
otherwise, we can consider new IRQs/NMIs which can be injected and call
nested events callback to switch from L2 to L1 if needed and inject the
proper vmexit events. However, 'commit 0ad3bed6c5ec ("kvm: nVMX: move
nested events check to kvm_vcpu_running")' results in the handle events
order reversely on non-APICv box. This patch fixes it by bailing out for
pending events and not consider new events in this scenario.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Fixes: 0ad3bed6c5ec ("kvm: nVMX: move nested events check to kvm_vcpu_running")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit acc9ab60132739e0c5ab40644444cd7b22d12886)
Orabug: 27200329
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Acked-by: Liran Alon <liran.alon@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/time: do not decrease steal time after live migration on xen

After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
xen_steal_lock() might be less than this_rq()->prev_steal_time which is
derived from previous return value of xen_steal_clock().

For instance, steal time of each vcpu is 335 before live migration.

cpu  198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu  200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

Since runstate times are cumulative and cleared during xen live migration
by xen hypervisor, the idea of this patch is to accumulate runstate times
to global percpu variables before live migration suspend. Once guest VM is
resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
runstate times and previously accumulated times stored in global percpu
variables.

Comment above HYPERVISOR_suspend() has been removed as it is inaccurate:
the call can return an error code (e.g., possibly -EPERM in the future).

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
which would overflow steal time and lead to 100% st usage in top command
for linux 4.8-4.10. A backport of this patch would fix that issue.

[boris: added linux/slab.h to driver/xen/time.c, slightly reformatted
        commit message]

References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Orabug: 27181243

(cherry picked from commit 5e25f5db6abb96ca8ee2aaedcb863daa6dfcc07a)
The original purpose of this patch is to not decrease steal time after live
migration on xen. We backport this patch to avoid xen steal clock overflow
issue which would lead to 100% st usage in top command as mentioned in
above commit message. UEK3 or UEK2 would not hit this issue.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

bnx2x: fix slowpath null crash

Backport: upstream 442866ff9743d51957685cecaa722a7fd47b02e2

When "NETDEV WATCHDOG: em4 (bnx2x): transmit queue 2 timed out" occurs,
BNX2X_SP_RTNL_TX_TIMEOUT is set. In the function bnx2x_sp_rtnl_task,
bnx2x_nic_unload and bnx2x_nic_load are executed to shutdown and open
NIC. In the function bnx2x_nic_load, bnx2x_alloc_mem fails to allocate
dma memory. The message "bnx2x: [bnx2x_alloc_mem:8399(em4)]Can't
allocate memory" pops out. The variable slowpath is set to NULL.
When shutdown the NIC, the function bnx2x_nic_unload is called. In
the function bnx2x_nic_unload, the following functions are executed.
bnx2x_chip_cleanup
    bnx2x_set_storm_rx_mode
        bnx2x_set_q_rx_mode
            bnx2x_set_q_rx_mode
                bnx2x_config_rx_mode
                    bnx2x_set_rx_mode_e2
In the function bnx2x_set_rx_mode_e2, the variable slowpath is operated.
Then the crash occurs.
To fix this crash, the variable slowpath is checked. And in the function
bnx2x_sp_rtnl_task, after dma memory allocation fails, another shutdown
and open NIC is executed.

Orabug: 27041078

[Correct the typos in the short log and comments]

CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Ariel Elior <aelior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

Replace max_t() with sub_positive() in dequeue_entity_load_avg()

Orabug: 27026563

This patch replaces max_t() with sub_positive() to prevent intermediate
values getting stored in cfs_rq->runnable_load_avg/runnable_load_sum.
This patch has been partially cherry picked from commit c7b50216818e
("sched/fair: Remove se->load.weight from se->avg.load_sum")

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>

sched/fair: Fix cfs_rq avg tracking underflow

As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> sa->load_avg = max_t(long, sa->load_avg - r, 0);
> sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> removed_load = 1;
> }
>
> turns into:
>
> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
> ffffffff8108706b:       48 85 c0                test   %rax,%rax
> ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
       maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 8974189222159154c55f24ddad33e3613960521a)

Conflicts:

kernel/sched/fair.c

    Orabug: 27026563

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>

rds: System panic if RDS netfilter is enabled and RDS/TCP is used

If a transport does not support netfilter, its inc_to_skb is NULL.
The code needs to make sure that it is not NULL before calling it.

Orabug: 26950401

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

fuse: Call end_queued_requests() after releasing fc->lock in fuse_dev_release()

Orabug: 27215268

During fs teardown,the following self deadlock can happen.

[<ffffffff8173891a>] _raw_spin_lock+0x2a/0x60
[<ffffffffa08a7ef1>] ? fuse_abort_conn+0x31/0x270 [fuse]
[<ffffffffa08763c0>] ? cuse_read_iter+0x70/0x70 [cuse]
[<ffffffffa0876414>] cuse_process_init_reply+0x54/0x490 [cuse]
[<ffffffffa08763c0>] ? cuse_read_iter+0x70/0x70 [cuse]
[<ffffffffa08a5bbf>] request_end+0xbf/0x170 [fuse]
[<ffffffffa08a7d16>] end_queued_requests.isra.19+0x86/0x160 [fuse]
[<ffffffffa08a7e8f>] fuse_dev_release+0x9f/0xd0 [fuse]
[<ffffffffa087611a>] cuse_channel_release+0x8a/0xa0 [cuse]
[<ffffffff81214cf4>] __fput+0xe4/0x220
[<ffffffff81214e7e>] ____fput+0xe/0x10
[<ffffffff810a5557>] task_work_run+0xb7/0xf0
[<ffffffff81017c6d>] do_notify_resume+0x8d/0xa0
[<ffffffff81738dbc>] int_signal+0x12/0x17

The deadlock happens when an attempt is made to take fc->lock in
fuse_abort_conn(). The same lock has already been taken in
fuse_dev_release() in the stack.

This flow is initiated from cuse_process_init_reply() in case of an error
and so, it is a very rare scenario, but it can lockup the fuse fs.

Fix this by releasing the spin lock before calling end_queued_requests()
instead of after, since it does not make a difference anyway.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

rds: IB active bonding IPv6 changes

For IPv6, each interface can be associated with many addresses.  This
is handled by having an array of struct rds_ib_port_addr in each
struct rds_ib_port.  This new struct is only used for storing an IPv6
address to minimize code changes and is separate from struct
rds_ib_alias.  Note that address alias is obsolete and does not apply
to IPv6 address.  All the failover event handling code stays the same
except that IPv6 addresses are also moved.

Orabug: 25410192

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

{IB/{core,ipoib},net/rds}: IPv6 support for ACL

The IB ACL components are extended to support IPv6 address.  Some
of the ACL ioctls use a 32 bit integer to represent an IP address.  To
support IPv6, struct in6_addr needs to be used. To ensure backward
compatibility, a new ioctl will be introduced for each of those ioctls
that take a 32 bit integer as address. The original ioctls are kept
and can still be used. The new ioctls can take IPv4 mapped IPv6
address so new apps only need to use the new ioctls.

The IPOIBACLNGET and IPOIBACLNADD commands re-use the same
ipoib_ioctl_req_data, except that the ips field should actually be a
pointer to a list of struct in6_addr.  Here we assume that the pointer
size to an u32 and in6_addr are the same.

Orabug: 25410192

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

rds: Enable RDS IPv6 support

This patch enables RDS to use IPv6 addresses. There are many data
structures (RDS socket options) used by RDS apps which use a 32 bit
integer to store IP address. To support IPv6, struct in6_addr needs to
be used. To ensure backward compatibility, a new data structure is
introduced for each of those data structures which use a 32 bit
integer to represent an IP address. And new socket options are
introduced to use those new structures. This means that existing apps
should work without a problem with the new RDS module. For apps which
want to use IPv6, those new data structures and socket options can be
used. IPv4 mapped address is used to represent IPv4 address in the new
data structures.

RDS/RDMA/IB uses a private data (struct rds_ib_connect_private)
exchange between endpoints at RDS connection establishment time to
support RDMA. This private data exchange uses a 32 bit integer to
represent an IP address. This needs to be changed in order to support
IPv6. A new private data struct rds6_ib_connect_private is introduced
to handle this. To ensure backward compatibility, an IPv6 capable RDS
stack uses another RDMA listener port (RDS_CM_PORT which is 16385,
the same value as the RDS/TCP listener port number) to accept IPv6
connection. And it continues to use the original RDS_PORT for IPv4
RDS connections. When it needs to communicate with an IPv6 peer, it
uses the RDS_CM_PORT to send the connection set up request.

Orabug: 25410192

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

rds: Changed IP address internal representation to struct in6_addr

This patch changed the internal representation of an IP address to use
struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
All the functions which take an IP address as argument are also
changed to use struct in6_addr.  But RDS socket layer is not modified
such that it still does not accept IPv6 address from an application.
And RDS layer does not accept nor initiate IPv6 connections.

The RDS netfilter header is changed.  User of RDS netfilter will need
to be re-compiled.

Orabug: 25410192

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

IB/ipoib: Remove ACL sysfs debug files

Currently, there are device attributes like add_acl which are writable and
can be used to add ACL in addition to using ioctl. The kernel needs to
parse a string from user space. While parsing IPv4 address is simple,
parsing IPv6 address is not. And it is error prone and can be a security
issue if not done right. Since those attributes are only used in a debug
kernel, it is advisable to remove those attributes instead of adding IPv6
support to them. The following attributes, add_acl, delete_acl, acl,
acl_instance, add_acl_instance and delete_acl_instance are removed. The
last four do not handle IP address. But to be consistent, they are also
removed.

Orabug: 25410192

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

rds: C-style nits

Replace the use of NIPQUAD() with the proper format string and fix
checkpatch warnings.

Fix Copyright format and year.

Orabug: 25410192

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

rds: ib: Fix NULL pointer dereference in debug code

rds_ib_recv_refill() is a function that refills an IB receive
queue. It can be called from both the CQE handler (tasklet) and a
worker thread.

Just after the call to ib_post_recv(), a debug message is printed with
rdsdebug():

            ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
            rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
                     recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
                     (long) ib_sg_dma_address(
                            ic->i_cm_id->device,
                            &recv->r_frag->f_sg),
                    ret);

Now consider an invocation of rds_ib_recv_refill() from the worker
thread, which is preemptible. Further, assume that the worker thread
is preempted between the ib_post_recv() and rdsdebug() statements.

Then, if the preemption is due to a receive CQE event, the
rds_ib_recv_cqe_handler() will be invoked. This function processes
receive completions, including freeing up data structures, such as the
recv->r_frag.

In this scenario, rds_ib_recv_cqe_handler() will process the receive
WR posted above. That implies, that the recv->r_frag has been freed
before the above rdsdebug() statement has been executed. When it is
later executed, we will have a NULL pointer dereference:

[ 4088.068008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 4088.076754] IP: rds_ib_recv_refill+0x87/0x620 [rds_rdma]
[ 4088.082686] PGD 0 P4D 0
[ 4088.085515] Oops: 0000 [#1] SMP
[ 4088.089015] Modules linked in: rds_rdma(OE) rds(OE) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) mlx4_ib(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) ib_cm(E) iw_cm(E) ib_core(E) binfmt_misc(E) sb_edac(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) aesni_intel(E) crypto_simd(E) iTCO_wdt(E) glue_helper(E) iTCO_vendor_support(E) sg(E) cryptd(E) pcspkr(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) shpchp(E) ioatdma(E) i2c_i801(E) wmi(E) lpc_ich(E) mei_me(E) mei(E) mfd_core(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E)
[ 4088.168486]  fb_sys_fops(E) ahci(E) ixgbe(E) libahci(E) ttm(E) mdio(E) ptp(E) pps_core(E) drm(E) sd_mod(E) libata(E) crc32c_intel(E) mlx4_core(E) i2c_core(E) dca(E) megaraid_sas(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: rds]
[ 4088.193442] CPU: 20 PID: 1244 Comm: kworker/20:2 Tainted: G           OE   4.14.0-rc7.master.20171105.ol7.x86_64 #1
[ 4088.205097] Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 31110000 03/03/2017
[ 4088.216074] Workqueue: ib_cm cm_work_handler [ib_cm]
[ 4088.221614] task: ffff885fa11d0000 task.stack: ffffc9000e598000
[ 4088.228224] RIP: 0010:rds_ib_recv_refill+0x87/0x620 [rds_rdma]
[ 4088.234736] RSP: 0018:ffffc9000e59bb68 EFLAGS: 00010286
[ 4088.240568] RAX: 0000000000000000 RBX: ffffc9002115d050 RCX: ffffc9002115d050
[ 4088.248535] RDX: ffffffffa0521380 RSI: ffffffffa0522158 RDI: ffffffffa0525580
[ 4088.256498] RBP: ffffc9000e59bbf8 R08: 0000000000000005 R09: 0000000000000000
[ 4088.264465] R10: 0000000000000339 R11: 0000000000000001 R12: 0000000000000000
[ 4088.272433] R13: ffff885f8c9d8000 R14: ffffffff81a0a060 R15: ffff884676268000
[ 4088.280397] FS:  0000000000000000(0000) GS:ffff885fbec80000(0000) knlGS:0000000000000000
[ 4088.289434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4088.295846] CR2: 0000000000000020 CR3: 0000000001e09005 CR4: 00000000001606e0
[ 4088.303816] Call Trace:
[ 4088.306557]  rds_ib_cm_connect_complete+0xe0/0x220 [rds_rdma]
[ 4088.312982]  ? __dynamic_pr_debug+0x8c/0xb0
[ 4088.317664]  ? __queue_work+0x142/0x3c0
[ 4088.321944]  rds_rdma_cm_event_handler+0x19e/0x250 [rds_rdma]
[ 4088.328370]  cma_ib_handler+0xcd/0x280 [rdma_cm]
[ 4088.333522]  cm_process_work+0x25/0x120 [ib_cm]
[ 4088.338580]  cm_work_handler+0xd6b/0x17aa [ib_cm]
[ 4088.343832]  process_one_work+0x149/0x360
[ 4088.348307]  worker_thread+0x4d/0x3e0
[ 4088.352397]  kthread+0x109/0x140
[ 4088.355996]  ? rescuer_thread+0x380/0x380
[ 4088.360467]  ? kthread_park+0x60/0x60
[ 4088.364563]  ret_from_fork+0x25/0x30
[ 4088.368548] Code: 48 89 45 90 48 89 45 98 eb 4d 0f 1f 44 00 00 48 8b 43 08 48 89 d9 48 c7 c2 80 13 52 a0 48 c7 c6 58 21 52 a0 48 c7 c7 80 55 52 a0 <4c> 8b 48 20 44 89 64 24 08 48 8b 40 30 49 83 e1 fc 48 89 04 24
[ 4088.389612] RIP: rds_ib_recv_refill+0x87/0x620 [rds_rdma] RSP: ffffc9000e59bb68
[ 4088.397772] CR2: 0000000000000020
[ 4088.401505] ---[ end trace fe922e6ccf004431 ]---

This bug was provoked by compiling rds out-of-tree with
EXTRA_CFLAGS="-DRDS_DEBUG -DDEBUG" and inserting an artificial delay
between the rdsdebug() and ib_ib_port_recv() statements:

           /* XXX when can this fail? */
       ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
+ if (can_wait)
+ usleep_range(1000, 5000);
       rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
(long) ib_sg_dma_address(

The fix is simply to move the rdsdebug() statement up before the
ib_post_recv() and remove the printing of ret, which is taken care of
anyway by the non-debug code.

Orabug: 24303333

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from upstream 1cb483a5cc84b497afb51a6c5dfb5a38a0b67086)

Conflicts:
net/rds/ib_recv.c

Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
Conflicts:
net/rds/ib_recv.c

Rebased on v4.1.12-117.0.20171115_0000. Conflict with 5f58d7e81c2f
("net/rds: reduce memory footprint during ib_post_recv in IB
transport")

USB: serial: console: fix use-after-free after failed setup

Make sure to reset the USB-console port pointer when console setup fails
in order to avoid having the struct usb_serial be prematurely freed by
the console code when the device is later disconnected.

Fixes: 73e487fdb75f ("[PATCH] USB console: fix disconnection issues")
Cc: stable <stable@vger.kernel.org> # 2.6.18
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
(cherry picked from commit 299d7572e46f98534033a9e65973f13ad1ce9047)

Orabug: 27206824
CVE: CVE-2017-16525

This bug is partly caused by a clean-up patch that is not merged in uek4,
so it only needs 1 commit to fix it.

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

uwb: properly check kthread_run return value

uwbd_start() calls kthread_run() and checks that the return value is
not NULL. But the return value is not NULL in case kthread_run() fails,
it takes the form of ERR_PTR(-EINTR).

Use IS_ERR() instead.

Also add a check to uwbd_stop().

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bbf26183b7a6236ba602f4d6a2f7cade35bba043)

Orabug: 27206874
CVE: CVE-2017-16526

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ALSA: usb-audio: Check out-of-bounds access by corrupted buffer descriptor

When a USB-audio device receives a maliciously adjusted or corrupted
buffer descriptor, the USB-audio driver may access an out-of-bounce
value at its parser.  This was detected by syzkaller, something like:

  BUG: KASAN: slab-out-of-bounds in usb_audio_probe+0x27b2/0x2ab0
  Read of size 1 at addr ffff88006b83a9e8 by task kworker/0:1/24
  CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.14.0-rc1-42251-gebb2c2437d80 #224
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: usb_hub_wq hub_event
  Call Trace:
   __dump_stack lib/dump_stack.c:16
   dump_stack+0x292/0x395 lib/dump_stack.c:52
   print_address_description+0x78/0x280 mm/kasan/report.c:252
   kasan_report_error mm/kasan/report.c:351
   kasan_report+0x22f/0x340 mm/kasan/report.c:409
   __asan_report_load1_noabort+0x19/0x20 mm/kasan/report.c:427
   snd_usb_create_streams sound/usb/card.c:248
   usb_audio_probe+0x27b2/0x2ab0 sound/usb/card.c:605
   usb_probe_interface+0x35d/0x8e0 drivers/usb/core/driver.c:361
   really_probe drivers/base/dd.c:413
   driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
   __device_attach_driver+0x230/0x290 drivers/base/dd.c:653
   bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
   __device_attach+0x26e/0x3d0 drivers/base/dd.c:710
   device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
   bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
   device_add+0xd0b/0x1660 drivers/base/core.c:1835
   usb_set_configuration+0x104e/0x1870 drivers/usb/core/message.c:1932
   generic_probe+0x73/0xe0 drivers/usb/core/generic.c:174
   usb_probe_device+0xaf/0xe0 drivers/usb/core/driver.c:266
   really_probe drivers/base/dd.c:413
   driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
   __device_attach_driver+0x230/0x290 drivers/base/dd.c:653
   bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
   __device_attach+0x26e/0x3d0 drivers/base/dd.c:710
   device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
   bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
   device_add+0xd0b/0x1660 drivers/base/core.c:1835
   usb_new_device+0x7b8/0x1020 drivers/usb/core/hub.c:2457
   hub_port_connect drivers/usb/core/hub.c:4903
   hub_port_connect_change drivers/usb/core/hub.c:5009
   port_event drivers/usb/core/hub.c:5115
   hub_event+0x194d/0x3740 drivers/usb/core/hub.c:5195
   process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
   worker_thread+0x221/0x1850 kernel/workqueue.c:2253
   kthread+0x3a1/0x470 kernel/kthread.c:231
   ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

This patch adds the checks of out-of-bounce accesses at appropriate
places and bails out when it goes out of the given buffer.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
(cherry picked from commit bfc81a8bc18e3c4ba0cbaa7666ff76be2f998991)

Orabug: 27206916
CVE: CVE-2017-16529

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

USB: uas: fix bug in handling of alternate settings

The uas driver has a subtle bug in the way it handles alternate
settings.  The uas_find_uas_alt_setting() routine returns an
altsetting value (the bAlternateSetting number in the descriptor), but
uas_use_uas_driver() then treats that value as an index to the
intf->altsetting array, which it isn't.

Normally this doesn't cause any problems because the various
alternate settings have bAlternateSetting values 0, 1, 2, ..., so the
value is equal to the index in the array.  But this is not guaranteed,
and Andrey Konovalov used the syzkaller fuzzer with KASAN to get a
slab-out-of-bounds error by violating this assumption.

This patch fixes the bug by making uas_find_uas_alt_setting() return a
pointer to the altsetting entry rather than either the value or the
index.  Pointers are less subject to misinterpretation.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
CC: Oliver Neukum <oneukum@suse.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 786de92b3cb26012d3d0f00ee37adf14527f35c4)

Orabug: 27206993
CVE: CVE-2017-16530

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

USB: fix out-of-bounds in usb_set_configuration

Andrey Konovalov reported a possible out-of-bounds problem for a USB interface
association descriptor. He writes:
It seems there's no proper size check of a USB_DT_INTERFACE_ASSOCIATION
descriptor. It's only checked that the size is >= 2 in
usb_parse_configuration(), so find_iad() might do out-of-bounds access
to intf_assoc->bInterfaceCount.

And he's right, we don't check for crazy descriptors of this type very well, so
resolve this problem. Yet another issue found by syzkaller...

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bd7a3fe770ebd8391d1c7d072ff88e9e76d063eb)

Orabug: 27207211
CVE: CVE-2017-16531

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

cgroup: make sure a parent css isn't offlined before its children

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the memory controller bug and allows the fix for cpu
controller.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com>
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Orabug: 27045648

(backport upstream commit aa226ff4a1ce79f229c6b7a4c0a14e17fececd01)

Conflict:
include/linux/cgroup-defs.h

Fix a kABI breakage

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

HID: usbhid: fix out-of-bounds bug

The hid descriptor identifies the length and type of subordinate
descriptors for a device. If the received hid descriptor is smaller than
the size of the struct hid_descriptor, it is possible to cause
out-of-bounds.

In addition, if bNumDescriptors of the hid descriptor have an incorrect
value, this can also cause out-of-bounds while approaching hdesc->desc[n].

So check the size of hid descriptor and bNumDescriptors.

BUG: KASAN: slab-out-of-bounds in usbhid_parse+0x9b1/0xa20
Read of size 1 at addr ffff88006c5f8edf by task kworker/1:2/1261

CPU: 1 PID: 1261 Comm: kworker/1:2 Not tainted
4.14.0-rc1-42251-gebb2c2437d80 #169
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
Call Trace:
__dump_stack lib/dump_stack.c:16
dump_stack+0x292/0x395 lib/dump_stack.c:52
print_address_description+0x78/0x280 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351
kasan_report+0x22f/0x340 mm/kasan/report.c:409
__asan_report_load1_noabort+0x19/0x20 mm/kasan/report.c:427
usbhid_parse+0x9b1/0xa20 drivers/hid/usbhid/hid-core.c:1004
hid_add_device+0x16b/0xb30 drivers/hid/hid-core.c:2944
usbhid_probe+0xc28/0x1100 drivers/hid/usbhid/hid-core.c:1369
usb_probe_interface+0x35d/0x8e0 drivers/usb/core/driver.c:361
really_probe drivers/base/dd.c:413
driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
__device_attach_driver+0x230/0x290 drivers/base/dd.c:653
bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
__device_attach+0x26e/0x3d0 drivers/base/dd.c:710
device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
device_add+0xd0b/0x1660 drivers/base/core.c:1835
usb_set_configuration+0x104e/0x1870 drivers/usb/core/message.c:1932
generic_probe+0x73/0xe0 drivers/usb/core/generic.c:174
usb_probe_device+0xaf/0xe0 drivers/usb/core/driver.c:266
really_probe drivers/base/dd.c:413
driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
__device_attach_driver+0x230/0x290 drivers/base/dd.c:653
bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
__device_attach+0x26e/0x3d0 drivers/base/dd.c:710
device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
device_add+0xd0b/0x1660 drivers/base/core.c:1835
usb_new_device+0x7b8/0x1020 drivers/usb/core/hub.c:2457
hub_port_connect drivers/usb/core/hub.c:4903
hub_port_connect_change drivers/usb/core/hub.c:5009
port_event drivers/usb/core/hub.c:5115
hub_event+0x194d/0x3740 drivers/usb/core/hub.c:5195
process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
worker_thread+0x221/0x1850 kernel/workqueue.c:2253
kthread+0x3a1/0x470 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

Cc: stable@vger.kernel.org
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Jaejoong Kim <climbbb.kim@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
(cherry picked from commit f043bfc98c193c284e2cd768fefabe18ac2fed9b)

Orabug: 27207901
CVE: CVE-2017-16533

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

USB: core: fix out-of-bounds access bug in usb_get_bos_descriptor()

Andrey used the syzkaller fuzzer to find an out-of-bounds memory
access in usb_get_bos_descriptor(). The code wasn't checking that the
next usb_dev_cap_header structure could fit into the remaining buffer
space.

This patch fixes the error and also reduces the bNumDeviceCaps field
in the header to match the actual number of capabilities found, in
cases where there are fewer than expected.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1c0edc3633b56000e18d82fc241e3995ca18a69e)

Orabug: 27207955
CVE: CVE-2017-16535

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

net: qmi_wwan: fix divide by 0 on bad descriptors

Orabug: 27215213
CVE: CVE-2017-16650

A CDC Ethernet functional descriptor with wMaxSegmentSize = 0 will
cause a divide error in usbnet_probe:

divide error: 0000 [#1] PREEMPT SMP KASAN
Modules linked in:
CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.14.0-rc8-44453-g1fdc1a82c34f #56
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
task: ffff88006bef5c00 task.stack: ffff88006bf60000
RIP: 0010:usbnet_update_max_qlen+0x24d/0x390 drivers/net/usb/usbnet.c:355
RSP: 0018:ffff88006bf67508 EFLAGS: 00010246
RAX: 00000000000163c8 RBX: ffff8800621fce40 RCX: ffff8800621fcf34
RDX: 0000000000000000 RSI: ffffffff837ecb7a RDI: ffff8800621fcf34
RBP: ffff88006bf67520 R08: ffff88006bef5c00 R09: ffffed000c43f881
R10: ffffed000c43f880 R11: ffff8800621fc406 R12: 0000000000000003
R13: ffffffff85c71de0 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88006ca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffe9c0d6dac CR3: 00000000614f4000 CR4: 00000000000006f0
Call Trace:
usbnet_probe+0x18b5/0x2790 drivers/net/usb/usbnet.c:1783
qmi_wwan_probe+0x133/0x220 drivers/net/usb/qmi_wwan.c:1338
usb_probe_interface+0x324/0x940 drivers/usb/core/driver.c:361
really_probe drivers/base/dd.c:413
driver_probe_device+0x522/0x740 drivers/base/dd.c:557

Fix by simply ignoring the bogus descriptor, as it is optional
for QMI devices anyway.

Fixes: 423ce8caab7e ("net: usb: qmi_wwan: New driver for Huawei QMI based WWAN devices")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

[media] cx231xx-cards: fix NULL-deref on missing association descriptor

Make sure to check that we actually have an Interface Association
Descriptor before dereferencing it during probe to avoid dereferencing a
NULL-pointer.

Fixes: e0d3bafd0258 ("V4L/DVB (10954): Add cx231xx USB driver")
Cc: stable <stable@vger.kernel.org> # 2.6.30
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
(cherry picked from commit 6c3b047fa2d2286d5e438bcb470c7b1a49f415f6)

Orabug: 27208030
CVE: CVE-2017-16536

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

Merge branch 'uek/uek-4.1-next' of git://ca-git.us.oracle.com/linux-uek into to-merge/nalcock

ctf: fix thinko preventing linking of out-of-tree modules when CTF is off

The CTF decoupling commit dropped a bunch of variable initializations
for the (degenerate) external-module no-CTF case because we would need
the same initializations for other cases, moving those initializations
further up and overriding them only when needed (in the CTF-enabled,
out-of-tree-module case).

Unfortunately I then forgot to move the containing ifdef CONFIG_CTF
down, leading to these variables being entirely unset in the out-
of-tree module case. This causes linking of out-of-tree modules to fail
when CTF is off.

Thanks to Iain Barker and releng for tracking this down and reporting
it.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Orabug: 27215305

ctf: allow dwarf2ctf to run as root but produce no output

This leads to no CTF file writes (and thus no writes) at all,
which is as safe as we can get as root: this is treated by the
makefiles as an empty CTF file. The resulting module will
appear to DTrace to have no types.

The warning is still emitted (rephrased slightly).

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Orabug: 27205676

mlx4: Subscribe to PXM notifier

With patch titled: "xen/pci: Add PXM node notifier for PXM (NUMA) changes."
there is a notifier which can be used to obtain information
about which NUMA node the device is on.

Usually this kind of information is available prior to the driver
loading, but thanks to the braindead way piix4 emulation works
in QEMU there is no good way of making this work.

Upstream is working on ditching this, but it is a year off or so.

Note that this should not be needed if QEMU q35 platform
is exposed with PCIe bridges in them. The PCIe bridges would contain
the proper PXM information and the hotplugged PCIe device would
snuggly be attached there.

But in the meantime this hack is in place to expose the PXM data.

OraBug: 27200813
Acked-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Add PXM node notifier for PXM (NUMA) changes.

commit c011624b48ca8713135fee9af20f65cc9610be7b
"xen-pcifront/hvm: Slurp up "pxm" entry and set NUMA node on PCIe device. (V5)"
added the functionality to update the PXM (ProXiMity) information
on the PCI devices.

In today's pre-Q35 type guests the PCI bus topology is pretty
flat, where there is only one bridge - and that does not square
very well with the real way one is suppose to expose PXM information.
That is have a PCI bridge and under the bridge plug in devices.

But since we can't do that, we are providing the information
via XenBus - and Xen pcifront listens to this and updates
the 'struct device' numa_node.

While that works.. there is a race. Thanks to the seperation
of pieces, Xend has no clue what the PCI topology is in QEMU.
Meaning it has no idea what the bus:slot:function should be
present in there - until QEMU gets told to plug the device
(ACPI hotplug) andthen it presents the correct "vdevfn" to Xend -
and Xend then creates the XenBus entries which the Xen pcifront
inside of the guest enumerates.

This is a race.

This is what happens at the guest side:

pci 0000:00:04.0: reg 0x18: [mem 0x00000000-0x07ffffff 64bit pref]
pci 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus ali..
pci 0000:00:04.0: BAR 2: assigned [mem 0x3c10000000-0x3c17ffffff ..
mlx4_core: Mellanox ConnectX core driver v2.2-1 (Nov 30 2017)
mlx4_core: Initializing 0000:00:04.0
mlx4_core 0000:00:04.0: enabling device (0000 -> 0002)

(mlx loaded, it sets the numa_node based on pdev->dev (which is -1).
Keep in mind that numa_node is on 'struct mlx_dev' not on 'struct device')

pcifront_hvm pci-0: PCI device 0000:00:04.0 (PXM=0)

(XenBus entries are created, pdev->dev is changed from -1 to 0.
The notification are being sent out when driver is finished loading
BUS_NOTIFY_BOUND_DRIVER)

But pdev->dev is NOT the same as 'struct mlx_dev' which has its own
numa_node attribute (which is exposed via SysFS)!

To make this work we need notifier that the mlx4 driver can be
notified when PXM is updated. And with this patch (and
another follow-up):

mlx4_core 0000:00:04.0: Updating PXM-1 to 0

(mlx4 updates it PXM data)

The sensible ways would be for Xen hypervisor to keep track of the guest
PCI topology instead of delegating that to QEMU, and in fact that work is
being done upstream. With that one can query the hypevisor for the
next available slot to hotplug the device instead of asking QEMU.
But that is ways off, and we don't have that much time.

Instead we add a per-'struct pci_dev' notifier that various drivers
can subscribe to and update their 'numa_node' when there are changes.

It is not perfect but it does the job. It is also a NOP if run on
anything that is not Xen HVM domains.

OraBug: 27200813
Acked-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pcifront: Walk the PCI bus after XenStore notification

Prior to this patch we would only update the 'struct pci_dev'
if a PCI hotplug (new device) event would happen.

This works fine if the device is part of the guest config, that is:

pci=["0a:11.1"]

or such.

Unfortunatly if you are doing PCI hotplug:

xm pci-attach vnuma 0a:11.1

There are two events we need to stamp the 'struct pci_dev'
with the PXM node information:
- ACPI hotplug to inject the PCI device.
- XenBus entries being created

And those are created in the wrong order - first is the
ACPI hotplug event, followed by the XenBus update.

That means 'pcifront_hvm_notifier' gets called when
the ACPI hotplug even has started, but XenBus hasn't been
yet created so the scan from there returns NULL.

Later on we get an XenBus notification (pcifront_hvm_xenbus_probe),
update our internal list, and then the PCI notifier would find it.

The one way to make this work is to listen to an
extra event: BUS_NOTIFY_BOUND_DRIVER

That is signaled once the driver has loaded itself and is
ready. When the 'pcifront_hvm_notifier' gets that information
it can walk over the PCI bus and stamp all the 'struct pci_dev'
with the updated PXM data. In other words the operations
are now:

- ACPI hotplug to inject the PCI device.
- udev is called, modprobe <XYZ> is invoked
- XenBus entries being created
- <XYZ> driver is finished
- pcifront_hvm_notifier is called, figures out which
'struct pci_dev' needs its PXM update and updates.

OraBug: 27200813

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

mm, thp: Do not make page table dirty unconditionally in follow_trans_huge_pmd()

Currently, we unconditionally make page table dirty in
follow_trans_huge_pmd(). However, it should be set only in case of
write access.

This is port of upstream commit: a8f97366452ed491d13cf1e44241bc0b5740b1f0

Orabug: 27165913

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CVE: CVE-2017-1000405
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

mlx4: Subscribe to PXM notifier

With patch titled: "xen/pci: Add PXM node notifier for PXM (NUMA) changes."
there is a notifier which can be used to obtain information
about which NUMA node the device is on.

Usually this kind of information is available prior to the driver
loading, but thanks to the braindead way piix4 emulation works
in QEMU there is no good way of making this work.

Upstream is working on ditching this, but it is a year off or so.

Note that this should not be needed if QEMU q35 platform
is exposed with PCIe bridges in them. The PCIe bridges would contain
the proper PXM information and the hotplugged PCIe device would
snuggly be attached there.

But in the meantime this hack is in place to expose the PXM data.

OraBug: 27200813
Acked-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Add PXM node notifier for PXM (NUMA) changes.

commit c011624b48ca8713135fee9af20f65cc9610be7b
"xen-pcifront/hvm: Slurp up "pxm" entry and set NUMA node on PCIe device. (V5)"
added the functionality to update the PXM (ProXiMity) information
on the PCI devices.

In today's pre-Q35 type guests the PCI bus topology is pretty
flat, where there is only one bridge - and that does not square
very well with the real way one is suppose to expose PXM information.
That is have a PCI bridge and under the bridge plug in devices.

But since we can't do that, we are providing the information
via XenBus - and Xen pcifront listens to this and updates
the 'struct device' numa_node.

While that works.. there is a race. Thanks to the seperation
of pieces, Xend has no clue what the PCI topology is in QEMU.
Meaning it has no idea what the bus:slot:function should be
present in there - until QEMU gets told to plug the device
(ACPI hotplug) andthen it presents the correct "vdevfn" to Xend -
and Xend then creates the XenBus entries which the Xen pcifront
inside of the guest enumerates.

This is a race.

This is what happens at the guest side:

pci 0000:00:04.0: reg 0x18: [mem 0x00000000-0x07ffffff 64bit pref]
pci 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus ali..
pci 0000:00:04.0: BAR 2: assigned [mem 0x3c10000000-0x3c17ffffff ..
mlx4_core: Mellanox ConnectX core driver v2.2-1 (Nov 30 2017)
mlx4_core: Initializing 0000:00:04.0
mlx4_core 0000:00:04.0: enabling device (0000 -> 0002)

(mlx loaded, it sets the numa_node based on pdev->dev (which is -1).
Keep in mind that numa_node is on 'struct mlx_dev' not on 'struct device')

pcifront_hvm pci-0: PCI device 0000:00:04.0 (PXM=0)

(XenBus entries are created, pdev->dev is changed from -1 to 0.
The notification are being sent out when driver is finished loading
BUS_NOTIFY_BOUND_DRIVER)

But pdev->dev is NOT the same as 'struct mlx_dev' which has its own
numa_node attribute (which is exposed via SysFS)!

To make this work we need notifier that the mlx4 driver can be
notified when PXM is updated. And with this patch (and
another follow-up):

mlx4_core 0000:00:04.0: Updating PXM-1 to 0

(mlx4 updates it PXM data)

The sensible ways would be for Xen hypervisor to keep track of the guest
PCI topology instead of delegating that to QEMU, and in fact that work is
being done upstream. With that one can query the hypevisor for the
next available slot to hotplug the device instead of asking QEMU.
But that is ways off, and we don't have that much time.

Instead we add a per-'struct pci_dev' notifier that various drivers
can subscribe to and update their 'numa_node' when there are changes.

It is not perfect but it does the job. It is also a NOP if run on
anything that is not Xen HVM domains.

OraBug: 27200813
Acked-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>