www.infradead.org Git - users/jedix/linux-maple.git/log

sparc/PCI: Use correct offset for bus address to resource

After we added 64bit mmio parsing, we got some "no compatible bridge window"
warning on anther new model that support 64bit resource.

It turns out that we can not use mem_space.start as 64bit mem space
offset, aka there is mem_space.start != offset.

Use child_phys_addr to calculate exact offset and record offset in
pbm.

After patch we get correct offset.

/pci@305: PCI IO [io  0x2007e00000000-0x2007e0fffffff] offset 2007e00000000
/pci@305: PCI MEM [mem 0x2000000100000-0x200007effffff] offset 2000000000000
/pci@305: PCI MEM64 [mem 0x2000100000000-0x2000dffffffff] offset 2000000000000
...
pci_sun4v f02ae7f8: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0x2007e00000000-0x2007e0fffffff] (bus address [0x0000-0xfffffff])
pci_bus 0000:00: root bus resource [mem 0x2000000100000-0x200007effffff] (bus address [0x00100000-0x7effffff])
pci_bus 0000:00: root bus resource [mem 0x2000100000000-0x2000dffffffff] (bus address [0x100000000-0xdffffffff])

-v3: put back mem64_offset, as we found T4 has mem_offset != mem64_offset
     check overlapping between mem64_space and mem_space.
-v7: after new pci_mmap_page_range patches.
-v8: remove change in pci_resource_to_user()

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: sparclinux@vger.kernel.org
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit f296714da83b75783997f8dcfe2a9021ef8fedde)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Remove __pci_mmap_make_offset()

After
PCI: Let pci_mmap_page_range() take resource address
No user for __pci_mmap_make_offset in those arch.

Remove them.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit c34a8c61ea58d516fb20c6ec4fdf338f96fcfeef)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Let pci_mmap_page_range() take resource address

Original pci_mmap_page_range() is taking PCI BAR value aka usr_address.

Bjorn found out that it would be much simple to pass resource address
directly and avoid extra those __pci_mmap_make_offset.

In this patch:
1. in proc path: proc_bus_pci_mmap, try convert back to resource
   before calling pci_mmap_page_range
2. in sysfs path: pci_mmap_resource will just offset with resource start.
3. all pci_mmap_page_range will have vma->vm_pgoff with in resource
   range instead of BAR value.
4. skip calling __pci_mmap_make_offset, as the checking is done
   in pci_mmap_fits().

-v2: add pci_user_to_resource and remove __pci_mmap_make_offset
-v3: pass resource pointer with pci_mmap_page_range()
-v4: put __pci_mmap_make_offset() removing to following patch
     seperate /sys io access alignment checking to another patch
     updated after Bjorn's pci_resource_to_user() changes.
-v5: update after fix for pci_mmap with proc path accoring to
     Bjorn.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit d6ccd78899c966eec1bac726f96f6cbed5b9960e)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Fix proc mmap on sparc

In 8c05cd08a7 ("PCI: fix offset check for sysfs mmapped files"), try
to check exposed value with resource start/end in proc mmap path.

|        start = vma->vm_pgoff;
|        size = ((pci_resource_len(pdev, resno) - 1) >> PAGE_SHIFT) + 1;
|        pci_start = (mmap_api == PCI_MMAP_PROCFS) ?
|                        pci_resource_start(pdev, resno) >> PAGE_SHIFT : 0;
|        if (start >= pci_start && start < pci_start + size &&
|                        start + nr <= pci_start + size)

vma->vm_pgoff is above code segment is user/BAR value >> PAGE_SHIFT.
pci_start is resource->start >> PAGE_SHIFT.

For sparc, resource start is different from BAR start aka pci bus address.
pci bus address need to add offset to be the resource start.

So that commit breaks all arch that exposed value is BAR/user value,
and need to be offseted to resource address.

test code using: ./test_mmap_proc /proc/bus/pci/0000:00/04.0 0x2000000
test code segment:
  fd = open(argv[1], O_RDONLY);
  ...
  sscanf(argv[2], "0x%lx", &offset);
  left = offset & (PAGE_SIZE - 1);
  offset &= PAGE_MASK;
  ioctl(fd, PCIIOC_MMAP_IS_MEM);
  addr = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, offset);
  for (i = 0; i < 8; i++)
    printf("%x ", addr[i + left]);
  munmap(addr, PAGE_SIZE);
  close(fd);

Fixes: 8c05cd08a7 ("PCI: fix offset check for sysfs mmapped files")
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 7bd8ad7855e6ecdac36c6dc7946f5b6beff13ae2)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Supply CPU physical address (not bus address) to iomem_is_exclusive()

iomem_is_exclusive() requires a CPU physical address, but on some arches we
supplied a PCI bus address instead.

On most arches, pci_resource_to_user(res) returns "res->start", which is a
CPU physical address. But on microblaze, mips, powerpc, and sparc, it
returns the PCI bus address corresponding to "res->start".

The result is that pci_mmap_resource() may fail when it shouldn't (if the
bus address happens to match an existing resource), or it may succeed when
it should fail (if the resource is exclusive but the bus address doesn't
match it).

Call iomem_is_exclusive() with "res->start", which is always a CPU physical
address, not the result of pci_resource_to_user().

Fixes: e8de1481fd71 ("resource: allow MMIO exclusivity for device drivers")
Suggested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
Orabug: 22855133
(cherry picked from commit ca620723d4ff9ea7ed484eab46264c3af871b9ae)
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit aee594b88bfcb58f51e0070de06d3879b3ea4609)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Use correct bus address to resource offset"

This reverts commit dae731c1974171f6c85a7efe018625b72edfa82a. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 2dc2632b0b252f8f50c276efa9cc4f00c8e067b5)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Unify pci_register_region()"

This reverts commit 51c0ecdf02fec6728bb1ecc674ee1b7fe90c9d69. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 934fffe0fdf1bf81c90be604827f19c4ede839fb)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Reserve legacy mmio after PCI mmio"

This reverts commit 0c1736c25a4e157cc31f4bed645fded815b4db97. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 59860a19cf9b2acd24a7eaec6d539dbc984a90b9)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Add IORESOURCE_MEM_64 for 64-bit resource in OF parsing"

This reverts commit 9594d268aa40997f3a47e5a8c64463f32db226ff. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit b2fd930d3e95db05979aefc42291be1f6ba2c625)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Keep resource idx order with bridge register number"

This reverts commit ed426d2bf0fa5244e6cfd531e25c50098df6b3cf. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 09dd990a13d7ea2a257f5be189f109fa80259599)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: kill wrong quirk about M7101"

This reverts commit 2bd1643612ce5560c80856f369ae81fc80e0331e. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit de113e58a8cc67f7daa128c33ac49a33935dc767)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "OF/PCI: Add IORESOURCE_MEM_64 for 64-bit resource"

This reverts commit 59072574902b6e44574c34a9989ec6ec6c494f99. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit c0691adf4a28993c77371969ff7800901ff1a67b)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Check pref compatible bit for mem64 resource of PCIe device"

This reverts commit 40990901003330e03db71f9f7b8966a6b5c6e5ba. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit bb9160790d346bcbfc0b5ca701e7e6f6d5d56f87)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Only treat non-pref mmio64 as pref if all bridges have MEM_64"

This reverts commit 18d135012bbc70a30df2a664c4470ce94430c9ea. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 2998a347e93c41d8677e73510a139aa54339c88f)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Add has_mem64 for struct host_bridge"

This reverts commit 9bb848b4f153ca43e05ad976274c8281240d5adc. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 4e49b00da550d255a9470d4ffcffb6dea2227570)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Only treat non-pref mmio64 as pref if host bridge has mmio64"

This reverts commit d39428713e56e17b1558a1b69828e514b29b18e5. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 0c060ada1d50dd5f423eed0e2ba4f3e15c4579b0)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Restore pref MMIO allocation logic for host bridge without mmio64"

This reverts commit 88f97208c7a78187924924f741f864d6a3ccc62e. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 5b6a6cf1e2785b6bf9153c726e09435c05921040)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc: Accommodate mem64_offset != mem_offset in pbm configuration"

This reverts commit 6037ec961e10e6e162c564d5306b2c2fddfb5b77. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 7d551cd315fb544b9b815566737de6e146fd0cd8)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Prevent VPD access for QLogic ISP2722

QLogic ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter has the VPD
access issue too, while read the common pci-sysfs access interface shown as

/sys/devices/pci0000:00/0000:00:03.2/0000:0b:00.0/vpd

with simple 'cat' could cause system hang and panic:

  Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:
  1. Integrated Management Log (IML)
  2. OA Syslog
  3. OA Forward Progress Log
  4. iLO Event Log
  CPU: 0 PID: 15070 Comm: udevadm Not tainted 4.1.12
  Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
   0000000000000086 000000007f0cdf51 ffff880c4fa05d58 ffffffff817193de
   ffffffffa00b42d8 0000000000000075 ffff880c4fa05dd8 ffffffff81714072
   0000000000000008 ffff880c4fa05de8 ffff880c4fa05d88 000000007f0cdf51
  Call Trace:
   <NMI>  [<ffffffff817193de>] dump_stack+0x63/0x81
   [<ffffffff81714072>] panic+0xd0/0x20e
   [<ffffffffa00b390d>] hpwdt_pretimeout+0xdd/0xe0 [hpwdt]
   [<ffffffff81021fc9>] ? sched_clock+0x9/0x10
   [<ffffffff8101c101>] nmi_handle+0x91/0x170
   [<ffffffff8101c10c>] ? nmi_handle+0x9c/0x170
   [<ffffffff8101c5fe>] io_check_error+0x1e/0xa0
   [<ffffffff8101c719>] default_do_nmi+0x99/0x140
   [<ffffffff8101c8b4>] do_nmi+0xf4/0x170
   [<ffffffff817232c5>] end_repeat_nmi+0x1a/0x1e
   [<ffffffff815d724b>] ? pci_conf1_read+0xeb/0x120
   [<ffffffff815d724b>] ? pci_conf1_read+0xeb/0x120
   [<ffffffff815d724b>] ? pci_conf1_read+0xeb/0x120
   <<EOE>>  [<ffffffff815db4b3>] raw_pci_read+0x23/0x40
   [<ffffffff815db4fc>] pci_read+0x2c/0x30
   [<ffffffff8136f612>] pci_user_read_config_word+0x72/0x110
   [<ffffffff8136f746>] pci_vpd_pci22_wait+0x96/0x130
   [<ffffffff8136ff9b>] pci_vpd_pci22_read+0xdb/0x1a0
   [<ffffffff8136ea30>] pci_read_vpd+0x20/0x30
   [<ffffffff8137d590>] read_vpd_attr+0x30/0x40
   [<ffffffff8128e037>] sysfs_kf_bin_read+0x47/0x70
   [<ffffffff8128d24e>] kernfs_fop_read+0xae/0x180
   [<ffffffff8120dd97>] __vfs_read+0x37/0x100
   [<ffffffff812ba7e4>] ? security_file_permission+0x84/0xa0
   [<ffffffff8120e366>] ? rw_verify_area+0x56/0xe0
   [<ffffffff8120e476>] vfs_read+0x86/0x140
   [<ffffffff8120f3f5>] SyS_read+0x55/0xd0
   [<ffffffff81720f2e>] system_call_fastpath+0x12/0x71
  Shutting down cpus with NMI
  Kernel Offset: disabled
  drm_kms_helper: panic occurred, switching back to text console

So blacklist the access to its VPD.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: stable@vger.kernel.org # v4.6+
(cherry picked from commit 0d5370d1d85251e5893ab7c90a429464de2e140b)

Orabug: 25975482

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

PCI: Prevent VPD access for buggy devices

On some devices, reading or writing VPD causes a system panic.
This can be easily reproduced by running "lspci -vvv" or
"cat /sys/bus/devices/XX../vpd".

Blacklist these devices so we don't access VPD data at all.

[bhelgaas: changelog, comment, drop pci/access.c changes]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110681
Tested-by: Shane Seymour <shane.seymour@hpe.com>
Tested-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
(cherry picked from commit 7c20078a8197389eead62399419fdc4f8ac4a8a3)

Orabug: 25975482

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Conflicts:
drivers/pci/quirks.c

target: consolidate backend attribute implementations

Orabug: 25791789

Provide a common sets of dev_attrib attributes for all devices using the
generic SPC/SBC parsers, and a second one with the minimal required read-only
attributes for passthrough devices. The later is only used by pscsi for now,
but will be wired up for the full-passthrough TCMU use case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
(cherry picked from commit 5873c4d157400ade4052e9d7b6259fa592e1ddbf)
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>

target: simplify backend driver registration

Orabug: 25791789

Rewrite the backend driver registration based on what we did to the fabric
drivers:  introduce a read-only struct target_bakckend_ops that the driver
registers, which is then instanciate as a struct target_backend by the
core.  This allows the ops vector to be smaller and allows us to mark it
const.  At the same time the registration function can set up the
configfs attributes, avoiding the need to add additional boilerplate code
for that to the drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
(cherry picked from commit 0a06d4309dc168dfa70cec3cf0cd9eb7fc15a2fd)
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>

x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID

Skylake CPU base-frequency and TSC frequency may differ
by up to 2%.

Enumerate CPU and TSC frequencies separately, allowing
cpu_khz and tsc_khz to differ.

The existing CPU frequency calibration mechanism is unchanged.
However, CPUID extensions are preferred, when available.

CPUID.0x16 is preferred over MSR and timer calibration
for CPU frequency discovery.

CPUID.0x15 takes precedence over CPU-frequency
for TSC frequency discovery.

Signed-off-by: Len Brown <len.brown@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/b27ec289fd005833b27d694d9c2dbb716c5cdff7.1466138954.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit aa297292d708e89773b3b2cdcaf33f01bfa095d8)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

x86/tsc: Save an indentation level in recalibrate_cpu_khz()

... by flipping the check.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1459837795-2588-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit eff4677e9fb9b680d1d5f6ba079116548d072b7e)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Conflicts:
arch/x86/kernel/tsc.c

x86/tsc_msr: Remove irqoff around MSR-based TSC enumeration

Remove the irqoff/irqon around MSR-based TSC enumeration,
as it is not necessary.

Also rename: try_msr_calibrate_tsc() to cpu_khz_from_msr(),
as that better describes what the routine does.

Signed-off-by: Len Brown <len.brown@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/a6b5c3ecd3b068175d2309599ab28163fc34215e.1466138954.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 02c0cd2dcf7fdc47d054b855b148ea8b82dbb7eb)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

perf/x86: Fix time_shift in perf_event_mmap_page

Commit:

b20112edeadf ("perf/x86: Improve accuracy of perf/sched clock")

allowed the time_shift value in perf_event_mmap_page to be as much
as 32. Unfortunately the documented algorithms for using time_shift
have it shifting an integer, whereas to work correctly with the value
32, the type must be u64.

In the case of perf tools, Intel PT decodes correctly but the timestamps
that are output (for example by perf script) have lost 32-bits of
granularity so they look like they are not changing at all.

Fix by limiting the shift to 31 and adjusting the multiplier accordingly.

Also update the documentation of perf_event_mmap_page so that new code
based on it will be more future-proof.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: b20112edeadf ("perf/x86: Improve accuracy of perf/sched clock")
Link: http://lkml.kernel.org/r/1445001845-13688-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b9511cd761faafca7a1acc059e792c1399f9d7c6)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

perf/x86: Improve accuracy of perf/sched clock

When TSC is stable perf/sched clock is based on it.
However the conversion from cycles to nanoseconds
is not as accurate as it could be. Because
CYC2NS_SCALE_FACTOR is 10, the accuracy is +/- 1/2048

The change is to calculate the maximum shift that
results in a multiplier that is still a 32-bit number.
For example all frequencies over 1 GHz will have
a shift of 32, making the accuracy of the conversion
+/- 1/(2^33). That is achieved by using the
'clocks_calc_mult_shift()' function.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1440147918-22250-1-git-send-email-adrian.hunter@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b20112edeadf0b8a1416de061caa4beb11539902)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

x86/apic: Handle zero vector gracefully in clear_vector_irq()

If x86_vector_alloc_irq() fails x86_vector_free_irqs() is invoked to cleanup
the already allocated vectors. This subsequently calls clear_vector_irq().

The failed irq has no vector assigned, which triggers the BUG_ON(!vector) in
clear_vector_irq().

We cannot suppress the call to x86_vector_free_irqs() for the failed
interrupt, because the other data related to this irq must be cleaned up as
well. So calling clear_vector_irq() with vector == 0 is legitimate.

Remove the BUG_ON and return if vector is zero,

[ tglx: Massaged changelog ]

Fixes: b5dc8e6c21e7 "x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors"
Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 1bdb8970392a68489b469c3a330a1adb5ef61beb)

Orabug: 24515998,25975565

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
arch/x86/kernel/apic/vector.c

HID: hid-cypress: validate length of report

Make sure we have enough of a report structure to validate before
looking at it.

Orabug: 25795985
CVE: CVE-2017-7273

Reported-by: Benoit Camredon <benoit.camredon@airbus.com>
Tested-by: Benoit Camredon <benoit.camredon@airbus.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
(cherry picked from commit 1ebb71143758f45dc0fa76e2f48429e13b16d110)
Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

Re-enable SDP for uek-nano kernel

SDP is enabled in the UEK config, but not in the uek-nano config.
Sockets Direct Protocol (SDP) is a network protocol which provides
an RDMA accelerated alternative to TCP over infiniband networks.

Orabug: 25999937

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

qla2xxx: Fix NULL pointer deref in QLA interrupt

Orabug: 25908317

In qla24xx_process_response_queue() rsp->msix->cpuid may trigger NULL
pointer dereference when rsp->msix is NULL:

[    5.622457] NULL pointer dereference at 0000000000000050
[    5.622457] IP: [<ffffffff8155e614>] qla24xx_process_response_queue+0x44/0x4b0
[    5.622457] PGD 0
[    5.622457] Oops: 0000 [#1] SMP
[    5.622457] Modules linked in:
[    5.622457] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.6.3-x86_64 #1
[    5.622457] Hardware name: HP ProLiant DL360 G5, BIOS P58 05/02/2011
[    5.622457] task: ffff8801a88f3740 ti: ffff8801a8954000 task.ti: ffff8801a8954000
[    5.622457] RIP: 0010:[<ffffffff8155e614>]  [<ffffffff8155e614>] qla24xx_process_response_queue+0x44/0x4b0
[    5.622457] RSP: 0000:ffff8801afb03de8  EFLAGS: 00010002
[    5.622457] RAX: 0000000000000000 RBX: 0000000000000032 RCX: 00000000ffffffff
[    5.622457] RDX: 0000000000000002 RSI: ffff8801a79bf8c8 RDI: ffff8800c8f7e7c0
[    5.622457] RBP: ffff8801afb03e68 R08: 0000000000000000 R09: 0000000000000000
[    5.622457] R10: 00000000ffff8c47 R11: 0000000000000002 R12: ffff8801a79bf8c8
[    5.622457] R13: ffff8800c8f7e7c0 R14: ffff8800c8f60000 R15: 0000000000018013
[    5.622457] FS:  0000000000000000(0000) GS:ffff8801afb00000(0000) knlGS:0000000000000000
[    5.622457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.622457] CR2: 0000000000000050 CR3: 0000000001e07000 CR4: 00000000000006e0
[    5.622457] Stack:
[    5.622457]  ffff8801afb03e30 ffffffff810c0f2d 0000000000000086 0000000000000002
[    5.622457]  ffff8801afb03e28 ffffffff816570e1 ffff8800c8994628 0000000000000002
[    5.622457]  ffff8801afb03e60 ffffffff816772d4 b47c472ad6955e68 0000000000000032
[    5.622457] Call Trace:
[    5.622457]  <IRQ>
[    5.622457]  [<ffffffff810c0f2d>] ? __wake_up_common+0x4d/0x80
[    5.622457]  [<ffffffff816570e1>] ? usb_hcd_resume_root_hub+0x51/0x60
[    5.622457]  [<ffffffff816772d4>] ? uhci_hub_status_data+0x64/0x240
[    5.622457]  [<ffffffff81560d00>] qla24xx_intr_handler+0xf0/0x2e0
[    5.622457]  [<ffffffff810d569e>] ? get_next_timer_interrupt+0xce/0x200
[    5.622457]  [<ffffffff810c89b4>] handle_irq_event_percpu+0x64/0x100
[    5.622457]  [<ffffffff810c8a77>] handle_irq_event+0x27/0x50
[    5.622457]  [<ffffffff810cb965>] handle_edge_irq+0x65/0x140
[    5.622457]  [<ffffffff8101a498>] handle_irq+0x18/0x30
[    5.622457]  [<ffffffff8101a276>] do_IRQ+0x46/0xd0
[    5.622457]  [<ffffffff817f8fff>] common_interrupt+0x7f/0x7f
[    5.622457]  <EOI>
[    5.622457]  [<ffffffff81020d38>] ? mwait_idle+0x68/0x80
[    5.622457]  [<ffffffff8102114a>] arch_cpu_idle+0xa/0x10
[    5.622457]  [<ffffffff810c1b97>] default_idle_call+0x27/0x30
[    5.622457]  [<ffffffff810c1d3b>] cpu_startup_entry+0x19b/0x230
[    5.622457]  [<ffffffff810324c6>] start_secondary+0x136/0x140
[    5.622457] Code: 00 00 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 47 58 a8 02 0f 84 c5 00 00 00 48 8b 46 50 49 89 f4 65 8b 15 34 bb aa 7e <39> 50 50 74 11 89 50 50 48 8b 46 50 8b 40 50 41 89 86 60 8b 00
[    5.622457] RIP  [<ffffffff8155e614>] qla24xx_process_response_queue+0x44/0x4b0
[    5.622457]  RSP <ffff8801afb03de8>
[    5.622457] CR2: 0000000000000050
[    5.622457] ---[ end trace fa2b19c25106d42b ]---
[    5.622457] Kernel panic - not syncing: Fatal exception in interrupt

The affected code was introduced by commit cdb898c52d1dfad4b4800b83a58b3fe5d352edde
(qla2xxx: Add irq affinity notification).

Only dereference rsp->msix when it has been set so the machine can boot
fine. Possibly rsp->msix is unset because:
[    3.479679] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 8.07.00.33-k.
[    3.481839] qla2xxx [0000:13:00.0]-001d: : Found an ISP2432 irq 17 iobase 0xffffc90000038000.
[    3.484081] qla2xxx [0000:13:00.0]-0035:0: MSI-X; Unsupported ISP2432 (0x2, 0x3).
[    3.485804] qla2xxx [0000:13:00.0]-0037:0: Falling back-to MSI mode -258.
[    3.890145] scsi host0: qla2xxx
[    3.891956] qla2xxx [0000:13:00.0]-00fb:0: QLogic QLE2460 - PCI-Express Single Channel 4Gb Fibre Channel HBA.
[    3.894207] qla2xxx [0000:13:00.0]-00fc:0: ISP2432: PCIe (2.5GT/s x4) @ 0000:13:00.0 hdma+ host#=0 fw=7.03.00 (9496).
[    5.714774] qla2xxx [0000:13:00.0]-500a:0: LOOP UP detected (4 Gbps).

Signed-off-by: Bruno Prémont <bonbons@linux-vserver.org>
Acked-by: Quinn Tran <quinn.tran@qlogic.com>
CC: <stable@vger.kernel.org> # 4.5+
Fixes: cdb898c52d1dfad4b4800b83a58b3fe5d352edde
Signed-off-by: James Bottomley <jejb@linux.vnet.ibm.com>
(cherry picked from commit 262e2bfd7d1e1f1ee48b870e5dfabb87c06b975e)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

Revert "be2net: fix MAC addr setting on privileged BE3 VFs"

Orabug: 25870303

This reverts commit 08046feba4560c904950e6ad1a7a2c8f908a5100.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

Revert "be2net: fix initial MAC setting"

Orabug: 25802842

This reverts commit e5678ed5e2ff0b38786598c97292ac1d8a6d9943.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

xfs: track and serialize in-flight async buffers against unmount

Orabug: 25550712

Newly allocated XFS metadata buffers are added to the LRU once the hold
count is released, which typically occurs after I/O completion. There is
no other mechanism at current that tracks the existence or I/O state of
a new buffer. Further, readahead I/O tends to be submitted
asynchronously by nature, which means the I/O can remain in flight and
actually complete long after the calling context is gone. This means
that file descriptors or any other holds on the filesystem can be
released, allowing the filesystem to be unmounted while I/O is still in
flight. When I/O completion occurs, core data structures may have been
freed, causing completion to run into invalid memory accesses and likely
to panic.

This problem is reproduced on XFS via directory readahead. A filesystem
is mounted, a directory is opened/closed and the filesystem immediately
unmounted. The open/close cycle triggers a directory readahead that if
delayed long enough, runs buffer I/O completion after the unmount has
completed.

To address this problem, add a mechanism to track all in-flight,
asynchronous buffers using per-cpu counters in the buftarg. The buffer
is accounted on the first I/O submission after the current reference is
acquired and unaccounted once the buffer is returned to the LRU or
freed. Update xfs_wait_buftarg() to wait on all in-flight I/O before
walking the LRU list. Once in-flight I/O has completed and the workqueue
has drained, all new buffers should have been released onto the LRU.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 9c7504aa72b6e2104ba6dcef518c15672ec51175)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

xfs: exclude never-released buffers from buftarg I/O accounting

Orabug: 25550712

The upcoming buftarg I/O accounting mechanism maintains a count of
all buffers that have undergone I/O in the current hold-release
cycle. Certain buffers associated with core infrastructure (e.g.,
the xfs_mount superblock buffer, log buffers) are never released,
however. This means that accounting I/O submission on such buffers
elevates the buftarg count indefinitely and could lead to lockup on
unmount.

Define a new buffer flag to explicitly exclude buffers from buftarg
I/O accounting. Set the flag on the superblock and associated log
buffers.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit c891c30a4dd1a236bb98630b35fc2769c5ce0d40)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

dm era: save spacemap metadata root after the pre-commit

Orabug: 25547820

When committing era metadata to disk, it doesn't always save the latest
spacemap metadata root in superblock. Due to this, metadata is getting
corrupted sometimes when reopening the device. The correct order of update
should be, pre-commit (shadows spacemap root), save the spacemap root
(newly shadowed block) to in-core superblock and then the final commit.

Cc: stable@vger.kernel.org
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
(cherry picked from commit 117aceb030307dcd431fdcff87ce988d3016c34a)
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

Btrfs: incremental send, do not issue invalid rmdir operations

Orabug: 26000657

When both the parent and send snapshots have a directory inode with the
same number but different generations (therefore they are different
inodes) and both have an entry with the same name, an incremental send
stream will contain an invalid rmdir operation that refers to the
orphanized name of the inode from the parent snapshot.

The following example scenario shows how this happens.

Parent snapshot:

.
|---- d259_old/               (ino 259, gen 9)
|         |---- d1/           (ino 258, gen 9)
|
|---- f                       (ino 257, gen 9)

Send snapshot:

.
|---- d258/                   (ino 258, gen 7)
|---- d259/                   (ino 259, gen 7)
         |---- d1/             (ino 257, gen 7)

When the kernel is processing inode 258 it notices that in both snapshots
there is an inode numbered 259 that is a parent of an inode 258. However
it ignores the fact that the inodes numbered 259 have different generations
in both snapshots, which means they are effectively different inodes.
Then it checks that both inodes 259 have a dentry named "d1" and because
of that it issues a rmdir operation with orphanized name of the inode 258
from the parent snapshot. This happens at send.c:process_record_refs(),
which calls send.c:did_overwrite_first_ref() that returns true and because
of that later on at process_recorded_refs() such rmdir operation is issued
because the inode being currently processed (258) is a directory and it
was deleted in the send snapshot (and replaced with another inode that has
the same number and is a directory too).
Fix this issue by comparing the generations of parent directory inodes
that have the same number and make send.c:did_overwrite_first_ref() when
the generations are different.

The following steps reproduce the problem.

$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/f
$ mkdir /mnt/d1
$ mkdir /mnt/d259_old
$ mv /mnt/d1 /mnt/d259_old/d1
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ umount /mnt

$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/d1
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/d1 /mnt/dir259/d1
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs receive /mnt/ -f /tmp/1.snap
# Take note that once the filesystem is created, its current
# generation has value 7 so the inodes from the second snapshot all have
# a generation value of 7. And after receiving the first snapshot
# the filesystem is at a generation value of 10, because the call to
# create the second snapshot bumps the generation to 8 (the snapshot
# creation ioctl does a transaction commit), the receive command calls
# the snapshot creation ioctl to create the first snapshot, which bumps
# the filesystem's generation to 9, and finally when the receive
# operation finishes it calls an ioctl to transition the first snapshot
# (snap1) from RW mode to RO mode, which does another transaction commit
# and bumps the filesystem's generation to 10. This means all the inodes
# in the first snapshot (snap1) have a generation value of 9.
$ rm -f /tmp/1.snap
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt

$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ btrfs receive /mnt -f /tmp/1.snap
$ btrfs receive -vv /mnt -f /tmp/2.snap
receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9
utimes
unlink f
mkdir o257-7-0
mkdir o259-7-0
rename o257-7-0 -> o259-7-0/d1
chown o259-7-0/d1 - uid=0, gid=0
chmod o259-7-0/d1 - mode=0755
utimes o259-7-0/d1
rmdir o258-9-0
ERROR: rmdir o258-9-0 failed: No such file or directory

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 0191410158840d6176de3267daa4604ad6a773fb)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/platform/uv/BAU: Remove __ro_after_init declaration

Orabug: 25920237

The upstream commit 8e3b21b6dbf0 (x86/platform/uv/BAU: Cleanup
bau_operations declaration and instances) has declared the
bau_operations ops instance with __ro_after_init. It can be safely
removed now as "ops" is an already existing variable and also the code that
supports this "read-only after init" feature is not pulled in yet.

The upstream commit c74ba8b3480d (arch: Introduce post-init read-only
memory) and several of its associated commits introduce this feature
from 4.6 kernel. When this feature is pulled in, this commit can be
reverted to make "bau_operations ops" read-only.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/platform: Remove warning message for duplicate NMI handlers

Orabug: 25920237

Remove the WARNING message associated with multiple NMI handlers as
there are at least two that are legitimate. These are the KGDB and the
UV handlers and both want to be called if the NMI has not been claimed
by any other NMI handler.

Use of the UNKNOWN NMI call chain dramatically lowers the NMI call rate
when high frequency NMI tools are in use, notably the perf tools. It is
required on systems that cannot sustain a high NMI call rate without
adversely affecting the system operation.

Signed-off-by: Mike Travis <mike.travis@hpe.com>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Russ Anderson <russ.anderson@hpe.com>
Cc: Frank Ramsay <frank.ramsay@hpe.com>
Cc: Tony Ernst <tony.ernst@hpe.com>
Link: http://lkml.kernel.org/r/20170307210841.730959611@asylum.americas.sgi.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 0d443b70cc92d741cbc1dcbf1079897b3d8bc3cc)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/platform/uv/BAU: Implement uv4_wait_completion with read_status

Orabug: 25920237

UV4 does not employ a software-timeout as in previous generations so a new
wait_completion routine without this logic is required. Certain completion
statuses require the AUX status bit in addition to ERROR and BUSY.

Add the read_status routine to construct the full completion status. Use
read_status in the uv4_wait_completion routine to handle all possible
completion statuses.

Signed-off-by: Andrew Banman <abanman@hpe.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mike Travis <mike.travis@hpe.com>
Cc: sivanich@hpe.com
Cc: rja@hpe.com
Cc: akpm@linux-foundation.org
Link: http://lkml.kernel.org/r/1489077734-111753-7-git-send-email-abanman@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 2f2a033fb5819c393d65da9b6233e095f3690f15)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/platform/uv/BAU: Add wait_completion to bau_operations

Orabug: 25920237

Remove the present wait_completion routine and add a function pointer by
the same name to the bau_operations struct. Rather than switching on the
UV hub version during message processing, set the architecture-specific
uv*_wait_completion during initialization.

The uv123_bau_ops struct must be split into uv1 and uv2_3 versions to
accommodate the corresponding wait_completion routines.

Signed-off-by: Andrew Banman <abanman@hpe.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mike Travis <mike.travis@hpe.com>
Cc: sivanich@hpe.com
Cc: rja@hpe.com
Cc: akpm@linux-foundation.org
Link: http://lkml.kernel.org/r/1489077734-111753-6-git-send-email-abanman@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 2620bbbf1f4f187952fb35861f4473860c432728)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/platform/uv/BAU: Add status mmr location fields to bau_control

Orabug: 25920237

The location of the ERROR and BUSY status bits depends on the descriptor
index, i.e. the CPU, of the message. Since this index does not change,
there is no need to calculate the mmr and index location during message
processing. The less work we do in the hot path the better.

Add status_mmr and status_index fields to bau_control and compute their
values during initialization. Add kerneldoc descriptions for the new
fields. Update uv*_wait_completion to use these fields rather than
receiving the information as parameters.

Signed-off-by: Andrew Banman <abanman@hpe.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mike Travis <mike.travis@hpe.com>
Cc: sivanich@hpe.com
Cc: rja@hpe.com
Cc: akpm@linux-foundation.org
Link: http://lkml.kernel.org/r/1489077734-111753-5-git-send-email-abanman@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit dfeb28f068ff9cc4f714c7d1edaf61597ea1768b)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/platform/uv/BAU: Cleanup bau_operations declaration and instances

Orabug: 25920237

Move the bau_operations declaration after bau struct declarations so the
bau structs can be referenced when adding new functions to
bau_operations. That way we avoid forward declarations of the bau
structs.

Likewise, move uv*_bau_ops structs down to avoid forward declarations of
new functions defined in the same file. Declare these structs __initconst
since they are only used during initialization. Similarly, declare the
bau_operations ops instance __ro_after_init as it is read-only after
initialization.

This is a preparatory patch for adding wait_completion to bau_operations.

Signed-off-by: Andrew Banman <abanman@hpe.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mike Travis <mike.travis@hpe.com>
Cc: sivanich@hpe.com
Cc: rja@hpe.com
Cc: akpm@linux-foundation.org
Link: http://lkml.kernel.org/r/1489077734-111753-4-git-send-email-abanman@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 8e3b21b6dbf0318d5b3a598572acc23f07189c40)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/platform/uv/BAU: Add payload descriptor qualifier

Orabug: 25920237

On UV4, the destination agent verifies each message by checking the
descriptor qualifier field of the message payload. Messages without this
field set to 0x534749 will cause a hub error to assert. Split
bau_message_payload into uv1_2_3 and uv4 versions to account for the
different payload formats.

Enforce the size of each field by using the appropriate u** integer type.
Replace extraneous comments with KernelDoc comment.

Signed-off-by: Andrew Banman <abanman@hpe.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mike Travis <mike.travis@hpe.com>
Cc: sivanich@hpe.com
Cc: rja@hpe.com
Cc: akpm@linux-foundation.org
Link: http://lkml.kernel.org/r/1489077734-111753-3-git-send-email-abanman@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit e9be36443cecda1be20b2cc3b891676ff2af9dff)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/platform/uv/BAU: Add uv_bau_version enumerated constants

Orabug: 25920237

Define enumerated constants for each UV hub version and replace magic
numbers with the appropriate constant.

Signed-off-by: Andrew Banman <abanman@hpe.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: sivanich@hpe.com
Cc: rja@hpe.com
Cc: mike.travis@hpe.com
Cc: akpm@linux-foundation.org
Link: http://lkml.kernel.org/r/1489077734-111753-2-git-send-email-abanman@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 491bd88cdb256cdabd25362b923d94ab80cf72c9)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/platform/uv/BAU: Fix HUB errors by remove initial write to sw-ack register

Orabug: 25920237

Writing to the software acknowledge clear register when there are no
pending messages causes a HUB error to assert. The original intent of this
write was to clear the pending bits before start of operation, but this is
an incorrect method and has been determined to be unnecessary.

Signed-off-by: Andrew Banman <abanman@hpe.com>
Acked-by: Mike Travis <mike.travis@hpe.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: rja@hpe.com
Cc: sivanich@hpe.com
Link: http://lkml.kernel.org/r/1487351269-181133-1-git-send-email-abanman@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1b17c6df852851b40c3c27c66b8fa2fd99cf25d8)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

fnic: Fixing sc abts status and flags assignment.

Orabug: 25638880

Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

fnic: Adding debug IO, Abort latency counter and check condition count to fnic stats

Orabug: 25638880

The IO and Abort latency counter count the time taken to complete
the IO and abort command into broad buckets. It is not intend for
performance measurement, just a debug stat.
current_max_io_time tries to keep tarck of the maximun time an IO
has taken to complete if it > 30sec, it just another debug stat.

Just a simple counter of number of check conditions encountered on
that host.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

fnic: Avoid false out-of-order detection for aborted command

Orabug: 25638880

If SCSI-ML has already issued abort on a command i.e
FNIC_IOREQ_ABTS_PENDING is set and we get a IO completion avoid
this being flagged as out-of-order completion by setting the
FNIC_IO_DONE flag in fnic_fcpio_icmnd_cmpl_handler

Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

scsi: fnic: Correcting rport check location in fnic_queuecommand_lck

Orabug: 25638880

In fnic_queuecommand_lck the cleaning up the rport check and
handeling when it is null

Signed-off-by: Satish Kharat <satishkh@cisco.com>
[ Upstream commit 6008e96b8107a4e30a97de947bd0fac239836b58 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

fnic: minor white space changes

Orabug: 25638880

Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

scsi: fnic: Avoid sending reset to firmware when another reset is in progress

Orabug: 25638880

The fix avoids calling an internal firmware reset when a fw reset
is already in progress.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
[ Upstream commit 9698b6f473555a722bf81a3371998427d5d27bde ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

ovl: Do d_type check only if work dir creation was successful

d_type check requires successful creation of workdir as iterates
through work dir and expects work dir to be present in it. If that's
not the case, this check will always return d_type not supported even
if underlying filesystem might be supporting it.

So don't do this check if work dir creation failed in previous step.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 21765194cecf2e4514ad75244df459f188140a0f)
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Orabug: 25802620

ovl: Ensure upper filesystem supports d_type

In some instances xfs has been created with ftype=0 and there if a file
on lower fs is removed, overlay leaves a whiteout in upper fs but that
whiteout does not get filtered out and is visible to overlayfs users.

And reason it does not get filtered out because upper filesystem does
not report file type of whiteout as DT_CHR during iterate_dir().

So it seems to be a requirement that upper filesystem support d_type for
overlayfs to work properly. Do this check during mount and fail if d_type
is not supported.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 45aebeaf4f67468f76bedf62923a576a519a9b68)
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Orabug: 25802620

sparc64: Add hardware capabilities for M8

This commit adds definitions for hardware
capabilities from both the Machine Descriptor and the
Compatibility Feature Register for M8 devices.

Orabug: 25555746

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Stop performance counter before updating

This commit takes the fix as implemented in commit
b36dd4d8040c and applies it to M8 devices. Original commit
message:

In order to reliably clear the PCRx.ov bit when updating a
performance counter value, we need to stop it counting first.
If we do not do this, then we can miss performance counter
overflow events.

Orabug: 25441707

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix a race condition when stopping performance counters

This commit takes the fix as implemented in commit
e5b7619e1de2 and applies it to M8 devices. Original commit
message:

When stopping a performance counter that is close to overflowing,
there is a race condition that can occur between writing to the
PCRx register to stop the counter (and also clearing the PCRx.ov
bit at the same time) vs the performance counter overflowing and
setting the PCRx.ov bit in the PCRx register.
The result of this race condition is that we occasionally miss
a performance counter overflow interrupt, which in turn leads
to incorrect event counting.
This race condition has been observed when counting cpu cycles.
To fix this issue when stopping a performance counter,
we simply allow it to continue counting and overflow before
stopping it. This allows the performance counter overflow
interrupt to be generated and acted upon.

Orabug: 25441707

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Use new misaligned load instructions for memcpy and copy_from_user

Use the new instructions for Load Misaligned Integer and Load Misaligned
Integer Alternate space for M8 architecture.

Decide when to use FP or ldm based on the following condition.
In case of FP load/alignaddr logic, there is a fixed overhead of
FP save/restore regardless of memcpy length. But the overhead due to the
ldm instruction grows with the size of memcpy. With our tests noticed
that up to length about 4096, the ldm instructions performs significanty
better than the FP alignaddr/load logic. With that into consideration,
use the new ldm instructions for length of 4096 or less. For lengths
above 4096, we will continue to use FP alignaddr/load logic.

Added the fix noticed crypto key corruption while running AES crypto tests.
This is the same problem reported in NG4memcpy. The commit f4da3628dc7c
("sparc64: Fix FPU register corruption with AES crypto offload.") fixes
the problem. Ported these changes to M8memcpy and verified the fix.

TODO: Encoded the ldmx and ldmxa instruction for now. Our build servers
are not updated with latest M8 instruction set yet. We need to decode it
back to assembly mnemonics when these instructions are available.

Orabug: 25381567

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Add a separate kernel memcpy functions for M8

Add a dedicated kernel memory copy functions for M8 architecture.
Use M7memcpy functions from M7 architecture and update affected functions
to take advantage of new Load Misaligned load/store instructions.

Following functions are going to be affected.
memcpy, copy_from_user and copy_to_user.

Following functions will not change. Will remain same as M7.
clear_page, clear_user_page, memset and bzero.

Orabug: 25381567

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: perf: make sure we do not set the 'picnht' bit in the PCR

Testing with the perf_fuzzer tool revealed an issue when the 'picnht' bit
was set in the Performance Control Register. Setting this bit means that
privileged software cannot access the PIC counter and we eventually end up
with messages of the form shown below, being reported.

NMI watchdog: BUG: soft lockup - CPU#788 stuck for 23s! [perf_fuzzer:336548]

This commit fixes the issue my making sure that the 'picnht' bit
is always reset when writing to the PCR register.

Orabug: 24926097

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: perf: move M7 pmu event definitions to seperate file

Originally the M7 pmu event definitions were in perf_event.c.
This commit moves them to a seperate header file just like the
M8 pmu event definitions.

Orabug: 23333572

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: perf: add perf support for M8 devices

This commit adds perf support for hardware events,
hardware cache events and kernel pmu events for M8
devices.

Orabug: 23333572

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: perf: Fix the mapping between perf events and perf counters

Up until now, perf event[0] has been counted on perf counter[0],
perf event[1] on perf counter[1] etc. However, there is no reason why
this has to be the case. Infact, on SPARC-M8 some events can only be
counted on a sub set of the available perf counters, rather than all
the available perf counters. Whilst adding these perf counter
constraints for the SPARC-M8 it became apparent that some confusion
had crept into the existing code.

Some functions were making the assumption that the perf event index was
equivalent to the perf counter index. This commit fixes these assumptions,
and allows for the case where perf event[0] is not counted on
perf counter[0] etc.

Orabug: 23333572

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

SPARC64: Enable IOMMU bypass for IB

This change enables IOMMU bypass for Infiniband devices to workaround
the rds-stress performance issue. The performance slowdown is only
seen with rds-stress test. This test generates rigorous amount of dma
map/unmap calls. Investigation shows that somewhere IOMMU/ATU code
doesn't scale and appears to be a bottleneck.

This is for uek4 only and not for upstream. When IOMMU/ATU issue gets
resolved, this workaround will be reverted.

Orabug: 25573557

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

SPARC64: Introduce IOMMU BYPASS method

This change adds IOMMU BYPASS HV API that will allow PCIe devices to
bypass IOMMU during DMA map/unmap.

Orabug: 25573557

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Add PCI IDs for Infiniband

This change adds PCI IDs for Infiniband.

Orabug: 25573557

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Disable the task group load_avg update for the root_task_group

Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.

Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.

Orabug: 25544560

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit aa0b7ae06387d40a988ce16a189082dee6e570bc)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

  10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
   5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

   9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

  Before patch - Max-jOPs: 907533    Critical-jOps: 134877
  After patch  - Max-jOPs: 916011    Critical-jOps: 142366

Orabug: 25544560

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b0367629acf62a78404c467cd09df447c2fea804)

Conflicts:

kernel/sched/sched.h
Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()

Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Orabug: 25544560

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a426f99c91d1036767a7819aaaba6bd3191b7f06)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Clean up load average references

For cfs_rq, we have load.weight, runnable_load_avg, and load_avg.
Clean up how they are used:

  - First, as group sched_entity already largely uses load_avg, we now expand
    to use load_avg in all cases.

  - Second, for CPU-wide load balancing, we choose to use runnable_load_avg
    in all cases, which is the same as before this series.

[Atish]:
Here is the conflict in this patch.
<<<<<<< HEAD
        unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
        unsigned long load_avg = rq->cfs.avg.load_avg;
=======
        unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
        unsigned long load_avg = weighted_cpuload(cpu);
>>>>>>> 7ea241a... sched/fair: Clean up load average references

ACCESS_ONCE is replaced with READ_ONCE by the upstream patch  316c1608d15c7.
As this patch is not in UEK4, ACCESS_ONCE is kept as it is.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-8-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7ea241afbf4924c58d41078599f7a32ba49fb985)

Conflicts:

kernel/sched/fair.c

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Provide runnable_load_avg back to cfs_rq

The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
Before this series, sometimes the runnable_load_avg is used, and sometimes
the load_avg is used. Completely replacing all uses of runnable_load_avg
with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
to result in overrated load. Therefore, we get runnable_load_avg back.

The new cfs_rq's runnable_load_avg is improved to be updated with all of the
runnable sched_eneities at the same time, so the one sched_entity updated and
the others stale problem is solved.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-7-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 139622343ef31941effc6de6a5a9320371a00e62)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Remove task and group entity load when they are dead

When task exits or group is destroyed, the entity's load should be
removed from its parent cfs_rq's load. Otherwise, it will take time
for the parent cfs_rq to decay the dead entity's load to 0, which
is not desired.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-6-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1269557889b477e3e43ab99a21035ddf8f7cea4d)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Init cfs_rq's sched_entity load average

The runnable load and utilization averages of cfs_rq's sched_entity
were not initiated. Like done to a task, give new cfs_rq' sched_entity
start values to heavy its load in infant time.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-5-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 540247fb5ddf6d2364f90387fa1f8f428d15e683)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Implement update_blocked_averages() for CONFIG_FAIR_GROUP_SCHED=n

The load and the utilization of idle CPUs must be updated periodically in
order to decay the blocked part.

If CONFIG_FAIR_GROUP_SCHED is not set, the load and util of idle cpus
are not decayed and stay at the values set before becoming idle.

Orabug: 25544560

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/1436918682-4971-4-git-send-email-yuyang.du@intel.com
[ Fixed up the SOB chain. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6c1d47c0827304949e0eb9479f4d587f226fac8b)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Rewrite runnable load and utilization average tracking

The idea of runnable load average (let runnable time contribute to weight)
was proposed by Paul Turner and Ben Segall, and it is still followed by
this rewrite. This rewrite aims to solve the following issues:

1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
   updated at the granularity of an entity at a time, which results in the
   cfs_rq's load average is stale or partially updated: at any time, only
   one entity is up to date, all other entities are effectively lagging
   behind. This is undesirable.

   To illustrate, if we have n runnable entities in the cfs_rq, as time
   elapses, they certainly become outdated:

     t0: cfs_rq { e1_old, e2_old, ..., en_old }

   and when we update:

     t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }

     t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }

     ...

   We solve this by combining all runnable entities' load averages together
   in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
   on the fact that if we regard the update as a function, then:

   w * update(e) = update(w * e) and

   update(e1) + update(e2) = update(e1 + e2), then

   w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)

   therefore, by this rewrite, we have an entirely updated cfs_rq at the
   time we update it:

     t1: update cfs_rq { e1_new, e2_new, ..., en_new }

     t2: update cfs_rq { e1_new, e2_new, ..., en_new }

     ...

2. cfs_rq's load average is different between top rq->cfs_rq and other
   task_group's per CPU cfs_rqs in whether or not blocked_load_average
   contributes to the load.

   The basic idea behind runnable load average (the same for utilization)
   is that the blocked state is taken into account as opposed to only
   accounting for the currently runnable state. Therefore, the average
   should include both the runnable/running and blocked load averages.
   This rewrite does that.

   In addition, we also combine runnable/running and blocked averages
   of all entities into the cfs_rq's average, and update it together at
   once. This is based on the fact that:

     update(runnable) + update(blocked) = update(runnable + blocked)

   This significantly reduces the code as we don't need to separately
   maintain/update runnable/running load and blocked load.

3. How task_group entities' share is calculated is complex and imprecise.

   We reduce the complexity in this rewrite to allow a very simple rule:
   the task_group's load_avg is aggregated from its per CPU cfs_rqs's
   load_avgs. Then group entity's weight is simply proportional to its
   own cfs_rq's load_avg / task_group's load_avg. To illustrate,

   if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,

   task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then

   cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share

To sum up, this rewrite in principle is equivalent to the current one, but
fixes the issues described above. Turns out, it significantly reduces the
code complexity and hence increases clarity and efficiency. In addition,
the new averages are more smooth/continuous (no spurious spikes and valleys)
and updated more consistently and quickly to reflect the load dynamics.

As a result, we have less load tracking overhead, better performance,
and especially better power efficiency due to more balanced load.

[Atish:]
This patch moved some of the conflicting functions from proc.c to
fair.c and removed the corresponding declarations from sched.h.
The upstream patch 3289bdb42988 moved these functions from proc.c to
fair.c in addition to lot of other code restructure. Porting 3289bdb42988
breaks KABI for UEK4. That's why I have to hack around it.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9d89c257dfb9c51a532d69397f6eed75e5168c35)

Conflicts:

kernel/sched/fair.c

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Remove rq's runnable avg

The current rq->avg is not used at all since its merge into the kernel,
and the code is in the scheduler's hot path, so remove it.

Orabug: 25544560

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit cd126afe838d7ea9b971cdea087fd498a7293c7f)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Allow enabling ADI on hugepages only

Since kernel provides only limted support for ADI on regular pages,
mprotect() should not allow enabling ADI on any pages other than
hugepages until rest of the support is complete.

Orabug: 25969377
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Save ADI tags on ADI enabled platforms only

When a page is ready to be swapped out or migrated, kernel saves the
ADI tags associated with it. To determine if ADI tags need to be saved
for a page, it checks if TTE.MCD bit is set for the page. M7 and newer
processors chose to reuse the TTE.CV bit from older processors for
TTE.MCD. When the kernel is running on non-ADI capable processors,
TTE.MCD bit is actually TTE.CV and it is set on older processors by
default for older processors by kernel. This causes kernel to
attempt to save the ADI tags for the page using MCD ASIs which are
not available on older processors. This causes a data access exception.
This patch adds a conditional to save ADI tags only on ADI capable
platforms.

Orabug: 25961592

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: increase FORCE_MAX_ZONEORDER to 16

Orabug: 25448108

This change enables large tsb sizes of up to 256MB. The performance
improvement is substantial for those programs with an enormous tsb rss.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: tsb size expansion

Orabug: 25448108

Some larger applications require a TSB size of magnitude not required by a
vast majority of processes. This commit enables the TSB to be expanded
to a size up to MAX_ORDER - 1, limited by TSB size order encoding and
finally limited by MMU hardware. The large TSB page order allocations are
not included in a kmem cache like current TSB sizes. The improvement is
done for tlb_type hypervisor and limited to recent sun4v. This patch should
not impact performance of other sparc64 core chip types.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: make tsb pointer computation symbolic

Orabug: 25448108

Define some symbolic names for tsb/tlb miss trap tsb pointer computation.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: fix intermittent LDom hang waiting for vdc_port_up

When LDom boots, sunvdc probes the disk using the LDC channel.
If the channel was previously configured, we need to wait until
we have received a packet from the other end to ensure the
the LDC reset was fully completed; otherwise, the primary domain
will send out NACK when vdc tries to connect. The NACK causes
ldc_reset and flushes the receive queue which may contain the
version info. The unexpected loss of version info causes
the vdc handshake to wait forever.

This patch is to workaround the ldc_reset not handling
reconnection after receiving a NACK. This patch can be
reverted after ldc_reset is properly implemented.

orabug: 25814698

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64:block/sunvdc: Renamed bio variable name from req to bio

Changed bio variable name from req to bio for better readability.
This is just a cosmetic change.

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 25128265
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64:block/sunvdc: Added io stats accounting for bio based vdisk

As vdisk now bypass block layer and directly submits IO, iostats does
not get accounted (which happens at block). Added IO accounting at bio
layer for vdisk block device.

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 25128265
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Remove node restriction from PRIQ MSI assignments

The current design of PRIQ only allows MSI routing to CPUs within the same
I/O node. At best this prevents certain types of CPU add/removals as well
as various affinity assignments and at worst can result in dropped
interrupts with CPU DR.

This prior design used a per-node hash table with collision chain (and RCU
synchronization) shared among the root complexes. While it would be
possible to support moving these records between the node hash tables this
is needlessly complicated when a single global table could be used instead
assuming the hashing mechanism was modified to limit the potential for
performance degradation through potentially longer collision chains and
more lookups.

The new design for MSI uses the bottom N bits of the device handle as a
hash versus hashing device handle and msidata together as before. This
exploits the existing numbering scheme to result in zero collisions (even
in an exactly sized hash table). If the numbering scheme is changed on a
future system, that system will still be functional though presumably
with some performance loss. The MSI data is then used as an index into a
array as has been done for prior linux sun4v interrupt model support. Given
the very small number of INTX interrupts, their handling remains
essentially unchanged (now uses a single global hash table, versus per I/O
node as before).

MSI Interrupts from Hypervisor data to ISR:

                                     struct pci_pbm_info *pbms[MAX_PBMS];
                                             +-----------+
                                             |           |
  From Hypervisor:                           |           |
                                             +-----------+
  - root complex devhandle  +---/HASH/------>|           |
  - msidata +--+                             |           |-------+
               |                             +-----------+       |
               |                             |           |       |
               |                             |           |       |
               |                             +-----------+       |
               |                             |           |       |
               |                             |           |       |
               |                             +-----------+       |
               |                                                 |
               |           struct pci_pbm_info                   |
               |              +-----------+<---------------------+
               |              |           |
               |              |    PBM    |
               |              |           |
               |              |           |---+
               |              |           |   |
               |              |           |   |
+--------------+              +-----------+   |
|                                             |
| struct priq_irq **msi_priq                  |
|    +-------------+<-------------------------+
|    |             |
|    |             |
+--->+-------------+       struct priq_irq
     |             |-------->+---------+
     |             |         |         |
     +-------------+         |         |
     |             |         +---------+
     |             |         |   irq   | generic_handle_irq(irq)
     +-------------+         +---------+
     |             |
     |             |
     +-------------+

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Reviewed-by Thomas Tai <thomas.tai@oracle.com>

Orabug: 25110748
Signed-off-by: Allen Pais <allen.pais@oracle.com>

blk-mq: Clean up all_q_list on request_queue deletion

Entries were not being removed from all_q_list when corresponding request
queue structs were freed with subsequent embedded list pointers being
trashed when the queue memory was re-allocated. Additionally, due to use of
cached allocation pool (kmem_cache_alloc_node), the same data structure
would be reinserted into the same list twice which cannot work given the
kernel linked list implementation.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 25569331
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: kern_addr_valid regression

Orabug: 25860542

I encountered this bug when using /proc/kcore to examine the kernel. Plus a
coworker inquired about debugging tools. We computed pa but did
not use it during the maximum physical address bits test. Instead we used
the identity mapped virtual address which will always fail this test.

I believe the defect came in here:
[bpicco@zareason linus.git]$ git describe --contains bb4e6e85daa52
v3.18-rc1~87^2~4
.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Merge branch topic/uek-4.1/sparc of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/sparc: (32 commits)
  sparc64: Detect DAX ra+pgsz when hvapi minor doesn't indicate it
  sparc64: DAX memory will use RA+PGSZ feature in HV
  sparc64: Disable DAX flow control
  sparc64: Add DAX hypervisor services
  sparc64: DAX memory needs persistent mappings
  sparc64: Fix incorrect error print in DAX driver when validating ccb
  sparc64: DAX request for non 4MB memory should return with unique errno
  Revert "sparc64: DAX request for non 4MB memory should return with unique errno"
  sparc64: DAX request to mmap non 4MB memory should fail with a debug print
  sparc64: DAX request for non 4MB memory should return with unique errno
  sparc64: Incorrect print by DAX driver when old driver API is used
  sparc64: DAX request to dequeue half of a long CCB should not succeed
  sparc64: dax_overflow_check reports incorrect data
  sparc64: Ignored DAX ref count causes lockup
  sparc64: disable dax page range checking on RA
  sparc64: Oracle Data Analytics Accelerator (DAX) driver
  sparc64: fix an issue when trying to bring hotplug cpus online
  sparc64: Fix memory corruption when THP is enabled
  sparc64: Fix address range for page table free Orabug: 25704426
  sparc64: Add support for 2G hugepages
  ...

sparc64: Detect DAX ra+pgsz when hvapi minor doesn't indicate it

Orabug: 25911008

The RA+PGSZ HV API feature is detected via a controlled experiment.
The experiment constructs a small DAX request and places the output
buffer at the very end of an 8k page. Then it checks to see if the 8k
page boundary was honored, and if so, then we've got the ability to pass
the page size along with a real address. Once the HV API minor number
is bumped to 1, this will be the primary method of detection.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
(cherry picked from commit 013d5b9909e804817dcd939f50f242f09237feac)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: DAX memory will use RA+PGSZ feature in HV

Orabug: 25911008
Orabug: 25931417

The reported kernel panics and "other oddities" are caused by
corruption of kernel memory by DAX output. This is happening due to an
apparent change between UEK2 and UEK4, whereby the underlying h/w page
size for memory acquired via kmalloc has changed. UEK2 used 4mb pages,
which the dax driver used to limit the output from the coprocessor,
who refuses to cross "page boundaries". But in UEK4 it appears that a
more intelligent approach to memory is used, and kernel memory may be
backed by a variety of huge h/w page sizes, ie, 256mb and 2gb. This
now allows DAX to produce output up to this much larger page size,
thus going beyond the actual allocation. We do not have any way to
kmalloc memory with a certain backing page size, and we cannot feed
DAX a virtual address if we are not certain of its page size.

Recent hypervisor f/w has provided a powerful new feature: the ability
to convey page size bits along with a real address (RA). This gives us
the opportunity to avoid using the TLB/TSB as a parameter passing
mechanism and we can use this to avoid using virtual addresses at all
in a DAX ccb. We now use this mechanism to set the page size for
dax_alloc memory to 4mb even if the underlying h/w page size for the
memory is much larger. Memory allocated by the application via
conventional methods is not affected.

This HV feature is available on M7 f/w with minor number 1, so this is
used to determine if the driver can provide the memory allocation
service. If the feature is not available, DAX will still work, but all
the responsibility for memory allocation falls back to the
application.

The sudden ENOACCESS errors are a result of another hypervisor change.
Newest HV firmware has begun to enforce the privileged flag (bit 14)
in arg2 of the ccb_submit hypercall. This flag is described in the API
wiki as "CCB virtual addresses are treated as privileged" and in the
VM spec as "Virtual addresses within CCBs are translated in privileged
context". The explanation given later in the VM spec is that if a CCB
contains any virtual address whose TTE has the priv bit set
(_PAGE_P_4V), then the priv flag in the ccb_submit api must also be
set, or else the hypervisor will refuse to perform the translation,
and an ENOACCESS error will be thrown. Since privileged virtual
addresses are no longer used as a result of this very commit, this
problem simply disappears.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Disable DAX flow control

Orabug: 25997202

The flow control feature of DAX only works for output buffers
and it doesn't appear to be possible to calculate the exact
size in bytes of an input buffer. So we have no way to bound
the amount of data that DAX reads as input. This has correctness
and security implications, so until we figure out something better
to do, the temporary workaround is to disable flow control completely
and fall back to 4Mb virtual page backed dax_alloc memory, which
will allow page boundaries to limit the size of all buffers. A new
module parameter "flow_enable" is provided to allow this decision
to be reverted at module load time if needed.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Add DAX hypervisor services

This provides the HV API for coprocessor services which
is needed to support a device driver for the DAX.

Orabug: 25996411

Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks:
KVM: VMX: fix vmwrite to invalid VMCS

KVM: VMX: fix vmwrite to invalid VMCS

fpu_activate is called outside of vcpu_load(), which means it should not
touch VMCS, but fpu_activate needs to.  Avoid the call by moving it to a
point where we know that the guest needs eager FPU and VMCS is loaded.

This will get rid of the following trace

vmwrite error: reg 6800 value 0 (err 1)
  [<ffffffff8162035b>] dump_stack+0x19/0x1b
  [<ffffffffa046c701>] vmwrite_error+0x2c/0x2e [kvm_intel]
  [<ffffffffa045f26f>] vmcs_writel+0x1f/0x30 [kvm_intel]
  [<ffffffffa04617e5>] vmx_fpu_activate.part.61+0x45/0xb0 [kvm_intel]
  [<ffffffffa0461865>] vmx_fpu_activate+0x15/0x20 [kvm_intel]
  [<ffffffffa0560b91>] kvm_arch_vcpu_create+0x51/0x70 [kvm]
  [<ffffffffa0548011>] kvm_vm_ioctl+0x1c1/0x760 [kvm]
  [<ffffffff8118b55a>] ? handle_mm_fault+0x49a/0xec0
  [<ffffffff811e47d5>] do_vfs_ioctl+0x2e5/0x4c0
  [<ffffffff8127abbe>] ? file_has_perm+0xae/0xc0
  [<ffffffff811e4a51>] SyS_ioctl+0xa1/0xc0
  [<ffffffff81630949>] system_call_fastpath+0x16/0x1b

(Note: we also unconditionally activate FPU in vmx_vcpu_reset(), so the
removed code added nothing.)

Fixes: c447e76b4cab ("kvm/fpu: Enable eager restore kvm FPU for MPX")
Cc: <stable@vger.kernel.org>
Reported-by: Vlastimil Holer <vlastimil.holer@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 370777daab3f024f1645177039955088e2e9ae73)
OraBug: 25956028
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch topic/uek-4.1/dtrace of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/dtrace:
dtrace: fix handling of save_stack_trace sentinel (x86 only)
dtrace: DTrace walltime lock-free implementation

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/drivers:
Revert "i40e: enable VSI broadcast promiscuous mode instead of adding broadcast filter"

Revert "i40e: enable VSI broadcast promiscuous mode instead of adding broadcast filter"

Orabug: 25877447

This reverts commit 690e746888bf9a307845d00fbac02bcfd2157fa2.
This fixes dhcp on OL7 under KVM.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

sparc64: DAX memory needs persistent mappings

Orabug: 25888596

Memory allocated on behalf of dax_alloc() is mapped by two
distinct MMU translations, one set for userland, which is backed by
regular 8k virtual pages, and another set for feeding to the
hypervisor, and these are kernel virtual addresses backed
by 4Mb pages. These latter translations are only used when
the memory is allocated/deallocated and when a dax transaction
is submitted. So the translations are unlikely to be in the TLB,
and eventually may be evicted from the kernel TSB after some time.
This leads to ENOMAP errors reported by the hypervisor. In order
to avoid this, we "touch" such pages just before dax_submit
which will fault translations into the tlb/tsb if necessary.

A future performance optimization is to take advantage of the "real
address has pagesize" feature which is available in very recent
versions of hypervisor f/w. In this case we can convert the address
in the ccb to a real address with the correct bits to specify
a 4Mb pagesize, and change the address type to real. Then the
memory touch would be unnecessary and the HV would not need to
probe the tlb/tsb at all.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix incorrect error print in DAX driver when validating ccb

Orabug: 25835254

This fixes an incorrect stringification in a macro that prints
invalid address type in a CCB.

Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: DAX request for non 4MB memory should return with unique errno

Orabug:25852910

With this change libdax can detect that mmap failed due to lack of flow
control in DAX HW and it can proceed with dax_subpage allocation.

Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>