www.infradead.org Git - users/jedix/linux-maple.git/log

libnvdimm: support for legacy (non-aliasing) nvdimms

Orabug: 22913653

The libnvdimm region driver is an intermediary driver that translates
non-volatile "region"s into "namespace" sub-devices that are surfaced by
persistent memory block-device drivers (PMEM and BLK).

ACPI 6 introduces the concept that a given nvdimm may simultaneously
offer multiple access modes to its media through direct PMEM load/store
access, or windowed BLK mode. Existing nvdimms mostly implement a PMEM
interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
If an nvdimm is single interfaced, then there is no need for dimm
metadata labels. For these devices we can take the region boundaries
directly to create a child namespace device (nd_namespace_io).

Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 3d88002e4a7bd40f355550284c6cd140e6fe29dc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, nfit: regions (block-data-window, persistent memory, volatile memory)

Orabug: 22913653

A "region" device represents the maximum capacity of a BLK range (mmio
block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
volatile memory), without regard for aliasing.  Aliasing, in the
dimm-local address space (DPA), is resolved by metadata on a dimm to
designate which exclusive interface will access the aliased DPA ranges.
Support for the per-dimm metadata/label arrvies is in a subsequent
patch.

The name format of "region" devices is "regionN" where, like dimms, N is
a global ida index assigned at discovery time.  This id is not reliable
across reboots nor in the presence of hotplug.  Look to attributes of
the region or static id-data of the sub-namespace to generate a
persistent name.  However, if the platform configuration does not change
it is reasonable to expect the same region id to be assigned at the next
boot.

"region"s have 2 generic attributes "size", and "mapping"s where:
- size: the BLK accessible capacity or the span of the
  system physical address range in the case of PMEM.

- mappingN: a tuple describing a dimm's contribution to the region's
  capacity in the format (<nmemX>,<dpa>,<size>).  For a PMEM-region
  there will be at least one mapping per dimm in the interleave set.  For
  a BLK-region there is only "mapping0" listing the starting DPA of the
  BLK-region and the available DPA capacity of that space (matches "size"
  above).

The max number of mappings per "region" is hard coded per the
constraints of sysfs attribute groups.  That said the number of mappings
per region should never exceed the maximum number of possible dimms in
the system.  If the current number turns out to not be enough then the
"mappings" attribute clarifies how many there are supposed to be. "32
should be enough for anybody...".

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 1f7df6f88b9245a7f2d0f8ecbc97dc88c8d0d8e1)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure

Orabug: 22913653

* Implement the device-model infrastructure for loading modules and
  attaching drivers to nvdimm devices.  This is a simple association of a
  nd-device-type number with a driver that has a bitmask of supported
  device types.  To facilitate userspace bind/unbind operations 'modalias'
  and 'devtype', that also appear in the uevent, are added as generic
  sysfs attributes for all nvdimm devices.  The reason for the device-type
  number is to support sub-types within a given parent devtype, be it a
  vendor-specific sub-type or otherwise.

* The first consumer of this infrastructure is the driver
  for dimm devices.  It simply uses control messages to retrieve and
  store the configuration-data image (label set) from each dimm.

Note: nd_device_register() arranges for asynchronous registration of
      nvdimm bus devices by default.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4d88a97aa9e8cfa6460aab119c5da60ad2267423)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices

Orabug: 22913653

Most discovery/configuration of the nvdimm-subsystem is done via sysfs
attributes.  However, some nvdimm_bus instances, particularly the
ACPI.NFIT bus, define a small set of messages that can be passed to the
platform.  For convenience we derive the initial libnvdimm-ioctl command
formats directly from the NFIT DSM Interface Example formats.

    ND_CMD_SMART: media health and diagnostics
    ND_CMD_GET_CONFIG_SIZE: size of the label space
    ND_CMD_GET_CONFIG_DATA: read label space
    ND_CMD_SET_CONFIG_DATA: write label space
    ND_CMD_VENDOR: vendor-specific command passthrough
    ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
    ND_CMD_ARS_START: initiate scrubbing
    ND_CMD_ARS_STATUS: report on scrubbing state
    ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

If a platform later defines different commands than this set it is
straightforward to extend support to those formats.

Most of the commands target a specific dimm.  However, the
address-range-scrubbing commands target the bus.  The 'commands'
attribute in sysfs of an nvdimm_bus, or nvdimm, enumerate the supported
commands for that object.

Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 62232e45f4a265abb43f0acf16e58f5d0b6e1ec9)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, nfit: dimm/memory-devices

Orabug: 22913653

Enable nvdimm devices to be registered on a nvdimm_bus. The kernel
assigned device id for nvdimm devicesis dynamic. If userspace needs a
more static identifier it should consult a provider-specific attribute.
In the case where NFIT is the provider, the 'nmemX/nfit/handle' or
'nmemX/nfit/serial' attributes may be used for this purpose.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e6dfb2de47768efe8cc37c9a1863d2aff81440fb)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: control character device and nvdimm_bus sysfs attributes

Orabug: 22913653

The control device for a nvdimm_bus is registered as an "nd" class
device.  The expectation is that there will usually only be one "nd" bus
registered under /sys/class/nd.  However, we allow for the possibility
of multiple buses and they will listed in discovery order as
ndctl0...ndctlN.  This character device hosts the ioctl for passing
control messages.  The initial command set has a 1:1 correlation with
the commands listed in the by the "NFIT DSM Example" document [1], but
this scheme is extensible to future command sets.

Note, nd_ioctl() and the backing ->ndctl() implementation are defined in
a subsequent patch.  This is simply the initial registrations and sysfs
attributes.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 45def22c1fab85764646746ce38d45b2f3281fa5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, nfit: initial libnvdimm infrastructure and NFIT support

Orabug: 22913653

A struct nvdimm_bus is the anchor device for registering nvdimm
resources and interfaces, for example, a character control device,
nvdimm devices, and I/O region devices. The ACPI NFIT (NVDIMM Firmware
Interface Table) is one possible platform description for such
non-volatile memory resources in a system. The nfit.ko driver attaches
to the "ACPI0012" device that indicates the presence of the NFIT and
parses the table to register a struct nvdimm_bus instance.

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit b94d5230d06eb930be82e67fb1a9a58271e78297)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

e820, efi: add ACPI 6.0 persistent memory types

Orabug: 22913653

ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
Mark it "reserved" and allow it to be claimed by a persistent memory
device driver.

This definition is in addition to the Linux kernel's existing type-12
definition that was recently added in support of shipping platforms with
NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
OEM reserved).

Note, /proc/iomem can be consulted for differentiating legacy
"Persistent Memory (legacy)" E820_PRAM vs standard "Persistent Memory"
E820_PMEM.

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ad5fb870c486d932a1749d7853dd70f436a7e03f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ACPICA: Fix for ill-formed GUID strings for NFIT tables.

Orabug: 22913653

ACPICA commit 60052949ba2aa7377106870da69b237193d10dc1

Error in transcription from the ACPI spec.

Link: https://github.com/acpica/acpica/commit/60052949
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit f3b6ced236259a87829b829e8e542ff53bfb9a4f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ACPICA: acpihelp: Update for new NFIT table GUIDs.

Orabug: 22913653

ACPICA commit 83727bed8f715685a63a9f668e73c60496a06054

Add original UUIDs/GUIDs to the acuuid.h file.
Cleanup acpihelp output for UUIDs/GUIDs.

Link: https://github.com/acpica/acpica/commit/83727bed
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 6c0d14680e247849cdb870c995a332781bdb93f2)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ACPICA: ACPI 6.0: Add support for NFIT table.

Orabug: 22913653

ACPICA commit e4e17ca361373e9b81494bb4ca697a12cef3cba6

NVDIMM Firmware Interface Table.

Link: https://github.com/acpica/acpica/commit/e4e17ca3
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 04f8e38497b02cd5596ff9af278e62cd057fff68)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

drivers/block/pmem: Map NVDIMM in Write-Through mode

Orabug: 22913653

The pmem driver maps NVDIMM uncacheable so that we don't lose
data which hasn't reached non-volatile storage in the case of a
crash. Change this to Write-Through mode which provides uncached
writes but cached reads, thus improving read performance.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-14-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 957561ec0fa8a701f60ca6a0f40cc46f5c554920)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

x86/mm, asm-generic: Add ioremap_wt() for creating Write-Through mappings

Orabug: 22913653

Add ioremap_wt() for creating Write-Through mappings on x86. It
follows the same model as ioremap_wc() for multi-arch support.
Define ARCH_HAS_IOREMAP_WT in the x86 version of io.h to
indicate that ioremap_wt() is implemented on x86.

Also update the PAT documentation file to cover ioremap_wt().

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-8-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit d838270e2516db11084bed4e294017eb7b646a75)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler

Tony reports that booting his 144-cpu machine with maxcpus=10 triggers
the following WARN_ON():

[   21.045727] WARNING: CPU: 8 PID: 647 at arch/x86/kernel/cpu/perf_event_intel_cqm.c:1267 intel_cqm_cpu_prepare+0x75/0x90()
[   21.045744] CPU: 8 PID: 647 Comm: systemd-udevd Not tainted 4.2.0-rc4 #1
[   21.045745] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0066.R00.1506021730 06/02/2015
[   21.045747]  0000000000000000 0000000082771b09 ffff880856333ba8 ffffffff81669b67
[   21.045748]  0000000000000000 0000000000000000 ffff880856333be8 ffffffff8107b02a
[   21.045750]  ffff88085b789800 ffff88085f68a020 ffffffff819e2470 000000000000000a
[   21.045750] Call Trace:
[   21.045757]  [<ffffffff81669b67>] dump_stack+0x45/0x57
[   21.045759]  [<ffffffff8107b02a>] warn_slowpath_common+0x8a/0xc0
[   21.045761]  [<ffffffff8107b15a>] warn_slowpath_null+0x1a/0x20
[   21.045762]  [<ffffffff81036725>] intel_cqm_cpu_prepare+0x75/0x90
[   21.045764]  [<ffffffff81036872>] intel_cqm_cpu_notifier+0x42/0x160
[   21.045767]  [<ffffffff8109a33d>] notifier_call_chain+0x4d/0x80
[   21.045769]  [<ffffffff8109a44e>] __raw_notifier_call_chain+0xe/0x10
[   21.045770]  [<ffffffff8107b538>] _cpu_up+0xe8/0x190
[   21.045771]  [<ffffffff8107b65a>] cpu_up+0x7a/0xa0
[   21.045774]  [<ffffffff8165e920>] cpu_subsys_online+0x40/0x90
[   21.045777]  [<ffffffff81433b37>] device_online+0x67/0x90
[   21.045778]  [<ffffffff81433bea>] online_store+0x8a/0xa0
[   21.045782]  [<ffffffff81430e78>] dev_attr_store+0x18/0x30
[   21.045785]  [<ffffffff8126b6ba>] sysfs_kf_write+0x3a/0x50
[   21.045786]  [<ffffffff8126ad40>] kernfs_fop_write+0x120/0x170
[   21.045789]  [<ffffffff811f0b77>] __vfs_write+0x37/0x100
[   21.045791]  [<ffffffff811f38b8>] ? __sb_start_write+0x58/0x110
[   21.045795]  [<ffffffff81296d2d>] ? security_file_permission+0x3d/0xc0
[   21.045796]  [<ffffffff811f1279>] vfs_write+0xa9/0x190
[   21.045797]  [<ffffffff811f2075>] SyS_write+0x55/0xc0
[   21.045800]  [<ffffffff81067300>] ? do_page_fault+0x30/0x80
[   21.045804]  [<ffffffff816709ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[   21.045805] ---[ end trace fe228b836d8af405 ]---

The root cause is that CPU_UP_PREPARE is completely the wrong notifier
action from which to access cpu_data(), because smp_store_cpu_info()
won't have been executed by the target CPU at that point, which in turn
means that ->x86_cache_max_rmid and ->x86_cache_occ_scale haven't been
filled out.

Instead let's invoke our handler from CPU_STARTING and rename it
appropriately.

Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@intel.com>
Link: http://lkml.kernel.org/r/1438863163-14083-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 24745516
(cherry picked from commit d7a702f0b1033cf402fef65bd6395072738f0844)
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

Merge branch topic/uek-4.1/kernel-generic of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Allow mce to reset instead of panic on UE

The intent of this patch is to ensure that the mce stack is not
put in the panic stack trace when the kernel reboots due to the
Uncorrectable Error. The mce stack in the panic trace confuses
the administrator and falsely implicates mce module as a culprit.
Hence a synchronization flag is added to machine restart the system
when it experience uncorrectable error.

Orabug: 24745271

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Dan Duval <dan.duval@oracle.com>

mptsas: add TUR with retries to ensure LUNs complete initialization

Orabug: 24745062

Earlier versions of the mptsas driver included a mechanism for
executing, and if necessary retrying, SCSI TEST UNIT READY commands
to ensure that devices complete their initialization during device
discovery. This functionality, present in UEK2, was never sent
upstream, and was lost when UEK4 was initiated.

We have been seeing flash devices returning errors, or simply
disappearing, during alter cell validate configuration operations
on Exadata systems. Giving the flash disks time to initialize
after (re-) discovery appears to resolve this issue.

This commit simply restores the missing functionality.

Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

Merge branch topic/uek-4.1/sparc of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

nvme: refactor nvme_queue_rq

This "backports" the structure I've used for the fabrics driver. It
mostly started out as a cleanup so that I could actually understand
the code, but I think it also qualifies as a micro-optimization due
to the reduced time we hold q_lock and disable interrupts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 24691685
mainline commit ba1ca37ea4e320c108c356eb8c91ac652afc57dd
Conflicts:
Adding GFP_ATOMIC to nvme_setup_prps and replacing
REQ_TYPE_DRV_PRIV with REQ_TYPE_SPECIAL

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>

Merge branch topic/uek-4.1/rpm-build of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch topic/uek-4.1/kernel-generic of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

uek-rpm nano: fix permissions on mod-sign.sh and find-provides

Orabug: 24691953

uek-rpm/ol6-nano/mod-sign.sh and uek-rpm/ol6-nano/find-provides need to be 0755.

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>

uek-rpm nano: modify uek-rpm/ol6-nano/ files for ueknano builds v1

Orabug: 24691953

Modify uek-rpm/ol6-nano/ files for ueknano builds to minimize the size
of the UEK kernel RPM.

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>

uek-rpm framework for ol6-nano builds.

Orabug: 24691953

ueknano is stripped down version of uek-4.1. It has only
necessary modules needed for Exadata systems. The reason to spin off
nano kernels is to reduce the size of Exadata kernels.

Create files in uek-rpm/ol6-nano/

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>

mm, hugetlb: fix huge_pte_alloc BUG_ON

Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
runs his database load with memory online and offline running in
parallel.  The reason is that huge_pmd_share might detect a shared pmd
which is currently migrated and so it has migration pte which is
!pte_huge.

There doesn't seem to be any easy way to prevent from the race and in
fact seeing the migration swap entry is not harmful.  Both callers of
huge_pte_alloc are prepared to handle them.  copy_hugetlb_page_range
will copy the swap entry and make it COW if needed.  hugetlb_fault will
back off and so the page fault is retries if the page is still under
migration and waits for its completion in hugetlb_fault.

That means that the BUG_ON is wrong and we should update it.  Let's
simply check that all present ptes are pte_huge instead.

Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: zhongjiang <zhongjiang@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 24691289
(cherry picked from commit 4e666314d286765a9e61818b488c7372326654ec)
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

mm: fix the page_swap_info BUG_ON check

'commit 62c230bc1790 ("mm: add support for a filesystem to activate swap
files and use direct_IO for writing swap pages")' replaced swap_aops
dirty hook from __set_page_dirty_no_writeback() to swap_set_page_dirty().
As such for normal cases without these special SWP flags
code path falls back to __set_page_dirty_no_writeback()
so behaviour is expected to be same as before.

But swap_set_page_dirty() makes use of helper page_swap_info() to
get sis(swap_info_struct) to check for the flags like SWP_FILE,
SWP_BLKDEV etc as desired for those features. This helper has
BUG_ON(!PageSwapCache(page)) which is racy and safe only for
set_page_dirty_lock() path. For set_page_dirty() path which is
often needed for cases to be called from irq context, kswapd()
can togele the flag behind the back while the call is
getting executed when system is low on memory and heavy
swapping is ongoing.

This ends up with undesired kernel panic. Patch just moves
the check outside the helper to its users appropriately
to fix kernel panic for the described path. Couple
of users of helpers already take care of SwapCache
condition so I skipped them.

Thanks to Wengang for extensive debug using vm cores
and Avinash for his thoughts about the issue.

Orabug: 24661696

Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sparc64: Fix PMD check during page table walk

Currently check for PMD_HUGE during page table
walk uses incorrect instruction sequence:

be,pt %xcc, 700f;
andcc REG1, REG2, %g0;

This sequence is incorrect since branch decision is
made *before* 'andcc' in the delay slot is executed.

Orabug: 24353511

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

vldc driver: provide kernel driver interfaces1

Orabug: 24601126

Forward port 22804422 to UEK4-QU3 - VLDC driver should expose
services...

Signed-off-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix sentinel page table entry for 16G

Currently no page table trimming is done for 16G pages
so _PAGE_PMD_HUGE must not be set for 16G. Also, for
this size, trimming would be done at PUD level, so
this flag should not be set anyways.

Orabug: 24353511

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Trim page tables for 2G pages

Currently mapping a 2G page requires 256*1024 PTE entries.
This results in large amounts of RAM to be used just for
storing page tables. We now use 256 PMD entries to map a
2G page which is much more space efficient.

Orabug: 23109070

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
(cherry picked from commit d3c88b8f27645c14cbb220570e5945abb0989d19)
(cherry picked from commit 768096d7916fefc497f397b0675455a754ee8a5b)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Trim page tables at PMD for hugepages

For PMD aligned (8M) hugepages, we currently allocate
all four page table levels which is wasteful. We now
allocate till PMD level only which saves memory usage
from page tables.

Orabug: 22630259

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
(cherry picked from commit 5d2c7930a4d3bf3ca560048052d638d7efa67e36)
(cherry picked from commit abefebd73e204979661a818ac31cf455d110a672)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

vcc driver fixes

Orabug:24319080 - hang on a mutex out of vcc_open()
Orabug:24326005 - UEK4 kernel panic tty_ldisc_flush vcc_close

Signed-off-by: Aaron Young <aaron.young@oracle.com>
Reviewed-By: Bijan Mottahedeh <Bijan.Mottahedeh@oracle.com>
Reviewed-By: Liam Merwick <Liam.Merwick@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

LDOMS DOMAIN SERVICES UPDATE 5

Orabug: 24601099

Signed-off-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Support reserving memory with memmap=xxx$yyy

The kernel commandline parameter memmap= was supported
on several other architectures but not on SPARC (it was
being ignored on SPARC).

Add support for the memmap=xxx$yyy commandline parameter
(sparc64/UEK4 only). The patch is based on the existing
code for the "tile" architecture.

There are other types of memmap= commandlines which
are only supported on x86 that are e820-specific.
These were not implemented.

Orabug: 22662762

Signed-off-by: Larry Bassel <larry.bassel@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc: Harden signal return frame checks.

    Orabug: 23303740

    [ Upstream commit d11c2a0de2824395656cf8ed15811580c9dd38aa ]

    All signal frames must be at least 16-byte aligned, because that is
    the alignment we explicitly create when we build signal return stack
    frames.

    All stack pointers must be at least 8-byte aligned.

Signed-off-by: David S. Miller <davem@davemloft.net>
    Conflicts:

    arch/sparc/kernel/signal32.c - modified patch context so that it would apply

Signed-off-by: Larry Bassel <larry.bassel@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64:Support User Probes for Sparc

Orabug: 23523685 Support User Probes in OLS / uek4.1

Signed-off-by: Eric Saint-Etienne <eric.saint.etienne@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Use HW supported number of context ID bits

Orabug: 24449941

Number of context IDs supported by the hardware
is reported via machine descriptor for sun4v
systems. For systems > T3, 16 bits are used
to represent context ID in the HW. For these
systems the context ID wrap around happens if
there are more that 65536 processes running
simultaneously. For systems older than that
13 bits are used and the context ID wraps around
if there are 8192 processes running simultaneously.

Reviewed-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix return from trap window fill crashes.

We must handle data access exception as well as memory address unaligned
exceptions from return from trap window fill faults, not just normal
TLB misses.

Otherwise we can get an OOPS that looks like this:

ld-linux.so.2(36808): Kernel bad sw trap 5 [#1]
CPU: 1 PID: 36808 Comm: ld-linux.so.2 Not tainted 4.6.0 #34
task: fff8000303be5c60 ti: fff8000301344000 task.ti: fff8000301344000
TSTATE: 0000004410001601 TPC: 0000000000a1a784 TNPC: 0000000000a1a788 Y: 00000002    Not tainted
TPC: <do_sparc64_fault+0x5c4/0x700>
g0: fff8000024fc8248 g1: 0000000000db04dc g2: 0000000000000000 g3: 0000000000000001
g4: fff8000303be5c60 g5: fff800030e672000 g6: fff8000301344000 g7: 0000000000000001
o0: 0000000000b95ee8 o1: 000000000000012b o2: 0000000000000000 o3: 0000000200b9b358
o4: 0000000000000000 o5: fff8000301344040 sp: fff80003013475c1 ret_pc: 0000000000a1a77c
RPC: <do_sparc64_fault+0x5bc/0x700>
l0: 00000000000007ff l1: 0000000000000000 l2: 000000000000005f l3: 0000000000000000
l4: fff8000301347e98 l5: fff8000024ff3060 l6: 0000000000000000 l7: 0000000000000000
i0: fff8000301347f60 i1: 0000000000102400 i2: 0000000000000000 i3: 0000000000000000
i4: 0000000000000000 i5: 0000000000000000 i6: fff80003013476a1 i7: 0000000000404d4c
I7: <user_rtt_fill_fixup+0x6c/0x7c>
Call Trace:
[0000000000404d4c] user_rtt_fill_fixup+0x6c/0x7c

The window trap handlers are slightly clever, the trap table entries for them are
composed of two pieces of code.  First comes the code that actually performs
the window fill or spill trap handling, and then there are three instructions at
the end which are for exception processing.

The userland register window fill handler is:

add %sp, STACK_BIAS + 0x00, %g1; \
ldxa [%g1 + %g0] ASI, %l0; \
mov 0x08, %g2; \
mov 0x10, %g3; \
ldxa [%g1 + %g2] ASI, %l1; \
mov 0x18, %g5; \
ldxa [%g1 + %g3] ASI, %l2; \
ldxa [%g1 + %g5] ASI, %l3; \
add %g1, 0x20, %g1; \
ldxa [%g1 + %g0] ASI, %l4; \
ldxa [%g1 + %g2] ASI, %l5; \
ldxa [%g1 + %g3] ASI, %l6; \
ldxa [%g1 + %g5] ASI, %l7; \
add %g1, 0x20, %g1; \
ldxa [%g1 + %g0] ASI, %i0; \
ldxa [%g1 + %g2] ASI, %i1; \
ldxa [%g1 + %g3] ASI, %i2; \
ldxa [%g1 + %g5] ASI, %i3; \
add %g1, 0x20, %g1; \
ldxa [%g1 + %g0] ASI, %i4; \
ldxa [%g1 + %g2] ASI, %i5; \
ldxa [%g1 + %g3] ASI, %i6; \
ldxa [%g1 + %g5] ASI, %i7; \
restored; \
retry; nop; nop; nop; nop; \
b,a,pt %xcc, fill_fixup_dax; \
b,a,pt %xcc, fill_fixup_mna; \
b,a,pt %xcc, fill_fixup;

And the way this works is that if any of those memory accesses
generate an exception, the exception handler can revector to one of
those final three branch instructions depending upon which kind of
exception the memory access took.  In this way, the fault handler
doesn't have to know if it was a spill or a fill that it's handling
the fault for.  It just always branches to the last instruction in
the parent trap's handler.

For example, for a regular fault, the code goes:

winfix_trampoline:
rdpr %tpc, %g3
or %g3, 0x7c, %g3
wrpr %g3, %tnpc
done

All window trap handlers are 0x80 aligned, so if we "or" 0x7c into the
trap time program counter, we'll get that final instruction in the
trap handler.

On return from trap, we have to pull the register window in but we do
this by hand instead of just executing a "restore" instruction for
several reasons.  The largest being that from Niagara and onward we
simply don't have enough levels in the trap stack to fully resolve all
possible exception cases of a window fault when we are already at
trap level 1 (which we enter to get ready to return from the original
trap).

This is executed inline via the FILL_*_RTRAP handlers.  rtrap_64.S's
code branches directly to these to do the window fill by hand if
necessary.  Now if you look at them, we'll see at the end:

    ba,a,pt    %xcc, user_rtt_fill_fixup;
    ba,a,pt    %xcc, user_rtt_fill_fixup;
    ba,a,pt    %xcc, user_rtt_fill_fixup;

And oops, all three cases are handled like a fault.

This doesn't work because each of these trap types (data access
exception, memory address unaligned, and faults) store their auxiliary
info in different registers to pass on to the C handler which does the
real work.

So in the case where the stack was unaligned, the unaligned trap
handler sets up the arg registers one way, and then we branched to
the fault handler which expects them setup another way.

So the FAULT_TYPE_* value ends up basically being garbage, and
randomly would generate the backtrace seen above.

Orabug: 24671126

Reported-by: Nick Alcock <nix@esperi.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Take ctx_alloc_lock properly in hugetlb_setup().

On cheetahplus chips we take the ctx_alloc_lock in order to
modify the TLB lookup parameters for the indexed TLBs, which
are stored in the context register.

This is called with interrupts disabled, however ctx_alloc_lock
is an IRQ safe lock, therefore we must take acquire/release it
properly with spin_{lock,unlock}_irq().

Orabug: 24671126

Reported-by: Meelis Roos <mroos@linux.ee>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix sparc64_set_context stack handling.

Like a signal return, we should use synchronize_user_stack() rather
than flush_user_windows().

Orabug: 24671126

Reported-by: Ilya Malakhov <ilmalakhovthefirst@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: vds kernel BUG at fs/buffer.c:1269!

Orabug: 24376791

Interrupts must be enabled before the fini call.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Virtual disk IO should handle VDS module removal and reinsertion

Orabug: 24319792

Virtual disk IO should handle mdodule removal and reinsertion while IO
is active between clients and the server.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: support for identifying Sonoma 2 systems

Needed for Sonoma 2 software support

Orabug: 22960812
Signed-off-by: Joe Moriarty <joe.moriarty@oracle.com>
Acked-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 8a9b7d9b25a3ad54bc41294f93ce814038f01c70)
(cherry picked from commit a8ce3853635573a42e49df9c8b7e87bf35656561)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sonoma:correctly recognize sonoma cpu type

Orabug: 23041920

Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Joe Moriarty <joe.moriarty@oracle.com>
(cherry picked from commit 72eaed0f66615fe000a63feb7350ba51bf040e06)
(cherry picked from commit 70a67f6bfc281c92e9422b1672d0cae30da178df)

sparc64: Set VDS workqueue max_active argument to 0

Orabug: 23565322

Based on

https://www.kernel.org/doc/Documentation/workqueue.txt

The recommended value for max_active is 0:

max_active:

max_active determines the maximum number of execution contexts per
CPU which can be assigned to the work items of a wq.  For example,
with @max_active of 16, at most 16 work items of the wq can be
executing at the same time per CPU.

Currently, for a bound wq, the maximum limit for @max_active is 512
and the default value used when 0 is specified is 256.  For an unbound
wq, the limit is higher of 512 and 4 * num_possible_cpus().  These
values are chosen sufficiently high such that they are not the
limiting factor while providing protection in runaway cases.

The number of active work items of a wq is usually regulated by the
users of the wq, more specifically, by how many work items the users
may queue at the same time.  Unless there is a specific need for
throttling the number of active work items, specifying '0' is
recommended.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Liam Merwick <Liam.Merwick@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit b584786e611e8e8a28830386e8b3db8874d794c5)
(cherry picked from commit f2559a96b70562267f01d5bb62ef44aa9f0c0cd8)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Reduce TLB flushes during hugepte changes

During hugepage map/unmap, TSB and TLB flushes are currently
issued at every PAGE_SIZE'd boundary which is unnecessary.
We now issue the flush at REAL_HPAGE_SIZE boundaries only.

Without this patch workloads which unmap a large hugepage
backed VMA region get CPU lockups due to excessive TLB
flush calls.

Orabug: 23071722

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
(cherry picked from commit b42a694198cca38e8cdb3f601266bf591ba3291d)
(cherry picked from commit fdc7f39ae632a9ec0114c59090131d2db7dd7682)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: kernel panic -- vds_bh_reset

Orabug: 23199936

The panic is an assertion failure in fs/buffer.c:1269

static inline void check_irqs_on(void)
{
         *** BUG_ON(irqs_disabled()); ***
}

The vds reset path calls the backend fini routine which eventually calls
the file close interface:

         vds_vio_lock(vio, flags);
         vds_be_fini(port);

vds_vio_lock() grabs a spin lock and disables local irqs and thus
the eventual assertion failure.

The fix is to add a new r/w mutex to protect backend state and move the
vds_vio_lock() call after vds_be_fini().

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Liam Merwick <Liam.Merwick@oracle.com>
(cherry picked from commit 6e33112afcdd654ada7c9414a1c4d83278533911)
(cherry picked from commit e62908110662f009f2449df5faae496ac43a1d65)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

vds_blk_rw() should check bio_alloc() NULL return value

Orabug: 22934031

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
(cherry picked from commit fec4ca1085a268c38a4a12c6119322aaf2f87698)
(cherry picked from commit d590a3711158228194ea31da0f1fea612bd13c05)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvdc: don't dereference port->disk before disk probe finishes

If the backing file for a vdisk is not present in the service domain an
ldc reset can occur during the initial port/disk probing. The ldc reset
logic was dereferencing port->disk, which may not have been setup yet.
Guard against this case.

Orabug: 20362258

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
(cherry picked from commit cd6d3705da958b5db625272eb8733ab79a045f87)
(cherry picked from commit bee156ac9cad00f6a39417217c454085645c3d62)
(cherry picked from commit 476306db27c9a6bcd2e8012047ba06a0af16b734)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: This patch adds PRIQ support.

This patch supports INT_A through INT_D interrupts as described
by the Open Firmware device tree as well as MSI vectors registered
by PCIe drivers. pci=nomsi may not work though frankly that makes no
sense on a SPARC machine.

The command line parameter priq=off reverts to prior MSIEQ interrupt
mechanism.

OraBug: 22748924

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
(cherry picked from commit d4c668861f91dfe6f5fa1a809218a8c46dc76c9b)
(cherry picked from commit c47c2d2a53856b25843e07c78f42f45a17661d2c)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Enable aggressive setting of PCIe MPS settings

This patch connects SPARC PCIe into the generic PCIe framework enabling
MPS and MRRS to be set aggressively subject to the standard command line
flags. To enable put "pci=pcie_bus_perf" on command line.

Orabug: 21149334

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
(cherry picked from commit 5e5b08ede2c5b6cbf39e20f91097ca2435ea286e)
(cherry picked from commit 8b9a1855f68978d437605b0267ba448399303511)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Allow redirection of MSI/MSI-X IRQs

Allows redirection of MSI/MSI-X IRQs by finding appropriate MSIEQ and
re-routing its IRQ. Also handles driver IRQs sharing the same MSIEQ.
Affinity masks for all such shared interrupts as well as MSIQ IRQ
are modified. Note, based on the HW sharing this patch can change
related driver IRQs in an invisible manner. While confusing and not
desirable, this is an artifact of the HW design.

Orabug: 22749960

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
(cherry picked from commit 914901044c1a028185326eb1a3c8821cab8845be)
(cherry picked from commit 98069630a58903f6ec29aaf784f24b44c27a0db0)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: use COMMAND_LINE_SIZE for boot string

Orabug: 19722011

original patch by Bob Picco

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 1a9bf6b57dbfcc4ec8e8d98bd20b7975f4b4934f)
(cherry picked from commit 017214c5742ee92e3270024c4cce1cecb793ac1c)

sparc64: crypto camellia opcode error fix

Orabug: 23128525

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit fc1b755de4250245961b226da41d10f066467926)
(cherry picked from commit c6cb169240529eb974a5846e55e622001973b79f)
(cherry picked from commit 8d140eb4166d127dc0f64595ec17e441beb4b47c)

sparc64: node_random needs attention

Seriously node_random will have to be hooked into sparc.

Orabug: 23128525

Signed-off-by: Bob Picco <bob.picco@oracle.com>
(cherry picked from commit 9c8ab6e8096ddf1814df8503cdd10ee83b4ddf9a)
(cherry picked from commit 640de4e021c24037420ddb4c52cc91b002d72ad7)

Conflicts:
kernel/fork.c
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 07bb1228d250a4e3003ccf317da22c06310b8607)
(cherry picked from commit cdeddef5edf0de95cff6ca8c8b7efe94322492d7)

sparc64: nr_cpus and nodes_shift

This is being done for M5 and the like.

To go beyond the NR_CPUS limit of 4064, the issue in cpu mondo -
init_cpu_send_mondo_info - needs to be addressed and appears possible.

Orabug: 23128525

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit d8ce9cc00181bfa65865c6d1624f0dcb3d048a7b)
(cherry picked from commit ce753fe5d10682939912242c9880935472f1e195)

Conflicts:
arch/sparc/Kconfig

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 48df0cc664ddfb2f03dc1afbb3ff01a3192c9723)
(cherry picked from commit fdb632ef6a250d24887003cc35b72744183c8642)

sparc64: struct adi_caps should use __u64, not u64

struct adi_caps uses u64 as the type for its field which is not
defined for include/uapi. Change it to __u64.

Orabug: 22713162

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 4b47a697322066fcd6cf0f4637dece26da3525fc)

Conflicts:
include/uapi/linux/prctl.h

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 858864aea91eb7a1337cedd70a01ffb3fb5d898a)
(cherry picked from commit acf5580a66da8aae303a42af49de27d7500651cc)

IPMI: Driver for Sparc T4/T5/T7 Platforms

Functional IPMI interface driver for Sparc T4/T5/T7. This will
probably also work for other platforms that use an iLOM channel
for IPMI services, including older and future ones, though these
have not been tested.

This driver provides the transport between the IPMI message layer
and the Sparc platform IPMI endpoint in iLOM. The Virtual Logical
Domain Channel (VLDC) driver claims the host endpoint, and we call
it to move data to/from iLOM. So there is an unusual dependency
on another loadable module which requires several compromises
until we work out a plan to restructure the VLDC driver to provide
a cleaner interface:

* An artificial symbolic dependency on vldc is created so that
   "modprobe ipmi_si" will ensure that vldc is loaded also.

* ipmi_vldc uses filp_open/kernel_read/kernel_write on device
   files provided by vldc, ie, /sys/class/vldc/ipmi/mode and
   /dev/vldc/ipmi.

Bug 22804422 has been created to deal with these issues.

Sending this driver upstream is on hold until we work out these
issues. Also, the vldc driver itself has not yet been sent upstream
and that is obviously a prerequisite.

Orabug: 22658348

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
(cherry picked from commit 6083e586b068ae159c8335adc2d210e7b7f66d27)
(cherry picked from commit 9944e6442b962c2945f2a59ef7c6ff81d0e95172)
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit e76d315b514e424a8623c51eaa526a6d2ac52a89)
(cherry picked from commit dfcab0a3eef7ebd4cc2fda9865f42ff114b46459)

Merge branch topic/uek-4.1/xen of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

xen-blkback: don't get ref for each queue

xen_blkif_get() for each queue is useless, and introduce a bug.

If there is I/O inflight, xen_blkif_disconnect() will return busy and
xen_blkif_put() not be called.
Then even if I/O completed, the xen_blkif_put() can't free all resources.

Orabug: 24661443
Signed-off-by: Bob Liu <bob.liu@oracle.com>

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

RDS: IB: set default frag size to 16K

For systems which wants lower fragment setting because of
smaller memory footprints, module parameter 'rds_ib_max_frag'
can be used to set lower value like 4K or 8K.

Orabug: 24656820

Reviewed-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

rds: avoid call to flush_mrs() in specific condition

This is to reduce process spawn time.
When user provides 0 values for cookie and flags in rds_free_mr() call,
avoid calling flush_mr()

skgxp uses cookie 0 and flag 0 combination for checking whether
transport is RDMA capable or not.

This is short term hack for customer escalation.
Customer is having other processes which are calling flush_mrs() and
that is causing mutex contention.
skgxp change is fairly significant, and we want to provide minimal
change in customer environment.

Risk factor here is, if there is any other use of cookie 0 and flag 0
combination (like freeing up unused MRs), then that will be impacted.
Code inspection by Leo/Avneesh at skgxp and skgnfs suggests that, this
combination not being used anywhere.

Long term solution for this requires changes in RDS as well as skgxp
application, which should be done in next UEK release.
Required RDS changes are present in UEK4; however, skgxp changes are
still remaining. Since this was escalation from major customer, we
require this hack in UEK4.

Orabug: 24656750

Tested-by: Sujatha Tolstoy <sujatha.tolstoy@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Leo Tominna <leo.tominna@oracle.com>
Reviewed-by: Avneesh Pant <avneesh.pant@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>

mlx4_core: allow unprivileged VFs read physical port counters

For compatibility to Guest OS running older release, we allow
VFs to read physical port counters

Orabug: 24656803

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Santosh Shilimkar<santosh.shilimkar@oracle.com>

sif: Lift sif_verbs up to be independent of sif internal headers

The sif_verbs.h file needs to be independent of
other header files to be includable from other kernel.
This is necessary to avoid duplicate definition of
the API elements. For Oracle Linux this file now moves from
drivers/infiniband/hw/sif/ to include/rdma/ to make it
available for the RDS and uvNIC drivers.

This is a temporary but necessary measure while we wait
for proper generic interfaces to be defined at the common
verbs layer.

Orabug: 24524698

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: ireg: Use the firmware release version as sysfs fw_ver

Report the official release version as reported by
ibv_query_device etc. instead of the previously used
internal firmware build version.

Orabug: 24533579

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

sif: Remove dummy implementation of get_protocol_stats

We don't really implement it and the entry point was silently
removed in upstream commit v4.6-rc5-317-gb40f475

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Åsmund Østvold <asmund.ostvold@oracle.com>

sif: ipd: Fix incorrect calculation of ipd from static rate

Orabug: 24449061

The ipd is calculated wrongly because it compares the active speed enum
with the value return from ib_rate_to_mult. Thus, this patch converts the
PSIF Active speed enum to a multiple of the base rate of SDR (2.5 Gbps).

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: Fix recently introduced checkpatch issues

It appears the commit check in checkpatch does not capture
all errors. Fix the new ones inthe driver code to
allow us to enable a regression test for it.

Orabug: 24570578

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Åsmund Østvold <asmund.ostvold@oracle.com>
Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: sqflush: Handle duplicate completions in poll_cq

Orabug: 23759723

During the QP transition from RTS-> ERR, the HW might generate
duplicate FLUSHED-IN-ERR completion. The SIF driver inverses the
sq_seq in a dedicated completion entry and sets the
CQ_POLLING_IGNORED_SEQ bit in the cq_sw flags. Nevertheless, this bit
is cleared once a duplicate FLUSHED-IN-ERR completion is detected in
poll_cq.

The above mentioned method cannot handle a scenario where HW generates
multiple duplicate completions. Thus, this patch moves the detection
of the duplicate completions to translate_wr_id. Then, SIF driver
will only return non duplicate completions to the user.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Revert "ixgbe: make a workaround to tx hang issue under dom"

Orabug: 24574722

This reverts commit 885bb302d5bb06d7f26427133a3b8afb2115a53a.

Need to revert this commit as it causes a tx hang on dom0 for OVM

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

nvme: don't overwrite req->cmd_flags on sync cmd

In __nvme_submit_sync_cmd, the request direction is overwritten when
the REQ_FAILFAST_DRIVER flag is set.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 75619bfa904d0 ("NVMe: End sync requests immediately on failure")
Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 24561038
Mainline Commit: e112af0dc9f55099b948e55077504a44b4162c79

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

NVMe: End sync requests immediately on failure

Do not retry failed sync commands so the original status may be seen
without issuing unnecessary retries.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 24561038
Mainline Commit: 75619bfa904d0f2840b4274eb92ce47b2e1c472e

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

ib_core: make wait_event uninterruptible in ib_flush_fmr_pool()

Replace wait_event_interruptible() with wait_event() in
ib_flush_fmr_pool() to avoid deallocating pd before fmr_cleanup_thread
tears down pool of fmrs.

Orabug: 24533036

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()

If we punch a hole on a reflink such that following conditions are met:

1. start offset is on a cluster boundary
2. end offset is not on a cluster boundary
3. (end offset is somewhere in another extent) or
(hole range > MAX_CONTIG_BYTES(1MB)),

we dont COW the first cluster starting at the start offset. But in this
case, we were wrongly passing this cluster to
ocfs2_zero_range_for_truncate() to zero out. This will modify the cluster
in place and zero it in the source too.

Fix this by skipping this cluster in such a scenario.

Orabug: 24516161

Reported-by: Saar Maoz <saar.maoz@oracle.com>
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>

NVMe: Fix obtaining command result

Replaces req->sense_len usage, which is not owned by the LLD, to
req->special to contain the command result for driver created commands,
and sets the result unconditionally on completion.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@fb.com>
Fixes: d29ec8241c10 ("nvme: submit internal commands through the block layer")
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit a0a931d6a2c1fbc5d5966ebf0e7a043748692c22 and
added missing pieces from d29ec8241c10eacf59c23b3828a88dbae06e7e3f
backport)
Orabug: 24532912
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Merge branch topic/uek-4.1/xen of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

x86/xen: Add x86_platform.is_untracked_pat_range quirk to ignore ISA regions.

On x86 whenever VMAs are setup, the 'is_ISA_range quirk' (which this
patch re-implements) is used to figure whether to ignore the
requested PAT type and always use WB (see 'reserve_memtype').
Specifically it forces the WB type for any region in the ISA space.

From the Intel SDM, the combination of MTRR (UC, which is setup by
the BIOS) and PAT (UC or WB) for the ISA region ends up with the same
value - UC.

However on Xen, due to XSA 154 we enforce that mappings that _ANY_
pagetable entry to MMIO ranges MUST have the same the same cachability
mapping - and in this case we enforce UC.

Which means that with XSA 154 (and without this patch) any application
that maps /dev/mem to get SMBIOS information (like mcelog), and pokes
in the ISA region will not have an PTE set. That is due to
reserve_pfn_range returning -EINVAL which results in the PTE not being set.

[These are debug entries added in 'reserve_pfn_range']
mcelog:2471 0xf0000->0xf1000, req_type=write-back new_type=write-back
mcelog:2471 0xeb000->0xed000, req_type=write-back new_type=write-back

.. above are successfull ones, but:
mcelog:2471 0xeb000->0xed000, req_type=uncached new_type=uncached
[again, a debug one:]
mcelog:2471 want=uncached got=write-back strict 0x000eb000-0x000ecfff
mcelog:2471 map pfn expected mapping type uncached for [mem 0x000eb000-0x000ecfff], got write-back
------------[ cut here ]------------

[<ffffffff816c66f0>] dump_stack+0x63/0x83
[<ffffffff81084745>] warn_slowpath_common+0x95/0xe0
[<ffffffff810847aa>] warn_slowpath_null+0x1a/0x20
[<ffffffff810725f3>] untrack_pfn+0x93/0xc0
[<ffffffff811b90f9>] unmap_single_vma+0xa9/0x100
[<ffffffff811b9644>] unmap_vmas+0x54/0xa0
[<ffffffff811bf0da>] exit_mmap+0x9a/0x150
[<ffffffff810825d3>] mmput+0x73/0x110
[<ffffffff81082775>] dup_mm+0x105/0x110
[<ffffffff81083b1d>] copy_process+0x11ed/0x1240
[<ffffffff81084009>] do_fork+0x79/0x280
[<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
[<ffffffff81084226>] SyS_clone+0x16/0x20
[<ffffffff816cb3ee>] system_call_fastpath+0x12/0x71

results in that splat.

The effective result of the function below is for 'reserver_memtype'
to ignore the result from 'x86_platform.is_untracked_pat_range' quirk.
Which means that the splat above does not happen.

Orabug: 24491985
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

blk-mq: avoid setting hctx->tags->cpumask before allocation

Orabug: 24465370,24464170,24300199

When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask has to be set after hctx->tags is setup in
blk_mq_map_swqueue(), otherwise it causes null pointer dereference.

Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 1356aae08338f1c19ce1c67bf8c543a267688fc3)
Signed-off-by: Bob Liu <bob.liu@oracle.com>

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

blk-mq: mark request queue as mq asap

Orabug: 24454933

Currently q->mq_ops is used widely to decide if the queue
is mq or not, so we should set the 'flag' asap so that both
block core and drivers can get the correct mq info.

For example, commit 868f2f0b720(blk-mq: dynamic h/w context count)
moves the hctx's initialization before setting q->mq_ops in
blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
to think the queue is non-mq and don't allocate command size
for the per-hctx flush rq.

This patches should fix the problem reported by Sasha.

Cc: Keith Busch <keith.busch@intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Fixes: 868f2f0b720 ("blk-mq: dynamic h/w context count")
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 66841672161efb9e3be4a1dbd9755020bb1d86b7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

sif: vlink connect is now enabled by default

This fix makes default link failover behaviour compatible with existing
mellanox CX3. Internal link status (PortState) will now follow external
link status (PortState) by default.

Driver feature mask SIFF_vlink_disconnect may be used to set default
behaviour to "vlink connect"=disabled.

Orabug: 24445370

Signed-off-by: Harald Høeg <harald.hoeg@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: sif_hwmon: add hwmon interface to export psif chip temperatures

This commit adds support to export psif chip temperatures via hwmon
interface

Orabug: 24432362

Signed-off-by: Francisco Triviño <francisco.trivino@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: EPSC_API_VERSION(2,10) - EPSC_DIAG_COUNTERS

Adding a new EPSC command EPSC_DIAG_COUNTERS
to get Diag counter values via mailbox.

Orabug: 24374612

Modified-by: Knut Omang <knut.omang@oracle.com>
Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: base: Scale default desc.array size values based on #of available CBs

With default values for #of QPs and MRs set high by default,
33 instances of the driver would consume a lot
of memory just to initialize basic tables since each of these
instances have their own 1M QP space and in effect allocates
the same amount of resources that a bare metal, single instance
driver would do.

The number of collect buffers assigned to the PCIe function tells us
what fraction of the hardware resources we got, and a small
fraction of the 16K CB space indicates that the function competes with
other functions on resources, and that it is unlikely that the same
huge number of QPs etc can be deployed with high performance
anyway.

This commit introduces tracking of module parameter settings
compared to default values, and if compiled in defaults are used,
we scale down the number of QPs etc with a factor corresponding
to the fraction of CBs we got.

This yields eg. 32K QPs per function in a 32 VF enabled system
and significantly reduces system wide memory usage in a
virtualized environment (whether Xen based or not)

Users can still override settings using the module parameters,
which will not be subject to scaling if they deviate from the
compiled in defaults.

Orabug: 24424521

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

sif: cb: Improve algorithm for allocating and using CBs from driver

Instead of allocating bandwidth collect buffers (CBs)
as a fallback for latency CBs, and spamming the kernel log
with failure messages, instead multiplex use across
the actual allocated number of latency CBs and just report
the failure to allocate once, with values to improve debugging.

Improves behaviour for scenarios where available CB resources
are spread across many VFs but VF drivers still see a lot
of (virtual) CPUs, which will easily be the case with the
default VF settings for Xen dom0.

Also, the low latency property is most critical for req.notify PQP
requests. Use high bandwidth CBs also for PQP operations other than
the REARM request, which is the performance critical req. for
req_notify_cq. This should improve performance for event based
applications under high load.

Orabug: 24424521

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

sif: epsc: For Xen dom0 configure resources for all 32 VFs at driver load

As of EPSC API version 2.9 firmware can distribute resources based on
the number of PCI functions the PF driver requests support for.
Older firmware will just ignore the value.

This commit enforces no VFs configured as the default setting
but enable all 32 VFs if a Xen PV domain is detected.

To allow overriding this behaviour we add a new module parameter
vf_max which can be used to override the number of VFs configured
for instance for use with other virtualization engines than Xen
and for debugging/tuning purposes. The vf_max parameter takes the
following values:

-2:  Use NVRAM configured firmware defaults (backward compat mode)
-1  (now default) : Exadata mode as described above
0-32:  Configure explicitly for that many VFs (only selected values
     are supported by firmware)

Orabug: 24424521

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

sif: Reintroduce function name prefixes in log statements

A lot of the available messages doesn't make enough sense without the
information in the function name so just reintroduce the function
name prefixes.

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Orabug: 24437547
Pre-check: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

sif: fmr: invalidate keys before TLB bulk invalidates

This commit reorders and sequentializes the cleanup phase when
bulk invalidates are used. The order was to post the TLB flushing
operation to the EPSC, then invalidate keys (potentially in parallel with the
ongoing flushing) before finally waiting for the TLB flushing to complete.
This way is not considered safe in general, as an incoming access to a key
can cause an invalidated PTE or PTW to be cached again and later cause
sif to read or write to a no longer valid location.

This commit makes sure that all keys are invalidated before
the TLB flushing is triggered.

Orabug: 24438867

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>