chris hyser [Thu, 19 May 2016 20:05:47 +0000 (13:05 -0700)]
sparc64: Enable aggressive setting of PCIe MPS settings
This patch connects SPARC PCIe into the generic PCIe framework enabling
MPS and MRRS to be set aggressively subject to the standard command line
flags. To enable put "pci=pcie_bus_perf" on command line.
chris hyser [Tue, 17 May 2016 16:39:25 +0000 (09:39 -0700)]
sparc64: Allow redirection of MSI/MSI-X IRQs
Allows redirection of MSI/MSI-X IRQs by finding appropriate MSIEQ and
re-routing its IRQ. Also handles driver IRQs sharing the same MSIEQ.
Affinity masks for all such shared interrupts as well as MSIQ IRQ
are modified. Note, based on the HW sharing this patch can change
related driver IRQs in an invisible manner. While confusing and not
desirable, this is an artifact of the HW design.
Rob Gardner [Tue, 9 Feb 2016 22:38:05 +0000 (15:38 -0700)]
IPMI: Driver for Sparc T4/T5/T7 Platforms
Functional IPMI interface driver for Sparc T4/T5/T7. This will
probably also work for other platforms that use an iLOM channel
for IPMI services, including older and future ones, though these
have not been tested.
This driver provides the transport between the IPMI message layer
and the Sparc platform IPMI endpoint in iLOM. The Virtual Logical
Domain Channel (VLDC) driver claims the host endpoint, and we call
it to move data to/from iLOM. So there is an unusual dependency
on another loadable module which requires several compromises
until we work out a plan to restructure the VLDC driver to provide
a cleaner interface:
* An artificial symbolic dependency on vldc is created so that
"modprobe ipmi_si" will ensure that vldc is loaded also.
* ipmi_vldc uses filp_open/kernel_read/kernel_write on device
files provided by vldc, ie, /sys/class/vldc/ipmi/mode and
/dev/vldc/ipmi.
Bug 22804422 has been created to deal with these issues.
Sending this driver upstream is on hold until we work out these
issues. Also, the vldc driver itself has not yet been sent upstream
and that is obviously a prerequisite.
Commit: 5075a47f3765e778b45367ba4873c1bd08b21d0e
fix-up code base for v4.1.12-46 merge
should not have removed "#include <linux/hugetlb.h>"
Add it back in after applying adfc71b605:
fix-up - add back include of linux/dtrace_os.h
so that it will merge with master.
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Commit: bd52d0fd57c96146f8d1838588753ab9dabcd2fe:
sparc64: Log warning for invalid hugepages boot param
removes "#include <linux/dtrace_os.h>" from arch/sparc/mm/fault_64.c.
That header file is needed by dtrace. Add it back in.
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Aaron Young [Tue, 1 Mar 2016 15:47:02 +0000 (07:47 -0800)]
SPARC64: UEK4 LDOMS DOMAIN SERVICES UPDATE 3
This update provides the following fix for LDom domain services on UEK4.
1. Add an event to the vlds driver which is used to signal
process(es) using libds that the vlds /dev devices have been updated.
When it receives this event, libds will refresh/update it's internal
list of vlds devices allowing the list to stay immediately up-to-date
when vlds devices have changed. This event fixes some DR related libds
problems found during regression testing due to libds internal vlds
device list becoming stale.
Signed-off-by: Aaron Young <aaron.young@oracle.com> Reviewed-by: Alexandre Chartre <Alexandre.Chartre@oracle.com>
Orabug: 22853109
Alexandre Chartre [Thu, 17 Mar 2016 10:31:24 +0000 (03:31 -0700)]
Interface to mark SR-IOV device ready for use by LDoms guest
Add a iov_ready file to all PCI devices (/sys/bus/pci/devices/*/iov_ready).
The iov_ready file is write only, and mapped to the pci_iov_dev_ready
hypervisor call, which is used to indicate that a PCI device is ready
or no longer ready to be shared with other domains
Write "1" to the file to indicate that the PCI device is ready.
For example:
Vijay Kumar [Wed, 9 Mar 2016 19:48:38 +0000 (11:48 -0800)]
sparc64: Log warning for invalid hugepages boot param
When an invalid hugepage param is mentioned in kernel boot param,
appropriate warning should be logged to indicate if it's not
a) software supported
b) MMU support for xl_hugepagesz
c) xl_hugepagesz not in use
Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Acked-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Orabug: 22729791 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Note: Resending this patch. There is no change in this patch since v1.
Jalap?no was verified repaired.
Now to find performance issues.
One performance issue is subordinate page table state (SPTS). The SPTS will
be tricky because of protection changes for COW and other. For example,
a 2Gb hugepage will have 1UL << (31-23) PMD entries. Do we want 256 IPI-s
for a hugepage TTE(pte) change?
chris hyser [Fri, 1 Apr 2016 19:44:22 +0000 (12:44 -0700)]
sparc64: Fix I/O NUMA parsing and sysfs display code.
I/O NUMA node parsing has been broke since T5 and did not work on
T7. The code also did not correctly handle PCIe root complexes
crossbar connected to multiple memory/cpu NUMA nodes. Additionally,
the numa_node attributes displayed in sysfs were incorrect.
Example: T7-4 showing round-robin spread of multiply connected root
complexes.
[ 3723.288247] /pci@305: On NUMA node 0
[ 3723.363398] /pci@304: On NUMA node 2
[ 3723.437486] /pci@307: On NUMA node 0
[ 3723.510510] /pci@306: On NUMA node 2
[ 3723.582582] /pci@313: On NUMA node 0
[ 3723.655276] /pci@308: On NUMA node 2
[ 3723.728077] /pci@302: On NUMA node 0
[ 3723.800774] /pci@30a: On NUMA node 2
[ 3723.874895] /pci@309: On NUMA node 0
[ 3723.947089] /pci@301: On NUMA node 2
[ 3724.020218] /pci@30b: On NUMA node 1
[ 3724.092902] /pci@300: On NUMA node 3
[ 3724.167630] /pci@303: On NUMA node 1
[ 3724.240287] /pci@30c: On NUMA node 3
[ 3724.312245] /pci@312: On NUMA node 1
[ 3724.384857] /pci@30e: On NUMA node 3
[ 3724.457482] /pci@30d: On NUMA node 1
[ 3724.531679] /pci@310: On NUMA node 3
[ 3724.603621] /pci@30f: On NUMA node 1
[ 3724.675695] /pci@311: On NUMA node 3
chris hyser [Thu, 7 Apr 2016 20:55:56 +0000 (13:55 -0700)]
sparc64: Set up core sibling list correctly for T7.
The important definition of core sibling is that some level of cache is shared.
The prior SPARC notion of socket was defined as highest level of shared cache.
On T7 platforms, the MD record now describes the CPUs that share the physical
socket and this is no longer tied to shared cache. This patch correctly
separates these two concepts.
chris hyser [Thu, 7 Apr 2016 19:12:05 +0000 (12:12 -0700)]
sparc64: Fix CPU package information in /sys
CPU package information in
/sys/bus/cpu/devices/cpu*/topology/physical_package_id
is inconisistent with the use by tools such as irqbalance. This patch
uses the socket ID to be consistent and useful.
chris hyser [Thu, 7 Apr 2016 19:32:48 +0000 (12:32 -0700)]
sparc64: Add 3rd level cache info to /sys
This patch pulls line size and cache size info from the machine description and
adds l3 caches files to /sys/bus/cpu/devices/cpu* directories. It also
structures the information in the same directory hierachy as x86 so that user
programs like irqbalance can find the needed information to work correctly.
Rob Gardner [Sun, 27 Mar 2016 22:39:13 +0000 (16:39 -0600)]
sparc64: Add lightweight syscall mechanism for lwp_info
This patch introduces a new "light weight" system call
mechanism which has the ability to retrieve small bits
of information and/or perform minor computations without
the need for a full blown save/switch/restore context.
Solaris provides _lwp_info(), which returns basically the
same information as getrusage(RUSAGE_THREAD) but much faster.
This is used extensively by the database code, and returns
the utime and stime for the calling thread.
(This patch also provides a fast getcpu function just as
a demonstration of how additional calls might be added.
Unlike x86, there is no unprivileged instruction to do this,
and so it is a fairly expensive system call.)
Allen Pais [Tue, 29 Mar 2016 08:50:33 +0000 (14:20 +0530)]
sparc64:piggback program generates a.out header with incorrect section sizes
piggyback in uek for SPARC generates an a.out that has section sizes that are
too large. This causes problems when booting with OpenBoot because OpenBoot
uses those sizes to map and copy the image to its specified VA and runs into
unmapped memory during the copies.
Signed-off-by: Jose Marchesi <jose.marchesi@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit bd99ee7ceffb1a472ccd8841dd7011d15e7fa258)
wim.coekaerts@oracle.com [Fri, 29 Jan 2016 17:39:38 +0000 (09:39 -0800)]
Add sun4v_wdt watchdog driver
This driver adds sparc hypervisor watchdog support. The default
timeout is 60 seconds and the range is between 1 and 31536000 seconds. Both watchdog-resolution and
watchdog-max-timeout MD properties settings are supported.
Signed-off-by: Wim Coekaerts <wim.coekaerts@oracle.com> Reviewed-by: Julian Calaby <julian.calaby@gmail.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit eccc96426978c0fa963f8712077ecb6247f0e57e)
SR-IOV code looks for arch specific data while enabling
VFs. When VF device is added, driver probe function makes set
of calls to initialize the pci device. Because the VF device is
added different way than the normal PF device(which happens via
of_create_pci_dev for sparc), some of the arch specific initialization
does not happen for VF device. That causes panic when archdata is
accessed.
To fix this, I have used already defined weak function
pcibios_setup_device to copy archdata from PF to VF.
Also verified the fix.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit be81c7e3cc48d3ff8b26021be3fd49e997743cbc)
chris hyser [Thu, 4 Feb 2016 21:14:43 +0000 (13:14 -0800)]
sparc64: enable "relaxed ordering" in IOMMU mappings
Enable relaxed ordering for memory writes in IOMMU TSB entry from
dma_4v_map_page() and dma_4v_map_sg() when dma_attrs
DMA_ATTR_WEAK_ORDERING is set. This requires vPCI version 2.0 API.
chris hyser [Thu, 4 Feb 2016 20:07:03 +0000 (12:07 -0800)]
sparc64: Enable PCI IOMMU version 2 API
Enable Version 2 of the PCI IOMMU API needed for advanced features
such as PCI Relaxed Ordering and greater than 2 GB DMA address
space per root complex.
Sowmini Varadhan [Tue, 2 Feb 2016 18:41:56 +0000 (10:41 -0800)]
sunvnet: perf tracepoint invocations to trace LDC state machine
Use sunvnet perf trace macros to monitor LDC message exchange state.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5fa4282fdb6d30937abcf1b1a9d367aaf472178a)
Sowmini Varadhan [Tue, 2 Feb 2016 18:41:55 +0000 (10:41 -0800)]
sunvnet: Add support for perf LDC event tracing
Add perf event macros for support of tracing and instrumentation
of LDC state machine
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 61cf74d322a9d8ef172251e32c3008cf60964b70)
- Disable cpu timer only for hot-remove and not for hot-add
- Update interrupt affinities before interrupt redistribution
- Default to simple round-robin interrupt redistribution for ldoms
Tushar Dave [Thu, 7 Jan 2016 23:24:26 +0000 (15:24 -0800)]
sparc64: bypass iommu to use 64bit address space
This patch is internal only not for UPSTREAM. This is a temporary
workaround based on UEK2 commit c1a12ed1d125
("sparc64: enable iommu bypass workaround for IB. This is temporary.")
Current design of sparc iommu is based on iommu V1 APIs which at max
can have 2G/8K DMA addresses. Due to this, kernel entity (e.g. i40e,
PSIF) requesting more than 2G/8K DMA addresses does not work at all.
This patch adds temporary workaround to remedy this issue by bypassing
iommu.
When 64bit iommu implementation is complete, this workaround will be
reverted.
Dave Kleikamp [Thu, 4 Feb 2016 16:43:48 +0000 (10:43 -0600)]
sparc64: call crash_kexec() directly from die_if_kernel()
A direct call to crash_kexec() here allows the crashing register state
to be saved to the PT_NOTE. When called from panic(), a new register
state is created which is less useful.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Occasionally, the crash kernel will fail to configure a virtual disk
because the hypervisor leaves an old request in the rx queue even after
it is reconfigured in ldc_bind(). Fix this with a call to ldc_rx_reset().
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Dave Kleikamp [Fri, 20 Mar 2015 21:38:42 +0000 (16:38 -0500)]
sparc64: restore prom_cif_stack
Commit ef3e035c stopped using the firmware stack and thus stopped saving
it's location in p1275buf. However, kexec wants to be using the firmware
stack when launching the new kernel.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
If sparc_perf_event_update() is called between performcnce counter
overflow interrupts then everything is fine and the total event
count calculation is correct. If however, the
sparc_perf_event_update() is only called when the performance counter
overflows, we do not take the counter wrap into consideration.
This leaves us with an incorrect value for the total event count.
This patch fixes this issue by taking the counter overflow situation
into consideration.
Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
(cherry picked from commit 6c89361408f964ad2c2c29200987aece3a7c222d) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Dave Aldridge [Thu, 21 Jan 2016 14:17:11 +0000 (06:17 -0800)]
sparc64: Fix for perf event counts sometimes reported as negative numbers
Use an unsigned number to prevent sign extension in the calculation
to work out the difference between the previous and the current
count obtained from the perfomance instrumentation counters.
Currently, NUMA node distance matrix is initialized only
when a machine descriptor (MD) exists. However, sun4u
machines (e.g. Sun Blade 2500) do not have an MD and thus
distance values were left uninitialized. The initialization
is now moved such that it happens on both sun4u and sun4v.
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com> Tested-by: Mikael Pettersson <mikpelinux@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 36beca6571c941b28b0798667608239731f9bc3a) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Dmitry V. Levin [Sat, 26 Dec 2015 23:13:27 +0000 (02:13 +0300)]
sparc64: fix incorrect sign extension in sys_sparc64_personality
The value returned by sys_personality has type "long int".
It is saved to a variable of type "int", which is not a problem
yet because the type of task_struct->pesonality is "unsigned int".
The problem is the sign extension from "int" to "long int"
that happens on return from sys_sparc64_personality.
For example, a userspace call personality((unsigned) -EINVAL) will
result to any subsequent personality call, including absolutely
harmless read-only personality(0xffffffff) call, failing with
errno set to EINVAL.
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Cc: <stable@vger.kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 525fd5a94e1be0776fa652df5c687697db508c91) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Khalid Aziz [Thu, 17 Dec 2015 17:33:50 +0000 (10:33 -0700)]
sparc64: Add ADI capability to cpu capabilities
Add ADI (Application Data Integrity) capability to cpu capabilities list.
ADI capability allows virtual addresses to be encoded with a tag in
bits 63-60. This tag serves as an access control key for the regions
of virtual address with ADI enabled and a key set on them. Hypervisor
encodes this capability as "adp" in "hwcap-list" property in machine
description.
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 82924e542f20e645bc7de86e2889fe3fb0858566) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Sowmini Varadhan [Mon, 18 Jan 2016 21:12:09 +0000 (16:12 -0500)]
sunvnet: Initialize network_header and transport_header in vnet_rx_one()
vnet_fullcsum() accesses ip_hdr() and transport header to compute
the checksum for IPv4 packets, so these need to be initialized in
skb created in vnet_rx_one().
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wim Coekaerts [Sun, 24 Jan 2016 23:59:22 +0000 (15:59 -0800)]
Add sun4v_wdt watchdog driver
This driver adds sparc hypervisor watchdog support. Timeout is set in
milliseconds since that is the granularity supported and it honors
the settings of both the watchdog-resolution and watchdog-max-timeout
MD properties.
Note that most watchdog drivers use timeout in seconds. This driver
requires timeout_ms as a module parameter and time in ms.
In this driver, the default is 60000ms or 60 seconds.
This driver also modifies hvcalls.S and changes sun4v_mach0_set_watchdog
such that it allows for NULL to be passed as 2nd parameter. This removes
the need to pass &time_remaining which is not useful.
This patch works around a problem where Solaris drops packets bound for
physical NICs (i.e., off host) that are using LSO and do not have the VIO v7
descriptor flags VNET_PKT_HASH, VNET_PKT_HCK_IPV4_HDRCKSUM,
VNET_PKT_HCK_FULLCKSUM set along with VNET_PKT_IPV4_LSO.
This patch can't go upstream because it doesn't actually support output
hashing (all packets will hash to '0'). The full and IPv4 header checksum
computations caused by the flags are unnecessary for Linux, but only affect
destinations through the vswitch.
(cherry picked from commit 3839694e54df457997025775894f954ea3185aff)
Rob Gardner [Wed, 23 Dec 2015 06:24:49 +0000 (23:24 -0700)]
sparc64: fix FP corruption in user copy functions
Short story: Exception handlers used by some copy_to_user() and
copy_from_user() functions do not diligently clean up floating point
register usage, and this can result in a user process seeing invalid
values in floating point registers. This sometimes makes the process
fail.
Long story: Several cpu-specific (NG4, NG2, U1, U3) memcpy functions
use floating point registers and VIS alignaddr/faligndata to
accelerate data copying when source and dest addresses don't align
well. Linux uses a lazy scheme for saving floating point registers; It
is not done upon entering the kernel since it's a very expensive
operation. Rather, it is done only when needed. If the kernel ends up
not using FP regs during the course of some trap or system call, then
it can return to user space without saving or restoring them.
The various memcpy functions begin their FP code with VISEntry (or a
variation thereof), which saves the FP regs. They conclude their FP
code with VISExit (or a variation) which essentially marks the FP regs
"clean", ie, they contain no unsaved values. fprs.FPRS_FEF is turned
off so that a lazy restore will be triggered when/if the user process
accesses floating point regs again.
The bug is that the user copy variants of memcpy, copy_from_user() and
copy_to_user(), employ an exception handling mechanism to detect faults
when accessing user space addresses, and when this handler is invoked,
an immediate return from the function is forced, and VISExit is not
executed, thus leaving the fprs register in an indeterminate state,
but often with fprs.FPRS_FEF set and one or more dirty bits. This
results in a return to user space with invalid values in the FP regs,
and since fprs.FPRS_FEF is on, no lazy restore occurs.
This bug affects copy_to_user() and copy_from_user() for NG4, NG2,
U3, and U1. All are fixed by using a new exception handler for those
loads and stores that are done during the time between VISEnter and
VISExit.
n.b. In NG4memcpy, the problematic code can be triggered by a copy
size greater than 128 bytes and an unaligned source address. This bug
is known to be the cause of random user process memory corruptions
while perf is running with the callgraph option (ie, perf record -g).
This occurs because perf uses copy_from_user() to read user stacks,
and may fault when it follows a stack frame pointer off to an
invalid page. Validation checks on the stack address just obscure
the underlying problem.
Signed-off-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a7c5724b5c17775ca8ea2fd9906d8a7e37337cce) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Rob Gardner [Tue, 22 Dec 2015 04:48:03 +0000 (21:48 -0700)]
sparc64: Don't set %pil in rtrap_nmi too early
Commit 28a1f53 delays setting %pil to avoid potential
hardirq stack overflow in the common rtrap_irq path.
Setting %pil also needs to be delayed in the rtrap_nmi
path for the same reason.
Signed-off-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1ca04a4ce0d5131471c5a1fac76899dc2d9d3f36)
Vijay Kumar [Wed, 23 Dec 2015 10:28:30 +0000 (02:28 -0800)]
sparc64: 'NULL' char after break when sysrq enabled
When sysrq is triggered from console, serial driver for SUN hypervisor
console receives a console break and enables the sysrq. It expects a valid
sysrq char following with break. Meanwhile if driver receives 'NULL'
ASCII char then it disables sysrq and sysrq handler will never be invoked.
This fix skips calling uart sysrq handler when 'NULL' is received while sysrq
is enabled.
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com> Reviewed-by: Chris Hyser <chris.hyser@oracle.com> Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 52708d690b8be132ba9d294464625dbbdb9fa5df)
Dave Aldridge [Wed, 16 Dec 2015 17:04:25 +0000 (09:04 -0800)]
sparc64: Fix segfaults and incorrect data collection by perf
There are two perf problems addressed by this commit:
1) Perf doesn't produce totally accurate call graphs for the user
space processes being analyzed. This is because the kernel's
space identifier is used to access the user stack when a
privileged context is interrupted by irq15 (perf counter
interrupt).
2) Spurious segfaults and bus errors occur in random processes,
including perf itself. This is caused by the same situation as
above, but if the perf counter interrupt arrives in the middle of
handling a fault (ie, TLB miss, page fault, etc), the running
thread's fault address and/or fault code (in thread_info) can be
inadvertantly modified by a nested fault, which makes the fault
unresolvable, and the process is killed with a signal.
This inadvertant modification happens because perf interrupt
processing can (and will) incur a fault itself while walking
the user stack, and this can overwrite the fault information
for the handler that was interrupted.
Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com> Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
(cherry picked from commit 21c8eb7e6a89f6be2a90ab8044ff64ceea8c2b36)
Tushar Dave [Thu, 10 Dec 2015 01:16:25 +0000 (17:16 -0800)]
i40e: Temporary workaround for DMA map issue
This is a quick temporary workaround for Bug 22107931.
iommu DMA map failure occurs when i40e saturate iommu with large number
of DMA map requests. e.g. System running 128 CPUs can maximum have 256K-1
entries in iommu table considering 8K page size and 32bit iommu (i.e.
2^31/PAGE_SIZE). On this system, i40e driver by default has 128 Queue Pairs
(QP) per interface. For each Rx queues, i40e by default, allocates 512 Rx
buffers which generates 64K DMA map requests. Four i40e interfaces will
generates total of 256K DMA map requests. That is beyond iommu can
accommodate and therefor results into DMA map failure.
The correct fix would be that i40e driver should not saturate iommu
resources and graciously bailout when DMA map failure occurs.
However, due severity of the issue and complexity involved implementing
correct resolution, this patch provides quick temporary workaround by
just limiting number of QP not to exceed 32.
For the record, QP equals 32 chosen because QP has to be power of 2 and we
can't have QP equals 64 because in that case number of DMA map requests for
Rx and Tx will be 256K and iommu can only accommodate 256K-1.
i.e.
64 RX queues * 512 RX buffers = 32K , for 4 interfaces = 128K
64 TX queues * 512 TX buffers = 32K , for 4 interfaces = 128K
When an appropriate fix (as mentioned above) is ready, this quick temporary
workaround will be removed.
Note:this temporary workaround can have negative impact on i40e network
performance.
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Aaron Young [Thu, 10 Dec 2015 14:39:54 +0000 (06:39 -0800)]
SPARC64: UEK4 LDOMS DOMAIN SERVICES UPDATE 2
This update provides fixes for LDom domain services on UEK4.
Including:
1. Add control vlds device for non-device specific operations.
2. Allocate larger LDC rx/tx queues based on MTU - same
algorithm as Solaris.
3. Fix default MTU for all ds devices to 4096 bytes.
If the macaddr is not from Open Firmwre or IDPROM (i.e., defaults
macaddr was used) then do not call i40e_macaddr_init again, else
you will get a driver init failure like this:
Babu Moger [Thu, 3 Dec 2015 18:24:38 +0000 (10:24 -0800)]
drivers/pci: Update the quirks for megaraid_sas adapter
This megaraid_sas adapter does have a valid pci vpd information.
Earlier commit 5bf1badcd02f ("pci: Limit VPD length for megaraid_sas
adapter") changed the vpd length to 0x80. This change fixed the panic.
However, we found some options of the lspci does not work very well if
it cannot find the valid vpd tag(Example command "lspci -s 10:00.0 -vv").
It displays the error message and exits right away. Setting the length
to 0 fixes the problem.
Srikar Dronamraju [Wed, 24 Jun 2015 11:10:04 +0000 (16:40 +0530)]
perf bench numa: Fix to show proper convergence stats
With commit: e1e455f4f4d3 (perf tools: Work around lack of sched_getcpu
in glibc < 2.6), perf_bench numa mem with -c or -m option is not able to
correctly calculate convergence.
With the above commit, sched_getcpu always seems to return -1. The
intention of commit e1e455f was to add a sched_getcpu in glibc < 2.6.
Hence keep the sched_getcpu definition under an ifdef.
This regression happened occurred between v4.0 and v4.1
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Vinson Lee <vlee@twitter.com> Fixes: e1e455f4f4d3 ("perf tools: Work around lack of sched_getcpu in glibc < 2.6") Link: http://lkml.kernel.org/r/20150624111004.GA5220@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
(cherry picked from commit 2b42b09b88c831ba4da2d669581dde371c38c2af)
Dan Williams [Thu, 12 Nov 2015 20:13:57 +0000 (12:13 -0800)]
ALSA: pci: depend on ZONE_DMA
There are several sound drivers that 'select ZONE_DMA'. This is
backwards as ZONE_DMA is an architecture capability exported to drivers.
Switch the polarity of the dependency to disable these drivers when the
architecture does not support ZONE_DMA. This was discovered in the
context of testing/enabling devm_memremap_pages() which depends on
ZONE_DEVICE. ZONE_DEVICE in turn depends on !ZONE_DMA.
Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
(cherry picked from commit 2db1a57986d37653583e67ccbf13082aadc8f25d)
Babu Moger [Fri, 6 Nov 2015 22:11:01 +0000 (17:11 -0500)]
pci: Limit VPD length for megaraid_sas adapter
Reading or Writing of PCI VPD data causes system panic.
We saw this problem by running "lspci -vvv" in the beginning.
However this can be easily reproduced by running
cat /sys/bus/devices/XX../vpd
VPD length has been set as 32768 by default. Accessing vpd
will trigger read/write of 32k. This causes problem as we
could read data beyond the VPD end tag. Behaviour is un-
predictable when this happens. I see some other adapter doing
similar quirks(commit bffadffd43d4 ("PCI: fix VPD limit quirk
for Broadcom 5708S"))
I see there is an attempt to fix this right way.
https://patchwork.ozlabs.org/patch/534843/ or
https://lkml.org/lkml/2015/10/23/97
Tried to fix it this way, but problem is I dont see the proper
start/end TAGs(at least for this adapter) at all. The data is
mostly junk or zeros. This patch fixes the issue by setting the
vpd length to 0x80.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Orabug: 22104511
Changes since v2 -> v3
Changed the vpd length from 0 to 0x80 which leaves the
option open for someone to read first few bytes.
Changes since v1 -> v2
Removed the changes in pci_id.h. Kept all the vendor
ids in quirks.c
(cherry picked from commit 8b95aa6b57dd64c4f36fa74026afe62de8c3afbb)
Aaron Young [Wed, 11 Nov 2015 14:24:54 +0000 (06:24 -0800)]
SPARC64: UEK4 LDOMS DOMAIN SERVICES UPDATE 1
This update provides several fixes for LDoms on UEK4
(i.e. fixes/updates for the ldoms UEK4 port done previously
- See BUG 21644721) and updates for ldoms to work with Zeus.
It also provides support for libpri which requires a HV call
to retrieve the PRI from the SP and enhancements to the vlds
driver to allow multiple process to register/use a service
simultaneously - so multiple processes can use libpri at the
same time.
Sowmini Varadhan [Wed, 4 Nov 2015 19:39:56 +0000 (14:39 -0500)]
i40e: Look up MAC address in Open Firmware or IDPROM
This is the i40e equivalent of commit c762dff24c06 ("ixgbe: Look up MAC
address in Open Firmware or IDPROM").
As with that fix, attempt to look up the MAC address in Open Firmware
on systems that support it, and use IDPROM on SPARC if no OF address
is found.
In the case of the i40e there is an assumption that the default mac
address has already been set up as the primary mac filter on probe,
so if this filter is obtained from the Open Firmware or IDPROM, an
explicit write is needed via i40e_aq_mac_address_write() and
i40e_aq_add_macvlan() invocation.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
(cherry picked from commit c7a3fd4e5d009b6b5bc90ee373aac232a7089068)
Nick Alcock [Mon, 2 Nov 2015 21:44:27 +0000 (21:44 +0000)]
sparc64, vdso: update the CLOCK_MONOTONIC_COARSE clock
In the significant rewrite involved in porting the SPARC vDSO code to
v4.1, the CLOCK_MONOTONIC_COARSE clock moved from being computed from
the CLOCK_REALTIME variable in the vvar page to being tracked by its own
vvar (mirroring a similar change done for x86): this vvar is maintained
in the same way, but is derived only once per jiffy tick rather than
every time clock_gettime() is called, which is likely faster under
sufficiently insane clock_gettime() than the way we did it in v3.0.
Unfortunately, the code to maintain this variable in
arch/sparc/kernel/vsyscall_gtod.c was never implemented, so
clock_gettime(CLOCK_MONOTONIC_COARSE) always returns zero. This is a
little coarser than the user would probably like.
Easily fixed by updating the relevant vvar in the same way as is done on
x86.
Reported-by: Wim Coekaerts <wim.coekaerts@oracle.com> Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Tested-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Orabug: 22137842
Allen Pais [Tue, 20 Oct 2015 18:42:59 +0000 (00:12 +0530)]
sparc: Accommodate mem64_offset != mem_offset in pbm configuration
PCI specs do not require mem_offset to be the same as mem64_offset.
This patch adds code to handle the cases where they are not the same
instead of panic'ing the kernel.
Yinghai Lu [Thu, 8 Oct 2015 21:38:34 +0000 (14:38 -0700)]
PCI: Restore pref MMIO allocation logic for host bridge without mmio64
From 5b2854155 (PCI: Restrict 64-bit prefetchable bridge windows to 64-bit
resources), we change the logic for pref mmio allocation:
When bridge pref support mmio64, we will only put children pref
that support mmio64 into it, and will put children pref mmio32
into bridge's non-pref mmio32.
That could leave bridge pref bar not used when that pref bar is mmio64,
and children res only has mmio32.
Also could have allocation failure when non-pref mmio32 is not big
enough space for those children pref mmio32.
That is not rational when the host bridge does not 64bit mmio above 4g
at all.
The patch restore to old logic:
when host bridge does not have has_mem64, put children pref mmio64 and
pref mmio32 all under bridges pref bars.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Orabug: 21826746
Yinghai Lu [Thu, 8 Oct 2015 21:38:32 +0000 (14:38 -0700)]
PCI: Add has_mem64 for struct host_bridge
Add has_mem64 for struct host_bridge, on root bus that does not support
mmio64 above 4g, will not set that.
We will use that info next two following patches:
1. Don't treat non-pref mmio64 as pref mmio, so will not put
it under bridge's pref range when rescan the devices
2. will keep pref mmio64 and pref mmio32 under bridge pref bar.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Orabug: 21826746
Yinghai Lu [Thu, 8 Oct 2015 21:38:31 +0000 (14:38 -0700)]
PCI: Only treat non-pref mmio64 as pref if all bridges have MEM_64
If any bridge up to root only have 32bit pref mmio, We don't need to
treat device non-pref mmio64 as as pref mmio64.
We need to move pci_bridge_check_ranges calling early.
for parent bridges pref mmio BAR may not allocated by BIOS, res flags
is still 0, we need to have it correct set before we check them for
child device resources.
-v2: check all bus resources instead of just res[15].
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Orabug: 21826746
All the bridges 64-bit resource have pref bit, but the device resource does not
have pref set, then we can not find parent for the device resource,
as we can not put non-pref mem under pref mem.
According to pcie spec errta
https://www.pcisig.com/specifications/pciexpress/base2/PCIe_Base_r2.1_Errata_08Jun10.pdf
page 13, in some case it is ok to mark some as pref.
Mark if the entire path from the host to the adapter is over PCI Express.
Then set pref compatible bit for claim/sizing/assign for 64bit mem resource
on that pcie device.
-v2: set pref for mmio 64 when whole path is PCI Express, according to David Miller.
-v3: don't set pref directly, change to UNDER_PREF, and set PREF before
sizing and assign resource, and cleart PREF afterwards. requested by BenH.
-v4: use on_all_pcie_path device flag instead.
Yinghai Lu [Thu, 8 Oct 2015 21:38:29 +0000 (14:38 -0700)]
OF/PCI: Add IORESOURCE_MEM_64 for 64-bit resource
For device resource PREF bit setting under bridge 64-bit pref resource,
we need to make sure only set PREF for 64bit resource, so set
IORESOUCE_MEM_64 for 64bit resource during OF device resource flags
parsing.
Yinghai Lu [Thu, 8 Oct 2015 21:38:26 +0000 (14:38 -0700)]
PCI: kill wrong quirk about M7101
Meelis reported that qla2000 driver does not get loaded on one sparc system.
schizo f00732d0: PCI host bridge to bus 0001:00
pci_bus 0001:00: root bus resource [io 0x7fe01000000-0x7fe01ffffff] (bus address [0x0000-0xffffff])
pci 0001:00:06.0: quirk: [io 0x7fe01000800-0x7fe0100083f] claimed by ali7101 ACPI
pci 0001:00:06.0: quirk: [io 0x7fe01000600-0x7fe0100061f] claimed by ali7101 SMB
pci 0001:00:07.0: can't claim BAR 0 [io 0x7fe01000000-0x7fe0100ffff]: address conflict with 0001:00:06.0 [io 0x7fe01000600-0x7fe0100061f]
So the quirk for M7101 claim the io range early.
According to spec with M7101 in M1543 page 103/104,
http://www.versalogic.com/Support/Downloads/pdf/ali1543.pdf
0xe0, and 0xe2 do not include address info for acpi/smb.
and we already had pref_compat support that add extra pref bit for device
resource.
It turns out that pci_resource_compatible()/pci_up_path_over_pref_mem64()
just check resource with bridge pref mmio register idx 15, and we have put
resource to use mmio register idx 14 during of_scan_pci_bridge()
as the bridge does not mmio resource.
We already fix pci_up_path_over_pref_mem64() to check all bus resources.
And at the same time, this patch will make resource to consistent sequence
like other arch or directly from pci_read_bridge_bases(),
even non-pref mmio is missing, or out of ordering in firmware reporting.
So hold i = 1 for non pref mmio, and i =2 for pref mmio.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Orabug: 21826746
Yinghai Lu [Thu, 8 Oct 2015 21:38:24 +0000 (14:38 -0700)]
sparc/PCI: Add IORESOURCE_MEM_64 for 64-bit resource in OF parsing
For device resource PREF bit setting under bridge 64-bit pref resource,
we need to make sure only set PREF for 64bit resource, so set
IORESOUCE_MEM_64 for 64bit resource during of device resource flags
parsing.