The intent of this patch is to ensure that the mce stack is not
put in the panic stack trace when the kernel reboots due to the
Uncorrectable Error. The mce stack in the panic trace confuses
the administrator and falsely implicates mce module as a culprit.
Hence a synchronization flag is added to machine restart the system
when it experience uncorrectable error.
Earlier versions of the mptsas driver included a mechanism for
executing, and if necessary retrying, SCSI TEST UNIT READY commands
to ensure that devices complete their initialization during device
discovery. This functionality, present in UEK2, was never sent
upstream, and was lost when UEK4 was initiated.
We have been seeing flash devices returning errors, or simply
disappearing, during alter cell validate configuration operations
on Exadata systems. Giving the flash disks time to initialize
after (re-) discovery appears to resolve this issue.
This commit simply restores the missing functionality.
Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Christoph Hellwig [Fri, 16 Oct 2015 05:58:38 +0000 (07:58 +0200)]
nvme: refactor nvme_queue_rq
This "backports" the structure I've used for the fabrics driver. It
mostly started out as a cleanup so that I could actually understand
the code, but I think it also qualifies as a micro-optimization due
to the reduced time we hold q_lock and disable interrupts.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 24691685
mainline commit ba1ca37ea4e320c108c356eb8c91ac652afc57dd
Conflicts:
Adding GFP_ATOMIC to nvme_setup_prps and replacing
REQ_TYPE_DRV_PRIV with REQ_TYPE_SPECIAL
ueknano is stripped down version of uek-4.1. It has only
necessary modules needed for Exadata systems. The reason to spin off
nano kernels is to reduce the size of Exadata kernels.
Michal Hocko [Tue, 2 Aug 2016 21:02:34 +0000 (14:02 -0700)]
mm, hugetlb: fix huge_pte_alloc BUG_ON
Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
runs his database load with memory online and offline running in
parallel. The reason is that huge_pmd_share might detect a shared pmd
which is currently migrated and so it has migration pte which is
!pte_huge.
There doesn't seem to be any easy way to prevent from the race and in
fact seeing the migration swap entry is not harmful. Both callers of
huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range
will copy the swap entry and make it COW if needed. hugetlb_fault will
back off and so the page fault is retries if the page is still under
migration and waits for its completion in hugetlb_fault.
That means that the BUG_ON is wrong and we should update it. Let's
simply check that all present ptes are pte_huge instead.
Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: zhongjiang <zhongjiang@huawei.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 24691289
(cherry picked from commit 4e666314d286765a9e61818b488c7372326654ec) Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
'commit 62c230bc1790 ("mm: add support for a filesystem to activate swap
files and use direct_IO for writing swap pages")' replaced swap_aops
dirty hook from __set_page_dirty_no_writeback() to swap_set_page_dirty().
As such for normal cases without these special SWP flags
code path falls back to __set_page_dirty_no_writeback()
so behaviour is expected to be same as before.
But swap_set_page_dirty() makes use of helper page_swap_info() to
get sis(swap_info_struct) to check for the flags like SWP_FILE,
SWP_BLKDEV etc as desired for those features. This helper has
BUG_ON(!PageSwapCache(page)) which is racy and safe only for
set_page_dirty_lock() path. For set_page_dirty() path which is
often needed for cases to be called from irq context, kswapd()
can togele the flag behind the back while the call is
getting executed when system is low on memory and heavy
swapping is ongoing.
This ends up with undesired kernel panic. Patch just moves
the check outside the helper to its users appropriately
to fix kernel panic for the described path. Couple
of users of helpers already take care of SwapCache
condition so I skipped them.
Thanks to Wengang for extensive debug using vm cores
and Avinash for his thoughts about the issue.
Nitin Gupta [Thu, 25 Aug 2016 18:33:27 +0000 (11:33 -0700)]
sparc64: Fix sentinel page table entry for 16G
Currently no page table trimming is done for 16G pages
so _PAGE_PMD_HUGE must not be set for 16G. Also, for
this size, trimming would be done at PUD level, so
this flag should not be set anyways.
Nitin Gupta [Thu, 2 Jun 2016 22:14:42 +0000 (15:14 -0700)]
sparc64: Trim page tables for 2G pages
Currently mapping a 2G page requires 256*1024 PTE entries.
This results in large amounts of RAM to be used just for
storing page tables. We now use 256 PMD entries to map a
2G page which is much more space efficient.
Nitin Gupta [Fri, 27 May 2016 21:58:13 +0000 (14:58 -0700)]
sparc64: Trim page tables at PMD for hugepages
For PMD aligned (8M) hugepages, we currently allocate
all four page table levels which is wasteful. We now
allocate till PMD level only which saves memory usage
from page tables.
Signed-off-by: Larry Bassel <larry.bassel@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
All signal frames must be at least 16-byte aligned, because that is
the alignment we explicitly create when we build signal return stack
frames.
All stack pointers must be at least 8-byte aligned.
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/sparc/kernel/signal32.c - modified patch context so that it would apply
Signed-off-by: Larry Bassel <larry.bassel@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Number of context IDs supported by the hardware
is reported via machine descriptor for sun4v
systems. For systems > T3, 16 bits are used
to represent context ID in the HW. For these
systems the context ID wrap around happens if
there are more that 65536 processes running
simultaneously. For systems older than that
13 bits are used and the context ID wraps around
if there are 8192 processes running simultaneously.
Reviewed-by: Babu Moger <babu.moger@oracle.com> Acked-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
David S. Miller [Sun, 29 May 2016 03:41:12 +0000 (20:41 -0700)]
sparc64: Fix return from trap window fill crashes.
We must handle data access exception as well as memory address unaligned
exceptions from return from trap window fill faults, not just normal
TLB misses.
Otherwise we can get an OOPS that looks like this:
The window trap handlers are slightly clever, the trap table entries for them are
composed of two pieces of code. First comes the code that actually performs
the window fill or spill trap handling, and then there are three instructions at
the end which are for exception processing.
And the way this works is that if any of those memory accesses
generate an exception, the exception handler can revector to one of
those final three branch instructions depending upon which kind of
exception the memory access took. In this way, the fault handler
doesn't have to know if it was a spill or a fill that it's handling
the fault for. It just always branches to the last instruction in
the parent trap's handler.
All window trap handlers are 0x80 aligned, so if we "or" 0x7c into the
trap time program counter, we'll get that final instruction in the
trap handler.
On return from trap, we have to pull the register window in but we do
this by hand instead of just executing a "restore" instruction for
several reasons. The largest being that from Niagara and onward we
simply don't have enough levels in the trap stack to fully resolve all
possible exception cases of a window fault when we are already at
trap level 1 (which we enter to get ready to return from the original
trap).
This is executed inline via the FILL_*_RTRAP handlers. rtrap_64.S's
code branches directly to these to do the window fill by hand if
necessary. Now if you look at them, we'll see at the end:
And oops, all three cases are handled like a fault.
This doesn't work because each of these trap types (data access
exception, memory address unaligned, and faults) store their auxiliary
info in different registers to pass on to the C handler which does the
real work.
So in the case where the stack was unaligned, the unaligned trap
handler sets up the arg registers one way, and then we branched to
the fault handler which expects them setup another way.
So the FAULT_TYPE_* value ends up basically being garbage, and
randomly would generate the backtrace seen above.
David S. Miller [Wed, 25 May 2016 19:51:20 +0000 (12:51 -0700)]
sparc64: Take ctx_alloc_lock properly in hugetlb_setup().
On cheetahplus chips we take the ctx_alloc_lock in order to
modify the TLB lookup parameters for the indexed TLBs, which
are stored in the context register.
This is called with interrupts disabled, however ctx_alloc_lock
is an IRQ safe lock, therefore we must take acquire/release it
properly with spin_{lock,unlock}_irq().
Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Reported-by: Ilya Malakhov <ilmalakhovthefirst@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Allen Pais <allen.pais@oracle.com>
max_active determines the maximum number of execution contexts per
CPU which can be assigned to the work items of a wq. For example,
with @max_active of 16, at most 16 work items of the wq can be
executing at the same time per CPU.
Currently, for a bound wq, the maximum limit for @max_active is 512
and the default value used when 0 is specified is 256. For an unbound
wq, the limit is higher of 512 and 4 * num_possible_cpus(). These
values are chosen sufficiently high such that they are not the
limiting factor while providing protection in runaway cases.
The number of active work items of a wq is usually regulated by the
users of the wq, more specifically, by how many work items the users
may queue at the same time. Unless there is a specific need for
throttling the number of active work items, specifying '0' is
recommended.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: Liam Merwick <Liam.Merwick@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit b584786e611e8e8a28830386e8b3db8874d794c5)
(cherry picked from commit f2559a96b70562267f01d5bb62ef44aa9f0c0cd8) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Nitin Gupta [Thu, 26 May 2016 21:56:19 +0000 (14:56 -0700)]
sparc64: Reduce TLB flushes during hugepte changes
During hugepage map/unmap, TSB and TLB flushes are currently
issued at every PAGE_SIZE'd boundary which is unnecessary.
We now issue the flush at REAL_HPAGE_SIZE boundaries only.
Without this patch workloads which unmap a large hugepage
backed VMA region get CPU lockups due to excessive TLB
flush calls.
Dwight Engen [Fri, 16 Jan 2015 22:19:39 +0000 (17:19 -0500)]
sunvdc: don't dereference port->disk before disk probe finishes
If the backing file for a vdisk is not present in the service domain an
ldc reset can occur during the initial port/disk probing. The ldc reset
logic was dereferencing port->disk, which may not have been setup yet.
Guard against this case.
chris hyser [Tue, 17 May 2016 20:05:35 +0000 (13:05 -0700)]
sparc64: This patch adds PRIQ support.
This patch supports INT_A through INT_D interrupts as described
by the Open Firmware device tree as well as MSI vectors registered
by PCIe drivers. pci=nomsi may not work though frankly that makes no
sense on a SPARC machine.
The command line parameter priq=off reverts to prior MSIEQ interrupt
mechanism.
chris hyser [Thu, 19 May 2016 20:05:47 +0000 (13:05 -0700)]
sparc64: Enable aggressive setting of PCIe MPS settings
This patch connects SPARC PCIe into the generic PCIe framework enabling
MPS and MRRS to be set aggressively subject to the standard command line
flags. To enable put "pci=pcie_bus_perf" on command line.
chris hyser [Tue, 17 May 2016 16:39:25 +0000 (09:39 -0700)]
sparc64: Allow redirection of MSI/MSI-X IRQs
Allows redirection of MSI/MSI-X IRQs by finding appropriate MSIEQ and
re-routing its IRQ. Also handles driver IRQs sharing the same MSIEQ.
Affinity masks for all such shared interrupts as well as MSIQ IRQ
are modified. Note, based on the HW sharing this patch can change
related driver IRQs in an invisible manner. While confusing and not
desirable, this is an artifact of the HW design.
Rob Gardner [Tue, 9 Feb 2016 22:38:05 +0000 (15:38 -0700)]
IPMI: Driver for Sparc T4/T5/T7 Platforms
Functional IPMI interface driver for Sparc T4/T5/T7. This will
probably also work for other platforms that use an iLOM channel
for IPMI services, including older and future ones, though these
have not been tested.
This driver provides the transport between the IPMI message layer
and the Sparc platform IPMI endpoint in iLOM. The Virtual Logical
Domain Channel (VLDC) driver claims the host endpoint, and we call
it to move data to/from iLOM. So there is an unusual dependency
on another loadable module which requires several compromises
until we work out a plan to restructure the VLDC driver to provide
a cleaner interface:
* An artificial symbolic dependency on vldc is created so that
"modprobe ipmi_si" will ensure that vldc is loaded also.
* ipmi_vldc uses filp_open/kernel_read/kernel_write on device
files provided by vldc, ie, /sys/class/vldc/ipmi/mode and
/dev/vldc/ipmi.
Bug 22804422 has been created to deal with these issues.
Sending this driver upstream is on hold until we work out these
issues. Also, the vldc driver itself has not yet been sent upstream
and that is obviously a prerequisite.
Bob Liu [Fri, 9 Sep 2016 19:44:08 +0000 (15:44 -0400)]
xen-blkback: don't get ref for each queue
xen_blkif_get() for each queue is useless, and introduce a bug.
If there is I/O inflight, xen_blkif_disconnect() will return busy and
xen_blkif_put() not be called.
Then even if I/O completed, the xen_blkif_put() can't free all resources.
Orabug: 24661443 Signed-off-by: Bob Liu <bob.liu@oracle.com>
For systems which wants lower fragment setting because of
smaller memory footprints, module parameter 'rds_ib_max_frag'
can be used to set lower value like 4K or 8K.
rds: avoid call to flush_mrs() in specific condition
This is to reduce process spawn time.
When user provides 0 values for cookie and flags in rds_free_mr() call,
avoid calling flush_mr()
skgxp uses cookie 0 and flag 0 combination for checking whether
transport is RDMA capable or not.
This is short term hack for customer escalation.
Customer is having other processes which are calling flush_mrs() and
that is causing mutex contention.
skgxp change is fairly significant, and we want to provide minimal
change in customer environment.
Risk factor here is, if there is any other use of cookie 0 and flag 0
combination (like freeing up unused MRs), then that will be impacted.
Code inspection by Leo/Avneesh at skgxp and skgnfs suggests that, this
combination not being used anywhere.
Long term solution for this requires changes in RDS as well as skgxp
application, which should be done in next UEK release.
Required RDS changes are present in UEK4; however, skgxp changes are
still remaining. Since this was escalation from major customer, we
require this hack in UEK4.
sif: Lift sif_verbs up to be independent of sif internal headers
The sif_verbs.h file needs to be independent of
other header files to be includable from other kernel.
This is necessary to avoid duplicate definition of
the API elements. For Oracle Linux this file now moves from
drivers/infiniband/hw/sif/ to include/rdma/ to make it
available for the RDS and uvNIC drivers.
This is a temporary but necessary measure while we wait
for proper generic interfaces to be defined at the common
verbs layer.
The ipd is calculated wrongly because it compares the active speed enum
with the value return from ib_rate_to_mult. Thus, this patch converts the
PSIF Active speed enum to a multiple of the base rate of SDR (2.5 Gbps).
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
Knut Omang [Wed, 31 Aug 2016 07:38:31 +0000 (09:38 +0200)]
sif: Fix recently introduced checkpatch issues
It appears the commit check in checkpatch does not capture
all errors. Fix the new ones inthe driver code to
allow us to enable a regression test for it.
During the QP transition from RTS-> ERR, the HW might generate
duplicate FLUSHED-IN-ERR completion. The SIF driver inverses the
sq_seq in a dedicated completion entry and sets the
CQ_POLLING_IGNORED_SEQ bit in the cq_sw flags. Nevertheless, this bit
is cleared once a duplicate FLUSHED-IN-ERR completion is detected in
poll_cq.
The above mentioned method cannot handle a scenario where HW generates
multiple duplicate completions. Thus, this patch moves the detection
of the duplicate completions to translate_wr_id. Then, SIF driver
will only return non duplicate completions to the user.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
ib_core: make wait_event uninterruptible in ib_flush_fmr_pool()
Replace wait_event_interruptible() with wait_event() in
ib_flush_fmr_pool() to avoid deallocating pd before fmr_cleanup_thread
tears down pool of fmrs.
Ashish Samant [Wed, 3 Aug 2016 02:26:30 +0000 (19:26 -0700)]
ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()
If we punch a hole on a reflink such that following conditions are met:
1. start offset is on a cluster boundary
2. end offset is not on a cluster boundary
3. (end offset is somewhere in another extent) or
(hole range > MAX_CONTIG_BYTES(1MB)),
we dont COW the first cluster starting at the start offset. But in this
case, we were wrongly passing this cluster to
ocfs2_zero_range_for_truncate() to zero out. This will modify the cluster
in place and zero it in the source too.
Fix this by skipping this cluster in such a scenario.
Keith Busch [Fri, 22 May 2015 18:28:31 +0000 (12:28 -0600)]
NVMe: Fix obtaining command result
Replaces req->sense_len usage, which is not owned by the LLD, to
req->special to contain the command result for driver created commands,
and sets the result unconditionally on completion.
Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@fb.com> Fixes: d29ec8241c10 ("nvme: submit internal commands through the block layer") Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit a0a931d6a2c1fbc5d5966ebf0e7a043748692c22 and
added missing pieces from d29ec8241c10eacf59c23b3828a88dbae06e7e3f
backport)
Orabug: 24532912 Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Konrad Rzeszutek Wilk [Fri, 19 Aug 2016 15:06:44 +0000 (11:06 -0400)]
x86/xen: Add x86_platform.is_untracked_pat_range quirk to ignore ISA regions.
On x86 whenever VMAs are setup, the 'is_ISA_range quirk' (which this
patch re-implements) is used to figure whether to ignore the
requested PAT type and always use WB (see 'reserve_memtype').
Specifically it forces the WB type for any region in the ISA space.
From the Intel SDM, the combination of MTRR (UC, which is setup by
the BIOS) and PAT (UC or WB) for the ISA region ends up with the same
value - UC.
However on Xen, due to XSA 154 we enforce that mappings that _ANY_
pagetable entry to MMIO ranges MUST have the same the same cachability
mapping - and in this case we enforce UC.
Which means that with XSA 154 (and without this patch) any application
that maps /dev/mem to get SMBIOS information (like mcelog), and pokes
in the ISA region will not have an PTE set. That is due to
reserve_pfn_range returning -EINVAL which results in the PTE not being set.
[These are debug entries added in 'reserve_pfn_range']
mcelog:2471 0xf0000->0xf1000, req_type=write-back new_type=write-back
mcelog:2471 0xeb000->0xed000, req_type=write-back new_type=write-back
.. above are successfull ones, but:
mcelog:2471 0xeb000->0xed000, req_type=uncached new_type=uncached
[again, a debug one:]
mcelog:2471 want=uncached got=write-back strict 0x000eb000-0x000ecfff
mcelog:2471 map pfn expected mapping type uncached for [mem 0x000eb000-0x000ecfff], got write-back
------------[ cut here ]------------
The effective result of the function below is for 'reserver_memtype'
to ignore the result from 'x86_platform.is_untracked_pat_range' quirk.
Which means that the splat above does not happen.
Orabug: 24491985 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask has to be set after hctx->tags is setup in
blk_mq_map_swqueue(), otherwise it causes null pointer dereference.
Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 1356aae08338f1c19ce1c67bf8c543a267688fc3) Signed-off-by: Bob Liu <bob.liu@oracle.com>
Currently q->mq_ops is used widely to decide if the queue
is mq or not, so we should set the 'flag' asap so that both
block core and drivers can get the correct mq info.
For example, commit 868f2f0b720(blk-mq: dynamic h/w context count)
moves the hctx's initialization before setting q->mq_ops in
blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
to think the queue is non-mq and don't allocate command size
for the per-hctx flush rq.
This patches should fix the problem reported by Sasha.
Cc: Keith Busch <keith.busch@intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Fixes: 868f2f0b720 ("blk-mq: dynamic h/w context count") Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 66841672161efb9e3be4a1dbd9755020bb1d86b7) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Harald Høeg [Tue, 5 Jul 2016 16:41:04 +0000 (18:41 +0200)]
sif: vlink connect is now enabled by default
This fix makes default link failover behaviour compatible with existing
mellanox CX3. Internal link status (PortState) will now follow external
link status (PortState) by default.
Driver feature mask SIFF_vlink_disconnect may be used to set default
behaviour to "vlink connect"=disabled.
Knut Omang [Tue, 9 Aug 2016 14:10:39 +0000 (16:10 +0200)]
sif: base: Scale default desc.array size values based on #of available CBs
With default values for #of QPs and MRs set high by default,
33 instances of the driver would consume a lot
of memory just to initialize basic tables since each of these
instances have their own 1M QP space and in effect allocates
the same amount of resources that a bare metal, single instance
driver would do.
The number of collect buffers assigned to the PCIe function tells us
what fraction of the hardware resources we got, and a small
fraction of the 16K CB space indicates that the function competes with
other functions on resources, and that it is unlikely that the same
huge number of QPs etc can be deployed with high performance
anyway.
This commit introduces tracking of module parameter settings
compared to default values, and if compiled in defaults are used,
we scale down the number of QPs etc with a factor corresponding
to the fraction of CBs we got.
This yields eg. 32K QPs per function in a 32 VF enabled system
and significantly reduces system wide memory usage in a
virtualized environment (whether Xen based or not)
Users can still override settings using the module parameters,
which will not be subject to scaling if they deviate from the
compiled in defaults.
Knut Omang [Tue, 9 Aug 2016 09:07:23 +0000 (11:07 +0200)]
sif: cb: Improve algorithm for allocating and using CBs from driver
Instead of allocating bandwidth collect buffers (CBs)
as a fallback for latency CBs, and spamming the kernel log
with failure messages, instead multiplex use across
the actual allocated number of latency CBs and just report
the failure to allocate once, with values to improve debugging.
Improves behaviour for scenarios where available CB resources
are spread across many VFs but VF drivers still see a lot
of (virtual) CPUs, which will easily be the case with the
default VF settings for Xen dom0.
Also, the low latency property is most critical for req.notify PQP
requests. Use high bandwidth CBs also for PQP operations other than
the REARM request, which is the performance critical req. for
req_notify_cq. This should improve performance for event based
applications under high load.
sif: epsc: For Xen dom0 configure resources for all 32 VFs at driver load
As of EPSC API version 2.9 firmware can distribute resources based on
the number of PCI functions the PF driver requests support for.
Older firmware will just ignore the value.
This commit enforces no VFs configured as the default setting
but enable all 32 VFs if a Xen PV domain is detected.
To allow overriding this behaviour we add a new module parameter
vf_max which can be used to override the number of VFs configured
for instance for use with other virtualization engines than Xen
and for debugging/tuning purposes. The vf_max parameter takes the
following values:
-2: Use NVRAM configured firmware defaults (backward compat mode)
-1 (now default) : Exadata mode as described above
0-32: Configure explicitly for that many VFs (only selected values
are supported by firmware)
Knut Omang [Wed, 10 Aug 2016 11:07:55 +0000 (13:07 +0200)]
sif: fmr: invalidate keys before TLB bulk invalidates
This commit reorders and sequentializes the cleanup phase when
bulk invalidates are used. The order was to post the TLB flushing
operation to the EPSC, then invalidate keys (potentially in parallel with the
ongoing flushing) before finally waiting for the TLB flushing to complete.
This way is not considered safe in general, as an incoming access to a key
can cause an invalidated PTE or PTW to be cached again and later cause
sif to read or write to a no longer valid location.
This commit makes sure that all keys are invalidated before
the TLB flushing is triggered.
SIF driver needs to generate FLUSHED-IN-ERR completions using pqp
during the QP tear down phase. Neverthelss, a faulty application or an
application that does not rely on the completion (e.g ibv_*pingpong)
might cause pqp to generate completion to a full CQ. Consequently, the
pqp transitions to ERR state, and this will eventually cause the
system to crash. This patch checks for this scenario to prevent
system crash.
Due to a SIF HW bug where SIF might generate duplicate completions,
the QP state must be transitioned into shadowed ERR state (HW state is
in RESET). In this case, modify_qp(ERR) will cause the QP state
transitions(HW/SW): from ERR (ERR) to RESET (ERR).
As a result, this means that SIF driver needs to generate
FLUSHED-IN-ERR completions when IB user performs post_send while the
QP is in shadowed ERR state. HW will not generate them as the HW QP
state is already in RESET state. The SIF driver generates
FLUSHED-IN-ERR if the last_set_state is in ERR state. last_set_state
is a "best effort" tracked state because QP mutex cannot be held in a
non-sleep context (post_send).
The issue happens in a multi-threaded scenario where one thread is
constantly performing post_send whereas another thread is performing
modify_qp (ERR). During the QP state transition from ERR (ERR) to
RESET (ERR), both HW and SIF driver generate the FLUSHED-IN-ERR and
eventually causing duplicate completion. This patch adds a
test in post_wa4074 to mask out this condition.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
Knut Omang [Mon, 8 Aug 2016 04:43:04 +0000 (06:43 +0200)]
sif: mmu/fmr: Fix check for page table reusability
The FMR mapping logic attempts to reuse the page table if memory layout is
sufficiently similar. Currently this optimization is for simplicity
limited to very similar memory layouts and does not handle changes
in the page table level (base page size).
The test for this only considered page sizes going from small to larger
and not the opposite.
This scenario is triggered by NFSoRDMA if the previous use was in
huge page mappable memory and the current use is in more
fragmented memory.
Robert Schmidt [Thu, 11 Aug 2016 14:14:12 +0000 (16:14 +0200)]
sif: PSC_API_VERSION(2,9): add num_ufs to psif_epsc_csr_config
The new member num_ufs can be used by the PF driver to request a
number of UFs FW shall support.
0: use default value stored on card
1: PF (UF 0) only
...
33: fully virtualized
>33: capped by FW to 33 i.e. fully virtualized
-1: alternative PF only config not for official use
There is a race between rds connection destruction (rds_ib_conn_shutdown) path
and the IRQ path (rds_ib_cq_comp_handler_recv). The IRQ path can schedule the
takelet (i_rtasklet) again (to receive data) in between of the removal of the
tasklet from list and the destruction of the connection in destuction path. When
the tasklet run, it would then access on stale (destroied) data.
A seen case is it was accessing ic->i_rcq which is set to NULL by destuction
path.
Fix:
We add a flag to rds_ib_connection structure indicating the connection is
under detroying when set. The flag is set after we reap on the receive CQ i_rcq
and before start to destroy the CQ in rds_ib_conn_shutdown(). We also flush the
rds_ib_rx running in rds_aux_wq worker thread before starting the destroy. So
that all existing run of rds_ib_rx (in tasklet path and workder thread path)
won't access distroyed receive CQ. And newly queued job (tasklet or worker) will
exist on seeing the flag set before accessing the (maybe destroied) receive CQ.
The flag is unset on new connection completions to allow access on re-created
receive CQ. This patch also takes care of rds_ib_cq_comp_handler_send (the IRQ
handler for send). And we do a final reap after destroying the QP to take care
of the flushing errors to release resouce.
Backport of upstream commit 3bb549ae4c51 ("RDS: TCP:
rds_tcp_accept_one() should transition socket from RESETTING to UP")
The state of the rds_connection after rds_tcp_reset_callbacks() would
be RDS_CONN_RESETTING and this is the value that should be passed by
rds_tcp_accept_one() to rds_connect_path_complete() to transition the
socket to RDS_CONN_UP.
Fixes: b5c21c0947c1 ("RDS: TCP: fix race windows in send-path
quiescence by rds_tcp_accept_one()") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commit 9c79440e2c5e ("RDS: TCP: fix race windows
in send-path quiescence by rds_tcp_accept_one()")
The send path needs to be quiesced before resetting callbacks from
rds_tcp_accept_one(), and commit eb192840266f ("RDS:TCP: Synchronize
rds_tcp_accept_one with rds_send_xmit when resetting t_sock") achieves
this using the c_state and RDS_IN_XMIT bit following the pattern
used by rds_conn_shutdown(). However this leaves the possibility
of a race window as shown in the sequence below
take t_conn_lock in rds_tcp_conn_connect
send outgoing syn to peer
drop t_conn_lock in rds_tcp_conn_connect
incoming from peer triggers rds_tcp_accept_one, conn is
marked CONNECTING
wait for RDS_IN_XMIT to quiesce any rds_send_xmit threads
call rds_tcp_reset_callbacks
[.. race-window where incoming syn-ack can cause the conn
to be marked UP from rds_tcp_state_change ..]
lock_sock called from rds_tcp_reset_callbacks, and we set
t_sock to null
As soon as the conn is marked UP in the race-window above, rds_send_xmit()
threads will proceed to rds_tcp_xmit and may encounter a null-pointer
deref on the t_sock.
Given that rds_tcp_state_change() is invoked in softirq context, whereas
rds_tcp_reset_callbacks() is in workq context, and testing for RDS_IN_XMIT
after lock_sock could result in a deadlock with tcp_sendmsg, this
commit fixes the race by using a new c_state, RDS_TCP_RESETTING, which
will prevent a transition to RDS_CONN_UP from rds_tcp_state_change().
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commit 0b6f760cff04 ("RDS: TCP: Retransmit half-sent
datagrams when switching sockets in rds_tcp_reset_callbacks")
When we switch a connection's sockets in rds_tcp_rest_callbacks,
any partially sent datagram must be retransmitted on the new
socket so that the receiver can correctly reassmble the RDS
datagram. Use rds_send_reset() which is designed for this purpose.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commit 335b48d980f6 ("RDS: TCP: Add/use
rds_tcp_reset_callbacks to reset tcp socket safely")
When rds_tcp_accept_one() has to replace the existing tcp socket
with a newer tcp socket (duelling-syn resolution), it must lock_sock()
to suppress the rds_tcp_data_recv() path while callbacks are being
changed. Also, existing RDS datagram reassembly state must be reset,
so that the next datagram on the new socket does not have corrupted
state. Similarly when resetting the newly accepted socket, appropriate
locks and synchronization is needed.
This commit ensures correct synchronization by invoking
kernel_sock_shutdown to reset a newly accepted sock, and by taking
appropriate lock_sock()s (for old and new sockets) when resetting
existing callbacks.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commmit c948bb5c2cc4 ("RDS: TCP: Avoid rds connection
churn from rogue SYNs")
When a rogue SYN is received after the connection arbitration
algorithm has converged, the incoming SYN should not needlessly
quiesce the transmit path, and it should not result in needless
TCP connection resets due to re-execution of the connection
arbitration logic.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commit 37e14f4fe299 ("RDS: TCP: rds_tcp_accept_worker()
must exit gracefully when terminating rds-tcp")
There are two instances where we want to terminate RDS-TCP: when
exiting the netns or during module unload. In either case, the
termination sequence is to stop the listen socket, mark the
rtn->rds_tcp_listen_sock as null, and flush any accept workqs.
Thus any workqs that get flushed at this point will encounter a
null rds_tcp_listen_sock, and must exit gracefully to allow
the RDS-TCP termination to complete successfully.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>