Knut Omang [Sun, 23 Oct 2016 07:31:26 +0000 (09:31 +0200)]
sif: Refactor PQP state out of sif_dev.
Prepare handling of PQPs for the additional complexity
of resurrect upon errors and query for undetected
error conditions by moving some of the state
now directly in sif_dev into a new sif_pqp_info
struct.
Already complex logic for PQP handling is going to
increase in complexity. We need to consolidate
to be able to keep clean and easily maintainable
interfaces.
Sometime uVNIC removal on OFOS won't trigger a actual removal
of Vstar interface, in that case uVNIC driver has to send NACK
code so that XCM will start cleaning its database.
When path->users becomes zero uVNIC driver starts
cleaning up the Forwarding table entries.
In some corner cases the call is invoked from transmit
function which is in interrupt context and that results
in a hard LOCKUP.
With new changes path->users is decremented in transmit
function to allow cleanup to happen from other thread.
Proper care is taken to avoid race between these
two contexts.
shamir rabinovitch [Wed, 26 Oct 2016 13:16:50 +0000 (06:16 -0700)]
RDS: rds debug messages are enabled by default
rds use Kconfig option called "RDS_DEBUG" to enable rds debug messages.
This option cause the rds Makefile to add -DDEBUG to the rds gcc command
line.
When CONFIG_DYNAMIC_DEBUG is enabled, the "DEBUG" macro is used by
include/linux/dynamic_debug.h to decide if dynamic debug prints should
be sent by default to the kernel log.
rds should not enable this macro for production builds.
David Ahern [Mon, 4 May 2015 15:51:38 +0000 (11:51 -0400)]
net/rds: Fix new sparse warning
c0adf54a109 introduced new sparse warnings:
CHECK /home/dahern/kernels/linux.git/net/rds/ib_cm.c
net/rds/ib_cm.c:191:34: warning: incorrect type in initializer (different base types)
net/rds/ib_cm.c:191:34: expected unsigned long long [unsigned] [usertype] dp_ack_seq
net/rds/ib_cm.c:191:34: got restricted __be64 <noident>
net/rds/ib_cm.c:194:51: warning: cast to restricted __be64
The temporary variable for sequence number should have been declared as __be64
rather than u64. Make it so.
Signed-off-by: David Ahern <david.ahern@oracle.com> Cc: shamir rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit e2783717a71e9babfdd7c36c7e35b790d2c01022) Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
shamir rabinovitch [Fri, 1 May 2015 00:58:07 +0000 (20:58 -0400)]
net/rds: fix unaligned memory access
rdma_conn_param private data is copied using memcpy after headers such
as cma_hdr (see cma_resolve_ib_udp as example). so the start of the
private data is aligned to the end of the structure that come before. if
this structure end with u32 the meaning is that the start of the private
data will be 4 bytes aligned. structures that use u8/u16/u32/u64 are
naturally aligned but in case the structure start is not 8 bytes aligned,
all u64 members of this structure will not be aligned. to solve this issue
we must use special macros that allow unaligned access to those
unaligned members.
Addresses the following kernel log seen when attempting to use RDMA:
Kernel unaligned access at TPC[10507a88] rds_ib_cm_connect_complete+0x1bc/0x1e0 [rds_rdma]
Acked-by: Chien Yen <chien.yen@oracle.com> Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com>
[Minor tweaks for top of tree by:] Signed-off-by: David Ahern <david.ahern@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c0adf54a10903b59037a4c5fcb933dfeeb7b2624) Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Mukesh Kacker [Thu, 27 Oct 2016 13:07:16 +0000 (06:07 -0700)]
mlx4_ib: remove WARN_ON() based on incorrect assumptions
A WARN_ON() was inserted when user data was introduced to the
ibv_cmd_alloc_shpd() by another infiniband provider to make
sure that no user data is sent to older providers such as this
one which do not expect it.
It assumed when no user data is sent the udata->inlen is zero.
The user-kernel API however always sends at least 8 octets
(which may not be initialized in case of provider libraries that
do not user user data).
We remove the WARN_ON and rely on providers not to touch that
field if their companion library is not expected to initialize it.
Santosh Shilimkar [Fri, 14 Oct 2016 23:47:49 +0000 (16:47 -0700)]
mlx4_core/ib: set the IB port MTU to 2K
'commit 096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for IB
ports")' overwrite the default port MTU and sets it as 4K. Since this
directly impacts the HW VLs supported and Oracle workloads heavily uses
all supported 8 VLs for traffic classification, 2K default needs to
be kept as is.
We initilise it to default 2k so that the feature(dynamic MTU configuration)
is still available for non DB users to set the desired MTU value using sysctl.
Also for CX2 cards, commit 596c5ff4b7b3 ("net/mlx4: adjust initial
value of vl_cap in mlx4_SET_PORT") broke the vl_cap which made the
supported VLs to 4 irrespective of MTU size.
Knut Omang [Fri, 21 Oct 2016 06:23:39 +0000 (08:23 +0200)]
sif: cq: transfer headroom attribute to user mode
This commit makes sure old libsif versions works
with the driver while providing a forward compatible
way of making additional changes to the extra
headroom in the CQs.
We anticipate to be able to trim
down the extra entries once we have PQP errors
handled transparently. This commit then ensures that
the headroom is only set in one place, at the
driver side, and that user mode just can
pick up the configured headroom from the kernel.
This is done by providing the used headroom
in a formerly reserved 32 bit field, thus no changes
to the packet size is necessary.
Nevertheless we increment the abi version from
3.6 to 3.7 to allow libsif to detect whether
the headroom field can be trusted.
Knut Omang [Thu, 13 Oct 2016 10:07:33 +0000 (12:07 +0200)]
sif: Add vendor flag to support testing without oversized CQs
After introduction of extra CQ entries to reduce risk of
having duplicate completions overflow a CQ, we no longer can
trigger various CQ overflow scenarios without running a lot of
requests. We need to be able to test with a minimal set of operations
to allow co-sim based tests for further analysis.
Introduce a new vendor_flag no_x_cqe = 0x80 to turn off
the allocation of extra CQEs.
This patch fixes an incomplete patch in commit "cq: Add
additional SIF visible cqes to CQ". The max_cqe
capability reported by query_device is incorrect because
it includes the SIF visible cqes.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
Fix an issue in modify_qp_hw from SQE to RTS returns
EPSC_MODIFY_CANNOT_CHANGE_QP_ATTR. In sif, qp transition
from SQE to RTS must explicitly set the req_access_error
to 0.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
shared pd is not an IBTA defined feature, but an Oracle
Linux extension. Even though PSIF can share a pd easily,
it must comply with the Oracle ib_core implementation which
requires a new pd "object" when reusing a pd (via share_pd
verbs).
Without a new pd "object", it causes a NULL pointer deference
during pd clean-up phase. Thus, this patch creates a new pd
"object" when reusing a pd, and this pd "object" is pointing
to the original pd index.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Francisco Triviño [Fri, 30 Sep 2016 08:35:14 +0000 (10:35 +0200)]
sif: eq: Add timeout to the threaded interrupt handler
This commit implements a timeout that prevents soft lockup issues when
the threaded interrupt function (sif_intr_worker) keeps processing
events for a long period. If the timeout is reached, the threaded
handler returns IRQ_HANDLED even if there are more events to be
processed. In such a case, the coalescing mechanism will generate
an IRQ for the last event.
RDS: IB: fix panic with handlers running post teardown
Shutdown cqe reaping loop takes care of emptying the
CQ's before they being destroyed. And once tasklets are
killed, the hanlders are not expected to run.
But because of core tasklet BUG, tasklet handler could
still run after tasklet_kill which lead can lead to kernel
panic. Fix for core tasklet code was proposed and accepted
upstream, but it comes with bagage of fixing quite a
few bad users of it. Also for receive, we have additional
kthread to take care.
The BUG fix done as part of Orabug 2446085, had an additional
assumption that reaping code won't reap all the CQEs after
QP moved to error state which was not correct. QP is
moved to error state as part of rdma_disconnect() and
all the CQEs are reaped by the loop properly.
Any handler running after above and trying to access the
qp/cq resources gets exposed to race conditions. Patch
fixes this race by makes sure that handlers returns
without any action post teardown.
RDS: add reconnect retry scheme for stalled connections
RDS IB connections gets stalled at times and letting the connections
take its sweet time to reconnect. On passive side, we wait for 15 seconds
for such stalled connections which is too slow based on application
IO timeouts. IB connections are established in milliseconds so we better
drop these stuck connections early and retry.
The retry timeout is kept tunable via reconnect_retry_ms sysctl. The
upper bound for retries is tunbale via rds_sysctl_reconnect_max_retries.
Lower IP and exponential back-off scheme was added to save the
SM queries because of races but it doesn't do what its intended.
The exponential back-off scheme does a good job of backing off
for races. The code just falls back to the original scheme.
RDS: IB: suppress log prints for FLUSH_ERR/RETRY_EXC
Flush errors are normal while draining the QP and retry exceeded
errors are normal for RC connections with finite transport retry.
No need so flood the log file. RDS in-memory trace already logs it.
Santosh Shilimkar [Sat, 27 Aug 2016 02:32:57 +0000 (19:32 -0700)]
RDS: use c_wq for all activities on a connection
RDS connection work, send work, recv work etc events have been
serialised by use of a single threaded work queue. For loopback
connections, we created a separate thread but only connection
work is moved on it. This actually under utilises the thread
and creates un-necessary contention for send/recv work and
connection work for loopback connections.
We move remainder loopback work as well on the rds_local_wq
which garantees serialisation as well as delinks the loopback
and non loopback work(s).
Santosh Shilimkar [Fri, 5 Aug 2016 22:02:55 +0000 (15:02 -0700)]
RDS: make the rds_{local_}wq part of rds_connection
Instead of sprinkling if/else for loopback all over the place,
lets just add c_wq as part of rds_connection. This will prevent
missing cases like 'commit edca33be359c ("RDS: move more queing for
loopback connections to separate queue"), 'commit 8502173071b6
("rds: schedule local connection activity in proper workqueue")
or any future changes.
Santosh Shilimkar [Fri, 5 Aug 2016 21:22:53 +0000 (14:22 -0700)]
RDS: make rds_conn_drop() take reason argument
This removes the need of modifying the conn all over the place
and moves it inside rds_conn_drop(). Actually there is almost
no need to carry the 'c_drop_source' information as part of
rds_connection but since shutdown thread wants to log this
info for debug, the field is left as is for now.
Santosh Shilimkar [Thu, 26 May 2016 23:22:29 +0000 (16:22 -0700)]
RDS: IB: use address change event for failover/failback
The RDS active bonding code had few fundamental bugs which
have been addressed so the workaround(s) added can be removed.
These workaround are creating problems and races in connection
management code leading to occasional connections stalls.
Removal of these makes the code behavior predictable and clean.
Few notables fixes to mention here:
- Taking care of all layers of events for ports before
marking them up/down.
- ARP cache related fixes.
- Local loopback connection hang fix
Patch almost make RDS active bonding failover/failback path
as intended from the beginning.
i.e On failover/failback, the address change events needs to
be sent which then triggers all the necessary events to
re-establish the connection(s).
Santosh Shilimkar [Tue, 10 May 2016 05:51:11 +0000 (22:51 -0700)]
RDS: IB: drop workaround for loopback connection hangs
There is no need to modify the ARP cache directly under the
assumption that it helps to speed up the failover/failback for
loopback connections. The ARP cache is properly updated by core
IPv4 code and one of the issue with RDS active bonding code
with arp has been addressed as part of
'commit 42a7becc725f ("RDS: IB: Make use of ARPOP_REQUEST instead
of ARPOP_REPLY in bonding code")'. Remove the workaround added as
part of bug 16979994 with patch "RDS: Local address resolution may
be delayed after IP has moved"
1) Enable EoiB QP property for uVNIC's created on Titan
card using is_eoib variable.
2) Use qkey provided by OFOS for EoIB QP, this QKEY comes
as part of VNIC Install message.
3) If Titan Card has TSO enabled then enable TSO using
NETIF_F_TSO flag and notify upstream network stack about this
4) If Titan has Checksum offload feature then notify network stack.
about this and on each WR set IB_SEND_IP_CSUM bit.
5) Added some Debug Flags
6) Printing Qkey information and EoiB information in proc stats
In Case of Heart Beat loss due to saturn or Multicast issues
uVNIC driver needs to send XVE_NOTIFY_HBEAT_LOST as a part of
Operational Request to OFOS . This will enable OFOS to perform
appropriate actions
Hans Westgaard Ry [Fri, 30 Sep 2016 08:36:49 +0000 (10:36 +0200)]
sif: Retest last allocated entry with roundrobin allocation
Current codebase tests next element if it is unused(freed) when allocating
round-robin. In certain cases where an element is allocated and then
deallocated immediately it is convenient to reuse the element.
This commit introduces a new state variable 'in_error'
in the CQ state. When a CQ error event is detected
cq->in_error is set.
This state is then checked before
* posting a req_notify_cq
* invoking any of the completion generating
workarounds.
Also set qp->last_set_state to ERR if a fatal QP event
is detected to reduce further posting on that QP.
This commit reduces the risk of getting into a privileged
QP error scenario for some common cases of misbehaved
applications, and also enables the driver to terminate more
quickly if a priv.QP error has occurred.
The CQ full is calculated via last posted seq number (last_seq)
minus the last completion seq number (head_seq). Both last_seq
and head_seq are defined as u16. However, in the calculation to
verify that the CQ is not full, a wrong casting is performed.
This causes a false negative of CQ full in the wrapped case.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
When a QP is created, the HW state is zeroed, then certain fields are
initialized. After modify_qp(), other fields are set. When a QP is
handed over to HW, HW will potentially modify parts of the QP
state. The QP will eventually be transitioned into the RESET state.
From RESET, it is legitimate to resurrect the QP.
It is imperative that the QP state is equal the state when it was
created when transitioned to RESET, in the case it will be resurrected
For any state to RESET transition, the IBTA specification states: "QP
attributes are reset to the same values after the QP was created."
This commit re-factors create_qp() and reset_qp() so the driver
adheres to the specification.
Due to two HW bugs, SIF needs to add additional cqes apart from
the requested N cqes. First issue is that the CQ cannot be full
when it is being invalidated. Hence, we need 1 extra entry.
The second issue is that HW might generate duplicate completions.
Thus, SIF needs 768 extra entries to cater for these duplicate
completions and 1 additional entry for the fence completion.
Then, SIF driver rounds up to the nearest 2^N. The number of cqes
available to the ulp/user, will be the above 2^N - (768 + 1 + 1).
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
sif: qp: Clear the QP state cq_int_err bit upon reset
During reset of a QP, the cq_int_err bit was not reset.
This would cause a subsequent transition to RTR to fail
and hardware to set the QP back in error again.
sif: qp: Collapsed two log statements + removed incorrect port number print
With debug level bit 0x2000 (SIF_QP) set, the driver logs QP creation
info. Two log statements logging the same information are
collapsed. Also, incorrect logging of port number is removed for all
QPs except QP0/1 , which are the only QPs that have valid port number
at their creation time.
sif: Avoid using SIFMT_2M for allocation of any tables in no_huge_page mode
The feature mask no_huge_pages, enabled for Xen due to
DMA address alignment issues with huge pages, did not apply
to allocation of CQs, RQs, and SQs, only to the tableworks.
This causes allocation of queues of these types
larger than 4M in total size to fail on Xen PV domains
such as dom0.
Mark the duplicate completion status as PSIF_WC_STATUS_DUPL_COMPL_ERR if
the additional/duplicate completions are detected by walk_and_update_cqes.
Then, the translate_wr_id can identify the duplicate completion during
sif_poll_cq.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
The sif_fixup_cqes function, which update the wr_id from the SQ handle, is
moved to reset_qp to cover the scenario where IB user reuses a QP after
performing ib_modify_qp(RESET). This patch also handles a scenario in
sif_fixup_cqes where a QP has been reset multiple times but the IB user has
not polled the associated CQ completely.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
Francisco Triviño [Tue, 13 Sep 2016 15:09:40 +0000 (17:09 +0200)]
sif: eq: Implement threaded interrupt handler
The handler function (sif_intr) for a single interrupt is mostly
processing completions notification events (CNE) as long as there
are events in the queue. Sometimes the CNEs are received at a higher
rate than the handler is able to process them, then it keeps
infinitely processing events until the queue might be full, which
leads to a fatal error, or the watchdog triggers a kernel panic,
as shown in orabug 24657844.
This commit replaces request_irq by request_threaded_irq, which
allows the driver to specify a threaded handler (sif_intr_worker)
in addition. The original handler function (sif_intr) is called
in hard interrupt context and can return IRQ_HANDLED if the timeout
SIF_IRQ_HANDLER_TIMEOUT is not exceeded or IRQ_WAKE_THREAD otherwise.
The flag IRQF_ONESHOT is used to ensure that the interrupt is
disabled when IRQ_WAKE_THREAD is returned.
The feature check_all_eqs_on_intr is no longer needed. This commit
removes this driver feature and hence simplifies the interrupt
handler implementation.
For systems which wants lower fragment setting because of
smaller memory footprints, module parameter 'rds_ib_max_frag'
can be used to set lower value like 4K or 8K.
rds: avoid call to flush_mrs() in specific condition
This is to reduce process spawn time.
When user provides 0 values for cookie and flags in rds_free_mr() call,
avoid calling flush_mr()
skgxp uses cookie 0 and flag 0 combination for checking whether
transport is RDMA capable or not.
This is short term hack for customer escalation.
Customer is having other processes which are calling flush_mrs() and
that is causing mutex contention.
skgxp change is fairly significant, and we want to provide minimal
change in customer environment.
Risk factor here is, if there is any other use of cookie 0 and flag 0
combination (like freeing up unused MRs), then that will be impacted.
Code inspection by Leo/Avneesh at skgxp and skgnfs suggests that, this
combination not being used anywhere.
Long term solution for this requires changes in RDS as well as skgxp
application, which should be done in next UEK release.
Required RDS changes are present in UEK4; however, skgxp changes are
still remaining. Since this was escalation from major customer, we
require this hack in UEK4.
sif: Lift sif_verbs up to be independent of sif internal headers
The sif_verbs.h file needs to be independent of
other header files to be includable from other kernel.
This is necessary to avoid duplicate definition of
the API elements. For Oracle Linux this file now moves from
drivers/infiniband/hw/sif/ to include/rdma/ to make it
available for the RDS and uvNIC drivers.
This is a temporary but necessary measure while we wait
for proper generic interfaces to be defined at the common
verbs layer.
The ipd is calculated wrongly because it compares the active speed enum
with the value return from ib_rate_to_mult. Thus, this patch converts the
PSIF Active speed enum to a multiple of the base rate of SDR (2.5 Gbps).
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
Knut Omang [Wed, 31 Aug 2016 07:38:31 +0000 (09:38 +0200)]
sif: Fix recently introduced checkpatch issues
It appears the commit check in checkpatch does not capture
all errors. Fix the new ones inthe driver code to
allow us to enable a regression test for it.
During the QP transition from RTS-> ERR, the HW might generate
duplicate FLUSHED-IN-ERR completion. The SIF driver inverses the
sq_seq in a dedicated completion entry and sets the
CQ_POLLING_IGNORED_SEQ bit in the cq_sw flags. Nevertheless, this bit
is cleared once a duplicate FLUSHED-IN-ERR completion is detected in
poll_cq.
The above mentioned method cannot handle a scenario where HW generates
multiple duplicate completions. Thus, this patch moves the detection
of the duplicate completions to translate_wr_id. Then, SIF driver
will only return non duplicate completions to the user.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
ib_core: make wait_event uninterruptible in ib_flush_fmr_pool()
Replace wait_event_interruptible() with wait_event() in
ib_flush_fmr_pool() to avoid deallocating pd before fmr_cleanup_thread
tears down pool of fmrs.
Harald Høeg [Tue, 5 Jul 2016 16:41:04 +0000 (18:41 +0200)]
sif: vlink connect is now enabled by default
This fix makes default link failover behaviour compatible with existing
mellanox CX3. Internal link status (PortState) will now follow external
link status (PortState) by default.
Driver feature mask SIFF_vlink_disconnect may be used to set default
behaviour to "vlink connect"=disabled.
Knut Omang [Tue, 9 Aug 2016 14:10:39 +0000 (16:10 +0200)]
sif: base: Scale default desc.array size values based on #of available CBs
With default values for #of QPs and MRs set high by default,
33 instances of the driver would consume a lot
of memory just to initialize basic tables since each of these
instances have their own 1M QP space and in effect allocates
the same amount of resources that a bare metal, single instance
driver would do.
The number of collect buffers assigned to the PCIe function tells us
what fraction of the hardware resources we got, and a small
fraction of the 16K CB space indicates that the function competes with
other functions on resources, and that it is unlikely that the same
huge number of QPs etc can be deployed with high performance
anyway.
This commit introduces tracking of module parameter settings
compared to default values, and if compiled in defaults are used,
we scale down the number of QPs etc with a factor corresponding
to the fraction of CBs we got.
This yields eg. 32K QPs per function in a 32 VF enabled system
and significantly reduces system wide memory usage in a
virtualized environment (whether Xen based or not)
Users can still override settings using the module parameters,
which will not be subject to scaling if they deviate from the
compiled in defaults.
Knut Omang [Tue, 9 Aug 2016 09:07:23 +0000 (11:07 +0200)]
sif: cb: Improve algorithm for allocating and using CBs from driver
Instead of allocating bandwidth collect buffers (CBs)
as a fallback for latency CBs, and spamming the kernel log
with failure messages, instead multiplex use across
the actual allocated number of latency CBs and just report
the failure to allocate once, with values to improve debugging.
Improves behaviour for scenarios where available CB resources
are spread across many VFs but VF drivers still see a lot
of (virtual) CPUs, which will easily be the case with the
default VF settings for Xen dom0.
Also, the low latency property is most critical for req.notify PQP
requests. Use high bandwidth CBs also for PQP operations other than
the REARM request, which is the performance critical req. for
req_notify_cq. This should improve performance for event based
applications under high load.
sif: epsc: For Xen dom0 configure resources for all 32 VFs at driver load
As of EPSC API version 2.9 firmware can distribute resources based on
the number of PCI functions the PF driver requests support for.
Older firmware will just ignore the value.
This commit enforces no VFs configured as the default setting
but enable all 32 VFs if a Xen PV domain is detected.
To allow overriding this behaviour we add a new module parameter
vf_max which can be used to override the number of VFs configured
for instance for use with other virtualization engines than Xen
and for debugging/tuning purposes. The vf_max parameter takes the
following values:
-2: Use NVRAM configured firmware defaults (backward compat mode)
-1 (now default) : Exadata mode as described above
0-32: Configure explicitly for that many VFs (only selected values
are supported by firmware)
Knut Omang [Wed, 10 Aug 2016 11:07:55 +0000 (13:07 +0200)]
sif: fmr: invalidate keys before TLB bulk invalidates
This commit reorders and sequentializes the cleanup phase when
bulk invalidates are used. The order was to post the TLB flushing
operation to the EPSC, then invalidate keys (potentially in parallel with the
ongoing flushing) before finally waiting for the TLB flushing to complete.
This way is not considered safe in general, as an incoming access to a key
can cause an invalidated PTE or PTW to be cached again and later cause
sif to read or write to a no longer valid location.
This commit makes sure that all keys are invalidated before
the TLB flushing is triggered.
SIF driver needs to generate FLUSHED-IN-ERR completions using pqp
during the QP tear down phase. Neverthelss, a faulty application or an
application that does not rely on the completion (e.g ibv_*pingpong)
might cause pqp to generate completion to a full CQ. Consequently, the
pqp transitions to ERR state, and this will eventually cause the
system to crash. This patch checks for this scenario to prevent
system crash.
Due to a SIF HW bug where SIF might generate duplicate completions,
the QP state must be transitioned into shadowed ERR state (HW state is
in RESET). In this case, modify_qp(ERR) will cause the QP state
transitions(HW/SW): from ERR (ERR) to RESET (ERR).
As a result, this means that SIF driver needs to generate
FLUSHED-IN-ERR completions when IB user performs post_send while the
QP is in shadowed ERR state. HW will not generate them as the HW QP
state is already in RESET state. The SIF driver generates
FLUSHED-IN-ERR if the last_set_state is in ERR state. last_set_state
is a "best effort" tracked state because QP mutex cannot be held in a
non-sleep context (post_send).
The issue happens in a multi-threaded scenario where one thread is
constantly performing post_send whereas another thread is performing
modify_qp (ERR). During the QP state transition from ERR (ERR) to
RESET (ERR), both HW and SIF driver generate the FLUSHED-IN-ERR and
eventually causing duplicate completion. This patch adds a
test in post_wa4074 to mask out this condition.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
Knut Omang [Mon, 8 Aug 2016 04:43:04 +0000 (06:43 +0200)]
sif: mmu/fmr: Fix check for page table reusability
The FMR mapping logic attempts to reuse the page table if memory layout is
sufficiently similar. Currently this optimization is for simplicity
limited to very similar memory layouts and does not handle changes
in the page table level (base page size).
The test for this only considered page sizes going from small to larger
and not the opposite.
This scenario is triggered by NFSoRDMA if the previous use was in
huge page mappable memory and the current use is in more
fragmented memory.
Robert Schmidt [Thu, 11 Aug 2016 14:14:12 +0000 (16:14 +0200)]
sif: PSC_API_VERSION(2,9): add num_ufs to psif_epsc_csr_config
The new member num_ufs can be used by the PF driver to request a
number of UFs FW shall support.
0: use default value stored on card
1: PF (UF 0) only
...
33: fully virtualized
>33: capped by FW to 33 i.e. fully virtualized
-1: alternative PF only config not for official use
There is a race between rds connection destruction (rds_ib_conn_shutdown) path
and the IRQ path (rds_ib_cq_comp_handler_recv). The IRQ path can schedule the
takelet (i_rtasklet) again (to receive data) in between of the removal of the
tasklet from list and the destruction of the connection in destuction path. When
the tasklet run, it would then access on stale (destroied) data.
A seen case is it was accessing ic->i_rcq which is set to NULL by destuction
path.
Fix:
We add a flag to rds_ib_connection structure indicating the connection is
under detroying when set. The flag is set after we reap on the receive CQ i_rcq
and before start to destroy the CQ in rds_ib_conn_shutdown(). We also flush the
rds_ib_rx running in rds_aux_wq worker thread before starting the destroy. So
that all existing run of rds_ib_rx (in tasklet path and workder thread path)
won't access distroyed receive CQ. And newly queued job (tasklet or worker) will
exist on seeing the flag set before accessing the (maybe destroied) receive CQ.
The flag is unset on new connection completions to allow access on re-created
receive CQ. This patch also takes care of rds_ib_cq_comp_handler_send (the IRQ
handler for send). And we do a final reap after destroying the QP to take care
of the flushing errors to release resouce.
Backport of upstream commit 3bb549ae4c51 ("RDS: TCP:
rds_tcp_accept_one() should transition socket from RESETTING to UP")
The state of the rds_connection after rds_tcp_reset_callbacks() would
be RDS_CONN_RESETTING and this is the value that should be passed by
rds_tcp_accept_one() to rds_connect_path_complete() to transition the
socket to RDS_CONN_UP.
Fixes: b5c21c0947c1 ("RDS: TCP: fix race windows in send-path
quiescence by rds_tcp_accept_one()") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commit 9c79440e2c5e ("RDS: TCP: fix race windows
in send-path quiescence by rds_tcp_accept_one()")
The send path needs to be quiesced before resetting callbacks from
rds_tcp_accept_one(), and commit eb192840266f ("RDS:TCP: Synchronize
rds_tcp_accept_one with rds_send_xmit when resetting t_sock") achieves
this using the c_state and RDS_IN_XMIT bit following the pattern
used by rds_conn_shutdown(). However this leaves the possibility
of a race window as shown in the sequence below
take t_conn_lock in rds_tcp_conn_connect
send outgoing syn to peer
drop t_conn_lock in rds_tcp_conn_connect
incoming from peer triggers rds_tcp_accept_one, conn is
marked CONNECTING
wait for RDS_IN_XMIT to quiesce any rds_send_xmit threads
call rds_tcp_reset_callbacks
[.. race-window where incoming syn-ack can cause the conn
to be marked UP from rds_tcp_state_change ..]
lock_sock called from rds_tcp_reset_callbacks, and we set
t_sock to null
As soon as the conn is marked UP in the race-window above, rds_send_xmit()
threads will proceed to rds_tcp_xmit and may encounter a null-pointer
deref on the t_sock.
Given that rds_tcp_state_change() is invoked in softirq context, whereas
rds_tcp_reset_callbacks() is in workq context, and testing for RDS_IN_XMIT
after lock_sock could result in a deadlock with tcp_sendmsg, this
commit fixes the race by using a new c_state, RDS_TCP_RESETTING, which
will prevent a transition to RDS_CONN_UP from rds_tcp_state_change().
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commit 0b6f760cff04 ("RDS: TCP: Retransmit half-sent
datagrams when switching sockets in rds_tcp_reset_callbacks")
When we switch a connection's sockets in rds_tcp_rest_callbacks,
any partially sent datagram must be retransmitted on the new
socket so that the receiver can correctly reassmble the RDS
datagram. Use rds_send_reset() which is designed for this purpose.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commit 335b48d980f6 ("RDS: TCP: Add/use
rds_tcp_reset_callbacks to reset tcp socket safely")
When rds_tcp_accept_one() has to replace the existing tcp socket
with a newer tcp socket (duelling-syn resolution), it must lock_sock()
to suppress the rds_tcp_data_recv() path while callbacks are being
changed. Also, existing RDS datagram reassembly state must be reset,
so that the next datagram on the new socket does not have corrupted
state. Similarly when resetting the newly accepted socket, appropriate
locks and synchronization is needed.
This commit ensures correct synchronization by invoking
kernel_sock_shutdown to reset a newly accepted sock, and by taking
appropriate lock_sock()s (for old and new sockets) when resetting
existing callbacks.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commmit c948bb5c2cc4 ("RDS: TCP: Avoid rds connection
churn from rogue SYNs")
When a rogue SYN is received after the connection arbitration
algorithm has converged, the incoming SYN should not needlessly
quiesce the transmit path, and it should not result in needless
TCP connection resets due to re-execution of the connection
arbitration logic.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of upstream commit 37e14f4fe299 ("RDS: TCP: rds_tcp_accept_worker()
must exit gracefully when terminating rds-tcp")
There are two instances where we want to terminate RDS-TCP: when
exiting the netns or during module unload. In either case, the
termination sequence is to stop the listen socket, mark the
rtn->rds_tcp_listen_sock as null, and flush any accept workqs.
Thus any workqs that get flushed at this point will encounter a
null rds_tcp_listen_sock, and must exit gracefully to allow
the RDS-TCP termination to complete successfully.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Santosh Shilimkar [Wed, 10 Aug 2016 19:30:36 +0000 (12:30 -0700)]
RDS: IB: Add MOS note details to link local(HAIP) address print
Update the log to include MOS note details and also make the
banner more prominent. This makes it consistent with application
flagging the similar error with MOS note details.
Qing Huang [Wed, 10 Aug 2016 17:14:25 +0000 (10:14 -0700)]
ib/mlx4: Initialize multiple Mellanox HCAs in parallel
This is a rework of UEK2 commit a8962313e121 ("OFED: Load multiple ...").
The goal of this patch to reduce the total mount of system boot/kernel
startup time when there are multiple Mellanox HCAs present in the system.
Typically each HCA/PF would require 6~7s to initialize plus extra time for
a certian number of VFs created by each PF. By default, multiple HCAs have
to be probed one by one in a serialized fasion.
The new scheme is to create a work request for current pci probe/mlx4 init
task and then return -EPROBE_DEFER immediately to the probe caller while
the system thread starts to execute the work request in the background.
The main pci probe thread doesn't have to wait for all the current probe
task to finish. The background init task's progress and return err code
will be saved by the sys worker thread and processed from the deferred
queue.