www.infradead.org Git - users/jedix/linux-maple.git/log

IB/ipoib: sysfs interface to manage ACL tables

Expose sysfs interface for ACL to be used for debug.

Orabug: 23222944

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/{cm,ipoib}: Filter traffic using ACL

Implement two packet filtering points, one at ib_ipoib driver when
processing ARP packets and second in ib_cm when processing connection
requests.

Orabug: 23222944

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/{cm,ipoib}: Manage ACL tables

Add support for ACL tables for ib_ipoib and ib_cm drivers.
ib_cm driver exposes functions register and unregister tables and to manage
tables content.
In ib_ipoib driver add ACL object for each network device.

Orabug: 23222944

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

offload ib subnet manager port and node get info query handling.

This change offloads ib subnet manager port and node get info query
handling to the HCA firmware to answer them. These port and node get
info query responses are time bound and HOST based sma software handler
responses can get delayed because of busy CPUs (RT workloads, interrupt
handlers, etc). Delayed responses can lead to SM taking node out of the
fabric which is not desirable. The port/node INFO query offload, will
let these specific SM queries handled by HCA firmware in a timely manner
irrespective of CPUs being busy at that moment in time.

Orabug: 23750258

Signed-off-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/ipoib: Adjust queue sizes

Current UEK4 uses 128 as default send queue size, and 256 as default
receive queue size.
UEK2 uses 2048 for send and receive queue size as default.

This patch adjusts queue sizes to avoid potential reports regarding
performance bottlenecks on UEK4.

Orabug: 23302017

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>

IB/ipoib: Change send workqueue size for CM mode

Idea here is, one misbehaving connection should not become single point
of failure.

priv->tx_outstanding is shared by all QPs and when it reaches
sendq_size, network interface queue is stopped.

In connected mode, for every connection, TX QP size is sendq_size.
So if one of QP starts behaving bad and we don't receive send
completions in time, priv->tx_outstanding value can reach to the limit
where network interface queue is required to be stopped.
This can bring down entire cluster, because even ping will not go
forward from that point onwards.

With this patch, when creating CM QP for send operations, we limit size:
+int ipoib_cm_sendq_size __read_mostly = ipoib_sendq_size / 8;

Based on Yuval's suggestion, added module parameter to dictate how many
bad connections we want to allow (8 above is configurable).

If outstanding completions for that particular connection reaches to
size of ipoib_cm_sendq_size; we halt sending data on that connection
till we receive at least one completion.

In summary, this will require multiple QPs to misbehave (instead of 1)
in order to bring down entire cluster.

As clarification, this patch is not trying to recover or change behavior
of connection which may have gone bad; but it's reducing impact of bad
connection.

Orabug: 23254764

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>

mlx4_core: use higher log_rdmarc_per_qp when scale_profile is set

Another parameter log_rdmarc_per_qp is scaled up to higher
value (128) when scale_profile is set based on new requirement.

The commit
58f318ea1272 "net/mlx4_core: Modify default value of
              log_rdmarc_per_qp to be consistent with HW capability"
was modifying the Mellanox defaults to accomplish the same but
this change uses scale_profile to be consistent with all the
other changes from Mellanox defaults done for HCA parameters.

This also (indirectly) fixes a code merge issue
with  following commits where a change to default value
of log_rdmarc_per_qp got inadvertently reverted as
two independent changes that interacted were done
in the same merge window (albeit this fix does it with a
slightly different implementation):

58f318ea1272 "net/mlx4_core: Modify default value of
              log_rdmarc_per_qp to be consistent with HW capability"
3480399bdf6d "mlx4_core: scale_profile should work without params
              set to 0"

Orabug: 23725942

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: change rds_ib_active_bonding_excl_ips to only RFC3927 space

Currently rds_ib_active_bonding_excl_ips excludes both
169.254.0.0/16 and 172.10.0.0/16 address ranges from use with RDS
bonding. This parameter was meant to default to the range used by
link-local addresses (169.254.0.0/16, RFC3927) as those do not play
nicely with InfiniBand.

"172.10/16" was probably a mistaken typing of "172.16/12", which is
one of the private use -- but not link-local -- ranges defined by
RFC1918. 172.10.0.0/16 is in active use on the global Internet
(part of the block 172.0.0.0/12 as of this writing); it doesn't
belong here.

Change the parameter default to only "169.254/16" per the original
change's intent.

Orabug; 23712042

Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: avoid large pages for sg allocation for TCP transport

To reduce SGEs, commit '23f90cc {"RDS: fix the sg allocation based
on actual message size" used buddy allocator to allocate large
pages based on messages size.

This change though seems to create issue for TCP transport most
likely triggering memory leak some where in RDS TCP driver path.
The same core code with large pages seems to work just fine with
IB transport.

Patch avoids the hugepage allocation for RDS TCP sockets.

Orabug: 23635336

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/ipoib v2: Add readout of statistics using ethtool

IPoIB collects statistics of traffic including number of packets
sent/received, number of bytes transferred, and certain errors. This
patch makes these statistics available to be queried by ethtool.

Orabug: 23105464
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Yuval Shaia <yuval.shaia@oracle.com>
Change-Id: I654587da2fd1628e0977346c87b4a3a2f08a4bdc

IB/core: Add encode/decode IB_RATE_25_GBPS

The case for IB_RATE_25_GBPS, EDR signalling speed, was missing in
ib_rate_to_mult and mult_to_ib_rate giving wrong return values
when drivers are converting static rate to/from inter-packet-delay.

Orabug: 23084916

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

sif: Support for EPSC_API_VERSION(2,5)

Up-to-date header files for protocol up to and including EPSC API v.2.5:
* a new query, EPSC_QUERY_HW_REVISION to query HW revision through mailbox
   (from v2.5)
* Enable Jumbo frame query support (from v2.4)
* add DEGRADE_CAUSE_FLAG_MCAST_LACK_OF_CREDIT
   adding new cause for degraded mode (from v2.3)
* adding external portinfo query:
   Adding a query for some portinfo attributes on the external port.
   Only a draft until PSIFFW implementation is done.
   (from v2.2)
* API for mailbox for BER (BER = Bit Error Rate) support
   (from v2.1)

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: Be more memory conservative for kdump and xen pv

- Enable using the Xen PV memory usage settings for kdump as well
- Tune these settings down by a factor 2 to alleviate
Orabug: 23523713

Orabug: 23729807

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: rq: Use a workqueue to handle sif_flush_rq

Orabug: 23491094

In sif_flush_rq, one of the required steps is to acquire
the qp mutex for qp state transition. Thus, this commit
moves the sif_flush_rq into a seperate singlethreaded
workqueue to ensure that sif_flush_rq is safe to call
from any context, including the interrupt context.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: rq: Added synchronization between sif_flush_rq and sif_post_recv

Orabug: 23491094

The sif_flush_rq retrieves the rq_sw->last_seq without acquiring
the rq lock. Thus, adding the lock in sif_flush_rq to ensure that
the FLUSH-IN-ERR completion(rq_sw->last_seq) is only being generated
after post_recv(rq_sw->last_seq) has completed.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: qp: added persistent_state in sif_qp struct

Orabug: 23491094

QP state needs to be referred in certain context which may not sleep.
Nevertheless, the state is not guarantee as mutex cannot be used. Thus,
this commit added new atomic persistent_state to determine the QP
state in non-sleep context.

This commit removes non-used flush_sq_done_wa4074 variable and added
mutex for sif_query_qp due to WA #3714 and WA #662. In SIF, there is
intermediate QP state from RTS->ERR and RTS->RESET. Thus, without
mutex, the sif_query_qp might gets the intermediate QP state.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: qp: Increase inline data for TSO QPs to accomodate larger L3/L4-headers

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Tested-by: Knut Omang <knut.omang@oracle.com>

sif: WA#3714: Set flush_retry_qp transport timer to infinite

The flush_retry_qp is configured with a minimum timeout of 6 value
262.144 usec), in combination with bug#4146 (duplicate send requests
not Acked if target RQ is empty) seems to be the reason because
driver is running into some timeouts after applying WA#3714 (waiting
for the completion of the zero post send).

This commit set the flush_retry_qp transport timer to infinite (0 value)

Signed-off-by: Triviño <francisco.trivino@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: Add a feature mask to allow internal vlink state to follow ext.links

SIF implements an internal IB switch for each port - all HCA vPorts are
connected to this switch, which again has a single external port
associated with the actual state of the physical port.

Some of the current management software assumes that a failed external port
of an HCA can be observed by looking at the local port, which is not
the case with SIF, where the local virtual port will not go down
if the external link goes down.

Firmware implements a mode to logically "wire" the vPort to the
corresponding physical port to mimic the legacy behaviour.

This mode can be enabled by OR'ing in 0x10000 in the module parameter
feature_mask. This is a temporary fix until management software can handle
this topology better.

Orabug: 23509653

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: Redefine IB_QP_CREATE_ flags

Redefine sif-spesific IB_QP_CREATE_ flags to avoid conflict with flags defined in ib_verbs.h
Flags moved to range defined by IB_QP_CREATE_RESERVED_START and IB_QP_CREATE_RESERVED_END

Note that we define more flags than fit the range and that some are defined below _RESERVED_START.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

sif: SQ: Adding synchronization between wa4074 and post_send

Orabug: 23607042

In pre_process_wa4074, the SQ lock must be held before corrupting
the checksum in the SQ entry. Besides, use inverse the checksum
value rather than setting it to 0.

Another missing case of acquiring of SQ lock is before generating
the completion. The SQ lock is only held to access the sq_sw->last_seq
to avoid generating completion before post send is completed. If this
case happen, it might cause the completion to be generated using the old
wc_id.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: BZ4074: clean up the workaround

Orabug: 23607042

This patch cleans up workaround 4074 in reviewing
the code while working on Orabug: 23607042.

1) This patch replaces the epsc_query_qp with reading
   the QPS from the memory directly.
   As the QP is already in RESET state,  accessing
   the QPS info from EPSC might potentially cause
   any unexpected returned data.
2) Remove the unused function sq_flush_wa4074.
3) For readibility, use the correct PSIF enum to
   mask PSIF specific WC code.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: BZ 4150: Flush retry reset at 1 when QP is modified to ERROR

Orabug: 23607042

The workaround is to disallow the polling of CQ before
the QP is modified to ERROR. By doing so, the CQ will be updated to
the correct sq_seq during post_wa4074.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: Automatically generate module version from new define TITAN_RELEASE

Change-Id: Ie9e262f12f53c0cdcd27ba9f7fa387be0ef4d884
Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: Enable debugging via trace_printk again

See Orabug: 23510486

Change-Id: I6353820356c9cf9286a1ba72ce883da507736c5f
Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: Remove ib_query_mr - it has been removed upstream

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: Use kernel function printk_ratelimit() instead of home brew

Removed sif_log_cq, sif_log_cq and the perf_sampling_threshold
kernel module which was added for debugging purposes.
Also adjust down a few log levels of some messages.

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: sif_qp: implement additional flush_retry_qp for port 2 (WA#3714)

Current implementation of WA#3714 uses an RC flush_retry_qp associated to
port 1, and configured with port_lid from port 1. Under the scenario where
port 1 is not used (port 1 is not connected, or in INIT state), but port
2 is up and running, the WA will use an invalid flush_retry_qp (with
port_lid = 0).

This commit improves WA#3714 implementation by creating an additional
flush_retry_qp that is associated to the port 2. The proper lush_retry_qp
is selected depending on the target QP port on which the WA will be applied.

Signed-off-by: Triviño <francisco.trivino@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: Build for kernel v.4.5.6

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: rq: Added synchronization during freeing rq

Added synchronization between free_rq and flush_rq to
ensure rq can only be freed up after flush_rq has
completed.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: cq: Do not invalidate the CQ until completion of events

The CQ completion event might be performed while the CQ
has been invalidated. Besides, adding a check in the
sif_req_notify_cq for not rearming the CQ if the CQ
has been invalidated.

This fixes a scenario reported in Orabug: #23491094.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: BZ 4138: Fix a NULL pointer dereference in RDS during tear-down

The issue is in the rqflush where the hardware view cannot be
trusted and SIF driver needs to rely on the software view. In
this case, software must wait for 1s to ensure that all the
completions are back. If the software counter is different
than the hardware view, the software counter will be used.

This issue is observed in Orabug: 23490618.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: LSO, test/adjust attr in create_qp,test stencil-size in send

We need to test that we can add an sge for the LSO stencil
subtracting 1 from max number of supported sge. The
attr.max_sge is incremented to get correct allocation of sge
with the extra entry for the LSO stencil.

We assumpe that the size of LSO headers/stencils is <= 64
adjusting max_inlinesize. The actual size is tested when
work-requests are posted.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: qp: Remove function name in debug printout to avoid confusion

Signed-off-by: Hakon Bugge <Haakon.Bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: sif_qp: remove flush_sq_done_wa4074 condition from WA#3714

Commit ab5d21b added flush_sq_done_wa4074 condition to the WA#3714 check.
As a result, WA#3714 is never called. This commit removes the condition.

Signed-off-by: Triviño <francisco.trivino@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: Add debugfs for workaround usage statistics

This commit also adds:

* WA#3714 usage counters and dump info
* Rename 3713 (bug ticket) to 3714 (WA ticket)

Signed-off-by: Triviño <francisco.trivino@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: ARMv8 (aarch64) portability changes.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: Fixed typo

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: query: Make headroom for TSO stencil used by IPoIB datagram mode

When query_device is called from IPoIB in datagram-mode it will
return max_sge = (SIF_HW_MAX_SEND_SGE-1) as opposed to SIF_HW_MAX_SEND_SGE
for other cases.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: using FW release version in device attibutes

Signed-off-by: Andre Wuttke <andre.wuttke@oracle.com>
Reviewed-by: Åsmund Østvold <asmund.ostvold@oracle.com>
Pre-check: Åsmund Østvold <asmund.ostvold@oracle.com>

sif: Make driver more silent at startup

- Set debug_mask to 0x1 for upstream
- Move most initialization messages to INIT(0x2) level
- Return number of VFs enabled instead of 0 from sif_vf_enable
This also eliminates a warning from the kernel framework

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: sif_r3: fix sif_r3_recreate_flush_qp soft lockup.

This commit fixes Orabug: #23540257. It prevents the situation where
the flush_retry_qp is used before the sdev->flush_lock has been
initialized. This occurs when IB_EVENT_LID_CHANGE event is received
before the flush_retry_qp is created by the driver.

Signed-off-by: Triviño <francisco.trivino@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: ah: Fixed incorrect ipd setting

Use cached copies of active speed and width in order to fulfill
ib_core locking rules. That is, create_ah() cannot sleep.

Signed-off-by: Hakon Bugge <Haakon.Bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: qp/ah: Added XRC QPs & IPD(AH) to debugfs output

Signed-off-by: Vinay Shaw <vinay.shaw@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: epsc: Fix keepalive timeouts

* If a keepalive is posted, do not reset the general timeout interval.
This effectively caused EPSC requests not ever to time out.
* Remove a superfluous timeout reset in sif_eps_poll_cqe
The timeout was already set correctly during post.
* Also avoid sending keepalives if the sender has given up.

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: Compile with kernel 4.4.10

* Remove obsolete (dummy) unimplemented fast_reg call impl
  This has changed significantly in 4.4.x and was not
  implemented for older kernels anyway.

* Add ifdefs for new wr struct layout -
  no longer uses union, instead we have to upcast
  to the right type to find the qp/request
  type specific fields

* undefine the mtrr code as the mtrr_del function seems not to be
  made available anymore. The functionality is not used anyway atm.

* Some regressions wrt checkpatch

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: DNE QPs were created even with limited mode

This could potentially lead to situations where it is not
possible to upgrade firmware.

Also make all QP creation fail in limited mode, otherwise
someone might create one and try to run traffic on it.
In particular any use of PQPs will lead to kernel null pointer
exceptions as they have not been initialized.

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: eq: Avoid sending COMM_EST event to ULPs (UD, RAW & GSI QPs)

From IB spec, o11-5.1.1:
For UD and Raw service types, generation of the Communication
Established Affiliated Asynchronous Event is allowed, but is
strongly discouraged.

Signed-off-by: Vinay Shaw <vinay.shaw@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: XRC: XRC support and PSIF 2.1 limitation #3521

This commit addresses the issue of XRC support (Orabug: 23044600).

Changes include
XRCTGT QP not to allocate SQ
Introduced get_sq/rq function to check XRC cases
    (XRC INI/TGT QP has no RQ, XRC TGT QP has no SQ & RQ).
Overload "ib_qp_attr" attributes for modify XRC QP (RTS state)
    requirement of PSIF
Rearranged/moved all QP helper functions to be in sif_qp.c/.h files

Note about user space support for XRC:
Since a XRCSRQ can be targeted by multiple XRCTGTQPs with same
XRC domain, simply getting a QP# in completion doesn't help.
MLX-hw overloads the "src_qp" with XRCSRQ# for completions.

For now, we limit the XRC association (not related to kernel context)
    one user-context <--> one XRCTGTQP/XSRQ

Signed-off-by: Vinay Shaw <vinay.shaw@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: cq: tear-down sequence in cleaning up the SendCQ

This commit ensures that the sif_fixup_cqes for a sendCQ
can only be executed after post_process_wa4074. As the CQE
in a sendCQ cannot be trusted, walk_and_update CQ must
be performed first.

In a scenario where the post_process_wa4074 and sif_fixup_cqes
are performed concurrently, the post_process_wa4074 is given
priority where no polling of the SendCQ is allowed in
sif_fixup_cqes. Then, post_process_wa4074 will generate
the remaining FLUSH-IN ERR for a Send queue.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: Fix regressions in supporting fw from release 0.1.0.4 and earlier

Orabug: 23497496

This commit fixes two separate regressions in handling the old fw:

1) Teardown of the dne_qp happens only with older FWs because the newer
   firmwares implements the dne_qp handling in fw, so
   driver does not invoke the teardown code. This teardown code
   uses generic calls that has been implicitly amended
   to by the WA for Bug #4074, which also assumed that all QPs
   subject to that call has a valid send queue. The DNE QPs don't
   and this causes a null pointer exception, which is triggered
   both during driver unload and as a side effect of lid changes.
2) EPSC support for SL to TSL mapping was introduced in EPSC API v.0.56
   but was broken - this causes the driver to set wrong values
   which leads to modify_qp errors. The fix is just to avoid putting
   the map to use unless epsc version is >= 0.57.

Signed-off-by: Knut Omang <knut.omang@oracle.com>

{IBCM/IPoIB/MLX4/RDS}: Temporary backout Exasecure change

ExaSecure changeset seems to impact Exadata data integrity
checker. We back out all the Exasecure changes for now till
the issue gets addressed.

Orabug: 23634771

Revert "IB/mlx4: Generate alias GUID for slaves"
Revert "RDS: Fix the rds_conn_destroy panic due to pending messages"
Revert "RDS: add handshaking for ACL violation detection at passive"
Revert "RDS: IB: enforce IP anti-spoofing for UUID context"
Revert "RDS: IB: invoke connection destruction in worker"
Revert "RDS: message filtering based on UUID"
Revert "RDS: Add UUID socket option"
Revert "RDS: Add reset all conns for a source address to CONN_RESET"
Revert "IB/ipoib: ioctl interface to manage ACL tables"
Revert "IB/ipoib: sysfs interface to manage ACL tables"
Revert "IB/{cm,ipoib}: Filter traffic using ACL"
Revert "IB/{cm,ipoib}: Manage ACL tables"

Tested-by: Rene Kundersma <rene.kundersma@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS/IB: Fix crash in SRQ initialization

SRQ initialization causes crash when IC connection is not available.

Orabug: 23523586

This is regression fix for commit 0f0f08915.
We require more work to have SRQ working with variable fragment size.
For now, we fix crash in SRQ initialization.

This also adds warning when SRQ is enabled.
SRQ feature is experimental and disabled by default.
When any user enables it, we should give warning.

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Tested-by: jenny x.xu <jenny.x.xu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: Remove the link-local restriction as a stop gap measure

Fresh CRS install seems to have a dependency with RDS IB link-local
connection going through. Setting the cluster_interconnect
parameter to non-link local address isn't covering the fresh
install usecases.

So as a stop gap measure, we just warn the user but let the connection
through till we come up with a solution to re-introduce the change.

Orabug: 2360905

Tested-by: Maria Yip <maria.yip@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: restore the vector spreading for the CQs

Since the IB_CQ_LEAST_LOADED vector support is not their on newer
kernels(post OFED 1.5), we had #if 0 code for it which got removed
as part of 'commit 3f1db626594e ("RDS: IB: drop discontinued IB
CQ_VECTOR support")'. On UEK2, the drivers had implementation
for this IB verb. UEK4 which is based on newer kernel obviously
doesn't support it.

RDS had an alternate fallback scheme which can be used in absence
of the dropped verb. On UEK2, we didn't use it but UEK4 RDS code was
silently using that till the code got removed. The patch restores
that code with bit more clarity on what it is actually doing.

Orabug: 23550561

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/mlx4: Generate alias GUID for slaves

Generate alias GUID by changing the fourth byte to be the GUID index in the
port GUID table.

This is porting of a work done in uek2 for Oracle purpose only.

Orabug: 23292164

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>

RDS: Fix the rds_conn_destroy panic due to pending messages

In corner cases, there could be pending messages on connection which
needs to be detsroyed. Make sure those messages are purged before
the connection is torned down.

Orabug: 23222944

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: add handshaking for ACL violation detection at passive

Offending connections with ACL violations should be cleaned up as
early as possible. When active detects ACL violation and sends reject;
it fills up private_data field. Passive checks for private_data
whenever it receives reject; and in case of ACL violation it destroys
connection.

Orabug: 23222944

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: enforce IP anti-spoofing for UUID context

Connection is established only after the IP requesting the connection
is legitimate and part of the ACL group. Invalid connection request(s)
are rejected and destroyed.

Ajay moved destroy connection when ACL check fails while initiating
connection to avoid unnecessary packet transfer on wire.

Orabug: 23222944

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Bang Ngyen <bang.nguyen@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: invoke connection destruction in worker

This is to avoid deadlock with c_cm_lock mutex.
In event handling path of Infiniband, whenever connection destruction is
required; we should invoke worker in order to avoid deadlock with mutex.

Orabug: 23222944

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

RDS: message filtering based on UUID

Egress message is filtered based on UUID and for ingress, the
unique UUID is CMS'ed to application to take further action.

SYSCTL 'uuid_tx_no_drop' to override UUID based packet filtering

Orabug: 23222944

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: Add UUID socket option

UUID is opaque user application data to be stored
per socket connection.

IB transport makes use of it for the ACL based
message filtering.

Orabug: 23222944

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Bang Ngyen <bang.nguyen@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: Add reset all conns for a source address to CONN_RESET

RDS_CONN_RESET SO gets enhanced to support reseting all
connections associated with a local address.

$rds-stress -r <SRC_IP> -s 0 --reset

Orabug: 23222944

Reported-by: Bang Ngyen <bang.nguyen@oracle.com>
Acked-by: Bang Ngyen <bang.nguyen@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/ipoib: ioctl interface to manage ACL tables

Expose ioctl to manage ACL content by application layer.

Orabug: 18679884

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/ipoib: sysfs interface to manage ACL tables

Expose sysfs interface for ACL to be used for debug.

Orabug: 18679884

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/{cm,ipoib}: Filter traffic using ACL

Implement two packet filtering points, one at ib_ipoib driver when
processing ARP packets and second in ib_cm when processing connection
requests.

Orabug: 18679884

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

sif driver initial commit part 1

sif_ah.c:        Implementation of IB address handles for SIF
sif_ah.h:        Interface to internal IB address handle logic for SIF
sif_base.c:      Basic hardware setup of SIF
sif_base.h:      Basic hardware setup of SIF
sif_checksum.c:  Utilities for SIF specific 32 bit checksums
sif_checksum.h:  Utilities for SIF specific 32 bit checksums
sif_cq.c:        Implementation of completion queue logic for SIF
sif_cq.h:        Internal interface to psif completion queue logic
sif_debug.c:     Use of debugfs for dumping internal data structure info
sif_debug.h:     Use of debugfs for dumping internal data structure info
sif_defs.c:      IB-to-SIF Mapper.
sif_defs.h:      Div. utility definitions and auxiliary data structures
sif_dev.h:       Driver specific data structure definitions
sif_dma.c:       DMA memory mapping
sif_dma.h:       DMA memory mapping
sif_drvapi.h:    Device specific operations available via the FWA access path
sif_elog.c:      Log over PCIe support for firmware
sif_elog.h:      Misc device for capturing log from the EPSC
sif_enl.h:       Protocol definitions for the netlink protocol for EPSC access from
sif_epsc.c:      Implementation of API for communication with the EPSC
sif_epsc.h:      API for communication with the EPSC (and EPS-A's)
sif_eq.c:        Setup of event queues and interrupt handling
sif_eq.h:        Event queues and interrupt handling
sif_fmr.c:       Implementation of fast memory registration for SIF
sif_fmr.h:       Interface to internal IB Fast Memory Registration (FMR)

Credits:
The sif driver supports Oracle’s new Dual Port EDR and QDR
IB Adapters and the integrated IB devices on the new SPARC SoC.

The driver is placed under drivers/infiniband/hw/sif

This patch set is the result of direct or indirect contribution by
several people:

Code contributors:
  Knut Omang, Vinay Shaw, Haakon Bugge, Wei Lin Guay,
  Lars Paul Huse, Francisco Trivino-Garcia.

Minor patch/bug fix contributors:
  Hans Westgaard Ry, Jesus Escudero, Robert Schmidt, Dag Moxnes,
  Andre Wuttke, Predrag Hodoba, Roy Arntsen

Initial architecture adaptations:
  Khalid Aziz (sparc64), Gerd Rausch (arm64)

Testing, Test development, Continuous integration, Bug haunting, Code
review:
  Knut Omang, Hakon Bugge, Åsmund Østvold, Francisco Trivino-Garcia,
  Wei Lin Guay, Vinay Shaw, Hans Westgaard Ry,
  + numerous other people within Oracle.

Simulator development:
  Andrew Manison, Hans Westgaard Ry, Knut Omang, Vinay Shaw

Orabug: 22529577

Reviewed-by: Hakon Bugge <Haakon.Bugge@oracle.com>
Signed-off-by: Knut Omang <knut.omang@oracle.com>

MAINTAINERS: Add Knut Omang as maintainer for sif, Oracles's Infiniband HCA driver

Signed-off-by: Knut Omang <knut.omang@oracle.com>

ib: Enable building the sif driver in the infiniband stack

The sif driver is the driver for Oracles new Infiniband HCA series.

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif driver initial commit part 7

Hardware struct print functions, Makefile and version info

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif driver initial commit part 6

Hardware register access functions, data types and macro definitions, part 2

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif driver initial commit part 5

Hardware register macro definitions and data types, part 1

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif driver initial commit part 4

sif_tqp.h:       Implementation of EPSA tunnelling QP for SIF
sif_user.h:      This file defines sif specific verbs extension request/response.
sif_verbs.c:     IB verbs API extensions specific to PSIF
sif_verbs.h:     IB verbs API extensions specific to PSIF
sif_vf.c:        SR/IOV support functions
sif_vf.h:        SR/IOV support functions
sif_xmmu.c:      Implementation of special MMU mappings.
sif_xmmu.h:      Implementation of special MMU mappings.
sif_xrc.c:       Implementation of XRC related functions
sif_xrc.h:       XRC related functions
version.h:       Detailed version info data structure

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif driver initial commit part 3

sif_pt.c:        SIF (private) page table management
sif_pt.h:        SIF (private) page table management.
sif_qp.c:        Implementation of IB queue pair logic for sif
sif_qp.h:        Interface to internal IB queue pair logic for sif
sif_query.c:     SIF implementation of some of IB query APIs
sif_query.h:     SIF implementation of some of IB query APIs
sif_r3.c:        Special handling specific for psif revision 3 and earlier
sif_r3.h:        Special handling specific for psif revision 3 and earlier
sif_rq.c:        Implementation of sif receive queues
sif_rq.h:        Interface to sif receive queues
sif_sndrcv.c:    Implementation of post send/recv logic for SIF
sif_sndrcv.h:    Interface to IB send/receive, MAD packet recv and
sif_spt.c:       Experimental implementation of shared use of the OS's page tables.
sif_spt.h:       Experimental (still unsafe)
sif_sq.c:        Implementation of the send queue side of an IB queue pair
sif_sq.h:        Implementation of the send queue side of an IB queue pair
sif_srq.c:       Interface to shared receive queues for SIF
sif_srq.h:       Interface to internal Shared receive queue logic for SIF
sif_tqp.c:       Implementation of EPSA tunneling QP for SIF

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif driver initial commit part 2

sif_fwa.c:       Firmware access API (netlink based out-of-band comm)
sif_fwa.h:       Low level access to a SIF device
sif_hwi.c:       Hardware init for SIF - combines the various init steps for psif
sif_hwi.h:       Hardware init for SIF
sif_ibcq.h:      External interface to IB completion queue logic for SIF
sif_ibpd.h:      External interface to (IB) protection domains for SIF
sif_ibqp.h:      External interface to IB queue pair logic for sif
sif_idr.c:       Synchronized ID ref allocation
sif_idr.h:       simple id allocation and deallocation for SIF
sif_int_user.h:  This file defines special internal data structures used
sif_ireg.c:      Utilities and entry points needed for Infiniband registration
sif_ireg.h:      support functions used in setup of sif as an IB HCA
sif_main.c:      main entry points and initialization
sif_mem.c:       SIF table memory and page table management
sif_mem.h:       A common interface for all memory used by
sif_mmu.c:       main entry points and initialization
sif_mmu.h:       API for management of sif's on-chip mmu.
sif_mr.c:        Implementation of memory regions support for SIF
sif_mr.h:        Interface to internal IB memory registration logic for SIF
sif_mw.c:        Implementation of memory windows for SIF
sif_mw.h:        Interface to internal IB memory window logic for SIF
sif_pd.c:        Implementation of IB protection domains for SIF
sif_pd.h:        Internal interface to protection domains
sif_pqp.c:       Privileged QP handling
sif_pqp.h:       Privileged QP handling

Signed-off-by: Knut Omang <knut.omang@oracle.com>

IB/{cm,ipoib}: Manage ACL tables

Add support for ACL tables for ib_ipoib and ib_cm drivers.
ib_cm driver exposes functions register and unregister tables and to manage
tables content.
In ib_ipoib driver add ACL object for each network device.

Orabug: 18679884

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: Drop stale iWARP support

RDS iWARP support was more of academic and was never complete neither
fully functional. Its getting for good. Thanks to Ajay for adding
couple of missed hunks.

Orabug: 23027670

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: drop discontinued IB CQ_VECTOR support

IB_CQ_VECTOR_LEAST_ATTACHED was OFED 1.5 feature which later Mellanox
dropped. As per them 'least attached' was not a way to distribute the load
and that actually fools the user to think it does so. The fact that some
cpu had 'least amount of attached cqs' has nothing to do with this cpu
actual load. This is why the feature never made to upstream.

On UEK4, the code is already under #if 0 because feature isn't
available. Time to clean up the dead code considering its already
dropped from upstream as well as OFED2.0+ onwards.

Orabug: 23027670

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: Drop unused and broken APM support

APM support in RDS has been broken and hence not being
used in production. We kept the code around but its time
to remove it and reduce the complexity in the RDS
failover code paths.

Orabug: 23027670

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: Make use of ARPOP_REQUEST instead of ARPOP_REPLY in bonding code

Even though IPv4 ARP RFC allows for using either REQUEST or REPLY
for grat. arp, upstream code from 3.14 onwards have moved on to
use only REQUEST.

Relevant commit:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=56022a8fdd874c56bb61d8c82559e43044d1aa06

RDS active bonding gratuitous arp code needs to adopt to this change
to take advantage of the neighbor updates on UEK4. The current code
makes use ARPOP_REPLY which needs to be changed to ARPOP_REQUEST.

Orabug: 23094704

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Acked-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: don't use the link-local address for ib transport

Link-local address can't be used for IB failover and don't work
with IB stack. Even though the DB RDS usage has recommnded to not
use these addresses, we keep hitting issue because of accidental
usage of it because of missing application config or admin scripts
blindly doing rds-ping for each local address(s).

RDS TCP which doesn't support acitive active, there might be an
usecase so the current fix it limited for IB transport atm.

Example traces:
$ rds-ping -I 169.254.221.37 169.254.221.38
bind() failed, errno: 99 (Cannot assign requested address)

cosnole:
RDS/IB: Link local address 169.254.221.37 NOT SUPPORTED
RDS: rds_bind() could not find a transport for 169.254.221.37, load rds_tcp or rds_rdma?

Orabug: 23027670

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Acked-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: rebuild receive caches when needed

RDS IB caches code leaks memory & it have been there from the
inception of cache code but we didn't noticed them since caches
are not teardown in normal operation paths. But now to support
features like variable fragment or connection destroy for ACL,
caches needs to be destroyed and rebuild if needed.

While freeing the caches is just fine, leaking memory while
doing that is bug and needs to be addressed. Thanks to Wengang
for spotting this stone age leak. Also the cache rebuild needs
to be done only when desired so patch optimises that part as
well.

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Maria Rodriguez <maria.r.rodriguez@oracle.com>
Tested-by: Hong Liu <hong.x.liu@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

rdma_cm: use cma_info() instead of cma_dbg()

We want to have selected prints going into messages file when debug
is enabled.

Orabug: 22381123

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Wengang Wang <wen.gang.wang@oracle.com>

OFED: indicate consistent vendor error

vendor error print should be consistent across protocols to avoid
any confusion.
Currently, it's decimal at some places and hex at some places.
This patch corrects that.

Orabug: 22381117

Suggested-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Wengang Wang <wen.gang.wang@oracle.com>

RDS: Change number based conn-drop reasons to enum

This patch converts the number based connection-drop reasons to enums,
making it easy to grep the reasons and to develop new patches based on
these reasons.

Orabug: 23294707

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: Move rds_rtd definitions from rds_rt_debug files to common files

This patch moves rds_rtd definitions from rds_rtd_debug.h to rds.h and
rds_rt_debug_bitmap modparam definition from rds_rt_debug.c to af_rds.c.
The patch removes rds_rt_debug files since there isn't much content
in these files to be held separately.

Commit 'ib/rds: runtime debuggability enhancement' originally defined
rds_rtd definitions.

Orabug: 23294707

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: Change the default value of rds_rt_debug_bitmap modparam to 0x488B

This patch changes the default value of rds_rt_debug_bitmap module
parameter to 0x488B to enable RDS_RTD_ERR, RDS_RTD_ERR_EXT, RDS_RTD_CM,
RDS_RTD_ACT_BND, RDS_RTD_RCV, RDS_RTD_SND flags of rds_rtd.

Orabug: 23294707

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: Replace rds_rtd printk with trace_printk

Forward rds_rtd prints to ftrace buffer by replacing
printk with trace_printk.

Orabug: 23294707

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: Print vendor error in recv completion error message

This patch when applied, prints vendor error along with work
completion status in recv completion error message.

Orabug: 23294707

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/mlx4: Fix unaligned access in send_reply_to_slave

The problem is that the function 'send_reply_to_slave' gets the
'req_sa_mad' as a pointer whose address is only aliged to 4 bytes
but is 8 bytes in size. This can result in unaligned access faults
on certain architectures.

Sowmini Varadhan pointed to this reply from Dave Miller that say
that memcpy should not be used to solve alignment issues:
https://lkml.org/lkml/2015/10/21/352

Optimization of memcpy to 'ldx' instruction can only happen if the
compiler knows that the size of the data we are copying is 8 bytes
and it assumes it is aligned to 8 bytes. If the compiler know the
type is not aligned to 8 it must not optimize the 8 byte copy.
Defining the data type as aligned to 4 forces the compiler to treat
all accesses as though they aren't aligned and avoids the 'ldx'
optimization.

Full credit for the idea goes to Jason Gunthorpe
<jgunthorpe@obsidianresearch.com>.

Orabug: 23311415

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>

rds: schedule local connection activity in proper workqueue

While reconnect, local connection is scheduled on rds_wq; while it it
should have been scheduled rds_local_wq.
This patch corrects that.

Orabug: 23223537

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Mukesh Kacker <mukesh.kacker@oracle.com>

IB/security: Restrict use of the write() interface

The drivers/infiniband stack uses write() as a replacement for
bi-directional ioctl(). This is not safe. There are ways to
trigger write calls that result in the return structure that
is normally written to user space being shunted off to user
specified kernel memory instead.

For the immediate repair, detect and deny suspicious accesses to
the write API.

For long term, update the user space libraries and the kernel API
to something that doesn't present the same security vulnerabilities
(likely a structured ioctl() interface).

The impacted uAPI interfaces are generally only available if
hardware from drivers/infiniband is installed in the system.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
[ Expanded check to all known write() entry points ]
Cc: stable@vger.kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
CVE-2016-4565
Orabug: 23276449

(cherry-pick from e6bd18f57aad1a2d1ef40e646d03ed0f2515c9e3)
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

net/rds: Use max_mr from HCA caps than max_fmr

All HCA drivers seems to populate max_mr caps and few of them do both
max_mr and max_fmr.
Hence update RDS code to make use of max_mr.

Orabug: 23223564

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>

RDS: IB: disable ib_cache purging to avoid memory leak in reconnect path

RDS IB caches don't work in reconnect path and if used can lead to
memory leaks. These leaks have been there for long time but we didn't
hit them since caches are not teardown in reconnect path. For different
frag rolling upgrade/downgrade support, its needed to work in reconnect
path but needs additional fixes.

Since the leak is blocking rest of the testing, temporary the cache
purging is disabled. It will be added back once fully fixed.

Orabug: 23275911

The change doesn't impact any of the existing RDS functionality.

Tested-by: Hong Liu <hong.x.liu@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: avoid bit fields for i_frag_pages

i_frag_pages may need to store more than 1 page value for
higher fragments so bit field won't help.

Lets fix that.

Orabug: 23275911

Tested-by: Hong Liu <hong.x.liu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>

RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.

Orabug 23228077

Backport of upstream commit bd7c5f983f31 ("RDS: TCP: Synchronize accept()
and connect() paths on t_conn_lock.")

An arbitration scheme for duelling SYNs is implemented as part of
commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
involved will arrive at the same arbitration decision. However, this
needs to be synchronized with an outgoing SYN to be generated by
rds_tcp_conn_connect(). This commit achieves the synchronization
through the t_conn_lock mutex in struct rds_tcp_connection.

The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
the t_conn_lock mutex. A SYN is sent out only if the RDS connection is
not already UP (an UP would indicate that rds_tcp_accept_one() has
completed 3WH, so no SYN needs to be generated).

Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
acquiring the t_conn_lock mutex. The only acceptable states (to
allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock

Orabug 23228077

Backport of upstream commit eb192840266f ("RDS:TCP: Synchronize
rds_tcp_accept_one with rds_send_xmit when resetting t_sock")

There is a race condition between rds_send_xmit -> rds_tcp_xmit
and the code that deals with resolution of duelling syns added
by commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()").

Specifically, we may end up derefencing a null pointer in rds_send_xmit
if we have the interleaving sequence:
         rds_tcp_accept_one                  rds_send_xmit

                                           conn is RDS_CONN_UP, so
      invoke rds_tcp_xmit

                                           tc = conn->c_transport_data
      rds_tcp_restore_callbacks
          /* reset t_sock */
      null ptr deref from tc->t_sock

The race condition can be avoided without adding the overhead of
additional locking in the xmit path: have rds_tcp_accept_one wait
for rds_tcp_xmit threads to complete before resetting callbacks.
The synchronization can be done in the same manner as rds_conn_shutdown().
First set the rds_conn_state to something other than RDS_CONN_UP
(so that new threads cannot get into rds_tcp_xmit()), then wait for
RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.

Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

skbuff: Add pskb_extract() helper function

Orabug 23180876

Upstream commit 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function")

A pattern of skb usage seen in modules such as RDS-TCP is to
extract `to_copy' bytes from the received TCP segment, starting
at some offset `off' into a new skb `clone'. This is done in
the ->data_ready callback, where the clone skb is queued up for rx on
the PF_RDS socket, while the parent TCP segment is returned unchanged
back to the TCP engine.

The existing code uses the sequence
        clone = skb_clone(..);
        pskb_pull(clone, off, ..);
        pskb_trim(clone, to_copy, ..);
with the intention of discarding the first `off' bytes. However,
skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
up doing a redundant memcpy of bytes that will then get discarded
in __pskb_pull_tail().

To avoid this inefficiency, this commit adds pskb_extract() that
creates the clone, and memcpy's only the relevant header/frag/frag_list
to the start of `clone'. pskb_trim() is then invoked to trim clone
down to the requested to_copy bytes.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

RDS: support individual receive trace reporting

If application wants to get indvidual trace point, its easy
to support with existing infrastructure.

No change needed in API

Orabug: 23215779

Tested-by: Namrata Jampani <namrata.jampani@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/ipoib: Add readout of statistics using ethtool

Orabug: 21498734

IPoIB collects statistics of traffic including number of packets
sent/received, number of bytes transferred, and certain errors. This
patch makes these statistics available to be queried by ethtool.

Change-Id: Ic159815fe0cc08770cd4111ec1df117b7349c154
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>