]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
9 years agosif: cq: tear-down sequence in cleaning up the SendCQ
Wei Lin Guay [Mon, 30 May 2016 07:54:34 +0000 (09:54 +0200)]
sif: cq: tear-down sequence in cleaning up the SendCQ

This commit ensures that the sif_fixup_cqes for a sendCQ
can only be executed after post_process_wa4074. As the CQE
in a sendCQ cannot be trusted,  walk_and_update CQ must
be performed first.

In a scenario where the post_process_wa4074 and sif_fixup_cqes
are performed concurrently, the post_process_wa4074 is given
priority where no polling of the SendCQ is allowed in
sif_fixup_cqes. Then, post_process_wa4074 will generate
the remaining FLUSH-IN ERR for a Send queue.

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
9 years agosif: Fix regressions in supporting fw from release 0.1.0.4 and earlier
Knut Omang [Mon, 30 May 2016 12:12:39 +0000 (14:12 +0200)]
sif: Fix regressions in supporting fw from release 0.1.0.4 and earlier

Orabug: 23497496

This commit fixes two separate regressions in handling the old fw:

1) Teardown of the dne_qp happens only with older FWs because the newer
   firmwares implements the dne_qp handling in fw, so
   driver does not invoke the teardown code. This teardown code
   uses generic calls that has been implicitly amended
   to by the WA for Bug #4074, which also assumed that all QPs
   subject to that call has a valid send queue. The DNE QPs don't
   and this causes a null pointer exception, which is triggered
   both during driver unload and as a side effect of lid changes.
2) EPSC support for SL to TSL mapping was introduced in EPSC API v.0.56
   but was broken - this causes the driver to set wrong values
   which leads to modify_qp errors. The fix is just to avoid putting
   the map to use unless epsc version is >= 0.57.

Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years ago{IBCM/IPoIB/MLX4/RDS}: Temporary backout Exasecure change
Santosh Shilimkar [Sun, 26 Jun 2016 20:46:33 +0000 (13:46 -0700)]
{IBCM/IPoIB/MLX4/RDS}: Temporary backout Exasecure change

ExaSecure changeset seems to impact Exadata data integrity
checker. We back out all the Exasecure changes for now till
the issue gets addressed.

Orabug: 23634771

Revert "IB/mlx4: Generate alias GUID for slaves"
Revert "RDS: Fix the rds_conn_destroy panic due to pending messages"
Revert "RDS: add handshaking for ACL violation detection at passive"
Revert "RDS: IB: enforce IP anti-spoofing for UUID context"
Revert "RDS: IB: invoke connection destruction in worker"
Revert "RDS: message filtering based on UUID"
Revert "RDS: Add UUID socket option"
Revert "RDS: Add reset all conns for a source address to CONN_RESET"
Revert "IB/ipoib: ioctl interface to manage ACL tables"
Revert "IB/ipoib: sysfs interface to manage ACL tables"
Revert "IB/{cm,ipoib}: Filter traffic using ACL"
Revert "IB/{cm,ipoib}: Manage ACL tables"

Tested-by: Rene Kundersma <rene.kundersma@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS/IB: Fix crash in SRQ initialization
Ajaykumar Hotchandani [Thu, 16 Jun 2016 19:15:21 +0000 (12:15 -0700)]
RDS/IB: Fix crash in SRQ initialization

SRQ initialization causes crash when IC connection is not available.

Orabug: 23523586

This is regression fix for commit 0f0f08915.
We require more work to have SRQ working with variable fragment size.
For now, we fix crash in SRQ initialization.

This also adds warning when SRQ is enabled.
SRQ feature is experimental and disabled by default.
When any user enables it, we should give warning.

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Tested-by: jenny x.xu <jenny.x.xu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Remove the link-local restriction as a stop gap measure
Santosh Shilimkar [Sat, 18 Jun 2016 17:48:07 +0000 (10:48 -0700)]
RDS: Remove the link-local restriction as a stop gap measure

Fresh CRS install seems to have a dependency with RDS IB link-local
connection going through. Setting the cluster_interconnect
parameter to non-link local address isn't covering the fresh
install usecases.

So as a stop gap measure, we just warn the user but let the connection
through till we come up with a solution to re-introduce the change.

Orabug: 2360905

Tested-by: Maria Yip <maria.yip@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: restore the vector spreading for the CQs
Santosh Shilimkar [Fri, 10 Jun 2016 04:05:54 +0000 (21:05 -0700)]
RDS: IB: restore the vector spreading for the CQs

Since the IB_CQ_LEAST_LOADED vector support is not their on newer
kernels(post OFED 1.5), we had #if 0 code for it which got removed
as part of 'commit 3f1db626594e ("RDS: IB: drop discontinued IB
CQ_VECTOR support")'. On UEK2, the drivers had implementation
for this IB verb. UEK4 which is based on newer kernel obviously
doesn't support it.

RDS had an alternate fallback scheme which can be used in absence
of the dropped verb. On UEK2, we didn't use it but UEK4 RDS code was
silently using that till the code got removed. The patch restores
that code with bit more clarity on what it is actually doing.

Orabug: 23550561

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoIB/mlx4: Generate alias GUID for slaves
Yuval Shaia [Thu, 26 May 2016 16:25:19 +0000 (09:25 -0700)]
IB/mlx4: Generate alias GUID for slaves

Generate alias GUID by changing the fourth byte to be the GUID index in the
port GUID table.

This is porting of a work done in uek2 for Oracle purpose only.

Orabug: 23292164

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
9 years agoRDS: Fix the rds_conn_destroy panic due to pending messages
Bang Nguyen [Sat, 16 Apr 2016 18:47:05 +0000 (11:47 -0700)]
RDS: Fix the rds_conn_destroy panic due to pending messages

In corner cases, there could be pending messages on connection which
needs to be detsroyed. Make sure those messages are purged before
the connection is torned down.

Orabug: 23222944

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: add handshaking for ACL violation detection at passive
Ajaykumar Hotchandani [Thu, 14 Apr 2016 21:20:08 +0000 (14:20 -0700)]
RDS: add handshaking for ACL violation detection at passive

Offending connections with ACL violations should be cleaned up as
early as possible. When active detects ACL violation and sends reject;
it fills up private_data field. Passive checks for private_data
whenever it receives reject; and in case of ACL violation it destroys
connection.

Orabug: 23222944

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: enforce IP anti-spoofing for UUID context
Santosh Shilimkar [Tue, 15 Mar 2016 12:32:09 +0000 (05:32 -0700)]
RDS: IB: enforce IP anti-spoofing for UUID context

Connection is established only after the IP requesting the connection
is legitimate and part of the ACL group. Invalid connection request(s)
are rejected and destroyed.

Ajay moved destroy connection when ACL check fails while initiating
connection to avoid unnecessary packet transfer on wire.

Orabug: 23222944

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Bang Ngyen <bang.nguyen@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: invoke connection destruction in worker
Ajaykumar Hotchandani [Thu, 14 Apr 2016 20:58:46 +0000 (13:58 -0700)]
RDS: IB: invoke connection destruction in worker

This is to avoid deadlock with c_cm_lock mutex.
In event handling path of Infiniband, whenever connection destruction is
required; we should invoke worker in order to avoid deadlock with mutex.

Orabug: 23222944

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
9 years agoRDS: message filtering based on UUID
Bang Nguyen [Wed, 3 Feb 2016 16:13:25 +0000 (08:13 -0800)]
RDS: message filtering based on UUID

Egress message is filtered based on UUID and for ingress, the
unique UUID is CMS'ed to application to take further action.

SYSCTL 'uuid_tx_no_drop' to override UUID based packet filtering

Orabug: 23222944

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Add UUID socket option
Santosh Shilimkar [Wed, 3 Feb 2016 16:13:25 +0000 (08:13 -0800)]
RDS: Add UUID socket option

UUID is opaque user application data to be stored
per socket connection.

IB transport makes use of it for the ACL based
message filtering.

Orabug: 23222944

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Bang Ngyen <bang.nguyen@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Add reset all conns for a source address to CONN_RESET
Santosh Shilimkar [Wed, 23 Mar 2016 04:51:49 +0000 (21:51 -0700)]
RDS: Add reset all conns for a source address to CONN_RESET

RDS_CONN_RESET SO gets enhanced to support reseting all
connections associated with a local address.

$rds-stress -r <SRC_IP> -s 0 --reset

Orabug: 23222944

Reported-by: Bang Ngyen <bang.nguyen@oracle.com>
Acked-by: Bang Ngyen <bang.nguyen@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoIB/ipoib: ioctl interface to manage ACL tables
Yuval Shaia [Mon, 4 Apr 2016 13:40:56 +0000 (16:40 +0300)]
IB/ipoib: ioctl interface to manage ACL tables

Expose ioctl to manage ACL content by application layer.

Orabug: 18679884

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoIB/ipoib: sysfs interface to manage ACL tables
Yuval Shaia [Mon, 4 Apr 2016 13:29:12 +0000 (16:29 +0300)]
IB/ipoib: sysfs interface to manage ACL tables

Expose sysfs interface for ACL to be used for debug.

Orabug: 18679884

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoIB/{cm,ipoib}: Filter traffic using ACL
Yuval Shaia [Mon, 4 Apr 2016 12:39:49 +0000 (15:39 +0300)]
IB/{cm,ipoib}: Filter traffic using ACL

Implement two packet filtering points, one at ib_ipoib driver when
processing ARP packets and second in ib_cm when processing connection
requests.

Orabug: 18679884

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agosif driver initial commit part 1
Knut Omang [Wed, 25 May 2016 09:01:10 +0000 (11:01 +0200)]
sif driver initial commit part 1

sif_ah.c:        Implementation of IB address handles for SIF
sif_ah.h:        Interface to internal IB address handle logic for SIF
sif_base.c:      Basic hardware setup of SIF
sif_base.h:      Basic hardware setup of SIF
sif_checksum.c:  Utilities for SIF specific 32 bit checksums
sif_checksum.h:  Utilities for SIF specific 32 bit checksums
sif_cq.c:        Implementation of completion queue logic for SIF
sif_cq.h:        Internal interface to psif completion queue logic
sif_debug.c:     Use of debugfs for dumping internal data structure info
sif_debug.h:     Use of debugfs for dumping internal data structure info
sif_defs.c:      IB-to-SIF Mapper.
sif_defs.h:      Div. utility definitions and auxiliary data structures
sif_dev.h:       Driver specific data structure definitions
sif_dma.c:       DMA memory mapping
sif_dma.h:       DMA memory mapping
sif_drvapi.h:    Device specific operations available via the FWA access path
sif_elog.c:      Log over PCIe support for firmware
sif_elog.h:      Misc device for capturing log from the EPSC
sif_enl.h:       Protocol definitions for the netlink protocol for EPSC access from
sif_epsc.c:      Implementation of API for communication with the EPSC
sif_epsc.h:      API for communication with the EPSC (and EPS-A's)
sif_eq.c:        Setup of event queues and interrupt handling
sif_eq.h:        Event queues and interrupt handling
sif_fmr.c:       Implementation of fast memory registration for SIF
sif_fmr.h:       Interface to internal IB Fast Memory Registration (FMR)

Credits:
The sif driver supports Oracle’s new Dual Port EDR and QDR
IB Adapters and the integrated IB devices on the new SPARC SoC.

The driver is placed under drivers/infiniband/hw/sif

This patch set is the result of direct or indirect contribution by
several people:

Code contributors:
  Knut Omang, Vinay Shaw, Haakon Bugge, Wei Lin Guay,
  Lars Paul Huse, Francisco Trivino-Garcia.

Minor patch/bug fix contributors:
  Hans Westgaard Ry, Jesus Escudero, Robert Schmidt, Dag Moxnes,
  Andre Wuttke, Predrag Hodoba, Roy Arntsen

Initial architecture adaptations:
  Khalid Aziz (sparc64), Gerd Rausch (arm64)

Testing, Test development, Continuous integration, Bug haunting, Code
review:
  Knut Omang, Hakon Bugge, Åsmund Østvold, Francisco Trivino-Garcia,
  Wei Lin Guay, Vinay Shaw, Hans Westgaard Ry,
  + numerous other people within Oracle.

Simulator development:
  Andrew Manison, Hans Westgaard Ry, Knut Omang, Vinay Shaw

Orabug: 22529577

Reviewed-by: Hakon Bugge <Haakon.Bugge@oracle.com>
Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years agoMAINTAINERS: Add Knut Omang as maintainer for sif, Oracles's Infiniband HCA driver
Knut Omang [Sat, 14 May 2016 09:01:36 +0000 (11:01 +0200)]
MAINTAINERS: Add Knut Omang as maintainer for sif, Oracles's Infiniband HCA driver

Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years agoib: Enable building the sif driver in the infiniband stack
Knut Omang [Thu, 12 May 2016 06:04:54 +0000 (08:04 +0200)]
ib: Enable building the sif driver in the infiniband stack

The sif driver is the driver for Oracles new Infiniband HCA series.

Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years agosif driver initial commit part 7
Knut Omang [Wed, 25 May 2016 09:01:12 +0000 (11:01 +0200)]
sif driver initial commit part 7

Hardware struct print functions, Makefile and version info

Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years agosif driver initial commit part 6
Knut Omang [Wed, 25 May 2016 09:01:12 +0000 (11:01 +0200)]
sif driver initial commit part 6

Hardware register access functions, data types and macro definitions, part 2

Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years agosif driver initial commit part 5
Knut Omang [Wed, 25 May 2016 09:01:11 +0000 (11:01 +0200)]
sif driver initial commit part 5

Hardware register macro definitions and data types, part 1

Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years agosif driver initial commit part 4
Wei Lin Guay [Tue, 24 May 2016 07:37:22 +0000 (09:37 +0200)]
sif driver initial commit part 4

sif_tqp.h:       Implementation of EPSA tunnelling QP for SIF
sif_user.h:      This file defines sif specific verbs extension request/response.
sif_verbs.c:     IB verbs API extensions specific to PSIF
sif_verbs.h:     IB verbs API extensions specific to PSIF
sif_vf.c:        SR/IOV support functions
sif_vf.h:        SR/IOV support functions
sif_xmmu.c:      Implementation of special MMU mappings.
sif_xmmu.h:      Implementation of special MMU mappings.
sif_xrc.c:       Implementation of XRC related functions
sif_xrc.h:       XRC related functions
version.h:       Detailed version info data structure

Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years agosif driver initial commit part 3
Knut Omang [Wed, 25 May 2016 09:01:11 +0000 (11:01 +0200)]
sif driver initial commit part 3

sif_pt.c:        SIF (private) page table management
sif_pt.h:        SIF (private) page table management.
sif_qp.c:        Implementation of IB queue pair logic for sif
sif_qp.h:        Interface to internal IB queue pair logic for sif
sif_query.c:     SIF implementation of some of IB query APIs
sif_query.h:     SIF implementation of some of IB query APIs
sif_r3.c:        Special handling specific for psif revision 3 and earlier
sif_r3.h:        Special handling specific for psif revision 3 and earlier
sif_rq.c:        Implementation of sif receive queues
sif_rq.h:        Interface to sif receive queues
sif_sndrcv.c:    Implementation of post send/recv logic for SIF
sif_sndrcv.h:    Interface to IB send/receive, MAD packet recv and
sif_spt.c:       Experimental implementation of shared use of the OS's page tables.
sif_spt.h:       Experimental (still unsafe)
sif_sq.c:        Implementation of the send queue side of an IB queue pair
sif_sq.h:        Implementation of the send queue side of an IB queue pair
sif_srq.c:       Interface to shared receive queues for SIF
sif_srq.h:       Interface to internal Shared receive queue logic for SIF
sif_tqp.c:       Implementation of EPSA tunneling QP for SIF

Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years agosif driver initial commit part 2
Knut Omang [Wed, 25 May 2016 09:01:11 +0000 (11:01 +0200)]
sif driver initial commit part 2

sif_fwa.c:       Firmware access API (netlink based out-of-band comm)
sif_fwa.h:       Low level access to a SIF device
sif_hwi.c:       Hardware init for SIF - combines the various init steps for psif
sif_hwi.h:       Hardware init for SIF
sif_ibcq.h:      External interface to IB completion queue logic for SIF
sif_ibpd.h:      External interface to (IB) protection domains for SIF
sif_ibqp.h:      External interface to IB queue pair logic for sif
sif_idr.c:       Synchronized ID ref allocation
sif_idr.h:       simple id allocation and deallocation for SIF
sif_int_user.h:  This file defines special internal data structures used
sif_ireg.c:      Utilities and entry points needed for Infiniband registration
sif_ireg.h:      support functions used in setup of sif as an IB HCA
sif_main.c:      main entry points and initialization
sif_mem.c:       SIF table memory and page table management
sif_mem.h:       A common interface for all memory used by
sif_mmu.c:       main entry points and initialization
sif_mmu.h:       API for management of sif's on-chip mmu.
sif_mr.c:        Implementation of memory regions support for SIF
sif_mr.h:        Interface to internal IB memory registration logic for SIF
sif_mw.c:        Implementation of memory windows for SIF
sif_mw.h:        Interface to internal IB memory window logic for SIF
sif_pd.c:        Implementation of IB protection domains for SIF
sif_pd.h:        Internal interface to protection domains
sif_pqp.c:       Privileged QP handling
sif_pqp.h:       Privileged QP handling

Signed-off-by: Knut Omang <knut.omang@oracle.com>
9 years agoIB/{cm,ipoib}: Manage ACL tables
Yuval Shaia [Mon, 4 Apr 2016 10:56:50 +0000 (13:56 +0300)]
IB/{cm,ipoib}: Manage ACL tables

Add support for ACL tables for ib_ipoib and ib_cm drivers.
ib_cm driver exposes functions register and unregister tables and to manage
tables content.
In ib_ipoib driver add ACL object for each network device.

Orabug: 18679884

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Drop stale iWARP support
Santosh Shilimkar [Wed, 10 Feb 2016 18:49:16 +0000 (10:49 -0800)]
RDS: Drop stale iWARP support

RDS iWARP support was more of academic and was never complete neither
fully functional. Its getting for good. Thanks to Ajay for adding
couple of missed hunks.

Orabug: 23027670

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: drop discontinued IB CQ_VECTOR support
Santosh Shilimkar [Wed, 17 Feb 2016 18:10:14 +0000 (10:10 -0800)]
RDS: IB: drop discontinued IB CQ_VECTOR support

IB_CQ_VECTOR_LEAST_ATTACHED was OFED 1.5 feature which later Mellanox
dropped. As per them 'least attached' was not a way to distribute the load
and that actually fools the user to think it does so. The fact that some
cpu had 'least amount of attached cqs' has nothing to do with this cpu
actual load. This is why the feature never made to upstream.

On UEK4, the code is already under #if 0 because feature isn't
available. Time to clean up the dead code considering its already
dropped from upstream as well as OFED2.0+ onwards.

Orabug: 23027670

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: Drop unused and broken APM support
Santosh Shilimkar [Wed, 10 Feb 2016 17:08:43 +0000 (09:08 -0800)]
RDS: IB: Drop unused and broken APM support

APM support in RDS has been broken and hence not being
used in production. We kept the code around but its time
to remove it and reduce the complexity in the RDS
failover code paths.

Orabug: 23027670

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: Make use of ARPOP_REQUEST instead of ARPOP_REPLY in bonding code
Santosh Shilimkar [Mon, 18 Apr 2016 14:09:39 +0000 (07:09 -0700)]
RDS: IB: Make use of ARPOP_REQUEST instead of ARPOP_REPLY in bonding code

Even though IPv4 ARP RFC allows for using either REQUEST or REPLY
for grat. arp, upstream code from 3.14 onwards have moved on to
use only REQUEST.

Relevant commit:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=56022a8fdd874c56bb61d8c82559e43044d1aa06

RDS active bonding gratuitous arp code needs to adopt to this change
to take advantage of the neighbor updates on UEK4. The current code
makes use  ARPOP_REPLY which needs to be changed to ARPOP_REQUEST.

Orabug: 23094704

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Acked-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: don't use the link-local address for ib transport
Santosh Shilimkar [Fri, 29 Apr 2016 21:50:55 +0000 (14:50 -0700)]
RDS: IB: don't use the link-local address for ib transport

Link-local address can't be used for IB failover and don't work
with IB stack. Even though the DB RDS usage has recommnded to not
use these addresses, we keep hitting issue because of accidental
usage of it because of missing application config or admin scripts
blindly doing rds-ping for each local address(s).

RDS TCP which doesn't support acitive active, there might be an
usecase so the current fix it limited for IB transport atm.

Example traces:
$ rds-ping -I 169.254.221.37 169.254.221.38
bind() failed, errno: 99 (Cannot assign requested address)

cosnole:
 RDS/IB: Link local address 169.254.221.37 NOT SUPPORTED
 RDS: rds_bind() could not find a transport for 169.254.221.37, load rds_tcp or rds_rdma?

Orabug: 23027670

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Rose Wang <rose.wang@oracle.com>
Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com>
Acked-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: rebuild receive caches when needed
Santosh Shilimkar [Wed, 18 May 2016 17:44:56 +0000 (10:44 -0700)]
RDS: IB: rebuild receive caches when needed

RDS IB caches code leaks memory & it have been there from the
inception of cache code but we didn't noticed them since caches
are not teardown in normal operation paths. But now to support
features like variable fragment or connection destroy for ACL,
caches needs to be destroyed and rebuild if needed.

While freeing the caches is just fine, leaking memory while
doing that is bug and needs to be addressed. Thanks to Wengang
for spotting this stone age leak. Also the cache rebuild needs
to be done only when desired so patch optimises that part as
well.

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Tested-by: Maria Rodriguez <maria.r.rodriguez@oracle.com>
Tested-by: Hong Liu <hong.x.liu@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agordma_cm: use cma_info() instead of cma_dbg()
Ajaykumar Hotchandani [Tue, 10 May 2016 23:58:02 +0000 (16:58 -0700)]
rdma_cm: use cma_info() instead of cma_dbg()

We want to have selected prints going into messages file when debug
is enabled.

Orabug: 22381123

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Wengang Wang <wen.gang.wang@oracle.com>
9 years agoOFED: indicate consistent vendor error
Ajaykumar Hotchandani [Tue, 10 May 2016 22:43:48 +0000 (15:43 -0700)]
OFED: indicate consistent vendor error

vendor error print should be consistent across protocols to avoid
any confusion.
Currently, it's decimal at some places and hex at some places.
This patch corrects that.

Orabug: 22381117

Suggested-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Wengang Wang <wen.gang.wang@oracle.com>
9 years agoRDS: Change number based conn-drop reasons to enum
Avinash Repaka [Tue, 17 May 2016 21:42:19 +0000 (14:42 -0700)]
RDS: Change number based conn-drop reasons to enum

This patch converts the number based connection-drop reasons to enums,
making it easy to grep the reasons and to develop new patches based on
these reasons.

Orabug: 23294707

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Move rds_rtd definitions from rds_rt_debug files to common files
Avinash Repaka [Wed, 18 May 2016 22:09:05 +0000 (15:09 -0700)]
RDS: Move rds_rtd definitions from rds_rt_debug files to common files

This patch moves rds_rtd definitions from rds_rtd_debug.h to rds.h and
rds_rt_debug_bitmap modparam definition from rds_rt_debug.c to af_rds.c.
The patch removes rds_rt_debug files since there isn't much content
in these files to be held separately.

Commit 'ib/rds: runtime debuggability enhancement' originally defined
rds_rtd definitions.

Orabug: 23294707

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Change the default value of rds_rt_debug_bitmap modparam to 0x488B
Avinash Repaka [Wed, 13 Apr 2016 00:48:42 +0000 (17:48 -0700)]
RDS: Change the default value of rds_rt_debug_bitmap modparam to 0x488B

This patch changes the default value of rds_rt_debug_bitmap module
parameter to 0x488B to enable RDS_RTD_ERR, RDS_RTD_ERR_EXT, RDS_RTD_CM,
RDS_RTD_ACT_BND, RDS_RTD_RCV, RDS_RTD_SND flags of rds_rtd.

Orabug: 23294707

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Replace rds_rtd printk with trace_printk
Avinash Repaka [Fri, 25 Mar 2016 19:58:37 +0000 (12:58 -0700)]
RDS: Replace rds_rtd printk with trace_printk

Forward rds_rtd prints to ftrace buffer by replacing
printk with trace_printk.

Orabug: 23294707

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: Print vendor error in recv completion error message
Avinash Repaka [Fri, 11 Mar 2016 22:09:08 +0000 (14:09 -0800)]
RDS: IB: Print vendor error in recv completion error message

This patch when applied, prints vendor error along with work
completion status in recv completion error message.

Orabug: 23294707

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoIB/mlx4: Fix unaligned access in send_reply_to_slave
shamir rabinovitch [Wed, 18 May 2016 10:18:10 +0000 (06:18 -0400)]
IB/mlx4: Fix unaligned access in send_reply_to_slave

The problem is that the function 'send_reply_to_slave' gets the
'req_sa_mad' as a pointer whose address is only aliged to 4 bytes
but is 8 bytes in size.  This can result in unaligned access faults
on certain architectures.

Sowmini Varadhan pointed to this reply from Dave Miller that say
that memcpy should not be used to solve alignment issues:
https://lkml.org/lkml/2015/10/21/352

Optimization of memcpy to 'ldx' instruction can only happen if the
compiler knows that the size of the data we are copying is 8 bytes
and it assumes it is aligned to 8 bytes. If the compiler know the
type is not aligned to 8 it must not optimize the 8 byte copy.
Defining the data type as aligned to 4 forces the compiler to treat
all accesses as though they aren't aligned and avoids the 'ldx'
optimization.

Full credit for the idea goes to Jason Gunthorpe
<jgunthorpe@obsidianresearch.com>.

Orabug: 23311415

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
9 years agords: schedule local connection activity in proper workqueue
Ajaykumar Hotchandani [Mon, 18 Apr 2016 22:59:26 +0000 (15:59 -0700)]
rds: schedule local connection activity in proper workqueue

While reconnect, local connection is scheduled on rds_wq; while it it
should have been scheduled rds_local_wq.
This patch corrects that.

Orabug: 23223537

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoIB/security: Restrict use of the write() interface
Jason Gunthorpe [Mon, 11 Apr 2016 01:13:13 +0000 (19:13 -0600)]
IB/security: Restrict use of the write() interface

The drivers/infiniband stack uses write() as a replacement for
bi-directional ioctl().  This is not safe. There are ways to
trigger write calls that result in the return structure that
is normally written to user space being shunted off to user
specified kernel memory instead.

For the immediate repair, detect and deny suspicious accesses to
the write API.

For long term, update the user space libraries and the kernel API
to something that doesn't present the same security vulnerabilities
(likely a structured ioctl() interface).

The impacted uAPI interfaces are generally only available if
hardware from drivers/infiniband is installed in the system.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
[ Expanded check to all known write() entry points ]
Cc: stable@vger.kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
CVE-2016-4565
Orabug: 23276449

(cherry-pick from e6bd18f57aad1a2d1ef40e646d03ed0f2515c9e3)
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agonet/rds: Use max_mr from HCA caps than max_fmr
Yuval Shaia [Thu, 5 May 2016 07:47:48 +0000 (00:47 -0700)]
net/rds: Use max_mr from HCA caps than max_fmr

All HCA drivers seems to populate max_mr caps and few of them do both
max_mr and max_fmr.
Hence update RDS code to make use of max_mr.

Orabug: 23223564

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
9 years agoRDS: IB: disable ib_cache purging to avoid memory leak in reconnect path
Santosh Shilimkar [Tue, 17 May 2016 21:46:18 +0000 (14:46 -0700)]
RDS: IB: disable ib_cache purging to avoid memory leak in reconnect path

RDS IB caches don't work in reconnect path and if used can lead to
memory leaks. These leaks have been there for long time but we didn't
hit them since caches are not teardown in reconnect path. For different
frag rolling upgrade/downgrade support, its needed to work in reconnect
path but needs additional fixes.

Since the leak is blocking rest of the testing, temporary the cache
purging is disabled. It will be added back once fully fixed.

Orabug: 23275911

The change doesn't impact any of the existing RDS functionality.

Tested-by: Hong Liu <hong.x.liu@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: avoid bit fields for i_frag_pages
Wengang Wang [Tue, 17 May 2016 16:21:20 +0000 (09:21 -0700)]
RDS: IB: avoid bit fields for i_frag_pages

i_frag_pages may need to store more than 1 page value for
higher fragments so bit field won't help.

Lets fix that.

Orabug: 23275911

Tested-by: Hong Liu <hong.x.liu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
9 years agoRDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.
Sowmini Varadhan [Tue, 3 May 2016 20:14:42 +0000 (13:14 -0700)]
RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.

Orabug 23228077

Backport of upstream commit bd7c5f983f31 ("RDS: TCP: Synchronize accept()
and connect() paths on t_conn_lock.")

An arbitration scheme for duelling SYNs is implemented as part of
commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
involved will arrive at the same arbitration decision. However, this
needs to be synchronized with an outgoing SYN to be generated by
rds_tcp_conn_connect(). This commit achieves the synchronization
through the t_conn_lock mutex in struct rds_tcp_connection.

The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
the t_conn_lock mutex.  A SYN is sent out only if the RDS connection is
not already UP (an UP would indicate that rds_tcp_accept_one() has
completed 3WH, so no SYN needs to be generated).

Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
acquiring the t_conn_lock mutex. The only acceptable states (to
allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoRDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock
Sowmini Varadhan [Tue, 3 May 2016 18:55:08 +0000 (11:55 -0700)]
RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock

Orabug 23228077

Backport of upstream commit eb192840266f ("RDS:TCP: Synchronize
rds_tcp_accept_one with rds_send_xmit when resetting t_sock")

There is a race condition between rds_send_xmit -> rds_tcp_xmit
and the code that deals with resolution of duelling syns added
by commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()").

Specifically, we may end up derefencing a null pointer in rds_send_xmit
if we have the interleaving sequence:
         rds_tcp_accept_one                  rds_send_xmit

                                           conn is RDS_CONN_UP, so
      invoke rds_tcp_xmit

                                           tc = conn->c_transport_data
      rds_tcp_restore_callbacks
          /* reset t_sock */
      null ptr deref from tc->t_sock

The race condition can be avoided without adding the overhead of
additional locking in the xmit path: have rds_tcp_accept_one wait
for rds_tcp_xmit threads to complete before resetting callbacks.
The synchronization can be done in the same manner as rds_conn_shutdown().
First set the rds_conn_state to something other than RDS_CONN_UP
(so that new threads cannot get into rds_tcp_xmit()), then wait for
RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.

Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoskbuff: Add pskb_extract() helper function
Sowmini Varadhan [Tue, 26 Apr 2016 10:51:46 +0000 (03:51 -0700)]
skbuff: Add pskb_extract() helper function

Orabug 23180876

Upstream commit 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function")

A pattern of skb usage seen in modules such as RDS-TCP is to
extract `to_copy' bytes from the received TCP segment, starting
at some offset `off' into a new skb `clone'. This is done in
the ->data_ready callback, where the clone skb is queued up for rx on
the PF_RDS socket, while the parent TCP segment is returned unchanged
back to the TCP engine.

The existing code uses the sequence
        clone = skb_clone(..);
        pskb_pull(clone, off, ..);
        pskb_trim(clone, to_copy, ..);
with the intention of discarding the first `off' bytes. However,
skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
up doing a redundant memcpy of bytes that will then get discarded
in __pskb_pull_tail().

To avoid this inefficiency, this commit adds pskb_extract() that
creates the clone, and memcpy's only the relevant header/frag/frag_list
to the start of `clone'. pskb_trim() is then invoked to trim clone
down to the requested to_copy bytes.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoRDS: support individual receive trace reporting
Santosh Shilimkar [Thu, 21 Apr 2016 23:18:47 +0000 (16:18 -0700)]
RDS: support individual receive trace reporting

If application wants to get indvidual trace point, its easy
to support with existing infrastructure.

No change needed in API

Orabug: 23215779

Tested-by: Namrata Jampani <namrata.jampani@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoIB/ipoib: Add readout of statistics using ethtool
Hans Westgaard Ry [Wed, 13 Apr 2016 13:23:20 +0000 (15:23 +0200)]
IB/ipoib: Add readout of statistics using ethtool

Orabug: 21498734

IPoIB collects statistics of traffic including number of packets
sent/received, number of bytes transferred, and certain errors. This
patch makes these statistics available to be queried by ethtool.

Change-Id: Ic159815fe0cc08770cd4111ec1df117b7349c154
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
9 years agoIB/ipoib: Add handling for sending of skb with many frags
Hans Westgaard Ry [Wed, 16 Mar 2016 13:01:02 +0000 (14:01 +0100)]
IB/ipoib: Add handling for sending of skb with many frags

Orabug: 21498734

IPoIB puts skb-fragments in SGEs adding 1 extra SGE when SG is enabled.
Current codepath assumes that the max number of SGEs a device supports
is at least MAX_SKB_FRAGS+1, there is no interaction with upper layers
to limit number of fragments in an skb if a device suports fewer SGEs.
The assumptions also lead to requesting a fixed number of SGEs when
IPoIB creates queue-pairs with SG enabled.

A fallback/slowpath is implemented using skb_linearize to
handle cases where the conversion would result in more sges than supported.

Change-Id: Ia81e69d7231987208ac298300fc5b9734f193a2d
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
9 years agoRevert "RDS: Make message size limit compliant with spec"
Chuck Anderson [Wed, 4 May 2016 17:18:17 +0000 (10:18 -0700)]
Revert "RDS: Make message size limit compliant with spec"

This reverts commit ef278157f938011ee08fb683311e5e31ffd29fdf.

Orabug: 23217242

From Avinash Repaka <avinash.repaka@oracle.com>:
This patch reverts the fix for Orabug: 22661521, since the fix assumes
that the memory region is always aligned on a page boundary, causing an
EMSGSIZE error when trying to register 1MB region that isn't 4KB aligned.
These issues were observed on kernel 4.1.12-39.el6uek tag.

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
9 years agoRDS: TCP: Remove unused constant
Sowmini Varadhan [Thu, 24 Mar 2016 19:13:57 +0000 (12:13 -0700)]
RDS: TCP: Remove unused constant

Orabug: 22993275

Upstream commit a3382e408b64 ("RDS: TCP: Remove unused constant")

RDS_TCP_DEFAULT_BUFSIZE has been unused since commit 1edd6a14d24f
("RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune").

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
Sowmini Varadhan [Wed, 23 Mar 2016 23:37:26 +0000 (16:37 -0700)]
RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket

Orabug: 22993275

Upstream commit c6a58ffed536 ("RDS: TCP: Add sysctl tunables for
sndbuf/rcvbuf on rds-tcp socket")

Add per-net sysctl tunables to set the size of sndbuf and
rcvbuf on the kernel tcp socket.

The tunables are added at /proc/sys/net/rds/tcp/rds_tcp_sndbuf
and /proc/sys/net/rds/tcp/rds_tcp_rcvbuf.

These values must be set before accept() or connect(),
and there may be an arbitrary number of existing rds-tcp
sockets when the tunable is modified. To make sure that all
connections in the netns pick up the same value for the tunable,
we reset existing rds-tcp connections in the netns, so that
they can reconnect with the new parameters.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Make message size limit compliant with spec
Avinash Repaka [Mon, 29 Feb 2016 23:30:57 +0000 (15:30 -0800)]
RDS: Make message size limit compliant with spec

This patch limits RDS messages(both RDMA & non-RDMA) and RDS MR size
to RDS_MAX_MSG_SIZE(1MB) irrespective of underlying transport layer.

Orabug: 22661521

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
9 years agoRDS: add flow control info to rds_info_rdma_connection
Wei Lin Guay [Wed, 27 Jan 2016 12:18:08 +0000 (13:18 +0100)]
RDS: add flow control info to rds_info_rdma_connection

Added per connection flow_ctl_post_credit and
flow_ctl_send_credit to rds-info. These info help
in debugging RDS IB flow control. The newly added
attributes are placed at the bottom of the data
structure to ensure backward compatibility.

Orabug: 22306628

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
9 years agoRDS: update IB flow control algorithm
Wei Lin Guay [Thu, 17 Dec 2015 08:34:33 +0000 (09:34 +0100)]
RDS: update IB flow control algorithm

The current algorithm that uses 16 as a hard-coded value
in rds_ib_advertise_credits() doesn't serve the purpose, as
post_recvs() are performed in bulk. Thus, the test
condition will always be true.

This patch moves rds_ib_advertise_credits() in to the
post_recvs() loop. Instead of updating the post_recv credits
after all the post_recvs() have completed, the post_recv
credit is being updated in log2 incremental manner.
The proposed exponential quadrupling algorithm serves as a
good compromise between early start of the peer and at the
same time reducing the amount of explicit ACKs. The credit
update explicit ACKs will be generated starting from 16,
256, 4096...etc.

The performance number below shows that this new flow
control algorithm has minimal impact performance even though
it requires additional explicit ACKs.

4 QPs, 32t, 16d -q 8448 -a 256 (rds-parameter)

HCAs flow_ctl no flow_ctl
Mellanox CX3 744K 742K
Oracle QDR M4 819K 831K

Orabug: 22306628

Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
9 years agoRDS: Add flow control in runtime debugging
Wei Lin Guay [Fri, 4 Dec 2015 15:50:13 +0000 (16:50 +0100)]
RDS: Add flow control in runtime debugging

Add RDS_RTD_FLOW_CNTRL feature flag in
rds_rt_debug_bitmap.

Orabug: 22306628

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
9 years agoRDS: fix IB transport flow control
Wei Lin Guay [Thu, 3 Dec 2015 08:23:59 +0000 (09:23 +0100)]
RDS: fix IB transport flow control

Orabug: 22306628

IB flow control is always disabled regardless of
rds_ib_sysctl_flow_control flag.
The issue is that the initial credit advertisement
annouces zero credits, because ib_recv_refill() has
not yet been called. An initial credit offering
of zero effectively disables flow control.

IB flow control is only enabled if both active and
passive connections have set the rds_ib_sysctl
flow_control flag. E.g,

Conn. A (on),  Conn. B (on)  = enable
Conn. A (off), Conn. B (on)  = disable
Conn. A (on),  Conn. B (off) = disable
Conn. A (off), Conn. B (off) = disable

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
9 years agomlx4_core: scale_profile should work without params set to 0
Mukesh Kacker [Wed, 13 Apr 2016 01:53:37 +0000 (18:53 -0700)]
mlx4_core: scale_profile should work without params set to 0

The "scale_profile" parameter is to be used to do scaling of
hca params without requiring specific tuning setting for each
one of them individually and yet allowing manual setting of
variables.

In UEK2, the module params were default zero and initialized
later from a "driver default" or a scale_profile dictated value.

In UEK4 (derived from Mellanox OFED 2.4) module params were
pre-initialized to default values and are not zero and have
to be forced to 0 for dynamic scaling to be activated.
This defeats the purpose of having a single parameter to achieve
scaling and not requiring setting individual parameters (while
retaining ability to revert to driver defaults).

The changes here (re)introduce a separate static instance
containing default parameters separate from module parameters
which are pre-initialized to zero for parameters that can scale
dynamically and to default values for others. The zero module
parameters are later initialized to either a scale_profile
governed value or driver defaults.

Orabug: 23078816

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agomlx4_core: bump default scaled value of num of cqs and srqs
Mukesh Kacker [Tue, 29 Mar 2016 02:12:50 +0000 (19:12 -0700)]
mlx4_core: bump default scaled value of num of cqs and srqs

Bump up the number of cqs and srqs when scale_profile parameter
is set or log_num_cq/log_num_srq are zero implying a scaled
default profile.

This allows higher scalability needed for cloud based platforms
running higher number of processes per hca.

If the individual parameters log_num_srq and/or log_num_cq are
set, they take precedence and scale_profile is ignored.

(Fixes for style problems with quoted strings split across
lines and and block comment format are also piggybacked
in this commit!)

Orabug: 23078966

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years ago[PATCH 2/2] Avoid redundant call to rds_bind_lookup() in recv path.
Sowmini Varadhan [Tue, 12 Apr 2016 01:49:12 +0000 (18:49 -0700)]
[PATCH 2/2] Avoid redundant call to rds_bind_lookup() in recv path.

Orabug 20930687

When RAC tries to scale RDS-TCP, they are hitting bottlenecks
due to inefficiencies in rds_bind_lookup. Each call to
rds_bind_lookup results in an irqsave/irqrestore sequence, and
when the list of RDS sockets is large, we end up having IRQs
suppressed for long intervals. This trigger flow-control assertions
and causes TX queue watchdog hangs in the sender. The current
implementation makes this even worse, by superfluously calling
rds_bind_lookup(). This patch set takes the first step to solving
this problem by avoiding one of the redundant calls to rds_bind_lookup.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: TOS fixes in failure paths when RDS-TCP and RDS-RDMA are run together
Sowmini Varadhan [Tue, 12 Apr 2016 00:17:37 +0000 (17:17 -0700)]
RDS: TOS fixes in failure paths when RDS-TCP and RDS-RDMA are run together

Orabug 20930687

When errors such as connection hangs or failures are encountered
over RDS-TCP,  the sending RDS, in an attempt at HA, will try to
reconnect, and trip up on all sorts of data structures intended
for ToS support. The ToS feature is currently only supported for
RDS-IB, and  unplanned/untested usage of these data
structures by RDS-TCP causes deadlocks and panics.

Until we properly design, support, and test the ToS feature for
RDS-TCP, such paths should not be wandered into. Thus this patchset
adds defensive checks to ignore rs_tos settings in rds_sendmsg() for
TCP transports, and prevents the sending of ToS heartbeat pings

Until we properly design, support, and test the ToS feature for
RDS-TCP, such paths should not be wandered into. Thus this patchset
adds defensive checks to ignore rs_tos settings in rds_sendmsg() for
TCP transports, and prevents the sending of ToS heartbeat pings
in rds_send_hb() for TCP transport.

For reference, the deadlock that can be encountered in the
hb ping path is:

 [<ffffffff814846f4>] do_tcp_setsockopt+0x244/0x710  <-- wants sock_lock
 [<ffffffff81119ead>] ? __alloc_pages_nodemask+0x12d/0x230
 [<ffffffff8125cf4d>] ? rb_insert_color+0x9d/0x160
 [<ffffffff8101cd43>] ? native_sched_clock+0x13/0x80
 [<ffffffff8101c369>] ? sched_clock+0x9/0x10
 [<ffffffff810e25a9>] ? trace_clock_local+0x9/0x10
 [<ffffffff810e8d97>] ? rb_reserve_next_event+0x67/0x480
 [<ffffffff81292a1f>] ? __alloc_and_insert_iova_range+0x17f/0x1f0
 [<ffffffff81117f52>] ? get_page_from_freelist+0x1e2/0x550
 [<ffffffff81484c1a>] tcp_setsockopt+0x2a/0x30
 [<ffffffff8142e724>] sock_common_setsockopt+0x14/0x20
 [<ffffffffa00de5cd>] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
 [<ffffffffa0749b35>] rds_send_xmit+0xe5/0x860 [rds]
 [<ffffffffa074a37d>] rds_send_hb+0xcd/0x130 [rds]
 [<ffffffffa0747a1b>] rds_recv_local+0x20b/0x330 [rds]
 [<ffffffffa074851d>] rds_recv_incoming+0x7d/0x290 [rds]
 [<ffffffff8101cd43>] ? native_sched_clock+0x13/0x80
 [<ffffffff8101c369>] ? sched_clock+0x9/0x10
 [<ffffffffa00de2c6>] rds_tcp_data_recv+0x316/0x440 [rds_tcp]
 [<ffffffff8148642a>] tcp_read_sock+0xda/0x230
 [<ffffffffa00ddfb0>] ? rds_tcp_recv+0x60/0x60 [rds_tcp]
 [<ffffffff81430fb0>] ? sock_get_timestamp+0xc0/0xc0
 [<ffffffffa00dde6d>] rds_tcp_read_sock+0x4d/0x60 [rds_tcp]
 [<ffffffffa00ddefa>] rds_tcp_data_ready+0x7a/0xd0 [rds_tcp]
 [<ffffffff8148df2e>] tcp_data_queue+0x2fe/0xac0
 [<ffffffff814916a9>] tcp_rcv_established+0x349/0x740
 [<ffffffff81499bb5>] tcp_v4_do_rcv+0x125/0x1f0
 [<ffffffff8149b3e7>] tcp_v4_rcv+0x597/0x830   <---- holds sock_lock
 [<ffffffff8148b56e>] ? __tcp_ack_snd_check+0x5e/0xa0
 [<ffffffff814916cd>] ? tcp_rcv_established+0x36d/0x740
 [<ffffffff814774ed>] ip_local_deliver_finish+0xdd/0x2a0
 [<ffffffff81477730>] ip_local_deliver+0x80/0x90
 [<ffffffff81476d75>] ip_rcv_finish+0x105/0x370

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agords: rds-stress show all zeros after few minutes
shamir rabinovitch [Wed, 16 Mar 2016 13:57:19 +0000 (09:57 -0400)]
rds: rds-stress show all zeros after few minutes

Issue can be seen on platforms that use 8K and above page size
while rds fragment size is 4K. On those platforms single page is
shared between 2 or more rds fragments. Each fragment has it's own
offeset and rds cong map code need to take this offset to account.
Not taking this offset to account lead to reading the data fragment
as congestion map fragment and hang of the rds transmit due to far
cong map corruption.

Orabug: 23045970

Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Anand Bibhuti <anand.bibhuti@oracle.com>
Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com>
9 years agoRDS: Fix the atomicity for congestion map update
Wengang Wang [Thu, 7 Apr 2016 07:29:43 +0000 (15:29 +0800)]
RDS: Fix the atomicity for congestion map update

Orabug: 23022620 (UEK4)
QuickRef: 22118109 (UEK2)

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang <wen.gang.wang@oracle.com> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: Run rds_fmr_flush WQ closer to ib_device
Wei Lin Guay [Wed, 18 Nov 2015 11:23:08 +0000 (12:23 +0100)]
RDS: IB: Run rds_fmr_flush WQ closer to ib_device

Orabug 22269408

rds_fmr_flush workqueue is calling ib_unmap_fmr
to invalidate a list of FMRs. Today, this workqueue
can be scheduled at any CPUs. In a NUMA-aware system,
schedule this workqueue to run on a CPU core closer to
ib_device can improve performance. As for now, we use
"sequential-low" policy. This policy selects two lower
cpu cores closer to HCA. In a non-NUMA aware system,
schedule rds_fmr_flush workqueue in a fixed cpu core
improves performance.

The mapping of cpu to the rds_fmr_flush workqueue
can be enabled/disabled via  sysctl and it is enable
by default. To disable the feature, use below sysctl.

rds_ib_sysctl_disable_unmap_fmr_cpu = 1

Putting down some of the rds-stress performance number
comparing default and sequential-low policy in a NUMA
system with Oracle M4 QDR and Mellanox CX3.

rds-stress 4 conns, 32 threads, 16 depths, RDMA write
and unidirectional (higher is better).

(Oracle M4 QDR)
default : 645591 IOPS
sequential-low : 806196 IOPS

(Mellanox CX3)
default : 473836 IOPS
sequential-low : 544187 IOPS

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
9 years agoRDS: IB: support larger frag size up to 16KB
Santosh Shilimkar [Wed, 21 Oct 2015 23:47:28 +0000 (16:47 -0700)]
RDS: IB: support larger frag size up to 16KB

Infiniband (IB) transport supports larger message size
than RDS_FRAG_SIZE, which is usually in 4KB PAGE_SIZE.
Nevertheless, RDS always fragments each payload into
RDS_FRAG_SIZE before hands it over to the underlying
IB HCA.

One of the important message size required for database
is 8448 (8K + 256B control message) for BCOPY. This RDS
message, even with IB transport, will generate three
IB work requests (WR)  with each having its own RDS header.
This series of patches improve RDS performance by allowing
IB transport to send/receive RDS message with a larger
RDS_FRAG_SIZE (Ideally, using a single WR).

In order to maintain the backward compatibility and
interoperability between various RDS versions, and at
the same time to support various FRAG_SIZE, the IB
fragment size is negotiated per connection.
Although IB is capable of supporting 4GB of message size,
currently we limit the IB RDS_FRAG_SIZE up to 16KB due to
two reasons:-
 1. This is needed for current 8448 RDS message size usecase.
 2. Minizing the size for each receive queue entry in order
    to optimal memory usage.

In term of implementation, The 'dp_reserved2' field of
'struct rds_ib_connect_private' now carries information about
supported IB fragment size. Since we are just
using the IB connection private data and a reserved field,
the protocol version is not bumped up. Furthermore, the feature
is enabled only for RDS_PROTOCOL_v4.1 and above (future).

To keep thing simpler for user, a module parameter
'rds_ib_max_frag' is provided. Without module parameter,
the default PAGE_SIZE frag will be used. During the connection
establishment, the smallest fragment size will be
chosen. If the fragment size is 0, it means RDS module
doesn't support large fragment size and the default
RDS_FRAG_SIZE will be used.

Upto ~10+ % improvement seen with Orion and ~9+ % with RDBMS
update queries.

Orabug: 21894138

Reviwed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: add frag size to per connection info
Santosh Shilimkar [Sat, 21 Nov 2015 22:16:48 +0000 (14:16 -0800)]
RDS: IB: add frag size to per connection info

rds-toools (rds-info) needs tobe updated to display per IB connection
fragment info. Have a patch for it and will get that merged accordingly.

Orabug: 21894138
Reviwed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: log the endpoint rds connection role
Santosh Shilimkar [Fri, 20 Nov 2015 21:25:40 +0000 (13:25 -0800)]
RDS: IB: log the endpoint rds connection role

Useful to know the active and passive end points in a
RDS conection.

Orabug: 21894138
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: purge receive frag cache on connection shutdown
Santosh Shilimkar [Mon, 21 Mar 2016 06:24:32 +0000 (23:24 -0700)]
RDS: IB: purge receive frag cache on connection shutdown

RDS IB connections can be formed with different fragment size across
reconnect and hence the current frag cache needs to be purged to
avoid stale frag usages.

Orabug: 21894138
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: use i_frag_sz for cache stat updates
Santosh Shilimkar [Mon, 21 Mar 2016 10:49:30 +0000 (03:49 -0700)]
RDS: IB: use i_frag_sz for cache stat updates

Updates the receive cache statistics based on ic->i_frag_sz

Orabug: 21894138
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: scale rds_ib_allocation based on fragment size
Santosh Shilimkar [Wed, 4 Nov 2015 21:42:39 +0000 (13:42 -0800)]
RDS: IB: scale rds_ib_allocation based on fragment size

The 'rds_ib_sysctl_max_recv_allocation' allocation is used to manage
and allocate the size of IB receive queue entry (RQE) for each IB
connection. However, it relies on the hardcoded RDS_FRAG_SIZE.

Lets make it scalable based on supported fragment sizes for different
IB connection. Each connection can allocate different RQE size
depending on the per connection fragment_size.

Orabug: 21894138
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: IB: make fragment size (RDS_FRAG_SIZE) dynamic
Santosh Shilimkar [Wed, 21 Oct 2015 23:47:28 +0000 (16:47 -0700)]
RDS: IB: make fragment size (RDS_FRAG_SIZE) dynamic

IB fabric is capable of fragment 4GB of data payload into
send_first, send_middle and send_last. Nevertheless,
RDS fragments each payload into PAGE_SIZE, which is usually
4KB. This patch makes the RDS_FRAG_SIZE for RDS IB transport
dynamic.

In the preperation for subsequent patch(es), this patch
adds per connection peer negotiation to determine the
supported fragment size for IB transport.

Orabug: 21894138
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: log the IP address as well on bind failure
Santosh Shilimkar [Wed, 4 Nov 2015 21:42:39 +0000 (13:42 -0800)]
RDS: log the IP address as well on bind failure

It's useful to know the IP address when RDS fails to bind a
connection. Thus, adding it to the error message.

Orabug: 21894138
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: fix the sg allocation based on actual message size
Wei Lin Guay [Fri, 20 Nov 2015 23:14:48 +0000 (15:14 -0800)]
RDS: fix the sg allocation based on actual message size

Fix an issue where only PAGE_SIZE bytes are allocated per
scatter-gather entry (SGE) regardless of the actual message
size: Furthermore, use buddy memory allocation technique to
allocate/free memmory (if possible) to reduce SGE.

Orabug: 21894138
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: make congestion code independent of PAGE_SIZE
Santosh Shilimkar [Mon, 16 Nov 2015 21:28:11 +0000 (13:28 -0800)]
RDS: make congestion code independent of PAGE_SIZE

RDS congestion map code is designed with base assumption of
4K page size. The map update as well transport code assumes
it that way. Ofcourse it breaks when transport like IB starts
supporting larger fragments than 4K.

To overcome this limitation without too many changes to the core
congestion map update logic, define indepedent RDS_CONG_PAGE_SIZE
and use it.

While at it we also move rds_message_map_pages() whose sole
purpose it to map congestion pages to congestion code.

Orabug: 21894138
Reviwed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Back out OoO send status fix since it causes the regression
Santosh Shilimkar [Tue, 16 Feb 2016 18:24:31 +0000 (10:24 -0800)]
RDS: Back out OoO send status fix since it causes the regression

With the DB build, the crash was observed which was boiled down to
this change on UEK2. Proactively we back this out on UEK4 as well
till the issue gets addresssed.

<0>------------[ cut here ]------------
<2>kernel BUG at net/rds/send.c:511!
<0>invalid opcode: 0000 [#1] SMP
<4>CPU 197
<4>Modules linked in: oracleacfs(P)(U) oracleadvm(P)(U) oracleoks(P)(U)
ipmi_poweroff ipmi_devintf ipmi_si ipmi_msghandler bonding acpi_cpufreq
freq_table mperf rds_rdma rds ib_sdp ib_ipoib rdma_ucm ib_ucm ib_uverbs
ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 mlx4_ib ib_sa ib_mad ib_core
mlx4_core ext3 jbd fuse ghes hed wmi i2c_i801 iTCO_wdt
iTCO_vendor_support
igb i2c_algo_bit i2c_core ixgbe hwmon dca sg ext4 mbcache jbd2 sd_mod
crc_t10dif megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last
unloaded: scsi_wait_scan]
<4>
<4>Pid: 85252, comm: oracle_85252_tp Tainted: P

[...]

<0>Call Trace:
<4> [<ffffffffa0340cab>] rds_send_drop_to+0xcb/0x470 [rds]
<4> [<ffffffffa033a19e>] rds_release+0x8e/0x110 [rds]
<4> [<ffffffff8142a369>] sock_release+0x29/0x90
<4> [<ffffffff8142a3e7>] sock_close+0x17/0x30
<4> [<ffffffff8116f2fe>] __fput+0xbe/0x240
<4> [<ffffffff8116f4a5>] fput+0x25/0x30
<4> [<ffffffff8116b3e3>] filp_close+0x63/0x90
<4> [<ffffffff8116b4c7>] sys_close+0xb7/0x120
<4> [<ffffffff81516762>] system_call_fastpath+0x16/0x1b
<0>Code: 00 75 1f 65 8b 14 25 58 c3 00 00 48 63 d2 48 c7 c0 80 26 01 00
48 03
04 d5 40 0e 99 81 48 83 40 68 01 5b 41 5c c9 c3 0f 0b eb fe <0f> 0b eb
fe 0f
1f 40 00 55 48 89 e5 48 83 ec 40 48 89 5d d8 4c
<1>RIP  [<ffffffffa033e5e8>] rds_send_sndbuf_remove+0x68/0x70 [rds]
<4> RSP <ffff8825b0577dd8>

Revert "RDS: Fix out-of-order RDS_CMSG_RDMA_SEND_STATUS"

This reverts commit 5631f1a303104d41f6ded0d603011d6c172b8644.

Orabug: 21894138
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agonet/mlx4_core: Modify default value of log_rdmarc_per_qp to be consistent with HW...
Yuval Shaia [Thu, 17 Sep 2015 09:52:44 +0000 (02:52 -0700)]
net/mlx4_core: Modify default value of log_rdmarc_per_qp to be consistent with HW capability

This value is used to calculate max_qp_dest_rdma.
Default value of 4 brings us to 16 while HW supports 128
(max_requester_per_qp)
Although this value can be changed by module param it is best that default
will be optimized

Orabag: 19883194

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
9 years agoFixed vnic issue after saturn reset
Pradeep Gopanapalli [Tue, 1 Mar 2016 01:45:59 +0000 (01:45 +0000)]
Fixed vnic issue after saturn reset

After saturn reboots uVnic will not recover
The issue happens When a host connected to NM3 via infiniband switch.
The fact here is xve driver sets OPER_DOWN when it losses HEART_BEAT and
there is no way for it set this state back .
Fixing this by disjoining Multicast group once HeartBeat is lost and
joining back once saturn comes back

Orabug: 22862488

Reported-by: Ye Jin <ye.jin@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: sajid zia <szia@oracle.com>
Signed-off-by: Qing Huang <qing.huang@oracle.com>
9 years agouvnic issues
Pradeep Gopanapalli [Thu, 11 Feb 2016 23:29:39 +0000 (23:29 +0000)]
uvnic issues

When there is Heartbeat loss because of Link event on  EDR fabric, uvnic
recovery fails(some times).
This is due to the fact that XVE_OPER_UP state would have cleared and
xve driver has to check this once multicast group membership is retained

While running MAXq test we are running into soft crash around
skb_try_coalesce .Since xve driver allocates PAGE_SIZE it has to use
full PAGE_SIZE for truesize of skb.

xve driver doesn't allocate enough tailroom in skbs, so IP/TCP
stacks need to reallocate skb head to pull IP/TCP headers.
this .

xve allocates some resources which are not needed by uVnic
functionality.
Fix this by using cm_supported flag.

Orabug: 22862488

Reported-by: ye jin <ye.jin@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: sajid zia <szia@oracle.com>
Signed-off-by: Qing Huang <qing.huang@oracle.com>
9 years agoFixed wrongly checked return type Added Debug print
Pradeep Gopanapalli [Tue, 9 Feb 2016 23:15:19 +0000 (23:15 +0000)]
Fixed wrongly checked return type Added Debug print

In xs_post_recv instead of checking for -ENOMEM xscore driver is checking
 ENOMEM, fixed this by checking proper return type

Orabug: 22862488

Reported-by: Haakon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: sajid zia <szia@oracle.com>
Signed-off-by: Qing Huang <qing.huang@oracle.com>
9 years agoRDS/IB: VRPC DELAY / OSS RECONNECT CAUSES 5 MINUTE STALL ON PORT FAILURE
Venkat Venkatsubra [Tue, 1 Mar 2016 22:27:28 +0000 (14:27 -0800)]
RDS/IB: VRPC DELAY / OSS RECONNECT CAUSES 5 MINUTE STALL ON PORT FAILURE

This problem occurs when the user gets notified of a successful
rdma write + bcopy message completion but the peer application
does not receive the bcopy message. This happens during a port down/up test.

What seems to happen is the rdma write succeeds but the bcopy message fails.

RDS should not be returning successful completion status to the user
in this case.

When RDS does a rdma followed by a bcopy message the user notification is
supposed to be implemented by method #3 below.

/* If the user asked for a completion notification on this
 * message, we can implement three different semantics:
 *  1.  Notify when we received the ACK on the RDS message
 *      that was queued with the RDMA. This provides reliable
 *      notification of RDMA status at the expense of a one-way
 *      packet delay.
 *  2.  Notify when the IB stack gives us the completion event for
 *      the RDMA operation.
 *  3.  Notify when the IB stack gives us the completion event for
 *      the accompanying RDS messages.
 * Here, we implement approach #3. To implement approach #2,
 * we would need to take an event for the rdma WR. To implement #1,
 * don't call rds_rdma_send_complete at all, and fall back to the notify
 * handling in the ACK processing code.

But unfortunately the user gets notified earlier to knowing the bcopy
send status. Right after rdma write completes the user gets notified
even though the subsequent bcopy eventually fails.

The fix is to delay signaling completions of rdma op till the
bcopy send completes.

Orabug: 22847528

Acked-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
9 years agords: add infrastructure to find more details for reconnect failure
Ajaykumar Hotchandani [Fri, 4 Mar 2016 03:23:05 +0000 (19:23 -0800)]
rds: add infrastructure to find more details for reconnect failure

This patch adds run-time support to debug scenarios where reconnect is
not successful for certain time.
We add two sysctl variables for start time and end time. These are
number of seconds after reconnect was initiated.

Orabug: 22631108

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
9 years agords: find connection drop reason
Ajaykumar Hotchandani [Fri, 4 Mar 2016 03:18:28 +0000 (19:18 -0800)]
rds: find connection drop reason

This patch attempts to find connection drop details.

Rational for adding this type of patch is, there are too many
places from where connection can get dropped.
And, in some cases, we don't have any idea of the source of
connection drop. This is especially painful for issues which
are reproducible in customer environment only.

Idea here is, we have tracker variable which keeps latest value
of connection drop source.
We can fetch that tracker variable as per our need.

Orabug: 22631108

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
9 years agoRDS: Add interface for receive MSG latency trace
Santosh Shilimkar [Fri, 11 Dec 2015 20:01:56 +0000 (12:01 -0800)]
RDS: Add interface for receive MSG latency trace

Socket option to tap receive path latency.
SO_RDS: SO_RDS_MSG_RXPATH_LATENCY
with parameter,
struct rds_rx_trace_so {
u8 rx_traces;
        u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
}

CMSG:
RDS_CMSG_RXPATH_LATENCY(recvmsg)
Returns rds message latencies in various stages of receive
path in nS. Its set per socket using SO_RDS_MSG_RXPATH_LATENCY
socket option. Legitimate points are defined in
enum rds_message_rxpath_latency. More points can be added in
future.

CSMG format:
struct rds_cmsg_rx_trace {
        u8 rx_traces;
        u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
        u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
}

Receive MSG trace points: RDS message Receive Path Latency points
enum rds_message_rxpath_latency {
RDS_MSG_RX_HDR_TO_DGRAM_START = 0,
RDS_MSG_RX_DGRAM_REASSEMBLE,
RDS_MSG_RX_DGRAM_DELIVERED,
RDS_MSG_RX_DGRAM_TRACE_MAX
}

Tested-by: Namrata Jampani <namrata.jampani@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Orabug: 22630180
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoIB/mlx4: Replace kfree with kvfree in mlx4_ib_destroy_srq
Wengang Wang [Thu, 17 Dec 2015 02:54:15 +0000 (10:54 +0800)]
IB/mlx4: Replace kfree with kvfree in mlx4_ib_destroy_srq

Upstream commit 0ef2f05c7e02ff99c0b5b583d7dee2cd12b053f2 uses vmalloc for
WR buffers when needed and uses kvfree to free the buffers. It missed
changing kfree to kvfree in mlx4_ib_destroy_srq().

Reported-by: Matthew Finaly <matt@Mellanox.com>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Backport from upstream: df4176677fdf7ac0c5748083eb7a5b269fb1e156

Orabug: 22487409
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Yuval Shaia <yuval.shaia@oracle.com>
9 years agoib_core: Add udata argument to alloc_shpd()
Mukesh Kacker [Tue, 22 Sep 2015 08:31:31 +0000 (01:31 -0700)]
ib_core: Add udata argument to alloc_shpd()

ib_core: Add udata argument to alloc_shpd()

Add udata argument to shared pd interface alloc_shpd()
consistent with evolution of other similar ib_core
interfaces so providers that wish to support it can use it.

For providers (like current Mellanox driver code) that
do not expect user user data, we assert a warning.

Orabug: 21884873

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
9 years agoRDS: establish connection for legitimate remote RDMA message
Santosh Shilimkar [Thu, 29 Oct 2015 16:24:46 +0000 (09:24 -0700)]
RDS: establish connection for legitimate remote RDMA message

The first message to a remote node should prompt a new
connection even if it is RDMA operation via CMSG. So that
means before CMSG parsing, the connection needs to be
established. Commit 3d6e0fed8edc ("rds_rdma: rds_sendmsg
should return EAGAIN if connection not setup")' tried to
address that issue as part of bug 20232581.

But it inadvertently broke the QoS policy evaluation. Basically
QoS has opposite requirement where it needs information from
CMSG to evaluate if the message is legitimate to be sent over
the wire. It basically needs to know how the total payload
which should include the actual payload and additional rdma
bytes. It then evaluates total payload with the systems QoS
thresholds to determine if the message is legitimate to be
sent.

Patch addresses these two opposite requirement by fetching
only the rdma bytes information for QoS evaluation and let
the full CMSG parsing happen after the connection is
initiated.  Since the connection establishment is asynchronous,
we make sure the map failure because of unavailable
connection reach to the user by appropriate error code.

Orabug: 22139696

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agords: remove the _reuse_ rds ib pool statistics
Wengang Wang [Tue, 10 Nov 2015 02:43:38 +0000 (10:43 +0800)]
rds: remove the _reuse_ rds ib pool statistics

Orabug: 22124214
fix a regress introduced by ceb99ba579a769f4e02375a3d52e36f44ae5f27f

The above commit introduced the two new statistics to rds_ib_statistics.

       uint64_t        s_ib_rdma_mr_1m_pool_reuse;
       uint64_t        s_ib_rdma_mr_8k_pool_reuse;

But didn't have rds-info changed accordingly thus rds-info gets shifted stats.

this removes these two stats.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
9 years agoRDS: Add support for per socket SO_TIMESTAMP for incoming messages
Santosh Shilimkar [Sat, 5 Sep 2015 00:01:06 +0000 (17:01 -0700)]
RDS: Add support for per socket SO_TIMESTAMP for incoming messages

The SO_TIMESTAMP generates time stamp for each incoming RDS message
using the wall time. Result is returned via recv_msg() in a control
message as timeval (usec resolution).

User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.

Orabug: 22190837

Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Tested-by: Namrata Jampani <namrata.jampani@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRDS: Fix out-of-order RDS_CMSG_RDMA_SEND_STATUS
Wei Lin Guay [Wed, 11 Nov 2015 17:31:14 +0000 (09:31 -0800)]
RDS: Fix out-of-order RDS_CMSG_RDMA_SEND_STATUS

Orabug: 22126982

If the RDS user application requests notification of RDMA send
completions, there is a possibility that RDS_CMSG_RDMA_SEND_STATUS
will be delivered out-of-order.

This can happen if RDS drops sending ACK after it received an explicit
ACK. In this case, the rds message ended up in reverse order in the
list.

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
9 years agoMerge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek-ofed into...
Santosh Shilimkar [Tue, 10 Nov 2015 02:17:26 +0000 (18:17 -0800)]
Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek-ofed into topic/uek-4.1/ofed

9 years agoMerge branch 'topic/uek-4.1/ofed.ovn' into topic/uek-4.1/ofed
Qing Huang [Tue, 10 Nov 2015 00:23:23 +0000 (16:23 -0800)]
Merge branch 'topic/uek-4.1/ofed.ovn' into topic/uek-4.1/ofed

* topic/uek-4.1/ofed.ovn:
  Integrate Uvnic functionality into uek-4.1 Revision 8008
  1) S_IRWXU causing kernel soft crash changing to 0644 WARNING: CPU: 0 PID: 20907 at fs/sysfs/group.c:61 create_files+0x171/0x180() Oct 12 21:43:14 ovn87-180 kernel: [252606.588541] Attribute vhba_default_scsi_timeout: Invalid permissions 0700 [Rev 8008]
  1) Support vnic for EDR based platform(uVnic) 2) Supported Types now Type 0 - XSMP_XCM_OVN - Xsigo VP780/OSDN standalone Chassis, (add pvi) Type 1 - XSMP_XCM_NOUPLINK - EDR Without uplink (add public-network) Type 2 - XSMP_XCM_UPLINK -EDR with uplink (add public-network <with -if> 3) Intelligence in driver to support all the modes 4) Added Code for printing Multicast LID [Revision 8008] 5) removed style errors

9 years agoIntegrate Uvnic functionality into uek-4.1 Revision 8008
Pradeep Gopanapalli [Sat, 7 Nov 2015 02:14:12 +0000 (18:14 -0800)]
Integrate Uvnic functionality into uek-4.1 Revision 8008

Reviewed-by: Sajid Zia <sajid.zia@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Signed-off-by: Qing Huang <qing.huang@oracle.com>
9 years ago1) S_IRWXU causing kernel soft crash changing to 0644 WARNING: CPU: 0 PID: 20907...
Pradeep Gopanapalli [Sat, 7 Nov 2015 02:11:33 +0000 (18:11 -0800)]
1) S_IRWXU causing kernel soft crash changing to 0644 WARNING: CPU: 0 PID: 20907 at fs/sysfs/group.c:61 create_files+0x171/0x180() Oct 12 21:43:14 ovn87-180 kernel: [252606.588541] Attribute vhba_default_scsi_timeout: Invalid permissions 0700 [Rev 8008]

Reviewed-by: Sajid Zia <sajid.zia@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Signed-off-by: Qing Huang <qing.huang@oracle.com>
9 years ago1) Support vnic for EDR based platform(uVnic) 2) Supported Types now Type 0 - XSMP_XC...
Pradeep Gopanapalli [Thu, 5 Nov 2015 02:58:15 +0000 (18:58 -0800)]
1) Support vnic for EDR based platform(uVnic) 2) Supported Types now Type 0 - XSMP_XCM_OVN - Xsigo VP780/OSDN standalone Chassis, (add pvi) Type 1 - XSMP_XCM_NOUPLINK - EDR Without uplink (add public-network) Type 2 - XSMP_XCM_UPLINK -EDR with uplink (add public-network <with -if> 3) Intelligence in driver to support all the modes 4) Added Code for printing Multicast LID [Revision 8008] 5) removed style errors

Reviewed-by: Sajid Zia <sajid.zia@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Signed-off-by: Qing Huang <qing.huang@oracle.com>
9 years agoIB/mlx4: Use vmalloc for WR buffers when needed
Wengang Wang [Thu, 29 Oct 2015 06:32:45 +0000 (14:32 +0800)]
IB/mlx4: Use vmalloc for WR buffers when needed

Orabug: 22025570

There are several hits that WR buffer allocation(kmalloc) failed.
It failed at order 3 and/or 4 contigous pages allocation. At the same time
there are actually 100MB+ free memory but well fragmented.
So try vmalloc when kmalloc failed.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agonet/rds: start rdma listening after ib/iw initialization is done
Qing Huang [Tue, 6 Oct 2015 22:32:22 +0000 (15:32 -0700)]
net/rds: start rdma listening after ib/iw initialization is done

This prevents RDS from handling incoming rdma packets before RDS
completes initializing its recv/send components.

We don't need to call rds_rdma_listen_stop() if rds_rdma_listen_init()
didn't succeed.

We only need to call rds_ib_exit() if rds_ib_init() succeeds but
other parts fail. The same applies to rds_iw_init()/rds_iw_exit().
So we need to change error handling sequence accordingly.

Jump to ib/iw error handling path when we get an err code from
rds_rdma_listen_init().

Orabug: 21684447

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoMerge branch 'topic/uek-4.1/ofed.rds-p2' into topic/uek-4.1/ofed
Mukesh Kacker [Thu, 22 Oct 2015 09:44:31 +0000 (02:44 -0700)]
Merge branch 'topic/uek-4.1/ofed.rds-p2' into topic/uek-4.1/ofed

* topic/uek-4.1/ofed.rds-p2:
  net/rds: start rdma listening after ib/iw initialization is done