www.infradead.org Git - users/jedix/linux-maple.git/log

rds_rdma: rds_sendmsg should return EAGAIN if connection not setup

In rds_sendmsg(), in the case RDS_CMSG_RDMA_MAP is requested and
rds_cmsg_send() is called, a "struct rds_mr" needs to be created.
For creating the "struct rds_mr", the connection needs to be
established(ready) for rds_ib_transport. Otherwise, __rds_rdma_map()
would fail because it can't find the right rds_ib_device (which is
associated with the ip address matching rds_sock's bound ip address).
The ip address is set at the completion of the rdma connection.

But actually in code, the connecting is triggered after the call of call
rds_cmsg_send() so rds_cmsg_send() would fail with -NODEV.

The fix is to move the trigger of connection before calling
rds_cmsg_send() and return -EAGAIN in case connection is not
ready yet when we are calling rds_cmsg_send().

Orabug: 21551474

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Chien-Hua Yen <chien.yen@oracle.com>

rds_rdma: allocate FMR according to max_item_soft

When max_item on the pool is very large(say 170000), large number of
alloc_fmr request would fail with -EGAIN. This can greatly hurt
performance.

Actually, whether it is going to allocate or wait for pool flush
should be according to "max_item_soft" rather than "max_item" because
we are resizing the pool by changing "max_item_soft" not max_item.

Also, when successfully allocated a lower layer fmr, we should increase
"max_item_soft" incrementally, not set it immediately to "max_item".

Orabug: 21551548

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Chien-Hua Yen <chien.yen@oracle.com>

rds_rdma: do not dealloc fmrs in the pool under use

In rds_ib_alloc_fmr, when it needs flush pools, it de-allocates FMRs.
That is time-consuming and is not meaningful in functionality perspective.

Fix is to not de-allocate the ones in the pool which is under use.

Orabug: 21551548

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Chien-Hua Yen <chien.yen@oracle.com>

rds: set fmr pool dirty_count correctly

In rds_ib_flush_mr_pool() the setting of dirty_count is wrong in case
"free_all" is true -- "clean" ones also counted as dirty. dirty_count
being negative is seen in vmcore.

Orabug: 21551548

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Chien-Hua Yen <chien.yen@oracle.com>

Add getsockopt support for SO_RDS_TRANSPORT

The currently attached transport for a PF_RDS socket may be obtained
from user space by invoking getsockopt(2) using the SO_RDS_TRANSPORT
option at the SOL_RDS level. The integer optval returned will be one
of the RDS_TRANS_* constants defined in linux/rds.h.

Orabug: 21061146

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>

Add setsockopt support for SO_RDS_TRANSPORT

An application may deterministically attach the underlying transport for
a PF_RDS socket by invoking setsockopt(2) with the SO_RDS_TRANSPORT
option at the SOL_RDS level. The integer argument to setsockopt must be
one of the RDS_TRANS_* transport types, e.g., RDS_TRANS_TCP. The option
must be specified before invoking bind(2) on the socket, and may only
be used once on the socket. An attempt to set the option on a bound
socket, or to invoke the option after a successful SO_RDS_TRANSPORT
attachment, will return EOPNOTSUPP.

Orabug: 21061146

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>

Declare SO_RDS_TRANSPORT and RDS_TRANS_* constants in uapi/linux/rds.h

User space applications that desire to explicitly select the
underlying transport for a PF_RDS socket may do so by using the
SO_RDS_TRANSPORT socket option at the SOL_RDS level before bind().
The integer argument provided to the socket option would be one
of the RDS_TRANS_* values, e.g., RDS_TRANS_TCP. This commit exports
the constant values need by such applications via <linux/rds.h>

Orabug: 21061146

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>

RDS-TCP: only initiate reconnect attempt on outgoing TCP socket.

When the peer of an RDS-TCP connection restarts, a reconnect
attempt should only be made from the active side of the TCP
connection, i.e. the side that has a transient TCP port
number. Do not add the passive side of the TCP connection
to the c_hash_node and thus avoid triggering rds_queue_reconnect()
for passive rds connections.

Orabug: 20930687
Upstream commit-id: c82ac7e69efe6dbe370d6ba84e2666d7692ef1c2

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>

RDS-TCP: Always create a new rds_sock for an incoming connection.

When running RDS over TCP, the active (client) side connects to the
listening ("passive") side at the RDS_TCP_PORT. After the connection
is established, if the client side reboots (potentially without even
sending a FIN) the server still has a TCP socket in the esablished
state. If the server-side now gets a new SYN comes from the client
with a different client port, TCP will create a new socket-pair, but
the RDS layer will incorrectly pull up the old rds_connection (which
is still associated with the stale t_sock and RDS socket state).

This patch corrects this behavior by having rds_tcp_accept_one()
always create a new connection for an incoming TCP SYN.
The rds and tcp state associated with the old socket-pair is cleaned
up via the rds_tcp_state_change() callback which would typically be
invoked in most cases when the client-TCP sends a FIN on TCP restart,
triggering a transition to CLOSE_WAIT state. In the rarer event of client
death without a FIN, TCP_KEEPALIVE probes on the socket will detect
the stale socket, and the TCP transition to CLOSE state will trigger
the RDS state cleanup.

Orabug: 20930687
Upstream commit-id: f711a6ae062caeee46067b2f2f12ffda319ae73c

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>

rds: directly include header for vmalloc/vfree in ib_recv.c

Directly include <linux/vmalloc.h> in file ib_recv.c

Without that we can get failure on non-x86 platforms where it
may not get included indirectly. Without that we get the failure

CC [M] net/rds/ib_recv.o
net/rds/ib_recv.c: In function ‘rds_ib_srq_init’:
net/rds/ib_recv.c:1530: error: implicit declaration of function ‘vmalloc’
net/rds/ib_recv.c:1531: warning: assignment makes pointer from integer without \
a cast
net/rds/ib_recv.c: In function ‘rds_ib_srq_exit’:
net/rds/ib_recv.c:1590: error: implicit declaration of function ‘vfree’
make[2]: *** [net/rds/ib_recv.o] Error 1
make[1]: *** [net/rds] Error 2
make: *** [net] Error 2

Orabug: 21059667

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>

rds: return EMSGSIZE for oversize requests before processing/queueing

rds_send_queue_rm() allows for the "current datagram" being queued
to exceed SO_SNDBUF thresholds by checking bytes queued without
counting in length of current datagram. (Since sk_sndbuf is set
to twice requested SO_SNDBUF value as a kernel heuristic this
is usually fine!)

If this "current datagram" squeezing past the threshold is itself
many times the size of the sk_sndbuf threshold itself then even
twice the SO_SNDBUF does not save us and it gets queued but
cannot be transmitted. Threads block and deadlock and device
becomes unusable. The check for this datagram not exceeding
SNDBUF thresholds (EMSGSIZE) is not done on this datagram as
that check is only done if queueing attempt fails.
(Datagrams that follow this datagram fail queueing attempts, go
through the check and eventually trip EMSGSIZE error but zero
length datagrams silently fail!)

This fix moves the check for datagrams exceeding SNDBUF limits
before any processing or queueing is attempted and returns EMSGSIZE
early in the rds_sndmsg() code. This change also ensures that all
datagrams get checked for exceeding SNDBUF/sk_sndbuf size limits
and the large datagrams that exceed those limits do not get to
rds_send_queue_rm() code for processing.

Orabug: 20971222

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Chien Yen <chien.yen@oracle.com>

net: rds: use correct size for max unacked packets and bytes

Max unacked packets/bytes is an int while sizeof(long) was used in the
sysctl table.

This means that when they were getting read we'd also leak kernel memory
to userspace along with the timeout values.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit db27ebb111e9f69efece08e4cb6a34ff980f8896)

Orabug: 20585918

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>

RDS/IP: RDS takes 10 seconds to plumb the second IP back

RDS netdev event handler code effectively waits
for 10 seconds when event indicates and interface
coming back up from all-interfaces-were-down condition
OR initial state for the first interface to be revived.
(It is checking for address to be plumbed, something
that is done after this wait in case where RDS active
bonding code is doing the address plumbing!)

That code is based on assumptions about delays in
plumbing interface IP address on which were
likely present due to bugs in legacy code.

With the current code, there is no need for
such checks before failing back interfaces
when an interface is brought back up.

Orabug: 20231857

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

RDS/IB: Tune failover-on-reboot scheduling

In certain platforms (e.g. X5-2) the IB devices are slow
to come up and the start of failover-on-reboot (in RDS active
bonding) is changed to accomodate them. Otherwise all
interfaces get de-activated and then re-activated on
almost every reboot.

We also make sure when all interfaces do get de-activated
by failover-on-reboot, it is not affected by delayed startup
of devices from all-ports-down which is present for other
situations.

The startup interval of first scheduling of failover-on-reboot
on module load is also turned into a module parameter.

Orabug: 20063740

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

RDS: mark netdev UP for intfs added post module load

When interfaces are brought up after module load,
a NETDEV_UP event for which no matching port exists
triggers re-initialization of active bonding data
structures. The initialization however missed marking
the layer as up in flag tracking which layers are up.
The fix here marks that layer as UP in the flags since
the initialization is triggered by the NETDEV_UP event
processing!

Orabug: 20130536

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Acked-by: Chien Yen < chien.yen@oracle.com>

RDS: Enable use of user named pkey devices

RDS code currently derives pkey associated with
a device by parsing its name. With user-named
pkey devices which do not have pkey as part of
the name, that is not possible and ipoib module
exports api to query the pkey from netdev.
Here we switch to use that api instead.

Orabug: 19064704

Ported from parts of UEK2 commit
a101f6037e882b1c12143416d48345fe7ea62979

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Chien-Hua Yen <chien.yen@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>

rds: fix list corruption and tx hang when netfilter is used

when rds netfilter enabled the below issues can happen and this
patch solve them:

1. rds socket rds-incoming list corruption. this issue happen when
the below code path is executed:
* rds_recv_incoming -> NF_HOOK { nf hook decide to send packet
to local and to origin } ->
* { send to local } -> rds_recv_local -> list_add_tail { rds-incoming
is queued on the rds socket rs_recv_queue list }
* { send to origin } -> rds_recv_route -> rds_recv_forward ->
{ rds_send_internal failed! } -> rds_recv_local -> list_add_tail
{ rds-incoming is queued on the rds socket rs_recv_queue list }
2. rds rx tasklet forced to tx the incoming packet and so is un able
to process acks from remote and free space in the rds sendbuf:
* rds_recv_incoming -> NF_HOOK { nf hook decide to send packet
to origin } -> rds_recv_route -> rds_recv_forward -> rds_send_internal ->
rds_send_xmit { rx tasklet now do the tx! }

Orabug: 18963548

Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com>
Tested-by: jun yang <jun.yang@oracle.com>
Tested-by: denise iguma <denise.iguma@oracle.com>
Acked-by: chien yen <chien.yen@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 61b5557dde8366245a58402f76d8e2d9f8a18b7d)

RDS: move more queing for loopback connections to separate queue

All instances of processing for processing of
connect/re-connect/disconnect/reject instances in RDS
should use separate workqueue dedicated for processing
for (few) local loopback connections to reduce latency
so they do not get behind processing of large number of
remote connections. However, a few instances of such
processing do not. With this fix those are being changed
to use the workqueue dedicated for local connections.

Orabug: 18977932

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit a0d407f186e0e4841bf1a24541dbcc0f0973726e)

RDS: add module parameter to allow module unload or not

orabug: 19665303

This patch adds the following feature to ib_ipoib, rds_rdma, ib_core and
mlx4_core.

Adds a module parameter "module_unload_allowed". If the parameter is 1(the
default value), moudles can be unloaded(same behavior as before); other-
wise if it's 0, the module is not allowed to be unloaded. The paramter can't
be changed when module is loaded until the module is unloaded(if it can be).

default values:
ib_ipoib: 1 for YES
rds_rdma: 0 for NO
ib_core: 1 for YES
mlx4_core: 0 for NO

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Joe Jin <joe.jin@oracle.com>
Acked-by: Todd Vierling <todd.vierling@oracle.com>
Acked-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit cf1a00039e6fea116e9ea7c82f55ee3ee5319cec)

Conflicts:
drivers/infiniband/core/device.c
drivers/infiniband/ulp/ipoib/ipoib_main.c
drivers/net/ethernet/mellanox/mlx4/main.c

rds: fix NULL pointer dereference panic during rds module unload

This issue reported happens during an unload of rds module with rds
reconnect timeout worker scheduled to execute in the future, and rds
module unloaded earlier than that. rds reconnect timeout worker was
introduced by 8991a87c6c3fc8b17383a140bd6f15a958e31298 ( RDS: SA query
optimization) commit. Fix is to flush/cancel any reconnect timeout
workers while performing rds connections destroy which is done during
module unload.

Orabug: 18952475

Signed-off-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Acked-by: Chien Yen <chien.yen@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 26c0879e51915b9ba0526d9a3630e08d2cc51a2b)

RDS:active bonding: disable failover across HCAs(failover groups)

Disable experimental code in RDS active bonding which performs
failovers across "failover groups" (HCAs). It causes
instabilities for some applications.

Orabug: 19430773

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Acked-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 38d6d011e4757e38781830cc4eebd8d7a8b690fc)

RDS/IB: active bonding - failover down interfaces on reboot.

RDS active bonding detects port down transitions from
active ports notified by events (and performs failover
when notified) but does not detect ports which are down
at boot time.

Changes involve tracking hw port, link layer, and netdev
layer up/down status separately and the aggregate port UP
status is deduced when ALL layers are UP and DOWN is deduced
when any one layer goes down.

A delayed task is scheduled to run at module load time after
a max delay OR on a sysctl trigger from init script after all
devices that are active are brought up. The ports found be
DOWN are failed over.

Orabug: 18697678

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 6160b5d17577b0bd4185bdb1b7eda76a5d6fc556)

RDS/IB: Remove dangling rcu_read_unlock() and other cleanups

Delete dangling rcu_read_unlock() which was left behind
when matching rcu_read_lock() and enclosed code was
removed in commit 538f5d0dfa704f4dcb4afa80a1d01b1317b9cd65

All compiler warnings are also fixed.

Orabug: 18995395

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit ba3dcd2f88fbf554f77bb312727c88193decfb75)

rds: new extension header: rdma bytes

Introduce a new extension header type RDSV3_EXTHDR_RDMA_BYTES for
an RDMA initiator to exchange rdma byte counts to its target.
Add new flag to RDS header: RDS_FLAG_EXTHDR_EXTENSION
Add new extension to RDS header: rds_ext_header_rdma_bytes

Please note:
Linux RDS and Solaris RDS have miss match in header flags. Solaris
RDS assigned flag 0x08 to RDS_FLAG_EXTHDR_EXTENSION.
Linux alredy use 0x08 for flag RDS_FLAG_HB_PING.
This patch require the below fix from the Solaris side:
BUG 19065367 - unified RDSV3_EXTHDR_RDMA_BYTES with Linux

Orabug: 18468180

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Acked-by: Sherman Pun <sherman.pun@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 5f4f74d028a0d9f8e0391cc6114314e780e65583)

RDS: Ensure non-zero SL uses correct path before lane 0 connection is dropped

There is an issue with the following scenario:
  * if non-zero lane is going down first with send completion error 12
  * Before lane 0 connection goes down, the peer initiates connection request with
    the non-zero lane
  * This non-zero lane connection request may be using old ARP entries of lane 0

This also fixes race condition between connection establishment and drop for
following scenario:
  * non-zero lane connection dropped
  * non-zero connection is initiated and this time it finds proper route and
    connection request goes through.
  * before non-zero lane connection is established at RDS layer,
    zero lane connection is getting dropped.
  * now this zero-lane connection will drop non-zero lane connection as well
    (with the assumption that non-zero lane did not find proper route).
  * when non-zero lane connection establishment event is received (REP packet),
    we have a race between connection establishment event on one CPU and
    connection drop on other CPU.

Orabug: 19133664

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
Reviewed-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 47d8e78f82872bbb9d709a0743ea2bdb2e9f6cbb)

rds: Lost locking in loop connection freeing

upstream commit: 58c490babd4b425310363cbd1f406d7e508f77a5

rds: Lost locking in loop connection freeing

The conn is removed from list in there and this requires
proper lock protection.

Orabug: 19265200

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 92ad60f3efabfcd9eb40e983a131c730829c0d90)

RDS: active bonding - failover/failback only to matching pkey

The active bonding code does not take the pkey match into
account for failover/failback of ports.

Support is added so failover/failback happen only to ports
with matching pkeys.

Orabug: 18681364

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
(cherry picked from commit fa3518c7642293f243e7135336c6c8b25a406b03)

RDS: active bonding - ports may not failback if all ports go down

When active bonding is enabled, ports failover and failback if
ports are enabled/disabled. If ALL ports go down, then no
failover happens since there is no available port to failover to.
In that case, some ports not hosting any migrated interfaces are
not resurrected.

Fix resurrects ports when they dont have an IP address set and
are failing back and RDS active bonding has state on them. Some,
debug/log messages are also improved such as to indicated when
a failover fails to happen.

Orabug: 18875563

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
(cherry picked from commit 65e125ed4457a25bebe9d69c267a3ce06bf75961)

RDS: Use rds_local_wq for loopback connections in rds_conn_connect_if_down()

Orabug: 18892380

This patch extends commit 0715fe8 "RDS: add workqueue for local
loopback connections" to rds_conn_connect_if_down() and
rds_connect_complete().

Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
(cherry picked from commit 3209dc535f742c59530d5a5dfdd003b501967e88)

RDS: add workqueue for local loopback connections

Switch reboot takes too long to recover

Orabug: 18892366

Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
(cherry picked from commit 38a7f3de7dfbae339ff82149277f7794cd925711)

RDS: SA query optimization

SA query optimization
The fact is all QoS lanes share the same physical path
b/w an IP pair. The only difference is the service level
that affects the quality of service for each lane. With
that, we have the following optimization:

1. Lane 0 to issue SA query request to the SM. All other
lanes will wait for lane 0 to finish route resolution,
then copy in the resolved path and fill in its service
level.

2. One-side reconnect to reduce reconnect racing, thus
further reducing the number of SA queries to the SM.

Reducing brownout for non-zero lanes
In some case, RDMA CM is delaying the disconnect event
after switch/node failure and this is causing extra
brownout for RDS reconnection. The workaround is to have
lane 0 probe other lanes by sending a HB msg. If the lane
is down, this will cause a send completion error and an
immediate reconnect.

Orabug: 18801977

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
(cherry picked from commit 8f84b1ff46e449e99c5fcf4d4f94dc2e8ea82cd7)
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit 8991a87c6c3fc8b17383a140bd6f15a958e31298)

RDS: Remove cond_resched() in RX tasklet

Re-install the base fix 17829338 and replace
spin_lock_irqsave(rx_lock)/spin_unlock_ireqrestore(rx_lock) with
spin_lock_bh(rx_lock)/spin_unlock_bh(rx_lock) to resolve bugs 18413711
and 18461816. rx_lock is used to prevent concurrent reaping b/w the
RX tasklet and worker.

Orabug: 18801937

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
Tested-by: Arvind Shukla <arvind.shukla@oracle.com>
(cherry picked from commit 409138bae9be49ee9782eed244a20774d61d6208)
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit cb2cb09bc520f2915a7c9c2eb1072d936a7b64b6)

RDS: Replace queue_work() by cond_resched() in the tasklet to breakup RX stream

Orabug: 18801931

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
(cherry picked from commit 74723dd3283d1bf2b352e5f71fe27340283716ed)
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit 37f5f84933725ad41644c0507b822a2d6b4e6cf0)

RDS: looping to reap cq recv queue in rds_conn_shutdown

Orabug: 18501034

Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit e1bab7c0e9af8e56e973c1c65f4f3f7474979c66)
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit a148e65e2d97a3e4327103b1e49c7cf74533f46d)

rds: Fix regression in dynamic active bonding configuration

OraBug: 18481645

Commit 6cf7cc30 "rds: dynamic active bonding configuration" introduced
a regression. Late joining IB interfaces were not configured
correctly. When active bonding configuration shows both interfaces
down, IP failover was not happening. This patch fixes this issue.

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
(cherry picked from commit 12755dbf7b4adc8ea2a935900b81c384731f6fff)
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit 12eebc4b9e28c3899089277ff9725bdcff1829aa)

rds/rdma_cm: send RDMA_CM_EVENT_ADDR_CHANGE event for active bonding

Orabug: 18421516

This patch is forward ported from ofa-2.6.32-400.1.1.el5.x86_64-1.5.5-4.1.15

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Avneesh Pant <avneesh.pant@oracle.com>
Signed-off-by: Chien Yen <chien.yen@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
(cherry picked from commit b90f280baeedf4a56fae0c248d108ae118bb94ab)
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Conflicts:
drivers/infiniband/core/cma.c

RDS: Idle QoS connections during remote peer reboot causing application brownout

This fix addresses the issue with the idled QoS connection not getting
disconnect event when the remote peer reboots. This is causing delayed
reconnect, hence application brownout when the peer comes online. The fix was
to proactively drop and reconnect them when the base lane is going through
the reconnect to the reboot peer, in effect forcing all the lanes to go
through the reconnect at the same time.

Orabug: 18443194

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
(cherry picked from commit f51ccefb3a0b9485da5cc5f66bb1e311f61bd70b)

rds: dynamic active bonding configuration

OraBug: 18404635

This patch allows late joining interfaces to participate in IP
failover/fallback operations.

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Tested-by: Liwen Huang <liwen.huang@oracle.com>
(cherry picked from commit eb1b61e17b5cde30f798b4a8e5b4f60c62997da9)

RDS: Fix slowdown when doing massively parallel workload

In shutdown, reap the Completion Queue Entry (CQE)
periodically while waiting for RX ring to quiesce

Reject new send if rds-info is pending to avoid
rcu stall

Break RX stream into work units of 10k messages each
and schedule them sequentially to avoid rcu stall

Orabug: 18362838

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Acked-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Tested-by: Arvind Shukla <arvind.shukla@oracle.com>
(cherry picked from commit dac771f1e55713b8a42bdffa059e1894e1ecdf17)

RDS: active bonding needs to set brcast and mask for its primary interface

Only IP address is set in the current implementation. This patch
adds the setting for broadcast address and netmask of failback
interface.

Orabug: 18479088

Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 5cad478d7148ac4b6fc2d6eb78d6bf5a576d69e1)

RDS: bind hash table size increase, add per-bucket rw lock

The single global rw lock in bind hash table is replaced with
a per-bucket lock. The size of the hash table is increased
from 1024 to 8192.

Orabug: 18071861

Tested-by: Michael Nowak <michael.nowak@oracle.com>
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
(cherry picked from commit 4fc6d8f3870b5ac39415f475b2a5817a5a89d48e)

RDMA CM: Add reason code for IB_CM_REJ_CONSUMER_DEFINED

RDS: Support rolling downgrade from version 4.1 to 3.1

Orabug: 17484682

Signed-off-by: Giri Adari <giri.adari@oracle.com>
Signed-off-by: Richard Frank <richard.frank@oracle.com>
Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
(cherry picked from commit 7b66ddd7f6a5b023191d74949fab41af245775a3)
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Conflicts:
net/rds/rds.h
(cherry picked from commit 0373566ba0d74f655ae83e09748f7cc8d553f351)

RDS: protocol negotiation fails during reconnect

RDS fails to realize the negotiated version and keeps rejecting the peer's
connect request, causing the connection to stall during reconnect.

Orabug: 17375389

Signed-off-by: Giri Adari <giri.adari@oracle.com>
(cherry picked from commit 34a2edd27ec65cbeb003a3d2e22cb6da57e90798)

RDS: double free rdma_cm_id

RDS currently offloads rdma_destroy_id() to an aux thread as part of the
connection shutdown. This was to workaround a bug in which rdma_destroy_id()
could block and cause RDS reconnect to hang. By queuing the rdma_destroy_id()
work, we unfortunately open up a timing window in which the pending
CMA_ADDR_QUERY request might not get canceled right away and race with
rdma_destroy_id().

In this case, rdma_destroyed_id() gets called and frees the cm id. Then,
CMA_ADDR_QUERY completes and calls RDS event handler which calls
rds_resolve_route on the destroyed cm id. The event handler returns failure
which causes RDMA CM to call rdma_destroy_id() again on the same cm id!
Hence the problem.

Since the rdma_destroy_id() bug has been fixed by MLX to offload the blocking
operation to the worker thread, RDS no longer needs to queue up
rdma_destroy_id(). This closes up the window above and fixes the problem.

Orabug: 17192816

Signed-off-by: Richard Frank <richard.frank@oracle.com>
(cherry picked from commit 3fec98717bf926d869d049e17baad849d1ba7d78)

RDS: ActiveBonding IP exclusion filter

Define a set of Exclusion IPs that RDS ActiveBond should not manage.
This is required to ensure that RDS AB can coexist with other HA IP
daemons..

New parameter rds_ib_active_bonding_excl_ips was added to filter out IPs
not managed by ActiveBonding.

Syntax:
rds_ib_active_bonding_excl_ips=[<IP>/<prefix>][,<IP>/<prefix>]*

Default:
rds_ib_active_bonding_excl_ips=169.254/16,172.10/16

Orabug: 17075950

Signed-off-by: Richard Frank <richard.frank@oracle.com>
(cherry picked from commit 477e03cf1c378d3a37ec9fa586912d69397b35be)

RDS: Reconnect stalls for 15s

On Switch reboot, both end nodes would try to reconnect at the same time.
This can cause a race b/w the gratuitous ARP and the IP resolution,
resulting in a path to a down port. The CONNECT request sent on this path
is stuck until the 15s timeout at which time the connection is dropped
and re-established.

The fix was to indroduce a reconnect delay b/w the ARP and the reconnect
to minimize the race and avoid the 15s timeout.

Orabug: 17277974

Signed-off-by: Richard Frank <richard.frank@oracle.com>
(cherry picked from commit 72faa77d3d695cbbdddc9d99be6abdfce1187bc6)

RDS: Reconnect causes panic at completion phase

The connection can be shutdown while it is processing the
RDMA_CM_EVENT_ESTABLISHED event, and this can lead to panic if the cm_id
has been destroyed.

The fix was to drop the connection if the cm_id has been destroyed.

Orabug: 17213597

Signed-off-by: Richard Frank <richard.frank@oracle.com>
(cherry picked from commit 000fdbea7eab93fc55c45de7302b6560fd41b7f1)

RDS: added stats to track and display receive side memory usage

Added these stats:
1. per-connection stat for number of receive buffers in cache
2. global stat for the same across all connections
3. number of bytes in socket receive buffer
Since stats are implemented using per-cpu variables and RDS currently
does unsigned arithmetic to add them up, separate counters (one for
addition and one for subtraction) are used for (2) and (3).
In the future we might change it to signed computation.

Orabug: 17045536

Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 4631300fcf86d459d5dbb09791ff9198c51feab1)

RDS: RDS reconnect stalls

After successfully negiotiating the version at lower protocol, RDS incorrectly
set the proposed version to the higher protocol, causing the subsequent
reconnect to stall.

The fix was not to change the proposed version after the initial connection
setup.

Orabug: 1731355

Signed-off-by: Richard Frank <richard.frank@oracle.com>
(cherry picked from commit 1a14dda0d3d3195306f3a141227eb003e895fb58)

RDS: disable IP failover if device removed

IP failover after the device has been removed can lead to panic.

The fix is to disable IP failover if the underlying device has been removed.

Orabug: 17206167

Signed-off-by: zheng.li <zheng.x.li@oracle.com>
(cherry picked from commit 84fc44b9e9fa00354892ef491d09d5eb727943b7)

RDS: Fix a bug in QoS protocol negotiation

Fix bug that may cause the connection to downgrade to lower
protocol. Also, don't negotiate protocol on reconnect.

Orabug: 17079972

Signed-off-by: Giri adari <giri.adari@oracle.com>
(cherry picked from commit 4962a6def99ce1b80212198ebc96700a51dee694)

RDS: alias failover is not working properly

This can lead to crashes or duplicate addresses. Alias will be
failed over in the following form:

e.g., ib0:<alias> -> ib1:P**:<alias>

Orabug: 17177994

Signed-off-by: zheng.li <zheng.x.li@oracle.com>
(cherry picked from commit 049a5ec115391ef1ad171825c4b7630550ae3328)

add NETFILTER suppport

Orabug: 17082619
Adds the ability for the RDS code to support the NETFILTER kernel interfaces.
This allows for packet inspection, modification, and potential redirection as
the packets flow through the lower layers of the RDS code.

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(Ported from UEK2 commit 1913973db561fd6db2e495d3b95e6f8c78b3ba23)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>

RDS: Local address resolution may be delayed after IP has moved. RDS to update local ARP cache directly to speed it up.

Orabug: 16979994

Signed-off-by: zheng.li <zheng.x.li@oracle.com>
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit e95af0c38f586f88521fe81432cf705748d366f9)

RDS: restore two-sided reconnect with the lower IP node having a constant 100 ms backoff.

Orabug: 16710287

Signed-off-by: Richard Frank <richard.frank@oracle.com>
(cherry picked from commit 1e165f6511abd1d57e4be79f1a3a430c98a7576e)

rds: set correct msg_namelen

commit 06b6a1cf6e776426766298d055bb3991957d90a7 upstream.

CVE-2012-3430

Jay Fenlason (fenlason@redhat.com) found a bug,
that recvfrom() on an RDS socket can return the contents of random kernel
memory to userspace if it was called with a address length larger than
sizeof(struct sockaddr_in).
rds_recvmsg() also fails to set the addr_len paramater properly before
returning, but that's just a bug.
There are also a number of cases wher recvfrom() can return an entirely bogus
address. Anything in rds_recvmsg() that returns a non-negative value but does
not go through the "sin = (struct sockaddr_in *)msg->msg_name;" code path
at the end of the while(1) loop will return up to 128 bytes of kernel memory
to userspace.

And I write two test programs to reproduce this bug, you will see that in
rds_server, fromAddr will be overwritten and the following sock_fd will be
destroyed.
Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is
better to make the kernel copy the real length of address to user space in
such case.

How to run the test programs ?
I test them on 32bit x86 system, 3.5.0-rc7.

1 compile
gcc -o rds_client rds_client.c
gcc -o rds_server rds_server.c

2 run ./rds_server on one console

3 run ./rds_client on another console

4 you will see something like:
server is waiting to receive data...
old socket fd=3
server received data from client:data from client
msg.msg_namelen=32
new socket fd=-1067277685
sendmsg()
: Bad file descriptor

/***************** rds_client.c ********************/

int main(void)
{
int sock_fd;
struct sockaddr_in serverAddr;
struct sockaddr_in toAddr;
char recvBuffer[128] = "data from client";
struct msghdr msg;
struct iovec iov;

sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
if (sock_fd < 0) {
perror("create socket error\n");
exit(1);
}

memset(&serverAddr, 0, sizeof(serverAddr));
serverAddr.sin_family = AF_INET;
serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
serverAddr.sin_port = htons(4001);

if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
perror("bind() error\n");
close(sock_fd);
exit(1);
}

memset(&toAddr, 0, sizeof(toAddr));
toAddr.sin_family = AF_INET;
toAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
toAddr.sin_port = htons(4000);
msg.msg_name = &toAddr;
msg.msg_namelen = sizeof(toAddr);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_iov->iov_base = recvBuffer;
msg.msg_iov->iov_len = strlen(recvBuffer) + 1;
msg.msg_control = 0;
msg.msg_controllen = 0;
msg.msg_flags = 0;

if (sendmsg(sock_fd, &msg, 0) == -1) {
perror("sendto() error\n");
close(sock_fd);
exit(1);
}

printf("client send data:%s\n", recvBuffer);

memset(recvBuffer, '\0', 128);

msg.msg_name = &toAddr;
msg.msg_namelen = sizeof(toAddr);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_iov->iov_base = recvBuffer;
msg.msg_iov->iov_len = 128;
msg.msg_control = 0;
msg.msg_controllen = 0;
msg.msg_flags = 0;
if (recvmsg(sock_fd, &msg, 0) == -1) {
perror("recvmsg() error\n");
close(sock_fd);
exit(1);
}

printf("receive data from server:%s\n", recvBuffer);

close(sock_fd);

return 0;
}

/***************** rds_server.c ********************/

int main(void)
{
struct sockaddr_in fromAddr;
int sock_fd;
struct sockaddr_in serverAddr;
unsigned int addrLen;
char recvBuffer[128];
struct msghdr msg;
struct iovec iov;

sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
if(sock_fd < 0) {
perror("create socket error\n");
exit(0);
}

memset(&serverAddr, 0, sizeof(serverAddr));
serverAddr.sin_family = AF_INET;
serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
serverAddr.sin_port = htons(4000);
if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
perror("bind error\n");
close(sock_fd);
exit(1);
}

printf("server is waiting to receive data...\n");
msg.msg_name = &fromAddr;

/*
* I add 16 to sizeof(fromAddr), ie 32,
* and pay attention to the definition of fromAddr,
* recvmsg() will overwrite sock_fd,
* since kernel will copy 32 bytes to userspace.
*
* If you just use sizeof(fromAddr), it works fine.
* */
msg.msg_namelen = sizeof(fromAddr) + 16;
/* msg.msg_namelen = sizeof(fromAddr); */
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_iov->iov_base = recvBuffer;
msg.msg_iov->iov_len = 128;
msg.msg_control = 0;
msg.msg_controllen = 0;
msg.msg_flags = 0;

while (1) {
printf("old socket fd=%d\n", sock_fd);
if (recvmsg(sock_fd, &msg, 0) == -1) {
perror("recvmsg() error\n");
close(sock_fd);
exit(1);
}
printf("server received data from client:%s\n", recvBuffer);
printf("msg.msg_namelen=%d\n", msg.msg_namelen);
printf("new socket fd=%d\n", sock_fd);
strcat(recvBuffer, "--data from server");
if (sendmsg(sock_fd, &msg, 0) == -1) {
perror("sendmsg()\n");
close(sock_fd);
exit(1);
}
}

close(sock_fd);
return 0;
}

Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit eb3ccc4c696e5c4a10d324886fd061ea88bab6c4)

RDS: IP config needs to be updated when network/rdma service restarted.

Orabug: 16963884

Signed-off-by: zheng.li <zheng.x.li@oracle.com>
(cherry picked from commit 00e79a242561efec173dab4640a3eaad50b1f4b3)

RDS: check for valid rdma id before initiating connection

Connection could have been dropped while the route is being resolved
so check for valid rdma id before initiating the connection.

Orabug: 16857341

Signed-off-by: zheng.li <zheng.x.li@oracle.com>
(cherry picked from commit 5528367d56539f817182faa1f0ea35779ccac14e)

RDS: reduce slab memory usage

Both rds_ib_incoming and rds_ib_frag slab objects are incorrectly
aligned, causing significant increase in slab memory usage by RDS.

Orabug: 16935507

Signed-off-by: Richard Frank <richard.frank@oracle.com>
(cherry picked from commit a7cf83092e6ad5c2d842c34b17436d4aafd00b54)

RDS: Move connection along with IP when failing over/back.

Orabug: 16916648

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Acked-by: Zheng Li <zheng.x.li@oracle.com>
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit 78b7d86911046c3a10ffa52d90f4f1a4523d7ac3)

RDS: Rename HAIP parameters to Active Bonding

Orabug: 16810395
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 5fc4ef482f653e8510875e5fe0fd6936b5133d15)

rds shouldn't release fmr when ib_device was already released.

Orabug: 16605377

when rds_ib_remove_one return, driver's mlx4_ib_removeone
function destroy ib_device, so we must clear rds_ibdev->dev
to NULL, or will cause crash when rds connection be released,
at the moment rds_ib_dev_free through ib_device
.i.e rds_ibdev->dev to release mr and fmr, reusing the
released ib_device will cause crash.

Signed-off-by: zheng.x.li@oracle.com
Signed-off-by: bang.nguyen@oracle.com

rds remove dev race.

Orabug: 16605377

RDS: make sure rds_ib_remove_one() returns only after the device is freed.

This is to avoid possible race condition in which rds_ib_remove_one() returns
prematurely and IB removes the underlying device. RDS later tries to free the
device and trips over.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: bang.nguyen@oracle.com
(cherry picked from commit 62dab719ea687129dc52df2c2eec3b730d628b7a)

reinit ip_config when service rdma restart.

Orabug: 16605377

reinit rds ip_config when net_device REGISTER and UNREGISTER event
happen, that will reassign new value to ip_config's member:dev and
rds_ibdev.

Signed-off-by: zheng.x.li@oracle.com
Signed-off-by: bang.nguyen@oracle.com
(cherry picked from commit 864b4ee41637414ae7916f740441cfa6509bc8dc)

rds: limit the size allocated by rds_message_alloc()

Orabug: 16837486
[ Upstream commit ece6b0a2b25652d684a7ced4ae680a863af041e0 ]

Dave Jones reported the following bug:

"When fed mangled socket data, rds will trust what userspace gives it,
and tries to allocate enormous amounts of memory larger than what
kmalloc can satisfy."

WARNING: at mm/page_alloc.c:2393 __alloc_pages_nodemask+0xa0d/0xbe0()
Hardware name: GA-MA78GM-S2H
Modules linked in: vmw_vsock_vmci_transport vmw_vmci vsock fuse bnep dlci bridge 8021q garp stp mrp binfmt_misc l2tp_ppp l2tp_core rfcomm s
Pid: 24652, comm: trinity-child2 Not tainted 3.8.0+ #65
Call Trace:
[<ffffffff81044155>] warn_slowpath_common+0x75/0xa0
[<ffffffff8104419a>] warn_slowpath_null+0x1a/0x20
[<ffffffff811444ad>] __alloc_pages_nodemask+0xa0d/0xbe0
[<ffffffff8100a196>] ? native_sched_clock+0x26/0x90
[<ffffffff810b2128>] ? trace_hardirqs_off_caller+0x28/0xc0
[<ffffffff810b21cd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff811861f8>] alloc_pages_current+0xb8/0x180
[<ffffffff8113eaaa>] __get_free_pages+0x2a/0x80
[<ffffffff811934fe>] kmalloc_order_trace+0x3e/0x1a0
[<ffffffff81193955>] __kmalloc+0x2f5/0x3a0
[<ffffffff8104df0c>] ? local_bh_enable_ip+0x7c/0xf0
[<ffffffffa0401ab3>] rds_message_alloc+0x23/0xb0 [rds]
[<ffffffffa04043a1>] rds_sendmsg+0x2b1/0x990 [rds]
[<ffffffff810b21cd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff81564620>] sock_sendmsg+0xb0/0xe0
[<ffffffff810b2052>] ? get_lock_stats+0x22/0x70
[<ffffffff810b24be>] ? put_lock_stats.isra.23+0xe/0x40
[<ffffffff81567f30>] sys_sendto+0x130/0x180
[<ffffffff810b872d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff816c547b>] ? _raw_spin_unlock_irq+0x3b/0x60
[<ffffffff816cd767>] ? sysret_check+0x1b/0x56
[<ffffffff810b8695>] ? trace_hardirqs_on_caller+0x115/0x1a0
[<ffffffff81341d8e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff816cd742>] system_call_fastpath+0x16/0x1b
---[ end trace eed6ae990d018c8b ]---

Reported-by: Dave Jones <davej@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 1524f0a4e3e23b3c8b4235eb7d9932129cc0006b)

RDS: Fixes to improve throughput performance

This fixes race conditions and other feature enhancements to improve
throughput.

Ported from UEK2 patch dbe1629e3387d8c68009e1da51d1a1ca778f2501

(Changes related to LAP in the original patch in
drivers/infiniband/core/cma.c are NOT ported because we
do not have APM support in rdma_cm)

Orabug: 16571410
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>

RDS: fix rds-ping spinlock recursion

This is the revised patch for fixing rds-ping spinlock recursion
according to Venkat's suggestions.

RDS ping/pong over TCP feature has been broken for years(2.6.39 to
3.6.0) since we have to set TCP cork and call kernel_sendmsg() between
ping/pong which both need to lock "struct sock *sk". However, this
lock has already been hold before rds_tcp_data_ready() callback is
triggerred. As a result, we always facing spinlock resursion which
would resulting in system panic.

Given that RDS ping is only used to test the connectivity and not for
serious performance measurements, we can queue the pong transmit to
rds_wq as a delayed response.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
CC: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
CC: David S. Miller <davem@davemloft.net>
CC: James Morris <james.l.morris@oracle.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5175a5e76bbdf20a614fb47ce7a38f0f39e70226)

Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Conflicts:
net/rds/send.c

Orabug: 16223050

Signed-off-by: Jerry Snitselaar <dev@snitselaar.org>
(cherry picked from commit 3badb20f7c232c2f72758453d01cb890ab686def)

Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit bd05c6c016b911bb7d9a16f2998389b4219bb2cf)

rds: Congestion flag does not get cleared causing the connection to hang

Orabug: 16424692

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 456165e342b25b010735d84985f2895ab7f379a9)

Add SIOCRDSGETTOS to get the current TOS for the socket

Orabug: 16397197
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 201e2362694aab25b4ef6b11e5bd62b75b2a0e17)

Changes to connect/TOS interface

Orabug: 16397197
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit b0aa0bd4342a38bdb994b7af301ea07a4f4b5ad6)

rds: this resolved crash while removing rds_rdma module. orabug: 16268201

Signed-off-by: Bang Nguyen <band.nguyen@oracle.com>
(cherry picked from commit 0ee85d26682603e53b3e022ec70a55dfa98710f9)

rds: scheduling while atomic on failover orabug: 16275095

Signed-off-by: Bang Nguyen <band.nguyen@oracle.com>
(cherry picked from commit a1b048d2106086119400fccbf37129414edf3f3a)

rds: unregister IB event handler on shutdown

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 888623e08e35272913838f83e6a601b65683ab27)

rds: HAIP support child interface

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 538f5d0dfa704f4dcb4afa80a1d01b1317b9cd65)

RDS HAIP misc fixes

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit e0cac8d762cbfeee7f5b34d722a43e61a326d970)

Ignore failover groups if HAIP is disabled

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 47bf625103193bf59f75a7c43c42411a04b55712)

RDS: RDS rolling upgrade

Changes to support rolling upgrade from RDS protocol version 3.1 to 4.1

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 6788b32aeb00a1ac4b3815680c029911c431031a)

RDS: Fixes warning while rds-info. spin_lock_irqsave() is changed to spin_lock_bh().

Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Reviewd-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 237b028186dd2523fbb81d47463ea8ce4e9a202d)

rds: UNDO reverts done for rebase code to compile with Linux 4.1 APIs

Commit 163377dd82f2d81809aabe736a2e0ea515055a69 does reverts
to common ancestor of upstream and UEK2 to rebase UEK2 patches
for net/rds. This commit undoes reverts needed to compile to
Linux 4.0 APIs.

UNDO Revert "net: Replace get_cpu_var through this_cpu_ptr" for net/rds
This commit does UNDO of revert of commit 903ceff7ca7b4d80c083a80ee5163b74e9fa359f for net/rds.

UNDO Revert "rds: switch ->inc_copy_to_user() to passing iov_iter"
This commit does UNDO of revert of commit c310e72c89926e06138e4881f21e4c8da3e7ef18

UNDO Revert of "rds: switch rds_message_copy_from_user() to iov_iter"
This commit does UNDO of revert of commit 083735f4b01b703184c0e11c2e384b2c60a8aea4.

UNDO Revert "put iov_iter into msghdr" for net/rds
This commit does UNDO of revert of commit c0371da6047abd261bc483c744dbc7d81a116172 for net/rds

UNDO Revert "net: introduce helper macro for_each_cmsghdr" for net/rds
This commit does UNDO of revert of commit f95b414edb18de59940dcebbefb49cf25c6d505c for net/rds

UNDO Revert "rds: Fix min() warning in rds_message_inc_copy_to_user()"
This commit does UNDO of revert of commit 6ff4a8ad4b6eae5171754fb60418bc81834aa09b.

UNDO Revert "rds: Make rds_message_copy_from_user() return 0 on success."
This commit does UNDO of revert of commit d0a47d32724bf0765b8768086ef1a7a6d074a7a0.

UNDO Revert "net: Remove iocb argument from sendmsg and recvmsg" for net/rds
This commit does UNDO of revert of commit 1b784140474e4fc94281a49e96c67d29df0efbde for net/rds.

These commits were reverted earlier to rebase unmodified UEK2 RDS code
(UNDO needed to compile to new Linux 4.1 kernel APIs - changed *after* Linux 3.18)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>

rds: port to UEK4, Linux-3.18*

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

rds: disable APM support

The APM(Alternate Path Migration) feature is not used and its
code is being disabled. (It can be re-enabled if/when APM support
is enabled in rdma_cm.

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

rds: disable cq balance

This should be enabled back after IB_CQ_VECTOR_LEAST_ATTACHED is added.

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

rds: move linux/rds.h to uapi/linux/rds.h

to be compatible to 3.18*

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

RDS: Kconfig and Makefile changes

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

RDS merge for UEK2

Orabug: 15997083

This is merged code of Mellanox OFED R2, 0080 release; and ofa 4.1

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 26add53cf20e08dfa331ec22d307dab40f0c4d74)

rds: Misc Async Send fixes

Async send fixes to support new rds-stress option "--async"

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>

rds: call unregister_netdevice_notifier for rds_ib_nb in rds_ib_exit

in commit 58f6b52b114d3400fea7daffb0440ca611e45c1c

rds: Misc HAIP fixes

netdevice_notifier rds_ib_nb is never unregistered.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

rds: flush and destroy workqueue rds_aux_wq and fix creation order.

in commit f05d77d46d172127d3f96538a62764a2a589a61b

    rds: Add Automatic Path Migration support

    RDS APM supports automatic connection failover in case of path
    failure, and connection failback when the path recovers.

    RDS APM is enabled by module parameter rds_ib_enable_apm (disabled by
    default).

workqueue rds_aux_wq is not destroyed and it should be create prior to
rds_trans_register since rds_trans_register callbacks can use rds_aux_wq.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

rds : fix compilation warning

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

rds: port the code to uek2

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

rds: CQ balance

This patch provides load-balancing for RDS CQs across available interrupt vectors.

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>

rds: HAIP across HCAs

This patch extends HAIP support to failover/failback IPs across HCAs.

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>

rds: Misc HAIP fixes

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>

rds: off by one fixes

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>

rds: Add Automatic Path Migration support

RDS APM supports automatic connection failover in case of path
failure, and connection failback when the path recovers.

RDS APM is enabled by module parameter rds_ib_enable_apm (disabled by
default).

Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>

rds: fix error flow handling

In case of an error flow, an uninitialized memory was used and this caused a
kernel oops.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Bang Nguyen <bang.nguyen@oracle.com>

net/rds: prevent memory leak in case of error flow

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>

rds: prepare support to kernel 2.6.39-200.1.1.el5uek: add the macro NIPQUAD_*

Add the macro:
NIPQUAD
NIPQUAD_FMT

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>

rds: fixed wrong condition in case of error

Need to use IS_ERR and not compare with NULL

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Bang Nguyen <bang.nguyen@oracle.com>

rds: fixed kernel oops in case of error flow

If failed to create an rdma_cm handler, don't try to free it and
prevent the following kernel oops:

BUG: unable to handle kernel NULL pointer dereference at 00000000000001fc
IP: [<ffffffff814ef21f>] _spin_lock_irqsave+0x1f/0x40
PGD 175b80067 PUD 176a0b067 PMD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/module/rds/initstate
CPU 0
Modules linked in: rds_rdma(+)(U) rds(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_srp(U) scsi_transport_srp scsi_tgt ib_ipoib(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) memtrack(U) netconsole configfs nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ipv6 microcode virtio_balloon virtio_net snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: memtrack]

Pid: 22908, comm: modprobe Not tainted 2.6.32-220.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff814ef21f>]  [<ffffffff814ef21f>] _spin_lock_irqsave+0x1f/0x40
RSP: 0018:ffff880125f07e68  EFLAGS: 00010086
RAX: 0000000000010000 RBX: fffffffffffffff4 RCX: 0000000000000000
RDX: 0000000000000286 RSI: 000000000000000a RDI: 00000000000001fc
RBP: ffff880125f07e68 R08: 0000000000000000 R09: ffff880176db4020
R10: ffff880125f07988 R11: 0000000000000002 R12: 00000000000001fc
R13: 0000000000d8e4c0 R14: 000000000000000a R15: 0000000000000000
FS:  00007f66c90cc700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000001fc CR3: 000000011e71b000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 22908, threadinfo ffff880125f06000, task ffff880174732b40)
Stack:
ffff880125f07e98 ffffffffa03dd795 fffffffffffffff4 fffffffffffffff4
<0> 0000000000d8e4c0 0000000000000000 ffff880125f07ed8 ffffffffa03dee11
<0> ffff880125f07ee8 00000000fffffff4 00000000fffffff4 fffffffffffffff4
Call Trace:
[<ffffffffa03dd795>] cma_exch+0x35/0x70 [rdma_cm]
[<ffffffffa03dee11>] rdma_destroy_id+0x21/0x310 [rdma_cm]
[<ffffffffa042a0be>] init_module+0xbe/0x118 [rds_rdma]
[<ffffffff81096e75>] ? __blocking_notifier_call_chain+0x65/0x80
[<ffffffffa042a000>] ? init_module+0x0/0x118 [rds_rdma]
[<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
[<ffffffff810af641>] sys_init_module+0xe1/0x250
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Code: c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 b8 00 00 01 00 <f0> 0f c1 07 0f b7 c8 c1 e8 10 39 c1 74 0e f3 90 0f 1f 44 00 00
RIP  [<ffffffff814ef21f>] _spin_lock_irqsave+0x1f/0x40
RSP <ffff880125f07e68>
CR2: 00000000000001fc
---[ end trace 8db2f942777f29d0 ]---

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>