This patch provides the ability to dynamically turn on or off various
types of debug/diag prints inside the RDS module.
The run-time debug prints are controlled by a rds module parameter,
rds_rt_debug_bitmap.
Here is the definition for different bits. We have implemented feature
related bits, such as Connection Management, Active Bonding, Error prints,
Send, Recv.
in net/rds/rds_rt_debug.h
...
enum {
/* bit 0 ~ 19 are feature related bits */
RDS_RTD_ERR = 1 << 0, /* 0x1 */
RDS_RTD_ERR_EXT = 1 << 1, /* 0x2 */
In general, *EXTRA bits mean that you will get extra information but
possible flood prints as well. But every bit can be controlled by users
so users can decide how much information they want to see/collect. The
current embedded printk level used for this patch is KERN_INFO. Most
likely all the msgs will only go to /var/log/messages without showing up
on console if we use the default settings for /proc/sys/kernel/printk and
/etc/rsyslog.conf in ol6 environment.
E.g if we want to turn on RDS_RTD_ERR and RDS_RTD_CM bits. What we can
do is
RDS/IB connection can hang with the log message
"CQ access violation on CQN ..."
followed by
"RDS: unknown event 15!".
Event 15 is RDMA_CM_EVENT_TIMEWAIT_EXIT. RDS was not handling this.
With this fix RDS will now attempt to reconnect on getting this event.
The fix contains 2 changes.
1) RDS change to handle RDMA_CM_EVENT_TIMEWAIT_EXIT event.
2) Display diagnostic data of "syndrome" and "vendor_error_syndrome" in
mlx4_core when CQ access violation occurs.
Wengang Wang [Mon, 7 Sep 2015 08:42:44 +0000 (16:42 +0800)]
rds: add busy_list only when fmr allocated successfully
The rdma layer ibmr is always added to busy list of the pool after
memory is allocated. In case the lower layer fmr allocation fails,
it should be removed from the busy list before memoryis freed but
it wasn't. Thus the freed ibmr is left in busy list, and the busy list
gets into unstable state.
Fix is to add busy_list only when fmr is allocated successfully
Wengang Wang [Mon, 7 Sep 2015 07:35:29 +0000 (15:35 +0800)]
rds: free ib_device related resource
There is a (rare) case that a ib_device gets removed(driver unload) while
upper layer(RDS) is still having references to the resources allocated
from this ib_device.
The result is either causing memory leak or crashing when accessing
the freed memory.
The resources are mainly rds_ib_mr objects, in-use rds_ib_mr (rds_mr)
objects are stored in rds_sock.rs_rdma_keys.
The fix is to
1) links up all in-use rds_ib_mr objects to the pool
2) links the rds_sock to rds_ib_mr
3) the destroy of the rds_ib_mr_pool takes care of freeing rds_ib_mrs
by calling rds_rdma_drop_keys()
Wengang Wang [Mon, 7 Sep 2015 06:12:42 +0000 (14:12 +0800)]
rds: srq initialization and cleanup
RDS has the following two problem related to shared receive queues
1) srq initialization:
When a new IB dev is registered to device_list, the .add methods
of clients in client_list are called to do some initialization work.
For RDS, rds_ib_add_one() is called. srq related things should be
well initialized here since this is the last change before using srq.
However, code only allocates memory and seems hope rds_ib_srqs_init()
to initialize it later. But infact, rds_ib_srqs_init()
is not called if the call path is not insmod of rds_rdma.
2) srq cleanup:
When removing rds_rdma module, srqs for all rds_ib_device should
be cleaned up. However, code only frees the rds_ib_device.srq memory
and is not cleaning up memory pointed to by pointers embedded inside.
This lead to resource leak.
Rama Nichanamatlu [Thu, 11 Jun 2015 17:43:54 +0000 (10:43 -0700)]
IB/rds_rdma: unloading of ofed stack causes page fault panic
This issue surfaced at the tail end of OFED functional automatic test suite
while unloading ofed modules resulting in following stack trace:
BUG: unable to handle kernel paging request at ffffffffa0abd1a0
IP: [<ffffffffa0abd1a0>] 0xffffffffa0abd1a0
Sowmini Varadhan [Fri, 28 Aug 2015 14:09:04 +0000 (10:09 -0400)]
RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.
Register pernet subsys init/stop functions that will set up
and tear down per-net RDS-TCP listen endpoints. Unregister
pernet subusys functions on 'modprobe -r' to clean up these
end points.
Enable keepalive on both accept and connect socket endpoints.
The keepalive timer expiration will ensure that client socket
endpoints will be removed as appropriate from the netns when
an interface is removed from a namespace.
Register a device notifier callback that will clean up all
sockets (and thus avoid the need to wait for keepalive timeout)
when the loopback device is unregistered from the netns indicating
that the netns is getting deleted.
Sowmini Varadhan [Fri, 28 Aug 2015 11:16:01 +0000 (07:16 -0400)]
RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net
Open the sockets calling sock_create_kern() with the correct struct net
pointer, and use that struct net pointer when verifying the
address passed to rds_bind().
It is complaining that total sent length is bigger that we want to send.
rds_ib_xmit() is wrong for the second entry for the same rds_message returning
wrong value.
The sg and off passed by rds_send_xmit() to rds_ib_xmit() is based on
scatterlist.offset/length, but the rds_ib_xmit() action is based on
scatterlist.dma_address/dma_length. In case dma_length is larger than length
there is problem. for the 2nd and later entries of rds_ib_xmit() for same
rds_message, at least one of the following two is wrong:
1) the scatterlist to start with, the chosen one can be far beyond the correct
one.
2) the offset to start with within the scatterlist.
Fix is to add op_dmasg and op_dmaoff fields to rm_data_op structure indicating
the scatterlist and offset within the it to start with for rds_ib_xmit()
respectively. The op_dmasg and op_dmaoff fields are initialized to zero
when doing dma mapping for the first time of the message and are changed
when filling send slots.
Wengang Wang [Tue, 4 Aug 2015 05:39:51 +0000 (13:39 +0800)]
rds: rds_ib_device.refcount overflow
Fixes: 3e0249f9c05c ("RDS/IB: add refcount tracking to struct rds_ib_device")
There is a missing dropping of refcount on rds_ib_device.refcount
in case rds_ib_alloc_fmr() failed(mr pool running out). This lead to
the refcount overflow.
A BUG_ON on line 117(see following) is triggered.
From vmcore:
s_ib_rdma_mr_pool_depleted is 2147485544
and rds_ibdev->refcount is -2147475448.
That is the evidence the mr pool is used up. So rds_ib_alloc_fmr() is
very likely to return ERR_PTR(-EAGAIN).
Wengang Wang [Tue, 4 Aug 2015 08:00:55 +0000 (16:00 +0800)]
rds_rdma: rds_sendmsg should return EAGAIN if connection not setup
In rds_sendmsg(), in the case RDS_CMSG_RDMA_MAP is requested and
rds_cmsg_send() is called, a "struct rds_mr" needs to be created.
For creating the "struct rds_mr", the connection needs to be
established(ready) for rds_ib_transport. Otherwise, __rds_rdma_map()
would fail because it can't find the right rds_ib_device (which is
associated with the ip address matching rds_sock's bound ip address).
The ip address is set at the completion of the rdma connection.
But actually in code, the connecting is triggered after the call of call
rds_cmsg_send() so rds_cmsg_send() would fail with -NODEV.
The fix is to move the trigger of connection before calling
rds_cmsg_send() and return -EAGAIN in case connection is not
ready yet when we are calling rds_cmsg_send().
Wengang Wang [Tue, 4 Aug 2015 08:27:08 +0000 (16:27 +0800)]
rds_rdma: allocate FMR according to max_item_soft
When max_item on the pool is very large(say 170000), large number of
alloc_fmr request would fail with -EGAIN. This can greatly hurt
performance.
Actually, whether it is going to allocate or wait for pool flush
should be according to "max_item_soft" rather than "max_item" because
we are resizing the pool by changing "max_item_soft" not max_item.
Also, when successfully allocated a lower layer fmr, we should increase
"max_item_soft" incrementally, not set it immediately to "max_item".
Wengang Wang [Tue, 4 Aug 2015 08:16:01 +0000 (16:16 +0800)]
rds: set fmr pool dirty_count correctly
In rds_ib_flush_mr_pool() the setting of dirty_count is wrong in case
"free_all" is true -- "clean" ones also counted as dirty. dirty_count
being negative is seen in vmcore.
Sowmini Varadhan [Tue, 2 Jun 2015 00:27:42 +0000 (20:27 -0400)]
Add getsockopt support for SO_RDS_TRANSPORT
The currently attached transport for a PF_RDS socket may be obtained
from user space by invoking getsockopt(2) using the SO_RDS_TRANSPORT
option at the SOL_RDS level. The integer optval returned will be one
of the RDS_TRANS_* constants defined in linux/rds.h.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Sowmini Varadhan [Tue, 2 Jun 2015 00:26:29 +0000 (20:26 -0400)]
Add setsockopt support for SO_RDS_TRANSPORT
An application may deterministically attach the underlying transport for
a PF_RDS socket by invoking setsockopt(2) with the SO_RDS_TRANSPORT
option at the SOL_RDS level. The integer argument to setsockopt must be
one of the RDS_TRANS_* transport types, e.g., RDS_TRANS_TCP. The option
must be specified before invoking bind(2) on the socket, and may only
be used once on the socket. An attempt to set the option on a bound
socket, or to invoke the option after a successful SO_RDS_TRANSPORT
attachment, will return EOPNOTSUPP.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Sowmini Varadhan [Tue, 2 Jun 2015 00:24:10 +0000 (20:24 -0400)]
Declare SO_RDS_TRANSPORT and RDS_TRANS_* constants in uapi/linux/rds.h
User space applications that desire to explicitly select the
underlying transport for a PF_RDS socket may do so by using the
SO_RDS_TRANSPORT socket option at the SOL_RDS level before bind().
The integer argument provided to the socket option would be one
of the RDS_TRANS_* values, e.g., RDS_TRANS_TCP. This commit exports
the constant values need by such applications via <linux/rds.h>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Sowmini Varadhan [Mon, 11 May 2015 13:39:04 +0000 (09:39 -0400)]
RDS-TCP: only initiate reconnect attempt on outgoing TCP socket.
When the peer of an RDS-TCP connection restarts, a reconnect
attempt should only be made from the active side of the TCP
connection, i.e. the side that has a transient TCP port
number. Do not add the passive side of the TCP connection
to the c_hash_node and thus avoid triggering rds_queue_reconnect()
for passive rds connections.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Sowmini Varadhan [Mon, 11 May 2015 13:32:51 +0000 (09:32 -0400)]
RDS-TCP: Always create a new rds_sock for an incoming connection.
When running RDS over TCP, the active (client) side connects to the
listening ("passive") side at the RDS_TCP_PORT. After the connection
is established, if the client side reboots (potentially without even
sending a FIN) the server still has a TCP socket in the esablished
state. If the server-side now gets a new SYN comes from the client
with a different client port, TCP will create a new socket-pair, but
the RDS layer will incorrectly pull up the old rds_connection (which
is still associated with the stale t_sock and RDS socket state).
This patch corrects this behavior by having rds_tcp_accept_one()
always create a new connection for an incoming TCP SYN.
The rds and tcp state associated with the old socket-pair is cleaned
up via the rds_tcp_state_change() callback which would typically be
invoked in most cases when the client-TCP sends a FIN on TCP restart,
triggering a transition to CLOSE_WAIT state. In the rarer event of client
death without a FIN, TCP_KEEPALIVE probes on the socket will detect
the stale socket, and the TCP transition to CLOSE state will trigger
the RDS state cleanup.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Mukesh Kacker [Sat, 9 May 2015 02:47:56 +0000 (19:47 -0700)]
rds: directly include header for vmalloc/vfree in ib_recv.c
Directly include <linux/vmalloc.h> in file ib_recv.c
Without that we can get failure on non-x86 platforms where it
may not get included indirectly. Without that we get the failure
CC [M] net/rds/ib_recv.o
net/rds/ib_recv.c: In function ‘rds_ib_srq_init’:
net/rds/ib_recv.c:1530: error: implicit declaration of function ‘vmalloc’
net/rds/ib_recv.c:1531: warning: assignment makes pointer from integer without \
a cast
net/rds/ib_recv.c: In function ‘rds_ib_srq_exit’:
net/rds/ib_recv.c:1590: error: implicit declaration of function ‘vfree’
make[2]: *** [net/rds/ib_recv.o] Error 1
make[1]: *** [net/rds] Error 2
make: *** [net] Error 2
Mukesh Kacker [Mon, 4 May 2015 00:33:59 +0000 (17:33 -0700)]
rds: return EMSGSIZE for oversize requests before processing/queueing
rds_send_queue_rm() allows for the "current datagram" being queued
to exceed SO_SNDBUF thresholds by checking bytes queued without
counting in length of current datagram. (Since sk_sndbuf is set
to twice requested SO_SNDBUF value as a kernel heuristic this
is usually fine!)
If this "current datagram" squeezing past the threshold is itself
many times the size of the sk_sndbuf threshold itself then even
twice the SO_SNDBUF does not save us and it gets queued but
cannot be transmitted. Threads block and deadlock and device
becomes unusable. The check for this datagram not exceeding
SNDBUF thresholds (EMSGSIZE) is not done on this datagram as
that check is only done if queueing attempt fails.
(Datagrams that follow this datagram fail queueing attempts, go
through the check and eventually trip EMSGSIZE error but zero
length datagrams silently fail!)
This fix moves the check for datagrams exceeding SNDBUF limits
before any processing or queueing is attempted and returns EMSGSIZE
early in the rds_sndmsg() code. This change also ensures that all
datagrams get checked for exceeding SNDBUF/sk_sndbuf size limits
and the large datagrams that exceed those limits do not get to
rds_send_queue_rm() code for processing.
Sasha Levin [Tue, 3 Feb 2015 13:55:58 +0000 (08:55 -0500)]
net: rds: use correct size for max unacked packets and bytes
Max unacked packets/bytes is an int while sizeof(long) was used in the
sysctl table.
This means that when they were getting read we'd also leak kernel memory
to userspace along with the timeout values.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit db27ebb111e9f69efece08e4cb6a34ff980f8896)
Mukesh Kacker [Mon, 12 Jan 2015 01:23:10 +0000 (17:23 -0800)]
RDS/IP: RDS takes 10 seconds to plumb the second IP back
RDS netdev event handler code effectively waits
for 10 seconds when event indicates and interface
coming back up from all-interfaces-were-down condition
OR initial state for the first interface to be revived.
(It is checking for address to be plumbed, something
that is done after this wait in case where RDS active
bonding code is doing the address plumbing!)
That code is based on assumptions about delays in
plumbing interface IP address on which were
likely present due to bugs in legacy code.
With the current code, there is no need for
such checks before failing back interfaces
when an interface is brought back up.
Mukesh Kacker [Thu, 20 Nov 2014 18:26:13 +0000 (10:26 -0800)]
RDS/IB: Tune failover-on-reboot scheduling
In certain platforms (e.g. X5-2) the IB devices are slow
to come up and the start of failover-on-reboot (in RDS active
bonding) is changed to accomodate them. Otherwise all
interfaces get de-activated and then re-activated on
almost every reboot.
We also make sure when all interfaces do get de-activated
by failover-on-reboot, it is not affected by delayed startup
of devices from all-ports-down which is present for other
situations.
The startup interval of first scheduling of failover-on-reboot
on module load is also turned into a module parameter.
Mukesh Kacker [Tue, 2 Dec 2014 20:17:26 +0000 (12:17 -0800)]
RDS: mark netdev UP for intfs added post module load
When interfaces are brought up after module load,
a NETDEV_UP event for which no matching port exists
triggers re-initialization of active bonding data
structures. The initialization however missed marking
the layer as up in flag tracking which layers are up.
The fix here marks that layer as UP in the flags since
the initialization is triggered by the NETDEV_UP event
processing!
RDS code currently derives pkey associated with
a device by parsing its name. With user-named
pkey devices which do not have pkey as part of
the name, that is not possible and ipoib module
exports api to query the pkey from netdev.
Here we switch to use that api instead.
rds: fix list corruption and tx hang when netfilter is used
when rds netfilter enabled the below issues can happen and this
patch solve them:
1. rds socket rds-incoming list corruption. this issue happen when
the below code path is executed:
* rds_recv_incoming -> NF_HOOK { nf hook decide to send packet
to local and to origin } ->
* { send to local } -> rds_recv_local -> list_add_tail { rds-incoming
is queued on the rds socket rs_recv_queue list }
* { send to origin } -> rds_recv_route -> rds_recv_forward ->
{ rds_send_internal failed! } -> rds_recv_local -> list_add_tail
{ rds-incoming is queued on the rds socket rs_recv_queue list }
2. rds rx tasklet forced to tx the incoming packet and so is un able
to process acks from remote and free space in the rds sendbuf:
* rds_recv_incoming -> NF_HOOK { nf hook decide to send packet
to origin } -> rds_recv_route -> rds_recv_forward -> rds_send_internal ->
rds_send_xmit { rx tasklet now do the tx! }
Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com> Tested-by: jun yang <jun.yang@oracle.com> Tested-by: denise iguma <denise.iguma@oracle.com> Acked-by: chien yen <chien.yen@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 61b5557dde8366245a58402f76d8e2d9f8a18b7d)
Mukesh Kacker [Wed, 8 Oct 2014 20:11:14 +0000 (13:11 -0700)]
RDS: move more queing for loopback connections to separate queue
All instances of processing for processing of
connect/re-connect/disconnect/reject instances in RDS
should use separate workqueue dedicated for processing
for (few) local loopback connections to reduce latency
so they do not get behind processing of large number of
remote connections. However, a few instances of such
processing do not. With this fix those are being changed
to use the workqueue dedicated for local connections.
This patch adds the following feature to ib_ipoib, rds_rdma, ib_core and
mlx4_core.
Adds a module parameter "module_unload_allowed". If the parameter is 1(the
default value), moudles can be unloaded(same behavior as before); other-
wise if it's 0, the module is not allowed to be unloaded. The paramter can't
be changed when module is loaded until the module is unloaded(if it can be).
default values:
ib_ipoib: 1 for YES
rds_rdma: 0 for NO
ib_core: 1 for YES
mlx4_core: 0 for NO
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Acked-by: Joe Jin <joe.jin@oracle.com> Acked-by: Todd Vierling <todd.vierling@oracle.com> Acked-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit cf1a00039e6fea116e9ea7c82f55ee3ee5319cec)
rds: fix NULL pointer dereference panic during rds module unload
This issue reported happens during an unload of rds module with rds
reconnect timeout worker scheduled to execute in the future, and rds
module unloaded earlier than that. rds reconnect timeout worker was
introduced by 8991a87c6c3fc8b17383a140bd6f15a958e31298 ( RDS: SA query
optimization) commit. Fix is to flush/cancel any reconnect timeout
workers while performing rds connections destroy which is done during
module unload.
Mukesh Kacker [Wed, 13 Aug 2014 20:02:06 +0000 (13:02 -0700)]
RDS:active bonding: disable failover across HCAs(failover groups)
Disable experimental code in RDS active bonding which performs
failovers across "failover groups" (HCAs). It causes
instabilities for some applications.
RDS/IB: active bonding - failover down interfaces on reboot.
RDS active bonding detects port down transitions from
active ports notified by events (and performs failover
when notified) but does not detect ports which are down
at boot time.
Changes involve tracking hw port, link layer, and netdev
layer up/down status separately and the aggregate port UP
status is deduced when ALL layers are UP and DOWN is deduced
when any one layer goes down.
A delayed task is scheduled to run at module load time after
a max delay OR on a sysctl trigger from init script after all
devices that are active are brought up. The ports found be
DOWN are failed over.
Mukesh Kacker [Sat, 21 Jun 2014 01:42:23 +0000 (18:42 -0700)]
RDS/IB: Remove dangling rcu_read_unlock() and other cleanups
Delete dangling rcu_read_unlock() which was left behind
when matching rcu_read_lock() and enclosed code was
removed in commit 538f5d0dfa704f4dcb4afa80a1d01b1317b9cd65
Shamir Rabinovitch [Sat, 28 Jun 2014 23:25:16 +0000 (16:25 -0700)]
rds: new extension header: rdma bytes
Introduce a new extension header type RDSV3_EXTHDR_RDMA_BYTES for
an RDMA initiator to exchange rdma byte counts to its target.
Add new flag to RDS header: RDS_FLAG_EXTHDR_EXTENSION
Add new extension to RDS header: rds_ext_header_rdma_bytes
Please note:
Linux RDS and Solaris RDS have miss match in header flags. Solaris
RDS assigned flag 0x08 to RDS_FLAG_EXTHDR_EXTENSION.
Linux alredy use 0x08 for flag RDS_FLAG_HB_PING.
This patch require the below fix from the Solaris side:
BUG 19065367 - unified RDSV3_EXTHDR_RDMA_BYTES with Linux
RDS: Ensure non-zero SL uses correct path before lane 0 connection is dropped
There is an issue with the following scenario:
* if non-zero lane is going down first with send completion error 12
* Before lane 0 connection goes down, the peer initiates connection request with
the non-zero lane
* This non-zero lane connection request may be using old ARP entries of lane 0
This also fixes race condition between connection establishment and drop for
following scenario:
* non-zero lane connection dropped
* non-zero connection is initiated and this time it finds proper route and
connection request goes through.
* before non-zero lane connection is established at RDS layer,
zero lane connection is getting dropped.
* now this zero-lane connection will drop non-zero lane connection as well
(with the assumption that non-zero lane did not find proper route).
* when non-zero lane connection establishment event is received (REP packet),
we have a race between connection establishment event on one CPU and
connection drop on other CPU.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 92ad60f3efabfcd9eb40e983a131c730829c0d90)
Mukesh Kacker [Sat, 31 May 2014 00:44:16 +0000 (17:44 -0700)]
RDS: active bonding - ports may not failback if all ports go down
When active bonding is enabled, ports failover and failback if
ports are enabled/disabled. If ALL ports go down, then no
failover happens since there is no available port to failover to.
In that case, some ports not hosting any migrated interfaces are
not resurrected.
Fix resurrects ports when they dont have an IP address set and
are failing back and RDS active bonding has state on them. Some,
debug/log messages are also improved such as to indicated when
a failover fails to happen.
Bang Nguyen [Wed, 16 Apr 2014 20:56:02 +0000 (13:56 -0700)]
RDS: SA query optimization
SA query optimization
The fact is all QoS lanes share the same physical path
b/w an IP pair. The only difference is the service level
that affects the quality of service for each lane. With
that, we have the following optimization:
1. Lane 0 to issue SA query request to the SM. All other
lanes will wait for lane 0 to finish route resolution,
then copy in the resolved path and fill in its service
level.
2. One-side reconnect to reduce reconnect racing, thus
further reducing the number of SA queries to the SM.
Reducing brownout for non-zero lanes
In some case, RDMA CM is delaying the disconnect event
after switch/node failure and this is causing extra
brownout for RDS reconnection. The workaround is to have
lane 0 probe other lanes by sending a HB msg. If the lane
is down, this will cause a send completion error and an
immediate reconnect.
Bang Nguyen [Wed, 7 May 2014 21:48:51 +0000 (14:48 -0700)]
RDS: Remove cond_resched() in RX tasklet
Re-install the base fix 17829338 and replace
spin_lock_irqsave(rx_lock)/spin_unlock_ireqrestore(rx_lock) with
spin_lock_bh(rx_lock)/spin_unlock_bh(rx_lock) to resolve bugs 18413711
and 18461816. rx_lock is used to prevent concurrent reaping b/w the
RX tasklet and worker.
Commit 6cf7cc30 "rds: dynamic active bonding configuration" introduced
a regression. Late joining IB interfaces were not configured
correctly. When active bonding configuration shows both interfaces
down, IP failover was not happening. This patch fixes this issue.
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com> Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
(cherry picked from commit 12755dbf7b4adc8ea2a935900b81c384731f6fff) Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit 12eebc4b9e28c3899089277ff9725bdcff1829aa)
This fix addresses the issue with the idled QoS connection not getting
disconnect event when the remote peer reboots. This is causing delayed
reconnect, hence application brownout when the peer comes online. The fix was
to proactively drop and reconnect them when the base lane is going through
the reconnect to the reboot peer, in effect forcing all the lanes to go
through the reconnect at the same time.
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com> Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
(cherry picked from commit f51ccefb3a0b9485da5cc5f66bb1e311f61bd70b)
Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
(cherry picked from commit 5cad478d7148ac4b6fc2d6eb78d6bf5a576d69e1)
Signed-off-by: Giri Adari <giri.adari@oracle.com> Signed-off-by: Richard Frank <richard.frank@oracle.com> Signed-off-by: Chien-Hua Yen <chien.yen@oracle.com>
(cherry picked from commit 7b66ddd7f6a5b023191d74949fab41af245775a3) Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Conflicts:
net/rds/rds.h
(cherry picked from commit 0373566ba0d74f655ae83e09748f7cc8d553f351)
Bang Nguyen [Tue, 20 Aug 2013 14:27:21 +0000 (07:27 -0700)]
RDS: double free rdma_cm_id
RDS currently offloads rdma_destroy_id() to an aux thread as part of the
connection shutdown. This was to workaround a bug in which rdma_destroy_id()
could block and cause RDS reconnect to hang. By queuing the rdma_destroy_id()
work, we unfortunately open up a timing window in which the pending
CMA_ADDR_QUERY request might not get canceled right away and race with
rdma_destroy_id().
In this case, rdma_destroyed_id() gets called and frees the cm id. Then,
CMA_ADDR_QUERY completes and calls RDS event handler which calls
rds_resolve_route on the destroyed cm id. The event handler returns failure
which causes RDMA CM to call rdma_destroy_id() again on the same cm id!
Hence the problem.
Since the rdma_destroy_id() bug has been fixed by MLX to offload the blocking
operation to the worker thread, RDS no longer needs to queue up
rdma_destroy_id(). This closes up the window above and fixes the problem.
Bang Nguyen [Sat, 17 Aug 2013 04:41:25 +0000 (21:41 -0700)]
RDS: Reconnect stalls for 15s
On Switch reboot, both end nodes would try to reconnect at the same time.
This can cause a race b/w the gratuitous ARP and the IP resolution,
resulting in a path to a down port. The CONNECT request sent on this path
is stuck until the 15s timeout at which time the connection is dropped
and re-established.
The fix was to indroduce a reconnect delay b/w the ARP and the reconnect
to minimize the race and avoid the 15s timeout.
Venkat Venkatsubra [Thu, 8 Aug 2013 05:15:05 +0000 (22:15 -0700)]
RDS: added stats to track and display receive side memory usage
Added these stats:
1. per-connection stat for number of receive buffers in cache
2. global stat for the same across all connections
3. number of bytes in socket receive buffer
Since stats are implemented using per-cpu variables and RDS currently
does unsigned arithmetic to add them up, separate counters (one for
addition and one for subtraction) are used for (2) and (3).
In the future we might change it to signed computation.
Bang Nguyen [Thu, 15 Aug 2013 02:10:00 +0000 (19:10 -0700)]
RDS: RDS reconnect stalls
After successfully negiotiating the version at lower protocol, RDS incorrectly
set the proposed version to the higher protocol, causing the subsequent
reconnect to stall.
The fix was not to change the proposed version after the initial connection
setup.
Ahmed Abbas [Thu, 18 Jul 2013 23:59:59 +0000 (16:59 -0700)]
add NETFILTER suppport
Orabug: 17082619
Adds the ability for the RDS code to support the NETFILTER kernel interfaces.
This allows for packet inspection, modification, and potential redirection as
the packets flow through the lower layers of the RDS code.
Jay Fenlason (fenlason@redhat.com) found a bug,
that recvfrom() on an RDS socket can return the contents of random kernel
memory to userspace if it was called with a address length larger than
sizeof(struct sockaddr_in).
rds_recvmsg() also fails to set the addr_len paramater properly before
returning, but that's just a bug.
There are also a number of cases wher recvfrom() can return an entirely bogus
address. Anything in rds_recvmsg() that returns a non-negative value but does
not go through the "sin = (struct sockaddr_in *)msg->msg_name;" code path
at the end of the while(1) loop will return up to 128 bytes of kernel memory
to userspace.
And I write two test programs to reproduce this bug, you will see that in
rds_server, fromAddr will be overwritten and the following sock_fd will be
destroyed.
Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is
better to make the kernel copy the real length of address to user space in
such case.
How to run the test programs ?
I test them on 32bit x86 system, 3.5.0-rc7.
4 you will see something like:
server is waiting to receive data...
old socket fd=3
server received data from client:data from client
msg.msg_namelen=32
new socket fd=-1067277685
sendmsg()
: Bad file descriptor
printf("server is waiting to receive data...\n");
msg.msg_name = &fromAddr;
/*
* I add 16 to sizeof(fromAddr), ie 32,
* and pay attention to the definition of fromAddr,
* recvmsg() will overwrite sock_fd,
* since kernel will copy 32 bytes to userspace.
*
* If you just use sizeof(fromAddr), it works fine.
* */
msg.msg_namelen = sizeof(fromAddr) + 16;
/* msg.msg_namelen = sizeof(fromAddr); */
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_iov->iov_base = recvBuffer;
msg.msg_iov->iov_len = 128;
msg.msg_control = 0;
msg.msg_controllen = 0;
msg.msg_flags = 0;
while (1) {
printf("old socket fd=%d\n", sock_fd);
if (recvmsg(sock_fd, &msg, 0) == -1) {
perror("recvmsg() error\n");
close(sock_fd);
exit(1);
}
printf("server received data from client:%s\n", recvBuffer);
printf("msg.msg_namelen=%d\n", msg.msg_namelen);
printf("new socket fd=%d\n", sock_fd);
strcat(recvBuffer, "--data from server");
if (sendmsg(sock_fd, &msg, 0) == -1) {
perror("sendmsg()\n");
close(sock_fd);
exit(1);
}
}
close(sock_fd);
return 0;
}
Signed-off-by: Weiping Pan <wpan@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit eb3ccc4c696e5c4a10d324886fd061ea88bab6c4)
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com> Acked-by: Zheng Li <zheng.x.li@oracle.com> Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit 78b7d86911046c3a10ffa52d90f4f1a4523d7ac3)
when rds_ib_remove_one return, driver's mlx4_ib_removeone
function destroy ib_device, so we must clear rds_ibdev->dev
to NULL, or will cause crash when rds connection be released,
at the moment rds_ib_dev_free through ib_device
.i.e rds_ibdev->dev to release mr and fmr, reusing the
released ib_device will cause crash.
RDS: make sure rds_ib_remove_one() returns only after the device is freed.
This is to avoid possible race condition in which rds_ib_remove_one() returns
prematurely and IB removes the underlying device. RDS later tries to free the
device and trips over.
"When fed mangled socket data, rds will trust what userspace gives it,
and tries to allocate enormous amounts of memory larger than what
kmalloc can satisfy."
Reported-by: Dave Jones <davej@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Cong Wang <amwang@redhat.com> Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 1524f0a4e3e23b3c8b4235eb7d9932129cc0006b)
jeff.liu [Mon, 8 Oct 2012 18:57:27 +0000 (18:57 +0000)]
RDS: fix rds-ping spinlock recursion
This is the revised patch for fixing rds-ping spinlock recursion
according to Venkat's suggestions.
RDS ping/pong over TCP feature has been broken for years(2.6.39 to
3.6.0) since we have to set TCP cork and call kernel_sendmsg() between
ping/pong which both need to lock "struct sock *sk". However, this
lock has already been hold before rds_tcp_data_ready() callback is
triggerred. As a result, we always facing spinlock resursion which
would resulting in system panic.
Given that RDS ping is only used to test the connectivity and not for
serious performance measurements, we can queue the pong transmit to
rds_wq as a delayed response.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> CC: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> CC: David S. Miller <davem@davemloft.net> CC: James Morris <james.l.morris@oracle.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5175a5e76bbdf20a614fb47ce7a38f0f39e70226)
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Conflicts:
net/rds/send.c
rds: UNDO reverts done for rebase code to compile with Linux 4.1 APIs
Commit 163377dd82f2d81809aabe736a2e0ea515055a69 does reverts
to common ancestor of upstream and UEK2 to rebase UEK2 patches
for net/rds. This commit undoes reverts needed to compile to
Linux 4.0 APIs.
UNDO Revert "net: Replace get_cpu_var through this_cpu_ptr" for net/rds
This commit does UNDO of revert of commit 903ceff7ca7b4d80c083a80ee5163b74e9fa359f for net/rds.
UNDO Revert "net: introduce helper macro for_each_cmsghdr" for net/rds
This commit does UNDO of revert of commit f95b414edb18de59940dcebbefb49cf25c6d505c for net/rds
UNDO Revert "net: Remove iocb argument from sendmsg and recvmsg" for net/rds
This commit does UNDO of revert of commit 1b784140474e4fc94281a49e96c67d29df0efbde for net/rds.
These commits were reverted earlier to rebase unmodified UEK2 RDS code
(UNDO needed to compile to new Linux 4.1 kernel APIs - changed *after* Linux 3.18)