Venkat Venkatsubra [Thu, 8 Aug 2013 05:15:05 +0000 (22:15 -0700)]
RDS: added stats to track and display receive side memory usage
Added these stats:
1. per-connection stat for number of receive buffers in cache
2. global stat for the same across all connections
3. number of bytes in socket receive buffer
Since stats are implemented using per-cpu variables and RDS currently
does unsigned arithmetic to add them up, separate counters (one for
addition and one for subtraction) are used for (2) and (3).
In the future we might change it to signed computation.
Bang Nguyen [Thu, 15 Aug 2013 02:10:00 +0000 (19:10 -0700)]
RDS: RDS reconnect stalls
After successfully negiotiating the version at lower protocol, RDS incorrectly
set the proposed version to the higher protocol, causing the subsequent
reconnect to stall.
The fix was not to change the proposed version after the initial connection
setup.
Ahmed Abbas [Thu, 18 Jul 2013 23:59:59 +0000 (16:59 -0700)]
add NETFILTER suppport
Orabug: 17082619
Adds the ability for the RDS code to support the NETFILTER kernel interfaces.
This allows for packet inspection, modification, and potential redirection as
the packets flow through the lower layers of the RDS code.
Jay Fenlason (fenlason@redhat.com) found a bug,
that recvfrom() on an RDS socket can return the contents of random kernel
memory to userspace if it was called with a address length larger than
sizeof(struct sockaddr_in).
rds_recvmsg() also fails to set the addr_len paramater properly before
returning, but that's just a bug.
There are also a number of cases wher recvfrom() can return an entirely bogus
address. Anything in rds_recvmsg() that returns a non-negative value but does
not go through the "sin = (struct sockaddr_in *)msg->msg_name;" code path
at the end of the while(1) loop will return up to 128 bytes of kernel memory
to userspace.
And I write two test programs to reproduce this bug, you will see that in
rds_server, fromAddr will be overwritten and the following sock_fd will be
destroyed.
Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is
better to make the kernel copy the real length of address to user space in
such case.
How to run the test programs ?
I test them on 32bit x86 system, 3.5.0-rc7.
4 you will see something like:
server is waiting to receive data...
old socket fd=3
server received data from client:data from client
msg.msg_namelen=32
new socket fd=-1067277685
sendmsg()
: Bad file descriptor
printf("server is waiting to receive data...\n");
msg.msg_name = &fromAddr;
/*
* I add 16 to sizeof(fromAddr), ie 32,
* and pay attention to the definition of fromAddr,
* recvmsg() will overwrite sock_fd,
* since kernel will copy 32 bytes to userspace.
*
* If you just use sizeof(fromAddr), it works fine.
* */
msg.msg_namelen = sizeof(fromAddr) + 16;
/* msg.msg_namelen = sizeof(fromAddr); */
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_iov->iov_base = recvBuffer;
msg.msg_iov->iov_len = 128;
msg.msg_control = 0;
msg.msg_controllen = 0;
msg.msg_flags = 0;
while (1) {
printf("old socket fd=%d\n", sock_fd);
if (recvmsg(sock_fd, &msg, 0) == -1) {
perror("recvmsg() error\n");
close(sock_fd);
exit(1);
}
printf("server received data from client:%s\n", recvBuffer);
printf("msg.msg_namelen=%d\n", msg.msg_namelen);
printf("new socket fd=%d\n", sock_fd);
strcat(recvBuffer, "--data from server");
if (sendmsg(sock_fd, &msg, 0) == -1) {
perror("sendmsg()\n");
close(sock_fd);
exit(1);
}
}
close(sock_fd);
return 0;
}
Signed-off-by: Weiping Pan <wpan@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit eb3ccc4c696e5c4a10d324886fd061ea88bab6c4)
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com> Acked-by: Zheng Li <zheng.x.li@oracle.com> Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
(cherry picked from commit 78b7d86911046c3a10ffa52d90f4f1a4523d7ac3)
when rds_ib_remove_one return, driver's mlx4_ib_removeone
function destroy ib_device, so we must clear rds_ibdev->dev
to NULL, or will cause crash when rds connection be released,
at the moment rds_ib_dev_free through ib_device
.i.e rds_ibdev->dev to release mr and fmr, reusing the
released ib_device will cause crash.
RDS: make sure rds_ib_remove_one() returns only after the device is freed.
This is to avoid possible race condition in which rds_ib_remove_one() returns
prematurely and IB removes the underlying device. RDS later tries to free the
device and trips over.
"When fed mangled socket data, rds will trust what userspace gives it,
and tries to allocate enormous amounts of memory larger than what
kmalloc can satisfy."
Reported-by: Dave Jones <davej@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Cong Wang <amwang@redhat.com> Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 1524f0a4e3e23b3c8b4235eb7d9932129cc0006b)
jeff.liu [Mon, 8 Oct 2012 18:57:27 +0000 (18:57 +0000)]
RDS: fix rds-ping spinlock recursion
This is the revised patch for fixing rds-ping spinlock recursion
according to Venkat's suggestions.
RDS ping/pong over TCP feature has been broken for years(2.6.39 to
3.6.0) since we have to set TCP cork and call kernel_sendmsg() between
ping/pong which both need to lock "struct sock *sk". However, this
lock has already been hold before rds_tcp_data_ready() callback is
triggerred. As a result, we always facing spinlock resursion which
would resulting in system panic.
Given that RDS ping is only used to test the connectivity and not for
serious performance measurements, we can queue the pong transmit to
rds_wq as a delayed response.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> CC: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> CC: David S. Miller <davem@davemloft.net> CC: James Morris <james.l.morris@oracle.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5175a5e76bbdf20a614fb47ce7a38f0f39e70226)
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Conflicts:
net/rds/send.c
rds: UNDO reverts done for rebase code to compile with Linux 4.1 APIs
Commit 163377dd82f2d81809aabe736a2e0ea515055a69 does reverts
to common ancestor of upstream and UEK2 to rebase UEK2 patches
for net/rds. This commit undoes reverts needed to compile to
Linux 4.0 APIs.
UNDO Revert "net: Replace get_cpu_var through this_cpu_ptr" for net/rds
This commit does UNDO of revert of commit 903ceff7ca7b4d80c083a80ee5163b74e9fa359f for net/rds.
UNDO Revert "net: introduce helper macro for_each_cmsghdr" for net/rds
This commit does UNDO of revert of commit f95b414edb18de59940dcebbefb49cf25c6d505c for net/rds
UNDO Revert "net: Remove iocb argument from sendmsg and recvmsg" for net/rds
This commit does UNDO of revert of commit 1b784140474e4fc94281a49e96c67d29df0efbde for net/rds.
These commits were reverted earlier to rebase unmodified UEK2 RDS code
(UNDO needed to compile to new Linux 4.1 kernel APIs - changed *after* Linux 3.18)
Dotan Barak [Thu, 7 Jun 2012 05:56:34 +0000 (08:56 +0300)]
RDS: fixed compilation warnings
Fixed the following compilation warnings:
net/rds/send.c: In function 'rds_send_xmit':
net/rds/send.c:299: warning: suggest parentheses around && within ||
net/rds/rdma.c: In function 'rds_cmsg_rdma_dest':
net/rds/rdma.c:697: warning: format '%Lx' expects type 'long long unsigned int', but argument 2 has type 'u32'
net/rds/ib_recv.c: In function 'rds_ib_srqs_init':
net/rds/ib_recv.c:1570: warning: 'return' with no value, in function returning non-void
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com> Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Bang Nguyen [Sun, 19 Feb 2012 20:19:57 +0000 (12:19 -0800)]
RDS Asynchronous Send support
1. Same behavior as RDMA send, i.e., generate notification on IB completion.
2. On error handling, connection is closed for traffic, i.e., new sends are
rejected and client retries
3. To guarantee ordering, all pending async (RDMA/bcopy) sends after the
failed send will also be aborted, and in the order that they were submitted.
4. Re-open connection for traffic after all the failed notifications have
been reaped by the client.
Dotan Barak [Wed, 15 Feb 2012 16:00:50 +0000 (18:00 +0200)]
rds: fix compilation warnings
net/rds/ib_recv.c: In function 'rds_ib_srq_event':
net/rds/ib_recv.c:1490: warning: too many arguments for format
net/rds/ib_recv.c:1484: warning: unused variable 'srq_attr'
net/rds/ib_recv.c: In function 'rds_ib_srq_init':
net/rds/ib_recv.c:1524: warning: passing argument 1 of 'ERR_PTR' makes
integer from pointer without a cast
include/linux/err.h:20: note: expected 'long int' but argument is of
type 'struct ib_srq *'
net/rds/ib_recv.c:1524: warning: format '%d' expects type 'int', but
argument 2 has type 'void *'
Bang Nguyen [Fri, 3 Feb 2012 16:10:06 +0000 (11:10 -0500)]
RDS Quality Of Service
RDS QoS is an extension of IB QoS to provide clients the ability to
segregate traffic flows and define policy to regulate them.
Internally, each traffic flow is represented by a connection with all of its
independent resources like that of a normal connection, and is
differentiated by service type. In other words, there can be multiple
connections between an IP pair and each supports a unique service type.
Service type (TOS) is user-defined and can be configured to satisfy certain
traffic requirements. For example, one service type may be configured for
high-priority low-latency traffic, another for low-priority high-bandwidth
traffic, and so on.
TOS is socket based. Client can set TOS on a socket via an IOCTL and must
do so before initiating any traffic. Once set, the TOS can not be changed.
Chris Mason [Fri, 3 Feb 2012 16:09:49 +0000 (11:09 -0500)]
RDS: make sure rds_send_xmit doesn't loop forever
rds_send_xmit can get stuck doing work on behalf of other senders. This
breaks out if we've been working too long. The work queue will get kicked
to finish off any other requests if our current process gives up.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:09:36 +0000 (11:09 -0500)]
RDS: don't test ring_empty or ring_low without locks held
The math in the ring functions can't be trusted unless you're either the only
person adding to the ring or the only person freeing from it. If there are no
locks held at all you can end up hitting bogus assertions around the ring counters.
This chnages the rds_ib_recv_refill code and the recv tasklet code to make sure
proper locks are held before we use rds_ib_ring_empty or rds_ib_ring_low
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Venkat Venkatsubra [Fri, 3 Feb 2012 16:09:07 +0000 (11:09 -0500)]
RDS: avoid double destory of cm_id when rdms_resolve_route fails
It crashes in rds_ib_conn_shutdown because it was using a freed cm_id. The
cm_id had got freed quite a while back actually (more than 15 secs back) during
an earlier connect attempt.
This was the sequence of the earlier connect attempt: rds_ib_conn_connect calls
rdma_resolve_addr. The synchronous part of rdma_resolve_addr succeeds. But the
asynchronous part fails at some point. RDMA Connection Manager returns the
event RDMA_CM_EVENT_ADDR_RESOLVED. This part succeeds. Next, RDS calls
rdma_resolve_route from the rds_rdma_cm_event_handler. This fails. We return
this error back to the RDMA CM addr_handler which destroys the cm_id as
follows: addr_handler (cma.c):
Later when a new connect req comes in from the remote side, we shutdown this cm_id
and try to reconnect:
/*
* after 15 seconds, give up on existing connection
* attempts and make them try again. At this point
* it's no longer a race but something has gone
* horribly wrong
*/
if (now > conn->c_connection_start &&
now - conn->c_connection_start > 5) {
printk(KERN_CRIT "rds connection racing for 15s, forcing reset "
"connection %u.%u.%u.%u->%u.%u.%u.%u\n",
NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr));
rds_conn_drop(conn);
....
We crash during the shutdown.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:09:07 +0000 (11:09 -0500)]
RDS: make sure rds_send_drop_to properly takes the m_rs_lock
rds_send_drop_to is used during socket tear down to find all the
messages on the socket and clean them up. It can race with the
acking code unless it takes the m_rs_lock on each and every message.
This plugs a hole where we didn't take m_rs_lock on any message that
didn't have the RDS_MSG_ON_CONN set. Taking m_rs_lock avoids
double frees and other memory corruptions as the ack code trusts
the message m_rs pointer on a socket that had actually been freed.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:09:07 +0000 (11:09 -0500)]
RDS: kick krdsd to send congestion map updates
We can get into a deadlock on the recv spinlock because
congestion map updates can be sent in the recev path. This
pushes the work off to krdsd instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:09:07 +0000 (11:09 -0500)]
RDS: add debuging code around sock_hold and sock_put.
RDS had a recent series of memory corruptions because of
a use-after-free and double-free of rds sockets. This adds
some debugging code around sock_put and sock_hold to
catch any similar bugs and spit out useful debugging info.
This is a temporary commit while customers try out our fix.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:07:54 +0000 (11:07 -0500)]
RDS: don't trust the LL_SEND_FULL bit
We are seeing connections stuck with the LL_SEND_FULL bit getting
set and never cleared. This changes RDS to stop trusting the
LL_SEND_FULL bit and kick krdsd after any time we
see -ENOMEM from the ring allocation code.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:07:54 +0000 (11:07 -0500)]
RDS: give up on half formed connections after 15s
RDS relies on events to transition connections through a few
different states, but sometimes we get stuck and end up with
a half formed connection that is never able to finish
The other end has either wandered off or there are bugs in
other layers, and we end up with any future attempts from
the other end rejected because we're already working on a
connection attempt.
This patch changes things to give up on half formed connections
after 15 seconds.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:07:41 +0000 (11:07 -0500)]
RDS: make sure not to loop forever inside rds_send_xmit
If a determined set of concurrent senders keep the send queue full,
we can loop forever insdie rds_send_xmit. This fix has two parts.
First we are dropping out of the while(1) loop after we've processed a
large batch of messages.
Second we add a generation number that gets bumped each time the
xmit bit lock is acquired. If someone else has jumped in and
made progress in the queue, we skip our goto restart.
Signed-off-by: Chris Mason <chris.mason@oracle.c.om> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Andy Grover [Thu, 13 Jan 2011 19:40:31 +0000 (11:40 -0800)]
rds: check for excessive looping in rds_send_xmit
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Andy Grover [Fri, 24 Sep 2010 17:16:37 +0000 (10:16 -0700)]
change ib default retry to 1
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Andy Grover [Fri, 3 Feb 2012 16:07:40 +0000 (11:07 -0500)]
This patch adds the modparam to rds.ko.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 3 Feb 2012 16:07:40 +0000 (11:07 -0500)]
RDS: only use passive connections when addresses match
Passive connections were added for the case where one loopback IB
connection between identical addresses needs another connection to store
the second QP. Unfortunately, they were also created in the case where
the addesses differ and we already have both QPs.
This lead to a message reordering bug.
- two different IB interfaces and addresses on a machine: A B
- traffic is sent from A to B
- connection from A-B is created, connect request sent
- listening accepts connect request, B-A is created
- traffic flows, next_rx is incremented
- unacked messages exist on the retrans list
- connection A-B is shut down, new connect request sent
- listen sees existing loopback B-A, creates new passive B-A
- retrans messages are sent and delivered because of 0 next_rx
The problem is that the second connection request saw the previously
existing parent connection. Instead of using it, and using the existing
next_rx_seq state for the traffic between those IPs, it mistakenly
thought that it had to create a passive connection.
We fix this by only using passive connections in the special case where
laddr and faddr match. In this case we'll only ever have one parent
sending connection requests and one passive connection created as the
listening path sees the existing parent connection which initiated the
request.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 3 Feb 2012 16:07:40 +0000 (11:07 -0500)]
RDS/IB: always free recv frag as we free its ring entry
We were still seeing rare occurances of the WARN_ON() that indicates
that the recv refill path was finding allocated frags in ring entries
that were marked free. These were usually followed by oom crashes.
They only seem to be occuring in the presence of interesting completion
errors and connection resets.
There are error paths in rds_ib_recv_cqe_handler() that could leave a
recv frag sitting in the ring. This patch ensures that we free the frag
as we mark the ring entry free. This should stop the refill path from
finding allocated frags in ring entries that were marked free.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Andy Grover [Tue, 7 Sep 2010 17:59:44 +0000 (10:59 -0700)]
RDS/IB: Quiet warnings when leaking frags
We have a race where sometimes we leak frags, and it hits
the WARN_ON. Unfortunately, the stream of WARN_ONs make
the machine unusable. This patch changes to WARN_ON_ONCE
so we do not hose the box, and we can still get notifications
the bug has occurred.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 23 Jul 2010 17:37:33 +0000 (10:37 -0700)]
RDS: cancel connection work structs as we shut down
Nothing was canceling the send and receive work that might have been
queued as a conn was being destroyed.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 23 Jul 2010 17:36:58 +0000 (10:36 -0700)]
RDS: don't call rds_conn_shutdown() from rds_conn_destroy()
rds_conn_shutdown() can return before the connection is shut down when
it encounters an existing state that it doesn't understand. This lets
rds_conn_destroy() then start tearing down the conn from under paths
that are still using it.
It's more reliable the shutdown work and wait for krdsd to complete the
shutdown callback. This stopped some hangs I was seeing where krdsd was
trying to shut down a freed conn.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 23 Jul 2010 17:32:31 +0000 (10:32 -0700)]
RDS: have sockets get transport module references
Right now there's nothing to stop the various paths that use
rs->rs_transport from racing with rmmod and executing freed transport
code. The simple fix is to have binding to a transport also hold a
reference to the transport's module, removing this class of races.
We already had an unused t_owner field which was set for the modular
transports and which wasn't set for the built-in loop transport.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Wed, 21 Jul 2010 22:13:25 +0000 (15:13 -0700)]
RDS: remove old rs_transport comment
rs_transport is now also used by the rdma paths once the socket is
bound. We don't need this stale comment to tell us what cscope can.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>