Merge branch 'topic/uek-4.1/ofed.rds-p1' into topic/uek-4.1/ofed
* topic/uek-4.1/ofed.rds-p1: (126 commits)
rds: UNDO reverts done for rebase code to compile with Linux 4.1 APIs
rds: port to UEK4, Linux-3.18*
rds: disable APM support
rds: disable cq balance
rds: move linux/rds.h to uapi/linux/rds.h
RDS: Kconfig and Makefile changes
RDS merge for UEK2
rds: Misc Async Send fixes
rds: call unregister_netdevice_notifier for rds_ib_nb in rds_ib_exit
rds: flush and destroy workqueue rds_aux_wq and fix creation order.
rds : fix compilation warning
rds: port the code to uek2
rds: CQ balance
rds: HAIP across HCAs
rds: Misc HAIP fixes
rds: off by one fixes
rds: Add Automatic Path Migration support
rds: fix error flow handling
net/rds: prevent memory leak in case of error flow
rds: prepare support to kernel 2.6.39-200.1.1.el5uek: add the macro NIPQUAD_*
...
Merge branch 'topic/uek-4.1/ofed.mlnx2.4-p3.orclFixes' into topic/uek-4.1/ofed
* topic/uek-4.1/ofed.mlnx2.4-p3.orclFixes:
IB/ipoib: CSUM support in connected mode
IB/ipoib: Scatter-Gather support in connected mode
ib_uverbs: Support for kernel implementation of XRC calls from user space
ib_{uverbs/core}: add new ib_create_qp_ex with udata arg
ib_uverbs: Avoid vendor specific masking of attributes in query_qp
ib_uverbs: Add padding to end align ib_uverbs_reg_mr_resp
ib: Add udata argument to create_ah
ib_umem: Add a new, more generic ib_umem_get_attrs
ib_mad: incoming sminfo SMPs gets discarded if no process_mad function is registered
mlx4_core: More support for automatically scaling profile parameters
ipoib: rfe- enable pkey and device name decoupling
ib_sdp: adding sdp socket support to rdma_cm
mlx4_vnic: set mod param "lro_num" default value to 0 to disable LRO feature
mlx4_vnic: Add correct typecasting to pointers in vnic_get_frag_header()
rdma_cm: CMA_QUERY_HANDLER: BAD STATUS -110 and -22
RDMA CM: Avoid possible SEGV during connection shutdown
rdma_cm: extend debug for remote mapping
Merge branch 'topic/uek-4.1/ofed.mlnx2.4-p2' into topic/uek-4.1/ofed
* topic/uek-4.1/ofed.mlnx2.4-p2: (93 commits)
mlx4_core: supporting 64b counters
ib_core: supporting 64b counters using PMA_COUNTERS_EXT mad
net/mlx4: When issuing commands use rwsem insteam of rw spinlocks
mlx4_ib: Make sure that PSN does not overflow.
ib_core: Make sure that PSN does not overflow.
IB/CMA: Make sure that PSN is not over max allowed
IB/mlx4: Mark user mr as writable if actual virtual memory is writable
mlx4_ib: Report proper BDF for IB MSI-X vectors
IB/core: Fix memory leak in cm_req_handler error flows
mlx4_core: enable msi_x module parameter for SRIOV VFs to limit number MSI-X interrupts per VF
mlx4_ib: Fix endianness in blueflame post_send.
net/mlx4: Switching between sending commands via polling and events may results in hung tasks
IB/mlx4: Put non zero value in max_ah
IB/core: Add debugging prints to ib_uverbs_write
IB/core: add debugging prints to explain -EINVAL in ib_uverbs_reg_mr
fix warning about bitwise or between u32 and size_t
IB/mlx4: Don't update QP1 for native functions
IB/ipoib: Check gso size prior to ib_send
mlx4_vnic: fix may be used uninitialized compilation warnings
mlx4_vnic: fix potential data corruption in sprintf
...
Merge branch 'topic/uek-4.1/ofed.mlnx2.4-p1' into topic/uek-4.1/ofed
* topic/uek-4.1/ofed.mlnx2.4-p1: (23 commits)
mlx4_vnic: Kconfig and Makefile changes
mlx4_vnic: add mlx4_vnic
mlx4_ib: add blue flame support for kernel consumers
net/mlx4_core: add sanity check when creating bitmap structure
net/mlx4_core: unmap clear register in case of error flow
ib_core: fix NULL pointer dereference
mlx4_ib: contig support for control objects
mlx4_core: fix wrong comment about the reason of subtract one from the max_cqes
IB/core - Don't modify outgoing DR SMP if first part is LID routed
net/mlx4: adjust initial value of vl_cap in mlx4_SET_PORT
mlx4_core: Error message on mtt allocation failure
IB/core: Control number of retries for SA to leave an MCG
mlx4: reducing wait during SW reset for 500 msecs
mlx4_ib: Do not enable blueflame sends if write combining is not available
IB/core: Fix create_qp issue relates to qp group type
mlx4_core: log_num_mtt handling
mlx4_ib: Fix the SQ size of an RC QP to support masked atomic operation
mlx4_ib: Use optimal numbers of MTT entries.
mlx4_ib: set write-combining flag for userspace blueflame pages
mlx4_core: limit min profile numbers
...
rds: UNDO reverts done for rebase code to compile with Linux 4.1 APIs
Commit 163377dd82f2d81809aabe736a2e0ea515055a69 does reverts
to common ancestor of upstream and UEK2 to rebase UEK2 patches
for net/rds. This commit undoes reverts needed to compile to
Linux 4.0 APIs.
UNDO Revert "net: Replace get_cpu_var through this_cpu_ptr" for net/rds
This commit does UNDO of revert of commit 903ceff7ca7b4d80c083a80ee5163b74e9fa359f for net/rds.
UNDO Revert "net: introduce helper macro for_each_cmsghdr" for net/rds
This commit does UNDO of revert of commit f95b414edb18de59940dcebbefb49cf25c6d505c for net/rds
UNDO Revert "net: Remove iocb argument from sendmsg and recvmsg" for net/rds
This commit does UNDO of revert of commit 1b784140474e4fc94281a49e96c67d29df0efbde for net/rds.
These commits were reverted earlier to rebase unmodified UEK2 RDS code
(UNDO needed to compile to new Linux 4.1 kernel APIs - changed *after* Linux 3.18)
Yuval Shaia [Tue, 16 Jun 2015 07:32:36 +0000 (00:32 -0700)]
IB/ipoib: CSUM support in connected mode
This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB CM.
IPoIB CM uses RC (Reliable Connection) which guarantees the corruption free
delivery of the packet.
InfiniBand uses 32b CRC which provides stronger data integrity protection
compare to 16b IP Checksum. So, there is no added value that IP/TCP Checksum
provides in the IB world.
The proposal is to tell network stack that IPoIB-CM supports IP Checksum
offload. This enables the kernel to save the time of checksum calculation
of IPoIB CM packets. Network sends the IP packet without adding the IP
Checksum to the header. On the receive side, IPoIB driver again tells the
network stack that IP Checksum is good for the incoming packets and network
stack avoids the IP Checksum calculations.
During connection establishment the driver determine if peer supports
IB CRC as checksum. This is done so driver will be able to calculate
checksum before transmiting the packet in case the peer does not support
this feature.
IB/ipoib: Scatter-Gather support in connected mode
By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance.
This MTU plus overhead puts the memory allocation for IP based packets at 32
4k pages (order 5), which have to be contiguous.
When the system memory under pressure, it was observed that allocating 128k
contiguous physical memory is difficult and causes serious errors (such as
system becomes unusable).
This enhancement resolve the issue by removing the physically contiguous memory
requirement using Scatter/Gather feature that exists in Linux stack.
With this fix Scatter-Gather will be supported also in connected mode
This change also revert the change made in commit e112373
("IPoIB/cm: Reduce connected mode TX object size)".
ib_uverbs: Support for kernel implementation of XRC calls from user space
Extends the kernel/user space interface for work requests to also provide
the XRC shared receive queue number. Necessary to support
kernel level implementation of user verbs for XRC.
Requires a corresponding libibverbs change to support XRC.
ib_{uverbs/core}: add new ib_create_qp_ex with udata arg
Necessary to get device specific arguments through to XRC QPs.
Added new local header file to serve as support interface
between ib_core and ib_uverbs.
Right now there is a lot of duplicate setup code in uverbs_cmd.c
on the ib_uverbs side and verbs.c on the ib_core side. This commit
is a quick fix to have XRC support working, but similar calls
can be added to consolidate the code for other parts of the API.
ib_uverbs: Avoid vendor specific masking of attributes in query_qp
This commit removes the implementation and use of the modify_qp_mask
helper function from the generic OFED implementation and into individual
device drivers.
Like with use of the ib_modify_qp_is_ok function it should be up to
each device driver how to handle bits set in the attribute masks.
With the modify_qp_mask function applied in the generic code,
drivers would not see the bits that the user process actually sets.
The restrictions imposed by the filter are also beyond what
is imposed by the Infiniband standard, and would also limit
future drivers or hardware from checking for unsupported or
invalid settings.
ib_uverbs: Add padding to end align ib_uverbs_reg_mr_resp
The ib_uverbs_reg_mr_resp structure was not 64 bit end aligned
as required by the protocol. This causes alignment issues
if a device specific driver needs to transfer extra response
arguments.
Most of the ib device driver entry points supports optional
device specific parameter transfer between user space and kernel space
via the udata argument - add a similar argument for ib_create_ah.
Update all infiniband drivers to include this agument
in their driver entry point implementation.
ib_umem: Add a new, more generic ib_umem_get_attrs
This call allows a full range of DMA attributes and also
DMA direction to be supplied and is just a refactor of the old ib_umem_get.
Reimplement ib_umem_get using the new generic call,
now a trivial implementation.
Dag Moxnes [Tue, 21 Apr 2015 10:20:02 +0000 (12:20 +0200)]
ib_mad: incoming sminfo SMPs gets discarded if no process_mad function is registered
The process_mad function is an optional IB driver entry point
allows a driver to intercept or modify MAD traffic.
This fix allows MAD traffic to flow down to the device also
when MAD traffic is completely handled by the device and
no process_mad function is provided.
Mukesh Kacker [Tue, 17 Mar 2015 01:11:27 +0000 (18:11 -0700)]
mlx4_core: More support for automatically scaling profile parameters
Add a new module configuration variable "scale_profile" parameter
which allows dynamic scaling of parameters. When it is not set,
the Mellanox default behavior will prevail.
The dynamically configured parameters are typically set to 0 in
configuration - but if they are set to a specific value, a warning
is printed that they are not being dynamically scaled. (This allows
for make exceptions and experiments with different values).
The original dynamic scaling of profile parameter num_mtt_segs
(governed by log_num_mtt) is retained. In addition scaling is
also introduced for parameter num_qp (governed by log_num_qp).
This is not a direct port but similar in spirit to fixes done
in UEK2 with following commits:
52ac96 OFED: Automatically size MTT in mlx4_core
47678c mlx4_core: increase default number of qps in mlx4_core driver
218561 mlx_core: Change log_num_mtt scaling range
497dd4 mlx4_core: change default for mlx4_scale_profile
An error message improvement is borrowed from Mellanox OFED 2.4 commit
17465c net/mlx4: add explicit message if user ask too few QPs
(Code for this commit is already upstream but the error message is less
explicit upstream!)
Mukesh Kacker [Thu, 22 Jan 2015 19:14:02 +0000 (11:14 -0800)]
ipoib: rfe- enable pkey and device name decoupling
The sysfs "create_child" interface creates
pkey based child interface but derives the
name from parent device name and pkey value.
This makes administration difficult where pkey
values can change but policies encoded with
device names do not.
We add ability to create a child interface with
a user specified name and a specified pkey
with a new sysfs "create_named_child" interface
(and also add a corresponding "delete_named_child"
interface).
We also add a new module api interface to query
pkey from a netdevice so any kernel users of
pkey based child interfaces can query it - since
with device name decoupled from pkey, it can no
longer be deduced from parsing the device name
by other kernel users.
Qing Huang [Mon, 26 Jan 2015 06:17:09 +0000 (22:17 -0800)]
ib_sdp: adding sdp socket support to rdma_cm
SDP related code was completely removed from upstream after
these two commits:
fbaa1a6, Sean Hefty, RDMA/cma: Merge cma_get/save_net_info 01602f1, Sean Hefty, RDMA/cma: Remove unused SDP related code
When adding the SDP support code back, to better organize
changes, we created the following separate new files for the
code: cma_priv.h, cma_sdp.c and cma_sdp_priv.h
Dotan Barak [Thu, 7 Jun 2012 05:56:34 +0000 (08:56 +0300)]
RDS: fixed compilation warnings
Fixed the following compilation warnings:
net/rds/send.c: In function 'rds_send_xmit':
net/rds/send.c:299: warning: suggest parentheses around && within ||
net/rds/rdma.c: In function 'rds_cmsg_rdma_dest':
net/rds/rdma.c:697: warning: format '%Lx' expects type 'long long unsigned int', but argument 2 has type 'u32'
net/rds/ib_recv.c: In function 'rds_ib_srqs_init':
net/rds/ib_recv.c:1570: warning: 'return' with no value, in function returning non-void
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com> Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Bang Nguyen [Sun, 19 Feb 2012 20:19:57 +0000 (12:19 -0800)]
RDS Asynchronous Send support
1. Same behavior as RDMA send, i.e., generate notification on IB completion.
2. On error handling, connection is closed for traffic, i.e., new sends are
rejected and client retries
3. To guarantee ordering, all pending async (RDMA/bcopy) sends after the
failed send will also be aborted, and in the order that they were submitted.
4. Re-open connection for traffic after all the failed notifications have
been reaped by the client.
Dotan Barak [Wed, 15 Feb 2012 16:00:50 +0000 (18:00 +0200)]
rds: fix compilation warnings
net/rds/ib_recv.c: In function 'rds_ib_srq_event':
net/rds/ib_recv.c:1490: warning: too many arguments for format
net/rds/ib_recv.c:1484: warning: unused variable 'srq_attr'
net/rds/ib_recv.c: In function 'rds_ib_srq_init':
net/rds/ib_recv.c:1524: warning: passing argument 1 of 'ERR_PTR' makes
integer from pointer without a cast
include/linux/err.h:20: note: expected 'long int' but argument is of
type 'struct ib_srq *'
net/rds/ib_recv.c:1524: warning: format '%d' expects type 'int', but
argument 2 has type 'void *'
Bang Nguyen [Fri, 3 Feb 2012 16:10:06 +0000 (11:10 -0500)]
RDS Quality Of Service
RDS QoS is an extension of IB QoS to provide clients the ability to
segregate traffic flows and define policy to regulate them.
Internally, each traffic flow is represented by a connection with all of its
independent resources like that of a normal connection, and is
differentiated by service type. In other words, there can be multiple
connections between an IP pair and each supports a unique service type.
Service type (TOS) is user-defined and can be configured to satisfy certain
traffic requirements. For example, one service type may be configured for
high-priority low-latency traffic, another for low-priority high-bandwidth
traffic, and so on.
TOS is socket based. Client can set TOS on a socket via an IOCTL and must
do so before initiating any traffic. Once set, the TOS can not be changed.
Chris Mason [Fri, 3 Feb 2012 16:09:49 +0000 (11:09 -0500)]
RDS: make sure rds_send_xmit doesn't loop forever
rds_send_xmit can get stuck doing work on behalf of other senders. This
breaks out if we've been working too long. The work queue will get kicked
to finish off any other requests if our current process gives up.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:09:36 +0000 (11:09 -0500)]
RDS: don't test ring_empty or ring_low without locks held
The math in the ring functions can't be trusted unless you're either the only
person adding to the ring or the only person freeing from it. If there are no
locks held at all you can end up hitting bogus assertions around the ring counters.
This chnages the rds_ib_recv_refill code and the recv tasklet code to make sure
proper locks are held before we use rds_ib_ring_empty or rds_ib_ring_low
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Venkat Venkatsubra [Fri, 3 Feb 2012 16:09:07 +0000 (11:09 -0500)]
RDS: avoid double destory of cm_id when rdms_resolve_route fails
It crashes in rds_ib_conn_shutdown because it was using a freed cm_id. The
cm_id had got freed quite a while back actually (more than 15 secs back) during
an earlier connect attempt.
This was the sequence of the earlier connect attempt: rds_ib_conn_connect calls
rdma_resolve_addr. The synchronous part of rdma_resolve_addr succeeds. But the
asynchronous part fails at some point. RDMA Connection Manager returns the
event RDMA_CM_EVENT_ADDR_RESOLVED. This part succeeds. Next, RDS calls
rdma_resolve_route from the rds_rdma_cm_event_handler. This fails. We return
this error back to the RDMA CM addr_handler which destroys the cm_id as
follows: addr_handler (cma.c):
Later when a new connect req comes in from the remote side, we shutdown this cm_id
and try to reconnect:
/*
* after 15 seconds, give up on existing connection
* attempts and make them try again. At this point
* it's no longer a race but something has gone
* horribly wrong
*/
if (now > conn->c_connection_start &&
now - conn->c_connection_start > 5) {
printk(KERN_CRIT "rds connection racing for 15s, forcing reset "
"connection %u.%u.%u.%u->%u.%u.%u.%u\n",
NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr));
rds_conn_drop(conn);
....
We crash during the shutdown.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:09:07 +0000 (11:09 -0500)]
RDS: make sure rds_send_drop_to properly takes the m_rs_lock
rds_send_drop_to is used during socket tear down to find all the
messages on the socket and clean them up. It can race with the
acking code unless it takes the m_rs_lock on each and every message.
This plugs a hole where we didn't take m_rs_lock on any message that
didn't have the RDS_MSG_ON_CONN set. Taking m_rs_lock avoids
double frees and other memory corruptions as the ack code trusts
the message m_rs pointer on a socket that had actually been freed.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:09:07 +0000 (11:09 -0500)]
RDS: kick krdsd to send congestion map updates
We can get into a deadlock on the recv spinlock because
congestion map updates can be sent in the recev path. This
pushes the work off to krdsd instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:09:07 +0000 (11:09 -0500)]
RDS: add debuging code around sock_hold and sock_put.
RDS had a recent series of memory corruptions because of
a use-after-free and double-free of rds sockets. This adds
some debugging code around sock_put and sock_hold to
catch any similar bugs and spit out useful debugging info.
This is a temporary commit while customers try out our fix.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:07:54 +0000 (11:07 -0500)]
RDS: don't trust the LL_SEND_FULL bit
We are seeing connections stuck with the LL_SEND_FULL bit getting
set and never cleared. This changes RDS to stop trusting the
LL_SEND_FULL bit and kick krdsd after any time we
see -ENOMEM from the ring allocation code.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:07:54 +0000 (11:07 -0500)]
RDS: give up on half formed connections after 15s
RDS relies on events to transition connections through a few
different states, but sometimes we get stuck and end up with
a half formed connection that is never able to finish
The other end has either wandered off or there are bugs in
other layers, and we end up with any future attempts from
the other end rejected because we're already working on a
connection attempt.
This patch changes things to give up on half formed connections
after 15 seconds.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Chris Mason [Fri, 3 Feb 2012 16:07:41 +0000 (11:07 -0500)]
RDS: make sure not to loop forever inside rds_send_xmit
If a determined set of concurrent senders keep the send queue full,
we can loop forever insdie rds_send_xmit. This fix has two parts.
First we are dropping out of the while(1) loop after we've processed a
large batch of messages.
Second we add a generation number that gets bumped each time the
xmit bit lock is acquired. If someone else has jumped in and
made progress in the queue, we skip our goto restart.
Signed-off-by: Chris Mason <chris.mason@oracle.c.om> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Andy Grover [Thu, 13 Jan 2011 19:40:31 +0000 (11:40 -0800)]
rds: check for excessive looping in rds_send_xmit
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Andy Grover [Fri, 24 Sep 2010 17:16:37 +0000 (10:16 -0700)]
change ib default retry to 1
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Andy Grover [Fri, 3 Feb 2012 16:07:40 +0000 (11:07 -0500)]
This patch adds the modparam to rds.ko.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 3 Feb 2012 16:07:40 +0000 (11:07 -0500)]
RDS: only use passive connections when addresses match
Passive connections were added for the case where one loopback IB
connection between identical addresses needs another connection to store
the second QP. Unfortunately, they were also created in the case where
the addesses differ and we already have both QPs.
This lead to a message reordering bug.
- two different IB interfaces and addresses on a machine: A B
- traffic is sent from A to B
- connection from A-B is created, connect request sent
- listening accepts connect request, B-A is created
- traffic flows, next_rx is incremented
- unacked messages exist on the retrans list
- connection A-B is shut down, new connect request sent
- listen sees existing loopback B-A, creates new passive B-A
- retrans messages are sent and delivered because of 0 next_rx
The problem is that the second connection request saw the previously
existing parent connection. Instead of using it, and using the existing
next_rx_seq state for the traffic between those IPs, it mistakenly
thought that it had to create a passive connection.
We fix this by only using passive connections in the special case where
laddr and faddr match. In this case we'll only ever have one parent
sending connection requests and one passive connection created as the
listening path sees the existing parent connection which initiated the
request.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 3 Feb 2012 16:07:40 +0000 (11:07 -0500)]
RDS/IB: always free recv frag as we free its ring entry
We were still seeing rare occurances of the WARN_ON() that indicates
that the recv refill path was finding allocated frags in ring entries
that were marked free. These were usually followed by oom crashes.
They only seem to be occuring in the presence of interesting completion
errors and connection resets.
There are error paths in rds_ib_recv_cqe_handler() that could leave a
recv frag sitting in the ring. This patch ensures that we free the frag
as we mark the ring entry free. This should stop the refill path from
finding allocated frags in ring entries that were marked free.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Andy Grover [Tue, 7 Sep 2010 17:59:44 +0000 (10:59 -0700)]
RDS/IB: Quiet warnings when leaking frags
We have a race where sometimes we leak frags, and it hits
the WARN_ON. Unfortunately, the stream of WARN_ONs make
the machine unusable. This patch changes to WARN_ON_ONCE
so we do not hose the box, and we can still get notifications
the bug has occurred.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 23 Jul 2010 17:37:33 +0000 (10:37 -0700)]
RDS: cancel connection work structs as we shut down
Nothing was canceling the send and receive work that might have been
queued as a conn was being destroyed.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 23 Jul 2010 17:36:58 +0000 (10:36 -0700)]
RDS: don't call rds_conn_shutdown() from rds_conn_destroy()
rds_conn_shutdown() can return before the connection is shut down when
it encounters an existing state that it doesn't understand. This lets
rds_conn_destroy() then start tearing down the conn from under paths
that are still using it.
It's more reliable the shutdown work and wait for krdsd to complete the
shutdown callback. This stopped some hangs I was seeing where krdsd was
trying to shut down a freed conn.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 23 Jul 2010 17:32:31 +0000 (10:32 -0700)]
RDS: have sockets get transport module references
Right now there's nothing to stop the various paths that use
rs->rs_transport from racing with rmmod and executing freed transport
code. The simple fix is to have binding to a transport also hold a
reference to the transport's module, removing this class of races.
We already had an unused t_owner field which was set for the modular
transports and which wasn't set for the built-in loop transport.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Wed, 21 Jul 2010 22:13:25 +0000 (15:13 -0700)]
RDS: remove old rs_transport comment
rs_transport is now also used by the rdma paths once the socket is
bound. We don't need this stale comment to tell us what cscope can.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Fri, 23 Jul 2010 17:30:45 +0000 (10:30 -0700)]
RDS: lock rds_conn_count decrement in rds_conn_destroy()
rds_conn_destroy() can race with all other modifications of the
rds_conn_count but it was modifying the count without locking.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Andy Grover [Tue, 20 Jul 2010 00:15:57 +0000 (17:15 -0700)]
Use CQ_NEXT_COMP for recv completions
We want to get interrupts for incoming data with no delay.
Splitting the CQs lets us have different policies here and
for send, where we don't want an event for each send completion.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Thu, 15 Jul 2010 19:34:33 +0000 (12:34 -0700)]
RDS/IB: protect the list of IB devices
The RDS IB device list wasn't protected by any locking. Traversal in
both the get_mr and FMR flushing paths could race with additon and
removal.
List manipulation is done with RCU primatives and is protected by the
write side of a rwsem. The list traversal in the get_mr fast path is
protected by a rcu read critical section. The FMR list traversal is
more problematic because it can block while traversing the list. We
protect this with the read side of the rwsem.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Wed, 14 Jul 2010 21:01:21 +0000 (14:01 -0700)]
RDS/IB: print IB event strings as well as their number
It's nice to not have to go digging in the code to see which event
occurred. It's easy to throw together a quick array that maps the ib
event enums to their strings. I didn't see anything in the stack that
does this translation for us, but I also didn't look very hard.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Wed, 14 Jul 2010 02:23:32 +0000 (19:23 -0700)]
RDS/IB: track signaled sends
RDS/IB: track signaled sends
We're seeing bugs today where IB connection shutdown clears the send
ring while the tasklet is processing completed sends. Implementation
details cause this to dereference a null pointer. Shutdown needs to
wait for send completion to stop before tearing down the connection. We
can't simply wait for the ring to empty because it may contain
unsignaled sends that will never be processed.
This patch tracks the number of signaled sends that we've posted and
waits for them to complete. It also makes sure that the tasklet has
finished executing.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Zach Brown [Sat, 10 Jul 2010 02:26:20 +0000 (19:26 -0700)]
RDS: remove __init and __exit annotation
RDS: remove __init and __exit annotation
The trivial amount of memory saved isn't worth the cost of dealing with section
mismatches.
Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Dotan Barak [Sun, 11 Dec 2011 13:17:24 +0000 (15:17 +0200)]
rds: fix compilation warnings
Fix the following compilation warnings:
ofed_kernel/net/rds/iw_cm.c: In function rds_iw_qp_event_handler:
ofed_kernel/net/rds/iw_cm.c:162: warning: too many arguments for format
ofed_kernel/net/rds/af_rds.c:384: warning: initialization from incompatible pointer type