net/rds: Avoid stalled connection due to CM REQ retries
RDS drops a connection and destroys its cm_id once a CM REJ is sent. In a
congested fabric, there is a race where a remote node receives a CM REJ
after CM has retried another CM REQ. In this scenario, the cm_id that sends
the CM REQ is no longer exists even though the remote end might respond
with a CM REP, and wait for an incoming CM RTU. This RDS connection
establishment is stuck until the connection is destroyed after the CM
timeout. As a result, this leads to a very long brownout time. Thus, this
patch adds a mechanism to detect a rejected CM REQ and rejects all the
subsequent CM REQ that are retried by the CM.
Orabug:
28068627
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Tested-by: Dib Chatterjee <dib.chatterjee@oracle.com>
(cherry picked from commit
c5c4f1472bc788ddc69af713f975ad92bdefe206
repo https://linux-git.us.oracle.com/UEK/linux-wguay-public)
Conflict:
net/rds/ib_cm.c
Made it checkpatch clean.
v1->v2:
Added Shannon's recommendations
Signed-off-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>