Temporary memory allocation failures may cause an RDS connection
to be stuck in an endless RNR (receiver Not Ready). Right around the
time the RDS connection becomes stuck, it reports these recv buffer
allocation failures:
rcuos/10: page allocation failure: order:2, mode:0x2
Call Trace:
<IRQ> [<
ffffffff81698cc0>] dump_stack+0x63/0x83
[<
ffffffff8118e59a>] warn_alloc_failed+0xea/0x140
[<
ffffffff810b93fa>] ? select_idle_sibling+0x2a/0x120
[<
ffffffff81191e09>] __alloc_pages_slowpath+0x409/0x760
[<
ffffffff81192411>] __alloc_pages_nodemask+0x2b1/0x2d0
[<
ffffffff810bae62>] ? check_preempt_wakeup+0x112/0x230
[<
ffffffff811dc3af>] alloc_pages_current+0xaf/0x170
[<
ffffffffa12e2090>] rds_page_remainder_alloc+0x60/0x2a4
[<
ffffffffa0b1c0ac>] rds_ib_refill_one_frag+0x13c/0x200 [rds_rdma]
[<
ffffffffa12994cd>] rds_ib_recv_refill_one+0x8d/0x220
[<
ffffffffa0b1dfbf>] rds_ib_recv_refill+0x11f/0x340 [rds_rdma]
[<
ffffffffa129989e>] rds_ib_recv_cqe_handler+0x23e/0x290
[<
ffffffffa0b19326>] poll_cq+0x66/0xe0 [rds_rdma]
[<
ffffffffa0b1945d>] rds_ib_rx+0xbd/0x210 [rds_rdma]
[<
ffffffffa0b1964a>] rds_ib_tasklet_fn_recv+0x3a/0x50 [rds_rdma]
[<
ffffffff81088361>] tasklet_action+0xb1/0xc0
[<
ffffffff8108871a>] __do_softirq+0x10a/0x350
[<
ffffffff8169f53c>] do_softirq_own_stack+0x1c/0x30
<EOI> [<
ffffffff81088445>] do_softirq+0x55/0x60
[<
ffffffff81088528>] __local_bh_enable_ip+0x88/0x90
[<
ffffffff810e86d1>] rcu_nocb_kthread+0xf1/0x180
[<
ffffffff810e85e0>] ? print_cpu_stall+0x170/0x170
[<
ffffffff810e85e0>] ? print_cpu_stall+0x170/0x170
[<
ffffffff810a465e>] kthread+0xce/0xf0
[<
ffffffff810a4590>] ? kthread_freezable_should_stop+0x70/0x70
[<
ffffffff8169dda2>] ret_from_fork+0x42/0x70
[<
ffffffff810a4590>] ? kthread_freezable_should_stop+0x70/0x70
We re-schedule recv buffer refiller on satisfying these conditions:
if (rds_conn_up(conn) &&
(must_wake || (can_wait && ring_low)
|| rds_ib_ring_empty(&ic->i_recv_ring))) {
queue_delayed_work(conn->c_wq, &conn->c_recv_w, 1);
}
This currently doesn't take into account memory allocation failures.
A bit later the memory pressure clears away.
But RDS does not refill receive buffers for that connection any more.
This is because the receiver is only woken up on the last packet of a
multi-packet message. But the last packet is never received, because the
recv queue becomes empty and we end up in the endless RNR Retry situation.
Orabug:
28127993
Consultation with: Haakon Bugge
Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com>
Reviewed-by: Haakon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>