mm/vmscan: fix hard LOCKUP in function isolate_lru_folios
This fixes the following hard lockup in isolate_lru_folios() during memory
reclaim. If the LRU mostly contains ineligible folios this may trigger
watchdog.
watchdog: Watchdog detected hard LOCKUP on cpu 173
RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0
Call Trace:
_raw_spin_lock_irqsave+0x31/0x40
folio_lruvec_lock_irqsave+0x5f/0x90
folio_batch_move_lru+0x91/0x150
lru_add_drain_per_cpu+0x1c/0x40
process_one_work+0x17d/0x350
worker_thread+0x27b/0x3a0
kthread+0xe8/0x120
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1b/0x30
lruvec->lru_lock owner:
PID: 2865 TASK:
ffff888139214d40 CPU: 40 COMMAND: "kswapd0"
#0 [
fffffe0000945e60] crash_nmi_callback at
ffffffffa567a555
#1 [
fffffe0000945e68] nmi_handle at
ffffffffa563b171
#2 [
fffffe0000945eb0] default_do_nmi at
ffffffffa6575920
#3 [
fffffe0000945ed0] exc_nmi at
ffffffffa6575af4
#4 [
fffffe0000945ef0] end_repeat_nmi at
ffffffffa6601dde
[exception RIP: isolate_lru_folios+403]
RIP:
ffffffffa597df53 RSP:
ffffc90006fb7c28 RFLAGS:
00000002
RAX:
0000000000000001 RBX:
ffffc90006fb7c60 RCX:
ffffea04a2196f88
RDX:
ffffc90006fb7c60 RSI:
ffffc90006fb7c60 RDI:
ffffea04a2197048
RBP:
ffff88812cbd3010 R8:
ffffea04a2197008 R9:
0000000000000001
R10:
0000000000000000 R11:
0000000000000001 R12:
ffffea04a2197008
R13:
ffffea04a2197048 R14:
ffffc90006fb7de8 R15:
0000000003e3e937
ORIG_RAX:
ffffffffffffffff CS: 0010 SS: 0018
<NMI exception stack>
#5 [
ffffc90006fb7c28] isolate_lru_folios at
ffffffffa597df53
#6 [
ffffc90006fb7cf8] shrink_active_list at
ffffffffa597f788
#7 [
ffffc90006fb7da8] balance_pgdat at
ffffffffa5986db0
#8 [
ffffc90006fb7ec0] kswapd at
ffffffffa5987354
#9 [
ffffc90006fb7ef8] kthread at
ffffffffa5748238
crash>
Scenario:
User processe are requesting a large amount of memory and keep page active.
Then a module continuously requests memory from ZONE_DMA32 area.
Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached.
However pages in the LRU(active_anon) list are mostly from
the ZONE_NORMAL area.
Reproduce:
Terminal 1: Construct to continuously increase pages active(anon).
mkdir /tmp/memory
mount -t tmpfs -o size=1024000M tmpfs /tmp/memory
dd if=/dev/zero of=/tmp/memory/block bs=4M
tail /tmp/memory/block
Terminal 2:
vmstat -a 1
active will increase.
procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ...
r b swpd free inact active si so bi bo
1 0 0
1445623076 45898836 83646008 0 0 0
1 0 0
1445623076 43450228 86094616 0 0 0
1 0 0
1445623076 41003480 88541364 0 0 0
1 0 0
1445623076 38557088 90987756 0 0 0
1 0 0
1445623076 36109688 93435156 0 0 0
1 0 0
1445619552 33663256 95881632 0 0 0
1 0 0
1445619804 31217140 98327792 0 0 0
1 0 0
1445619804 28769988 100774944 0 0 0
1 0 0
1445619804 26322348 103222584 0 0 0
1 0 0
1445619804 23875592 105669340 0 0 0
cat /proc/meminfo | head
Active(anon) increase.
MemTotal:
1579941036 kB
MemFree:
1445618500 kB
MemAvailable:
1453013224 kB
Buffers: 6516 kB
Cached:
128653956 kB
SwapCached: 0 kB
Active:
118110812 kB
Inactive:
11436620 kB
Active(anon):
115345744 kB
Inactive(anon): 945292 kB
When the Active(anon) is
115345744 kB, insmod module triggers
the ZONE_DMA32 watermark.
perf record -e vmscan:mm_vmscan_lru_isolate -aR
perf script
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2
nr_skipped=2 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=
28835844
nr_skipped=
28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=
28835844
nr_skipped=
28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29
nr_skipped=29 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon
See nr_scanned=
28835844.
28835844 * 4k = 115343376KB approximately equal to
115345744 kB.
If increase Active(anon) to 1000G then insmod module triggers
the ZONE_DMA32 watermark. hard lockup will occur.
In my device nr_scanned =
0000000003e3e937 when hard lockup.
Convert to memory size 0x0000000003e3e937 * 4KB =
261072092 KB.
[
ffffc90006fb7c28] isolate_lru_folios at
ffffffffa597df53
ffffc90006fb7c30:
0000000000000020 0000000000000000
ffffc90006fb7c40:
ffffc90006fb7d40 ffff88812cbd3000
ffffc90006fb7c50:
ffffc90006fb7d30 0000000106fb7de8
ffffc90006fb7c60:
ffffea04a2197008 ffffea0006ed4a48
ffffc90006fb7c70:
0000000000000000 0000000000000000
ffffc90006fb7c80:
0000000000000000 0000000000000000
ffffc90006fb7c90:
0000000000000000 0000000000000000
ffffc90006fb7ca0:
0000000000000000 0000000003e3e937
ffffc90006fb7cb0:
0000000000000000 0000000000000000
ffffc90006fb7cc0:
8d7c0b56b7874b00 ffff88812cbd3000
About the Fixes:
Why did it take eight years to be discovered?
The problem requires the following conditions to occur:
1. The device memory should be large enough.
2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area.
3. The memory in ZONE_DMA32 needs to reach the watermark.
If the memory is not large enough, or if the usage design of ZONE_DMA32
area memory is reasonable, this problem is difficult to detect.
notes:
The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL,
but other suitable scenarios may also trigger the problem.
Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn
Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
Signed-off-by: liuye <liuye@kylinos.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>