]> www.infradead.org Git - users/mchehab/rasdaemon.git/commit
ras-page-isolation: do_page_offline always considers page offline was successful
authorlvying <lvying6@huawei.com>
Sat, 31 Oct 2020 09:57:14 +0000 (17:57 +0800)
committerMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Wed, 23 Dec 2020 09:44:03 +0000 (10:44 +0100)
commite4d27840e173491ab29c2d97017da9344e2c2526
tree8d5698fe141d09bfd0d0469c31c13bffd42862fa
parent0c3950ad2d8a5a393d06721c06d22be1531ecb79
ras-page-isolation: do_page_offline always considers page offline was successful

do_page_offline always consider page offline was successful even if
kernel soft/hard offline page failed.

Calling rasdaemon with:

/etc/sysconfig/rasdaemon PAGE_CE_THRESHOLD="1"

i.e when a page's address occurs Corrected Error, rasdaemon should
trigger this page soft offline.

However, after adding a livepatch into kernel's
store_soft_offline_page to observe this function's return value,
when injecting a CE into address 0x3f7ec30000, the Kernel
lot reports:

soft_offline: 0x3f7ec30: unknown non LRU page type ffffe0000000000 ()
[store_soft_offline_page]return from soft_offline_page: -5

While rasdaemon log reports:

rasdaemon[73711]: cpu 00:rasdaemon: Corrected Errors at 0x3f7ec30000 exceed threshold
rasdaemon[73711]: rasdaemon: Result of offlining page at 0x3f7ec30000: offlined

using strace to record rasdaemon's system call, it reports:

strace -p 73711
openat(AT_FDCWD, "/sys/devices/system/memory/soft_offline_page",
       O_WRONLY|O_CREAT|O_TRUNC, 0666) = 28
fstat(28, {st_mode=S_IFREG|0200, st_size=4096, ...}) = 0
write(28, "0x3f7ec30000", 12)           = -1 EIO (Input/output error)
close(28)                               = 0

So, kernel actually soft offline pfn 0x3f7ec30 failed and
store_soft_offline_page returned -EIO. However, rasdaemon always
considers the page offline to be successful.

According to strace display, ferror was unable of detecting the
failure of the write syscall.

This patch changes fopen-fprintf-ferror-fclose process to use
the lower I/O level, by using instead open-write-close, which
can detect such syscall failure.

Signed-off-by: lvying <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
ras-page-isolation.c