rasdaemon: Add Ice Lake and Sapphire Rapids MSCOD values
Based on mcelog commits:
ee90ff20ce6a ("mcelog: Add support for Icelake server, Icelake-D, and Snow Ridge") 391abaac9bdf ("mcelog: Add decode for MCi_MISC from 10nm memory controller") 59cb7ad4bc72 ("mcelog: i10nm: Fix mapping from bank number to functional unit") c0acd0e6a639 ("mcelog: Add support for Sapphirerapids server.")
Shiju Jose [Tue, 9 Mar 2021 16:18:56 +0000 (16:18 +0000)]
rasdaemon: fix build error in register_ns_ev_decoder if the sqlite3 is not enabled
ns_ev_decoder->stmt_dec_record = NULL; in the register_ns_ev_decoder()
should be under #ifdef HAVE_SQLITE3 to fix the compilation error
when build without the configure option --enable-sqlite3.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Mon, 8 Mar 2021 16:57:26 +0000 (16:57 +0000)]
rasdaemon: add support for memory_failure events
Add support to log the memory_failure kernel trace
events.
Example rasdaemon log and SQLite DB output for the
memory_failure event,
=================================================
rasdaemon: memory_failure_event store: 0x126ce8f8
rasdaemon: register inserted at db
<...>-785 [000] 0.000024: memory_failure_event: 2020-10-02 13:27:13 -0400 pfn=0x204000000 page_type=free buddy page action_result=Delayed
B. Wilson [Mon, 12 Apr 2021 15:29:58 +0000 (00:29 +0900)]
ras-record: Create RASSTATEDIR at runtime instead of install time
Package managers such as Nix and Guix force installation into an
isolated directory hierarchy. Furthermore, said hierarchy becomes
readonly after the install has completed, rendering any
<hierarchy>/var/lib/rasdaemon/ directory effectively useless.
In addition to being standard practice, creating RASSTATEDIR when
necessary at runtime fixes the above use cases.
Jason Tian [Thu, 4 Feb 2021 01:57:05 +0000 (09:57 +0800)]
Add code to decode Ampere specific error
All Ampere specific errors(payload type0/1/2/3) include 48 bytes
OEM data, which will be decoded out error type,subtype,instance,
socket number and so on.
Signed-off-by: Jason Tian <jason@os.amperecomputing.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Josh Hunt [Fri, 8 Jan 2021 00:12:52 +0000 (19:12 -0500)]
rasdaemon: fix memory leak in parse_ras_data
parse_ras_data() is calling trace_seq_init() which allocates a buffer,
but never calls the corresponding trace_seq_destroy() to free it causing
us to leak memory.
Subhendu Saha [Tue, 12 Jan 2021 08:29:55 +0000 (03:29 -0500)]
Fix ras-mc-ctl script.
When rasdaemon is compiled without enabling aer, mce, devlink,
etc., those tables are not created in the database file. Then
ras-mc-ctl script breaks trying to query data from non-existent
tables.
Signed-off-by: Subhendu Saha subhends@akamai.com Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
lvying6 [Sat, 31 Oct 2020 09:57:15 +0000 (17:57 +0800)]
ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again
OS may fail to offline page at the previous time. After some time,
this page's state changed, and the page can be offlined by OS.
At this time, Correctable errors on this page reached the threshold.
Rasdaemon should trigger to offline this page again.
lvying [Sat, 31 Oct 2020 09:57:14 +0000 (17:57 +0800)]
ras-page-isolation: do_page_offline always considers page offline was successful
do_page_offline always consider page offline was successful even if
kernel soft/hard offline page failed.
Calling rasdaemon with:
/etc/sysconfig/rasdaemon PAGE_CE_THRESHOLD="1"
i.e when a page's address occurs Corrected Error, rasdaemon should
trigger this page soft offline.
However, after adding a livepatch into kernel's
store_soft_offline_page to observe this function's return value,
when injecting a CE into address 0x3f7ec30000, the Kernel
lot reports:
soft_offline: 0x3f7ec30: unknown non LRU page type ffffe0000000000 ()
[store_soft_offline_page]return from soft_offline_page: -5
While rasdaemon log reports:
rasdaemon[73711]: cpu 00:rasdaemon: Corrected Errors at 0x3f7ec30000 exceed threshold
rasdaemon[73711]: rasdaemon: Result of offlining page at 0x3f7ec30000: offlined
using strace to record rasdaemon's system call, it reports:
So, kernel actually soft offline pfn 0x3f7ec30 failed and
store_soft_offline_page returned -EIO. However, rasdaemon always
considers the page offline to be successful.
According to strace display, ferror was unable of detecting the
failure of the write syscall.
This patch changes fopen-fprintf-ferror-fclose process to use
the lower I/O level, by using instead open-write-close, which
can detect such syscall failure.
Shiju Jose [Mon, 10 Aug 2020 14:42:56 +0000 (15:42 +0100)]
rasdaemon: Modify non-standard error decoding interface using linked list
Replace the current non-standard error decoding interface with the
interface based on the linked list to avoid using realloc and
to improve the interface.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Xiaofei Tan [Mon, 27 Jul 2020 07:38:39 +0000 (15:38 +0800)]
rasdaemon: add support for hisilicon common section decoder
Add a new non-standard error section, Hisilicon common section.
It is defined for the next generation SoC Kunpeng930. It also supports
Kunpeng920 and some modules of Kunpeng920 could be changed to use
this section.
We put the code to an new source file, as it supports multiple Hardware
platform. Some code of hip08 could be shared. Move them to this new file.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
ras-diskerror-handler.c: In function ras_diskerror_event_handler:
ras-diskerror-handler.c:98:2:
warning: ignoring return value of asprintf, declared with attribute warn_unused_result [-Wunused-result]
asprintf(&ev.dev, "%u:%u", major(dev), minor(dev));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check the return value of asprintf() to avoid the warning.
dann frazier [Tue, 21 Apr 2020 21:56:04 +0000 (15:56 -0600)]
ras-mc-ctl: PCIe AER: display PCIe dev name
Storage of PCIe dev name was added in commit 8e96ca2c1c59 ("rasdaemon:
store PCIe dev name and TLP header for the aer event"). This makes
ras-mc-ctl extract and emit it like so:
RPM build errors:
bogus date in %changelog: Fri Oct 10 2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.6.4-1
Bad exit status from /var/tmp/rpm-tmp.MRqZEZ (%install)
wuyun [Sat, 20 Jun 2020 12:26:22 +0000 (20:26 +0800)]
rasdaemon: add support for memory Corrected Error predictive failure analysis
Memory Corrected Error was corrected by hardware. These errors do not
require immediate software actions, but are still reported for
accounting and predictive failure analysis.
Based on statistical results, some actions can be taken to prevent
Corrected Error from evoluting to Uncorrected Error.
Xiaofei Tan [Wed, 27 May 2020 08:02:33 +0000 (16:02 +0800)]
rasdaemon: fix the issue that non standard decoder can't work in pthread way
The non standard decoding functions are registered in app init process
through __attribute__((constructor)), and unregistered in app exit process
through __attribute__((destructor)). We don't need to unregister them
in any other steps. This patch removes these unnecessary unregister calls.
Fixes: 78a21c1e9770 ("rasdaemon: add closure and cleanups for the database") Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Xiaofei Tan [Wed, 27 May 2020 08:02:32 +0000 (16:02 +0800)]
rasdaemon: add support of l3tag and l3data in hip08 OEM format2
The two modules, l3tag and l3data were originally reported through "ARM
processor error section". But it is not suitable. Because l3tag or l3data
doesn't belong to any single CPU core. So we change it to use OEM format2.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Xiaofei Tan [Mon, 25 Nov 2019 09:33:24 +0000 (10:33 +0100)]
rasdaemon: fix the wrong declaring of 'sruct ras_events' in ras-record.h
The following warning can be found by PC-Lint when do static code
analysis to the file non-standard-hisi_hip08.c:
Warning -- Declaration of symbol 'ras' hides symbol 'ras' (line 28, file ras-record.h)
This means that the local variable name 'ras' is same as an global
variable. In fact, there is no global variable named 'ras', but an
wrong declaring in ras-record.h.
CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Brian WoodsGhannam, Yazen [Fri, 1 Nov 2019 14:48:14 +0000 (15:48 +0100)]
rasdaemon: add support for new AMD SMCA bank types
Going forward, the Scalable Machine Check Architecture (SMCA) has some
updated and additional bank types which show up in Zen2. The differing
bank types include: CS_V2, PSP_V2, SMU_V2, MP5, NBIO, and PCIE. The V2
bank types replace the original bank types but have unique HWID/MCAtype
IDs from the originals so there's no conflicts between different
versions or other bank types. All of the differing bank types have new
MCE descriptions which have been added as well.
CC: "mchehab+samsung@kernel.org" <mchehab+samsung@kernel.org>, "Namburu, Chandu-babu" <chandu@amd.com> # Thread-Topic: [PATCH 2/2] rasdaemon: add support for new AMD SMCA bank types Signed-off-by: Brian Woods <brian.woods@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: Chandu-babu Namburu <chandu@amd.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Wed, 13 Nov 2019 16:31:12 +0000 (16:31 +0000)]
rasdaemon: fix for the ras-record.c:ras_mc_prepare_stmt() failure when new fields added to the sql table
rasdaemon fails in the ras_mc_prepare_stmt() function when new fields are
added to the table's db_fields on top of the existing sql table in the
system.
This patch adds solution for this issue.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Wed, 16 Oct 2019 16:34:01 +0000 (17:34 +0100)]
rasdaemon: add signal handling for the cleanup
Presently rasdaemon would not free allocated memory and
would not do other cleanup when the rasdaemon closed
with ctrl+c or kill etc.
This patch adds handling of the signals SIGINT, SIGTERM, SIGHUP
and SIGQUIT and do necessary clean ups when receive the
specified signals.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Xiaofei Tan [Tue, 8 Oct 2019 12:38:57 +0000 (20:38 +0800)]
rasdaemon: add timestamp for hip08 OEM error records in sqlite3 DB
This patch does two things:
1.Add timestamp for hip08 OEM error records in sqlite3 DB.
2.Add suffix "_v2" for hip08 OEM event names to keep compatibility
with old sqlite3 DB.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Xiaofei Tan [Tue, 8 Oct 2019 12:38:54 +0000 (20:38 +0800)]
rasdaemon: optimize sqlite3 DB record of register fields for hip08
Optimize sqlite3 DB record of register fields for hip08 by combining
all register fields to one text field, which will include register name.
This will make the record easier to read.
For example, from:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU',2,'corrected',
273058,0,-1,0,1308622858,0,0,0,0,133,0,0,NULL);
change to:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU',2,'corrected',
'ERR_FR_0=0x42aa2 ERR_FR_1=0x0 ERR_CTRL_0=0xffffffff ERR_CTRL_1=0x0
ERR_STATUS_0=0x4e00000a ERR_STATUS_1=0x0 ERR_ADDR_0=0x0, ERR_ADDR_1=0x0
ERR_MISC0_0=0x0 ERR_MISC0_1=0x90 ERR_MISC1_0=0x0 ERR_MISC1_1=0x0');
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Xiaofei Tan [Thu, 8 Aug 2019 02:14:30 +0000 (10:14 +0800)]
rasdaemon: fix the issue of sqlite3 integer bind parameter mismatch
Some interger fields of arm_event and mc_event are 8 bytes width,
and sqlite3_bind_int64() should be used when restore the event to
sqlite3. But we use sqlite3_bind_int() in current code. This will
lead to an wrong value in sqlite3 DB.
This patch is to fix the issue.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
README: updated instructions about sending patches
The instructions there are a little outdated. Sergio
suggested changing just my e-mail, but let's do a better job
and use my canonical e-mail (mchehab@kernel.org), plus add the
alternative of sending patches against either github or gitlab.
fix file descriptor leak in ras-report.c:setup_report_socket()
A running instance of rasdaemon was seen to hit the limit on open file
descriptors. Most of the the descriptors were AF_UNIX STREAM sockets.
At the same time the limit was hit, attempts by rasdaemon to open the
SQLite database started failing with SQLite error 14.
This patch avoids leaking a socket file descriptor each time the connect()
call fails.
parse_ras_data: initialize record.cpu before pevent_print_event().
pevent_print_event() prints record.cpu; make sure it's initialized.
The cpu field from pthread_data is my best guess at a suitable value:
parse_ras_data() was already printing it separately.
parse_ras_data: flush trace buffer immediately, not on next call
parse_ras_data() was calling fflush() before, not after printf().
As a result, information about an event would not be printed
immediately but possibly much later.
Shiju Jose [Mon, 17 Jun 2019 14:28:51 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2
This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format2.
These errors are from the H/W modules SMMU, HHA, HLLC, PA and DDRC.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Mon, 17 Jun 2019 14:28:50 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1
This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format1.
These errors are from the H/W modules MN, PLL, SLLC, AA, SIOE,
POE, DISP, LPC, SAS and SATA.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
ras-mce-handler.c:344:9: warning: address of array 'e->mcgstatus_msg' will always evaluate to 'true' [-Wpointer-bool-conversion]
if (e->mcgstatus_msg)
~~ ~~~^~~~~~~~~~~~~
Ying Lv [Wed, 15 May 2019 03:15:42 +0000 (11:15 +0800)]
fix rasdaemon high CPU usage when part of CPUs offline
When we set part of CPU core offline, such as by setting the kernel cmdline
maxcpus = N(N is less than the total number of system CPU cores).
And then, we will observe that the CPU usage of some rasdaemon threads
is very close to 100.
This is because when part of CPU offline, poll in read_ras_event_all_cpus func
will fallback to pthread way.
Offlined CPU thread will return negative value when read trace_pipe_raw,
negative return value will covert to positive value because of 'unsigned size'.
So code will always go into 'size > 0' branch, and the CPU usage is too high.
Here, variable size uses int type will go to the right branch.
Fiexs: eff7c9e0("ras-events: Only use pthreads for collect if poll() not available") Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com> Signed-off-by: Ying Lv <lvying6@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>