]> www.infradead.org Git - users/mchehab/rasdaemon.git/log
users/mchehab/rasdaemon.git
3 years agoBump verstion to 0.6.7 v0.6.7
Mauro Carvalho Chehab [Wed, 26 May 2021 07:41:45 +0000 (09:41 +0200)]
Bump verstion to 0.6.7

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon.spec.in: don't install _sharedstatedir
Mauro Carvalho Chehab [Wed, 26 May 2021 08:32:39 +0000 (10:32 +0200)]
rasdaemon.spec.in: don't install _sharedstatedir

%{_sharedstatedir}/rasdaemon is now created at runtime.
So, no need to install it anymore.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoMakefile.am: fix build header rules
Mauro Carvalho Chehab [Wed, 26 May 2021 08:20:39 +0000 (10:20 +0200)]
Makefile.am: fix build header rules

non-standard-hisilicon.h was added twice;
ras-memory-failure-handler.h is missing.

Due to that, the tarball becomes incomplete, causing build
errors.

While here, also adjust .travis.yml to use --enable-all.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Add Ice Lake and Sapphire Rapids MSCOD values
Greg Edwards [Thu, 8 Apr 2021 21:03:30 +0000 (15:03 -0600)]
rasdaemon: Add Ice Lake and Sapphire Rapids MSCOD values

Based on mcelog commits:

  ee90ff20ce6a ("mcelog: Add support for Icelake server, Icelake-D, and Snow Ridge")
  391abaac9bdf ("mcelog: Add decode for MCi_MISC from 10nm memory controller")
  59cb7ad4bc72 ("mcelog: i10nm: Fix mapping from bank number to functional unit")
  c0acd0e6a639 ("mcelog: Add support for Sapphirerapids server.")

Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: fix build error in register_ns_ev_decoder if the sqlite3 is not enabled
Shiju Jose [Tue, 9 Mar 2021 16:18:56 +0000 (16:18 +0000)]
rasdaemon: fix build error in register_ns_ev_decoder if the sqlite3 is not enabled

ns_ev_decoder->stmt_dec_record = NULL; in the register_ns_ev_decoder()
should be under #ifdef HAVE_SQLITE3 to fix the compilation error
when build without the configure option --enable-sqlite3.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Modify confiure.ac for Hisilicon Kunpeng errors
Shiju Jose [Mon, 8 Mar 2021 16:57:32 +0000 (16:57 +0000)]
rasdaemon: Modify confiure.ac for Hisilicon Kunpeng errors

Modify  HIP07 SAS HW errors : $USE_HISI_NS_DECODE to
HISI Kunpeng errors : $USE_HISI_NS_DECODE.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng9xx common errors
Shiju Jose [Mon, 8 Mar 2021 16:57:31 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng9xx common errors

Add support for the HiSilicon Kunpeng9xx platforms common errors.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng920 errors
Shiju Jose [Mon, 8 Mar 2021 16:57:30 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng920 errors

Add support for the HiSilicon Kunpeng920 errors.
Supported error formats: OEM type 1, OEM typ2 and PCIe controller
error formats.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Add support for the vendor-specific errors
Shiju Jose [Mon, 8 Mar 2021 16:57:29 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Add support for the vendor-specific errors

Add commands to support logging the vendor-specific
error info in the ras-mc-ctl.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Add memory failure events
Shiju Jose [Mon, 8 Mar 2021 16:57:28 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Add memory failure events

Add supporting memory failure errors (memory_failure_event)
to the ras-mc-ctl tool.

Sample Log,
ras-mc-ctl --summary
...
Memory failure events summary:
        Delayed errors: 4
        Failed errors: 1
...

ras-mc-ctl --errors
...
Memory failure events:
1 2020-10-28 23:20:41 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Delayed
2 2020-10-28 23:31:38 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Delayed
3 2020-10-28 23:54:54 -0800 error: pfn=0x205000000, page_type=free buddy page, action_result=Delayed
4 2020-10-29 00:12:25 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Delayed
5 2020-10-29 00:26:36 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Failed

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Modify ARM processor error summary log
Shiju Jose [Mon, 8 Mar 2021 16:57:27 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Modify ARM processor error summary log

Add CPU's mpidr information to the ARM processor error
summary log.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: add support for memory_failure events
Shiju Jose [Mon, 8 Mar 2021 16:57:26 +0000 (16:57 +0000)]
rasdaemon: add support for memory_failure events

Add support to log the memory_failure kernel trace
events.

Example rasdaemon log and SQLite DB output for the
memory_failure event,
=================================================
rasdaemon: memory_failure_event store: 0x126ce8f8
rasdaemon: register inserted at db
<...>-785   [000]     0.000024: memory_failure_event: 2020-10-02 13:27:13 -0400 pfn=0x204000000 page_type=free buddy page action_result=Delayed

CREATE TABLE memory_failure_event (id INTEGER PRIMARY KEY, timestamp TEXT, pfn TEXT, page_type TEXT, action_result TEXT);
INSERT INTO memory_failure_event VALUES(1,'2020-10-02 13:27:13 -0400','0x204000000','free buddy page','Delayed');
==================================================

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoras-record: Create RASSTATEDIR at runtime instead of install time
B. Wilson [Mon, 12 Apr 2021 15:29:58 +0000 (00:29 +0900)]
ras-record: Create RASSTATEDIR at runtime instead of install time

Package managers such as Nix and Guix force installation into an
isolated directory hierarchy. Furthermore, said hierarchy becomes
readonly after the install has completed, rendering any
<hierarchy>/var/lib/rasdaemon/ directory effectively useless.

In addition to being standard practice, creating RASSTATEDIR when
necessary at runtime fixes the above use cases.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoadd labels for A2SDi-8C+-HLN4F
Henrik Riomar [Sat, 27 Feb 2021 13:10:42 +0000 (14:10 +0100)]
add labels for A2SDi-8C+-HLN4F

Same as A2SDi-8C-HLN4F, but with a fan

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdded label for ASUS PRIME X570-PRO
obiwan-r2d2 [Sat, 22 May 2021 13:35:36 +0000 (15:35 +0200)]
Added label for ASUS PRIME X570-PRO

ras-mc-ctl: mainboard: ASUSTeK COMPUTER INC. model PRIME X570-PRO

Installed DIMM_A1:
Label                   CE      UE
mc#0csrow#0channel#1    0       0
mc#0csrow#1channel#1    0       0

Installed DIMM_A2:
Label                   CE      UE
mc#0csrow#3channel#1    0       0
mc#0csrow#2channel#1    0       0

Installed DIMM_B1:
Label                   CE      UE
mc#0csrow#1channel#0    0       0
mc#0csrow#0channel#0    0       0

Installed DIMM_B2:
Label                   CE      UE
mc#0csrow#2channel#0    0       0
mc#0csrow#3channel#0    0       0

Test with 2 DIMMs:

LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS
                                    DIMM_B1              0:0:0 missing
                                    DIMM_A1              0:0:1 missing
                                    DIMM_B1              0:1:0 missing
                                    DIMM_A1              0:1:1 missing
mc0 csrow 2 channel 0               DIMM_B2              DIMM_B2
mc0 csrow 2 channel 1               DIMM_A2              DIMM_A2
mc0 csrow 3 channel 0               DIMM_B2              DIMM_B2
mc0 csrow 3 channel 1               DIMM_A2              DIMM_A2

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdd code to decode Ampere specific error
Jason Tian [Thu, 4 Feb 2021 01:57:05 +0000 (09:57 +0800)]
Add code to decode Ampere specific error

All Ampere specific errors(payload type0/1/2/3) include 48 bytes
OEM data, which will be decoded out error type,subtype,instance,
socket number and so on.

Signed-off-by: Jason Tian <jason@os.amperecomputing.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: fix memory leak in parse_ras_data
Josh Hunt [Fri, 8 Jan 2021 00:12:52 +0000 (19:12 -0500)]
rasdaemon: fix memory leak in parse_ras_data

parse_ras_data() is calling trace_seq_init() which allocates a buffer,
but never calls the corresponding trace_seq_destroy() to free it causing
us to leak memory.

Reported-by: Subhendu Saha <subhends@akamai.com>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoFix ras-mc-ctl script.
Subhendu Saha [Tue, 12 Jan 2021 08:29:55 +0000 (03:29 -0500)]
Fix ras-mc-ctl script.

When rasdaemon is compiled without enabling aer, mce, devlink,
etc., those tables are not created in the database file. Then
ras-mc-ctl script breaks trying to query data from non-existent
tables.

Signed-off-by: Subhendu Saha subhends@akamai.com
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoadd Supermicro X10SRA-F and H8DGU.
Chad W Seys [Thu, 14 Jan 2021 15:07:47 +0000 (09:07 -0600)]
add Supermicro X10SRA-F and H8DGU.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again
lvying6 [Sat, 31 Oct 2020 09:57:15 +0000 (17:57 +0800)]
ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again

OS may fail to offline page at the previous time. After some time,
this page's state changed, and the page can be offlined by OS.
At this time, Correctable errors on this page reached the threshold.
Rasdaemon should trigger to offline this page again.

Signed-off-by: lvying6 <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoras-page-isolation: do_page_offline always considers page offline was successful
lvying [Sat, 31 Oct 2020 09:57:14 +0000 (17:57 +0800)]
ras-page-isolation: do_page_offline always considers page offline was successful

do_page_offline always consider page offline was successful even if
kernel soft/hard offline page failed.

Calling rasdaemon with:

/etc/sysconfig/rasdaemon PAGE_CE_THRESHOLD="1"

i.e when a page's address occurs Corrected Error, rasdaemon should
trigger this page soft offline.

However, after adding a livepatch into kernel's
store_soft_offline_page to observe this function's return value,
when injecting a CE into address 0x3f7ec30000, the Kernel
lot reports:

soft_offline: 0x3f7ec30: unknown non LRU page type ffffe0000000000 ()
[store_soft_offline_page]return from soft_offline_page: -5

While rasdaemon log reports:

rasdaemon[73711]: cpu 00:rasdaemon: Corrected Errors at 0x3f7ec30000 exceed threshold
rasdaemon[73711]: rasdaemon: Result of offlining page at 0x3f7ec30000: offlined

using strace to record rasdaemon's system call, it reports:

strace -p 73711
openat(AT_FDCWD, "/sys/devices/system/memory/soft_offline_page",
       O_WRONLY|O_CREAT|O_TRUNC, 0666) = 28
fstat(28, {st_mode=S_IFREG|0200, st_size=4096, ...}) = 0
write(28, "0x3f7ec30000", 12)           = -1 EIO (Input/output error)
close(28)                               = 0

So, kernel actually soft offline pfn 0x3f7ec30 failed and
store_soft_offline_page returned -EIO. However, rasdaemon always
considers the page offline to be successful.

According to strace display, ferror was unable of detecting the
failure of the write syscall.

This patch changes fopen-fprintf-ferror-fclose process to use
the lower I/O level, by using instead open-write-close, which
can detect such syscall failure.

Signed-off-by: lvying <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoAdd architecture ppc64le to travis build
Debabrata Deka [Fri, 4 Dec 2020 13:13:21 +0000 (14:13 +0100)]
Add architecture ppc64le to travis build

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoFix problem from make dist-rpm
Jason Tian [Fri, 11 Dec 2020 02:39:11 +0000 (03:39 +0100)]
Fix problem from make dist-rpm

rasdaemon rpm package can't build out and report the files should start
from "/".

Update the Makefile by adding "" to folder name and change one
typo.

Signed-off-by: Jason Tian <jason@os.amperecomputing.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Add 8 channel decoding for SMCA systems
Muralidhara M K [Thu, 20 Aug 2020 15:30:57 +0000 (21:00 +0530)]
rasdaemon: Add 8 channel decoding for SMCA systems

Current Scalable Machine Check Architecture (SMCA) systems support up
to 8 UMC channels.

To find the UMC channel represented by a bank, look at the 6th nibble
in the MCA_IPID[InstanceId] field.

Signed-off-by: Muralidhara M K <muralimk@amd.com>
[ Adjust commit message. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Fix error print
Liguang Zhang [Mon, 10 Aug 2020 03:07:43 +0000 (11:07 +0800)]
rasdaemon: Fix error print

Fix error print handle_ras_events.

Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoCreate SYSCONFDEFDIR configure parameter
Antonio Russo [Tue, 18 Aug 2020 04:29:51 +0000 (22:29 -0600)]
Create SYSCONFDEFDIR configure parameter

Provide downstream packagers with a tunable describing the location of
the file containing environment variables to pass to the startup script.

Defaults to the existing value, /etc/sysconfig.

Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: ras-mc-ctl: Add ARM processor error information
Shiju Jose [Tue, 11 Aug 2020 12:31:46 +0000 (13:31 +0100)]
rasdaemon: ras-mc-ctl: Add ARM processor error information

Add supporting ARM processor error in the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Modify non-standard error decoding interface using linked list
Shiju Jose [Mon, 10 Aug 2020 14:42:56 +0000 (15:42 +0100)]
rasdaemon: Modify non-standard error decoding interface using linked list

Replace the current non-standard error decoding interface with the
interface based on the linked list to avoid using realloc and
to improve the interface.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: add support for hisilicon common section decoder
Xiaofei Tan [Mon, 27 Jul 2020 07:38:39 +0000 (15:38 +0800)]
rasdaemon: add support for hisilicon common section decoder

Add a new non-standard error section, Hisilicon common section.
It is defined for the next generation SoC Kunpeng930. It also supports
Kunpeng920 and some modules of Kunpeng920 could be changed to use
this section.

We put the code to an new source file, as it supports multiple Hardware
platform. Some code of hip08 could be shared. Move them to this new file.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: delete the code of non-standard error decoder for hip07
Xiaofei Tan [Mon, 27 Jul 2020 07:38:38 +0000 (15:38 +0800)]
rasdaemon: delete the code of non-standard error decoder for hip07

Delete the code of non-standard error decoder for hip07 that was never
used. Because the corresponding code in Linux kernel wasn't accepted.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: delete the duplicate code about the definition of hip08 DB fields
Xiaofei Tan [Mon, 27 Jul 2020 07:38:37 +0000 (15:38 +0800)]
rasdaemon: delete the duplicate code about the definition of hip08 DB fields

Delete the duplicate code about the definition of DB fields for hip08 OEM
event format1 and format2. Because the two OEM event format is the same.

Signed-off-By: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Add error decoding for new SMCA Load Store bank type
Muralidhara M K [Mon, 13 Jan 2020 13:42:06 +0000 (19:12 +0530)]
rasdaemon: Add error decoding for new SMCA Load Store bank type

Future Scalable Machine Check Architecture (SMCA) systems will have a
new Load Store bank type.

Add the new type's (HWID, McaType) ID and error decoding.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
[ Adjust commit message. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Fix "ignoring return value" build warning.
Yazen Ghannam [Fri, 8 May 2020 14:51:29 +0000 (14:51 +0000)]
rasdaemon: Fix "ignoring return value" build warning.

The following build warning is given:

ras-diskerror-handler.c: In function ras_diskerror_event_handler:
ras-diskerror-handler.c:98:2:
warning: ignoring return value of asprintf, declared with attribute warn_unused_result [-Wunused-result]
  asprintf(&ev.dev, "%u:%u", major(dev), minor(dev));
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Check the return value of asprintf() to avoid the warning.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoMatch rankX in ras-mc-ctl
Cong Wang [Fri, 28 Feb 2020 20:37:15 +0000 (12:37 -0800)]
Match rankX in ras-mc-ctl

According to kernel doc:
https://www.kernel.org/doc/html/v4.10/admin-guide/ras.html
mcX directory contains either dimmX or rankX directories.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoFix a typo in ras-mc-ctl
Cong Wang [Fri, 28 Feb 2020 00:24:06 +0000 (16:24 -0800)]
Fix a typo in ras-mc-ctl

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoadded label for A2SDi-8C-HLN4F
A. Binzxxxxxx [Mon, 18 Nov 2019 18:24:59 +0000 (19:24 +0100)]
added label for A2SDi-8C-HLN4F

added label for A2SDi-8C-HLN4F

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoras-mc-ctl: PCIe AER: display PCIe dev name
dann frazier [Tue, 21 Apr 2020 21:56:04 +0000 (15:56 -0600)]
ras-mc-ctl: PCIe AER: display PCIe dev name

Storage of PCIe dev name was added in commit 8e96ca2c1c59 ("rasdaemon:
store PCIe dev name and TLP header for the aer event"). This makes
ras-mc-ctl extract and emit it like so:

PCIe AER events:
1 2020-04-16 22:09:48 +0000 0000:0b:00.0 Corrected error: Receiver Error
2 2020-04-16 22:23:24 +0000 0000:0b:00.0 Corrected error: Receiver Error
3 2020-04-17 23:00:37 +0000 0000:d9:01.0 Corrected error: Advisory Non-Fatal, BIT15
4 2020-04-17 23:21:52 +0000 0000:d9:01.0 Corrected error: Advisory Non-Fatal
5 2020-04-18 02:04:24 +0000 0000:5e:00.0 Corrected error: Receiver Error

Signed-off-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Shiju Jose <shiju.jose@huawei.com>
4 years agoMakefile.am: fix install of misc/rasdaemon.env v0.6.6
Mauro Carvalho Chehab [Tue, 21 Jul 2020 12:15:41 +0000 (14:15 +0200)]
Makefile.am: fix install of misc/rasdaemon.env

The logic added by the previous patch didn't work properly.

Change it to pack misc/rasdaemon.env when creating a
tarball and install it via "make install" target.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoBump to version 0.6.6
Mauro Carvalho Chehab [Tue, 21 Jul 2020 11:32:21 +0000 (13:32 +0200)]
Bump to version 0.6.6

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon.spec: Fix a bogus date
Mauro Carvalho Chehab [Tue, 21 Jul 2020 11:36:09 +0000 (13:36 +0200)]
rasdaemon.spec: Fix a bogus date

RPM build errors:
    bogus date in %changelog: Fri Oct 10 2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org>  0.6.4-1
    Bad exit status from /var/tmp/rpm-tmp.MRqZEZ (%install)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: add support for memory Corrected Error predictive failure analysis
wuyun [Sat, 20 Jun 2020 12:26:22 +0000 (20:26 +0800)]
rasdaemon: add support for memory Corrected Error predictive failure analysis

Memory Corrected Error was corrected by hardware. These errors do not
require immediate software actions, but are still reported for
accounting and predictive failure analysis.

Based on statistical results, some actions can be taken to prevent
Corrected Error from evoluting to Uncorrected Error.

Signed-off-by: wuyun <wuyun.wu@huawei.com>
Signed-off-by: lvying6 <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: add rbtree support for page record
wuyun [Sat, 20 Jun 2020 12:26:21 +0000 (20:26 +0800)]
rasdaemon: add rbtree support for page record

The rbtree is very efficient for recording and querying fault page info.

Signed-off-by: wuyun <wuyun.wu@huawei.com>
Signed-off-by: lvying6 <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: fix the issue that non standard decoder can't work in pthread way
Xiaofei Tan [Wed, 27 May 2020 08:02:33 +0000 (16:02 +0800)]
rasdaemon: fix the issue that non standard decoder can't work in pthread way

The non standard decoding functions are registered in app init process
through __attribute__((constructor)), and unregistered in app exit process
through __attribute__((destructor)). We don't need to unregister them
in any other steps. This patch removes these unnecessary unregister calls.

Fixes: 78a21c1e9770 ("rasdaemon: add closure and cleanups for the database")
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: add support of l3tag and l3data in hip08 OEM format2
Xiaofei Tan [Wed, 27 May 2020 08:02:32 +0000 (16:02 +0800)]
rasdaemon: add support of l3tag and l3data in hip08 OEM format2

The two modules, l3tag and l3data were originally reported through "ARM
processor error section". But it is not suitable. Because l3tag or l3data
doesn't belong to any single CPU core. So we change it to use OEM format2.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix error handling in ras_mc_event_opendb()
Aristeu Rozanski [Tue, 7 Jan 2020 19:49:19 +0000 (14:49 -0500)]
rasdaemon: fix error handling in ras_mc_event_opendb()

Found with covscan that the return value from ras_mc_prepare_stmt() and from
ras_mc_event_opendb() itself aren't checked.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: add default branch for switch statement
Xiaofei Tan [Tue, 26 Nov 2019 13:23:06 +0000 (14:23 +0100)]
rasdaemon: add default branch for switch statement

Add default branch for the switch statements that default branch
was missed.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:36 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix case style issues for enum constant
Xiaofei Tan [Tue, 26 Nov 2019 13:22:21 +0000 (14:22 +0100)]
rasdaemon: fix case style issues for enum constant

Change lowercase letters of enum constant to uppercase ones.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:35 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: replace sprintf with snprintf for hip08
Xiaofei Tan [Tue, 26 Nov 2019 13:21:11 +0000 (14:21 +0100)]
rasdaemon: replace sprintf with snprintf for hip08

Replace sprintf with snprintf for hip08 to improve reliability.
Besides, add border check for buffer pointer.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:34 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix magic number issues reported by static code analysis for hip08
Xiaofei Tan [Tue, 26 Nov 2019 13:20:06 +0000 (14:20 +0100)]
rasdaemon: fix magic number issues reported by static code analysis for hip08

Fix magic number issues reported by static code analysis for hip08.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:33 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: split PCIe local table decode function to reduce length
Xiaofei Tan [Tue, 26 Nov 2019 13:19:29 +0000 (14:19 +0100)]
rasdaemon: split PCIe local table decode function to reduce length

This patch splits function decode_hip08_pcie_local_error() to reduce
length. Move header decoding and register dump to single function
separately.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:32 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: split OEM type2 table decode function to reduce length
Xiaofei Tan [Tue, 26 Nov 2019 13:18:55 +0000 (14:18 +0100)]
rasdaemon: split OEM type2 table decode function to reduce length

This patch splits function decode_hip08_oem_type2_error() to reduce
length. Move header decoding and register dump to single function
separately.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:31 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: split OEM type1 table decode function to reduce length
Xiaofei Tan [Tue, 26 Nov 2019 13:18:09 +0000 (14:18 +0100)]
rasdaemon: split OEM type1 table decode function to reduce length

This patch splits function decode_hip08_oem_type1_error() to reduce
length. Move header decoding and register dump to single function
separately.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:30 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix sub module name of HHA and DDRC for hip08
Xiaofei Tan [Tue, 26 Nov 2019 13:17:34 +0000 (14:17 +0100)]
rasdaemon: fix sub module name of HHA and DDRC for hip08

Fix sub module name of HHA and DDRC for hip08, and add const to the
pointer parameter 'name' of step_vendor_data_tab().

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:29 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: decode submodule of OEM type1 for hip08
Xiaofei Tan [Tue, 26 Nov 2019 13:16:57 +0000 (14:16 +0100)]
rasdaemon: decode submodule of OEM type1 for hip08

Decode submodule of OEM type1 for hip08, and reconstruct the functions
of geting OEM module name and submodule name.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:28 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix an warning reported by PC-Lint
Xiaofei Tan [Mon, 25 Nov 2019 09:38:45 +0000 (10:38 +0100)]
rasdaemon: fix an warning reported by PC-Lint

This patch fixes the following warning, and no function change:

Warning -- Storage class specified after a type

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com>
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix the wrong declaring of 'sruct ras_events' in ras-record.h
Xiaofei Tan [Mon, 25 Nov 2019 09:33:24 +0000 (10:33 +0100)]
rasdaemon: fix the wrong declaring of 'sruct ras_events' in ras-record.h

The following warning can be found by PC-Lint when do static code
analysis to the file non-standard-hisi_hip08.c:

Warning -- Declaration of symbol 'ras' hides symbol 'ras' (line 28, file ras-record.h)

This means that the local variable name 'ras' is same as an global
variable. In fact, there is no global variable named 'ras', but an
wrong declaring in ras-record.h.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com>
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: add support for new AMD SMCA bank types
Brian WoodsGhannam, Yazen [Fri, 1 Nov 2019 14:48:14 +0000 (15:48 +0100)]
rasdaemon: add support for new AMD SMCA bank types

Going forward, the Scalable Machine Check Architecture (SMCA) has some
updated and additional bank types which show up in Zen2.  The differing
bank types include: CS_V2, PSP_V2, SMU_V2, MP5, NBIO, and PCIE.  The V2
bank types replace the original bank types but have unique HWID/MCAtype
IDs from the originals so there's no conflicts between different
versions or other bank types.  All of the differing bank types have new
MCE descriptions which have been added as well.

CC: "mchehab+samsung@kernel.org" <mchehab+samsung@kernel.org>, "Namburu, Chandu-babu" <chandu@amd.com> # Thread-Topic: [PATCH 2/2] rasdaemon: add support for new AMD SMCA bank types
Signed-off-by: Brian Woods <brian.woods@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Chandu-babu Namburu <chandu@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: rename CPU_NAPLES cputype
Brian WoodsGhannam, Yazen [Fri, 1 Nov 2019 14:48:13 +0000 (15:48 +0100)]
rasdaemon: rename CPU_NAPLES cputype

Change CPU_NAPLES to CPU_AMD_SMCA to reflect that it isn't just NAPLES
that is supported, but AMD's Scalable Machine Check Architecture (SMCA).

  [ Yazen: change family check to feature check, and change CPU name. ]

CC: "mchehab+samsung@kernel.org" <mchehab+samsung@kernel.org>, "Namburu, Chandu-babu" <chandu@amd.com> # Thread-Topic: [PATCH 1/2] rasdaemon: rename CPU_NAPLES cputype
Signed-off-by: Brian Woods <brian.woods@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Chandu-babu Namburu <chandu@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agoBump to version 0.6.5 v0.6.5
Mauro Carvalho Chehab [Wed, 20 Nov 2019 04:34:28 +0000 (05:34 +0100)]
Bump to version 0.6.5

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: store PCIe dev name and TLP header for the aer event
Shiju Jose [Wed, 13 Nov 2019 16:31:13 +0000 (16:31 +0000)]
rasdaemon: store PCIe dev name and TLP header for the aer event

This patch adds logging and recording of the PCIe dev name and the
TLP header for the aer event.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix for the ras-record.c:ras_mc_prepare_stmt() failure when new fields...
Shiju Jose [Wed, 13 Nov 2019 16:31:12 +0000 (16:31 +0000)]
rasdaemon: fix for the ras-record.c:ras_mc_prepare_stmt() failure when new fields added to the sql table

rasdaemon fails in the ras_mc_prepare_stmt() function when new fields are
added to the table's db_fields on top of the existing sql table in the
system.

This patch adds solution for this issue.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: add signal handling for the cleanup
Shiju Jose [Wed, 16 Oct 2019 16:34:01 +0000 (17:34 +0100)]
rasdaemon: add signal handling for the cleanup

Presently rasdaemon would not free allocated memory and
would not do other cleanup when the rasdaemon closed
with ctrl+c or kill etc.
This patch adds handling of the signals SIGINT, SIGTERM, SIGHUP
and SIGQUIT and do necessary clean ups when receive the
specified signals.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: add closure and cleanups for the database
Shiju Jose [Wed, 16 Oct 2019 16:34:00 +0000 (17:34 +0100)]
rasdaemon: add closure and cleanups for the database

This patch adds closure and cleanups for the sqlite3 database.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: delete multiple definitions of ARRAY_SIZE
Shiju Jose [Wed, 16 Oct 2019 16:33:59 +0000 (17:33 +0100)]
rasdaemon: delete multiple definitions of ARRAY_SIZE

This patch deletes multiple definitions of ARRAY_SIZE and
move the definition to a common file.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix memory leak in ras-events.c:add_event_handler()
Shiju Jose [Wed, 16 Oct 2019 16:33:58 +0000 (17:33 +0100)]
rasdaemon: fix memory leak in ras-events.c:add_event_handler()

This patch rearranges the free(page) call to prevent the
memory leak when __toggle_ras_mc_event() fail.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix missing fclose in ras-events.c:select_tracing_timestamp()
Shiju Jose [Wed, 16 Oct 2019 16:33:57 +0000 (17:33 +0100)]
rasdaemon: fix missing fclose in ras-events.c:select_tracing_timestamp()

This patch adds fix for missing fclose() in select_tracing_timestamp()
when return fail if can't parse /proc/uptime.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix memory leak in ras-events.c:handle_ras_events()
Shiju Jose [Wed, 16 Oct 2019 16:33:56 +0000 (17:33 +0100)]
rasdaemon: fix memory leak in ras-events.c:handle_ras_events()

This patch fix memory leak in handle_ras_events()
when failed to trace all supported RAS events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix cleanup issues in ras-events.c:read_ras_event_all_cpus()
Shiju Jose [Wed, 16 Oct 2019 16:33:55 +0000 (17:33 +0100)]
rasdaemon: fix cleanup issues in ras-events.c:read_ras_event_all_cpus()

This patch fix memory leaks and close the open files if the
open_trace() or read(fds[i].fd, page, pdata[i].ras->page_size)
function calls fail.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agoBump to version 0.6.4 v0.6.4
Mauro Carvalho Chehab [Thu, 10 Oct 2019 17:41:15 +0000 (14:41 -0300)]
Bump to version 0.6.4

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: support three more modules for OEM type1 error for hip08
Xiaofei Tan [Tue, 8 Oct 2019 12:38:58 +0000 (20:38 +0800)]
rasdaemon: support three more modules for OEM type1 error for hip08

Support three more modules for OEM type1 error for hip08. They are
RDE, GIC and USB.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: add timestamp for hip08 OEM error records in sqlite3 DB
Xiaofei Tan [Tue, 8 Oct 2019 12:38:57 +0000 (20:38 +0800)]
rasdaemon: add timestamp for hip08 OEM error records in sqlite3 DB

This patch does two things:
1.Add timestamp for hip08 OEM error records in sqlite3 DB.
2.Add suffix "_v2" for hip08 OEM event names to keep compatibility
with old sqlite3 DB.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: change submodule ID of sqlite3 DB field to text for hip08
Xiaofei Tan [Tue, 8 Oct 2019 12:38:56 +0000 (20:38 +0800)]
rasdaemon: change submodule ID of sqlite3 DB field to text for hip08

Change submodule ID of sqlite3 DB field from integer to text for hip08
to make it easier to understand by user.

For example, from:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU',2,
'corrected','');

change to:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU','MGMT_SMMU',
'corrected','');

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: add underscore(_) for some logging item names for hip08
Xiaofei Tan [Tue, 8 Oct 2019 12:38:55 +0000 (20:38 +0800)]
rasdaemon: add underscore(_) for some logging item names for hip08

Add underscore(_) for some logging item names for hip08. Then we can
match and catch specific fields of the log easily if needed.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: optimize sqlite3 DB record of register fields for hip08
Xiaofei Tan [Tue, 8 Oct 2019 12:38:54 +0000 (20:38 +0800)]
rasdaemon: optimize sqlite3 DB record of register fields for hip08

Optimize sqlite3 DB record of register fields for hip08 by combining
all register fields to one text field, which will include register name.
This will make the record easier to read.

For example, from:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU',2,'corrected',
273058,0,-1,0,1308622858,0,0,0,0,133,0,0,NULL);

change to:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU',2,'corrected',
'ERR_FR_0=0x42aa2 ERR_FR_1=0x0 ERR_CTRL_0=0xffffffff ERR_CTRL_1=0x0
ERR_STATUS_0=0x4e00000a ERR_STATUS_1=0x0 ERR_ADDR_0=0x0, ERR_ADDR_1=0x0
ERR_MISC0_0=0x0 ERR_MISC0_1=0x90 ERR_MISC1_0=0x0 ERR_MISC1_1=0x0');

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: fix the issue of sqlite3 integer bind parameter mismatch
Xiaofei Tan [Thu, 8 Aug 2019 02:14:30 +0000 (10:14 +0800)]
rasdaemon: fix the issue of sqlite3 integer bind parameter mismatch

Some interger fields of arm_event and mc_event are 8 bytes width,
and sqlite3_bind_int64() should be used when restore the event to
sqlite3. But we use sqlite3_bind_int() in current code. This will
lead to an wrong value in sqlite3 DB.

This patch is to fix the issue.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoREADME: updated instructions about sending patches
Mauro Carvalho Chehab [Wed, 4 Sep 2019 23:56:35 +0000 (20:56 -0300)]
README: updated instructions about sending patches

The instructions there are a little outdated. Sergio
suggested changing just my e-mail, but let's do a better job
and use my canonical e-mail (mchehab@kernel.org), plus add the
alternative of sending patches against either github or gitlab.

Suggested-by: Sergio Gelato <sergio.gelato@astro.su.se>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoFix URLs to git.kernel.org repositories in README file
Sergio Gelato [Wed, 19 Sep 2018 15:10:27 +0000 (12:10 -0300)]
Fix URLs to git.kernel.org repositories in README file

Some of the URLs to repositories on git.kernel.org were out of date and
non-functional. This commit replaces them with working alternatives.

Signed-off-by: Sergio Gelato <Sergio.Gelato@astro.su.se>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agofix file descriptor leak in ras-report.c:setup_report_socket()
Sergio Gelato [Wed, 19 Sep 2018 14:59:35 +0000 (11:59 -0300)]
fix file descriptor leak in ras-report.c:setup_report_socket()

A running instance of rasdaemon was seen to hit the limit on open file
descriptors. Most of the the descriptors were AF_UNIX STREAM sockets.
At the same time the limit was hit, attempts by rasdaemon to open the
SQLite database started failing with SQLite error 14.

This patch avoids leaking a socket file descriptor each time the connect()
call fails.

Signed-off-by: Sergio Gelato <Sergio.Gelato@astro.su.se>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoparse_ras_data: initialize record.cpu before pevent_print_event().
Sergio Gelato [Wed, 19 Sep 2018 14:58:31 +0000 (11:58 -0300)]
parse_ras_data: initialize record.cpu before pevent_print_event().

pevent_print_event() prints record.cpu; make sure it's initialized.
The cpu field from pthread_data is my best guess at a suitable value:
parse_ras_data() was already printing it separately.

Signed-off-by: Sergio Gelato <Sergio.Gelato@astro.su.se>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoparse_ras_data: flush trace buffer immediately, not on next call
Sergio Gelato [Wed, 19 Sep 2018 14:57:42 +0000 (11:57 -0300)]
parse_ras_data: flush trace buffer immediately, not on next call

parse_ras_data() was calling fflush() before, not after printf().
As a result, information about an event would not be printed
immediately but possibly much later.

Signed-off-by: Sergio Gelato <Sergio.Gelato@astro.su.se>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoReplace whitespaces by tabs
Mauro Carvalho Chehab [Wed, 4 Sep 2019 23:44:55 +0000 (20:44 -0300)]
Replace whitespaces by tabs

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoras-diskerror: dev_t is in sys/types.h in musl
Henrik Riomar [Thu, 29 Aug 2019 06:54:56 +0000 (08:54 +0200)]
ras-diskerror: dev_t is in sys/types.h in musl

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoBump to version 0.6.3 v0.6.3
Mauro Carvalho Chehab [Fri, 23 Aug 2019 11:01:39 +0000 (08:01 -0300)]
Bump to version 0.6.3

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoconfigure.ac: add an option to enable all features
Mauro Carvalho Chehab [Fri, 23 Aug 2019 11:26:24 +0000 (08:26 -0300)]
configure.ac: add an option to enable all features

At least for build testing, an option to enable everything
can be handful.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoAdd newline to summary to match errors output
Geoff Winterbourne [Thu, 25 Jul 2019 20:13:50 +0000 (14:13 -0600)]
Add newline to summary to match errors output

5 years agoSwitch to kernel filters for block_rq_complete
Cong Wang [Thu, 13 Jun 2019 18:51:39 +0000 (11:51 -0700)]
Switch to kernel filters for block_rq_complete

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agoAdd disk I/O error monitoring
Cong Wang [Wed, 12 Jun 2019 20:24:49 +0000 (13:24 -0700)]
Add disk I/O error monitoring

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agoMake event filter type specific
Cong Wang [Wed, 12 Jun 2019 22:06:37 +0000 (15:06 -0700)]
Make event filter type specific

struct ras_events passed via context pointer is not per event,
therefore the per event filter must be specific to each event.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agoras-mce-handler: Add support for Hygon Dhyana family 18h processor
Pu Wen [Thu, 23 May 2019 13:00:22 +0000 (21:00 +0800)]
ras-mce-handler: Add support for Hygon Dhyana family 18h processor

The Hygon Dhyana family 18h processor is derived from AMD family 17h.
The Hygon Dhyana support to Linux is already accepted upstream[1].

Add Hygon Dhyana support to mce handler of rasdaemon in order to handle
MCE events on Hygon Dhyana platforms.

Reference:
[1] https://git.kernel.org/tip/fec98069fb72fb656304a3e52265e0c2fc9adf87

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoFix Perl warnings in ras-mc-ctl
Cong Wang [Thu, 13 Jun 2019 05:26:20 +0000 (22:26 -0700)]
Fix Perl warnings in ras-mc-ctl

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agorasdaemon:add logging HiSilicon HIP08 PCIe local errors
Shiju Jose [Mon, 17 Jun 2019 14:28:52 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 PCIe local errors

This patch adds logging for the HiSilicon HIP08 PCIe local errors.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2
Shiju Jose [Mon, 17 Jun 2019 14:28:51 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2

This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format2.
These errors are from the H/W modules SMMU, HHA, HLLC, PA and DDRC.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1
Shiju Jose [Mon, 17 Jun 2019 14:28:50 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1

This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format1.
These errors are from the H/W modules MN, PLL, SLLC, AA, SIOE,
POE, DISP, LPC, SAS and SATA.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: update iteration logic for the non-standard error decoding functions
Shiju Jose [Mon, 17 Jun 2019 14:28:49 +0000 (15:28 +0100)]
rasdaemon: update iteration logic for the non-standard error decoding functions

This patch updates the iteration logic for the non-standard
error decoding functions.

Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: rearrange HiSilicon HIP07 decoding function table
Shiju Jose [Mon, 17 Jun 2019 14:28:48 +0000 (15:28 +0100)]
rasdaemon: rearrange HiSilicon HIP07 decoding function table

This patch rearranges the decoding function table for the
HiSilicon HIP07 non-standard errors.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon:print non-standard error data if not decoded
Shiju Jose [Mon, 17 Jun 2019 14:28:47 +0000 (15:28 +0100)]
rasdaemon:print non-standard error data if not decoded

This patch change printing non-standard error data
only if not decoded.

Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoras-mce-handler: fix mcgstatus message print
Mauro Carvalho Chehab [Tue, 11 Jun 2019 18:01:38 +0000 (15:01 -0300)]
ras-mce-handler: fix mcgstatus message print

As warned by clang, the test there is wrong:

ras-mce-handler.c:344:9: warning: address of array 'e->mcgstatus_msg' will always evaluate to 'true' [-Wpointer-bool-conversion]
        if (e->mcgstatus_msg)
        ~~  ~~~^~~~~~~~~~~~~

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoTravis: enable all possible features
Mauro Carvalho Chehab [Tue, 11 Jun 2019 17:58:23 +0000 (14:58 -0300)]
Travis: enable all possible features

Several of those are arm-specific, but, as the goal here is just
to compile-test, enable them all.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoras-events: fix a warning when built without devlink
Mauro Carvalho Chehab [Tue, 11 Jun 2019 17:56:08 +0000 (14:56 -0300)]
ras-events: fix a warning when built without devlink

ras-events.c:667:8: warning: unused variable ‘filter_str’ [-Wunused-variable]
  667 |  char *filter_str = NULL;
      |        ^~~~~~~~~~

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agofix rasdaemon high CPU usage when part of CPUs offline
Ying Lv [Wed, 15 May 2019 03:15:42 +0000 (11:15 +0800)]
fix rasdaemon high CPU usage when part of CPUs offline

When we set part of CPU core offline, such as by setting the kernel cmdline
maxcpus = N(N is less than the total number of system CPU cores).
And then, we will observe that the CPU usage of some rasdaemon threads
is very close to 100.

This is because when part of CPU offline, poll in read_ras_event_all_cpus func
will fallback to pthread way.
Offlined CPU thread will return negative value when read trace_pipe_raw,
negative return value will covert to positive value because of 'unsigned size'.
So code will always go into 'size > 0' branch, and the CPU usage is too high.

Here, variable size uses int type will go to the right branch.

Fiexs: eff7c9e0("ras-events: Only use pthreads for collect if poll() not available")
Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Signed-off-by: Ying Lv <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>