]> www.infradead.org Git - users/mchehab/rasdaemon.git/log
users/mchehab/rasdaemon.git
2 years agoBump version to 0.7.0 libtrace
Mauro Carvalho Chehab [Sat, 21 Jan 2023 06:52:14 +0000 (07:52 +0100)]
Bump version to 0.7.0

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years ago.gitignore: add the auto-generated "compile" file
Mauro Carvalho Chehab [Sat, 21 Jan 2023 06:55:05 +0000 (07:55 +0100)]
.gitignore: add the auto-generated "compile" file

autoreconf is producing a compile file. Ignore it on git status.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoINSTALL: update from latest version of it
Mauro Carvalho Chehab [Sat, 21 Jan 2023 06:54:30 +0000 (07:54 +0100)]
INSTALL: update from latest version of it

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoconfigure.ac: fix bashisms
Sam James [Thu, 29 Dec 2022 17:23:47 +0000 (17:23 +0000)]
configure.ac: fix bashisms

configure scripts need to be runnable with a POSIX-compliant /bin/sh.

On many (but not all!) systems, /bin/sh is provided by Bash, so errors
like this aren't spotted. Notably Debian defaults to /bin/sh provided
by dash which doesn't tolerate such bashisms as '=='.

This retains compatibility with bash.

Signed-off-by: Sam James <sam@gentoo.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agolabels/asus: add ASUS TUF GAMING B450-PLUS II
dgcampea [Mon, 19 Dec 2022 18:53:13 +0000 (18:53 +0000)]
labels/asus: add ASUS TUF GAMING B450-PLUS II

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Add four modules supported by HiSilicon common section
Xiaofei Tan [Mon, 31 Oct 2022 10:36:26 +0000 (18:36 +0800)]
rasdaemon: Add four modules supported by HiSilicon common section

Add four modules supported by HiSilicon common error section.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Fix for a memory out-of-bounds issue and optimized code to remove duplicat...
Shiju Jose [Thu, 28 Apr 2022 21:59:04 +0000 (22:59 +0100)]
rasdaemon: Fix for a memory out-of-bounds issue and optimized code to remove duplicate function.

Fixed a memory out-of-bounds issue with string pointers and
optimized code structure to remove duplicate function.

Signed-off-by: Lei Feng <fenglei47@h-partners.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Updated HiSilicon platform name
Shiju Jose [Thu, 28 Apr 2022 17:58:43 +0000 (18:58 +0100)]
rasdaemon: ras-mc-ctl: Updated HiSilicon platform name

Updated the HiSilicon platform name as KunPeng9xx.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Relocate reading and display Kunpeng920 errors to under Kunpeng9xx
Shiju Jose [Mon, 7 Mar 2022 12:38:45 +0000 (12:38 +0000)]
rasdaemon: ras-mc-ctl: Relocate reading and display Kunpeng920 errors to under Kunpeng9xx

Relocate reading and display Kunpeng920 errors to under Kunpeng9xx.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Add support to display the HiSilicon vendor errors for a speci...
Shiju Jose [Sat, 5 Mar 2022 18:19:38 +0000 (18:19 +0000)]
rasdaemon: ras-mc-ctl: Add support to display the HiSilicon vendor errors for a specified module

Add support to display the HiSilicon vendor errors for a specified module.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Add printing usage if necessary parameters are not passed...
Shiju Jose [Sat, 5 Mar 2022 17:01:35 +0000 (17:01 +0000)]
rasdaemon: ras-mc-ctl: Add printing usage if necessary parameters are not passed for the vendor-error options

Add printing usage if necessary parameters are not passed
for the vendor-errors options.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Reformat error info of the HiSilicon Kunpeng920
Shiju Jose [Sat, 5 Mar 2022 16:18:55 +0000 (16:18 +0000)]
rasdaemon: ras-mc-ctl: Reformat error info of the HiSilicon Kunpeng920

Reformat the code to display the error info of HiSilicon Kunpeng920.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Modify error statistics for HiSilicon KunPeng9xx common errors
Shiju Jose [Thu, 24 Feb 2022 18:02:14 +0000 (18:02 +0000)]
rasdaemon: ras-mc-ctl: Modify error statistics for HiSilicon KunPeng9xx common errors

Modify the error statistics for the HiSilicon KunPeng9xx platforms common errors
to display the statistics and error info based on the module and the error severity.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Modify recording Hisilicon common error data
Shiju Jose [Wed, 2 Mar 2022 12:20:40 +0000 (12:20 +0000)]
rasdaemon: Modify recording Hisilicon common error data

The error statistics for the Hisilicon common
error need to do based on module, error severity etc.

Modify recording Hisilicon common error data as separate fields
in the sql db table instead of the combined single field.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Support cpu fault isolation for recoverable errors
Shengwei Luo [Wed, 23 Feb 2022 09:23:27 +0000 (17:23 +0800)]
rasdaemon: Support cpu fault isolation for recoverable errors

When the recoverable errors in cpu core occurred, try to offline
the related cpu core.

Signed-off-by: Shengwei Luo <luoshengwei@huawei.com>
Signed-off-by: Junchong Pan <panjunchong@hisilicon.com>
Signed-off-by: Lei Feng <fenglei47@h-partners.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Support cpu fault isolation for corrected errors
Shengwei Luo [Wed, 23 Feb 2022 09:21:58 +0000 (17:21 +0800)]
rasdaemon: Support cpu fault isolation for corrected errors

When the corrected errors exceed the set limit in cycle, try to
offline the related cpu core.

Signed-off-by: Shengwei Luo <luoshengwei@huawei.com>
Signed-off-by: Junchong Pan <panjunchong@hisilicon.com>
Signed-off-by: Lei Feng <fenglei47@h-partners.com>
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-memory-failure-handler: handle localtime() failure correctly
Aristeu Rozanski [Thu, 19 Jan 2023 13:45:57 +0000 (08:45 -0500)]
rasdaemon: ras-memory-failure-handler: handle localtime() failure correctly

We could just have an empty string but keeping the format could prevent
issues if someone is actually parsing this.
Found with covscan.

v2: fixed the timestamp as pointed by Robert Elliott

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: mce-amd-smca: properly limit bank types
Aristeu Rozanski [Thu, 19 Jan 2023 13:45:57 +0000 (08:45 -0500)]
rasdaemon: mce-amd-smca: properly limit bank types

Found with covscan.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-report: fix possible but unlikely file descriptor leak
Aristeu Rozanski [Thu, 19 Jan 2023 13:45:57 +0000 (08:45 -0500)]
rasdaemon: ras-report: fix possible but unlikely file descriptor leak

Found with covscan.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agolibtrace: Use XSI version of strerror_r on non glibc systems
Khem Raj [Wed, 31 Aug 2022 02:54:35 +0000 (19:54 -0700)]
libtrace: Use XSI version of strerror_r on non glibc systems

The version used is glibc specific therefore make it so
and provide a fallback for non-glibc systems

Signed-off-by: Khem Raj <raj.khem@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: use the new block_rq_error tracepoint
Yang Shi [Mon, 4 Apr 2022 23:34:05 +0000 (16:34 -0700)]
rasdaemon: use the new block_rq_error tracepoint

Since Linux 5.18-rc1 a new block tracepoint called block_rq_error is
available for tracing disk error events dedicatedly.  Currently
rasdaemon is using block_rq_complete which also traces successful cases.
It incurs excessive tracing logs and somehow overhead since the event is
triggered quite often.

Use the new tracepoint for disk error reporting, and the new trace point
has the same format as block_rq_complete.

Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agoBump version to 0.6.8 v0.6.8
Mauro Carvalho Chehab [Fri, 1 Apr 2022 10:50:08 +0000 (12:50 +0200)]
Bump version to 0.6.8

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agomisc/rasdaemon.spec.in: fix some issues on it
Mauro Carvalho Chehab [Fri, 1 Apr 2022 10:34:23 +0000 (12:34 +0200)]
misc/rasdaemon.spec.in: fix some issues on it

Sort of sync this file from Fedora's upstream, addressing some
bugs sysfsdir install bugs.

After such change, the main difference would be that, in
Fedora, it uses different config settings, depending at the
architecture:

-%configure --enable-all --with-sysconfdefdir=%{_sysconfdir}/sysconfig
+%ifarch %{arm} aarch64
+%configure --enable-sqlite3 --enable-aer --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-abrt-report --enable-non-standard --enable-arm --enable-hisi-ns-decode --with-sysconfdefdir=%{_sysconfdir}/sysconfig
+%else
+%configure --enable-sqlite3 --enable-aer --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-abrt-report --with-sysconfdefdir=%{_sysconfdir}/sysconfig
+%endif

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agoMakefile.am: clean output from misc/*.in
Mauro Carvalho Chehab [Fri, 1 Apr 2022 09:41:39 +0000 (11:41 +0200)]
Makefile.am: clean output from misc/*.in

Cleanup files that are generated at build time from the *.in
input files.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: Add some modules supported by hisi common error section
Xiaofei Tan [Wed, 20 Oct 2021 06:33:40 +0000 (14:33 +0800)]
rasdaemon: Add some modules supported by hisi common error section

Add some modules supported by hisi common error section. Besides,
HHA is the module for some old platform, and it takes the same place
of MATA, so remove it.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: Fix some print format issues for hisi common error section
Xiaofei Tan [Wed, 20 Oct 2021 06:33:39 +0000 (14:33 +0800)]
rasdaemon: Fix some print format issues for hisi common error section

It is not right to use '%d' to print uint8_t and uint16_t, although
there is no function issue. Change to use '%hhu' and '%hu' separately.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: Fix the issue of command option -r for hip08
Xiaofei Tan [Wed, 20 Oct 2021 06:33:38 +0000 (14:33 +0800)]
rasdaemon: Fix the issue of command option -r for hip08

It will record event even the option -r is not provided for hip08.
It is not right, and fix it.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: Fix the issue of sprintf data type mismatch in uuid_le()
Xiaofei Tan [Wed, 20 Oct 2021 06:33:37 +0000 (14:33 +0800)]
rasdaemon: Fix the issue of sprintf data type mismatch in uuid_le()

The data type of sprintf called in the function uuid_le() is mismatch.
Arm64 compiler force it to unsigned char by default, and can work normally.
But if someone compile it with the option -fsigned-char, the function
can't work correctly.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon.service.in: comment out syslog.target
Evils [Sat, 11 Dec 2021 02:27:05 +0000 (03:27 +0100)]
rasdaemon.service.in: comment out syslog.target

syslog is only used when the daemon runs in backround mode
  this service is configured to run in foreground mode

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agoadd labels for asrock x570 motherboard
Steven Johnson [Tue, 7 Dec 2021 10:57:08 +0000 (17:57 +0700)]
add labels for asrock x570 motherboard

Signed-off-by: Steven Johnson <strntydog@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agoUpdate ras-mc-ctl manpage to match current options
Justin Vreeland [Wed, 3 Nov 2021 02:51:50 +0000 (19:51 -0700)]
Update ras-mc-ctl manpage to match current options

Signed-off-by: Justin Vreeland <vreeland.justin@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Fix script to parse dimm sizes
Muralidhara M K [Tue, 27 Jul 2021 11:36:45 +0000 (06:36 -0500)]
rasdaemon: ras-mc-ctl: Fix script to parse dimm sizes

Removes trailing spaces at the end of a line from
file location and fixes --layout option to parse dimm nodes
to get the size of each dimm from ras-mc-ctl.

Issue is reported https://github.com/mchehab/rasdaemon/issues/43
Where '> ras-mc-ctl --layout' reports all 0s

With this change the layout option prints the correct dimm sizes
> sudo ras-mc-ctl --layout
          +-----------------------------------------------+
          |                      mc0                      |
          |  csrow0   |  csrow1   |  csrow2   |  csrow3   |
----------+-----------------------------------------------+
...
channel7: |  16384 MB  |     0 MB  |     0 MB  |     0 MB |
channel6: |  16384 MB  |     0 MB  |     0 MB  |     0 MB |
...
----------+-----------------------------------------------+

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Link: https://lkml.kernel.org/r/20210810183855.129076-1-nchatrad@amd.com/
3 years agorasdaemon: fix compile against musl libc
Stijn Tintel [Wed, 1 Sep 2021 00:32:18 +0000 (03:32 +0300)]
rasdaemon: fix compile against musl libc

Fix the following compile errors that occurs when building against musl:

ras-events.c: In function 'read_ras_event_all_cpus':
ras-events.c:366:16: error: 'PATH_MAX' undeclared (first use in this function)
  366 |  char pipe_raw[PATH_MAX];
      |                ^~~~~~~~

ras-events.c: In function 'handle_ras_events_cpu':
ras-events.c:564:16: error: 'PATH_MAX' undeclared (first use in this function)
  564 |  char pipe_raw[PATH_MAX];
      |

Signed-off-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agolabels/supermicro: added Supermicro X11SCW
DmNosachev [Fri, 23 Jul 2021 14:28:33 +0000 (17:28 +0300)]
labels/supermicro: added Supermicro X11SCW

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro X10DRL, X11SPM
DmNosachev [Thu, 22 Jul 2021 07:25:38 +0000 (10:25 +0300)]
labels/supermicro: added Supermicro X10DRL, X11SPM

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro X11SCA(-F)
DmNosachev [Fri, 2 Jul 2021 10:13:46 +0000 (13:13 +0300)]
labels/supermicro: added Supermicro X11SCA(-F)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro B1DRi
DmNosachev [Wed, 30 Jun 2021 13:49:18 +0000 (16:49 +0300)]
labels/supermicro: added Supermicro B1DRi

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro X11DDW-NT(-L)
DmNosachev [Tue, 29 Jun 2021 11:07:54 +0000 (14:07 +0300)]
labels/supermicro: added Supermicro X11DDW-NT(-L)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro X10DRI(-T)
DmNosachev [Tue, 29 Jun 2021 10:48:55 +0000 (13:48 +0300)]
labels/supermicro: added Supermicro X10DRI(-T)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: supermicro db syntax
DmNosachev [Tue, 29 Jun 2021 10:37:48 +0000 (13:37 +0300)]
labels/supermicro: supermicro db syntax

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added x11dph-i labels
DmNosachev [Tue, 29 Jun 2021 08:33:10 +0000 (11:33 +0300)]
labels/supermicro: added x11dph-i labels

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Support MCE for AMD CPU family 19h
Muralidhara M K [Wed, 28 Jul 2021 06:52:12 +0000 (01:52 -0500)]
rasdaemon: Support MCE for AMD CPU family 19h

Add support for family 19h x86 CPUs from AMD.

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Enumerate memory on noncpu nodes
Muralidhara M K [Mon, 12 Jul 2021 10:40:46 +0000 (05:40 -0500)]
rasdaemon: Enumerate memory on noncpu nodes

On newer heterogeneous systems from AMD with GPU nodes (with HBM2 memory
banks) connected via xGMI links to the CPUs.

The node id information is available in the InstanceHI[47:44] of
the IPID register.

The UMC Phys on Aldeberan nodes are enumerated as csrow
The UMC channels connected to HBMs are enumerated as ranks.

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: set SMCA maximum number of banks to 64
Muralidhara M K [Mon, 12 Jul 2021 10:18:43 +0000 (05:18 -0500)]
rasdaemon: set SMCA maximum number of banks to 64

Newer AMD systems with SMCA banks support up to 64 MCA banks per CPU.

This patch is based on the commit below upstremed into the kernel:
a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64")

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Add new SMCA bank types with error decoding
Naveen Krishna Chatradhi [Tue, 1 Jun 2021 05:31:17 +0000 (11:01 +0530)]
rasdaemon: Add new SMCA bank types with error decoding

Upcoming systems with Scalable Machine Check Architecture (SMCA) have
new MCA banks added.

This patch adds the (HWID, MCATYPE) tuple, name and error decoding for
those new SMCA banks.
While at it, optimize the string names in smca_bank_name[].

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoconfigure.ac: fix SYSCONFDEFDIR default value
Matt Whitlock [Wed, 9 Jun 2021 14:25:18 +0000 (10:25 -0400)]
configure.ac: fix SYSCONFDEFDIR default value

configure.ac was using AC_ARG_WITH incorrectly, yielding a generated configure script like:

    # Check whether --with-sysconfdefdir was given.
    if test "${with_sysconfdefdir+set}" = set; then :
      withval=$with_sysconfdefdir; SYSCONFDEFDIR=$withval
    else
      "/etc/sysconfig"
    fi

This commit fixes the default case so that the SYSCONFDEFDIR variable is assigned the value "/etc/sysconfig" rather than trying to execute "/etc/sysconfig" as a command.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdd error handling for Ampere-specific errors.
Jason Tian [Fri, 28 May 2021 03:35:43 +0000 (11:35 +0800)]
Add error handling for Ampere-specific errors.

Save Ampere-specific errors' decode into sqlite3 data
base and log PCIe segment, bus/device/function number
into BMC SEL.

Signed-off-by: Jason Tian <jason@os.amperecomputing.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdd support for multi-arch builds
Mauro Carvalho Chehab [Wed, 26 May 2021 10:55:54 +0000 (12:55 +0200)]
Add support for multi-arch builds

Allow building rasdaemon on several architectures:
- x86_64
- arm 64
- ppc 64 LE

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoci.yml: Fix the job for it to run on a single arch
Mauro Carvalho Chehab [Wed, 26 May 2021 10:41:27 +0000 (12:41 +0200)]
ci.yml: Fix the job for it to run on a single arch

There were some issues on the previous content. Fix them, in
order to allow it to build on a single architecture.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdd a github workflow for CI automation
Mauro Carvalho Chehab [Wed, 26 May 2021 10:35:55 +0000 (12:35 +0200)]
Add a github workflow for CI automation

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon.spec.in: Fix the description on this example file
Mauro Carvalho Chehab [Wed, 26 May 2021 08:37:52 +0000 (10:37 +0200)]
rasdaemon.spec.in: Fix the description on this example file

While this is used just to test if building it is OK, better
to keep the logs nice ;-)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoBump verstion to 0.6.7 v0.6.7
Mauro Carvalho Chehab [Wed, 26 May 2021 07:41:45 +0000 (09:41 +0200)]
Bump verstion to 0.6.7

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon.spec.in: don't install _sharedstatedir
Mauro Carvalho Chehab [Wed, 26 May 2021 08:32:39 +0000 (10:32 +0200)]
rasdaemon.spec.in: don't install _sharedstatedir

%{_sharedstatedir}/rasdaemon is now created at runtime.
So, no need to install it anymore.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoMakefile.am: fix build header rules
Mauro Carvalho Chehab [Wed, 26 May 2021 08:20:39 +0000 (10:20 +0200)]
Makefile.am: fix build header rules

non-standard-hisilicon.h was added twice;
ras-memory-failure-handler.h is missing.

Due to that, the tarball becomes incomplete, causing build
errors.

While here, also adjust .travis.yml to use --enable-all.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Add Ice Lake and Sapphire Rapids MSCOD values
Greg Edwards [Thu, 8 Apr 2021 21:03:30 +0000 (15:03 -0600)]
rasdaemon: Add Ice Lake and Sapphire Rapids MSCOD values

Based on mcelog commits:

  ee90ff20ce6a ("mcelog: Add support for Icelake server, Icelake-D, and Snow Ridge")
  391abaac9bdf ("mcelog: Add decode for MCi_MISC from 10nm memory controller")
  59cb7ad4bc72 ("mcelog: i10nm: Fix mapping from bank number to functional unit")
  c0acd0e6a639 ("mcelog: Add support for Sapphirerapids server.")

Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: fix build error in register_ns_ev_decoder if the sqlite3 is not enabled
Shiju Jose [Tue, 9 Mar 2021 16:18:56 +0000 (16:18 +0000)]
rasdaemon: fix build error in register_ns_ev_decoder if the sqlite3 is not enabled

ns_ev_decoder->stmt_dec_record = NULL; in the register_ns_ev_decoder()
should be under #ifdef HAVE_SQLITE3 to fix the compilation error
when build without the configure option --enable-sqlite3.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Modify confiure.ac for Hisilicon Kunpeng errors
Shiju Jose [Mon, 8 Mar 2021 16:57:32 +0000 (16:57 +0000)]
rasdaemon: Modify confiure.ac for Hisilicon Kunpeng errors

Modify  HIP07 SAS HW errors : $USE_HISI_NS_DECODE to
HISI Kunpeng errors : $USE_HISI_NS_DECODE.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng9xx common errors
Shiju Jose [Mon, 8 Mar 2021 16:57:31 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng9xx common errors

Add support for the HiSilicon Kunpeng9xx platforms common errors.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng920 errors
Shiju Jose [Mon, 8 Mar 2021 16:57:30 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng920 errors

Add support for the HiSilicon Kunpeng920 errors.
Supported error formats: OEM type 1, OEM typ2 and PCIe controller
error formats.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Add support for the vendor-specific errors
Shiju Jose [Mon, 8 Mar 2021 16:57:29 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Add support for the vendor-specific errors

Add commands to support logging the vendor-specific
error info in the ras-mc-ctl.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Add memory failure events
Shiju Jose [Mon, 8 Mar 2021 16:57:28 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Add memory failure events

Add supporting memory failure errors (memory_failure_event)
to the ras-mc-ctl tool.

Sample Log,
ras-mc-ctl --summary
...
Memory failure events summary:
        Delayed errors: 4
        Failed errors: 1
...

ras-mc-ctl --errors
...
Memory failure events:
1 2020-10-28 23:20:41 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Delayed
2 2020-10-28 23:31:38 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Delayed
3 2020-10-28 23:54:54 -0800 error: pfn=0x205000000, page_type=free buddy page, action_result=Delayed
4 2020-10-29 00:12:25 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Delayed
5 2020-10-29 00:26:36 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Failed

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Modify ARM processor error summary log
Shiju Jose [Mon, 8 Mar 2021 16:57:27 +0000 (16:57 +0000)]
rasdaemon: ras-mc-ctl: Modify ARM processor error summary log

Add CPU's mpidr information to the ARM processor error
summary log.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: add support for memory_failure events
Shiju Jose [Mon, 8 Mar 2021 16:57:26 +0000 (16:57 +0000)]
rasdaemon: add support for memory_failure events

Add support to log the memory_failure kernel trace
events.

Example rasdaemon log and SQLite DB output for the
memory_failure event,
=================================================
rasdaemon: memory_failure_event store: 0x126ce8f8
rasdaemon: register inserted at db
<...>-785   [000]     0.000024: memory_failure_event: 2020-10-02 13:27:13 -0400 pfn=0x204000000 page_type=free buddy page action_result=Delayed

CREATE TABLE memory_failure_event (id INTEGER PRIMARY KEY, timestamp TEXT, pfn TEXT, page_type TEXT, action_result TEXT);
INSERT INTO memory_failure_event VALUES(1,'2020-10-02 13:27:13 -0400','0x204000000','free buddy page','Delayed');
==================================================

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoras-record: Create RASSTATEDIR at runtime instead of install time
B. Wilson [Mon, 12 Apr 2021 15:29:58 +0000 (00:29 +0900)]
ras-record: Create RASSTATEDIR at runtime instead of install time

Package managers such as Nix and Guix force installation into an
isolated directory hierarchy. Furthermore, said hierarchy becomes
readonly after the install has completed, rendering any
<hierarchy>/var/lib/rasdaemon/ directory effectively useless.

In addition to being standard practice, creating RASSTATEDIR when
necessary at runtime fixes the above use cases.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoadd labels for A2SDi-8C+-HLN4F
Henrik Riomar [Sat, 27 Feb 2021 13:10:42 +0000 (14:10 +0100)]
add labels for A2SDi-8C+-HLN4F

Same as A2SDi-8C-HLN4F, but with a fan

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdded label for ASUS PRIME X570-PRO
obiwan-r2d2 [Sat, 22 May 2021 13:35:36 +0000 (15:35 +0200)]
Added label for ASUS PRIME X570-PRO

ras-mc-ctl: mainboard: ASUSTeK COMPUTER INC. model PRIME X570-PRO

Installed DIMM_A1:
Label                   CE      UE
mc#0csrow#0channel#1    0       0
mc#0csrow#1channel#1    0       0

Installed DIMM_A2:
Label                   CE      UE
mc#0csrow#3channel#1    0       0
mc#0csrow#2channel#1    0       0

Installed DIMM_B1:
Label                   CE      UE
mc#0csrow#1channel#0    0       0
mc#0csrow#0channel#0    0       0

Installed DIMM_B2:
Label                   CE      UE
mc#0csrow#2channel#0    0       0
mc#0csrow#3channel#0    0       0

Test with 2 DIMMs:

LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS
                                    DIMM_B1              0:0:0 missing
                                    DIMM_A1              0:0:1 missing
                                    DIMM_B1              0:1:0 missing
                                    DIMM_A1              0:1:1 missing
mc0 csrow 2 channel 0               DIMM_B2              DIMM_B2
mc0 csrow 2 channel 1               DIMM_A2              DIMM_A2
mc0 csrow 3 channel 0               DIMM_B2              DIMM_B2
mc0 csrow 3 channel 1               DIMM_A2              DIMM_A2

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdd code to decode Ampere specific error
Jason Tian [Thu, 4 Feb 2021 01:57:05 +0000 (09:57 +0800)]
Add code to decode Ampere specific error

All Ampere specific errors(payload type0/1/2/3) include 48 bytes
OEM data, which will be decoded out error type,subtype,instance,
socket number and so on.

Signed-off-by: Jason Tian <jason@os.amperecomputing.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: fix memory leak in parse_ras_data
Josh Hunt [Fri, 8 Jan 2021 00:12:52 +0000 (19:12 -0500)]
rasdaemon: fix memory leak in parse_ras_data

parse_ras_data() is calling trace_seq_init() which allocates a buffer,
but never calls the corresponding trace_seq_destroy() to free it causing
us to leak memory.

Reported-by: Subhendu Saha <subhends@akamai.com>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoFix ras-mc-ctl script.
Subhendu Saha [Tue, 12 Jan 2021 08:29:55 +0000 (03:29 -0500)]
Fix ras-mc-ctl script.

When rasdaemon is compiled without enabling aer, mce, devlink,
etc., those tables are not created in the database file. Then
ras-mc-ctl script breaks trying to query data from non-existent
tables.

Signed-off-by: Subhendu Saha subhends@akamai.com
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoadd Supermicro X10SRA-F and H8DGU.
Chad W Seys [Thu, 14 Jan 2021 15:07:47 +0000 (09:07 -0600)]
add Supermicro X10SRA-F and H8DGU.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again
lvying6 [Sat, 31 Oct 2020 09:57:15 +0000 (17:57 +0800)]
ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again

OS may fail to offline page at the previous time. After some time,
this page's state changed, and the page can be offlined by OS.
At this time, Correctable errors on this page reached the threshold.
Rasdaemon should trigger to offline this page again.

Signed-off-by: lvying6 <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoras-page-isolation: do_page_offline always considers page offline was successful
lvying [Sat, 31 Oct 2020 09:57:14 +0000 (17:57 +0800)]
ras-page-isolation: do_page_offline always considers page offline was successful

do_page_offline always consider page offline was successful even if
kernel soft/hard offline page failed.

Calling rasdaemon with:

/etc/sysconfig/rasdaemon PAGE_CE_THRESHOLD="1"

i.e when a page's address occurs Corrected Error, rasdaemon should
trigger this page soft offline.

However, after adding a livepatch into kernel's
store_soft_offline_page to observe this function's return value,
when injecting a CE into address 0x3f7ec30000, the Kernel
lot reports:

soft_offline: 0x3f7ec30: unknown non LRU page type ffffe0000000000 ()
[store_soft_offline_page]return from soft_offline_page: -5

While rasdaemon log reports:

rasdaemon[73711]: cpu 00:rasdaemon: Corrected Errors at 0x3f7ec30000 exceed threshold
rasdaemon[73711]: rasdaemon: Result of offlining page at 0x3f7ec30000: offlined

using strace to record rasdaemon's system call, it reports:

strace -p 73711
openat(AT_FDCWD, "/sys/devices/system/memory/soft_offline_page",
       O_WRONLY|O_CREAT|O_TRUNC, 0666) = 28
fstat(28, {st_mode=S_IFREG|0200, st_size=4096, ...}) = 0
write(28, "0x3f7ec30000", 12)           = -1 EIO (Input/output error)
close(28)                               = 0

So, kernel actually soft offline pfn 0x3f7ec30 failed and
store_soft_offline_page returned -EIO. However, rasdaemon always
considers the page offline to be successful.

According to strace display, ferror was unable of detecting the
failure of the write syscall.

This patch changes fopen-fprintf-ferror-fclose process to use
the lower I/O level, by using instead open-write-close, which
can detect such syscall failure.

Signed-off-by: lvying <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoAdd architecture ppc64le to travis build
Debabrata Deka [Fri, 4 Dec 2020 13:13:21 +0000 (14:13 +0100)]
Add architecture ppc64le to travis build

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoFix problem from make dist-rpm
Jason Tian [Fri, 11 Dec 2020 02:39:11 +0000 (03:39 +0100)]
Fix problem from make dist-rpm

rasdaemon rpm package can't build out and report the files should start
from "/".

Update the Makefile by adding "" to folder name and change one
typo.

Signed-off-by: Jason Tian <jason@os.amperecomputing.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Add 8 channel decoding for SMCA systems
Muralidhara M K [Thu, 20 Aug 2020 15:30:57 +0000 (21:00 +0530)]
rasdaemon: Add 8 channel decoding for SMCA systems

Current Scalable Machine Check Architecture (SMCA) systems support up
to 8 UMC channels.

To find the UMC channel represented by a bank, look at the 6th nibble
in the MCA_IPID[InstanceId] field.

Signed-off-by: Muralidhara M K <muralimk@amd.com>
[ Adjust commit message. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Fix error print
Liguang Zhang [Mon, 10 Aug 2020 03:07:43 +0000 (11:07 +0800)]
rasdaemon: Fix error print

Fix error print handle_ras_events.

Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoCreate SYSCONFDEFDIR configure parameter
Antonio Russo [Tue, 18 Aug 2020 04:29:51 +0000 (22:29 -0600)]
Create SYSCONFDEFDIR configure parameter

Provide downstream packagers with a tunable describing the location of
the file containing environment variables to pass to the startup script.

Defaults to the existing value, /etc/sysconfig.

Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: ras-mc-ctl: Add ARM processor error information
Shiju Jose [Tue, 11 Aug 2020 12:31:46 +0000 (13:31 +0100)]
rasdaemon: ras-mc-ctl: Add ARM processor error information

Add supporting ARM processor error in the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Modify non-standard error decoding interface using linked list
Shiju Jose [Mon, 10 Aug 2020 14:42:56 +0000 (15:42 +0100)]
rasdaemon: Modify non-standard error decoding interface using linked list

Replace the current non-standard error decoding interface with the
interface based on the linked list to avoid using realloc and
to improve the interface.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: add support for hisilicon common section decoder
Xiaofei Tan [Mon, 27 Jul 2020 07:38:39 +0000 (15:38 +0800)]
rasdaemon: add support for hisilicon common section decoder

Add a new non-standard error section, Hisilicon common section.
It is defined for the next generation SoC Kunpeng930. It also supports
Kunpeng920 and some modules of Kunpeng920 could be changed to use
this section.

We put the code to an new source file, as it supports multiple Hardware
platform. Some code of hip08 could be shared. Move them to this new file.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: delete the code of non-standard error decoder for hip07
Xiaofei Tan [Mon, 27 Jul 2020 07:38:38 +0000 (15:38 +0800)]
rasdaemon: delete the code of non-standard error decoder for hip07

Delete the code of non-standard error decoder for hip07 that was never
used. Because the corresponding code in Linux kernel wasn't accepted.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: delete the duplicate code about the definition of hip08 DB fields
Xiaofei Tan [Mon, 27 Jul 2020 07:38:37 +0000 (15:38 +0800)]
rasdaemon: delete the duplicate code about the definition of hip08 DB fields

Delete the duplicate code about the definition of DB fields for hip08 OEM
event format1 and format2. Because the two OEM event format is the same.

Signed-off-By: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Add error decoding for new SMCA Load Store bank type
Muralidhara M K [Mon, 13 Jan 2020 13:42:06 +0000 (19:12 +0530)]
rasdaemon: Add error decoding for new SMCA Load Store bank type

Future Scalable Machine Check Architecture (SMCA) systems will have a
new Load Store bank type.

Add the new type's (HWID, McaType) ID and error decoding.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
[ Adjust commit message. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: Fix "ignoring return value" build warning.
Yazen Ghannam [Fri, 8 May 2020 14:51:29 +0000 (14:51 +0000)]
rasdaemon: Fix "ignoring return value" build warning.

The following build warning is given:

ras-diskerror-handler.c: In function ras_diskerror_event_handler:
ras-diskerror-handler.c:98:2:
warning: ignoring return value of asprintf, declared with attribute warn_unused_result [-Wunused-result]
  asprintf(&ev.dev, "%u:%u", major(dev), minor(dev));
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Check the return value of asprintf() to avoid the warning.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoMatch rankX in ras-mc-ctl
Cong Wang [Fri, 28 Feb 2020 20:37:15 +0000 (12:37 -0800)]
Match rankX in ras-mc-ctl

According to kernel doc:
https://www.kernel.org/doc/html/v4.10/admin-guide/ras.html
mcX directory contains either dimmX or rankX directories.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoFix a typo in ras-mc-ctl
Cong Wang [Fri, 28 Feb 2020 00:24:06 +0000 (16:24 -0800)]
Fix a typo in ras-mc-ctl

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoadded label for A2SDi-8C-HLN4F
A. Binzxxxxxx [Mon, 18 Nov 2019 18:24:59 +0000 (19:24 +0100)]
added label for A2SDi-8C-HLN4F

added label for A2SDi-8C-HLN4F

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoras-mc-ctl: PCIe AER: display PCIe dev name
dann frazier [Tue, 21 Apr 2020 21:56:04 +0000 (15:56 -0600)]
ras-mc-ctl: PCIe AER: display PCIe dev name

Storage of PCIe dev name was added in commit 8e96ca2c1c59 ("rasdaemon:
store PCIe dev name and TLP header for the aer event"). This makes
ras-mc-ctl extract and emit it like so:

PCIe AER events:
1 2020-04-16 22:09:48 +0000 0000:0b:00.0 Corrected error: Receiver Error
2 2020-04-16 22:23:24 +0000 0000:0b:00.0 Corrected error: Receiver Error
3 2020-04-17 23:00:37 +0000 0000:d9:01.0 Corrected error: Advisory Non-Fatal, BIT15
4 2020-04-17 23:21:52 +0000 0000:d9:01.0 Corrected error: Advisory Non-Fatal
5 2020-04-18 02:04:24 +0000 0000:5e:00.0 Corrected error: Receiver Error

Signed-off-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Shiju Jose <shiju.jose@huawei.com>
4 years agoMakefile.am: fix install of misc/rasdaemon.env v0.6.6
Mauro Carvalho Chehab [Tue, 21 Jul 2020 12:15:41 +0000 (14:15 +0200)]
Makefile.am: fix install of misc/rasdaemon.env

The logic added by the previous patch didn't work properly.

Change it to pack misc/rasdaemon.env when creating a
tarball and install it via "make install" target.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoBump to version 0.6.6
Mauro Carvalho Chehab [Tue, 21 Jul 2020 11:32:21 +0000 (13:32 +0200)]
Bump to version 0.6.6

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon.spec: Fix a bogus date
Mauro Carvalho Chehab [Tue, 21 Jul 2020 11:36:09 +0000 (13:36 +0200)]
rasdaemon.spec: Fix a bogus date

RPM build errors:
    bogus date in %changelog: Fri Oct 10 2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org>  0.6.4-1
    Bad exit status from /var/tmp/rpm-tmp.MRqZEZ (%install)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: add support for memory Corrected Error predictive failure analysis
wuyun [Sat, 20 Jun 2020 12:26:22 +0000 (20:26 +0800)]
rasdaemon: add support for memory Corrected Error predictive failure analysis

Memory Corrected Error was corrected by hardware. These errors do not
require immediate software actions, but are still reported for
accounting and predictive failure analysis.

Based on statistical results, some actions can be taken to prevent
Corrected Error from evoluting to Uncorrected Error.

Signed-off-by: wuyun <wuyun.wu@huawei.com>
Signed-off-by: lvying6 <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: add rbtree support for page record
wuyun [Sat, 20 Jun 2020 12:26:21 +0000 (20:26 +0800)]
rasdaemon: add rbtree support for page record

The rbtree is very efficient for recording and querying fault page info.

Signed-off-by: wuyun <wuyun.wu@huawei.com>
Signed-off-by: lvying6 <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: fix the issue that non standard decoder can't work in pthread way
Xiaofei Tan [Wed, 27 May 2020 08:02:33 +0000 (16:02 +0800)]
rasdaemon: fix the issue that non standard decoder can't work in pthread way

The non standard decoding functions are registered in app init process
through __attribute__((constructor)), and unregistered in app exit process
through __attribute__((destructor)). We don't need to unregister them
in any other steps. This patch removes these unnecessary unregister calls.

Fixes: 78a21c1e9770 ("rasdaemon: add closure and cleanups for the database")
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agorasdaemon: add support of l3tag and l3data in hip08 OEM format2
Xiaofei Tan [Wed, 27 May 2020 08:02:32 +0000 (16:02 +0800)]
rasdaemon: add support of l3tag and l3data in hip08 OEM format2

The two modules, l3tag and l3data were originally reported through "ARM
processor error section". But it is not suitable. Because l3tag or l3data
doesn't belong to any single CPU core. So we change it to use OEM format2.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix error handling in ras_mc_event_opendb()
Aristeu Rozanski [Tue, 7 Jan 2020 19:49:19 +0000 (14:49 -0500)]
rasdaemon: fix error handling in ras_mc_event_opendb()

Found with covscan that the return value from ras_mc_prepare_stmt() and from
ras_mc_event_opendb() itself aren't checked.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: add default branch for switch statement
Xiaofei Tan [Tue, 26 Nov 2019 13:23:06 +0000 (14:23 +0100)]
rasdaemon: add default branch for switch statement

Add default branch for the switch statements that default branch
was missed.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:36 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix case style issues for enum constant
Xiaofei Tan [Tue, 26 Nov 2019 13:22:21 +0000 (14:22 +0100)]
rasdaemon: fix case style issues for enum constant

Change lowercase letters of enum constant to uppercase ones.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:35 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: replace sprintf with snprintf for hip08
Xiaofei Tan [Tue, 26 Nov 2019 13:21:11 +0000 (14:21 +0100)]
rasdaemon: replace sprintf with snprintf for hip08

Replace sprintf with snprintf for hip08 to improve reliability.
Besides, add border check for buffer pointer.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:34 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 years agorasdaemon: fix magic number issues reported by static code analysis for hip08
Xiaofei Tan [Tue, 26 Nov 2019 13:20:06 +0000 (14:20 +0100)]
rasdaemon: fix magic number issues reported by static code analysis for hip08

Fix magic number issues reported by static code analysis for hip08.

CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Date: Tue, 26 Nov 2019 20:12:33 +0800
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>