]> www.infradead.org Git - users/mchehab/rasdaemon.git/log
users/mchehab/rasdaemon.git
18 months agorasdaemon: Add new MA_LLC, USR_DP, and USR_CP bank types.
Muralidhara M K [Fri, 30 Jun 2023 10:36:53 +0000 (10:36 +0000)]
rasdaemon: Add new MA_LLC, USR_DP, and USR_CP bank types.

Add HWID and McaType values for new SMCA bank types
and error decoding for those new SMCA banks.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for post-processing MCA errors
Avadhut Naik [Mon, 22 May 2023 22:13:17 +0000 (22:13 +0000)]
rasdaemon: Add support for post-processing MCA errors

Currently, the rasdaemon performs detailed error decoding of received
MCA errors on the system only whence it is running, either as a daemon
or in the foreground.

As such, error decoding cannot be undertaken for any MCA errors received
whence the rasdaemon wasn't running. Additionally, if the error decoding
modules like edac_mce_amd too have not been loaded, error records in the
demsg buffer might correspond to raw values in associated MSRs, compelling
users to undertake decoding manually. The scenario seems more plausible on
AMD systems with Scalabale MCA (SMCA) with plans in place to remove SMCA
Extended Error Descriptions from the edac_mce_amd module in an effort to
offload SMCA Error Decoding to the rasdaemon.

As such, add support to post-process and decode MCA Errors received on AMD
SMCA systems from raw MSR values. Support for post-processing and decoding
of MCA Errors received on CPUs of other vendors can be added in the future,
as needed.

Suggested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Handle reassigned bit definitions for CS SMCA
Avadhut Naik [Mon, 24 Apr 2023 20:35:56 +0000 (20:35 +0000)]
rasdaemon: Handle reassigned bit definitions for CS SMCA

Currently, on AMD systems with Scalable MCA (SMCA), each machine check
error of a SMCA bank type has an associated bit position in the bank's
control (CTL) register used for enabling / disabling reporting of the
very error. An error's bit position in the CTL register is also used
during error decoding for offsetting into the corresponding bank's error
description structure. As new errors are being added in newer AMD systems
for existing SMCA bank types, the underlying SMCA architecture guarantees
that the bit positions of existing errors are not altered.

However, on some AMD systems viz. Genoa, some of the existing bit
definitions in the CTL register of the Coherent Slave (CS) SMCA bank type
are reassigned without defining new HWID and McaType. Consequently, the
very errors whose bit definitions have been reassigned in the CTL register
are being erroneously decoded.

As a solution, create a new software defined SMCA bank type by utilizing
one of the hardware-reserved values for HWID. The new SMCA bank type will
only be employed for CS error decoding on affected CPU models.

Additionally, since the existing error description structure for the CS
SMCA bank type is still valid, add new error description structure to
compensate for the reassigned bit definitions.

Signed-off-by: Avadhut Naik <avadnaik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Update SMCA bank error descriptions
Avadhut Naik [Tue, 18 Apr 2023 18:24:21 +0000 (18:24 +0000)]
rasdaemon: Update SMCA bank error descriptions

Update, reword some existing SMCA bank type error descriptions to extend
SMCA error decoding functionality for modern AMD processors. Additionally,
also add new error descriptions for missing SMCA bank types.

Signed-off-by: Avadhut Naik <avadnaik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoadd ':' before error output
weidong [Tue, 8 Aug 2023 08:59:12 +0000 (08:59 +0000)]
add ':' before error output

All prints except disk are preceded by a colon

Signed-off-by: weidong <weidongkl@sina.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoAdd label for mainboard: ASUSTeK COMPUTER INC. Model: Z9PH-D16 Series
garadar [Fri, 14 Jul 2023 17:45:28 +0000 (19:45 +0200)]
Add label for mainboard: ASUSTeK COMPUTER INC. Model: Z9PH-D16 Series

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoAdd label for mainboard: GIGABYTE model MZ62-HD0-00
alberta [Fri, 14 Jul 2023 16:19:11 +0000 (18:19 +0200)]
Add label for mainboard: GIGABYTE model MZ62-HD0-00

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoCheck CPUs online, not configured.
Zeph / Liz Loss-Cutler-Hull [Sun, 9 Jul 2023 11:57:19 +0000 (04:57 -0700)]
Check CPUs online, not configured.

When the number of CPUs detected is greater than the number of CPUs in
the system, rasdaemon will crash when it receives some events.

Looking deeper, we also fail to use the poll method for similar reasons
in this case.

All of this can be prevented by checking to see how many CPUs are
currently online (sysconf(_SC_NPROCESSORS_ONLN)) instead of how many
CPUs the current kernel was configured to support
(sysconf(_SC_NPROCESSORS_CONF)).

For the kernel side of the discussion, see https://lore.kernel.org/lkml/CAM6Wdxft33zLeeXHhmNX5jyJtfGTLiwkQSApc=10fqf+rQh9DA@mail.gmail.com/T/
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL memory module events
Shiju Jose [Wed, 5 Apr 2023 15:16:19 +0000 (16:16 +0100)]
rasdaemon: Add support for the CXL memory module events

Add support to log and record the CXL memory module events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL dram events
Shiju Jose [Wed, 5 Apr 2023 12:28:20 +0000 (13:28 +0100)]
rasdaemon: Add support for the CXL dram events

Add support to log and record the CXL dram events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL general media events
Shiju Jose [Wed, 5 Apr 2023 10:54:41 +0000 (11:54 +0100)]
rasdaemon: Add support for the CXL general media events

Add support to log and record the CXL general media events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL generic events
Shiju Jose [Tue, 4 Apr 2023 17:49:09 +0000 (18:49 +0100)]
rasdaemon: Add support for the CXL generic events

Add support to log and record the CXL generic events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL overflow events
Shiju Jose [Tue, 4 Apr 2023 15:50:50 +0000 (16:50 +0100)]
rasdaemon: Add support for the CXL overflow events

Add support to log and record the CXL overflow events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add common function to get timestamp for the event
Shiju Jose [Tue, 4 Apr 2023 15:07:21 +0000 (16:07 +0100)]
rasdaemon: Add common function to get timestamp for the event

Add common function to get the timestamp for the event
reported.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add common function to convert timestamp in the CXL event records to the...
Shiju Jose [Tue, 4 Apr 2023 13:40:42 +0000 (14:40 +0100)]
rasdaemon: Add common function to convert timestamp in the CXL event records to the broken-down time format

Add common function to convert the timestamp in the CXL event records
in nanoseconds to the broken-down time format.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for creating the vendor error tables at startup
Shiju Jose [Wed, 31 May 2023 15:24:36 +0000 (16:24 +0100)]
rasdaemon: Add support for creating the vendor error tables at startup

1. Support for create/open the vendor error tables at rasdaemon startup.
2. Make changes in the HiSilicon error handling code for the same.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: fix issue of signed and unsigned integer comparison and remove redundant...
Xiaofei Tan [Tue, 30 May 2023 10:44:12 +0000 (11:44 +0100)]
rasdaemon: fix issue of signed and unsigned integer comparison and remove redundant header file

1. The return value of ARRAY_SIZE() is unsigned integer. It isn't right to
compare it with a signed integer. This patch fix them.

2. Remove redundant header file and adjust the header files sequence.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: fix return value type issue of read/write function from unistd.h
Xiaofei Tan [Thu, 11 May 2023 02:54:26 +0000 (10:54 +0800)]
rasdaemon: fix return value type issue of read/write function from unistd.h

The return value type of read/write function from unistd.h is ssize_t.
It's signed normally, and return -1 on error. Fix incorrect use in the
function read_ras_event_all_cpus().

BTW, make setting buffer_percent as a separate function.

Fixes: 94750bcf9309 ("rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely")
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoRasdaemon: Fix autoreconf build error
Ayush Jain [Tue, 23 May 2023 06:55:36 +0000 (12:25 +0530)]
Rasdaemon: Fix autoreconf build error

When building rasdaemon with autoreconf, on certain distros
we see the following error message.
Makefile.am: error: required file './README' not found
Autoreconf looks for README file instead of README.md
Fix this by passing 'foreign' to AM_INIT_AUTOMAKE.

Signed-off-by: Ayush Jain <ayush.jain3@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoras-events: quit loop in read_ras_event when kbuf data is broken
hubin [Thu, 18 May 2023 08:14:41 +0000 (16:14 +0800)]
ras-events: quit loop in read_ras_event when kbuf data is broken

when kbuf data is broken, kbuffer_next_event() may move kbuf->index back to
the current kbuf->index position, causing dead loop.

In this situation, rasdaemon will repeatedly parse an invalid event, and
print warning like "ug! negative record size -8!", pushing cpu utilization
rate to 100%.

when kbuf data is broken, discard current page and continue reading next page
kbuf.

Signed-off-by: hubin <hubin73@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: Add support for the CXL AER correctable errors
Shiju Jose [Fri, 17 Mar 2023 13:07:01 +0000 (13:07 +0000)]
rasdaemon: Add support for the CXL AER correctable errors

Add support to log and record the CXL AER correctable errors.

The corresponding Kernel patches are here:
https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t
https://lore.kernel.org/linux-cxl/63e5ed38d77d9_138fbc2947a@iweiny-mobl.notmuch/T/#t

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: Add support for the CXL AER uncorrectable errors
Shiju Jose [Fri, 17 Mar 2023 12:51:02 +0000 (12:51 +0000)]
rasdaemon: Add support for the CXL AER uncorrectable errors

Add support to log and record the CXL AER uncorrectable errors.

The corresponding Kernel patches are here:
https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t
https://lore.kernel.org/lkml/63eeb2a8c9e3f_32d612941f@dwillia2-xfh.jf.intel.com.notmuch/T/

It was found that the header log data to be converted to the
big-endian format to correctly store in the SQLite DB likely
because the SQLite database seems uses the big-endian storage.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>#
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: Add support for the CXL poison events
Shiju Jose [Fri, 31 Mar 2023 12:35:13 +0000 (13:35 +0100)]
rasdaemon: Add support for the CXL poison events

Add support to log and record the CXL poison events.

The corresponding Kernel patches here:
https://lore.kernel.org/linux-cxl/64457d30bae07_2028294ac@dwillia2-xfh.jf.intel.com.notmuch/

Presently for logging only, could be extended for the policy
based recovery action for the frequent poison events depending on the above
kernel patches.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: Move definition for BIT and BIT_ULL to a common file
Shiju Jose [Mon, 16 Jan 2023 17:13:32 +0000 (17:13 +0000)]
rasdaemon: Move definition for BIT and BIT_ULL to a common file

Move definition for BIT() and BIT_ULL() to the
common file ras-record.h

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agoras-mc-ctl: add option to exclude old events from reports
Marcus Sundman [Thu, 20 Apr 2023 15:17:17 +0000 (18:17 +0300)]
ras-mc-ctl: add option to exclude old events from reports

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: fix table create if some cpus are offline
Shiju Jose [Sun, 5 Mar 2023 23:14:42 +0000 (23:14 +0000)]
rasdaemon: fix table create if some cpus are offline

Fix for regression in ras_mc_create_table() if some cpus are offline
at the system start

Issue:

Regression in the ras_mc_create_table() if some of the cpus are offline
at the system start when run the rasdaemon.

This issue is reproducible in ras_mc_create_table() with decode and
record non-standard events and reproducible sometimes with
ras_mc_create_table() for the standard events.

Also in the multi thread way, there is memory leak in ras_mc_event_opendb()
as struct sqlite3_priv *priv and sqlite3 *db allocated/initialized per
thread, but stored in the common struct ras_events ras in pthread data,
which is shared across the threads.

Reason:

when the system starts with some of the cpus offline and then run
the rasdaemon, read_ras_event_all_cpus() exit with error and switch to
the multi thread way. However read() in read_ras_event() return error in
threads for each of the offline CPUs and does clean up including calling
ras_mc_event_closedb().

Since the 'struct ras_events ras' passed in the pthread_data to each of the
threads is common, struct sqlite3_priv *priv and sqlite3 *db allocated/
initialized per thread and stored in the common 'struct ras_events ras',
are getting overwritten in each ras_mc_event_opendb()(which called from
pthread per cpu), result memory leak.

Also when ras_mc_event_closedb() is called in the above error case from
the threads corresponding to the offline cpus, close the sqlite3 *db and
free sqlite3_priv *priv stored in the common 'struct ras_events ras',
result regression when accessing priv->db in the ras_mc_create_table()
from another context later.

Solution:

In ras_mc_event_opendb(), allocate struct sqlite3_priv *priv,
init sqlite3 *db and create tables common for the threads with shared
'struct ras_events ras' based on a reference count and free them in the
same way.

Also protect critical code ras_mc_event_opendb() and ras_mc_event_closedb()
using mutex in the multi thread case from any regression caused by the
thread pre-emption.

Reported-by: Lei Feng <fenglei47@h-partners.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agoconfigure.ac: fix bashisms
Sam James [Sun, 19 Feb 2023 18:33:20 +0000 (18:33 +0000)]
configure.ac: fix bashisms

configure scripts need to be runnable with a POSIX-compliant /bin/sh.

On many (but not all!) systems, /bin/sh is provided by Bash, so errors
like this aren't spotted. Notably Debian defaults to /bin/sh provided
by dash which doesn't tolerate such bashisms as '=='.

This retains compatibility with bash.

Fixes configure warnings/errors like:
```
checking for libtraceevent... yes
./configure: 13430: test: x: unexpected operator
./configure: 13439: test: x: unexpected operator
```

Signed-off-by: Sam James <sam@gentoo.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoFix create release workflow
Mauro Carvalho Chehab [Sat, 18 Feb 2023 17:26:33 +0000 (18:26 +0100)]
Fix create release workflow

make dist-bzip2 requires configure to work, which, in turn, depends
on having some tools installed.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoci.yml: fix workflow to build rasdaemon
Mauro Carvalho Chehab [Sat, 18 Feb 2023 13:04:51 +0000 (14:04 +0100)]
ci.yml: fix workflow to build rasdaemon

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoChangeLog: do some minor updates
Mauro Carvalho Chehab [Sat, 18 Feb 2023 13:04:05 +0000 (14:04 +0100)]
ChangeLog: do some minor updates

It is missing an entry about new labels. Also, version is at the
wrong place.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoBump version to 0.8.0 v0.8.0
Mauro Carvalho Chehab [Sat, 18 Feb 2023 08:45:50 +0000 (09:45 +0100)]
Bump version to 0.8.0

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agolabels/asrock: add X399D8A-2T
tictooc [Sat, 11 Feb 2023 17:40:29 +0000 (17:40 +0000)]
labels/asrock: add X399D8A-2T

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoConvert README to markdown format
Mauro Carvalho Chehab [Sat, 18 Feb 2023 08:45:56 +0000 (09:45 +0100)]
Convert README to markdown format

That allows git??b to better parse it.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agomisc/rasdaemon.spec.in: add libtraceevent requirement
Mauro Carvalho Chehab [Sat, 18 Feb 2023 08:15:07 +0000 (09:15 +0100)]
misc/rasdaemon.spec.in: add libtraceevent requirement

As we're not not bunding libtraceevent inside RASdaemon, packaging
it now requires it.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoMakefile.am: fix mock build target
Mauro Carvalho Chehab [Sat, 18 Feb 2023 08:08:08 +0000 (09:08 +0100)]
Makefile.am: fix mock build target

Mock now makes mandatory to add the install dir, otherwise it
refuses to build. So, add it.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely
Shiju Jose [Sat, 4 Feb 2023 19:15:55 +0000 (19:15 +0000)]
rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely

The error events are not received in the rasdaemon since kernel 6.1-rc6.
This issue is firstly detected and reported, when testing the CXL error
events in the rasdaemon.

Debugging showed, poll() on trace_pipe_raw in the ras-events.c do not
return and this issue is seen after the commit
42fb0a1e84ff525ebe560e2baf9451ab69127e2b ("tracing/ring-buffer: Have
polling block on watermark").

This issue is also verified using a test application for poll()
and select() on per_cpu trace_pipe_raw.

There is also a bug reported on this issue,
https://lore.kernel.org/all/31eb3b12-3350-90a4-a0d9-d1494db7cf74@oracle.com/

This issue occurs for the per_cpu case, which calls the ring_buffer_poll_wait(),
in kernel/trace/ring_buffer.c, with the buffer_percent > 0 and then wait until
the percentage of pages are available. The default value set for the
buffer_percent is 50 in the kernel/trace/trace.c. However poll() does not return
even met the percentage of pages condition.

As a fix, rasdaemon set buffer_percent as 0 through the
/sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent, then the
task will wake up as soon as data is added to any of the specific cpu
buffer and poll() on per_cpu/cpuX/trace_pipe_raw does not block
indefinitely.

Dependency on the kernel fix commit
3e46d910d8acf94e5360126593b68bf4fee4c4a1("tracing: Fix poll() and select()
do not work on per_cpu trace_pipe and trace_pipe_raw")

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
2 years agoREADME: Update instructions about how to contribute
Mauro Carvalho Chehab [Mon, 23 Jan 2023 14:29:33 +0000 (15:29 +0100)]
README: Update instructions about how to contribute

Nowadays, we're only using github in practice for development.
Let it clearer at the documentation.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoMakefile.am: enable all options on make distcheck
Mauro Carvalho Chehab [Sat, 21 Jan 2023 13:06:54 +0000 (14:06 +0100)]
Makefile.am: enable all options on make distcheck

Ensure that all modules are enabled on "make distcheck".

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoconfigure.ac: get rid of obsolete macros
Mauro Carvalho Chehab [Sat, 21 Jan 2023 13:04:19 +0000 (14:04 +0100)]
configure.ac: get rid of obsolete macros

Use autoupdate 2.71, in order to get rid of obsoleted macros.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoci.yml: add libtraceevent-dev dependency
Mauro Carvalho Chehab [Sat, 21 Jan 2023 12:41:59 +0000 (13:41 +0100)]
ci.yml: add libtraceevent-dev dependency

This is needed to build newest version of rasdaemon.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoRemove the old libtrace
Mauro Carvalho Chehab [Sat, 21 Jan 2023 08:23:57 +0000 (09:23 +0100)]
Remove the old libtrace

Now that rasdaemon is using the libtraceevent library, we
can get rid of our own fork.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoAdjust indentations
Mauro Carvalho Chehab [Sat, 21 Jan 2023 08:59:57 +0000 (09:59 +0100)]
Adjust indentations

With the function rename due to the usage of libtraceevent
library, adjust some indentations.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoConvert to use libtraceevent
Mauro Carvalho Chehab [Sat, 21 Jan 2023 08:23:57 +0000 (09:23 +0100)]
Convert to use libtraceevent

Rasdaemon used for a long time an early version of this library,
with the code embedded directly into its code. The rationale is
that the library was not officially released on that time, but
this has long changed.

So, instead, just use the library directly.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoon_tag.yml: use a different approach to upload artifact v0.7.0
Mauro Carvalho Chehab [Sun, 22 Jan 2023 06:23:22 +0000 (07:23 +0100)]
on_tag.yml: use a different approach to upload artifact

Use my own upload release asset logic, as it is known to work
already on ZBar.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoAdd a release workflow
Mauro Carvalho Chehab [Sat, 21 Jan 2023 13:49:40 +0000 (14:49 +0100)]
Add a release workflow

Should be auto-filling the release information and upload
a source distro package tarball.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoBump version to 0.7.0 libtrace
Mauro Carvalho Chehab [Sat, 21 Jan 2023 06:52:14 +0000 (07:52 +0100)]
Bump version to 0.7.0

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years ago.gitignore: add the auto-generated "compile" file
Mauro Carvalho Chehab [Sat, 21 Jan 2023 06:55:05 +0000 (07:55 +0100)]
.gitignore: add the auto-generated "compile" file

autoreconf is producing a compile file. Ignore it on git status.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoINSTALL: update from latest version of it
Mauro Carvalho Chehab [Sat, 21 Jan 2023 06:54:30 +0000 (07:54 +0100)]
INSTALL: update from latest version of it

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoconfigure.ac: fix bashisms
Sam James [Thu, 29 Dec 2022 17:23:47 +0000 (17:23 +0000)]
configure.ac: fix bashisms

configure scripts need to be runnable with a POSIX-compliant /bin/sh.

On many (but not all!) systems, /bin/sh is provided by Bash, so errors
like this aren't spotted. Notably Debian defaults to /bin/sh provided
by dash which doesn't tolerate such bashisms as '=='.

This retains compatibility with bash.

Signed-off-by: Sam James <sam@gentoo.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agolabels/asus: add ASUS TUF GAMING B450-PLUS II
dgcampea [Mon, 19 Dec 2022 18:53:13 +0000 (18:53 +0000)]
labels/asus: add ASUS TUF GAMING B450-PLUS II

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Add four modules supported by HiSilicon common section
Xiaofei Tan [Mon, 31 Oct 2022 10:36:26 +0000 (18:36 +0800)]
rasdaemon: Add four modules supported by HiSilicon common section

Add four modules supported by HiSilicon common error section.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Fix for a memory out-of-bounds issue and optimized code to remove duplicat...
Shiju Jose [Thu, 28 Apr 2022 21:59:04 +0000 (22:59 +0100)]
rasdaemon: Fix for a memory out-of-bounds issue and optimized code to remove duplicate function.

Fixed a memory out-of-bounds issue with string pointers and
optimized code structure to remove duplicate function.

Signed-off-by: Lei Feng <fenglei47@h-partners.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Updated HiSilicon platform name
Shiju Jose [Thu, 28 Apr 2022 17:58:43 +0000 (18:58 +0100)]
rasdaemon: ras-mc-ctl: Updated HiSilicon platform name

Updated the HiSilicon platform name as KunPeng9xx.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Relocate reading and display Kunpeng920 errors to under Kunpeng9xx
Shiju Jose [Mon, 7 Mar 2022 12:38:45 +0000 (12:38 +0000)]
rasdaemon: ras-mc-ctl: Relocate reading and display Kunpeng920 errors to under Kunpeng9xx

Relocate reading and display Kunpeng920 errors to under Kunpeng9xx.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Add support to display the HiSilicon vendor errors for a speci...
Shiju Jose [Sat, 5 Mar 2022 18:19:38 +0000 (18:19 +0000)]
rasdaemon: ras-mc-ctl: Add support to display the HiSilicon vendor errors for a specified module

Add support to display the HiSilicon vendor errors for a specified module.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Add printing usage if necessary parameters are not passed...
Shiju Jose [Sat, 5 Mar 2022 17:01:35 +0000 (17:01 +0000)]
rasdaemon: ras-mc-ctl: Add printing usage if necessary parameters are not passed for the vendor-error options

Add printing usage if necessary parameters are not passed
for the vendor-errors options.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Reformat error info of the HiSilicon Kunpeng920
Shiju Jose [Sat, 5 Mar 2022 16:18:55 +0000 (16:18 +0000)]
rasdaemon: ras-mc-ctl: Reformat error info of the HiSilicon Kunpeng920

Reformat the code to display the error info of HiSilicon Kunpeng920.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-mc-ctl: Modify error statistics for HiSilicon KunPeng9xx common errors
Shiju Jose [Thu, 24 Feb 2022 18:02:14 +0000 (18:02 +0000)]
rasdaemon: ras-mc-ctl: Modify error statistics for HiSilicon KunPeng9xx common errors

Modify the error statistics for the HiSilicon KunPeng9xx platforms common errors
to display the statistics and error info based on the module and the error severity.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Modify recording Hisilicon common error data
Shiju Jose [Wed, 2 Mar 2022 12:20:40 +0000 (12:20 +0000)]
rasdaemon: Modify recording Hisilicon common error data

The error statistics for the Hisilicon common
error need to do based on module, error severity etc.

Modify recording Hisilicon common error data as separate fields
in the sql db table instead of the combined single field.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Support cpu fault isolation for recoverable errors
Shengwei Luo [Wed, 23 Feb 2022 09:23:27 +0000 (17:23 +0800)]
rasdaemon: Support cpu fault isolation for recoverable errors

When the recoverable errors in cpu core occurred, try to offline
the related cpu core.

Signed-off-by: Shengwei Luo <luoshengwei@huawei.com>
Signed-off-by: Junchong Pan <panjunchong@hisilicon.com>
Signed-off-by: Lei Feng <fenglei47@h-partners.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Support cpu fault isolation for corrected errors
Shengwei Luo [Wed, 23 Feb 2022 09:21:58 +0000 (17:21 +0800)]
rasdaemon: Support cpu fault isolation for corrected errors

When the corrected errors exceed the set limit in cycle, try to
offline the related cpu core.

Signed-off-by: Shengwei Luo <luoshengwei@huawei.com>
Signed-off-by: Junchong Pan <panjunchong@hisilicon.com>
Signed-off-by: Lei Feng <fenglei47@h-partners.com>
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-memory-failure-handler: handle localtime() failure correctly
Aristeu Rozanski [Thu, 19 Jan 2023 13:45:57 +0000 (08:45 -0500)]
rasdaemon: ras-memory-failure-handler: handle localtime() failure correctly

We could just have an empty string but keeping the format could prevent
issues if someone is actually parsing this.
Found with covscan.

v2: fixed the timestamp as pointed by Robert Elliott

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: mce-amd-smca: properly limit bank types
Aristeu Rozanski [Thu, 19 Jan 2023 13:45:57 +0000 (08:45 -0500)]
rasdaemon: mce-amd-smca: properly limit bank types

Found with covscan.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: ras-report: fix possible but unlikely file descriptor leak
Aristeu Rozanski [Thu, 19 Jan 2023 13:45:57 +0000 (08:45 -0500)]
rasdaemon: ras-report: fix possible but unlikely file descriptor leak

Found with covscan.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agolibtrace: Use XSI version of strerror_r on non glibc systems
Khem Raj [Wed, 31 Aug 2022 02:54:35 +0000 (19:54 -0700)]
libtrace: Use XSI version of strerror_r on non glibc systems

The version used is glibc specific therefore make it so
and provide a fallback for non-glibc systems

Signed-off-by: Khem Raj <raj.khem@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: use the new block_rq_error tracepoint
Yang Shi [Mon, 4 Apr 2022 23:34:05 +0000 (16:34 -0700)]
rasdaemon: use the new block_rq_error tracepoint

Since Linux 5.18-rc1 a new block tracepoint called block_rq_error is
available for tracing disk error events dedicatedly.  Currently
rasdaemon is using block_rq_complete which also traces successful cases.
It incurs excessive tracing logs and somehow overhead since the event is
triggered quite often.

Use the new tracepoint for disk error reporting, and the new trace point
has the same format as block_rq_complete.

Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agoBump version to 0.6.8 v0.6.8
Mauro Carvalho Chehab [Fri, 1 Apr 2022 10:50:08 +0000 (12:50 +0200)]
Bump version to 0.6.8

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agomisc/rasdaemon.spec.in: fix some issues on it
Mauro Carvalho Chehab [Fri, 1 Apr 2022 10:34:23 +0000 (12:34 +0200)]
misc/rasdaemon.spec.in: fix some issues on it

Sort of sync this file from Fedora's upstream, addressing some
bugs sysfsdir install bugs.

After such change, the main difference would be that, in
Fedora, it uses different config settings, depending at the
architecture:

-%configure --enable-all --with-sysconfdefdir=%{_sysconfdir}/sysconfig
+%ifarch %{arm} aarch64
+%configure --enable-sqlite3 --enable-aer --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-abrt-report --enable-non-standard --enable-arm --enable-hisi-ns-decode --with-sysconfdefdir=%{_sysconfdir}/sysconfig
+%else
+%configure --enable-sqlite3 --enable-aer --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-abrt-report --with-sysconfdefdir=%{_sysconfdir}/sysconfig
+%endif

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agoMakefile.am: clean output from misc/*.in
Mauro Carvalho Chehab [Fri, 1 Apr 2022 09:41:39 +0000 (11:41 +0200)]
Makefile.am: clean output from misc/*.in

Cleanup files that are generated at build time from the *.in
input files.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: Add some modules supported by hisi common error section
Xiaofei Tan [Wed, 20 Oct 2021 06:33:40 +0000 (14:33 +0800)]
rasdaemon: Add some modules supported by hisi common error section

Add some modules supported by hisi common error section. Besides,
HHA is the module for some old platform, and it takes the same place
of MATA, so remove it.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: Fix some print format issues for hisi common error section
Xiaofei Tan [Wed, 20 Oct 2021 06:33:39 +0000 (14:33 +0800)]
rasdaemon: Fix some print format issues for hisi common error section

It is not right to use '%d' to print uint8_t and uint16_t, although
there is no function issue. Change to use '%hhu' and '%hu' separately.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: Fix the issue of command option -r for hip08
Xiaofei Tan [Wed, 20 Oct 2021 06:33:38 +0000 (14:33 +0800)]
rasdaemon: Fix the issue of command option -r for hip08

It will record event even the option -r is not provided for hip08.
It is not right, and fix it.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: Fix the issue of sprintf data type mismatch in uuid_le()
Xiaofei Tan [Wed, 20 Oct 2021 06:33:37 +0000 (14:33 +0800)]
rasdaemon: Fix the issue of sprintf data type mismatch in uuid_le()

The data type of sprintf called in the function uuid_le() is mismatch.
Arm64 compiler force it to unsigned char by default, and can work normally.
But if someone compile it with the option -fsigned-char, the function
can't work correctly.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon.service.in: comment out syslog.target
Evils [Sat, 11 Dec 2021 02:27:05 +0000 (03:27 +0100)]
rasdaemon.service.in: comment out syslog.target

syslog is only used when the daemon runs in backround mode
  this service is configured to run in foreground mode

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agoadd labels for asrock x570 motherboard
Steven Johnson [Tue, 7 Dec 2021 10:57:08 +0000 (17:57 +0700)]
add labels for asrock x570 motherboard

Signed-off-by: Steven Johnson <strntydog@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agoUpdate ras-mc-ctl manpage to match current options
Justin Vreeland [Wed, 3 Nov 2021 02:51:50 +0000 (19:51 -0700)]
Update ras-mc-ctl manpage to match current options

Signed-off-by: Justin Vreeland <vreeland.justin@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agorasdaemon: ras-mc-ctl: Fix script to parse dimm sizes
Muralidhara M K [Tue, 27 Jul 2021 11:36:45 +0000 (06:36 -0500)]
rasdaemon: ras-mc-ctl: Fix script to parse dimm sizes

Removes trailing spaces at the end of a line from
file location and fixes --layout option to parse dimm nodes
to get the size of each dimm from ras-mc-ctl.

Issue is reported https://github.com/mchehab/rasdaemon/issues/43
Where '> ras-mc-ctl --layout' reports all 0s

With this change the layout option prints the correct dimm sizes
> sudo ras-mc-ctl --layout
          +-----------------------------------------------+
          |                      mc0                      |
          |  csrow0   |  csrow1   |  csrow2   |  csrow3   |
----------+-----------------------------------------------+
...
channel7: |  16384 MB  |     0 MB  |     0 MB  |     0 MB |
channel6: |  16384 MB  |     0 MB  |     0 MB  |     0 MB |
...
----------+-----------------------------------------------+

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Link: https://lkml.kernel.org/r/20210810183855.129076-1-nchatrad@amd.com/
3 years agorasdaemon: fix compile against musl libc
Stijn Tintel [Wed, 1 Sep 2021 00:32:18 +0000 (03:32 +0300)]
rasdaemon: fix compile against musl libc

Fix the following compile errors that occurs when building against musl:

ras-events.c: In function 'read_ras_event_all_cpus':
ras-events.c:366:16: error: 'PATH_MAX' undeclared (first use in this function)
  366 |  char pipe_raw[PATH_MAX];
      |                ^~~~~~~~

ras-events.c: In function 'handle_ras_events_cpu':
ras-events.c:564:16: error: 'PATH_MAX' undeclared (first use in this function)
  564 |  char pipe_raw[PATH_MAX];
      |

Signed-off-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
3 years agolabels/supermicro: added Supermicro X11SCW
DmNosachev [Fri, 23 Jul 2021 14:28:33 +0000 (17:28 +0300)]
labels/supermicro: added Supermicro X11SCW

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro X10DRL, X11SPM
DmNosachev [Thu, 22 Jul 2021 07:25:38 +0000 (10:25 +0300)]
labels/supermicro: added Supermicro X10DRL, X11SPM

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro X11SCA(-F)
DmNosachev [Fri, 2 Jul 2021 10:13:46 +0000 (13:13 +0300)]
labels/supermicro: added Supermicro X11SCA(-F)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro B1DRi
DmNosachev [Wed, 30 Jun 2021 13:49:18 +0000 (16:49 +0300)]
labels/supermicro: added Supermicro B1DRi

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro X11DDW-NT(-L)
DmNosachev [Tue, 29 Jun 2021 11:07:54 +0000 (14:07 +0300)]
labels/supermicro: added Supermicro X11DDW-NT(-L)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added Supermicro X10DRI(-T)
DmNosachev [Tue, 29 Jun 2021 10:48:55 +0000 (13:48 +0300)]
labels/supermicro: added Supermicro X10DRI(-T)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: supermicro db syntax
DmNosachev [Tue, 29 Jun 2021 10:37:48 +0000 (13:37 +0300)]
labels/supermicro: supermicro db syntax

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agolabels/supermicro: added x11dph-i labels
DmNosachev [Tue, 29 Jun 2021 08:33:10 +0000 (11:33 +0300)]
labels/supermicro: added x11dph-i labels

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Support MCE for AMD CPU family 19h
Muralidhara M K [Wed, 28 Jul 2021 06:52:12 +0000 (01:52 -0500)]
rasdaemon: Support MCE for AMD CPU family 19h

Add support for family 19h x86 CPUs from AMD.

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Enumerate memory on noncpu nodes
Muralidhara M K [Mon, 12 Jul 2021 10:40:46 +0000 (05:40 -0500)]
rasdaemon: Enumerate memory on noncpu nodes

On newer heterogeneous systems from AMD with GPU nodes (with HBM2 memory
banks) connected via xGMI links to the CPUs.

The node id information is available in the InstanceHI[47:44] of
the IPID register.

The UMC Phys on Aldeberan nodes are enumerated as csrow
The UMC channels connected to HBMs are enumerated as ranks.

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: set SMCA maximum number of banks to 64
Muralidhara M K [Mon, 12 Jul 2021 10:18:43 +0000 (05:18 -0500)]
rasdaemon: set SMCA maximum number of banks to 64

Newer AMD systems with SMCA banks support up to 64 MCA banks per CPU.

This patch is based on the commit below upstremed into the kernel:
a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64")

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Add new SMCA bank types with error decoding
Naveen Krishna Chatradhi [Tue, 1 Jun 2021 05:31:17 +0000 (11:01 +0530)]
rasdaemon: Add new SMCA bank types with error decoding

Upcoming systems with Scalable Machine Check Architecture (SMCA) have
new MCA banks added.

This patch adds the (HWID, MCATYPE) tuple, name and error decoding for
those new SMCA banks.
While at it, optimize the string names in smca_bank_name[].

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoconfigure.ac: fix SYSCONFDEFDIR default value
Matt Whitlock [Wed, 9 Jun 2021 14:25:18 +0000 (10:25 -0400)]
configure.ac: fix SYSCONFDEFDIR default value

configure.ac was using AC_ARG_WITH incorrectly, yielding a generated configure script like:

    # Check whether --with-sysconfdefdir was given.
    if test "${with_sysconfdefdir+set}" = set; then :
      withval=$with_sysconfdefdir; SYSCONFDEFDIR=$withval
    else
      "/etc/sysconfig"
    fi

This commit fixes the default case so that the SYSCONFDEFDIR variable is assigned the value "/etc/sysconfig" rather than trying to execute "/etc/sysconfig" as a command.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdd error handling for Ampere-specific errors.
Jason Tian [Fri, 28 May 2021 03:35:43 +0000 (11:35 +0800)]
Add error handling for Ampere-specific errors.

Save Ampere-specific errors' decode into sqlite3 data
base and log PCIe segment, bus/device/function number
into BMC SEL.

Signed-off-by: Jason Tian <jason@os.amperecomputing.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdd support for multi-arch builds
Mauro Carvalho Chehab [Wed, 26 May 2021 10:55:54 +0000 (12:55 +0200)]
Add support for multi-arch builds

Allow building rasdaemon on several architectures:
- x86_64
- arm 64
- ppc 64 LE

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoci.yml: Fix the job for it to run on a single arch
Mauro Carvalho Chehab [Wed, 26 May 2021 10:41:27 +0000 (12:41 +0200)]
ci.yml: Fix the job for it to run on a single arch

There were some issues on the previous content. Fix them, in
order to allow it to build on a single architecture.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoAdd a github workflow for CI automation
Mauro Carvalho Chehab [Wed, 26 May 2021 10:35:55 +0000 (12:35 +0200)]
Add a github workflow for CI automation

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon.spec.in: Fix the description on this example file
Mauro Carvalho Chehab [Wed, 26 May 2021 08:37:52 +0000 (10:37 +0200)]
rasdaemon.spec.in: Fix the description on this example file

While this is used just to test if building it is OK, better
to keep the logs nice ;-)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoBump verstion to 0.6.7 v0.6.7
Mauro Carvalho Chehab [Wed, 26 May 2021 07:41:45 +0000 (09:41 +0200)]
Bump verstion to 0.6.7

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon.spec.in: don't install _sharedstatedir
Mauro Carvalho Chehab [Wed, 26 May 2021 08:32:39 +0000 (10:32 +0200)]
rasdaemon.spec.in: don't install _sharedstatedir

%{_sharedstatedir}/rasdaemon is now created at runtime.
So, no need to install it anymore.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agoMakefile.am: fix build header rules
Mauro Carvalho Chehab [Wed, 26 May 2021 08:20:39 +0000 (10:20 +0200)]
Makefile.am: fix build header rules

non-standard-hisilicon.h was added twice;
ras-memory-failure-handler.h is missing.

Due to that, the tarball becomes incomplete, causing build
errors.

While here, also adjust .travis.yml to use --enable-all.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
3 years agorasdaemon: Add Ice Lake and Sapphire Rapids MSCOD values
Greg Edwards [Thu, 8 Apr 2021 21:03:30 +0000 (15:03 -0600)]
rasdaemon: Add Ice Lake and Sapphire Rapids MSCOD values

Based on mcelog commits:

  ee90ff20ce6a ("mcelog: Add support for Icelake server, Icelake-D, and Snow Ridge")
  391abaac9bdf ("mcelog: Add decode for MCi_MISC from 10nm memory controller")
  59cb7ad4bc72 ("mcelog: i10nm: Fix mapping from bank number to functional unit")
  c0acd0e6a639 ("mcelog: Add support for Sapphirerapids server.")

Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>