Avadhut Naik [Mon, 22 May 2023 22:13:17 +0000 (22:13 +0000)]
rasdaemon: Add support for post-processing MCA errors
Currently, the rasdaemon performs detailed error decoding of received
MCA errors on the system only whence it is running, either as a daemon
or in the foreground.
As such, error decoding cannot be undertaken for any MCA errors received
whence the rasdaemon wasn't running. Additionally, if the error decoding
modules like edac_mce_amd too have not been loaded, error records in the
demsg buffer might correspond to raw values in associated MSRs, compelling
users to undertake decoding manually. The scenario seems more plausible on
AMD systems with Scalabale MCA (SMCA) with plans in place to remove SMCA
Extended Error Descriptions from the edac_mce_amd module in an effort to
offload SMCA Error Decoding to the rasdaemon.
As such, add support to post-process and decode MCA Errors received on AMD
SMCA systems from raw MSR values. Support for post-processing and decoding
of MCA Errors received on CPUs of other vendors can be added in the future,
as needed.
rasdaemon: Handle reassigned bit definitions for CS SMCA
Currently, on AMD systems with Scalable MCA (SMCA), each machine check
error of a SMCA bank type has an associated bit position in the bank's
control (CTL) register used for enabling / disabling reporting of the
very error. An error's bit position in the CTL register is also used
during error decoding for offsetting into the corresponding bank's error
description structure. As new errors are being added in newer AMD systems
for existing SMCA bank types, the underlying SMCA architecture guarantees
that the bit positions of existing errors are not altered.
However, on some AMD systems viz. Genoa, some of the existing bit
definitions in the CTL register of the Coherent Slave (CS) SMCA bank type
are reassigned without defining new HWID and McaType. Consequently, the
very errors whose bit definitions have been reassigned in the CTL register
are being erroneously decoded.
As a solution, create a new software defined SMCA bank type by utilizing
one of the hardware-reserved values for HWID. The new SMCA bank type will
only be employed for CS error decoding on affected CPU models.
Additionally, since the existing error description structure for the CS
SMCA bank type is still valid, add new error description structure to
compensate for the reassigned bit definitions.
Update, reword some existing SMCA bank type error descriptions to extend
SMCA error decoding functionality for modern AMD processors. Additionally,
also add new error descriptions for missing SMCA bank types.
When the number of CPUs detected is greater than the number of CPUs in
the system, rasdaemon will crash when it receives some events.
Looking deeper, we also fail to use the poll method for similar reasons
in this case.
All of this can be prevented by checking to see how many CPUs are
currently online (sysconf(_SC_NPROCESSORS_ONLN)) instead of how many
CPUs the current kernel was configured to support
(sysconf(_SC_NPROCESSORS_CONF)).
For the kernel side of the discussion, see https://lore.kernel.org/lkml/CAM6Wdxft33zLeeXHhmNX5jyJtfGTLiwkQSApc=10fqf+rQh9DA@mail.gmail.com/T/ Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Xiaofei Tan [Thu, 11 May 2023 02:54:26 +0000 (10:54 +0800)]
rasdaemon: fix return value type issue of read/write function from unistd.h
The return value type of read/write function from unistd.h is ssize_t.
It's signed normally, and return -1 on error. Fix incorrect use in the
function read_ras_event_all_cpus().
BTW, make setting buffer_percent as a separate function.
Fixes: 94750bcf9309 ("rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely") Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Ayush Jain [Tue, 23 May 2023 06:55:36 +0000 (12:25 +0530)]
Rasdaemon: Fix autoreconf build error
When building rasdaemon with autoreconf, on certain distros
we see the following error message.
Makefile.am: error: required file './README' not found
Autoreconf looks for README file instead of README.md
Fix this by passing 'foreign' to AM_INIT_AUTOMAKE.
hubin [Thu, 18 May 2023 08:14:41 +0000 (16:14 +0800)]
ras-events: quit loop in read_ras_event when kbuf data is broken
when kbuf data is broken, kbuffer_next_event() may move kbuf->index back to
the current kbuf->index position, causing dead loop.
In this situation, rasdaemon will repeatedly parse an invalid event, and
print warning like "ug! negative record size -8!", pushing cpu utilization
rate to 100%.
when kbuf data is broken, discard current page and continue reading next page
kbuf.
Shiju Jose [Fri, 17 Mar 2023 13:07:01 +0000 (13:07 +0000)]
rasdaemon: Add support for the CXL AER correctable errors
Add support to log and record the CXL AER correctable errors.
The corresponding Kernel patches are here:
https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t
https://lore.kernel.org/linux-cxl/63e5ed38d77d9_138fbc2947a@iweiny-mobl.notmuch/T/#t
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Shiju Jose [Fri, 17 Mar 2023 12:51:02 +0000 (12:51 +0000)]
rasdaemon: Add support for the CXL AER uncorrectable errors
Add support to log and record the CXL AER uncorrectable errors.
The corresponding Kernel patches are here:
https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t
https://lore.kernel.org/lkml/63eeb2a8c9e3f_32d612941f@dwillia2-xfh.jf.intel.com.notmuch/T/
It was found that the header log data to be converted to the
big-endian format to correctly store in the SQLite DB likely
because the SQLite database seems uses the big-endian storage.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com># Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Shiju Jose [Fri, 31 Mar 2023 12:35:13 +0000 (13:35 +0100)]
rasdaemon: Add support for the CXL poison events
Add support to log and record the CXL poison events.
The corresponding Kernel patches here:
https://lore.kernel.org/linux-cxl/64457d30bae07_2028294ac@dwillia2-xfh.jf.intel.com.notmuch/
Presently for logging only, could be extended for the policy
based recovery action for the frequent poison events depending on the above
kernel patches.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Shiju Jose [Sun, 5 Mar 2023 23:14:42 +0000 (23:14 +0000)]
rasdaemon: fix table create if some cpus are offline
Fix for regression in ras_mc_create_table() if some cpus are offline
at the system start
Issue:
Regression in the ras_mc_create_table() if some of the cpus are offline
at the system start when run the rasdaemon.
This issue is reproducible in ras_mc_create_table() with decode and
record non-standard events and reproducible sometimes with
ras_mc_create_table() for the standard events.
Also in the multi thread way, there is memory leak in ras_mc_event_opendb()
as struct sqlite3_priv *priv and sqlite3 *db allocated/initialized per
thread, but stored in the common struct ras_events ras in pthread data,
which is shared across the threads.
Reason:
when the system starts with some of the cpus offline and then run
the rasdaemon, read_ras_event_all_cpus() exit with error and switch to
the multi thread way. However read() in read_ras_event() return error in
threads for each of the offline CPUs and does clean up including calling
ras_mc_event_closedb().
Since the 'struct ras_events ras' passed in the pthread_data to each of the
threads is common, struct sqlite3_priv *priv and sqlite3 *db allocated/
initialized per thread and stored in the common 'struct ras_events ras',
are getting overwritten in each ras_mc_event_opendb()(which called from
pthread per cpu), result memory leak.
Also when ras_mc_event_closedb() is called in the above error case from
the threads corresponding to the offline cpus, close the sqlite3 *db and
free sqlite3_priv *priv stored in the common 'struct ras_events ras',
result regression when accessing priv->db in the ras_mc_create_table()
from another context later.
Solution:
In ras_mc_event_opendb(), allocate struct sqlite3_priv *priv,
init sqlite3 *db and create tables common for the threads with shared
'struct ras_events ras' based on a reference count and free them in the
same way.
Also protect critical code ras_mc_event_opendb() and ras_mc_event_closedb()
using mutex in the multi thread case from any regression caused by the
thread pre-emption.
Reported-by: Lei Feng <fenglei47@h-partners.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Sam James [Sun, 19 Feb 2023 18:33:20 +0000 (18:33 +0000)]
configure.ac: fix bashisms
configure scripts need to be runnable with a POSIX-compliant /bin/sh.
On many (but not all!) systems, /bin/sh is provided by Bash, so errors
like this aren't spotted. Notably Debian defaults to /bin/sh provided
by dash which doesn't tolerate such bashisms as '=='.
Shiju Jose [Sat, 4 Feb 2023 19:15:55 +0000 (19:15 +0000)]
rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely
The error events are not received in the rasdaemon since kernel 6.1-rc6.
This issue is firstly detected and reported, when testing the CXL error
events in the rasdaemon.
Debugging showed, poll() on trace_pipe_raw in the ras-events.c do not
return and this issue is seen after the commit 42fb0a1e84ff525ebe560e2baf9451ab69127e2b ("tracing/ring-buffer: Have
polling block on watermark").
This issue is also verified using a test application for poll()
and select() on per_cpu trace_pipe_raw.
There is also a bug reported on this issue,
https://lore.kernel.org/all/31eb3b12-3350-90a4-a0d9-d1494db7cf74@oracle.com/
This issue occurs for the per_cpu case, which calls the ring_buffer_poll_wait(),
in kernel/trace/ring_buffer.c, with the buffer_percent > 0 and then wait until
the percentage of pages are available. The default value set for the
buffer_percent is 50 in the kernel/trace/trace.c. However poll() does not return
even met the percentage of pages condition.
As a fix, rasdaemon set buffer_percent as 0 through the
/sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent, then the
task will wake up as soon as data is added to any of the specific cpu
buffer and poll() on per_cpu/cpuX/trace_pipe_raw does not block
indefinitely.
Dependency on the kernel fix commit 3e46d910d8acf94e5360126593b68bf4fee4c4a1("tracing: Fix poll() and select()
do not work on per_cpu trace_pipe and trace_pipe_raw")
Rasdaemon used for a long time an early version of this library,
with the code embedded directly into its code. The rationale is
that the library was not officially released on that time, but
this has long changed.
Shiju Jose [Thu, 24 Feb 2022 18:02:14 +0000 (18:02 +0000)]
rasdaemon: ras-mc-ctl: Modify error statistics for HiSilicon KunPeng9xx common errors
Modify the error statistics for the HiSilicon KunPeng9xx platforms common errors
to display the statistics and error info based on the module and the error severity.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Shengwei Luo [Wed, 23 Feb 2022 09:21:58 +0000 (17:21 +0800)]
rasdaemon: Support cpu fault isolation for corrected errors
When the corrected errors exceed the set limit in cycle, try to
offline the related cpu core.
Signed-off-by: Shengwei Luo <luoshengwei@huawei.com> Signed-off-by: Junchong Pan <panjunchong@hisilicon.com> Signed-off-by: Lei Feng <fenglei47@h-partners.com> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Yang Shi [Mon, 4 Apr 2022 23:34:05 +0000 (16:34 -0700)]
rasdaemon: use the new block_rq_error tracepoint
Since Linux 5.18-rc1 a new block tracepoint called block_rq_error is
available for tracing disk error events dedicatedly. Currently
rasdaemon is using block_rq_complete which also traces successful cases.
It incurs excessive tracing logs and somehow overhead since the event is
triggered quite often.
Use the new tracepoint for disk error reporting, and the new trace point
has the same format as block_rq_complete.
Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Xiaofei Tan [Wed, 20 Oct 2021 06:33:40 +0000 (14:33 +0800)]
rasdaemon: Add some modules supported by hisi common error section
Add some modules supported by hisi common error section. Besides,
HHA is the module for some old platform, and it takes the same place
of MATA, so remove it.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Xiaofei Tan [Wed, 20 Oct 2021 06:33:37 +0000 (14:33 +0800)]
rasdaemon: Fix the issue of sprintf data type mismatch in uuid_le()
The data type of sprintf called in the function uuid_le() is mismatch.
Arm64 compiler force it to unsigned char by default, and can work normally.
But if someone compile it with the option -fsigned-char, the function
can't work correctly.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Muralidhara M K [Tue, 27 Jul 2021 11:36:45 +0000 (06:36 -0500)]
rasdaemon: ras-mc-ctl: Fix script to parse dimm sizes
Removes trailing spaces at the end of a line from
file location and fixes --layout option to parse dimm nodes
to get the size of each dimm from ras-mc-ctl.
Issue is reported https://github.com/mchehab/rasdaemon/issues/43
Where '> ras-mc-ctl --layout' reports all 0s
Fix the following compile errors that occurs when building against musl:
ras-events.c: In function 'read_ras_event_all_cpus':
ras-events.c:366:16: error: 'PATH_MAX' undeclared (first use in this function)
366 | char pipe_raw[PATH_MAX];
| ^~~~~~~~
ras-events.c: In function 'handle_ras_events_cpu':
ras-events.c:564:16: error: 'PATH_MAX' undeclared (first use in this function)
564 | char pipe_raw[PATH_MAX];
|
rasdaemon: Add new SMCA bank types with error decoding
Upcoming systems with Scalable Machine Check Architecture (SMCA) have
new MCA banks added.
This patch adds the (HWID, MCATYPE) tuple, name and error decoding for
those new SMCA banks.
While at it, optimize the string names in smca_bank_name[].
Signed-off-by: Muralidhara M K <muralimk@amd.com> Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Matt Whitlock [Wed, 9 Jun 2021 14:25:18 +0000 (10:25 -0400)]
configure.ac: fix SYSCONFDEFDIR default value
configure.ac was using AC_ARG_WITH incorrectly, yielding a generated configure script like:
# Check whether --with-sysconfdefdir was given.
if test "${with_sysconfdefdir+set}" = set; then :
withval=$with_sysconfdefdir; SYSCONFDEFDIR=$withval
else
"/etc/sysconfig"
fi
This commit fixes the default case so that the SYSCONFDEFDIR variable is assigned the value "/etc/sysconfig" rather than trying to execute "/etc/sysconfig" as a command.
rasdaemon: Add Ice Lake and Sapphire Rapids MSCOD values
Based on mcelog commits:
ee90ff20ce6a ("mcelog: Add support for Icelake server, Icelake-D, and Snow Ridge") 391abaac9bdf ("mcelog: Add decode for MCi_MISC from 10nm memory controller") 59cb7ad4bc72 ("mcelog: i10nm: Fix mapping from bank number to functional unit") c0acd0e6a639 ("mcelog: Add support for Sapphirerapids server.")