Xiaofei Tan [Mon, 25 Nov 2019 09:33:24 +0000 (10:33 +0100)]
rasdaemon: fix the wrong declaring of 'sruct ras_events' in ras-record.h
The following warning can be found by PC-Lint when do static code
analysis to the file non-standard-hisi_hip08.c:
Warning -- Declaration of symbol 'ras' hides symbol 'ras' (line 28, file ras-record.h)
This means that the local variable name 'ras' is same as an global
variable. In fact, there is no global variable named 'ras', but an
wrong declaring in ras-record.h.
CC: Xiaofei Tan <tanxiaofei@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com>, <jonathan.cameron@huawei.com> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Brian WoodsGhannam, Yazen [Fri, 1 Nov 2019 14:48:14 +0000 (15:48 +0100)]
rasdaemon: add support for new AMD SMCA bank types
Going forward, the Scalable Machine Check Architecture (SMCA) has some
updated and additional bank types which show up in Zen2. The differing
bank types include: CS_V2, PSP_V2, SMU_V2, MP5, NBIO, and PCIE. The V2
bank types replace the original bank types but have unique HWID/MCAtype
IDs from the originals so there's no conflicts between different
versions or other bank types. All of the differing bank types have new
MCE descriptions which have been added as well.
CC: "mchehab+samsung@kernel.org" <mchehab+samsung@kernel.org>, "Namburu, Chandu-babu" <chandu@amd.com> # Thread-Topic: [PATCH 2/2] rasdaemon: add support for new AMD SMCA bank types Signed-off-by: Brian Woods <brian.woods@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: Chandu-babu Namburu <chandu@amd.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Wed, 13 Nov 2019 16:31:12 +0000 (16:31 +0000)]
rasdaemon: fix for the ras-record.c:ras_mc_prepare_stmt() failure when new fields added to the sql table
rasdaemon fails in the ras_mc_prepare_stmt() function when new fields are
added to the table's db_fields on top of the existing sql table in the
system.
This patch adds solution for this issue.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Wed, 16 Oct 2019 16:34:01 +0000 (17:34 +0100)]
rasdaemon: add signal handling for the cleanup
Presently rasdaemon would not free allocated memory and
would not do other cleanup when the rasdaemon closed
with ctrl+c or kill etc.
This patch adds handling of the signals SIGINT, SIGTERM, SIGHUP
and SIGQUIT and do necessary clean ups when receive the
specified signals.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Xiaofei Tan [Tue, 8 Oct 2019 12:38:57 +0000 (20:38 +0800)]
rasdaemon: add timestamp for hip08 OEM error records in sqlite3 DB
This patch does two things:
1.Add timestamp for hip08 OEM error records in sqlite3 DB.
2.Add suffix "_v2" for hip08 OEM event names to keep compatibility
with old sqlite3 DB.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Xiaofei Tan [Tue, 8 Oct 2019 12:38:54 +0000 (20:38 +0800)]
rasdaemon: optimize sqlite3 DB record of register fields for hip08
Optimize sqlite3 DB record of register fields for hip08 by combining
all register fields to one text field, which will include register name.
This will make the record easier to read.
For example, from:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU',2,'corrected',
273058,0,-1,0,1308622858,0,0,0,0,133,0,0,NULL);
change to:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU',2,'corrected',
'ERR_FR_0=0x42aa2 ERR_FR_1=0x0 ERR_CTRL_0=0xffffffff ERR_CTRL_1=0x0
ERR_STATUS_0=0x4e00000a ERR_STATUS_1=0x0 ERR_ADDR_0=0x0, ERR_ADDR_1=0x0
ERR_MISC0_0=0x0 ERR_MISC0_1=0x90 ERR_MISC1_0=0x0 ERR_MISC1_1=0x0');
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Xiaofei Tan [Thu, 8 Aug 2019 02:14:30 +0000 (10:14 +0800)]
rasdaemon: fix the issue of sqlite3 integer bind parameter mismatch
Some interger fields of arm_event and mc_event are 8 bytes width,
and sqlite3_bind_int64() should be used when restore the event to
sqlite3. But we use sqlite3_bind_int() in current code. This will
lead to an wrong value in sqlite3 DB.
This patch is to fix the issue.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
README: updated instructions about sending patches
The instructions there are a little outdated. Sergio
suggested changing just my e-mail, but let's do a better job
and use my canonical e-mail (mchehab@kernel.org), plus add the
alternative of sending patches against either github or gitlab.
fix file descriptor leak in ras-report.c:setup_report_socket()
A running instance of rasdaemon was seen to hit the limit on open file
descriptors. Most of the the descriptors were AF_UNIX STREAM sockets.
At the same time the limit was hit, attempts by rasdaemon to open the
SQLite database started failing with SQLite error 14.
This patch avoids leaking a socket file descriptor each time the connect()
call fails.
parse_ras_data: initialize record.cpu before pevent_print_event().
pevent_print_event() prints record.cpu; make sure it's initialized.
The cpu field from pthread_data is my best guess at a suitable value:
parse_ras_data() was already printing it separately.
parse_ras_data: flush trace buffer immediately, not on next call
parse_ras_data() was calling fflush() before, not after printf().
As a result, information about an event would not be printed
immediately but possibly much later.
Shiju Jose [Mon, 17 Jun 2019 14:28:51 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2
This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format2.
These errors are from the H/W modules SMMU, HHA, HLLC, PA and DDRC.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Mon, 17 Jun 2019 14:28:50 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1
This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format1.
These errors are from the H/W modules MN, PLL, SLLC, AA, SIOE,
POE, DISP, LPC, SAS and SATA.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
ras-mce-handler.c:344:9: warning: address of array 'e->mcgstatus_msg' will always evaluate to 'true' [-Wpointer-bool-conversion]
if (e->mcgstatus_msg)
~~ ~~~^~~~~~~~~~~~~
Ying Lv [Wed, 15 May 2019 03:15:42 +0000 (11:15 +0800)]
fix rasdaemon high CPU usage when part of CPUs offline
When we set part of CPU core offline, such as by setting the kernel cmdline
maxcpus = N(N is less than the total number of system CPU cores).
And then, we will observe that the CPU usage of some rasdaemon threads
is very close to 100.
This is because when part of CPU offline, poll in read_ras_event_all_cpus func
will fallback to pthread way.
Offlined CPU thread will return negative value when read trace_pipe_raw,
negative return value will covert to positive value because of 'unsigned size'.
So code will always go into 'size > 0' branch, and the CPU usage is too high.
Here, variable size uses int type will go to the right branch.
Fiexs: eff7c9e0("ras-events: Only use pthreads for collect if poll() not available") Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com> Signed-off-by: Ying Lv <lvying6@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Aristeu Rozanski [Wed, 1 Aug 2018 20:29:58 +0000 (16:29 -0400)]
rasdaemon: ras-mc-ctl: add option to show error counts
In some scenarios it might not be desirable to have a daemon running
to parse and store the errors provided by EDAC and only having the
number of CEs and UEs is enough. This patch implements this feature
as an ras-mc-ctl option.
mce-amd-k8: be sure to not go past error_msg buffer
As warned by gcc:
mce-amd-k8.c: In function ‘decode_k8_generic_errcode’:
mce-amd-k8.c:136:30: warning: ‘) ’ directive output may be truncated writing 2 bytes into a region of size between 0 and 4095 [-Wformat-truncation=]
mce_snprintf(e->error_msg, "(%s) ", tmp_buf);
^~~~~~~
ras-mce-handler.h:104:29: note: in definition of macro ‘mce_snprintf’
snprintf(buf + __n, __len, fmt, ##arg); \
^~~
ras-mce-handler.h:104:2: note: ‘snprintf’ output between 4 and 4099 bytes into a destination of size 4096
snprintf(buf + __n, __len, fmt, ##arg); \
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mce-amd-k8.c:136:3: note: in expansion of macro ‘mce_snprintf’
mce_snprintf(e->error_msg, "(%s) ", tmp_buf);
^~~~~~~~~~~~
ras-report.c: In function ‘setup_report_socket’:
ras-report.c:36:2: warning: ‘strncpy’ output truncated before terminating nul copying 25 bytes from a string of the same length [-Wstringop-truncation]
strncpy(addr.sun_path, ABRT_SOCKET, strlen(ABRT_SOCKET));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The strncpy logic there is wrong. Fix it and be sure to have a NUL
terminated string filled at addr.sun_path.
mce-intel-*: fix a warning when using FIELD(<num>, NULL)
Internally, FIELD() macro checks the size of an array, by
using ARRAY_SIZE. Well, this macro causes a division by zero
if NULL is used, as its type is void, as warned:
mce-intel-dunnington.c:30:2: note: in expansion of macro ‘FIELD’
FIELD(17, NULL),
^~~~~
ras-mce-handler.h:28:33: warning: division ‘sizeof (void *) / sizeof (void)’ does not compute the number of array elements [-Wsizeof-pointer-div]
#define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
^
bitfield.h:37:51: note: in expansion of macro ‘ARRAY_SIZE’
#define FIELD(start_bit, name) { start_bit, name, ARRAY_SIZE(name) }
^~~~~~~~~~
While this warning is harmless, it may prevent seeing more serios
warnings. So, add a FIELD_NULL(<num>) macro to avoid that.
Thomas Tai [Mon, 14 May 2018 14:33:48 +0000 (10:33 -0400)]
rasdaemon: use separate string array for error status
The bit field description for correctable status register
and uncorrectable status register are different. Using a
single aer_errors string array will cause bit[12] to
overlap and thus recording the wrong description.
Using a separate variable to switch between correctable
and uncorrectable error is needed.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Thomas Tai [Mon, 14 May 2018 14:33:47 +0000 (10:33 -0400)]
rasdaemon: fix PCIe AER error type
The error types between PCIe AER and CPU Machine Check are
different. when handling aer_event, the PCIe AER error
type should be used. Add an enum to match the kernel
PCIe AER and use it to decode the error type.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Greg Edwards [Wed, 28 Mar 2018 22:10:46 +0000 (16:10 -0600)]
rasdaemon: Add Skylake Xeon MSCOD values
Based on mcelog commits e4aca6312aee ("Add support to decode MSCOD
values for Skylake server") and 34f03e306c36 ("mcelog: Change name of
skylake interconnect from QPI to UPI").
Aristeu Rozanski [Fri, 2 Feb 2018 15:20:48 +0000 (10:20 -0500)]
rasdaemon: ARM: fully initialize ras_arm_event
Issue found by covscan:
1. rasdaemon-0.4.1/ras-arm-handler.c:32: var_decl: Declaring variable "ev" without initializer.
16. rasdaemon-0.4.1/ras-arm-handler.c:81: uninit_use_in_call: Using uninitialized value "ev.error_count" when calling "ras_store_arm_record".
23. rasdaemon-0.4.1/ras-record.c:243:2: read_parm_fld: Reading a parameter field.
mce-intel-p4-p6: prevent build errors with -Werror=format-security
On Fedora, -Werror=format-security is now used on packages, with
causes the following build error:
mce-intel-p4-p6.c: In function 'p4_decode_model':
mce-intel-p4-p6.c:130:4: error: format not a string literal and no format arguments [-Werror=format-security]
mce_snprintf(e->error_msg, p4_model[i].str);
^~~~~~~~~~~~
cc1: some warnings being treated as errors
configure.ac: show if Hisilicon error report are enabled
As changeset b856c89a11d7 ("rasdaemon:add support for
Hisilicon non-standard error decoder") added a new
configurable error report, show if it is enabled or not.
shiju.jose@huawei.com [Wed, 4 Oct 2017 09:11:21 +0000 (10:11 +0100)]
rasdaemon:add support for Hisilicon non-standard error decoder
1. This patch add support to decode the non-standard
error information for Hisilicon HIP07 SAS HW module.
2. Add stub decoder for Hislicon HIP07 HNS HW module.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>