Shiju Jose [Wed, 16 Oct 2019 16:34:01 +0000 (17:34 +0100)]
rasdaemon: add signal handling for the cleanup
Presently rasdaemon would not free allocated memory and
would not do other cleanup when the rasdaemon closed
with ctrl+c or kill etc.
This patch adds handling of the signals SIGINT, SIGTERM, SIGHUP
and SIGQUIT and do necessary clean ups when receive the
specified signals.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Xiaofei Tan [Tue, 8 Oct 2019 12:38:57 +0000 (20:38 +0800)]
rasdaemon: add timestamp for hip08 OEM error records in sqlite3 DB
This patch does two things:
1.Add timestamp for hip08 OEM error records in sqlite3 DB.
2.Add suffix "_v2" for hip08 OEM event names to keep compatibility
with old sqlite3 DB.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Xiaofei Tan [Tue, 8 Oct 2019 12:38:54 +0000 (20:38 +0800)]
rasdaemon: optimize sqlite3 DB record of register fields for hip08
Optimize sqlite3 DB record of register fields for hip08 by combining
all register fields to one text field, which will include register name.
This will make the record easier to read.
For example, from:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU',2,'corrected',
273058,0,-1,0,1308622858,0,0,0,0,133,0,0,NULL);
change to:
INSERT INTO hip08_oem_type2_event VALUES(1,1,1,0,0,'SMMU',2,'corrected',
'ERR_FR_0=0x42aa2 ERR_FR_1=0x0 ERR_CTRL_0=0xffffffff ERR_CTRL_1=0x0
ERR_STATUS_0=0x4e00000a ERR_STATUS_1=0x0 ERR_ADDR_0=0x0, ERR_ADDR_1=0x0
ERR_MISC0_0=0x0 ERR_MISC0_1=0x90 ERR_MISC1_0=0x0 ERR_MISC1_1=0x0');
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Xiaofei Tan [Thu, 8 Aug 2019 02:14:30 +0000 (10:14 +0800)]
rasdaemon: fix the issue of sqlite3 integer bind parameter mismatch
Some interger fields of arm_event and mc_event are 8 bytes width,
and sqlite3_bind_int64() should be used when restore the event to
sqlite3. But we use sqlite3_bind_int() in current code. This will
lead to an wrong value in sqlite3 DB.
This patch is to fix the issue.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
README: updated instructions about sending patches
The instructions there are a little outdated. Sergio
suggested changing just my e-mail, but let's do a better job
and use my canonical e-mail (mchehab@kernel.org), plus add the
alternative of sending patches against either github or gitlab.
fix file descriptor leak in ras-report.c:setup_report_socket()
A running instance of rasdaemon was seen to hit the limit on open file
descriptors. Most of the the descriptors were AF_UNIX STREAM sockets.
At the same time the limit was hit, attempts by rasdaemon to open the
SQLite database started failing with SQLite error 14.
This patch avoids leaking a socket file descriptor each time the connect()
call fails.
parse_ras_data: initialize record.cpu before pevent_print_event().
pevent_print_event() prints record.cpu; make sure it's initialized.
The cpu field from pthread_data is my best guess at a suitable value:
parse_ras_data() was already printing it separately.
parse_ras_data: flush trace buffer immediately, not on next call
parse_ras_data() was calling fflush() before, not after printf().
As a result, information about an event would not be printed
immediately but possibly much later.
Shiju Jose [Mon, 17 Jun 2019 14:28:51 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2
This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format2.
These errors are from the H/W modules SMMU, HHA, HLLC, PA and DDRC.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Mon, 17 Jun 2019 14:28:50 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1
This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format1.
These errors are from the H/W modules MN, PLL, SLLC, AA, SIOE,
POE, DISP, LPC, SAS and SATA.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
ras-mce-handler.c:344:9: warning: address of array 'e->mcgstatus_msg' will always evaluate to 'true' [-Wpointer-bool-conversion]
if (e->mcgstatus_msg)
~~ ~~~^~~~~~~~~~~~~
Ying Lv [Wed, 15 May 2019 03:15:42 +0000 (11:15 +0800)]
fix rasdaemon high CPU usage when part of CPUs offline
When we set part of CPU core offline, such as by setting the kernel cmdline
maxcpus = N(N is less than the total number of system CPU cores).
And then, we will observe that the CPU usage of some rasdaemon threads
is very close to 100.
This is because when part of CPU offline, poll in read_ras_event_all_cpus func
will fallback to pthread way.
Offlined CPU thread will return negative value when read trace_pipe_raw,
negative return value will covert to positive value because of 'unsigned size'.
So code will always go into 'size > 0' branch, and the CPU usage is too high.
Here, variable size uses int type will go to the right branch.
Fiexs: eff7c9e0("ras-events: Only use pthreads for collect if poll() not available") Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com> Signed-off-by: Ying Lv <lvying6@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Aristeu Rozanski [Wed, 1 Aug 2018 20:29:58 +0000 (16:29 -0400)]
rasdaemon: ras-mc-ctl: add option to show error counts
In some scenarios it might not be desirable to have a daemon running
to parse and store the errors provided by EDAC and only having the
number of CEs and UEs is enough. This patch implements this feature
as an ras-mc-ctl option.
mce-amd-k8: be sure to not go past error_msg buffer
As warned by gcc:
mce-amd-k8.c: In function ‘decode_k8_generic_errcode’:
mce-amd-k8.c:136:30: warning: ‘) ’ directive output may be truncated writing 2 bytes into a region of size between 0 and 4095 [-Wformat-truncation=]
mce_snprintf(e->error_msg, "(%s) ", tmp_buf);
^~~~~~~
ras-mce-handler.h:104:29: note: in definition of macro ‘mce_snprintf’
snprintf(buf + __n, __len, fmt, ##arg); \
^~~
ras-mce-handler.h:104:2: note: ‘snprintf’ output between 4 and 4099 bytes into a destination of size 4096
snprintf(buf + __n, __len, fmt, ##arg); \
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mce-amd-k8.c:136:3: note: in expansion of macro ‘mce_snprintf’
mce_snprintf(e->error_msg, "(%s) ", tmp_buf);
^~~~~~~~~~~~
ras-report.c: In function ‘setup_report_socket’:
ras-report.c:36:2: warning: ‘strncpy’ output truncated before terminating nul copying 25 bytes from a string of the same length [-Wstringop-truncation]
strncpy(addr.sun_path, ABRT_SOCKET, strlen(ABRT_SOCKET));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The strncpy logic there is wrong. Fix it and be sure to have a NUL
terminated string filled at addr.sun_path.
mce-intel-*: fix a warning when using FIELD(<num>, NULL)
Internally, FIELD() macro checks the size of an array, by
using ARRAY_SIZE. Well, this macro causes a division by zero
if NULL is used, as its type is void, as warned:
mce-intel-dunnington.c:30:2: note: in expansion of macro ‘FIELD’
FIELD(17, NULL),
^~~~~
ras-mce-handler.h:28:33: warning: division ‘sizeof (void *) / sizeof (void)’ does not compute the number of array elements [-Wsizeof-pointer-div]
#define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
^
bitfield.h:37:51: note: in expansion of macro ‘ARRAY_SIZE’
#define FIELD(start_bit, name) { start_bit, name, ARRAY_SIZE(name) }
^~~~~~~~~~
While this warning is harmless, it may prevent seeing more serios
warnings. So, add a FIELD_NULL(<num>) macro to avoid that.
Thomas Tai [Mon, 14 May 2018 14:33:48 +0000 (10:33 -0400)]
rasdaemon: use separate string array for error status
The bit field description for correctable status register
and uncorrectable status register are different. Using a
single aer_errors string array will cause bit[12] to
overlap and thus recording the wrong description.
Using a separate variable to switch between correctable
and uncorrectable error is needed.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Thomas Tai [Mon, 14 May 2018 14:33:47 +0000 (10:33 -0400)]
rasdaemon: fix PCIe AER error type
The error types between PCIe AER and CPU Machine Check are
different. when handling aer_event, the PCIe AER error
type should be used. Add an enum to match the kernel
PCIe AER and use it to decode the error type.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Greg Edwards [Wed, 28 Mar 2018 22:10:46 +0000 (16:10 -0600)]
rasdaemon: Add Skylake Xeon MSCOD values
Based on mcelog commits e4aca6312aee ("Add support to decode MSCOD
values for Skylake server") and 34f03e306c36 ("mcelog: Change name of
skylake interconnect from QPI to UPI").
Aristeu Rozanski [Fri, 2 Feb 2018 15:20:48 +0000 (10:20 -0500)]
rasdaemon: ARM: fully initialize ras_arm_event
Issue found by covscan:
1. rasdaemon-0.4.1/ras-arm-handler.c:32: var_decl: Declaring variable "ev" without initializer.
16. rasdaemon-0.4.1/ras-arm-handler.c:81: uninit_use_in_call: Using uninitialized value "ev.error_count" when calling "ras_store_arm_record".
23. rasdaemon-0.4.1/ras-record.c:243:2: read_parm_fld: Reading a parameter field.
mce-intel-p4-p6: prevent build errors with -Werror=format-security
On Fedora, -Werror=format-security is now used on packages, with
causes the following build error:
mce-intel-p4-p6.c: In function 'p4_decode_model':
mce-intel-p4-p6.c:130:4: error: format not a string literal and no format arguments [-Werror=format-security]
mce_snprintf(e->error_msg, p4_model[i].str);
^~~~~~~~~~~~
cc1: some warnings being treated as errors
configure.ac: show if Hisilicon error report are enabled
As changeset b856c89a11d7 ("rasdaemon:add support for
Hisilicon non-standard error decoder") added a new
configurable error report, show if it is enabled or not.
shiju.jose@huawei.com [Wed, 4 Oct 2017 09:11:21 +0000 (10:11 +0100)]
rasdaemon:add support for Hisilicon non-standard error decoder
1. This patch add support to decode the non-standard
error information for Hisilicon HIP07 SAS HW module.
2. Add stub decoder for Hislicon HIP07 HNS HW module.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Seiichi Ikarashi [Wed, 17 Jun 2015 10:56:57 +0000 (07:56 -0300)]
rasdaemon: add internal errors of IA32_MC4_STATUS for Haswell
Now rasdaemon looks purposely omitting internal errors of
IA32_MC4_STATUS for Haswell-family processors, which are described in
Intel SDM vol3 Table 16-20. I think it's better to show these errors
because mcelog does show them.
Seiichi Ikarashi [Fri, 12 Jun 2015 09:35:37 +0000 (06:35 -0300)]
rasdaemon: use MCA error msg as error_msg
In the case of machine-checks which do not have a model-specific MCA
error code but have an architectural code only, mce_event.error_msg
becomes empty then you don't know what happened.
Aristeu Rozanski [Mon, 1 Jun 2015 20:04:00 +0000 (17:04 -0300)]
rasdaemon: add support to match the machine by system's product name
In some cases the motherboard names will change but the mapping won't
across a line of products. This patch adds support for "Product:" to be
specified in the label files instead of Model:.
An example:
Vendor: Dell Inc.
Product: PowerEdge R610
DIMM_A1: 0.0.0; DIMM_A2: 0.0.1; DIMM_A3: 0.0.2;
DIMM_A4: 0.1.0; DIMM_A5: 0.1.1; DIMM_A6: 0.1.2;
Seiichi Ikarashi [Tue, 26 May 2015 14:59:39 +0000 (11:59 -0300)]
rasdaemon: make sure the error is valid before handling ranks
Fix "rank" handling according to the Bit 63 description in Intel SDM Vol.3C
Table 16-23, that says "... Use this information only after there is valid
first error info indicated by bit 62".
Also fix invalid comparisons of unsigned variables "rank0" and "rank1".