Shiju Jose [Mon, 17 Jun 2019 14:28:51 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2
This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format2.
These errors are from the H/W modules SMMU, HHA, HLLC, PA and DDRC.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Mon, 17 Jun 2019 14:28:50 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1
This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format1.
These errors are from the H/W modules MN, PLL, SLLC, AA, SIOE,
POE, DISP, LPC, SAS and SATA.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
ras-mce-handler.c:344:9: warning: address of array 'e->mcgstatus_msg' will always evaluate to 'true' [-Wpointer-bool-conversion]
if (e->mcgstatus_msg)
~~ ~~~^~~~~~~~~~~~~
Ying Lv [Wed, 15 May 2019 03:15:42 +0000 (11:15 +0800)]
fix rasdaemon high CPU usage when part of CPUs offline
When we set part of CPU core offline, such as by setting the kernel cmdline
maxcpus = N(N is less than the total number of system CPU cores).
And then, we will observe that the CPU usage of some rasdaemon threads
is very close to 100.
This is because when part of CPU offline, poll in read_ras_event_all_cpus func
will fallback to pthread way.
Offlined CPU thread will return negative value when read trace_pipe_raw,
negative return value will covert to positive value because of 'unsigned size'.
So code will always go into 'size > 0' branch, and the CPU usage is too high.
Here, variable size uses int type will go to the right branch.
Fiexs: eff7c9e0("ras-events: Only use pthreads for collect if poll() not available") Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com> Signed-off-by: Ying Lv <lvying6@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Aristeu Rozanski [Wed, 1 Aug 2018 20:29:58 +0000 (16:29 -0400)]
rasdaemon: ras-mc-ctl: add option to show error counts
In some scenarios it might not be desirable to have a daemon running
to parse and store the errors provided by EDAC and only having the
number of CEs and UEs is enough. This patch implements this feature
as an ras-mc-ctl option.
mce-amd-k8: be sure to not go past error_msg buffer
As warned by gcc:
mce-amd-k8.c: In function ‘decode_k8_generic_errcode’:
mce-amd-k8.c:136:30: warning: ‘) ’ directive output may be truncated writing 2 bytes into a region of size between 0 and 4095 [-Wformat-truncation=]
mce_snprintf(e->error_msg, "(%s) ", tmp_buf);
^~~~~~~
ras-mce-handler.h:104:29: note: in definition of macro ‘mce_snprintf’
snprintf(buf + __n, __len, fmt, ##arg); \
^~~
ras-mce-handler.h:104:2: note: ‘snprintf’ output between 4 and 4099 bytes into a destination of size 4096
snprintf(buf + __n, __len, fmt, ##arg); \
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mce-amd-k8.c:136:3: note: in expansion of macro ‘mce_snprintf’
mce_snprintf(e->error_msg, "(%s) ", tmp_buf);
^~~~~~~~~~~~
ras-report.c: In function ‘setup_report_socket’:
ras-report.c:36:2: warning: ‘strncpy’ output truncated before terminating nul copying 25 bytes from a string of the same length [-Wstringop-truncation]
strncpy(addr.sun_path, ABRT_SOCKET, strlen(ABRT_SOCKET));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The strncpy logic there is wrong. Fix it and be sure to have a NUL
terminated string filled at addr.sun_path.
mce-intel-*: fix a warning when using FIELD(<num>, NULL)
Internally, FIELD() macro checks the size of an array, by
using ARRAY_SIZE. Well, this macro causes a division by zero
if NULL is used, as its type is void, as warned:
mce-intel-dunnington.c:30:2: note: in expansion of macro ‘FIELD’
FIELD(17, NULL),
^~~~~
ras-mce-handler.h:28:33: warning: division ‘sizeof (void *) / sizeof (void)’ does not compute the number of array elements [-Wsizeof-pointer-div]
#define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
^
bitfield.h:37:51: note: in expansion of macro ‘ARRAY_SIZE’
#define FIELD(start_bit, name) { start_bit, name, ARRAY_SIZE(name) }
^~~~~~~~~~
While this warning is harmless, it may prevent seeing more serios
warnings. So, add a FIELD_NULL(<num>) macro to avoid that.
Thomas Tai [Mon, 14 May 2018 14:33:48 +0000 (10:33 -0400)]
rasdaemon: use separate string array for error status
The bit field description for correctable status register
and uncorrectable status register are different. Using a
single aer_errors string array will cause bit[12] to
overlap and thus recording the wrong description.
Using a separate variable to switch between correctable
and uncorrectable error is needed.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Thomas Tai [Mon, 14 May 2018 14:33:47 +0000 (10:33 -0400)]
rasdaemon: fix PCIe AER error type
The error types between PCIe AER and CPU Machine Check are
different. when handling aer_event, the PCIe AER error
type should be used. Add an enum to match the kernel
PCIe AER and use it to decode the error type.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Greg Edwards [Wed, 28 Mar 2018 22:10:46 +0000 (16:10 -0600)]
rasdaemon: Add Skylake Xeon MSCOD values
Based on mcelog commits e4aca6312aee ("Add support to decode MSCOD
values for Skylake server") and 34f03e306c36 ("mcelog: Change name of
skylake interconnect from QPI to UPI").
Aristeu Rozanski [Fri, 2 Feb 2018 15:20:48 +0000 (10:20 -0500)]
rasdaemon: ARM: fully initialize ras_arm_event
Issue found by covscan:
1. rasdaemon-0.4.1/ras-arm-handler.c:32: var_decl: Declaring variable "ev" without initializer.
16. rasdaemon-0.4.1/ras-arm-handler.c:81: uninit_use_in_call: Using uninitialized value "ev.error_count" when calling "ras_store_arm_record".
23. rasdaemon-0.4.1/ras-record.c:243:2: read_parm_fld: Reading a parameter field.
mce-intel-p4-p6: prevent build errors with -Werror=format-security
On Fedora, -Werror=format-security is now used on packages, with
causes the following build error:
mce-intel-p4-p6.c: In function 'p4_decode_model':
mce-intel-p4-p6.c:130:4: error: format not a string literal and no format arguments [-Werror=format-security]
mce_snprintf(e->error_msg, p4_model[i].str);
^~~~~~~~~~~~
cc1: some warnings being treated as errors
configure.ac: show if Hisilicon error report are enabled
As changeset b856c89a11d7 ("rasdaemon:add support for
Hisilicon non-standard error decoder") added a new
configurable error report, show if it is enabled or not.
shiju.jose@huawei.com [Wed, 4 Oct 2017 09:11:21 +0000 (10:11 +0100)]
rasdaemon:add support for Hisilicon non-standard error decoder
1. This patch add support to decode the non-standard
error information for Hisilicon HIP07 SAS HW module.
2. Add stub decoder for Hislicon HIP07 HNS HW module.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Seiichi Ikarashi [Wed, 17 Jun 2015 10:56:57 +0000 (07:56 -0300)]
rasdaemon: add internal errors of IA32_MC4_STATUS for Haswell
Now rasdaemon looks purposely omitting internal errors of
IA32_MC4_STATUS for Haswell-family processors, which are described in
Intel SDM vol3 Table 16-20. I think it's better to show these errors
because mcelog does show them.
Seiichi Ikarashi [Fri, 12 Jun 2015 09:35:37 +0000 (06:35 -0300)]
rasdaemon: use MCA error msg as error_msg
In the case of machine-checks which do not have a model-specific MCA
error code but have an architectural code only, mce_event.error_msg
becomes empty then you don't know what happened.
Aristeu Rozanski [Mon, 1 Jun 2015 20:04:00 +0000 (17:04 -0300)]
rasdaemon: add support to match the machine by system's product name
In some cases the motherboard names will change but the mapping won't
across a line of products. This patch adds support for "Product:" to be
specified in the label files instead of Model:.
An example:
Vendor: Dell Inc.
Product: PowerEdge R610
DIMM_A1: 0.0.0; DIMM_A2: 0.0.1; DIMM_A3: 0.0.2;
DIMM_A4: 0.1.0; DIMM_A5: 0.1.1; DIMM_A6: 0.1.2;
Seiichi Ikarashi [Tue, 26 May 2015 14:59:39 +0000 (11:59 -0300)]
rasdaemon: make sure the error is valid before handling ranks
Fix "rank" handling according to the Bit 63 description in Intel SDM Vol.3C
Table 16-23, that says "... Use this information only after there is valid
first error info indicated by bit 62".
Also fix invalid comparisons of unsigned variables "rank0" and "rank1".
Seiichi Ikarashi [Tue, 26 May 2015 14:59:37 +0000 (11:59 -0300)]
rasdaemon: add missing semicolon in hsw_decode_model()
hsw_decode_model() tries to skip decode_bitfield() if IA32_MC4_STATUS indicates
some internal errors. Unfortunately, here behaves opposite to the intention
because a semicolon is missing.
Uniquely identify Ivy Bridge even though the machine checks are the same
for Sandy Bridge and Ivy Bridge. This makes the output for the processor
display "Ivy Bridge".
September 2013 edition of the software developer manual added an
entry that had been inadvertently omitted from earlier editions.
Add the 0x80 entry for "Corrected memory read error".
Edition 050 of the Intel SDM released in late February 2014
includes a new simple error code in "Table 15-8. IA32_MCi_Status
[15:0] Simple Error Code Encoding". Code 6 (0000 0000 0000 0110)
has been allocated for the reporting of cases where the BIOS SMM
code attempts to execute code outside of the protected SMRR area.
Aristeu Rozanski [Fri, 15 Aug 2014 17:50:58 +0000 (13:50 -0400)]
rasdaemon: do not assume dimmX/ directories will be present
While finding the labels, size and location, ras-mc-ctl will search /sys for
the files and calculate the location. When it uses the location trying to map
back to files to print labels or write labels, it'll just assume dimm*
directories exist which is not correct while using drivers like amd64_edac.
This patch adds two new hashes to store the location and the label file path
so it can be used later.
Luck, Tony [Mon, 4 Aug 2014 20:29:01 +0000 (13:29 -0700)]
rasdaemon: Add support for extlog trace events
Linux kernel 3.17 includes a new trace event to pick up extended
error logs produced by BIOS in the Common Platform Error Record
format described in appendix N of the UEFI standard. This patch
adds support to collect that information and log it both in
readable ASCII and into the sqlite3 database that rasdaemon
uses to store all error information. In addition ras-mc-ctl
is updated to query that database for both detailed and summary
reports.
Big thanks to Aristeu for pretty much all the sqlite3 pieces,
plus testing and fixing miscellaneous issues elsewhere.
Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Tue, 24 Jun 2014 15:01:31 +0000 (11:01 -0400)]
rasdaemon: handle failures of snprintf()
Florian Weimer found that in bitfield_msg() the return value of
snprintf() is used to calculate length ignoring that it can return a
negative number. This patch makes bitfield_msg() to stop writing in such
case.
Luck, Tony [Mon, 7 Apr 2014 18:27:47 +0000 (11:27 -0700)]
rasdaemon: sqlite truncates some MCE fields to 32-bit
The sqlite3_bind_int() function takes an "int" as the argument value to
save to the database. But some fields are wider than 32-bits. Use
sqlite3_bind_int64() for the fields where we know values can exceed
4G.