mce-intel-*: fix a warning when using FIELD(<num>, NULL)
Internally, FIELD() macro checks the size of an array, by
using ARRAY_SIZE. Well, this macro causes a division by zero
if NULL is used, as its type is void, as warned:
mce-intel-dunnington.c:30:2: note: in expansion of macro ‘FIELD’
FIELD(17, NULL),
^~~~~
ras-mce-handler.h:28:33: warning: division ‘sizeof (void *) / sizeof (void)’ does not compute the number of array elements [-Wsizeof-pointer-div]
#define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
^
bitfield.h:37:51: note: in expansion of macro ‘ARRAY_SIZE’
#define FIELD(start_bit, name) { start_bit, name, ARRAY_SIZE(name) }
^~~~~~~~~~
While this warning is harmless, it may prevent seeing more serios
warnings. So, add a FIELD_NULL(<num>) macro to avoid that.
Thomas Tai [Mon, 14 May 2018 14:33:48 +0000 (10:33 -0400)]
rasdaemon: use separate string array for error status
The bit field description for correctable status register
and uncorrectable status register are different. Using a
single aer_errors string array will cause bit[12] to
overlap and thus recording the wrong description.
Using a separate variable to switch between correctable
and uncorrectable error is needed.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Thomas Tai [Mon, 14 May 2018 14:33:47 +0000 (10:33 -0400)]
rasdaemon: fix PCIe AER error type
The error types between PCIe AER and CPU Machine Check are
different. when handling aer_event, the PCIe AER error
type should be used. Add an enum to match the kernel
PCIe AER and use it to decode the error type.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Greg Edwards [Wed, 28 Mar 2018 22:10:46 +0000 (16:10 -0600)]
rasdaemon: Add Skylake Xeon MSCOD values
Based on mcelog commits e4aca6312aee ("Add support to decode MSCOD
values for Skylake server") and 34f03e306c36 ("mcelog: Change name of
skylake interconnect from QPI to UPI").
Aristeu Rozanski [Fri, 2 Feb 2018 15:20:48 +0000 (10:20 -0500)]
rasdaemon: ARM: fully initialize ras_arm_event
Issue found by covscan:
1. rasdaemon-0.4.1/ras-arm-handler.c:32: var_decl: Declaring variable "ev" without initializer.
16. rasdaemon-0.4.1/ras-arm-handler.c:81: uninit_use_in_call: Using uninitialized value "ev.error_count" when calling "ras_store_arm_record".
23. rasdaemon-0.4.1/ras-record.c:243:2: read_parm_fld: Reading a parameter field.
mce-intel-p4-p6: prevent build errors with -Werror=format-security
On Fedora, -Werror=format-security is now used on packages, with
causes the following build error:
mce-intel-p4-p6.c: In function 'p4_decode_model':
mce-intel-p4-p6.c:130:4: error: format not a string literal and no format arguments [-Werror=format-security]
mce_snprintf(e->error_msg, p4_model[i].str);
^~~~~~~~~~~~
cc1: some warnings being treated as errors
configure.ac: show if Hisilicon error report are enabled
As changeset b856c89a11d7 ("rasdaemon:add support for
Hisilicon non-standard error decoder") added a new
configurable error report, show if it is enabled or not.
shiju.jose@huawei.com [Wed, 4 Oct 2017 09:11:21 +0000 (10:11 +0100)]
rasdaemon:add support for Hisilicon non-standard error decoder
1. This patch add support to decode the non-standard
error information for Hisilicon HIP07 SAS HW module.
2. Add stub decoder for Hislicon HIP07 HNS HW module.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Seiichi Ikarashi [Wed, 17 Jun 2015 10:56:57 +0000 (07:56 -0300)]
rasdaemon: add internal errors of IA32_MC4_STATUS for Haswell
Now rasdaemon looks purposely omitting internal errors of
IA32_MC4_STATUS for Haswell-family processors, which are described in
Intel SDM vol3 Table 16-20. I think it's better to show these errors
because mcelog does show them.
Seiichi Ikarashi [Fri, 12 Jun 2015 09:35:37 +0000 (06:35 -0300)]
rasdaemon: use MCA error msg as error_msg
In the case of machine-checks which do not have a model-specific MCA
error code but have an architectural code only, mce_event.error_msg
becomes empty then you don't know what happened.
Aristeu Rozanski [Mon, 1 Jun 2015 20:04:00 +0000 (17:04 -0300)]
rasdaemon: add support to match the machine by system's product name
In some cases the motherboard names will change but the mapping won't
across a line of products. This patch adds support for "Product:" to be
specified in the label files instead of Model:.
An example:
Vendor: Dell Inc.
Product: PowerEdge R610
DIMM_A1: 0.0.0; DIMM_A2: 0.0.1; DIMM_A3: 0.0.2;
DIMM_A4: 0.1.0; DIMM_A5: 0.1.1; DIMM_A6: 0.1.2;
Seiichi Ikarashi [Tue, 26 May 2015 14:59:39 +0000 (11:59 -0300)]
rasdaemon: make sure the error is valid before handling ranks
Fix "rank" handling according to the Bit 63 description in Intel SDM Vol.3C
Table 16-23, that says "... Use this information only after there is valid
first error info indicated by bit 62".
Also fix invalid comparisons of unsigned variables "rank0" and "rank1".
Seiichi Ikarashi [Tue, 26 May 2015 14:59:37 +0000 (11:59 -0300)]
rasdaemon: add missing semicolon in hsw_decode_model()
hsw_decode_model() tries to skip decode_bitfield() if IA32_MC4_STATUS indicates
some internal errors. Unfortunately, here behaves opposite to the intention
because a semicolon is missing.
Uniquely identify Ivy Bridge even though the machine checks are the same
for Sandy Bridge and Ivy Bridge. This makes the output for the processor
display "Ivy Bridge".
September 2013 edition of the software developer manual added an
entry that had been inadvertently omitted from earlier editions.
Add the 0x80 entry for "Corrected memory read error".
Edition 050 of the Intel SDM released in late February 2014
includes a new simple error code in "Table 15-8. IA32_MCi_Status
[15:0] Simple Error Code Encoding". Code 6 (0000 0000 0000 0110)
has been allocated for the reporting of cases where the BIOS SMM
code attempts to execute code outside of the protected SMRR area.
Aristeu Rozanski [Fri, 15 Aug 2014 17:50:58 +0000 (13:50 -0400)]
rasdaemon: do not assume dimmX/ directories will be present
While finding the labels, size and location, ras-mc-ctl will search /sys for
the files and calculate the location. When it uses the location trying to map
back to files to print labels or write labels, it'll just assume dimm*
directories exist which is not correct while using drivers like amd64_edac.
This patch adds two new hashes to store the location and the label file path
so it can be used later.
Luck, Tony [Mon, 4 Aug 2014 20:29:01 +0000 (13:29 -0700)]
rasdaemon: Add support for extlog trace events
Linux kernel 3.17 includes a new trace event to pick up extended
error logs produced by BIOS in the Common Platform Error Record
format described in appendix N of the UEFI standard. This patch
adds support to collect that information and log it both in
readable ASCII and into the sqlite3 database that rasdaemon
uses to store all error information. In addition ras-mc-ctl
is updated to query that database for both detailed and summary
reports.
Big thanks to Aristeu for pretty much all the sqlite3 pieces,
plus testing and fixing miscellaneous issues elsewhere.
Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Tue, 24 Jun 2014 15:01:31 +0000 (11:01 -0400)]
rasdaemon: handle failures of snprintf()
Florian Weimer found that in bitfield_msg() the return value of
snprintf() is used to calculate length ignoring that it can return a
negative number. This patch makes bitfield_msg() to stop writing in such
case.
Luck, Tony [Mon, 7 Apr 2014 18:27:47 +0000 (11:27 -0700)]
rasdaemon: sqlite truncates some MCE fields to 32-bit
The sqlite3_bind_int() function takes an "int" as the argument value to
save to the database. But some fields are wider than 32-bits. Use
sqlite3_bind_int64() for the fields where we know values can exceed
4G.
Jakub Filak [Wed, 2 Apr 2014 13:03:44 +0000 (15:03 +0200)]
Correct ABRT report data
Remove '\0' byte from 'PUT' message because this was superfluous.
Replaced 'BASENAME' item with 'TYPE' item because the first one is no
longer supported by abrtd and the second one is required. Basically the
later is a substitute for the first one.
Removed the closing message which is not supported by abrtd. abrtd
considers that message as a part of the problem report.
Removed a superfluous space from 'Backtrace'.
Signed-off-by: Jakub Filak <jfilak@redhat.com> Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Betty Dall [Wed, 19 Mar 2014 21:54:56 +0000 (15:54 -0600)]
ras-mc-ctl: Print useful message when run without rasdaemon -r
The utility script ras-mc-ctl requires that rasdaemon --record be run
to create the me_event table in the SQLite database. The current behaviour
is this:
[root@sa1 util]# ras-mc-ctl --errors
DBD::SQLite::db prepare failed: no such table: mc_event at
/usr/local/sbin/ras-mc-ctl line 914.
Can't call method "execute" on an undefined value at
/usr/local/sbin/ras-mc-ctl line 915.
With this change, the user sees:
[root@sa1 util]# ras-mc-ctl --errors
DBD::SQLite::db prepare failed: no such table: mc_event at
/usr/local/sbin/ras-mc-ctl line 914.
ras-mc-ctl: Error: mc_event table missing from
/usr/local/var/lib/rasdaemon/ras-mc_event.db. Run 'rasdaemon --record'.
Signed-off-by: Betty Dall <betty.dall@hp.com> Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Betty Dall [Wed, 19 Mar 2014 20:59:46 +0000 (14:59 -0600)]
rasdaemon: Make record option dependent on HAVE_SQULITE3
The record option in parse_opt() can be a compile time option with
the HAVE_SQLITE3 since that option is used in the corresponding
argp_option structure.
Signed-off-by: Betty Dall <betty.dall@hp.com> Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
mce-amd-k8.c: In function ‘bank_name’:
mce-amd-k8.c:250:22: warning: argument to ‘sizeof’ in ‘snprintf’ call is the same expression as the destination; did you mean to provide an explicit length? [-Wsizeof-pointer-memaccess]
snprintf(buf, sizeof(buf), "%s (bank=%d)", s, e->bank);
^