]> www.infradead.org Git - users/mchehab/rasdaemon.git/log
users/mchehab/rasdaemon.git
5 years agoBump to version 0.6.3 v0.6.3
Mauro Carvalho Chehab [Fri, 23 Aug 2019 11:01:39 +0000 (08:01 -0300)]
Bump to version 0.6.3

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoconfigure.ac: add an option to enable all features
Mauro Carvalho Chehab [Fri, 23 Aug 2019 11:26:24 +0000 (08:26 -0300)]
configure.ac: add an option to enable all features

At least for build testing, an option to enable everything
can be handful.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoAdd newline to summary to match errors output
Geoff Winterbourne [Thu, 25 Jul 2019 20:13:50 +0000 (14:13 -0600)]
Add newline to summary to match errors output

5 years agoSwitch to kernel filters for block_rq_complete
Cong Wang [Thu, 13 Jun 2019 18:51:39 +0000 (11:51 -0700)]
Switch to kernel filters for block_rq_complete

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agoAdd disk I/O error monitoring
Cong Wang [Wed, 12 Jun 2019 20:24:49 +0000 (13:24 -0700)]
Add disk I/O error monitoring

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agoMake event filter type specific
Cong Wang [Wed, 12 Jun 2019 22:06:37 +0000 (15:06 -0700)]
Make event filter type specific

struct ras_events passed via context pointer is not per event,
therefore the per event filter must be specific to each event.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agoras-mce-handler: Add support for Hygon Dhyana family 18h processor
Pu Wen [Thu, 23 May 2019 13:00:22 +0000 (21:00 +0800)]
ras-mce-handler: Add support for Hygon Dhyana family 18h processor

The Hygon Dhyana family 18h processor is derived from AMD family 17h.
The Hygon Dhyana support to Linux is already accepted upstream[1].

Add Hygon Dhyana support to mce handler of rasdaemon in order to handle
MCE events on Hygon Dhyana platforms.

Reference:
[1] https://git.kernel.org/tip/fec98069fb72fb656304a3e52265e0c2fc9adf87

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoFix Perl warnings in ras-mc-ctl
Cong Wang [Thu, 13 Jun 2019 05:26:20 +0000 (22:26 -0700)]
Fix Perl warnings in ras-mc-ctl

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agorasdaemon:add logging HiSilicon HIP08 PCIe local errors
Shiju Jose [Mon, 17 Jun 2019 14:28:52 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 PCIe local errors

This patch adds logging for the HiSilicon HIP08 PCIe local errors.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2
Shiju Jose [Mon, 17 Jun 2019 14:28:51 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2

This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format2.
These errors are from the H/W modules SMMU, HHA, HLLC, PA and DDRC.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1
Shiju Jose [Mon, 17 Jun 2019 14:28:50 +0000 (15:28 +0100)]
rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1

This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format1.
These errors are from the H/W modules MN, PLL, SLLC, AA, SIOE,
POE, DISP, LPC, SAS and SATA.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: update iteration logic for the non-standard error decoding functions
Shiju Jose [Mon, 17 Jun 2019 14:28:49 +0000 (15:28 +0100)]
rasdaemon: update iteration logic for the non-standard error decoding functions

This patch updates the iteration logic for the non-standard
error decoding functions.

Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon: rearrange HiSilicon HIP07 decoding function table
Shiju Jose [Mon, 17 Jun 2019 14:28:48 +0000 (15:28 +0100)]
rasdaemon: rearrange HiSilicon HIP07 decoding function table

This patch rearranges the decoding function table for the
HiSilicon HIP07 non-standard errors.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agorasdaemon:print non-standard error data if not decoded
Shiju Jose [Mon, 17 Jun 2019 14:28:47 +0000 (15:28 +0100)]
rasdaemon:print non-standard error data if not decoded

This patch change printing non-standard error data
only if not decoded.

Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoras-mce-handler: fix mcgstatus message print
Mauro Carvalho Chehab [Tue, 11 Jun 2019 18:01:38 +0000 (15:01 -0300)]
ras-mce-handler: fix mcgstatus message print

As warned by clang, the test there is wrong:

ras-mce-handler.c:344:9: warning: address of array 'e->mcgstatus_msg' will always evaluate to 'true' [-Wpointer-bool-conversion]
        if (e->mcgstatus_msg)
        ~~  ~~~^~~~~~~~~~~~~

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoTravis: enable all possible features
Mauro Carvalho Chehab [Tue, 11 Jun 2019 17:58:23 +0000 (14:58 -0300)]
Travis: enable all possible features

Several of those are arm-specific, but, as the goal here is just
to compile-test, enable them all.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoras-events: fix a warning when built without devlink
Mauro Carvalho Chehab [Tue, 11 Jun 2019 17:56:08 +0000 (14:56 -0300)]
ras-events: fix a warning when built without devlink

ras-events.c:667:8: warning: unused variable ‘filter_str’ [-Wunused-variable]
  667 |  char *filter_str = NULL;
      |        ^~~~~~~~~~

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agofix rasdaemon high CPU usage when part of CPUs offline
Ying Lv [Wed, 15 May 2019 03:15:42 +0000 (11:15 +0800)]
fix rasdaemon high CPU usage when part of CPUs offline

When we set part of CPU core offline, such as by setting the kernel cmdline
maxcpus = N(N is less than the total number of system CPU cores).
And then, we will observe that the CPU usage of some rasdaemon threads
is very close to 100.

This is because when part of CPU offline, poll in read_ras_event_all_cpus func
will fallback to pthread way.
Offlined CPU thread will return negative value when read trace_pipe_raw,
negative return value will covert to positive value because of 'unsigned size'.
So code will always go into 'size > 0' branch, and the CPU usage is too high.

Here, variable size uses int type will go to the right branch.

Fiexs: eff7c9e0("ras-events: Only use pthreads for collect if poll() not available")
Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Signed-off-by: Ying Lv <lvying6@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoMerge branch 'congwang-devlink'
Mauro Carvalho Chehab [Tue, 11 Jun 2019 17:53:15 +0000 (14:53 -0300)]
Merge branch 'congwang-devlink'

* congwang-devlink:
  TravisCI: add support for devlink build
  libtrace: Fix get_field_str() for dynamic strings
  Add devlink filter and net_dev_xmit_timeout
  Add devlink events

5 years agoTravisCI: add support for devlink build
Mauro Carvalho Chehab [Tue, 11 Jun 2019 17:44:22 +0000 (14:44 -0300)]
TravisCI: add support for devlink build

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agoMerge branch 'devlink' of https://github.com/congwang/rasdaemon into congwang-devlink
Mauro Carvalho Chehab [Tue, 11 Jun 2019 17:42:17 +0000 (14:42 -0300)]
Merge branch 'devlink' of https://github.com/congwang/rasdaemon into congwang-devlink

* 'devlink' of https://github.com/congwang/rasdaemon:
  libtrace: Fix get_field_str() for dynamic strings
  Add devlink filter and net_dev_xmit_timeout
  Add devlink events

5 years agoAdd support for Travis CI builds
Mauro Carvalho Chehab [Tue, 11 Jun 2019 17:32:02 +0000 (14:32 -0300)]
Add support for Travis CI builds

Let it be built with Travis CI when merged at github.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 years agolibtrace: Fix get_field_str() for dynamic strings
Cong Wang [Sun, 2 Jun 2019 03:52:19 +0000 (20:52 -0700)]
libtrace: Fix get_field_str() for dynamic strings

This cherry-picks the libtraceevent commit d777f8de99b0
("tools lib traceevent: Fix get_field_str() for dynamic strings")
from Linux kernel git repo.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agoAdd devlink filter and net_dev_xmit_timeout
Cong Wang [Sun, 2 Jun 2019 00:16:54 +0000 (17:16 -0700)]
Add devlink filter and net_dev_xmit_timeout

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
5 years agoAdd devlink events
Cong Wang [Thu, 25 Apr 2019 20:21:19 +0000 (13:21 -0700)]
Add devlink events

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
6 years agoMerge branch 'cnamburu-naples-support'
Mauro Carvalho Chehab [Fri, 26 Apr 2019 12:30:55 +0000 (09:30 -0300)]
Merge branch 'cnamburu-naples-support'

* cnamburu-naples-support:
  rasdaemon: add support for AMD Scalable MCA

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
6 years agorasdaemon: add support for AMD Scalable MCA
Chandu-babu Namburu [Wed, 30 Jan 2019 15:06:45 +0000 (20:36 +0530)]
rasdaemon: add support for AMD Scalable MCA

Add logic here to decode errors from all known IP blocks for
AMD Scalable MCA supported processors

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Chandu-babu Namburu <chandu@amd.com>
6 years agoBump to version 0.6.2 v0.6.2
Mauro Carvalho Chehab [Tue, 14 Aug 2018 17:06:06 +0000 (14:06 -0300)]
Bump to version 0.6.2

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
6 years agoINSTALL: update it from auto-generated data
Mauro Carvalho Chehab [Tue, 14 Aug 2018 16:57:44 +0000 (13:57 -0300)]
INSTALL: update it from auto-generated data

There were some changes on new autotools for the INSTALL file
Update it to match them.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
6 years agoChangeLog: Reorder to place new stuff at the beginning
Mauro Carvalho Chehab [Tue, 14 Aug 2018 16:56:45 +0000 (13:56 -0300)]
ChangeLog: Reorder to place new stuff at the beginning

It is easier to read a changelog from the new to the oldest
entry.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
6 years agorasdaemon: ras-mc-ctl: add option to show error counts
Aristeu Rozanski [Wed, 1 Aug 2018 20:29:58 +0000 (16:29 -0400)]
rasdaemon: ras-mc-ctl: add option to show error counts

In some scenarios it might not be desirable to have a daemon running
to parse and store the errors provided by EDAC and only having the
number of CEs and UEs is enough. This patch implements this feature
as an ras-mc-ctl option.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
6 years agomce-amd-k8: be sure to not go past error_msg buffer
Mauro Carvalho Chehab [Tue, 14 Aug 2018 16:13:54 +0000 (13:13 -0300)]
mce-amd-k8: be sure to not go past error_msg buffer

As warned by gcc:

mce-amd-k8.c: In function ‘decode_k8_generic_errcode’:
mce-amd-k8.c:136:30: warning: ‘) ’ directive output may be truncated writing 2 bytes into a region of size between 0 and 4095 [-Wformat-truncation=]
   mce_snprintf(e->error_msg, "(%s) ", tmp_buf);
                              ^~~~~~~
ras-mce-handler.h:104:29: note: in definition of macro ‘mce_snprintf’
  snprintf(buf + __n, __len, fmt,  ##arg);  \
                             ^~~
ras-mce-handler.h:104:2: note: ‘snprintf’ output between 4 and 4099 bytes into a destination of size 4096
  snprintf(buf + __n, __len, fmt,  ##arg);  \
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mce-amd-k8.c:136:3: note: in expansion of macro ‘mce_snprintf’
   mce_snprintf(e->error_msg, "(%s) ", tmp_buf);
   ^~~~~~~~~~~~

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
6 years agoras-report: avoid copying after addr.sun_path
Mauro Carvalho Chehab [Tue, 14 Aug 2018 16:10:10 +0000 (13:10 -0300)]
ras-report: avoid copying after addr.sun_path

As warned by gcc:

ras-report.c: In function ‘setup_report_socket’:
ras-report.c:36:2: warning: ‘strncpy’ output truncated before terminating nul copying 25 bytes from a string of the same length [-Wstringop-truncation]
  strncpy(addr.sun_path, ABRT_SOCKET, strlen(ABRT_SOCKET));
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The strncpy logic there is wrong. Fix it and be sure to have a NUL
terminated string filled at addr.sun_path.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
6 years agomce-intel-*: fix a warning when using FIELD(<num>, NULL)
Mauro Carvalho Chehab [Tue, 14 Aug 2018 16:06:27 +0000 (13:06 -0300)]
mce-intel-*: fix a warning when using FIELD(<num>, NULL)

Internally, FIELD() macro checks the size of an array, by
using ARRAY_SIZE. Well, this macro causes a division by zero
if NULL is used, as its type is void, as warned:

mce-intel-dunnington.c:30:2: note: in expansion of macro ‘FIELD’
  FIELD(17, NULL),
  ^~~~~
ras-mce-handler.h:28:33: warning: division ‘sizeof (void *) / sizeof (void)’ does not compute the number of array elements [-Wsizeof-pointer-div]
 #define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
                                 ^
bitfield.h:37:51: note: in expansion of macro ‘ARRAY_SIZE’
 #define FIELD(start_bit, name) { start_bit, name, ARRAY_SIZE(name) }
                                                   ^~~~~~~~~~

While this warning is harmless, it may prevent seeing more serios
warnings. So, add a FIELD_NULL(<num>) macro to avoid that.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
6 years agorasdaemon: use separate string array for error status
Thomas Tai [Mon, 14 May 2018 14:33:48 +0000 (10:33 -0400)]
rasdaemon: use separate string array for error status

The bit field description for correctable status register
and uncorrectable status register are different. Using a
single aer_errors string array will cause bit[12] to
overlap and thus recording the wrong description.
Using a separate variable to switch between correctable
and uncorrectable error is needed.

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
6 years agorasdaemon: fix PCIe AER error type
Thomas Tai [Mon, 14 May 2018 14:33:47 +0000 (10:33 -0400)]
rasdaemon: fix PCIe AER error type

The error types between PCIe AER and CPU Machine Check are
different. when handling aer_event, the PCIe AER error
type should be used. Add an enum to match the kernel
PCIe AER and use it to decode the error type.

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
7 years agoBump version to 0.6.1 v0.6.1
Mauro Carvalho Chehab [Wed, 25 Apr 2018 10:33:39 +0000 (07:33 -0300)]
Bump version to 0.6.1

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
7 years agorasdaemon: Update DIMM labels for 2-socket servers
Shubhrata Priya [Tue, 17 Apr 2018 20:17:58 +0000 (20:17 +0000)]
rasdaemon: Update DIMM labels for 2-socket servers

Update labels for some 2-socket DellEMC servers.

Signed-off-by: Charles Rose <charles.rose@dell.com>
Signed-off-by: Shubhrata Priya <shubhrata.priyadarsh@dell.com>
Tested-by: Shubhrata Priya <shubhrata.priyadarsh@dell.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
7 years agorasdaemon: Add Skylake Xeon MSCOD values
Greg Edwards [Wed, 28 Mar 2018 22:10:46 +0000 (16:10 -0600)]
rasdaemon: Add Skylake Xeon MSCOD values

Based on mcelog commits e4aca6312aee ("Add support to decode MSCOD
values for Skylake server") and 34f03e306c36 ("mcelog: Change name of
skylake interconnect from QPI to UPI").

Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
7 years agorasdaemon: ARM: fully initialize ras_arm_event
Aristeu Rozanski [Fri, 2 Feb 2018 15:20:48 +0000 (10:20 -0500)]
rasdaemon: ARM: fully initialize ras_arm_event

Issue found by covscan:

1. rasdaemon-0.4.1/ras-arm-handler.c:32: var_decl: Declaring variable "ev" without initializer.
16. rasdaemon-0.4.1/ras-arm-handler.c:81: uninit_use_in_call: Using uninitialized value "ev.error_count" when calling "ras_store_arm_record".
23. rasdaemon-0.4.1/ras-record.c:243:2: read_parm_fld: Reading a parameter field.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
7 years agoUpdate my email
Mauro Carvalho Chehab [Wed, 25 Apr 2018 10:20:08 +0000 (07:20 -0300)]
Update my email

As I'll stop using mchehab@s-opensource.com (and already
stopped using @osg.samsung for some sime), update all e-mail
occurrences.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
7 years agomce-intel-p4-p6: prevent build errors with -Werror=format-security v0.6.0
Mauro Carvalho Chehab [Sat, 14 Oct 2017 10:26:30 +0000 (07:26 -0300)]
mce-intel-p4-p6: prevent build errors with -Werror=format-security

On Fedora, -Werror=format-security is now used on packages, with
causes the following build error:

mce-intel-p4-p6.c: In function 'p4_decode_model':
mce-intel-p4-p6.c:130:4: error: format not a string literal and no format arguments [-Werror=format-security]
    mce_snprintf(e->error_msg, p4_model[i].str);
    ^~~~~~~~~~~~
cc1: some warnings being treated as errors

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agoBump to version 0.6.0
Mauro Carvalho Chehab [Sat, 14 Oct 2017 09:25:12 +0000 (06:25 -0300)]
Bump to version 0.6.0

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon.spec: add other --enable options
Mauro Carvalho Chehab [Sat, 14 Oct 2017 09:47:53 +0000 (06:47 -0300)]
rasdaemon.spec: add other --enable options

As we use the rasdaemon.spec in order to check if everything
is ok, add the new --enable-foo options there.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agoMakefile: add new rasdaemon headers
Mauro Carvalho Chehab [Sat, 14 Oct 2017 09:42:17 +0000 (06:42 -0300)]
Makefile: add new rasdaemon headers

Those are needed for make "distdir" tarball creation.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon: update bugs report information
Mauro Carvalho Chehab [Sat, 14 Oct 2017 09:26:54 +0000 (06:26 -0300)]
rasdaemon: update bugs report information

I don't work at Red Hat since 2013. My e-mail address there
has long gone! Replace it to my kernel.org e-mail, as this
is more permanent.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agoconfigure.ac: display if ARM error report is enabled
Mauro Carvalho Chehab [Sat, 14 Oct 2017 09:12:45 +0000 (06:12 -0300)]
configure.ac: display if ARM error report is enabled

changeset 5662e5376adc ("rasdaemon: add support for ARM events")
added a new ./configure argument. Display if it is enabled or
not.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon: add support for ARM events
Tyler Baicar [Tue, 12 Sep 2017 20:58:25 +0000 (14:58 -0600)]
rasdaemon: add support for ARM events

Add support to handle the ARM kernel trace events
which cover RAS ARM processor errors.

[V4]: fix arm_event_tab usage

Change-Id: Ife99c97042498d5fad4d9b8e873ecfba6a47947d
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agoconfigure.ac: show if Hisilicon error report are enabled
Mauro Carvalho Chehab [Sat, 14 Oct 2017 09:07:53 +0000 (06:07 -0300)]
configure.ac: show if Hisilicon error report are enabled

As changeset b856c89a11d7 ("rasdaemon:add support for
Hisilicon non-standard error decoder") added a new
configurable error report, show if it is enabled or not.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon:add support for Hisilicon non-standard error decoder
shiju.jose@huawei.com [Wed, 4 Oct 2017 09:11:21 +0000 (10:11 +0100)]
rasdaemon:add support for Hisilicon non-standard error decoder

1. This patch add support to decode the non-standard
error information for Hisilicon HIP07 SAS HW module.
2. Add stub decoder for Hislicon HIP07 HNS HW module.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon:add support for non-standard error decoder
shiju.jose@huawei.com [Wed, 4 Oct 2017 09:11:08 +0000 (10:11 +0100)]
rasdaemon:add support for non-standard error decoder

This patch add support to decode the non-standard
error information.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon: Update DIMM labels for Intel Skylake servers
Charles.Rose@dell.com [Fri, 11 Aug 2017 20:09:10 +0000 (20:09 +0000)]
rasdaemon: Update DIMM labels for Intel Skylake servers

Update labels for Intel Skylake based Dell PowerEdge servers.

Signed-off-by: Charles Rose <charles_rose@dell.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agoconfigure.ac: print if CPER non-standard logs are enabled
Mauro Carvalho Chehab [Fri, 11 Aug 2017 20:48:05 +0000 (17:48 -0300)]
configure.ac: print if CPER non-standard logs are enabled

Now that we have a parser for CPER non-standard errors,
display if such option is enabled.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon: add support for non standard CPER section events
Tyler Baicar [Mon, 12 Jun 2017 22:16:04 +0000 (16:16 -0600)]
rasdaemon: add support for non standard CPER section events

Add support to handle the non standard CPER section kernel trace
events which cover RAS errors who's section type is unknown.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agoBump to version 0.5.9 v0.5.9
Mauro Carvalho Chehab [Thu, 8 Jun 2017 09:29:49 +0000 (06:29 -0300)]
Bump to version 0.5.9

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon.spec.in: update it to reflect current needs
Mauro Carvalho Chehab [Thu, 8 Jun 2017 09:46:11 +0000 (06:46 -0300)]
rasdaemon.spec.in: update it to reflect current needs

Keep it more or less in sync with the Fedora version of it,
in order to allow it to be built with the new-ver.sh script.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon: add Knights Mill model
Aristeu Rozanski [Thu, 4 May 2017 18:02:53 +0000 (14:02 -0400)]
rasdaemon: add Knights Mill model

Knights Mill is similar to Knights Landing and can use the same code.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agorasdaemon: Update DIMM labels for Dell Servers
Charles.Rose@dell.com [Tue, 6 Jun 2017 21:42:21 +0000 (21:42 +0000)]
rasdaemon: Update DIMM labels for Dell Servers

Updated to include Dell PowerEdge Servers that are current.
Note the use of Product field instead of Model. Tested on
multiple Dell PowerEdge servers.

Signed-off-by: Charles Rose <charles_rose@dell.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
7 years agoconfigure.ac: report enabled features
Mauro Carvalho Chehab [Thu, 8 Jun 2017 09:05:48 +0000 (06:05 -0300)]
configure.ac: report enabled features

We're starting to have too many optional features. Report
what options are enabled at the end of ./configure output.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
8 years agoUpdate it to point to the new repository
Mauro Carvalho Chehab [Tue, 14 Mar 2017 12:32:12 +0000 (09:32 -0300)]
Update it to point to the new repository

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
9 years agoBump version to 0.5.8 v0.5.8
Mauro Carvalho Chehab [Fri, 15 Apr 2016 10:07:11 +0000 (07:07 -0300)]
Bump version to 0.5.8

Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agoAdd Broadwell EP/EX MSCOD values
Aristeu Rozanski [Fri, 8 Apr 2016 19:07:19 +0000 (15:07 -0400)]
Add Broadwell EP/EX MSCOD values

Based on mcelog commit id 32252e9c37e97ea5083d90d2cf194bb85a4a0cda.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agoAdd Broadwell DE MSCOD values
Aristeu Rozanski [Fri, 8 Apr 2016 19:07:18 +0000 (15:07 -0400)]
Add Broadwell DE MSCOD values

Based on mcelog commit id 32252e9c37e97ea5083d90d2cf194bb85a4a0cda.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agoBump version to 0.5.7 v0.5.7
Mauro Carvalho Chehab [Fri, 5 Feb 2016 17:24:42 +0000 (15:24 -0200)]
Bump version to 0.5.7

Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agomce-intel-knl: Fix CodingStyle
Mauro Carvalho Chehab [Fri, 5 Feb 2016 17:15:18 +0000 (15:15 -0200)]
mce-intel-knl: Fix CodingStyle

Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: Add support for Knights Landing processor
Marcin Koss [Thu, 3 Dec 2015 14:19:47 +0000 (15:19 +0100)]
rasdaemon: Add support for Knights Landing processor

Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: Add model numbers for Broadwell-EP/EX and -DE
Seiichi Ikarashi [Tue, 29 Sep 2015 01:46:23 +0000 (10:46 +0900)]
rasdaemon: Add model numbers for Broadwell-EP/EX and -DE

Based on mcelog code.

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: fix typos on ras-mc-ctl man page
Aristeu Rozanski [Mon, 10 Aug 2015 18:24:41 +0000 (14:24 -0400)]
rasdaemon: fix typos on ras-mc-ctl man page

Fixed two markers and two typos in the documentation.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agoBump version to 0.5.6 v0.5.6
Mauro Carvalho Chehab [Fri, 3 Jul 2015 10:35:14 +0000 (07:35 -0300)]
Bump version to 0.5.6

Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: add internal errors of IA32_MC4_STATUS for Haswell
Seiichi Ikarashi [Wed, 17 Jun 2015 10:56:57 +0000 (07:56 -0300)]
rasdaemon: add internal errors of IA32_MC4_STATUS for Haswell

Now rasdaemon looks purposely omitting internal errors of
IA32_MC4_STATUS for Haswell-family processors, which are described in
Intel SDM vol3 Table 16-20. I think it's better to show these errors
because mcelog does show them.

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: use MCA error msg as error_msg
Seiichi Ikarashi [Fri, 12 Jun 2015 09:35:37 +0000 (06:35 -0300)]
rasdaemon: use MCA error msg as error_msg

In the case of machine-checks which do not have a model-specific MCA
error code but have an architectural code only, mce_event.error_msg
becomes empty then you don't know what happened.

(snip)
MCE records summary:
1  errors
          ^
          empty!

(snip)
MCE events:
1 2015-06-12 00:21:46 +0900 error: , mcg mcgstatus= 0, mci Corrected_error
                                  ^
                                empty!

Error_enabled, mcgcap=0x07000c16, status=0x9c0000000000017a, addr=0x204fffffff, misc=0x4004000000000080, walltime=0x557b0db2, cpu=0x00000001, cpuid=0x000306f3, apicid=0x00000002, bank=0x00000003

In such a case, let's use the content of mcastatus_msg as error_msg
instead.

(snip)
MCE records summary:
1 Generic CACHE Level-2 Eviction Error errors
(snip)
MCE events:
1 2015-06-12 02:39:04 +0900 error: Generic CACHE Level-2 Eviction Error, mcg mcgstatus= 0, mci Corrected_error Error_enabled, mcgcap=0x07000c16, status=0x9c0000000000017a, addr=0x204fffffff, misc=0x4004000000000080, walltime=0x557b1f22, cpu=0x00000001, cpuid=0x000306f3, apicid=0x00000002, bank=0x00000003

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: unnecessary comma for empty mc_location string
Seiichi Ikarashi [Wed, 10 Jun 2015 23:49:55 +0000 (20:49 -0300)]
rasdaemon: unnecessary comma for empty mc_location string

Into the /var/log/messages, rasdaemon sometimes prints an unnecessary
comma ", " between mca= and cpu_type= like below:

Jun  9 02:44:39 localhost rasdaemon: <...>-4585  [1638893312]  1031.109000: mce_record:           2015-06-08 10:07:28 +0900 bank=3, status= 9c0000000000017a, mci=Corrected_error Error_enabled, mca=Generic CACHE Level-2 Eviction Error, , cpu_type= Intel Xeon v3 (Haswell) EP/EX, cpu= 1, socketid= 0, misc= 4004000000000080, addr= 204fffffff, mcgstatus= 0, mcgcap= 7000c16, apicid= 2

That's the comma for mc_location which is printed even if mc_location is
empty due to a wrong if condition.

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: remove a space from mcgstatus_msg
Seiichi Ikarashi [Wed, 10 Jun 2015 10:29:03 +0000 (07:29 -0300)]
rasdaemon: remove a space from mcgstatus_msg

"ras-mc-ctl --errors" shows an unnecessary space character in the
mcgstatus string of MCE event, like below:

2 2015-04-04 19:57:22 +0900 error: MC_HA_IMC_RW_BLOCK_ACK_TIMEOUT, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8000000067000e0b, walltime=0x555da140, cpu=0x00000001, cpuid=0x000306f3, apicid=0x00000002, bank=0x00000004

Let's remove it.

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agox86, rasdaemon: Add support to log Local Machine Check Exception (LMCE)
Ashok Raj [Fri, 5 Jun 2015 16:32:47 +0000 (13:32 -0300)]
x86, rasdaemon: Add support to log Local Machine Check Exception (LMCE)

Local Machine Check Exception allows certain errors to be signaled to
only the affected logical processor. This change captures them for
rasdaemon.

log:Changes to rasdaemon to support new architectural changes to MCE

Changet to rasdaemon to support new architectural extentions in Intel
CPUs.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agoBump version to 0.5.5 v0.5.5
Mauro Carvalho Chehab [Wed, 3 Jun 2015 13:59:55 +0000 (10:59 -0300)]
Bump version to 0.5.5

Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agoImprove INSTALL summary instructions
Mauro Carvalho Chehab [Wed, 3 Jun 2015 13:42:46 +0000 (10:42 -0300)]
Improve INSTALL summary instructions

Using && warrants that the previous command succeeds. So, this
is the recommended way.

Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: add support to match the machine by system's product name
Aristeu Rozanski [Mon, 1 Jun 2015 20:04:00 +0000 (17:04 -0300)]
rasdaemon: add support to match the machine by system's product name

In some cases the motherboard names will change but the mapping won't
across a line of products. This patch adds support for "Product:" to be
specified in the label files instead of Model:.

An example:
Vendor: Dell Inc.
  Product: PowerEdge R610
    DIMM_A1: 0.0.0;     DIMM_A2:  0.0.1;        DIMM_A3:  0.0.2;
    DIMM_A4: 0.1.0;     DIMM_A5:  0.1.1;        DIMM_A6:  0.1.2;

    DIMM_B1: 1.0.0;     DIMM_B2:  1.0.1;        DIMM_B3:  1.0.2;
    DIMM_B4: 1.1.0;     DIMM_B5:  1.1.1;        DIMM_B6:  1.1.2;

Would match all 'PowerEdge R610' machines.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: make sure the error is valid before handling ranks
Seiichi Ikarashi [Tue, 26 May 2015 14:59:39 +0000 (11:59 -0300)]
rasdaemon: make sure the error is valid before handling ranks

Fix "rank" handling according to the Bit 63 description in Intel SDM Vol.3C
Table 16-23, that says "... Use this information only after there is valid
first error info indicated by bit 62".
Also fix invalid comparisons of unsigned variables "rank0" and "rank1".

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: enable IMC status usage for Haswell-E
Seiichi Ikarashi [Tue, 26 May 2015 14:59:38 +0000 (11:59 -0300)]
rasdaemon: enable IMC status usage for Haswell-E

Enable IMC status bank for Haswell-E, as described in Intel SDM Vol.3C
Table 35-27.

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: add missing semicolon in hsw_decode_model()
Seiichi Ikarashi [Tue, 26 May 2015 14:59:37 +0000 (11:59 -0300)]
rasdaemon: add missing semicolon in hsw_decode_model()

hsw_decode_model() tries to skip decode_bitfield() if IA32_MC4_STATUS indicates
some internal errors. Unfortunately, here behaves opposite to the intention
because a semicolon is missing.

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: properly pring message strings in decode_bitfield()
Seiichi Ikarashi [Tue, 26 May 2015 14:59:36 +0000 (11:59 -0300)]
rasdaemon: properly pring message strings in decode_bitfield()

Fix decode_bitfield() so that it does print message strings from the struct
field table.

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: add support for Knights Landing
Aristeu Rozanski [Mon, 18 May 2015 17:19:33 +0000 (14:19 -0300)]
rasdaemon: add support for Knights Landing

Patch based on mcelog.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: add support for Broadwell
Aristeu Rozanski [Mon, 18 May 2015 17:19:32 +0000 (14:19 -0300)]
rasdaemon: add support for Broadwell

Only basic support for now.

Based on mcelog code.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: Identify Ivy Bridge properly
Aristeu Rozanski [Mon, 18 May 2015 17:19:31 +0000 (14:19 -0300)]
rasdaemon: Identify Ivy Bridge properly

This patch is based on b29cc4d615cead87cbc163ada0645b10c5b1217d (mcelog)
mcelog: Identify Ivy Bridge properly

Uniquely identify Ivy Bridge even though the machine checks are the same
for Sandy Bridge and Ivy Bridge.  This makes the output for the processor
display "Ivy Bridge".

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: tony.luck@intel.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: Add missing entry to Ivy Bridge memory controller decode table
Aristeu Rozanski [Mon, 18 May 2015 17:19:30 +0000 (14:19 -0300)]
rasdaemon: Add missing entry to Ivy Bridge memory controller decode table

This patch is based on 2577aeb662374cb87169ee675b2e37c06f1aed99 (mcelog)

mcelog: Add missing entry to Ivy Bridge memory controller decode table

September 2013 edition of the software developer manual added an
entry that had been inadvertently omitted from earlier editions.
Add the 0x80 entry for "Corrected memory read error".

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: decode new simple error code number 6
Aristeu Rozanski [Mon, 18 May 2015 17:19:29 +0000 (14:19 -0300)]
rasdaemon: decode new simple error code number 6

This patch was based on fa313dd0144596dfa140bd66805367250d6eae9b
(mcelog)

mcelog: Decode new simple error code number 6

Edition 050 of the Intel SDM released in late February 2014
includes a new simple error code in "Table 15-8. IA32_MCi_Status
[15:0] Simple Error Code Encoding".  Code 6 (0000 0000 0000 0110)
has been allocated for the reporting of cases where the BIOS SMM
code attempts to execute code outside of the protected SMRR area.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
9 years agorasdaemon: add support for Haswell
Aristeu Rozanski [Mon, 18 May 2015 17:19:28 +0000 (14:19 -0300)]
rasdaemon: add support for Haswell

Based on mcelog code.

Acked-by: Tony Luck <tony.luck@intel,com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
10 years agoBump version to 0.5.4 v0.5.4
Mauro Carvalho Chehab [Fri, 15 Aug 2014 22:15:47 +0000 (19:15 -0300)]
Bump version to 0.5.4

Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agorasdaemon: do not assume dimmX/ directories will be present
Aristeu Rozanski [Fri, 15 Aug 2014 17:50:58 +0000 (13:50 -0400)]
rasdaemon: do not assume dimmX/ directories will be present

While finding the labels, size and location, ras-mc-ctl will search /sys for
the files and calculate the location. When it uses the location trying to map
back to files to print labels or write labels, it'll just assume dimm*
directories exist which is not correct while using drivers like amd64_edac.
This patch adds two new hashes to store the location and the label file path
so it can be used later.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agorasdaemon: enable recording by default in service file
Aristeu Rozanski [Mon, 21 Jul 2014 20:23:18 +0000 (16:23 -0400)]
rasdaemon: enable recording by default in service file

This patch changes the service file to enable the tracing events after
the daemon is started and starts the daemon recording events by default.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agorasdaemon: correct range while parsing top, middle and lower layers
Aristeu Rozanski [Mon, 21 Jul 2014 19:25:40 +0000 (15:25 -0400)]
rasdaemon: correct range while parsing top, middle and lower layers

{top,middle,lower}_layer are signed char, therefore will never be 255.

Reference: https://bugzilla.redhat.com/show_bug.cgi?id=1035746

Tested in a GHES enabled machine using EINJ.

v2: no need to test ranges at all

Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agoBump version to 0.5.3 v0.5.3
Mauro Carvalho Chehab [Sun, 10 Aug 2014 14:04:10 +0000 (11:04 -0300)]
Bump version to 0.5.3

Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agoAdd a target to build rasdaemon with mock
Mauro Carvalho Chehab [Sun, 10 Aug 2014 15:51:04 +0000 (12:51 -0300)]
Add a target to build rasdaemon with mock

Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agoAdd an option to build the srpm
Mauro Carvalho Chehab [Sun, 10 Aug 2014 15:47:21 +0000 (12:47 -0300)]
Add an option to build the srpm

Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agorasdaemon: Add support for extlog trace events
Luck, Tony [Mon, 4 Aug 2014 20:29:01 +0000 (13:29 -0700)]
rasdaemon: Add support for extlog trace events

Linux kernel 3.17 includes a new trace event to pick up extended
error logs produced by BIOS in the Common Platform Error Record
format described in appendix N of the UEFI standard. This patch
adds support to collect that information and log it both in
readable ASCII and into the sqlite3 database that rasdaemon
uses to store all error information.  In addition ras-mc-ctl
is updated to query that database for both detailed and summary
reports.

Big thanks to Aristeu for pretty much all the sqlite3 pieces,
plus testing and fixing miscellaneous issues elsewhere.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agorasdaemon: handle failures of snprintf()
Aristeu Rozanski [Tue, 24 Jun 2014 15:01:31 +0000 (11:01 -0400)]
rasdaemon: handle failures of snprintf()

Florian Weimer found that in bitfield_msg() the return value of
snprintf() is used to calculate length ignoring that it can return a
negative number. This patch makes bitfield_msg() to stop writing in such
case.

Reference: https://bugzilla.redhat.com/show_bug.cgi?id=1035741

Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agorasdaemon: fix mce numfield decoded error
Xie XiuQi [Thu, 8 May 2014 12:07:19 +0000 (20:07 +0800)]
rasdaemon: fix mce numfield decoded error

Some fields are missing in mce decode information, as below:
...
rasdaemon: register inserted at db
           <...>-31568 [000]  4023.214080: mce_record:
2014-05-07 15:51:16 +0800 bank=2, status= bd000000000000c0, MEMORY
CONTROLLER MS_CHANNEL0_ERR Transaction: Memory scrubbing error %s: %Lu
 %s: %Lx
 %s: %Lx
 %s: %Lu
 %s: %Lu
 %s: %Lx
, mci=Uncorrected_error Error_enabled SRAO, n_errors=0 channel=0,
dimm=0, cpu_type= Intel Xeon 5500 series / Core i3/5/7
("Nehalem/Westmere"), cpu= 0, socketid= 0, ip= 1eadbabe (INEXACT), cs=
73, misc= 8c, addr= 62b000, mcgstatus= 5 RIPV MCIP, mcgcap= 1c09,
apicid= 0

"f->name" & "v" are missed to print in decode_numfield(), so fix it.

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agorasdaemon: sqlite truncates some MCE fields to 32-bit
Luck, Tony [Mon, 7 Apr 2014 18:27:47 +0000 (11:27 -0700)]
rasdaemon: sqlite truncates some MCE fields to 32-bit

The sqlite3_bind_int() function takes an "int" as the argument value to
save to the database. But some fields are wider than 32-bits.  Use
sqlite3_bind_int64() for the fields where we know values can exceed
4G.

Before:

# ./rasdaemon/util/ras-mc-ctl --errors
 ...
MCE events:
1 2014-04-04 08:50:32 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x00010090, addr=0x35fcb9c0, misc=0x5026a686, walltime=0x5342e4f9, cpu=0x0000000e, cpuid=0x000306f1, apicid=0x00000020, socketid=0x00000001, bank=0x00000008
2 2014-04-04 08:50:35 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x00010090, addr=0x4187adc0, misc=0x4274f486, walltime=0x5342e4fc, cpu=0x0000000e, cpuid=0x000306f1, apicid=0x00000020, socketid=0x00000001, bank=0x00000007
3 2014-04-04 08:50:37 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x00010090, addr=0x52efc600, misc=0x50028286, walltime=0x5342e4fd, cpu=0x0000000e, cpuid=0x000306f1, apicid=0x00000020, socketid=0x00000001, bank=0x00000008

After:
./rasdaemon/util/ras-mc-ctl --errors
 ...
1 2014-04-04 09:00:07 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8c00004000010090, addr=0x45340a180, misc=0x140686886, walltime=0x5342e736, cpuid=0x000306f1, bank=0x00000008
2 2014-04-04 09:00:08 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8c00004000010090, addr=0x44d6e4780, misc=0x15060e086, walltime=0x5342e737, cpuid=0x000306f1, bank=0x00000007
3 2014-04-04 09:00:10 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8c00004000010090, addr=0x44cb64640, misc=0x140505086, walltime=0x5342e739, cpuid=0x000306f1, bank=0x00000008

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
10 years agorasdaemon: fix some typos and cut/paste errors in sqlite bits
Luck, Tony [Mon, 7 Apr 2014 19:23:25 +0000 (12:23 -0700)]
rasdaemon: fix some typos and cut/paste errors in sqlite bits

aer event has the error_type as field 2 and msg as field 3 - but the calls
the sqlite3_bind_text use 3 and 4.

mce event forgot to declare the "mcastatus_msg"

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
11 years agoBump version to 0.5.2 v0.5.2
Mauro Carvalho Chehab [Thu, 3 Apr 2014 11:50:45 +0000 (08:50 -0300)]
Bump version to 0.5.2

Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>