]>
www.infradead.org Git - users/mchehab/rasdaemon.git/log
Seiichi Ikarashi [Tue, 26 May 2015 14:59:38 +0000 (11:59 -0300)]
rasdaemon: enable IMC status usage for Haswell-E
Enable IMC status bank for Haswell-E, as described in Intel SDM Vol.3C
Table 35-27.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Tue, 26 May 2015 14:59:37 +0000 (11:59 -0300)]
rasdaemon: add missing semicolon in hsw_decode_model()
hsw_decode_model() tries to skip decode_bitfield() if IA32_MC4_STATUS indicates
some internal errors. Unfortunately, here behaves opposite to the intention
because a semicolon is missing.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Tue, 26 May 2015 14:59:36 +0000 (11:59 -0300)]
rasdaemon: properly pring message strings in decode_bitfield()
Fix decode_bitfield() so that it does print message strings from the struct
field table.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:33 +0000 (14:19 -0300)]
rasdaemon: add support for Knights Landing
Patch based on mcelog.
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:32 +0000 (14:19 -0300)]
rasdaemon: add support for Broadwell
Only basic support for now.
Based on mcelog code.
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:31 +0000 (14:19 -0300)]
rasdaemon: Identify Ivy Bridge properly
This patch is based on
b29cc4d615cead87cbc163ada0645b10c5b1217d (mcelog)
mcelog: Identify Ivy Bridge properly
Uniquely identify Ivy Bridge even though the machine checks are the same
for Sandy Bridge and Ivy Bridge. This makes the output for the processor
display "Ivy Bridge".
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: tony.luck@intel.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:30 +0000 (14:19 -0300)]
rasdaemon: Add missing entry to Ivy Bridge memory controller decode table
This patch is based on
2577aeb662374cb87169ee675b2e37c06f1aed99 (mcelog)
mcelog: Add missing entry to Ivy Bridge memory controller decode table
September 2013 edition of the software developer manual added an
entry that had been inadvertently omitted from earlier editions.
Add the 0x80 entry for "Corrected memory read error".
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:29 +0000 (14:19 -0300)]
rasdaemon: decode new simple error code number 6
This patch was based on
fa313dd0144596dfa140bd66805367250d6eae9b
(mcelog)
mcelog: Decode new simple error code number 6
Edition 050 of the Intel SDM released in late February 2014
includes a new simple error code in "Table 15-8. IA32_MCi_Status
[15:0] Simple Error Code Encoding". Code 6 (0000 0000 0000 0110)
has been allocated for the reporting of cases where the BIOS SMM
code attempts to execute code outside of the protected SMRR area.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:28 +0000 (14:19 -0300)]
rasdaemon: add support for Haswell
Based on mcelog code.
Acked-by: Tony Luck <tony.luck@intel,com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Mauro Carvalho Chehab [Fri, 15 Aug 2014 22:15:47 +0000 (19:15 -0300)]
Bump version to 0.5.4
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Fri, 15 Aug 2014 17:50:58 +0000 (13:50 -0400)]
rasdaemon: do not assume dimmX/ directories will be present
While finding the labels, size and location, ras-mc-ctl will search /sys for
the files and calculate the location. When it uses the location trying to map
back to files to print labels or write labels, it'll just assume dimm*
directories exist which is not correct while using drivers like amd64_edac.
This patch adds two new hashes to store the location and the label file path
so it can be used later.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Mon, 21 Jul 2014 20:23:18 +0000 (16:23 -0400)]
rasdaemon: enable recording by default in service file
This patch changes the service file to enable the tracing events after
the daemon is started and starts the daemon recording events by default.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Mon, 21 Jul 2014 19:25:40 +0000 (15:25 -0400)]
rasdaemon: correct range while parsing top, middle and lower layers
{top,middle,lower}_layer are signed char, therefore will never be 255.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=
1035746
Tested in a GHES enabled machine using EINJ.
v2: no need to test ranges at all
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Sun, 10 Aug 2014 14:04:10 +0000 (11:04 -0300)]
Bump version to 0.5.3
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Sun, 10 Aug 2014 15:51:04 +0000 (12:51 -0300)]
Add a target to build rasdaemon with mock
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Sun, 10 Aug 2014 15:47:21 +0000 (12:47 -0300)]
Add an option to build the srpm
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Luck, Tony [Mon, 4 Aug 2014 20:29:01 +0000 (13:29 -0700)]
rasdaemon: Add support for extlog trace events
Linux kernel 3.17 includes a new trace event to pick up extended
error logs produced by BIOS in the Common Platform Error Record
format described in appendix N of the UEFI standard. This patch
adds support to collect that information and log it both in
readable ASCII and into the sqlite3 database that rasdaemon
uses to store all error information. In addition ras-mc-ctl
is updated to query that database for both detailed and summary
reports.
Big thanks to Aristeu for pretty much all the sqlite3 pieces,
plus testing and fixing miscellaneous issues elsewhere.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Tue, 24 Jun 2014 15:01:31 +0000 (11:01 -0400)]
rasdaemon: handle failures of snprintf()
Florian Weimer found that in bitfield_msg() the return value of
snprintf() is used to calculate length ignoring that it can return a
negative number. This patch makes bitfield_msg() to stop writing in such
case.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=
1035741
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Xie XiuQi [Thu, 8 May 2014 12:07:19 +0000 (20:07 +0800)]
rasdaemon: fix mce numfield decoded error
Some fields are missing in mce decode information, as below:
...
rasdaemon: register inserted at db
<...>-31568 [000] 4023.214080: mce_record:
2014-05-07 15:51:16 +0800 bank=2, status=
bd000000000000c0 , MEMORY
CONTROLLER MS_CHANNEL0_ERR Transaction: Memory scrubbing error %s: %Lu
%s: %Lx
%s: %Lx
%s: %Lu
%s: %Lu
%s: %Lx
, mci=Uncorrected_error Error_enabled SRAO, n_errors=0 channel=0,
dimm=0, cpu_type= Intel Xeon 5500 series / Core i3/5/7
("Nehalem/Westmere"), cpu= 0, socketid= 0, ip=
1eadbabe (INEXACT), cs=
73, misc= 8c, addr= 62b000, mcgstatus= 5 RIPV MCIP, mcgcap= 1c09,
apicid= 0
"f->name" & "v" are missed to print in decode_numfield(), so fix it.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Luck, Tony [Mon, 7 Apr 2014 18:27:47 +0000 (11:27 -0700)]
rasdaemon: sqlite truncates some MCE fields to 32-bit
The sqlite3_bind_int() function takes an "int" as the argument value to
save to the database. But some fields are wider than 32-bits. Use
sqlite3_bind_int64() for the fields where we know values can exceed
4G.
Before:
# ./rasdaemon/util/ras-mc-ctl --errors
...
MCE events:
1 2014-04-04 08:50:32 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x00010090, addr=0x35fcb9c0, misc=0x5026a686, walltime=0x5342e4f9, cpu=0x0000000e, cpuid=0x000306f1, apicid=0x00000020, socketid=0x00000001, bank=0x00000008
2 2014-04-04 08:50:35 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x00010090, addr=0x4187adc0, misc=0x4274f486, walltime=0x5342e4fc, cpu=0x0000000e, cpuid=0x000306f1, apicid=0x00000020, socketid=0x00000001, bank=0x00000007
3 2014-04-04 08:50:37 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x00010090, addr=0x52efc600, misc=0x50028286, walltime=0x5342e4fd, cpu=0x0000000e, cpuid=0x000306f1, apicid=0x00000020, socketid=0x00000001, bank=0x00000008
After:
./rasdaemon/util/ras-mc-ctl --errors
...
1 2014-04-04 09:00:07 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8c00004000010090, addr=0x45340a180, misc=0x140686886, walltime=0x5342e736, cpuid=0x000306f1, bank=0x00000008
2 2014-04-04 09:00:08 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8c00004000010090, addr=0x44d6e4780, misc=0x15060e086, walltime=0x5342e737, cpuid=0x000306f1, bank=0x00000007
3 2014-04-04 09:00:10 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8c00004000010090, addr=0x44cb64640, misc=0x140505086, walltime=0x5342e739, cpuid=0x000306f1, bank=0x00000008
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Luck, Tony [Mon, 7 Apr 2014 19:23:25 +0000 (12:23 -0700)]
rasdaemon: fix some typos and cut/paste errors in sqlite bits
aer event has the error_type as field 2 and msg as field 3 - but the calls
the sqlite3_bind_text use 3 and 4.
mce event forgot to declare the "mcastatus_msg"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 3 Apr 2014 11:50:45 +0000 (08:50 -0300)]
Bump version to 0.5.2
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Jakub Filak [Wed, 2 Apr 2014 13:03:44 +0000 (15:03 +0200)]
Correct ABRT report data
Remove '\0' byte from 'PUT' message because this was superfluous.
Replaced 'BASENAME' item with 'TYPE' item because the first one is no
longer supported by abrtd and the second one is required. Basically the
later is a substitute for the first one.
Removed the closing message which is not supported by abrtd. abrtd
considers that message as a part of the problem report.
Removed a superfluous space from 'Backtrace'.
Signed-off-by: Jakub Filak <jfilak@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Fri, 28 Mar 2014 21:36:00 +0000 (18:36 -0300)]
Bump version to 0.5.1
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Fri, 28 Mar 2014 21:47:41 +0000 (18:47 -0300)]
Add two new generated files to .gitignore
The service files are now auto-generated.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Jakub Filak [Fri, 21 Feb 2014 14:54:09 +0000 (15:54 +0100)]
Make paths in the systemd services configurable
The path to a binary depends on configuration, therefore it is better to
not use hard coded strings.
Signed-off-by: Jakub Filak <jfilak@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Betty Dall [Wed, 19 Mar 2014 21:54:56 +0000 (15:54 -0600)]
ras-mc-ctl: Print useful message when run without rasdaemon -r
The utility script ras-mc-ctl requires that rasdaemon --record be run
to create the me_event table in the SQLite database. The current behaviour
is this:
[root@sa1 util]# ras-mc-ctl --errors
DBD::SQLite::db prepare failed: no such table: mc_event at
/usr/local/sbin/ras-mc-ctl line 914.
Can't call method "execute" on an undefined value at
/usr/local/sbin/ras-mc-ctl line 915.
With this change, the user sees:
[root@sa1 util]# ras-mc-ctl --errors
DBD::SQLite::db prepare failed: no such table: mc_event at
/usr/local/sbin/ras-mc-ctl line 914.
ras-mc-ctl: Error: mc_event table missing from
/usr/local/var/lib/rasdaemon/ras-mc_event.db. Run 'rasdaemon --record'.
Signed-off-by: Betty Dall <betty.dall@hp.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Betty Dall [Wed, 19 Mar 2014 20:59:47 +0000 (14:59 -0600)]
rasdaemon: Add record option to rasdaemon man page
Add the already existing rasdaemon option 'record' to the rasdaemon man
page. This option records events via sqlite3.
Signed-off-by: Betty Dall <betty.dall@hp.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Betty Dall [Wed, 19 Mar 2014 20:59:46 +0000 (14:59 -0600)]
rasdaemon: Make record option dependent on HAVE_SQULITE3
The record option in parse_opt() can be a compile time option with
the HAVE_SQLITE3 since that option is used in the corresponding
argp_option structure.
Signed-off-by: Betty Dall <betty.dall@hp.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Sun, 16 Feb 2014 10:56:05 +0000 (19:56 +0900)]
Change version to 0.5.0
As this version has a new feature, name it as 0.5.0.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Junliang Li [Thu, 13 Feb 2014 02:39:53 +0000 (10:39 +0800)]
add abrt suppport for rasdaemon
Adds abrt as another error mechanism for the rasdaemon.
This patch does:
1) read ras event (mc,mce and aer)
2) setup a abrt-server unix socket
3) write messages follow ABRT server protocol, set event
info into backtrace zone.
4) commit report.
For now, it depends on ABRT to limit flood reports.
Signed-off-by: Junliang Li <lijunliang.dna@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 13 Feb 2014 20:11:26 +0000 (05:11 +0900)]
mce-amd-k8.c: fix a warning
mce-amd-k8.c: In function ‘bank_name’:
mce-amd-k8.c:250:22: warning: argument to ‘sizeof’ in ‘snprintf’ call is the same expression as the destination; did you mean to provide an explicit length? [-Wsizeof-pointer-memaccess]
snprintf(buf, sizeof(buf), "%s (bank=%d)", s, e->bank);
^
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Wed, 12 Feb 2014 23:25:15 +0000 (08:25 +0900)]
README: describe the location of the main repositories
As it could have more copies of the rasdaemon in the net, add the
location of the main ones.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Wed, 12 Feb 2014 23:13:18 +0000 (08:13 +0900)]
Update README to reflect the patch submission process
That helps to better document how to contribute with code.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Tue, 10 Sep 2013 16:22:42 +0000 (13:22 -0300)]
Bump to version 0.4.2
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 20:13:43 +0000 (17:13 -0300)]
ras-mc-ctl: Fix the DIMM layout display
The items weren't being presented at the right order. Fix it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 16:26:03 +0000 (13:26 -0300)]
contrib/edac-tests: Make it work without edac-utils
There were a few traces of edac-utils and an older version of
the EDAC trace on this script. Remove them, and change it to
0755 mode.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 15:58:02 +0000 (12:58 -0300)]
Add an example of labels file
This is an example of a labels file for a Dell Power Edge T620.
For now, only DIMMs A1 and B1 are tested here.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 15:45:18 +0000 (12:45 -0300)]
ras-mc-ctl: Fix label register with 2 layers
When there aren't 3 layers, label print/register weren't working.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 15:43:02 +0000 (12:43 -0300)]
ras-mc-ctl: Improve parser
Accept either . or : as layers separator at config files.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Tue, 4 Jun 2013 10:41:58 +0000 (07:41 -0300)]
Makefile.am: fix build if rpmbuild was never called before
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 3 Jun 2013 13:57:02 +0000 (10:57 -0300)]
TODO: Update it with the current issues
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 19:40:40 +0000 (16:40 -0300)]
ras-mc-ctl: Fix the name of the error table data
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 19:16:44 +0000 (16:16 -0300)]
ras-mc-ctl: report errors also for PCIe AER and MCE
Show also PCIe AER and MCE when used with --errors parameter.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 17:57:54 +0000 (14:57 -0300)]
ras-mc-ctl: add summary for MCE and PCIe AER errors
Report the summary also for MCE and PCIe errors.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 17:18:24 +0000 (14:18 -0300)]
Add support to store MCE events at the database
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 16:54:11 +0000 (13:54 -0300)]
Add support to record AER events
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 16:53:18 +0000 (13:53 -0300)]
ras-record: Make the code easier to add support for other tables
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 16:51:55 +0000 (13:51 -0300)]
ras-record: reorder functions
No functional changes
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 16:10:16 +0000 (13:10 -0300)]
ras-record: rename stmt to stmt_mc_event
This stmt is used only for mc_event. So, rename it, as we'll be
adding other stmts for the other tables.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 15:41:01 +0000 (12:41 -0300)]
ras-record: make the code more generic
Now that we're ready to add more tables to the database, make
the code that creates and inserts data into the table more
generic.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 30 May 2013 00:53:58 +0000 (21:53 -0300)]
ras-mc-ctl: Improve error summary to show label and mc
Both information are useful for the users, even on summary.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 15:04:29 +0000 (12:04 -0300)]
Update rasdaemon.spec.in
This is exactly what it should be used for Fedora.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 14:57:21 +0000 (11:57 -0300)]
Create directories via install target
As the dirs will be created via install target, we may cleanup the
rpm spec model file.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 14:33:11 +0000 (11:33 -0300)]
Makefile.am: honour destdir at the local install target
That avoids building errors like:
/bin/sh /builddir/build/BUILD/rasdaemon-0.4.1/install-sh -d "/var/lib/rasdaemon"
mkdir: cannot create directory '/var/lib/rasdaemon': Permission denied
mkdir: cannot create directory '/var/lib/rasdaemon': Permission denied
When building for a distro package.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 14:10:44 +0000 (11:10 -0300)]
Bump to version 0.4.1
The sqlite3 bugfix is important enough to deserve a version.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 14:03:04 +0000 (11:03 -0300)]
README: update to reflect the need of perl DBI sqlite
This is now needed by ras-mc-ctl.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 13:59:43 +0000 (10:59 -0300)]
Makefile.am: create ${prefix}/var/lib/rasdaemon on install
rasdaemon -r requires that directory to be created, otherwise,
sql open will fail.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 12:33:45 +0000 (09:33 -0300)]
ras-mc-ctl: add support for queuing the errors
As the mc_event table is filled by rasdaemon, we need a tool to
extract data from it.
So, use the existing perl script for the basic queries.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 10:41:30 +0000 (07:41 -0300)]
ras-record: use sqlite3_reset to allow reusing the prepared statement
Instead of using sqlite3_finalize, we should use sqlite3_reset, or
otherwise the prepared statement will be de-allocated.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 10:40:46 +0000 (07:40 -0300)]
rasdaemon.spec.in: Require sqlite-devel
This library is needed on builds when --enable-sqlite3 is used.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tony Luck [Tue, 28 May 2013 18:20:36 +0000 (11:20 -0700)]
ras-events: Fence-post error when reporting number of cpus we listen to
I see:
rasdaemon: Listening to events for cpus 0 to 64
which would be 65 total cpus - I only have 64.
Fix the log message to use "n_cpus - 1" rather than "n_cpus".
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 18:10:05 +0000 (15:10 -0300)]
Add a tool to automate releasing new versions
This small script automates the process of building newer
versions of the tool.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 18:09:29 +0000 (15:09 -0300)]
Replace some hard-coded strings by the autotools macro names
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 18:00:22 +0000 (15:00 -0300)]
Bump version to 0.4.0
There are too many changes already. Bump it to version 0.4.0.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 17:58:36 +0000 (14:58 -0300)]
ras-events: parse errors at select_tracing_timestamp()
This fixes the following warnings:
ras-events.c: In function 'select_tracing_timestamp':
ras-events.c:501:6: warning: ignoring return value of 'read', declared with attribute warn_unused_result [-Wunused-result]
ras-events.c:531:8: warning: ignoring return value of 'fscanf', declared with attribute warn_unused_result [-Wunused-result]
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 17:08:07 +0000 (14:08 -0300)]
Store RAS sqlite3 db file on a proper place
Instead of creating it on the same directory as when it
is called, put it at ${prefix}/var/lib/rasdaemon directory.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 14:37:50 +0000 (11:37 -0300)]
ras-events: use sysconf to get the number of CPU's
There are several "per-cpu" files at sysfs that seem to be
utterly bogus, as trying to poll from them just return POLLERR.
Let's use, instead, sysconf() to get the number of CPU's, avoiding
such bug.
Not sure if this would work with hotplugged CPU's, though, so
let's preserve the old code there, for now.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 11:47:57 +0000 (08:47 -0300)]
ras-events: Only use pthreads for collect if poll() not available
Before kernel 3.10, one pthread per cpu was used, as the code
would need to run an endless loop, in order to get events.
With kernel 3.10 and upper, we can simply use poll() there.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 11:13:17 +0000 (08:13 -0300)]
ras-mce-handler: change the test order to avoid leaked memory
As getdelim allocates memory, the better is to swap the
tests, or otherwise the code will allocate some memory that
will never be de-allocated.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 10:47:53 +0000 (07:47 -0300)]
ras-mce-handler: Fix /proc/cpuinfo parser
The test for the parsing completion is wrong. Fix it.
While here, change the namespace to avoid latter
conflicts.
Reported-by: Chen Gong <gong.chen@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 21:19:08 +0000 (18:19 -0300)]
ras-mce-handler: Fix a warning
ras-mce-handler.c: In function ‘register_mce_handler’:
ras-mce-handler.c:200:13: warning: ‘mce’ may be used uninitialized in this function [-Wuninitialized]
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 20:47:15 +0000 (17:47 -0300)]
Enable MCE parsing at RPM files
As this is known to work, enable it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 20:46:56 +0000 (17:46 -0300)]
README: update to reflect the current status
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 20:26:04 +0000 (17:26 -0300)]
Update TODO list
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 20:23:48 +0000 (17:23 -0300)]
mce-intel-sb: add memory controller decoding
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 20:19:11 +0000 (17:19 -0300)]
Add support to decode memory controller data on Nehalem
xeon75xx code can be dropped as it doesn't exist anyway on
mcelog. According to the code there, it lacks support for it
to work at the Kernel.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 19:46:12 +0000 (16:46 -0300)]
mce-intel: Enable iMC log where available
Add a code to enable iMC log where available.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 18:50:51 +0000 (15:50 -0300)]
mce-intel-ivb: enable the code that parses memory controller errors
Enable the code that parses the memory controller errors.
This code assumes that iMC log is already enabled.
A latter patch will add support for enabling it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tony Luck [Fri, 24 May 2013 16:55:40 +0000 (09:55 -0700)]
spelling: Fix spelling in ras-record.c
s/interted/inserted/
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tony Luck [Fri, 24 May 2013 16:29:06 +0000 (09:29 -0700)]
configure: Fix help string for sqlite3
The AS_HELP_STRING has a typo and says to use "--enable-sqlite" when
it should say "-enable-sqlite3"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 24 May 2013 14:21:32 +0000 (11:21 -0300)]
mce: Some improvements at the output format
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 24 May 2013 11:21:51 +0000 (08:21 -0300)]
ras-mce-handler: fix /proc/cpuinfo parser
The scanf parsers for /proc/cpuinfo were broken, as they
got a "mce->" prefix by mistake. Remove it to fix.
With that, MCE parser will successfully register.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 24 May 2013 11:18:48 +0000 (08:18 -0300)]
event-parse: Remove a temporary debug message
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 24 May 2013 11:16:57 +0000 (08:16 -0300)]
Don't require that all tracing types to be supported
Not all systems support all 3 types of RAS (EDAC, PCIe AER, MCELOG).
Don't bail out if at least one of them is supported.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 24 May 2013 10:37:06 +0000 (07:37 -0300)]
Update edac-tests to use ras-mc-ctl instead of ./edac-ctl
All functionalities previously found on my test version of
edac-ctl is present on ras-mc-ctl. So, let's rename it.
The test code still tries to run edac-util. This tool,
which is part of edac-utils, use the edac error counters to
check the errors. For now, let's keep it, as it might be useful,
although this will likely be removed on future versions of this
testing script.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 24 May 2013 09:18:54 +0000 (06:18 -0300)]
ras-events: Fix the logic that retrieves the debugfs mount point
While on Fedora/RHEL the mount device for debugfs is called "debugfs",
it is usual to use "none" on some other distros or for manually
mounted debugfs.
So, fix the logic to look at the filesystem type, instead, as it should
always be "debugfs", on both cases.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tony Luck [Thu, 23 May 2013 20:27:31 +0000 (13:27 -0700)]
ras-record: Avoid NULL pointer when running without sqlite
When running "rasdaemon -f" we can dereference a NULL pointer in
ras_store_mc_event() since "ras->db_priv" is NULL.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 19:42:08 +0000 (16:42 -0300)]
ras-events: Fix MCE binding
The #ifdef for detecting MCE was wrong. Due to that, the MCE
handler was not being enabled.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 19:37:54 +0000 (16:37 -0300)]
Make the enable function more generic
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 17:58:21 +0000 (14:58 -0300)]
Get rid of ras-record warnings
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 17:44:36 +0000 (14:44 -0300)]
get rid of MCE warnings
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 17:26:07 +0000 (14:26 -0300)]
Cleanup warnings at ras-aer-handler.c
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 16:35:07 +0000 (13:35 -0300)]
Fix event handler parser logic
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 14:48:02 +0000 (11:48 -0300)]
ras-events: Add some hacks to make it work with 3.6.10-rc2
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 14:07:29 +0000 (11:07 -0300)]
libtrace: sync with the latest code from trace-cmd
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 13:24:03 +0000 (10:24 -0300)]
edac-fake-inject: Check if the Kernel supports error injection
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 12:35:38 +0000 (09:35 -0300)]
Get rid of mc_event_error_type
Somehow, the tracing library is not finding it on some systems:
overriding event (710) ras:mc_event with new print handler
trace-cmd: File exists
function mc_event_error_type not defined
Let's just get rid of it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 12:09:19 +0000 (09:09 -0300)]
Better handle parser errors with MC events
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 23 May 2013 12:01:10 +0000 (09:01 -0300)]
edac-fake-inject: Make it more generic
The tool used to support only 2 or 3 layer memory controllers,
faling with edac_ghes driver. Make it more generic to also work
there.
Also, don't assume that the SYSFS is mounted at /sys/kernel/debug,
but look at its mount location via /proc/mounts.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>