]>
www.infradead.org Git - users/mchehab/rasdaemon.git/log
Mauro Carvalho Chehab [Thu, 8 Jun 2017 09:29:49 +0000 (06:29 -0300)]
Bump to version 0.5.9
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Mauro Carvalho Chehab [Thu, 8 Jun 2017 09:46:11 +0000 (06:46 -0300)]
rasdaemon.spec.in: update it to reflect current needs
Keep it more or less in sync with the Fedora version of it,
in order to allow it to be built with the new-ver.sh script.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Aristeu Rozanski [Thu, 4 May 2017 18:02:53 +0000 (14:02 -0400)]
rasdaemon: add Knights Mill model
Knights Mill is similar to Knights Landing and can use the same code.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Charles.Rose@dell.com [Tue, 6 Jun 2017 21:42:21 +0000 (21:42 +0000)]
rasdaemon: Update DIMM labels for Dell Servers
Updated to include Dell PowerEdge Servers that are current.
Note the use of Product field instead of Model. Tested on
multiple Dell PowerEdge servers.
Signed-off-by: Charles Rose <charles_rose@dell.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Mauro Carvalho Chehab [Thu, 8 Jun 2017 09:05:48 +0000 (06:05 -0300)]
configure.ac: report enabled features
We're starting to have too many optional features. Report
what options are enabled at the end of ./configure output.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Mauro Carvalho Chehab [Tue, 14 Mar 2017 12:32:12 +0000 (09:32 -0300)]
Update it to point to the new repository
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Mauro Carvalho Chehab [Fri, 15 Apr 2016 10:07:11 +0000 (07:07 -0300)]
Bump version to 0.5.8
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Fri, 8 Apr 2016 19:07:19 +0000 (15:07 -0400)]
Add Broadwell EP/EX MSCOD values
Based on mcelog commit id
32252e9c37e97ea5083d90d2cf194bb85a4a0cda .
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Fri, 8 Apr 2016 19:07:18 +0000 (15:07 -0400)]
Add Broadwell DE MSCOD values
Based on mcelog commit id
32252e9c37e97ea5083d90d2cf194bb85a4a0cda .
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Mauro Carvalho Chehab [Fri, 5 Feb 2016 17:24:42 +0000 (15:24 -0200)]
Bump version to 0.5.7
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Mauro Carvalho Chehab [Fri, 5 Feb 2016 17:15:18 +0000 (15:15 -0200)]
mce-intel-knl: Fix CodingStyle
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Marcin Koss [Thu, 3 Dec 2015 14:19:47 +0000 (15:19 +0100)]
rasdaemon: Add support for Knights Landing processor
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Tue, 29 Sep 2015 01:46:23 +0000 (10:46 +0900)]
rasdaemon: Add model numbers for Broadwell-EP/EX and -DE
Based on mcelog code.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 10 Aug 2015 18:24:41 +0000 (14:24 -0400)]
rasdaemon: fix typos on ras-mc-ctl man page
Fixed two markers and two typos in the documentation.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Mauro Carvalho Chehab [Fri, 3 Jul 2015 10:35:14 +0000 (07:35 -0300)]
Bump version to 0.5.6
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Wed, 17 Jun 2015 10:56:57 +0000 (07:56 -0300)]
rasdaemon: add internal errors of IA32_MC4_STATUS for Haswell
Now rasdaemon looks purposely omitting internal errors of
IA32_MC4_STATUS for Haswell-family processors, which are described in
Intel SDM vol3 Table 16-20. I think it's better to show these errors
because mcelog does show them.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Fri, 12 Jun 2015 09:35:37 +0000 (06:35 -0300)]
rasdaemon: use MCA error msg as error_msg
In the case of machine-checks which do not have a model-specific MCA
error code but have an architectural code only, mce_event.error_msg
becomes empty then you don't know what happened.
(snip)
MCE records summary:
1 errors
^
empty!
(snip)
MCE events:
1 2015-06-12 00:21:46 +0900 error: , mcg mcgstatus= 0, mci Corrected_error
^
empty!
Error_enabled, mcgcap=0x07000c16, status=0x9c0000000000017a, addr=0x204fffffff, misc=0x4004000000000080, walltime=0x557b0db2, cpu=0x00000001, cpuid=0x000306f3, apicid=0x00000002, bank=0x00000003
In such a case, let's use the content of mcastatus_msg as error_msg
instead.
(snip)
MCE records summary:
1 Generic CACHE Level-2 Eviction Error errors
(snip)
MCE events:
1 2015-06-12 02:39:04 +0900 error: Generic CACHE Level-2 Eviction Error, mcg mcgstatus= 0, mci Corrected_error Error_enabled, mcgcap=0x07000c16, status=0x9c0000000000017a, addr=0x204fffffff, misc=0x4004000000000080, walltime=0x557b1f22, cpu=0x00000001, cpuid=0x000306f3, apicid=0x00000002, bank=0x00000003
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Wed, 10 Jun 2015 23:49:55 +0000 (20:49 -0300)]
rasdaemon: unnecessary comma for empty mc_location string
Into the /var/log/messages, rasdaemon sometimes prints an unnecessary
comma ", " between mca= and cpu_type= like below:
Jun 9 02:44:39 localhost rasdaemon: <...>-4585 [
1638893312 ] 1031.109000: mce_record: 2015-06-08 10:07:28 +0900 bank=3, status=
9c0000000000017a , mci=Corrected_error Error_enabled, mca=Generic CACHE Level-2 Eviction Error, , cpu_type= Intel Xeon v3 (Haswell) EP/EX, cpu= 1, socketid= 0, misc=
4004000000000080 , addr=
204fffffff , mcgstatus= 0, mcgcap=
7000c16 , apicid= 2
That's the comma for mc_location which is printed even if mc_location is
empty due to a wrong if condition.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Wed, 10 Jun 2015 10:29:03 +0000 (07:29 -0300)]
rasdaemon: remove a space from mcgstatus_msg
"ras-mc-ctl --errors" shows an unnecessary space character in the
mcgstatus string of MCE event, like below:
2 2015-04-04 19:57:22 +0900 error: MC_HA_IMC_RW_BLOCK_ACK_TIMEOUT, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8000000067000e0b, walltime=0x555da140, cpu=0x00000001, cpuid=0x000306f3, apicid=0x00000002, bank=0x00000004
Let's remove it.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Ashok Raj [Fri, 5 Jun 2015 16:32:47 +0000 (13:32 -0300)]
x86, rasdaemon: Add support to log Local Machine Check Exception (LMCE)
Local Machine Check Exception allows certain errors to be signaled to
only the affected logical processor. This change captures them for
rasdaemon.
log:Changes to rasdaemon to support new architectural changes to MCE
Changet to rasdaemon to support new architectural extentions in Intel
CPUs.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Mauro Carvalho Chehab [Wed, 3 Jun 2015 13:59:55 +0000 (10:59 -0300)]
Bump version to 0.5.5
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Mauro Carvalho Chehab [Wed, 3 Jun 2015 13:42:46 +0000 (10:42 -0300)]
Improve INSTALL summary instructions
Using && warrants that the previous command succeeds. So, this
is the recommended way.
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 1 Jun 2015 20:04:00 +0000 (17:04 -0300)]
rasdaemon: add support to match the machine by system's product name
In some cases the motherboard names will change but the mapping won't
across a line of products. This patch adds support for "Product:" to be
specified in the label files instead of Model:.
An example:
Vendor: Dell Inc.
Product: PowerEdge R610
DIMM_A1: 0.0.0; DIMM_A2: 0.0.1; DIMM_A3: 0.0.2;
DIMM_A4: 0.1.0; DIMM_A5: 0.1.1; DIMM_A6: 0.1.2;
DIMM_B1: 1.0.0; DIMM_B2: 1.0.1; DIMM_B3: 1.0.2;
DIMM_B4: 1.1.0; DIMM_B5: 1.1.1; DIMM_B6: 1.1.2;
Would match all 'PowerEdge R610' machines.
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Tue, 26 May 2015 14:59:39 +0000 (11:59 -0300)]
rasdaemon: make sure the error is valid before handling ranks
Fix "rank" handling according to the Bit 63 description in Intel SDM Vol.3C
Table 16-23, that says "... Use this information only after there is valid
first error info indicated by bit 62".
Also fix invalid comparisons of unsigned variables "rank0" and "rank1".
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Tue, 26 May 2015 14:59:38 +0000 (11:59 -0300)]
rasdaemon: enable IMC status usage for Haswell-E
Enable IMC status bank for Haswell-E, as described in Intel SDM Vol.3C
Table 35-27.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Tue, 26 May 2015 14:59:37 +0000 (11:59 -0300)]
rasdaemon: add missing semicolon in hsw_decode_model()
hsw_decode_model() tries to skip decode_bitfield() if IA32_MC4_STATUS indicates
some internal errors. Unfortunately, here behaves opposite to the intention
because a semicolon is missing.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Seiichi Ikarashi [Tue, 26 May 2015 14:59:36 +0000 (11:59 -0300)]
rasdaemon: properly pring message strings in decode_bitfield()
Fix decode_bitfield() so that it does print message strings from the struct
field table.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:33 +0000 (14:19 -0300)]
rasdaemon: add support for Knights Landing
Patch based on mcelog.
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:32 +0000 (14:19 -0300)]
rasdaemon: add support for Broadwell
Only basic support for now.
Based on mcelog code.
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:31 +0000 (14:19 -0300)]
rasdaemon: Identify Ivy Bridge properly
This patch is based on
b29cc4d615cead87cbc163ada0645b10c5b1217d (mcelog)
mcelog: Identify Ivy Bridge properly
Uniquely identify Ivy Bridge even though the machine checks are the same
for Sandy Bridge and Ivy Bridge. This makes the output for the processor
display "Ivy Bridge".
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: tony.luck@intel.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:30 +0000 (14:19 -0300)]
rasdaemon: Add missing entry to Ivy Bridge memory controller decode table
This patch is based on
2577aeb662374cb87169ee675b2e37c06f1aed99 (mcelog)
mcelog: Add missing entry to Ivy Bridge memory controller decode table
September 2013 edition of the software developer manual added an
entry that had been inadvertently omitted from earlier editions.
Add the 0x80 entry for "Corrected memory read error".
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:29 +0000 (14:19 -0300)]
rasdaemon: decode new simple error code number 6
This patch was based on
fa313dd0144596dfa140bd66805367250d6eae9b
(mcelog)
mcelog: Decode new simple error code number 6
Edition 050 of the Intel SDM released in late February 2014
includes a new simple error code in "Table 15-8. IA32_MCi_Status
[15:0] Simple Error Code Encoding". Code 6 (0000 0000 0000 0110)
has been allocated for the reporting of cases where the BIOS SMM
code attempts to execute code outside of the protected SMRR area.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Aristeu Rozanski [Mon, 18 May 2015 17:19:28 +0000 (14:19 -0300)]
rasdaemon: add support for Haswell
Based on mcelog code.
Acked-by: Tony Luck <tony.luck@intel,com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Mauro Carvalho Chehab [Fri, 15 Aug 2014 22:15:47 +0000 (19:15 -0300)]
Bump version to 0.5.4
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Fri, 15 Aug 2014 17:50:58 +0000 (13:50 -0400)]
rasdaemon: do not assume dimmX/ directories will be present
While finding the labels, size and location, ras-mc-ctl will search /sys for
the files and calculate the location. When it uses the location trying to map
back to files to print labels or write labels, it'll just assume dimm*
directories exist which is not correct while using drivers like amd64_edac.
This patch adds two new hashes to store the location and the label file path
so it can be used later.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Mon, 21 Jul 2014 20:23:18 +0000 (16:23 -0400)]
rasdaemon: enable recording by default in service file
This patch changes the service file to enable the tracing events after
the daemon is started and starts the daemon recording events by default.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Mon, 21 Jul 2014 19:25:40 +0000 (15:25 -0400)]
rasdaemon: correct range while parsing top, middle and lower layers
{top,middle,lower}_layer are signed char, therefore will never be 255.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=
1035746
Tested in a GHES enabled machine using EINJ.
v2: no need to test ranges at all
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Sun, 10 Aug 2014 14:04:10 +0000 (11:04 -0300)]
Bump version to 0.5.3
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Sun, 10 Aug 2014 15:51:04 +0000 (12:51 -0300)]
Add a target to build rasdaemon with mock
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Sun, 10 Aug 2014 15:47:21 +0000 (12:47 -0300)]
Add an option to build the srpm
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Luck, Tony [Mon, 4 Aug 2014 20:29:01 +0000 (13:29 -0700)]
rasdaemon: Add support for extlog trace events
Linux kernel 3.17 includes a new trace event to pick up extended
error logs produced by BIOS in the Common Platform Error Record
format described in appendix N of the UEFI standard. This patch
adds support to collect that information and log it both in
readable ASCII and into the sqlite3 database that rasdaemon
uses to store all error information. In addition ras-mc-ctl
is updated to query that database for both detailed and summary
reports.
Big thanks to Aristeu for pretty much all the sqlite3 pieces,
plus testing and fixing miscellaneous issues elsewhere.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Aristeu Rozanski [Tue, 24 Jun 2014 15:01:31 +0000 (11:01 -0400)]
rasdaemon: handle failures of snprintf()
Florian Weimer found that in bitfield_msg() the return value of
snprintf() is used to calculate length ignoring that it can return a
negative number. This patch makes bitfield_msg() to stop writing in such
case.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=
1035741
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Xie XiuQi [Thu, 8 May 2014 12:07:19 +0000 (20:07 +0800)]
rasdaemon: fix mce numfield decoded error
Some fields are missing in mce decode information, as below:
...
rasdaemon: register inserted at db
<...>-31568 [000] 4023.214080: mce_record:
2014-05-07 15:51:16 +0800 bank=2, status=
bd000000000000c0 , MEMORY
CONTROLLER MS_CHANNEL0_ERR Transaction: Memory scrubbing error %s: %Lu
%s: %Lx
%s: %Lx
%s: %Lu
%s: %Lu
%s: %Lx
, mci=Uncorrected_error Error_enabled SRAO, n_errors=0 channel=0,
dimm=0, cpu_type= Intel Xeon 5500 series / Core i3/5/7
("Nehalem/Westmere"), cpu= 0, socketid= 0, ip=
1eadbabe (INEXACT), cs=
73, misc= 8c, addr= 62b000, mcgstatus= 5 RIPV MCIP, mcgcap= 1c09,
apicid= 0
"f->name" & "v" are missed to print in decode_numfield(), so fix it.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Luck, Tony [Mon, 7 Apr 2014 18:27:47 +0000 (11:27 -0700)]
rasdaemon: sqlite truncates some MCE fields to 32-bit
The sqlite3_bind_int() function takes an "int" as the argument value to
save to the database. But some fields are wider than 32-bits. Use
sqlite3_bind_int64() for the fields where we know values can exceed
4G.
Before:
# ./rasdaemon/util/ras-mc-ctl --errors
...
MCE events:
1 2014-04-04 08:50:32 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x00010090, addr=0x35fcb9c0, misc=0x5026a686, walltime=0x5342e4f9, cpu=0x0000000e, cpuid=0x000306f1, apicid=0x00000020, socketid=0x00000001, bank=0x00000008
2 2014-04-04 08:50:35 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x00010090, addr=0x4187adc0, misc=0x4274f486, walltime=0x5342e4fc, cpu=0x0000000e, cpuid=0x000306f1, apicid=0x00000020, socketid=0x00000001, bank=0x00000007
3 2014-04-04 08:50:37 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x00010090, addr=0x52efc600, misc=0x50028286, walltime=0x5342e4fd, cpu=0x0000000e, cpuid=0x000306f1, apicid=0x00000020, socketid=0x00000001, bank=0x00000008
After:
./rasdaemon/util/ras-mc-ctl --errors
...
1 2014-04-04 09:00:07 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8c00004000010090, addr=0x45340a180, misc=0x140686886, walltime=0x5342e736, cpuid=0x000306f1, bank=0x00000008
2 2014-04-04 09:00:08 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8c00004000010090, addr=0x44d6e4780, misc=0x15060e086, walltime=0x5342e737, cpuid=0x000306f1, bank=0x00000007
3 2014-04-04 09:00:10 -0700 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus= 0, mci Corrected_error, mcgcap=0x07000c16, status=0x8c00004000010090, addr=0x44cb64640, misc=0x140505086, walltime=0x5342e739, cpuid=0x000306f1, bank=0x00000008
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Luck, Tony [Mon, 7 Apr 2014 19:23:25 +0000 (12:23 -0700)]
rasdaemon: fix some typos and cut/paste errors in sqlite bits
aer event has the error_type as field 2 and msg as field 3 - but the calls
the sqlite3_bind_text use 3 and 4.
mce event forgot to declare the "mcastatus_msg"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 3 Apr 2014 11:50:45 +0000 (08:50 -0300)]
Bump version to 0.5.2
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Jakub Filak [Wed, 2 Apr 2014 13:03:44 +0000 (15:03 +0200)]
Correct ABRT report data
Remove '\0' byte from 'PUT' message because this was superfluous.
Replaced 'BASENAME' item with 'TYPE' item because the first one is no
longer supported by abrtd and the second one is required. Basically the
later is a substitute for the first one.
Removed the closing message which is not supported by abrtd. abrtd
considers that message as a part of the problem report.
Removed a superfluous space from 'Backtrace'.
Signed-off-by: Jakub Filak <jfilak@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Fri, 28 Mar 2014 21:36:00 +0000 (18:36 -0300)]
Bump version to 0.5.1
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Fri, 28 Mar 2014 21:47:41 +0000 (18:47 -0300)]
Add two new generated files to .gitignore
The service files are now auto-generated.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Jakub Filak [Fri, 21 Feb 2014 14:54:09 +0000 (15:54 +0100)]
Make paths in the systemd services configurable
The path to a binary depends on configuration, therefore it is better to
not use hard coded strings.
Signed-off-by: Jakub Filak <jfilak@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Betty Dall [Wed, 19 Mar 2014 21:54:56 +0000 (15:54 -0600)]
ras-mc-ctl: Print useful message when run without rasdaemon -r
The utility script ras-mc-ctl requires that rasdaemon --record be run
to create the me_event table in the SQLite database. The current behaviour
is this:
[root@sa1 util]# ras-mc-ctl --errors
DBD::SQLite::db prepare failed: no such table: mc_event at
/usr/local/sbin/ras-mc-ctl line 914.
Can't call method "execute" on an undefined value at
/usr/local/sbin/ras-mc-ctl line 915.
With this change, the user sees:
[root@sa1 util]# ras-mc-ctl --errors
DBD::SQLite::db prepare failed: no such table: mc_event at
/usr/local/sbin/ras-mc-ctl line 914.
ras-mc-ctl: Error: mc_event table missing from
/usr/local/var/lib/rasdaemon/ras-mc_event.db. Run 'rasdaemon --record'.
Signed-off-by: Betty Dall <betty.dall@hp.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Betty Dall [Wed, 19 Mar 2014 20:59:47 +0000 (14:59 -0600)]
rasdaemon: Add record option to rasdaemon man page
Add the already existing rasdaemon option 'record' to the rasdaemon man
page. This option records events via sqlite3.
Signed-off-by: Betty Dall <betty.dall@hp.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Betty Dall [Wed, 19 Mar 2014 20:59:46 +0000 (14:59 -0600)]
rasdaemon: Make record option dependent on HAVE_SQULITE3
The record option in parse_opt() can be a compile time option with
the HAVE_SQLITE3 since that option is used in the corresponding
argp_option structure.
Signed-off-by: Betty Dall <betty.dall@hp.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Sun, 16 Feb 2014 10:56:05 +0000 (19:56 +0900)]
Change version to 0.5.0
As this version has a new feature, name it as 0.5.0.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Junliang Li [Thu, 13 Feb 2014 02:39:53 +0000 (10:39 +0800)]
add abrt suppport for rasdaemon
Adds abrt as another error mechanism for the rasdaemon.
This patch does:
1) read ras event (mc,mce and aer)
2) setup a abrt-server unix socket
3) write messages follow ABRT server protocol, set event
info into backtrace zone.
4) commit report.
For now, it depends on ABRT to limit flood reports.
Signed-off-by: Junliang Li <lijunliang.dna@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 13 Feb 2014 20:11:26 +0000 (05:11 +0900)]
mce-amd-k8.c: fix a warning
mce-amd-k8.c: In function ‘bank_name’:
mce-amd-k8.c:250:22: warning: argument to ‘sizeof’ in ‘snprintf’ call is the same expression as the destination; did you mean to provide an explicit length? [-Wsizeof-pointer-memaccess]
snprintf(buf, sizeof(buf), "%s (bank=%d)", s, e->bank);
^
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Wed, 12 Feb 2014 23:25:15 +0000 (08:25 +0900)]
README: describe the location of the main repositories
As it could have more copies of the rasdaemon in the net, add the
location of the main ones.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Wed, 12 Feb 2014 23:13:18 +0000 (08:13 +0900)]
Update README to reflect the patch submission process
That helps to better document how to contribute with code.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Tue, 10 Sep 2013 16:22:42 +0000 (13:22 -0300)]
Bump to version 0.4.2
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 20:13:43 +0000 (17:13 -0300)]
ras-mc-ctl: Fix the DIMM layout display
The items weren't being presented at the right order. Fix it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 16:26:03 +0000 (13:26 -0300)]
contrib/edac-tests: Make it work without edac-utils
There were a few traces of edac-utils and an older version of
the EDAC trace on this script. Remove them, and change it to
0755 mode.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 15:58:02 +0000 (12:58 -0300)]
Add an example of labels file
This is an example of a labels file for a Dell Power Edge T620.
For now, only DIMMs A1 and B1 are tested here.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 15:45:18 +0000 (12:45 -0300)]
ras-mc-ctl: Fix label register with 2 layers
When there aren't 3 layers, label print/register weren't working.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Thu, 15 Aug 2013 15:43:02 +0000 (12:43 -0300)]
ras-mc-ctl: Improve parser
Accept either . or : as layers separator at config files.
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Mauro Carvalho Chehab [Tue, 4 Jun 2013 10:41:58 +0000 (07:41 -0300)]
Makefile.am: fix build if rpmbuild was never called before
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 3 Jun 2013 13:57:02 +0000 (10:57 -0300)]
TODO: Update it with the current issues
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 19:40:40 +0000 (16:40 -0300)]
ras-mc-ctl: Fix the name of the error table data
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 19:16:44 +0000 (16:16 -0300)]
ras-mc-ctl: report errors also for PCIe AER and MCE
Show also PCIe AER and MCE when used with --errors parameter.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 17:57:54 +0000 (14:57 -0300)]
ras-mc-ctl: add summary for MCE and PCIe AER errors
Report the summary also for MCE and PCIe errors.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 17:18:24 +0000 (14:18 -0300)]
Add support to store MCE events at the database
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 16:54:11 +0000 (13:54 -0300)]
Add support to record AER events
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 16:53:18 +0000 (13:53 -0300)]
ras-record: Make the code easier to add support for other tables
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 16:51:55 +0000 (13:51 -0300)]
ras-record: reorder functions
No functional changes
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 16:10:16 +0000 (13:10 -0300)]
ras-record: rename stmt to stmt_mc_event
This stmt is used only for mc_event. So, rename it, as we'll be
adding other stmts for the other tables.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Fri, 31 May 2013 15:41:01 +0000 (12:41 -0300)]
ras-record: make the code more generic
Now that we're ready to add more tables to the database, make
the code that creates and inserts data into the table more
generic.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Thu, 30 May 2013 00:53:58 +0000 (21:53 -0300)]
ras-mc-ctl: Improve error summary to show label and mc
Both information are useful for the users, even on summary.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 15:04:29 +0000 (12:04 -0300)]
Update rasdaemon.spec.in
This is exactly what it should be used for Fedora.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 14:57:21 +0000 (11:57 -0300)]
Create directories via install target
As the dirs will be created via install target, we may cleanup the
rpm spec model file.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 14:33:11 +0000 (11:33 -0300)]
Makefile.am: honour destdir at the local install target
That avoids building errors like:
/bin/sh /builddir/build/BUILD/rasdaemon-0.4.1/install-sh -d "/var/lib/rasdaemon"
mkdir: cannot create directory '/var/lib/rasdaemon': Permission denied
mkdir: cannot create directory '/var/lib/rasdaemon': Permission denied
When building for a distro package.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 14:10:44 +0000 (11:10 -0300)]
Bump to version 0.4.1
The sqlite3 bugfix is important enough to deserve a version.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 14:03:04 +0000 (11:03 -0300)]
README: update to reflect the need of perl DBI sqlite
This is now needed by ras-mc-ctl.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 13:59:43 +0000 (10:59 -0300)]
Makefile.am: create ${prefix}/var/lib/rasdaemon on install
rasdaemon -r requires that directory to be created, otherwise,
sql open will fail.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 12:33:45 +0000 (09:33 -0300)]
ras-mc-ctl: add support for queuing the errors
As the mc_event table is filled by rasdaemon, we need a tool to
extract data from it.
So, use the existing perl script for the basic queries.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 10:41:30 +0000 (07:41 -0300)]
ras-record: use sqlite3_reset to allow reusing the prepared statement
Instead of using sqlite3_finalize, we should use sqlite3_reset, or
otherwise the prepared statement will be de-allocated.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Wed, 29 May 2013 10:40:46 +0000 (07:40 -0300)]
rasdaemon.spec.in: Require sqlite-devel
This library is needed on builds when --enable-sqlite3 is used.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tony Luck [Tue, 28 May 2013 18:20:36 +0000 (11:20 -0700)]
ras-events: Fence-post error when reporting number of cpus we listen to
I see:
rasdaemon: Listening to events for cpus 0 to 64
which would be 65 total cpus - I only have 64.
Fix the log message to use "n_cpus - 1" rather than "n_cpus".
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 18:10:05 +0000 (15:10 -0300)]
Add a tool to automate releasing new versions
This small script automates the process of building newer
versions of the tool.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 18:09:29 +0000 (15:09 -0300)]
Replace some hard-coded strings by the autotools macro names
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 18:00:22 +0000 (15:00 -0300)]
Bump version to 0.4.0
There are too many changes already. Bump it to version 0.4.0.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 17:58:36 +0000 (14:58 -0300)]
ras-events: parse errors at select_tracing_timestamp()
This fixes the following warnings:
ras-events.c: In function 'select_tracing_timestamp':
ras-events.c:501:6: warning: ignoring return value of 'read', declared with attribute warn_unused_result [-Wunused-result]
ras-events.c:531:8: warning: ignoring return value of 'fscanf', declared with attribute warn_unused_result [-Wunused-result]
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 17:08:07 +0000 (14:08 -0300)]
Store RAS sqlite3 db file on a proper place
Instead of creating it on the same directory as when it
is called, put it at ${prefix}/var/lib/rasdaemon directory.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 14:37:50 +0000 (11:37 -0300)]
ras-events: use sysconf to get the number of CPU's
There are several "per-cpu" files at sysfs that seem to be
utterly bogus, as trying to poll from them just return POLLERR.
Let's use, instead, sysconf() to get the number of CPU's, avoiding
such bug.
Not sure if this would work with hotplugged CPU's, though, so
let's preserve the old code there, for now.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 11:47:57 +0000 (08:47 -0300)]
ras-events: Only use pthreads for collect if poll() not available
Before kernel 3.10, one pthread per cpu was used, as the code
would need to run an endless loop, in order to get events.
With kernel 3.10 and upper, we can simply use poll() there.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 11:13:17 +0000 (08:13 -0300)]
ras-mce-handler: change the test order to avoid leaked memory
As getdelim allocates memory, the better is to swap the
tests, or otherwise the code will allocate some memory that
will never be de-allocated.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Tue, 28 May 2013 10:47:53 +0000 (07:47 -0300)]
ras-mce-handler: Fix /proc/cpuinfo parser
The test for the parsing completion is wrong. Fix it.
While here, change the namespace to avoid latter
conflicts.
Reported-by: Chen Gong <gong.chen@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 21:19:08 +0000 (18:19 -0300)]
ras-mce-handler: Fix a warning
ras-mce-handler.c: In function ‘register_mce_handler’:
ras-mce-handler.c:200:13: warning: ‘mce’ may be used uninitialized in this function [-Wuninitialized]
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 20:47:15 +0000 (17:47 -0300)]
Enable MCE parsing at RPM files
As this is known to work, enable it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 20:46:56 +0000 (17:46 -0300)]
README: update to reflect the current status
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 20:26:04 +0000 (17:26 -0300)]
Update TODO list
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 27 May 2013 20:23:48 +0000 (17:23 -0300)]
mce-intel-sb: add memory controller decoding
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>