Shiju Jose [Mon, 19 Aug 2024 10:33:08 +0000 (11:33 +0100)]
rasdaemon: Fix for compilation warning in ras-memory-failure-handler.c
Fix for following compilation warning,
ras-memory-failure-handler.c:120:6: warning: implicit declaration of function ‘asprintf’; did you mean ‘vsprintf’? [-Wimplicit-function-declaration]
if (asprintf(&env[ei++], "PATH=%s", getenv("PATH") ?: "/sbin:/usr/sbin:/bin:/usr/bin") < 0)
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Avadhut Naik [Fri, 16 Aug 2024 18:10:40 +0000 (18:10 +0000)]
rasdaemon: Fix mem_fail_event build breakage
Commit 566a52622b1d ("add mem_fail_event trigger") introduces an event
trigger for a memory failure event.
However, if the rasdaemon is not configured with enable-memory-failure,
the setup function of the trigger, mem_fail_event_trigger_setup(), will
result in an undefined reference linker error when called through
setup_event_trigger().
Ensure that the setup function for the trigger is called only when the
rasdaemon has been configured with enable-memory-failure.
Tomohiro Misono [Thu, 8 Aug 2024 09:21:17 +0000 (09:21 +0000)]
ras-events: fix -d option to work again
It seems commit 3e9a59a184ca("Add dynamic switch of ras events support.")
inadvertedly introduced the change to ignore -d option.
Fix this so that -d will disable all trace events at once like before.
- Rework the returned code logic to be more consistent;
- error codes will be using negative values;
- positive values indicate special return codes.
- Don't bloat the logs with lots of error messages due to
unsupported traces;
- Ensure that the number of CPUs will probably retrieved or bail out;
- Don't bail if it can't setup a monotone clock: it is better
to have a wrong timestamp than no log at all.
ras-arm-handler: cope with latest upstream changes
Unfortunately, rasdaemon support for the firmware first
CPER ARM processor extended trace was added years before
having it merged upstream. That's bad, specially since
upstream revision requested a change on some fields.
Fix support for it by aligning with latest upstream version:
https://lore.kernel.org/linux-edac/3853853f820a666253ca8ed6c7c724dc3d50044a.1720679234.git.mchehab+huawei@kernel.org/T/#m17003e47912b228e91e57ac6e4f90ea30061aa3b
A backward-compatible logic was added to avoid breaking with
existing OOT support.
scripts/checkpatch.pl: some improvements to reduce false positives
- camelcase is OK for printk inttypes.h;
- strncpy is OK;
- accept up to 120 chars on lines without warnings;
- stop complaining about "BACKTRACE=" strings split on multiple lines;
- remove PREFER_DEFINED_ATTRIBUTE_MACRO, as this is kernel-specific;
- remove MACRO_ARG_REUSE, as this applies mostly to multithreading;
- don't warn on using do{} while(0) with single line statements;
There were lots of changes on this version. The summary at
ChangeLog contains a sanitized version of it.
It should be noticed that the next version will likely bring
an uAPI incompatible change. Unfortunately, UEFI CPER record
trace for ARM processor is currently incomplete upstream.
Rasdaemon gained support for an extended arm trace event that
supports all fields of the CPER record, but it depends on a
patch that it is not upstreamed yet.
While looked on such patches, there are some changes needed
to get it merged, meaning that future versions of rasdaemon
may not be compatible with the downstream patch anymore.
scripts/checkpatch.pl: add a script to check coding style
We sort of follow Kernel coding style. Import a version of it,
making it compatible with rasdaemon coding style by removing
stuff that doesn't fix here.
Ruidong Tian [Thu, 23 Nov 2023 09:47:25 +0000 (17:47 +0800)]
rasdaemon: add mc_event trigger
Allow users to run a trigger when RAS mc_event occurs, The mc_event
trigger is separated into CE trigger and UE trigger, this is because
CE is more frequent than UE, and the CE trigger will lead to more
performance hits. Users can choose different triggers for CE/UE to
reduce this effect.
Users can config trigger in /etc/sysconfig/rasdaemon:
TRIGGER_DIR: The trigger diretory
MC_CE_TRIGGER: The script executed when corrected error occurs.
MC_UE_TRIGGER: The script executed when uncorrected error occurs.
No script will be executed if MC_CE_TRIGGER/MC_UE_TRIGGER is null.
util/arm_einj.py: add an utility for ARM error injection via QEMU
Testing rasdaemon is not easy, as it depends on either having
real hardware producing events or a test BIOS. This is usually
not available and/or not too reliable.
So, take a different approach by adding a QEMU QAPI designed for
doing hardware error injection. The QEMU patches are at:
ras-arm-handler: be compatible with upstream Kernel
Changeset e37eb2f11a82 ("Add code to decode Ampere specific error")
broke ARM event record with upstream Kernel, as it requires a different
trace event than the one that it is on upstream Kernel, and it is
part of a pending pull request:
Restore its behavior by making parsing the UEFI 2.6+ N.17 and N.16
table extra fields to be optional. That should make it compatible
with current upstream Kernels again.
Fixes: e37eb2f11a82 ("Add code to decode Ampere specific error") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Avadhut Naik [Fri, 10 May 2024 18:20:19 +0000 (13:20 -0500)]
rasdaemon: Update SMCA bank error descriptions
Update error descriptions of SMCA bank types to support AMD's new Family
1Ah-based processors.
Also, modify some existing error descriptions to better reflect the error
received.
Shiju Jose [Wed, 20 Mar 2024 12:16:05 +0000 (12:16 +0000)]
rasdaemon: Fix for vendor errors are not recorded in the SQLite database if some cpus are offline
Fix for vendor errors are not recorded in the SQLite database if some cpus
are offline at the system start.
Issue:
This issue is reproducible by offline some cpus, run
./rasdaemon -f --record & and
inject vendor specific error supported in the rasdaemon.
Reason:
When the system starts with some of the cpus offline and then run
the rasdaemon, read_ras_event_all_cpus() exit with error and switch to
the multi thread way. However read() in read_ras_event() return error in
threads for each of the offline CPUs and does clean up including calling
ras_ns_finalize_vendor_tables(), which invokes sqlite3_finalize() on vendor
tables created. Thus the vendor error data does not stored in the SQLite
database when such error is reported next time.
Solution:
In ras_ns_add_vendor_tables() and ras_ns_finalize_vendor_tables() use
reference count and close vendor tables which created in ras_ns_add_vendor_tables()
based on the reference count.
Reported-by: Junhao He <hejunhao3@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
mce-amd-smca: update smca_hwid to use smca_bank_types
bank_type is used as smca_bank_types everywhere, there's no point in
declaring it as unsigned int. It also upsets covscan:
3. rasdaemon-0.6.7/mce-amd-smca.c:914: assignment: Assigning: "bank_type" = "s_hwid->bank_type".
7. rasdaemon-0.6.7/mce-amd-smca.c:926: cond_at_most: Checking "bank_type >= 64U" implies that "bank_type" and "s_hwid->bank_type" may be up to 63 on the false branch.
14. rasdaemon-0.6.7/mce-amd-smca.c:942: overrun-local: Overrunning array "smca_mce_descs" of 38 16-byte elements at element index 63 (byte offset 1023) using index "bank_type" (which evaluates to 63).
# 940| /* Only print the descriptor of valid extended error code */
# 941| if (xec < smca_mce_descs[bank_type].num_descs)
# 942|-> mce_snprintf(e->mcastatus_msg,
# 943| "%s. Ext Err Code: %d",
# 944| smca_mce_descs[bank_type].descs[xec],
rasdaemon: Add support to parse microcode field of mce tracepoint
Support for exporting the Microcode Revision is being added to the
mce_record tracepoint.
Add the required, corresponding support in the rasdaemon for the field
to be parsed and logged or added to the database and viewed later through
ras-mc-ctl utility.
rasdaemon: Add support to parse the PPIN field of mce tracepoint
Support for exporting the PPIN (Protected Processor Inventory Number)
is being added to the mce_record tracepoint.
Add the required, corresponding support in the rasdaemon for the field
to be parsed and logged or added to the database and viewed later through
ras-mc-ctl utility.
Avadhut Naik [Tue, 26 Mar 2024 04:06:08 +0000 (23:06 -0500)]
rasdaemon: ras-mc-ctl: Add support to display mcastatus_msg string
Currently, the mcastatus_msg string of struct mce_event is added to the
SQLite database by the rasdaemon when it is recording errors. The same
however, is not outputted by the ras-mc-ctl utility.
The string provides important error information relating to the received
MCE. For example, on AMD SMCA systems, the string outputs extended error
code and description. As such, the string should be present in the
output of ras-mc-ctl utility.
Add support to output the string through the ras-mc-ctl utility.
rasdaemon: Add error decoding for MCA_CTL_SMU extended bits
Enable error decoding support for the newly added extended
error bit descriptions from MCA_CTL_SMU.
b'0:11 can be decoded from existing array smca_smu2_mce_desc.
Define a function to append the newly defined b'58:62 to the
smca_smu2_mce_desc. This reduces the maintaining Reserved bits
from b'12:57 in the code.
Walter Sonius [Sun, 11 Feb 2024 22:30:25 +0000 (23:30 +0100)]
rasdaemon: labels/apple add MacPro 1,1 and 2,1 models
For the Apple MacPro 1,1 (Mac-F4208DC8) and MacPro 2,1 (Mac-F4208DA9)
these are the correct labels for the DIMM numbers 1-4 on each DIMM Riser
A&B for a total of 8 DIMMS. The MacPro 1,1 vendor is actually called
"Apple Computer, Inc." vs "Apple Inc." for the MacPro 2,1 and 3,1.
Another note is that the MacPro 1,1 and 2,1 require the kernel parameter
noefi for their efi32 firmware to boot a 64bit kernel using the
debian-12.4.0-amd64-netinst.iso.
The upper Riser is called A the lower Riser is called B. However
compared to MacPro 3,1 the riser labels A & B are branch swapped on the
memory controller on MacPro1,1 and 2,1 not its physical location in the
case (double checked it)! The so called slot 2 and slot 3 found by
ras-mc-ctl --layout are not available as slots or risers on the
motherboard. The ras-mc-ctl --guess-labels showed right labels but the
DIMM numbers are indistinguishable, however this commit is needed to
link them to the right memory location.
Signed-off-by: Walter Sonius <walterav1984@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
We don't use Fedorahosted for a long time; the URL was updated,
but right now it is a way more common to receive patches via github
than from other repositories, so change the repository order.
Walter Sonius [Mon, 29 Jan 2024 17:21:37 +0000 (18:21 +0100)]
apple macpro 2008 3,1 dimm1-4 labels riser A&B
For the Apple Mac Pro 3,1( 2008) Mac-F42C88C8 these are the correct labels for the DIMM numbers 1-4 on each DIMM Riser A&B for a total of 8 DIMMS.
The upper Riser is called A the lower Riser is called B. The so called `slot 2` and `slot 3` found by `ras-mc-ctl --layout` are not available as slots or risers on the motherboard. The `ras-mc-ctl --guess-labels` showed right labels but the DIMM numbers are indistinguishable, however this commit is needed to link them to the right memory location.
$ ras-mc-ctl --guess-labels
memory stick 'DIMM 1' is located at 'DIMM Riser B'
memory stick 'DIMM 2' is located at 'DIMM Riser B'
memory stick 'DIMM 1' is located at 'DIMM Riser A'
memory stick 'DIMM 2' is located at 'DIMM Riser A'
memory stick 'DIMM 3' is located at 'DIMM Riser B'
memory stick 'DIMM 4' is located at 'DIMM Riser B'
memory stick 'DIMM 3' is located at 'DIMM Riser A'
memory stick 'DIMM 4' is located at 'DIMM Riser A'
```
Signed-off-by: Walter Sonius <walterav1984@gmail.com>
zhuofeng [Thu, 7 Dec 2023 02:26:56 +0000 (10:26 +0800)]
Fix potential overflow with some arrays at page-isolation logic
Overflows may happen in the `threshold_string` and `cycle_string` arrays.
If the PAGE_CE_THRESHOLD value in page isolation is set to 50 bits,
there is a risk of array overflow. Because sprintf is an insecure
function, use snprintf instead.
An error is reported when the AddressSanitizer is used.
rasdaemon: Improper PAGE_CE_ACTION, set to default soft
rasdaemon: Page offline choice on Corrected Errors is soft
=================================================================
==221920==ERROR: AddressSanitizer: stack-buffer-overflow on address 0xffffdd91d932 at pc 0xffffa24071c4 bp 0xffffdd91d720 sp 0xffffdd91ced8
WRITE of size 55 at 0xffffdd91d932 thread T0
#0 0xffffa24071c0 in vsprintf (/usr/lib64/libasan.so.6+0x5c1c0)
#1 0xffffa24073cc in sprintf (/usr/lib64/libasan.so.6+0x5c3cc)
#2 0x459558 in parse_env_string /home/rasdaemon/ras-page-isolation.c:185
#3 0x4596f4 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:202
#4 0x459934 in ras_page_account_init /home/rasdaemon/ras-page-isolation.c:211
#5 0x40f700 in handle_ras_events /home/rasdaemon/ras-events.c:902
#6 0x405b8c in main /home/rasdaemon/rasdaemon.c:211
#7 0xffffa20b6f38 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#8 0xffffa20b7004 in __libc_start_main_impl ../csu/libc-start.c:409
#9 0x4038ec in _start (/home/rasdaemon/rasdaemon+0x4038ec)
Address 0xffffdd91d932 is located in stack of thread T0 at offset 82 in frame
#0 0x459574 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:190
This frame has 2 object(s):
[32, 82) 'threshold_string' (line 191)
[128, 178) 'cycle_string' (line 192) <== Memory access at offset 82 partially underflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/usr/lib64/libasan.so.6+0x5c1c0) in vsprintf
Shadow bytes around the buggy address:
0x200ffbb23ad0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23b10: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
=>0x200ffbb23b20: 00 00 00 00 00 00[02]f2 f2 f2 f2 f2 00 00 00 00
0x200ffbb23b30: 00 00 02 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00
0x200ffbb23b40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23b50: f1 f1 f1 f1 f1 f1 04 f2 00 00 f2 f2 00 00 00 00
0x200ffbb23b60: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
0x200ffbb23b70: f2 f2 f2 f2 00 00 00 00 00 00 00 00 f2 f2 f2 f2
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==221920==ABORTING
Hunter He [Wed, 6 Dec 2023 06:52:03 +0000 (14:52 +0800)]
rasdaemon:Add support for creating vendor tables at startup.
When rasdaemon is running without non-standard error, those
tables are not created in the database file. Then ras-mc-ctl
script breaks trying to query data from non-existent tables.
Add support for creating vendor tables at startup.
Signed-off-by: Hunter He <hunter.he@jaguarmicro.com>
caixiaomeng 00662745 [Wed, 29 Nov 2023 06:31:46 +0000 (14:31 +0800)]
Add dynamic switch of ras events support.
Rasdaemon does not support a way to disable some events by config.
If user want to disable specified event(eg:block_rq_complete), he
should recompile rasdaemon, which is not so convenient.
This patch add dynamic switch of ras event support.You can add
events you want to disabled in /etc/sysconfig/rasdaemon.For example,
`DISABLE="ras:mc_event,block:block_rq_complete"`.Then restart
rasdaemon, these two events will be disabled without recompilation.
[mchehab: make is_disabled_event() static] Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Avadhut Naik [Tue, 21 Nov 2023 20:04:19 +0000 (14:04 -0600)]
rasdaemon: Add support for vendor-specific machine check error information
Some CPU vendors may provide additional vendor-specific machine check
error information. AMD, for example, provides FRU Text through SYND 1/2
registers if BIT 9 of SMCA_CONFIG register is set.
Add support to display the additional vendor-specific error information,
if any.
rasdaemon: log non_standard_event at just one line
It is more reasonable log non_standard_event in one line exclude errors
dump. So you can easily to get decoded non_standard_event log in one
line if you implement a decoder like other event.
Avadhut Naik [Thu, 31 Aug 2023 07:23:48 +0000 (02:23 -0500)]
rasdaemon: Fix SMCA bank type decoding
On AMD systems with Scalable MCA (SMCA), the (HWID, MCATYPE) tuple from
the MCA_IPID MSR, bits 43:32 and 63:48 respectively, are used for SMCA
bank type decoding. On occurrence of an SMCA error, the cached tuples are
compared against the tuple read from the MCA_IPID MSR to determine the
SMCA bank type.
Currently however, all high 32 bits of the MCA_IPID register are cached in
the rasdaemon for all SMCA bank types. Bits 47:44 which do not play a part
in bank type decoding are zeroed out. Likewise, when an SMCA error occurs,
all high 32 bits of the MCA_IPID register are read and compared against
the cached values in smca_hwid_mcatypes array.
This can lead to erroneous bank type decoding since the bits 47:44 are
not guaranteed to be zero. They are either reserved or, on some modern
AMD systems viz. Genoa, denote the InstanceIdHi value. The bits therefore,
should not be associated with SMCA bank type decoding.
Import the HWID_MCATYPE macro from the kernel to ensure that only the
relevant fields i.e. (HWID, MCATYPE) tuples are used for SMCA bank type
decoding on occurrence of an SMCA error.
Muralidhara M K [Thu, 27 Jul 2023 10:18:12 +0000 (10:18 +0000)]
rasdaemon: Identify the DIe Number in multidie system
Some AMD systems have 4 dies in each socket and Die ID represents
whether the error occured on cpu die or gpu die.
Also, respective Die used for FRU identification.
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>