ras-page-isolation.h: fix most coding style issues
Fix several checkpatch.pl warnings:
ras-page-isolation.h:50: WARNING:SPACING: missing space after enum definition
ras-page-isolation.h:79: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
ras-page-isolation.h:80: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
ras-page-isolation.h:96: WARNING:CONSTANT_COMPARISON: Comparisons should place the constant on the right side of the test
ras-page-isolation.h:119: WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'
It should be noticed that this warning was not addressed,
as it seems to be a false-positive:
ras-page-isolation.h:96: WARNING:CONSTANT_COMPARISON: Comparisons should place the constant on the right side of the test
The location_fields is used for both APEI and DSM data.
The logic there defines 7 values for APEI and 9 for DSM,
but, with the current logic, it allocates only 7 elements.
mce-intel-ivb/mce-intel-sb: remove code commented with #if 0
The dead code is there for a long time without any attempts
to actually implement it for SB/IVB. As such CPUs were released
a long time ago, it is unlikely that someone would address the
comments there.
So, drop the dead code. If needed, this patch can be reversed
later.
rasdaemon: don't use braces for single statement blocks
Solve those checkpatch warnings:
WARNING: braces {} are not necessary for single statement blocks
+ if (clock_gettime(clk_id, &ts) == 0 && !strcmp(ev.error_type, "Corrected")) {
+ ras_record_row_error(ev.driver_detail, ev.error_count, ts.tv_sec, ev.address);
+ }
total: 0 errors, 1 warnings, 0 checks, 304 lines checked
WARNING: braces {} are not necessary for single statement blocks
+ if (!matched) {
+ log(TERM, LOG_INFO, "Improper %s, set to default off\n", env);
+ }
WARNING: braces {} are not necessary for any arm of this statement
+ if (rr1->type == GHES) {
[...]
+ } else {
[...]
WARNING: braces {} are not necessary for single statement blocks
+ for (int i = 0; i < ROW_LOCATION_FIELDS_NUM; i++) {
+ dst->location_fields[i] = src->location_fields[i];
+ }
zhuofeng [Sat, 29 Jun 2024 09:27:56 +0000 (17:27 +0800)]
The rasdaemon service may fail to be started for the first time.
The rasdaemon creates a separate instance virtual directory on first startup, like `/sys/kernel/debug/tracing/instances/rasdaemon`.
After the directory is created, the kernel generates virtual files such as `trace_clock` and `set_event` in `/sys/kernel/debug/tracing/instances/rasdaemon`.
The kernel generates virtual files and the rasdaemon accesses the virtual files at the same time. Therefore, the kernel may not generate the virtual files when the rasdaemon accesses the virtual files.
So add up to 30 seconds to give the kernel enough time to generate the files.
zhuofeng [Mon, 4 Mar 2024 13:04:42 +0000 (21:04 +0800)]
New feature: support memory row CE threshold policy
- Introduction: Identify memory row faults in memory CE faults and
isolate the physical memory pages where row faults occur. This method
can effectively prevent CE storms or memory UCE faults caused by memory
row failures.
- Implementation: The system counts the number of CE faults in the same
memory row within a specified period. If the number of CE faults exceeds
the configured threshold, the system considers that the memory row may
fail and isolates all physical pages recorded in the memory row.
Notes:
1. This function is disabled by default. You can enable it by
configuring the'ROW_CE_ACTION' field in the '/etc/sysconfig/rasdaemon' configuration file.
2. If both row isolation and page isolation are enabled, page isolation is automatically
disabled by default.
3. If the number of fault times in the DIMM CE fault information received by the rasdaemon
is 0, the BIOS does not correctly parse the number of fault times when parsing the fault information.
When a fault occurs, the rasdaemon process considers that the number of faults is 1 by default,
which is the same as the kernel process.
Avadhut Naik [Thu, 7 Nov 2024 06:24:44 +0000 (06:24 +0000)]
rasdaemon: Modify support for vendor-specific machine check error information
Commit 83a3ced797256d ("rasdaemon: Add support for vendor-specific
machine check error information") assumes that MCA_CONFIG MSR will be
exported as part of vendor-specific error information through the MCE
tracepoint.
The same, however, is not true anymore. MCA_CONFIG MSR will not be
exported through the MCE tracepoint. Instead, the data from MCA_SYND1/2
MSRs, exported as vendor-specific error information on newer AMD SOCs,
should always be interpreted as FRUText.
Modify the error decoding support accordingly.
Fixes: 83a3ced797256d ("rasdaemon: Add support for vendor-specific
machine check error information") Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Thu, 16 May 2024 14:23:44 +0000 (15:23 +0100)]
rasdaemon: ras-mc-ctl: Log hpa and region info from cxl_general_media and cxl_dram tables
Add support for read and log hpa and region info from cxl_general_media and
cxl_dram tables.
Note: This change does not have backward compatability, because
the select command with newly added columns would fail with previous
CXL tables where newly added columns are not present.
The issue can be solved with updating the CXL table's name to v2,
but again no backward compatability in ras-mc-ctl for listing errors
which fails when previous version of CXL table only present in the
database as it cannot find v2 of the table.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Wed, 15 May 2024 11:18:36 +0000 (12:18 +0100)]
rasdaemon: CXL: Extract, log and record region info from cxl_general_media and cxl_dram events
Add extract, log and record region info to cxl_general_media and
cxl_dram events.
The corresponding kernel changes:
https://lore.kernel.org/all/cover.1711598777.git.alison.schofield@intel.com/T/#m6fd773b5477fc44b875848e053708a1c8996c4e4
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Mon, 19 Aug 2024 10:56:15 +0000 (11:56 +0100)]
rasdaemon: ras-events: Fix warning ‘filter_ras_mc_event’ defined but not used
Fix following compilation warning,
ras-events.c:318:12: warning: ‘filter_ras_mc_event’ defined but not used [-Wunused-function]
static int filter_ras_mc_event(struct ras_events *ras, char *group, char *event,
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Fix following checkpatch warning in ras-arm-handler.
+ trace_seq_printf(s, " Program execution can be restarted reliably at the PC associated with the error");
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Mon, 19 Aug 2024 10:33:08 +0000 (11:33 +0100)]
rasdaemon: Fix for compilation warning in ras-memory-failure-handler.c
Fix for following compilation warning,
ras-memory-failure-handler.c:120:6: warning: implicit declaration of function ‘asprintf’; did you mean ‘vsprintf’? [-Wimplicit-function-declaration]
if (asprintf(&env[ei++], "PATH=%s", getenv("PATH") ?: "/sbin:/usr/sbin:/bin:/usr/bin") < 0)
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Avadhut Naik [Fri, 16 Aug 2024 18:10:40 +0000 (18:10 +0000)]
rasdaemon: Fix mem_fail_event build breakage
Commit 566a52622b1d ("add mem_fail_event trigger") introduces an event
trigger for a memory failure event.
However, if the rasdaemon is not configured with enable-memory-failure,
the setup function of the trigger, mem_fail_event_trigger_setup(), will
result in an undefined reference linker error when called through
setup_event_trigger().
Ensure that the setup function for the trigger is called only when the
rasdaemon has been configured with enable-memory-failure.
Tomohiro Misono [Thu, 8 Aug 2024 09:21:17 +0000 (09:21 +0000)]
ras-events: fix -d option to work again
It seems commit 3e9a59a184ca("Add dynamic switch of ras events support.")
inadvertedly introduced the change to ignore -d option.
Fix this so that -d will disable all trace events at once like before.
- Rework the returned code logic to be more consistent;
- error codes will be using negative values;
- positive values indicate special return codes.
- Don't bloat the logs with lots of error messages due to
unsupported traces;
- Ensure that the number of CPUs will probably retrieved or bail out;
- Don't bail if it can't setup a monotone clock: it is better
to have a wrong timestamp than no log at all.
ras-arm-handler: cope with latest upstream changes
Unfortunately, rasdaemon support for the firmware first
CPER ARM processor extended trace was added years before
having it merged upstream. That's bad, specially since
upstream revision requested a change on some fields.
Fix support for it by aligning with latest upstream version:
https://lore.kernel.org/linux-edac/3853853f820a666253ca8ed6c7c724dc3d50044a.1720679234.git.mchehab+huawei@kernel.org/T/#m17003e47912b228e91e57ac6e4f90ea30061aa3b
A backward-compatible logic was added to avoid breaking with
existing OOT support.
scripts/checkpatch.pl: some improvements to reduce false positives
- camelcase is OK for printk inttypes.h;
- strncpy is OK;
- accept up to 120 chars on lines without warnings;
- stop complaining about "BACKTRACE=" strings split on multiple lines;
- remove PREFER_DEFINED_ATTRIBUTE_MACRO, as this is kernel-specific;
- remove MACRO_ARG_REUSE, as this applies mostly to multithreading;
- don't warn on using do{} while(0) with single line statements;
There were lots of changes on this version. The summary at
ChangeLog contains a sanitized version of it.
It should be noticed that the next version will likely bring
an uAPI incompatible change. Unfortunately, UEFI CPER record
trace for ARM processor is currently incomplete upstream.
Rasdaemon gained support for an extended arm trace event that
supports all fields of the CPER record, but it depends on a
patch that it is not upstreamed yet.
While looked on such patches, there are some changes needed
to get it merged, meaning that future versions of rasdaemon
may not be compatible with the downstream patch anymore.
scripts/checkpatch.pl: add a script to check coding style
We sort of follow Kernel coding style. Import a version of it,
making it compatible with rasdaemon coding style by removing
stuff that doesn't fix here.
Ruidong Tian [Thu, 23 Nov 2023 09:47:25 +0000 (17:47 +0800)]
rasdaemon: add mc_event trigger
Allow users to run a trigger when RAS mc_event occurs, The mc_event
trigger is separated into CE trigger and UE trigger, this is because
CE is more frequent than UE, and the CE trigger will lead to more
performance hits. Users can choose different triggers for CE/UE to
reduce this effect.
Users can config trigger in /etc/sysconfig/rasdaemon:
TRIGGER_DIR: The trigger diretory
MC_CE_TRIGGER: The script executed when corrected error occurs.
MC_UE_TRIGGER: The script executed when uncorrected error occurs.
No script will be executed if MC_CE_TRIGGER/MC_UE_TRIGGER is null.
util/arm_einj.py: add an utility for ARM error injection via QEMU
Testing rasdaemon is not easy, as it depends on either having
real hardware producing events or a test BIOS. This is usually
not available and/or not too reliable.
So, take a different approach by adding a QEMU QAPI designed for
doing hardware error injection. The QEMU patches are at:
ras-arm-handler: be compatible with upstream Kernel
Changeset e37eb2f11a82 ("Add code to decode Ampere specific error")
broke ARM event record with upstream Kernel, as it requires a different
trace event than the one that it is on upstream Kernel, and it is
part of a pending pull request:
Restore its behavior by making parsing the UEFI 2.6+ N.17 and N.16
table extra fields to be optional. That should make it compatible
with current upstream Kernels again.
Fixes: e37eb2f11a82 ("Add code to decode Ampere specific error") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Avadhut Naik [Fri, 10 May 2024 18:20:19 +0000 (13:20 -0500)]
rasdaemon: Update SMCA bank error descriptions
Update error descriptions of SMCA bank types to support AMD's new Family
1Ah-based processors.
Also, modify some existing error descriptions to better reflect the error
received.
Shiju Jose [Wed, 20 Mar 2024 12:16:05 +0000 (12:16 +0000)]
rasdaemon: Fix for vendor errors are not recorded in the SQLite database if some cpus are offline
Fix for vendor errors are not recorded in the SQLite database if some cpus
are offline at the system start.
Issue:
This issue is reproducible by offline some cpus, run
./rasdaemon -f --record & and
inject vendor specific error supported in the rasdaemon.
Reason:
When the system starts with some of the cpus offline and then run
the rasdaemon, read_ras_event_all_cpus() exit with error and switch to
the multi thread way. However read() in read_ras_event() return error in
threads for each of the offline CPUs and does clean up including calling
ras_ns_finalize_vendor_tables(), which invokes sqlite3_finalize() on vendor
tables created. Thus the vendor error data does not stored in the SQLite
database when such error is reported next time.
Solution:
In ras_ns_add_vendor_tables() and ras_ns_finalize_vendor_tables() use
reference count and close vendor tables which created in ras_ns_add_vendor_tables()
based on the reference count.
Reported-by: Junhao He <hejunhao3@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
mce-amd-smca: update smca_hwid to use smca_bank_types
bank_type is used as smca_bank_types everywhere, there's no point in
declaring it as unsigned int. It also upsets covscan:
3. rasdaemon-0.6.7/mce-amd-smca.c:914: assignment: Assigning: "bank_type" = "s_hwid->bank_type".
7. rasdaemon-0.6.7/mce-amd-smca.c:926: cond_at_most: Checking "bank_type >= 64U" implies that "bank_type" and "s_hwid->bank_type" may be up to 63 on the false branch.
14. rasdaemon-0.6.7/mce-amd-smca.c:942: overrun-local: Overrunning array "smca_mce_descs" of 38 16-byte elements at element index 63 (byte offset 1023) using index "bank_type" (which evaluates to 63).
# 940| /* Only print the descriptor of valid extended error code */
# 941| if (xec < smca_mce_descs[bank_type].num_descs)
# 942|-> mce_snprintf(e->mcastatus_msg,
# 943| "%s. Ext Err Code: %d",
# 944| smca_mce_descs[bank_type].descs[xec],