www.infradead.org Git - users/mchehab/rasdaemon.git/log

Add labels for ASRock X570 Creator

Signed-off-by: Dezponia <150628177+dezponia@users.noreply.github.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Add labels for ASRock X570S PG Riptide

Signed-off-by: Dezponia <150628177+dezponia@users.noreply.github.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: mce: decode io port for bus error

mcelog decode bus error with io port like:

  ...
  MCA: BUS error: 0 0 Level-3 Generic IO Request-did-not-timeout
  IO MCA reported by root port 0:7b:07.0
  ...

Introduce the code into rasdaemon.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: arm: do not print error msg if field not found

Fix output from:
2024-12-19 13:52:02 +0800 affinity: 0 MPIDR: 0x810c0200 MIDR: 0x481fd010 running_state: 1 psci_state: 0<CANT FIND FIELD pei_len>
to:
2024-12-19 13:52:02 +0800 affinity: 0 MPIDR: 0x810c0200 MIDR: 0x481fd010 running_state: 1 psci_state: 0

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: add DE error type for AMD

hw_event_mc_err_type in kernel include HW_EVENT_ERR_DEFERRED for AMD,
add this error type in rasdaemon.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: Fix the display format of JaguarMicro vendor no standard errors

1)Another whitespaces were added by mce_snprintf. Remove redundant
whitespaces.
2)Print logs in the same line.

Signed-off-by: "Hunter.He" <hunter.he@jaguarmicro.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: bump to version 0.8.2

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-page-isolation.h: remove extra parenthesis

Another parenthesis were accidentally added to ROW_LOCATION_FIELDS_NUM
macro. Remove it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: check if sscanf() processed all arguments on dev_name

Ensure that all arguments are parsed by sscanf() when dealing
with dev_name.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

unified-sel.h: convert license boilerplate to SPDX

Use SPDX for GPLv2.0 or later, instead of a license
boilerplate.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-page-isolation.h: fix most coding style issues

Fix several checkpatch.pl warnings:

ras-page-isolation.h:50: WARNING:SPACING: missing space after enum definition
ras-page-isolation.h:79: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
ras-page-isolation.h:80: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
ras-page-isolation.h:96: WARNING:CONSTANT_COMPARISON: Comparisons should place the constant on the right side of the test
ras-page-isolation.h:119: WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'

It should be noticed that this warning was not addressed,
as it seems to be a false-positive:
ras-page-isolation.h:96: WARNING:CONSTANT_COMPARISON: Comparisons should place the constant on the right side of the test

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-page-isolation: fix location_fields size

The location_fields is used for both APEI and DSM data.
The logic there defines 7 values for APEI and 9 for DSM,
but, with the current logic, it allocates only 7 elements.

This is likely due to a typo. Fix it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

unified-sel: fix most coding style issues

Solve several issues pointed by checkpatch.pl.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

unified-sel: replace license boilerplate with SPDX

Use GPL-2.0-or-later SPDX tag instead of a license boilerplate
to GPL-2.0+.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-page-isolation: fix additional coding style issues

Fix some indentation issues and an uneeded parenthesis
inside ras-page-isolation code.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-page-isolation: make memory_location_field static

Those structures are used only internally inside the
ras-page-isolation code. They're not public. Make them static.

While here, fix coding style.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-page-isolation: drop some uneeded prototype

There is a prototype not used and two other prototypes that
are used only internally inside the function.

Make such functions static and drop the unused one.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-page-isolation: use snprintf() instead of sprintf()

Use the safer snprintf() call to avoid the risk of going past the
buffer.

While here, make row_record_get_id() static.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

mce-intel: drop a code commented a long time ago with an action

There is a commented out code at mce-intel that has been at
rasdaemon for a long time.

Remove it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

mce-intel-ivb/mce-intel-sb: remove code commented with #if 0

The dead code is there for a long time without any attempts
to actually implement it for SB/IVB. As such CPUs were released
a long time ago, it is unlikely that someone would address the
comments there.

So, drop the dead code. If needed, this patch can be reversed
later.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: don't use braces for single statement blocks

Solve those checkpatch warnings:

WARNING: braces {} are not necessary for single statement blocks
+ if (clock_gettime(clk_id, &ts) == 0 && !strcmp(ev.error_type, "Corrected")) {
+ ras_record_row_error(ev.driver_detail, ev.error_count, ts.tv_sec, ev.address);
+ }

total: 0 errors, 1 warnings, 0 checks, 304 lines checked
WARNING: braces {} are not necessary for single statement blocks
+ if (!matched) {
+ log(TERM, LOG_INFO, "Improper %s, set to default off\n", env);
+ }

WARNING: braces {} are not necessary for any arm of this statement
+ if (rr1->type == GHES) {
[...]
+ } else {
[...]

WARNING: braces {} are not necessary for single statement blocks
+ for (int i = 0; i < ROW_LOCATION_FIELDS_NUM; i++) {
+ dst->location_fields[i] = src->location_fields[i];
+ }

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-page-isolation: don't use "/**" for normal comments

The usage of /** is reserved for doxygen markups. Don't use
it where it doesn't belong.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: use __func__ instead of the name of the function

Solve some checkpatch warnings about not usint __func__ at the
rasdaemon logs.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: fix some coding style issues

Use checkpatch to fix some trivial coding style issues with:

$ ./scripts/checkpatch.pl -f *.c -q --strict --fix-inplace

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ipmitool SEL logging of AER CEs on OpenBMC platforms

Signed-off-by: Krishna Dhulipala <krishnad@meta.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

The rasdaemon service may fail to be started for the first time.

The rasdaemon creates a separate instance virtual directory on first startup, like `/sys/kernel/debug/tracing/instances/rasdaemon`.

After the directory is created, the kernel generates virtual files such as `trace_clock` and `set_event` in `/sys/kernel/debug/tracing/instances/rasdaemon`.

The kernel generates virtual files and the rasdaemon accesses the virtual files at the same time. Therefore, the kernel may not generate the virtual files when the rasdaemon accesses the virtual files.

So add up to 30 seconds to give the kernel enough time to generate the files.

Signed-off-by: zhuofeng <zhuofeng2@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

Makefile: only enable rbtree if needed

Don't enable rbtree and ras-page-isolation code unconditionally.
Only enable it if PFA is compiled.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

New feature: support memory row CE threshold policy

- Introduction: Identify memory row faults in memory CE faults and
isolate the physical memory pages where row faults occur. This method
can effectively prevent CE storms or memory UCE faults caused by memory
row failures.

- Implementation: The system counts the number of CE faults in the same
memory row within a specified period. If the number of CE faults exceeds
the configured threshold, the system considers that the memory row may
fail and isolates all physical pages recorded in the memory row.

Notes:
1. This function is disabled by default. You can enable it by
configuring the'ROW_CE_ACTION' field in the '/etc/sysconfig/rasdaemon' configuration file.
2. If both row isolation and page isolation are enabled, page isolation is automatically
disabled by default.
3. If the number of fault times in the DIMM CE fault information received by the rasdaemon
is 0, the BIOS does not correctly parse the number of fault times when parsing the fault information.
When a fault occurs, the rasdaemon process considers that the number of faults is 1 by default,
which is the same as the kernel process.

Signed-off-by: zhuofeng <zhuofeng2@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

ras-page-isolation: drop an unused variable

There's no need to store the value of strtoul() during the
overflow check. Remove it, as this is causing a warning:

ras-page-isolation.c: In function ‘parse_isolation_env’:
ras-page-isolation.c:166:47: warning: unused variable ‘converted_value’ [-Wunused-variable]
166 | unsigned long converted_value = strtoul(config->env, &endptr, 10);
| ^~~~~~~~~~~~~~~

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Fix the bug that `config->env` is greater than `ulong_max` when units->val=1

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: Modify support for vendor-specific machine check error information

Commit 83a3ced797256d ("rasdaemon: Add support for vendor-specific
machine check error information") assumes that MCA_CONFIG MSR will be
exported as part of vendor-specific error information through the MCE
tracepoint.

The same, however, is not true anymore. MCA_CONFIG MSR will not be
exported through the MCE tracepoint. Instead, the data from MCA_SYND1/2
MSRs, exported as vendor-specific error information on newer AMD SOCs,
should always be interpreted as FRUText.

Modify the error decoding support accordingly.

Fixes: 83a3ced797256d ("rasdaemon: Add support for vendor-specific
machine check error information")
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: ras-mc-ctl: Log hpa and region info from cxl_general_media and cxl_dram tables

Add support for read and log hpa and region info from cxl_general_media and
cxl_dram tables.

Note: This change does not have backward compatability, because
the select command with newly added columns would fail with previous
CXL tables where newly added columns are not present.
The issue can be solved with updating the CXL table's name to v2,
but again no backward compatability in ras-mc-ctl for listing errors
which fails when previous version of CXL table only present in the
database as it cannot find v2 of the table.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: CXL: Extract, log and record region info from cxl_general_media and cxl_dram events

Add extract, log and record region info to cxl_general_media and
cxl_dram events.

The corresponding kernel changes:
https://lore.kernel.org/all/cover.1711598777.git.alison.schofield@intel.com/T/#m6fd773b5477fc44b875848e053708a1c8996c4e4

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: CXL: Fix uncorrectable macro spelling

Fix the macro (CXL_GMER_EVT_DESC_UNCORECTABLE_EVENT) spelling .
Uncorrectable is spelled with two r's.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: ras-non-standard-handler: Fix checkpatch warning

Fix following checkpatch warning,
CHECK: spaces preferred around that '*' (ctx:WxV)
+ sqlite3_stmt *stmt_dec_record;

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: ras-events: Fix warning ‘filter_ras_mc_event’ defined but not used

Fix following compilation warning,
ras-events.c:318:12: warning: ‘filter_ras_mc_event’ defined but not used [-Wunused-function]
static int filter_ras_mc_event(struct ras_events *ras, char *group, char *event,

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: ras-arm-handler: Fix checkpatch warning length exceeds 120 columns

Fix following checkpatch warning in ras-arm-handler.
+ trace_seq_printf(s, " Program execution can be restarted reliably at the PC associated with the error");

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: ras-events: removed obselete code under #if 0

Remove unused code enclosed under #if 0 to fix the checkpatch
warnings.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: ras-mce-handler: Fix checkpatch errors

Fix following checkpatch error in ras-mce-handler.c

Delete below obselte code under #if 0 ... #endif
WARNING: Consider removing the code enclosed by this #if 0 and its #endif

WARNING: Consider removing the code enclosed by this #if 0 and its #endif

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: rbtree: removed unused definition for RB_ROOT

Removed unused definition for RB_ROOT from rbtree.h

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: Fix for compilation warning in ras-memory-failure-handler.c

Fix for following compilation warning,
ras-memory-failure-handler.c:120:6: warning: implicit declaration of function ‘asprintf’; did you mean ‘vsprintf’? [-Wimplicit-function-declaration]
if (asprintf(&env[ei++], "PATH=%s", getenv("PATH") ?: "/sbin:/usr/sbin:/bin:/usr/bin") < 0)

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

rasdaemon: Fix mem_fail_event build breakage

Commit 566a52622b1d ("add mem_fail_event trigger") introduces an event
trigger for a memory failure event.

However, if the rasdaemon is not configured with enable-memory-failure,
the setup function of the trigger, mem_fail_event_trigger_setup(), will
result in an undefined reference linker error when called through
setup_event_trigger().

Ensure that the setup function for the trigger is called only when the
rasdaemon has been configured with enable-memory-failure.

Fixes: 566a52622b1d ("add mem_fail_event trigger")
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-events: fix -d option to work again

It seems commit 3e9a59a184ca("Add dynamic switch of ras events support.")
inadvertedly introduced the change to ignore -d option.
Fix this so that -d will disable all trace events at once like before.

Signed-off-by: Tomohiro Misono <misono.tomohiro@fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ChangeLog: fix 0.8.1 release date

2023 -> 2024.

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ci.yml: Change the name of the second job

Using the same name seems to cause troubles

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ci.yml: place checkpatch check in separate

This doesn't need to run for all 3 architectures.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ci.yml: run checkpatch when doing tests

That helps detecting new problems at the code.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Makefile.am: add types.h to the list of headers

Without that, make mock won't work properly.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

scripts/checkpatch.pl: add support for checking SPDX

Now that rasdaemon files have SPDX tags, enforce it via checkpatch
script.

The code was imported from the Linux Kernel, with some changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: enforce SPDX license tags

Replace license text comments with SPDX tags. For files that don't
have any license, use the COPYING license (GPL-2.0).

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-events: demote log information about trace being enabled/disabled

There are already enough information outside __toggle_ras_mc_event()
to identify if a feature was enabled or disabled.

So, this is mostly for debugging purposes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: cleanup coding style

Solve a series of coding style warnings:

mce-amd.c:132: WARNING:RETURN_VOID: void function return statements are not generally useful
mce-amd-smca.c:984: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'm->family == 0x19'
non-standard-ampere.c:743: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'err->subtype == 0x01'
non-standard-ampere.c:743: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'err->subtype == 0x02'
non-standard-jaguarmicro.c:382: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'mod_id >= tbl_size'
non-standard-jaguarmicro.c:382: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around '!module'
non-standard-jaguarmicro.c:425: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'sub_id >= tbl_size'
non-standard-jaguarmicro.c:425: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around '!sub_module'
ras-cxl-handler.c:408: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'i > 0'
ras-cxl-handler.c:705: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'i > 0'
ras-mce-handler.c:251: WARNING:USE_NEGATIVE_ERRNO: return of an errno should typically be negative (ie: return -ENOMEM)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-events: make returned error code consistent

- Rework the returned code logic to be more consistent;
  - error codes will be using negative values;
  - positive values indicate special return codes.
- Don't bloat the logs with lots of error messages due to
  unsupported traces;
- Ensure that the number of CPUs will probably retrieved or bail out;
- Don't bail if it can't setup a monotone clock: it is better
  to have a wrong timestamp than no log at all.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: add .editorconfig file to follow our coding style

That helps keeping the coding style, as lots of editors support
this file.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-report.h: avoid long lines

Better format the stubs on this file to avoid long lines.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

types.h: remove whitespaces

Cut-and-pasting it from /usr/include/linux/bits.h ended adding
unwanted whitespaces. Remove those.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

types.h: don't depend on linux/bits.h

Such include would require Kernel sources to be installed.
We don't really need that: Just copy the two GENMASK macros
and be it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-events: don't use extern inside a C file

Fix a checkpatch warning:

ras-events.c:66: WARNING:AVOID_EXTERNS: externs should be avoided in .c files

by better handing how checks_inside var is handled.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: don't use unsafe strcpy, strcat and sprintf

Remove all occurrences of those calls.

While here, also fix a couple missing whitespace warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

types.h: add an implementation for strscpy() and strscat()

Do our own implementation for such routines, as the Kernel
implementation is a lot more complex than what it would be needed
here.

With that, change checkpatch.pl to request usage of such functions
instead of unsafe strcpy()/strcat().

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-events: drop a dead code to check number of CPUs

Just use sysconf(_SC_NPROCESSORS_ONLN) here.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-report: fix coding style and string fill issues

Don't use unsafe sprintf(). Instead, re-implement the logic in
a way that buffer overflows won't occur.

While here, also avoid lines longer than 80 columns when possible.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

non-standard-jaguarmicro: avoid CamelCase

Coding-style: no need to use CamelCase here. So, use lowercase.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

checkpatch.pl: warn also about strcat and sprintf usages

strcpy, strncpy and sprintf aren't safe, as they don't check
buffer overflows. Change the checkpatch logic to warn about
such usages.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: alphabetically sort includes

Reorder includes to ensure that they'll all be alphabetically
sorted.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-arm-handler: use GENMASK() macro

Now that we have the macro defined on types.h, use it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: move type macros to a separate header (types.h)

That makes easier to use/maintain it, without needing to include
ras-record.h when all it is needed are common macros.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: fix a coding style issue

Comment block identation was wrong. Fix it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-arm-handler: Parse and log ARM Processor Error Info table

Parse and log ARM Processor Error Info table data, UEFI 2.9A/2.10
specs section N2.4.4.1.

[mchehab: fix a typo]
Suggested-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: fix some typos and correct spelling

With the help of checkpatch.pl --codespell, fix some typos.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

scripts/checkpatch.pl: set default mode to strict

There aren't many false positives. So, change default to strict
mode.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-arm-handler: cope with latest upstream changes

Unfortunately, rasdaemon support for the firmware first
CPER ARM processor extended trace was added years before
having it merged upstream. That's bad, specially since
upstream revision requested a change on some fields.

Fix support for it by aligning with latest upstream version:
https://lore.kernel.org/linux-edac/3853853f820a666253ca8ed6c7c724dc3d50044a.1720679234.git.mchehab+huawei@kernel.org/T/#m17003e47912b228e91e57ac6e4f90ea30061aa3b

A backward-compatible logic was added to avoid breaking with
existing OOT support.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

scripts/checkpatch.pl: some improvements to reduce false positives

- camelcase is OK for printk inttypes.h;
- strncpy is OK;
- accept up to 120 chars on lines without warnings;
- stop complaining about "BACKTRACE=" strings split on multiple lines;
- remove PREFER_DEFINED_ATTRIBUTE_MACRO, as this is kernel-specific;
- remove MACRO_ARG_REUSE, as this applies mostly to multithreading;
- don't warn on using do{} while(0) with single line statements;

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: coding style cleanup

Solve lots of coding style issues reported by:

./scripts/checkpatch.pl --terse --show-types --strict \
-f $(git ls-files|grep -E '\.[ch]$') \
--ignore MACRO_ARG_REUSE,STRCPY,IF_0,UNNECESSARY_PARENTHESES,CAMELCASE,STRNCPY; done

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

scripts/checkpatch.pl: do some additional cleanups

Remove more things that won't make sense for rasdaemon.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Bump version to 0.8.1

There were lots of changes on this version. The summary at
ChangeLog contains a sanitized version of it.

It should be noticed that the next version will likely bring
an uAPI incompatible change. Unfortunately, UEFI CPER record
trace for ARM processor is currently incomplete upstream.

Rasdaemon gained support for an extended arm trace event that
supports all fields of the CPER record, but it depends on a
patch that it is not upstreamed yet.

While looked on such patches, there are some changes needed
to get it merged, meaning that future versions of rasdaemon
may not be compatible with the downstream patch anymore.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: adjust install targets for the spec to be build

We use Fedora spec file to check if everything is OK. Do some
changes to make it happy.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-aer-handler: handle errors when running ipmitool

Without that, Fedora build will produce warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rbtree.h: Fix an issue introduced by checkpatch logic

Checkpatch actually broke RB_EMPTY_ROOT macro. It was defined
as:

#define RB_EMPTY_ROOT(root) ((root)->rb_node == NULL)

It ended replacing it by:
((root)->!rb_node)

Which is not the way we espect it. Weird enough, this was compiling.
Anyway, what we want, instead, is:

#define RB_EMPTY_ROOT(root) (!(root)->rb_node)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: do some coding style cleanups

Adjust coding style on some files to somewhat match the Kernel's
coding style, with the help of scripts/checkpatch.pl.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

scripts/checkpatch.pl: add a script to check coding style

We sort of follow Kernel coding style. Import a version of it,
making it compatible with rasdaemon coding style by removing
stuff that doesn't fix here.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: mce-amd-smca: Optimizing decoding of MCA_CTL_SMU bits

Optimize smca_smu2_mce_desc in better way from the commit ced615c.

Update existing array with extended error descriptions instead
of creating new array, simplifying the code.

Signed-off-by: Sathya Priya Kumar <sathyapriya.k@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Add labels for TRX50 WS

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Cleanup MCE error log on non-x86 args

We can only register for MCE on x86 arch.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

contrib/qemu_einj.py: make it more generic to allow other einj types

Currently, the einj logic handles just ARM processor CPER events, but
it is easy to change it to support other types as well.

Rename the script and make it more generic to accept new subparsers
for different types of EINJ.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: add mem_fail_event trigger

This event is somewhat similar to mc_event, except that this one
occurs on ARM platforms and the fields are different.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

trigger: parse only once TRIGGER_DIR env variable

Instead of parsing TRIGGER_DIR every time a new event happens,
store the trigger full path, simplifying the logic and avoiding
memory leaks.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-mc-handler: cleanup trigger logic

- Only setup mc_ce_trigger/mc_ue_trigger if the trigger is
  valid;

- Check if the trigger is there before doing strcmp, as
  checking if a pointer is not null is faster than strcmp();

- Ensure that the trigger env vars will be const, as we don't
  want to accidentally override those env vars;

- Print trigger enabled messages when rasdaemon runs with -f;

- ensure that trigger variables will initialize to NULL;

- coding style cleanups.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: add mc_event trigger

Allow users to run a trigger when RAS mc_event occurs, The mc_event
trigger is separated into CE trigger and UE trigger, this is because
CE is more frequent than UE, and the CE trigger will lead to more
performance hits. Users can choose different triggers for CE/UE to
reduce this effect.

Users can config trigger in /etc/sysconfig/rasdaemon:

    TRIGGER_DIR: The trigger diretory
    MC_CE_TRIGGER: The script executed when corrected error occurs.
    MC_UE_TRIGGER: The script executed when uncorrected error occurs.

No script will be executed if MC_CE_TRIGGER/MC_UE_TRIGGER is null.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

util/arm_einj.py: fix a typo at virt-addr

Typo: QAPI parameter is virt-addr.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

util/arm_einj.py: remove a debug print

This was meant only for testing argument handling. Remove it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

util/arm_einj.py: add an utility for ARM error injection via QEMU

Testing rasdaemon is not easy, as it depends on either having
real hardware producing events or a test BIOS. This is usually
not available and/or not too reliable.

So, take a different approach by adding a QEMU QAPI designed for
doing hardware error injection. The QEMU patches are at:

https://gitlab.com/mchehab_kernel/qemu/-/tree/arm-error-inject-v2

And some instructions about how to use it are at rasdaemon wiki
pages at github:

https://github.com/mchehab/rasdaemon/wiki

Add the error injection tool to rasdaemon sources.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

ras-arm-handler: be compatible with upstream Kernel

Changeset e37eb2f11a82 ("Add code to decode Ampere specific error")
broke ARM event record with upstream Kernel, as it requires a different
trace event than the one that it is on upstream Kernel, and it is
part of a pending pull request:

https://lore.kernel.org/all/20240321-b4-arm-ras-error-vendor-info-v5-rc3-v5-0-850f9bfb97a8@os.amperecomputing.com/

Restore its behavior by making parsing the UEFI 2.6+ N.17 and N.16
table extra fields to be optional. That should make it compatible
with current upstream Kernels again.

Fixes: e37eb2f11a82 ("Add code to decode Ampere specific error")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Do a coding style cleanup with regards to tabs and white spaces

Use tabs instead of spaces and remove blank ending whitespaces.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: Add Corrected Internal Error for aer_cor_errors

Add "Corrected Internal Error" for aer_cor_errors to decode
the error reported in status register in bit 14.

Signed-off-by: Jesus Esquivel <jesus.esquivel@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: Update SMCA bank error descriptions

Update error descriptions of SMCA bank types to support AMD's new Family
1Ah-based processors.
Also, modify some existing error descriptions to better reflect the error
received.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Add Lenovo P920 DIMM labels

This adds the labels entry for the Lenovo ThinkStation P920.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rasdaemon: Fix for vendor errors are not recorded in the SQLite database if some cpus are offline

Fix for vendor errors are not recorded in the SQLite database if some cpus
are offline at the system start.

Issue:

This issue is reproducible by offline some cpus, run
./rasdaemon -f --record & and
inject vendor specific error supported in the rasdaemon.

Reason:

When the system starts with some of the cpus offline and then run
the rasdaemon, read_ras_event_all_cpus() exit with error and switch to
the multi thread way. However read() in read_ras_event() return error in
threads for each of the offline CPUs and does clean up including calling
ras_ns_finalize_vendor_tables(), which invokes sqlite3_finalize() on vendor
tables created. Thus the vendor error data does not stored in the SQLite
database when such error is reported next time.

Solution:

In ras_ns_add_vendor_tables() and ras_ns_finalize_vendor_tables() use
reference count and close vendor tables which created in ras_ns_add_vendor_tables()
based on the reference count.

Reported-by: Junhao He <hejunhao3@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

mce-amd-smca: update smca_hwid to use smca_bank_types

bank_type is used as smca_bank_types everywhere, there's no point in
declaring it as unsigned int. It also upsets covscan:

3. rasdaemon-0.6.7/mce-amd-smca.c:914: assignment: Assigning: "bank_type" = "s_hwid->bank_type".
7. rasdaemon-0.6.7/mce-amd-smca.c:926: cond_at_most: Checking "bank_type >= 64U" implies that "bank_type" and "s_hwid->bank_type" may be up to 63 on the false branch.
14. rasdaemon-0.6.7/mce-amd-smca.c:942: overrun-local: Overrunning array "smca_mce_descs" of 38 16-byte elements at element index 63 (byte offset 1023) using index "bank_type" (which evaluates to 63).
#   940|        /* Only print the descriptor of valid extended error code */
#   941|        if (xec < smca_mce_descs[bank_type].num_descs)
#   942|->              mce_snprintf(e->mcastatus_msg,
#   943|                             "%s. Ext Err Code: %d",
#   944|                             smca_mce_descs[bank_type].descs[xec],

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

labels/asrock: Add DIMM labels for ASRock Rack X570D4U

Signed-off-by: Ivan Mironov <mironov.ivan@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>