]> www.infradead.org Git - users/mchehab/rasdaemon.git/log
users/mchehab/rasdaemon.git
5 months agorasdaemon: ras-events: Fix warning ‘filter_ras_mc_event’ defined but not used
Shiju Jose [Mon, 19 Aug 2024 10:56:15 +0000 (11:56 +0100)]
rasdaemon: ras-events: Fix warning ‘filter_ras_mc_event’ defined but not used

Fix following compilation warning,
ras-events.c:318:12: warning: ‘filter_ras_mc_event’ defined but not used [-Wunused-function]
 static int filter_ras_mc_event(struct ras_events *ras, char *group, char *event,

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: ras-arm-handler: Fix checkpatch warning length exceeds 120 columns
Shiju Jose [Mon, 19 Aug 2024 10:51:30 +0000 (11:51 +0100)]
rasdaemon: ras-arm-handler: Fix checkpatch warning length exceeds 120 columns

Fix following checkpatch warning in ras-arm-handler.
+ trace_seq_printf(s, " Program execution can be restarted reliably at the PC associated with the error");

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: ras-events: removed obselete code under #if 0
Shiju Jose [Mon, 19 Aug 2024 10:48:06 +0000 (11:48 +0100)]
rasdaemon: ras-events: removed obselete code under #if 0

Remove unused code enclosed under #if 0 to fix the checkpatch
warnings.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: ras-mce-handler: Fix checkpatch errors
Shiju Jose [Mon, 19 Aug 2024 10:44:58 +0000 (11:44 +0100)]
rasdaemon: ras-mce-handler: Fix checkpatch errors

Fix following checkpatch error in  ras-mce-handler.c

Delete below obselte code under #if 0 ... #endif
WARNING: Consider removing the code enclosed by this #if 0 and its #endif

WARNING: Consider removing the code enclosed by this #if 0 and its #endif

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: rbtree: removed unused definition for RB_ROOT
Shiju Jose [Mon, 19 Aug 2024 10:38:16 +0000 (11:38 +0100)]
rasdaemon: rbtree: removed unused definition for RB_ROOT

Removed unused definition for RB_ROOT from rbtree.h

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: Fix for compilation warning in ras-memory-failure-handler.c
Shiju Jose [Mon, 19 Aug 2024 10:33:08 +0000 (11:33 +0100)]
rasdaemon: Fix for compilation warning in ras-memory-failure-handler.c

Fix for following compilation warning,
ras-memory-failure-handler.c:120:6: warning: implicit declaration of function ‘asprintf’; did you mean ‘vsprintf’? [-Wimplicit-function-declaration]
  if (asprintf(&env[ei++], "PATH=%s", getenv("PATH") ?: "/sbin:/usr/sbin:/bin:/usr/bin") < 0)

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: Fix mem_fail_event build breakage
Avadhut Naik [Fri, 16 Aug 2024 18:10:40 +0000 (18:10 +0000)]
rasdaemon: Fix mem_fail_event build breakage

Commit 566a52622b1d ("add mem_fail_event trigger") introduces an event
trigger for a memory failure event.

However, if the rasdaemon is not configured with enable-memory-failure,
the setup function of the trigger, mem_fail_event_trigger_setup(), will
result in an undefined reference linker error when called through
setup_event_trigger().

Ensure that the setup function for the trigger is called only when the
rasdaemon has been configured with enable-memory-failure.

Fixes: 566a52622b1d ("add mem_fail_event trigger")
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-events: fix -d option to work again
Tomohiro Misono [Thu, 8 Aug 2024 09:21:17 +0000 (09:21 +0000)]
ras-events: fix -d option to work again

It seems commit 3e9a59a184ca("Add dynamic switch of ras events support.")
inadvertedly introduced the change to ignore -d option.
Fix this so that -d will disable all trace events at once like before.

Signed-off-by: Tomohiro Misono <misono.tomohiro@fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoChangeLog: fix 0.8.1 release date
Baruch Siach [Tue, 30 Jul 2024 07:02:10 +0000 (10:02 +0300)]
ChangeLog: fix 0.8.1 release date

2023 -> 2024.

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoci.yml: Change the name of the second job
Mauro Carvalho Chehab [Fri, 19 Jul 2024 09:11:05 +0000 (11:11 +0200)]
ci.yml: Change the name of the second job

Using the same name seems to cause troubles

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoci.yml: place checkpatch check in separate
Mauro Carvalho Chehab [Fri, 19 Jul 2024 08:58:07 +0000 (10:58 +0200)]
ci.yml: place checkpatch check in separate

This doesn't need to run for all 3 architectures.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoci.yml: run checkpatch when doing tests
Mauro Carvalho Chehab [Fri, 19 Jul 2024 08:54:56 +0000 (10:54 +0200)]
ci.yml: run checkpatch when doing tests

That helps detecting new problems at the code.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoMakefile.am: add types.h to the list of headers
Mauro Carvalho Chehab [Fri, 19 Jul 2024 08:42:25 +0000 (10:42 +0200)]
Makefile.am: add types.h to the list of headers

Without that, make mock won't work properly.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoscripts/checkpatch.pl: add support for checking SPDX
Mauro Carvalho Chehab [Fri, 19 Jul 2024 08:38:58 +0000 (10:38 +0200)]
scripts/checkpatch.pl: add support for checking SPDX

Now that rasdaemon files have SPDX tags, enforce it via checkpatch
script.

The code was imported from the Linux Kernel, with some changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: enforce SPDX license tags
Mauro Carvalho Chehab [Fri, 19 Jul 2024 07:53:41 +0000 (09:53 +0200)]
rasdaemon: enforce SPDX license tags

Replace license text comments with SPDX tags. For files that don't
have any license, use the COPYING license (GPL-2.0).

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-events: demote log information about trace being enabled/disabled
Mauro Carvalho Chehab [Fri, 19 Jul 2024 07:38:52 +0000 (09:38 +0200)]
ras-events: demote log information about trace being enabled/disabled

There are already enough information outside __toggle_ras_mc_event()
to identify if a feature was enabled or disabled.

So, this is mostly for debugging purposes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: cleanup coding style
Mauro Carvalho Chehab [Fri, 19 Jul 2024 07:29:51 +0000 (09:29 +0200)]
rasdaemon: cleanup coding style

Solve a series of coding style warnings:

mce-amd.c:132: WARNING:RETURN_VOID: void function return statements are not generally useful
mce-amd-smca.c:984: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'm->family == 0x19'
non-standard-ampere.c:743: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'err->subtype == 0x01'
non-standard-ampere.c:743: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'err->subtype == 0x02'
non-standard-jaguarmicro.c:382: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'mod_id >= tbl_size'
non-standard-jaguarmicro.c:382: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around '!module'
non-standard-jaguarmicro.c:425: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'sub_id >= tbl_size'
non-standard-jaguarmicro.c:425: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around '!sub_module'
ras-cxl-handler.c:408: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'i > 0'
ras-cxl-handler.c:705: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'i > 0'
ras-mce-handler.c:251: WARNING:USE_NEGATIVE_ERRNO: return of an errno should typically be negative (ie: return -ENOMEM)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-events: make returned error code consistent
Mauro Carvalho Chehab [Fri, 19 Jul 2024 06:24:25 +0000 (08:24 +0200)]
ras-events: make returned error code consistent

- Rework the returned code logic to be more consistent;
  - error codes will be using negative values;
  - positive values indicate special return codes.
- Don't bloat the logs with lots of error messages due to
  unsupported traces;
- Ensure that the number of CPUs will probably retrieved or bail out;
- Don't bail if it can't setup a monotone clock: it is better
  to have a wrong timestamp than no log at all.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: add .editorconfig file to follow our coding style
Mauro Carvalho Chehab [Fri, 19 Jul 2024 05:36:17 +0000 (07:36 +0200)]
rasdaemon: add .editorconfig file to follow our coding style

That helps keeping the coding style, as lots of editors support
this file.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-report.h: avoid long lines
Mauro Carvalho Chehab [Thu, 18 Jul 2024 16:06:49 +0000 (18:06 +0200)]
ras-report.h: avoid long lines

Better format the stubs on this file to avoid long lines.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agotypes.h: remove whitespaces
Mauro Carvalho Chehab [Thu, 18 Jul 2024 16:02:43 +0000 (18:02 +0200)]
types.h: remove whitespaces

Cut-and-pasting it from /usr/include/linux/bits.h ended adding
unwanted whitespaces. Remove those.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agotypes.h: don't depend on linux/bits.h
Mauro Carvalho Chehab [Thu, 18 Jul 2024 15:51:28 +0000 (17:51 +0200)]
types.h: don't depend on linux/bits.h

Such include would require Kernel sources to be installed.
We don't really need that: Just copy the two GENMASK macros
and be it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-events: don't use extern inside a C file
Mauro Carvalho Chehab [Thu, 18 Jul 2024 15:40:17 +0000 (17:40 +0200)]
ras-events: don't use extern inside a C file

Fix a checkpatch warning:

ras-events.c:66: WARNING:AVOID_EXTERNS: externs should be avoided in .c files

by better handing how checks_inside var is handled.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: don't use unsafe strcpy, strcat and sprintf
Mauro Carvalho Chehab [Thu, 18 Jul 2024 11:02:30 +0000 (13:02 +0200)]
rasdaemon: don't use unsafe strcpy, strcat and sprintf

Remove all occurrences of those calls.

While here, also fix a couple missing whitespace warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agotypes.h: add an implementation for strscpy() and strscat()
Mauro Carvalho Chehab [Thu, 18 Jul 2024 14:44:47 +0000 (16:44 +0200)]
types.h: add an implementation for strscpy() and strscat()

Do our own implementation for such routines, as the Kernel
implementation is a lot more complex than what it would be needed
here.

With that, change checkpatch.pl to request usage of such functions
instead of unsafe strcpy()/strcat().

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-events: drop a dead code to check number of CPUs
Mauro Carvalho Chehab [Thu, 18 Jul 2024 12:23:43 +0000 (14:23 +0200)]
ras-events: drop a dead code to check number of CPUs

Just use sysconf(_SC_NPROCESSORS_ONLN) here.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-report: fix coding style and string fill issues
Mauro Carvalho Chehab [Thu, 18 Jul 2024 10:58:23 +0000 (12:58 +0200)]
ras-report: fix coding style and string fill issues

Don't use unsafe sprintf(). Instead, re-implement the logic in
a way that buffer overflows won't occur.

While here, also avoid lines longer than 80 columns when possible.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agonon-standard-jaguarmicro: avoid CamelCase
Mauro Carvalho Chehab [Thu, 18 Jul 2024 09:54:08 +0000 (11:54 +0200)]
non-standard-jaguarmicro: avoid CamelCase

Coding-style: no need to use CamelCase here. So, use lowercase.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agocheckpatch.pl: warn also about strcat and sprintf usages
Mauro Carvalho Chehab [Thu, 18 Jul 2024 11:01:00 +0000 (13:01 +0200)]
checkpatch.pl: warn also about strcat and sprintf usages

strcpy, strncpy and sprintf aren't safe, as they don't check
buffer overflows. Change the checkpatch logic to warn about
such usages.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: alphabetically sort includes
Mauro Carvalho Chehab [Thu, 18 Jul 2024 09:45:16 +0000 (11:45 +0200)]
rasdaemon: alphabetically sort includes

Reorder includes to ensure that they'll all be alphabetically
sorted.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-arm-handler: use GENMASK() macro
Mauro Carvalho Chehab [Thu, 18 Jul 2024 08:44:57 +0000 (10:44 +0200)]
ras-arm-handler: use GENMASK() macro

Now that we have the macro defined on types.h, use it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: move type macros to a separate header (types.h)
Mauro Carvalho Chehab [Thu, 18 Jul 2024 08:43:17 +0000 (10:43 +0200)]
rasdaemon: move type macros to a separate header (types.h)

That makes easier to use/maintain it, without needing to include
ras-record.h when all it is needed are common macros.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: fix a coding style issue
Mauro Carvalho Chehab [Thu, 18 Jul 2024 08:43:07 +0000 (10:43 +0200)]
rasdaemon: fix a coding style issue

Comment block identation was wrong. Fix it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-arm-handler: Parse and log ARM Processor Error Info table
Shiju Jose [Tue, 16 Jul 2024 16:36:59 +0000 (17:36 +0100)]
ras-arm-handler: Parse and log ARM Processor Error Info table

Parse and log ARM Processor Error Info table data, UEFI 2.9A/2.10
specs section N2.4.4.1.

[mchehab: fix a typo]
Suggested-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: fix some typos and correct spelling
Mauro Carvalho Chehab [Wed, 17 Jul 2024 05:19:05 +0000 (07:19 +0200)]
rasdaemon: fix some typos and correct spelling

With the help of checkpatch.pl --codespell, fix some typos.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoscripts/checkpatch.pl: set default mode to strict
Mauro Carvalho Chehab [Wed, 17 Jul 2024 05:11:44 +0000 (07:11 +0200)]
scripts/checkpatch.pl: set default mode to strict

There aren't many false positives. So, change default to strict
mode.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-arm-handler: cope with latest upstream changes
Mauro Carvalho Chehab [Wed, 17 Jul 2024 05:01:29 +0000 (07:01 +0200)]
ras-arm-handler: cope with latest upstream changes

Unfortunately, rasdaemon support for the firmware first
CPER ARM processor extended trace was added years before
having it merged upstream. That's bad, specially since
upstream revision requested a change on some fields.

Fix support for it by aligning with latest upstream version:
        https://lore.kernel.org/linux-edac/3853853f820a666253ca8ed6c7c724dc3d50044a.1720679234.git.mchehab+huawei@kernel.org/T/#m17003e47912b228e91e57ac6e4f90ea30061aa3b

A backward-compatible logic was added to avoid breaking with
existing OOT support.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoscripts/checkpatch.pl: some improvements to reduce false positives
Mauro Carvalho Chehab [Wed, 17 Jul 2024 04:23:00 +0000 (06:23 +0200)]
scripts/checkpatch.pl: some improvements to reduce false positives

- camelcase is OK for printk inttypes.h;
- strncpy is OK;
- accept up to 120 chars on lines without warnings;
- stop complaining about "BACKTRACE=" strings split on multiple lines;
- remove PREFER_DEFINED_ATTRIBUTE_MACRO, as this is kernel-specific;
- remove MACRO_ARG_REUSE, as this applies mostly to multithreading;
- don't warn on using do{} while(0) with single line statements;

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: coding style cleanup
Mauro Carvalho Chehab [Tue, 16 Jul 2024 23:06:52 +0000 (01:06 +0200)]
rasdaemon: coding style cleanup

Solve lots of coding style issues reported by:

./scripts/checkpatch.pl --terse --show-types --strict \
-f $(git ls-files|grep -E '\.[ch]$') \
--ignore MACRO_ARG_REUSE,STRCPY,IF_0,UNNECESSARY_PARENTHESES,CAMELCASE,STRNCPY; done

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoscripts/checkpatch.pl: do some additional cleanups
Mauro Carvalho Chehab [Tue, 16 Jul 2024 23:06:21 +0000 (01:06 +0200)]
scripts/checkpatch.pl: do some additional cleanups

Remove more things that won't make sense for rasdaemon.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoBump version to 0.8.1 v0.8.1
Mauro Carvalho Chehab [Tue, 16 Jul 2024 08:24:51 +0000 (10:24 +0200)]
Bump version to 0.8.1

There were lots of changes on this version. The summary at
ChangeLog contains a sanitized version of it.

It should be noticed that the next version will likely bring
an uAPI incompatible change. Unfortunately, UEFI CPER record
trace for ARM processor is currently incomplete upstream.

Rasdaemon gained support for an extended arm trace event that
supports all fields of the CPER record, but it depends on a
patch that it is not upstreamed yet.

While looked on such patches, there are some changes needed
to get it merged, meaning that future versions of rasdaemon
may not be compatible with the downstream patch anymore.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: adjust install targets for the spec to be build
Mauro Carvalho Chehab [Tue, 16 Jul 2024 08:41:49 +0000 (10:41 +0200)]
rasdaemon: adjust install targets for the spec to be build

We use Fedora spec file to check if everything is OK. Do some
changes to make it happy.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-aer-handler: handle errors when running ipmitool
Mauro Carvalho Chehab [Tue, 16 Jul 2024 08:37:23 +0000 (10:37 +0200)]
ras-aer-handler: handle errors when running ipmitool

Without that, Fedora build will produce warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorbtree.h: Fix an issue introduced by checkpatch logic
Mauro Carvalho Chehab [Tue, 16 Jul 2024 08:12:42 +0000 (10:12 +0200)]
rbtree.h: Fix an issue introduced by checkpatch logic

Checkpatch actually broke RB_EMPTY_ROOT macro. It was defined
as:

#define RB_EMPTY_ROOT(root)    ((root)->rb_node == NULL)

It ended replacing it by:
((root)->!rb_node)

Which is not the way we espect it. Weird enough, this was compiling.
Anyway, what we want, instead, is:

#define RB_EMPTY_ROOT(root)    (!(root)->rb_node)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: do some coding style cleanups
Mauro Carvalho Chehab [Tue, 16 Jul 2024 06:34:27 +0000 (08:34 +0200)]
rasdaemon: do some coding style cleanups

Adjust coding style on some files to somewhat match the Kernel's
coding style, with the help of scripts/checkpatch.pl.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoscripts/checkpatch.pl: add a script to check coding style
Mauro Carvalho Chehab [Tue, 16 Jul 2024 06:47:23 +0000 (08:47 +0200)]
scripts/checkpatch.pl: add a script to check coding style

We sort of follow Kernel coding style. Import a version of it,
making it compatible with rasdaemon coding style by removing
stuff that doesn't fix here.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: mce-amd-smca: Optimizing decoding of MCA_CTL_SMU bits
sathya priya kumar [Thu, 13 Jun 2024 05:29:09 +0000 (05:29 +0000)]
rasdaemon: mce-amd-smca: Optimizing decoding of MCA_CTL_SMU bits

Optimize smca_smu2_mce_desc in better way from the commit ced615c.

Update existing array with extended error descriptions instead
of creating new array, simplifying the code.

Signed-off-by: Sathya Priya Kumar <sathyapriya.k@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoAdd labels for TRX50 WS
tictooc [Sat, 29 Jun 2024 14:03:42 +0000 (14:03 +0000)]
Add labels for TRX50 WS

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoCleanup MCE error log on non-x86 args
Mauro Carvalho Chehab [Mon, 15 Jul 2024 14:40:26 +0000 (14:40 +0000)]
Cleanup MCE error log on non-x86 args

We can only register for MCE on x86 arch.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agocontrib/qemu_einj.py: make it more generic to allow other einj types
Mauro Carvalho Chehab [Mon, 15 Jul 2024 12:26:15 +0000 (14:26 +0200)]
contrib/qemu_einj.py: make it more generic to allow other einj types

Currently, the einj logic handles just ARM processor CPER events, but
it is easy to change it to support other types as well.

Rename the script and make it more generic to accept new subparsers
for different types of EINJ.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: add mem_fail_event trigger
Mauro Carvalho Chehab [Tue, 16 Jul 2024 05:05:32 +0000 (05:05 +0000)]
rasdaemon: add mem_fail_event trigger

This event is somewhat similar to mc_event, except that this one
occurs on ARM platforms and the fields are different.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agotrigger: parse only once TRIGGER_DIR env variable
Mauro Carvalho Chehab [Mon, 15 Jul 2024 11:40:37 +0000 (13:40 +0200)]
trigger: parse only once TRIGGER_DIR env variable

Instead of parsing TRIGGER_DIR every time a new event happens,
store the trigger full path, simplifying the logic and avoiding
memory leaks.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-mc-handler: cleanup trigger logic
Mauro Carvalho Chehab [Tue, 16 Jul 2024 05:38:13 +0000 (07:38 +0200)]
ras-mc-handler: cleanup trigger logic

- Only setup mc_ce_trigger/mc_ue_trigger if the trigger is
  valid;

- Check if the trigger is there before doing strcmp, as
  checking if a pointer is not null is faster than strcmp();

- Ensure that the trigger env vars will be const, as we don't
  want to accidentally override those env vars;

- Print trigger enabled messages when rasdaemon runs with -f;

- ensure that trigger variables will initialize to NULL;

- coding style cleanups.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: add mc_event trigger
Ruidong Tian [Thu, 23 Nov 2023 09:47:25 +0000 (17:47 +0800)]
rasdaemon: add mc_event trigger

Allow users to run a trigger when RAS mc_event occurs, The mc_event
trigger is separated into CE trigger and UE trigger, this is because
CE is more frequent than UE, and the CE trigger will lead to more
performance hits. Users can choose different triggers for CE/UE to
reduce this effect.

Users can config trigger in /etc/sysconfig/rasdaemon:

    TRIGGER_DIR: The trigger diretory
    MC_CE_TRIGGER: The script executed when corrected error occurs.
    MC_UE_TRIGGER: The script executed when uncorrected error occurs.

No script will be executed if MC_CE_TRIGGER/MC_UE_TRIGGER is null.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoutil/arm_einj.py: fix a typo at virt-addr
Mauro Carvalho Chehab [Wed, 10 Jul 2024 13:07:44 +0000 (15:07 +0200)]
util/arm_einj.py: fix a typo at virt-addr

Typo: QAPI parameter is virt-addr.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoutil/arm_einj.py: remove a debug print
Mauro Carvalho Chehab [Wed, 10 Jul 2024 13:07:39 +0000 (15:07 +0200)]
util/arm_einj.py: remove a debug print

This was meant only for testing argument handling. Remove it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoutil/arm_einj.py: add an utility for ARM error injection via QEMU
Mauro Carvalho Chehab [Wed, 10 Jul 2024 12:15:24 +0000 (14:15 +0200)]
util/arm_einj.py: add an utility for ARM error injection via QEMU

Testing rasdaemon is not easy, as it depends on either having
real hardware producing events or a test BIOS. This is usually
not available and/or not too reliable.

So, take a different approach by adding a QEMU QAPI designed for
doing hardware error injection. The QEMU patches are at:

https://gitlab.com/mchehab_kernel/qemu/-/tree/arm-error-inject-v2

And some instructions about how to use it are at rasdaemon wiki
pages at github:

https://github.com/mchehab/rasdaemon/wiki

Add the error injection tool to rasdaemon sources.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agoras-arm-handler: be compatible with upstream Kernel
Mauro Carvalho Chehab [Tue, 25 Jun 2024 08:05:45 +0000 (10:05 +0200)]
ras-arm-handler: be compatible with upstream Kernel

Changeset e37eb2f11a82 ("Add code to decode Ampere specific error")
broke ARM event record with upstream Kernel, as it requires a different
trace event than the one that it is on upstream Kernel, and it is
part of a pending pull request:

https://lore.kernel.org/all/20240321-b4-arm-ras-error-vendor-info-v5-rc3-v5-0-850f9bfb97a8@os.amperecomputing.com/

Restore its behavior by making parsing the UEFI 2.6+ N.17 and N.16
table extra fields to be optional. That should make it compatible
with current upstream Kernels again.

Fixes: e37eb2f11a82 ("Add code to decode Ampere specific error")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agoDo a coding style cleanup with regards to tabs and white spaces
Mauro Carvalho Chehab [Tue, 11 Jun 2024 10:01:40 +0000 (12:01 +0200)]
Do a coding style cleanup with regards to tabs and white spaces

Use tabs instead of spaces and remove blank ending whitespaces.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Add Corrected Internal Error for aer_cor_errors
Jesus Esquivel [Mon, 3 Jun 2024 22:47:20 +0000 (16:47 -0600)]
rasdaemon: Add Corrected Internal Error for aer_cor_errors

Add "Corrected Internal Error" for aer_cor_errors to decode
the error reported in status register in bit 14.

Signed-off-by: Jesus Esquivel <jesus.esquivel@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Update SMCA bank error descriptions
Avadhut Naik [Fri, 10 May 2024 18:20:19 +0000 (13:20 -0500)]
rasdaemon: Update SMCA bank error descriptions

Update error descriptions of SMCA bank types to support AMD's new Family
1Ah-based processors.
Also, modify some existing error descriptions to better reflect the error
received.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agoAdd Lenovo P920 DIMM labels
Raul E Rangel [Thu, 9 May 2024 18:55:11 +0000 (18:55 +0000)]
Add Lenovo P920 DIMM labels

This adds the labels entry for the Lenovo ThinkStation P920.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Fix for vendor errors are not recorded in the SQLite database if some...
Shiju Jose [Wed, 20 Mar 2024 12:16:05 +0000 (12:16 +0000)]
rasdaemon: Fix for vendor errors are not recorded in the SQLite database if some cpus are offline

Fix for vendor errors are not recorded in the SQLite database if some cpus
are offline at the system start.

Issue:

This issue is reproducible by offline some cpus, run
./rasdaemon -f --record & and
inject vendor specific error supported in the rasdaemon.

Reason:

When the system starts with some of the cpus offline and then run
the rasdaemon, read_ras_event_all_cpus() exit with error and switch to
the multi thread way. However read() in read_ras_event() return error in
threads for each of the offline CPUs and does clean up including calling
ras_ns_finalize_vendor_tables(), which invokes sqlite3_finalize() on vendor
tables created. Thus the vendor error data does not stored in the SQLite
database when such error is reported next time.

Solution:

In ras_ns_add_vendor_tables() and ras_ns_finalize_vendor_tables() use
reference count and close vendor tables which created in ras_ns_add_vendor_tables()
based on the reference count.

Reported-by: Junhao He <hejunhao3@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agomce-amd-smca: update smca_hwid to use smca_bank_types
Aristeu Rozanski [Tue, 9 Apr 2024 14:06:30 +0000 (10:06 -0400)]
mce-amd-smca: update smca_hwid to use smca_bank_types

bank_type is used as smca_bank_types everywhere, there's no point in
declaring it as unsigned int. It also upsets covscan:

3. rasdaemon-0.6.7/mce-amd-smca.c:914: assignment: Assigning: "bank_type" = "s_hwid->bank_type".
7. rasdaemon-0.6.7/mce-amd-smca.c:926: cond_at_most: Checking "bank_type >= 64U" implies that "bank_type" and "s_hwid->bank_type" may be up to 63 on the false branch.
14. rasdaemon-0.6.7/mce-amd-smca.c:942: overrun-local: Overrunning array "smca_mce_descs" of 38 16-byte elements at element index 63 (byte offset 1023) using index "bank_type" (which evaluates to 63).
#   940|        /* Only print the descriptor of valid extended error code */
#   941|        if (xec < smca_mce_descs[bank_type].num_descs)
#   942|->              mce_snprintf(e->mcastatus_msg,
#   943|                             "%s. Ext Err Code: %d",
#   944|                             smca_mce_descs[bank_type].descs[xec],

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agolabels/asrock: Add DIMM labels for ASRock Rack X570D4U
Ivan Mironov [Thu, 28 Mar 2024 00:40:13 +0000 (05:40 +0500)]
labels/asrock: Add DIMM labels for ASRock Rack X570D4U

Signed-off-by: Ivan Mironov <mironov.ivan@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Add support to parse microcode field of mce tracepoint
Avadhut Naik [Tue, 2 Apr 2024 05:07:38 +0000 (00:07 -0500)]
rasdaemon: Add support to parse microcode field of mce tracepoint

Support for exporting the Microcode Revision is being added to the
mce_record tracepoint.

Add the required, corresponding support in the rasdaemon for the field
to be parsed and logged or added to the database and viewed later through
ras-mc-ctl utility.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Add support to parse the PPIN field of mce tracepoint
Avadhut Naik [Tue, 2 Apr 2024 04:33:07 +0000 (23:33 -0500)]
rasdaemon: Add support to parse the PPIN field of mce tracepoint

Support for exporting the PPIN (Protected Processor Inventory Number)
is being added to the mce_record tracepoint.

Add the required, corresponding support in the rasdaemon for the field
to be parsed and logged or added to the database and viewed later through
ras-mc-ctl utility.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support to display mcastatus_msg string
Avadhut Naik [Tue, 26 Mar 2024 04:06:08 +0000 (23:06 -0500)]
rasdaemon: ras-mc-ctl: Add support to display mcastatus_msg string

Currently, the mcastatus_msg string of struct mce_event is added to the
SQLite database by the rasdaemon when it is recording errors. The same
however, is not outputted by the ras-mc-ctl utility.

The string provides important error information relating to the received
MCE. For example, on AMD SMCA systems, the string outputs extended error
code and description. As such, the string should be present in the
output of ras-mc-ctl utility.

Add support to output the string through the ras-mc-ctl utility.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agoprint logs in the same line
zhuofeng [Tue, 12 Mar 2024 06:28:55 +0000 (14:28 +0800)]
print logs in the same line

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL memory module trace events
Shiju Jose [Mon, 12 Feb 2024 11:29:13 +0000 (11:29 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL memory module trace events

Add support for CXL memory module events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL DRAM trace events
Shiju Jose [Mon, 12 Feb 2024 11:22:03 +0000 (11:22 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL DRAM trace events

Add support for CXL DRAM events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL general media trace events
Shiju Jose [Mon, 12 Feb 2024 11:14:03 +0000 (11:14 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL general media trace events

Add support for CXL general media events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL generic trace events
Shiju Jose [Mon, 12 Feb 2024 10:56:25 +0000 (10:56 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL generic trace events

Add support for CXL generic events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL poison trace events
Shiju Jose [Mon, 12 Feb 2024 10:49:10 +0000 (10:49 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL poison trace events

Add support for CXL poison events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL overflow trace events
Shiju Jose [Mon, 12 Feb 2024 10:38:51 +0000 (10:38 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL overflow trace events

Add support for CXL overflow events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL AER correctable trace events
Shiju Jose [Mon, 12 Feb 2024 10:35:25 +0000 (10:35 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL AER correctable trace events

Add support for CXL AER correctable events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL AER uncorrectable trace events
Shiju Jose [Mon, 12 Feb 2024 10:27:58 +0000 (10:27 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL AER uncorrectable trace events

Add support for CXL AER uncorrectable events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-memory-failure-handler: update memory failure action page types
Shiju Jose [Tue, 6 Feb 2024 12:08:00 +0000 (12:08 +0000)]
rasdaemon: ras-memory-failure-handler: update memory failure action page types

Update memory failure action page types corresponding to the same in
mm/memory-failure.c in the kernel.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: Fix build warnings unused variable if AMP RAS errors is not enabled
Shiju Jose [Mon, 4 Mar 2024 11:49:50 +0000 (11:49 +0000)]
rasdaemon: Fix build warnings unused variable if AMP RAS errors is not enabled

This patch fixes following build warnings unused variable if AMP RAS errors
is not enabled(--enable-amp-ns-decode).

==================================================
ras-aer-handler.c: In function ‘ras_aer_event_handler’:
ras-aer-handler.c:72:21: warning: unused variable ‘fn’ [-Wunused-variable]
  int seg, bus, dev, fn;
                     ^~
ras-aer-handler.c:72:16: warning: unused variable ‘dev’ [-Wunused-variable]
  int seg, bus, dev, fn;
                ^~~
ras-aer-handler.c:72:11: warning: unused variable ‘bus’ [-Wunused-variable]
  int seg, bus, dev, fn;
           ^~~
ras-aer-handler.c:72:6: warning: unused variable ‘seg’ [-Wunused-variable]
  int seg, bus, dev, fn;
      ^~~
ras-aer-handler.c:71:10: warning: variable ‘sel_data’ set but not used [-Wunused-but-set-variable]
  uint8_t sel_data[5];
          ^~~~~~~~
ras-aer-handler.c:70:7: warning: unused variable ‘ipmi_add_sel’ [-Wunused-variable]
  char ipmi_add_sel[105];
       ^~~~~~~~~~~~
==================================================

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Do not try to find modprobe
Ivan Mironov [Sun, 3 Mar 2024 09:51:13 +0000 (14:51 +0500)]
rasdaemon: ras-mc-ctl: Do not try to find modprobe

It is not used and prevents ras-mc-ctl.service from starting on Fedora
when SELinux is in Enforcing mode.

Resolves: rhbz#1836861
Resolves: https://github.com/fedora-selinux/selinux-policy/issues/2054
Resolves: https://github.com/mchehab/rasdaemon/issues/79
Signed-off-by: Ivan Mironov <mironov.ivan@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agolabels/asus: Add DIMM labels for Asus PRIME X570-P
Ivan Mironov [Sat, 2 Mar 2024 05:53:50 +0000 (10:53 +0500)]
labels/asus: Add DIMM labels for Asus PRIME X570-P

Signed-off-by: Ivan Mironov <mironov.ivan@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agoUse block_rq_error if RHEL >= 9.1
Etienne Champetier [Mon, 26 Feb 2024 20:02:01 +0000 (15:02 -0500)]
Use block_rq_error if RHEL >= 9.1

The commit introducing block_rq_error tracepoint
has been backported in RHEL 9.1, so improve the check
for block_rq_error presence to use it.

Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: Add error decoding for MCA_CTL_SMU extended bits
Sathya Priya Kumar [Thu, 11 Jan 2024 07:20:07 +0000 (01:20 -0600)]
rasdaemon: Add error decoding for MCA_CTL_SMU extended bits

Enable error decoding support for the newly added extended
error bit descriptions from MCA_CTL_SMU.
b'0:11 can be decoded from existing array smca_smu2_mce_desc.
Define a function to append the newly defined b'58:62 to the
smca_smu2_mce_desc. This reduces the maintaining Reserved bits
from b'12:57 in the code.

Signed-off-by: Sathya Priya Kumar <sathyapriya.k@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: labels/apple add MacPro 1,1 and 2,1 models
Walter Sonius [Sun, 11 Feb 2024 22:30:25 +0000 (23:30 +0100)]
rasdaemon: labels/apple add MacPro 1,1 and 2,1 models

For the Apple MacPro 1,1 (Mac-F4208DC8) and MacPro 2,1 (Mac-F4208DA9)
these are the correct labels for the DIMM numbers 1-4 on each DIMM Riser
A&B for a total of 8 DIMMS. The MacPro 1,1 vendor is actually called
"Apple Computer, Inc." vs "Apple Inc." for the MacPro 2,1 and 3,1.
Another note is that the MacPro 1,1 and 2,1 require the kernel parameter
noefi for their efi32 firmware to boot a 64bit kernel using the
debian-12.4.0-amd64-netinst.iso.

The upper Riser is called A the lower Riser is called B. However
compared to MacPro 3,1 the riser labels A & B are branch swapped on the
memory controller on MacPro1,1 and 2,1 not its physical location in the
case (double checked it)! The so called slot 2 and slot 3 found by
ras-mc-ctl --layout are not available as slots or risers on the
motherboard. The ras-mc-ctl --guess-labels showed right labels but the
DIMM numbers are indistinguishable, however this commit is needed to
link them to the right memory location.

Signed-off-by: Walter Sonius <walterav1984@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: labels/intel add DQ57TM vendor and model
Walter Sonius [Thu, 8 Feb 2024 14:40:45 +0000 (15:40 +0100)]
rasdaemon: labels/intel add DQ57TM vendor and model

Add labels used on the Intel Corporation DQ57TM motherboard.

$ sudo dmesg | grep DMI | grep DQ57TM
[    0.000000] DMI:  /DQ57TM, BIOS TMIBX10H.86A.0050.2011.1207.1134 12/07/2011

Signed-off-by: Walter Sonius <walterav1984@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agoREADME.md: Fix repository information
Mauro Carvalho Chehab [Wed, 5 Jun 2024 12:47:00 +0000 (14:47 +0200)]
README.md: Fix repository information

We don't use Fedorahosted for a long time; the URL was updated,
but right now it is a way more common to receive patches via github
than from other repositories, so change the repository order.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
14 months agoapple macpro 2008 3,1 dimm1-4 labels riser A&B
Walter Sonius [Mon, 29 Jan 2024 17:21:37 +0000 (18:21 +0100)]
apple macpro 2008 3,1 dimm1-4 labels riser A&B

For the Apple Mac Pro 3,1( 2008) Mac-F42C88C8 these are the correct labels for the DIMM numbers 1-4 on each DIMM Riser A&B for a total of 8 DIMMS.

The upper Riser is called A the lower Riser is called B. The so called `slot 2` and `slot 3` found by `ras-mc-ctl --layout` are not available as slots or risers on the motherboard. The `ras-mc-ctl --guess-labels` showed right labels but the DIMM numbers are indistinguishable, however  this commit is needed to link them to the right memory location.

```
$ ras-mc-ctl --layout
       +-----------------------------------------------+
       |                      mc0                      |
       |        branch0        |        branch1        |
       | channel0  | channel1  | channel0  | channel1  |
-------+-----------------------------------------------+
slot3: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
slot2: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
-------+-----------------------------------------------+
slot1: |  2048 MB  |  2048 MB  |  2048 MB  |  2048 MB  |
slot0: |  8192 MB  |  8192 MB  |  4096 MB  |  4096 MB  |
-------+-----------------------------------------------+

$ ras-mc-ctl --guess-labels
memory stick 'DIMM 1' is located at 'DIMM Riser B'
memory stick 'DIMM 2' is located at 'DIMM Riser B'
memory stick 'DIMM 1' is located at 'DIMM Riser A'
memory stick 'DIMM 2' is located at 'DIMM Riser A'
memory stick 'DIMM 3' is located at 'DIMM Riser B'
memory stick 'DIMM 4' is located at 'DIMM Riser B'
memory stick 'DIMM 3' is located at 'DIMM Riser A'
memory stick 'DIMM 4' is located at 'DIMM Riser A'
```

Signed-off-by: Walter Sonius <walterav1984@gmail.com>
14 months agolabels/supermicro: add Supermicro X11DPi-N(T)
Werner Fischer [Wed, 31 Jan 2024 12:33:00 +0000 (13:33 +0100)]
labels/supermicro: add Supermicro X11DPi-N(T)

Add labels for Supermicro X11DPi-N and X11DPi-NT motherboards.

Signed-off-by: Werner Fischer <devlists@wefi.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agoC files: cleanup coding style
Mauro Carvalho Chehab [Mon, 22 Jan 2024 07:36:47 +0000 (08:36 +0100)]
C files: cleanup coding style

The rasdaemon conding style follows Linux Kernel where it makes sense.

Yet, changes made overtime ended with some coding style non-compliances.

Adjust rasdaemon coding style by using:

   scripts/checkpatch.pl --fix-inplace --strict *.c --ignore PREFER_KERNEL_TYPES

And doing some manual fixups where the script didn't work.
As a bonus, some typos were also fixed on some rasdaemon messages.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agorasdaemon: ras-mc-ctl: Add support to display the JaguarMicro vendor errors
Hunter He [Mon, 25 Dec 2023 09:34:56 +0000 (17:34 +0800)]
rasdaemon: ras-mc-ctl: Add support to display the JaguarMicro vendor errors

Add support to display the JaguarMicro Corsica DPU vendor errors event.

Signed-off-by: Hunter He <hunter.he@jaguarmicro.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agoSupermicro X12DPU-6 DIMM labels
DmNosachev [Tue, 19 Dec 2023 09:44:01 +0000 (12:44 +0300)]
Supermicro X12DPU-6 DIMM labels

Add labels for X12DPU-6 motherboard.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agoFix potential overflow with some arrays at page-isolation logic
zhuofeng [Thu, 7 Dec 2023 02:26:56 +0000 (10:26 +0800)]
Fix potential overflow with some arrays at page-isolation logic

Overflows may happen in the `threshold_string` and `cycle_string` arrays.

If the PAGE_CE_THRESHOLD value in page isolation is set to 50 bits,
there is a risk of array overflow. Because sprintf is an insecure
function, use snprintf instead.

An error is reported when the AddressSanitizer is used.

rasdaemon: Improper PAGE_CE_ACTION, set to default soft
rasdaemon: Page offline choice on Corrected Errors is soft
=================================================================
==221920==ERROR: AddressSanitizer: stack-buffer-overflow on address 0xffffdd91d932 at pc 0xffffa24071c4 bp 0xffffdd91d720 sp 0xffffdd91ced8
WRITE of size 55 at 0xffffdd91d932 thread T0
    #0 0xffffa24071c0 in vsprintf (/usr/lib64/libasan.so.6+0x5c1c0)
    #1 0xffffa24073cc in sprintf (/usr/lib64/libasan.so.6+0x5c3cc)
    #2 0x459558 in parse_env_string /home/rasdaemon/ras-page-isolation.c:185
    #3 0x4596f4 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:202
    #4 0x459934 in ras_page_account_init /home/rasdaemon/ras-page-isolation.c:211
    #5 0x40f700 in handle_ras_events /home/rasdaemon/ras-events.c:902
    #6 0x405b8c in main /home/rasdaemon/rasdaemon.c:211
    #7 0xffffa20b6f38 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #8 0xffffa20b7004 in __libc_start_main_impl ../csu/libc-start.c:409
    #9 0x4038ec in _start (/home/rasdaemon/rasdaemon+0x4038ec)

Address 0xffffdd91d932 is located in stack of thread T0 at offset 82 in frame
    #0 0x459574 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:190

  This frame has 2 object(s):
    [32, 82) 'threshold_string' (line 191)
    [128, 178) 'cycle_string' (line 192) <== Memory access at offset 82 partially underflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/usr/lib64/libasan.so.6+0x5c1c0) in vsprintf
Shadow bytes around the buggy address:
  0x200ffbb23ad0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23b10: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
=>0x200ffbb23b20: 00 00 00 00 00 00[02]f2 f2 f2 f2 f2 00 00 00 00
  0x200ffbb23b30: 00 00 02 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00
  0x200ffbb23b40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23b50: f1 f1 f1 f1 f1 f1 04 f2 00 00 f2 f2 00 00 00 00
  0x200ffbb23b60: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
  0x200ffbb23b70: f2 f2 f2 f2 00 00 00 00 00 00 00 00 f2 f2 f2 f2
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==221920==ABORTING

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agorasdaemon: Fix return value type compiling warnning of configure Optional Features...
Hunter He [Mon, 4 Dec 2023 04:54:55 +0000 (12:54 +0800)]
rasdaemon: Fix return value type compiling warnning of configure Optional Features with --enable-amp-ns-decode and without --enable-sqlite3.

Fix return value type compiling warnning of configure Optional Features
with --enable-amp-ns-decode and without --enable-sqlite3.

Signed-off-by: Hunter He <hunter.he@jaguarmicro.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agorasdaemon:Add support for creating vendor tables at startup.
Hunter He [Wed, 6 Dec 2023 06:52:03 +0000 (14:52 +0800)]
rasdaemon:Add support for creating vendor tables at startup.

When rasdaemon is running without non-standard error, those
tables are not created in the database file. Then ras-mc-ctl
script breaks trying to query data from non-existent tables.

Add support for creating vendor tables at startup.

Signed-off-by: Hunter He <hunter.he@jaguarmicro.com>
15 months agoAdd dynamic switch of ras events support.
caixiaomeng 00662745 [Wed, 29 Nov 2023 06:31:46 +0000 (14:31 +0800)]
Add dynamic switch of ras events support.

Rasdaemon does not support a way to disable some events by config.
If user want to disable specified event(eg:block_rq_complete), he
should recompile rasdaemon, which is not so convenient.

This patch add dynamic switch of ras event support.You can add
events you want to disabled in /etc/sysconfig/rasdaemon.For example,
`DISABLE="ras:mc_event,block:block_rq_complete"`.Then restart
rasdaemon, these two events will be disabled without recompilation.

[mchehab: make is_disabled_event() static]
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agorasdaemon: Add support for vendor-specific machine check error information
Avadhut Naik [Tue, 21 Nov 2023 20:04:19 +0000 (14:04 -0600)]
rasdaemon: Add support for vendor-specific machine check error information

Some CPU vendors may provide additional vendor-specific machine check
error information. AMD, for example, provides FRU Text through SYND 1/2
registers if BIT 9 of SMCA_CONFIG register is set.

Add support to display the additional vendor-specific error information,
if any.

Signed-off-by: Avadhut Naik <Avadhut.Naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: ras-mc-ctl: Modify check for HiSilicon KunPeng9xx error fields
Shiju Jose [Thu, 24 Aug 2023 12:07:17 +0000 (13:07 +0100)]
rasdaemon: ras-mc-ctl: Modify check for HiSilicon KunPeng9xx error fields

Modify check for valid HiSilicon KunPeng9xx error fields.
Fixes an error data is not printed when it's value is 0.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add Emerald Rapids support
Delgado Vargas, Daniel [Fri, 20 Oct 2023 16:57:11 +0000 (10:57 -0600)]
rasdaemon: Add Emerald Rapids support

Signed-off-by: Delgado Vargas, Daniel <daniel.delgado.vargas@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoAdd a space between "diskerror_event" and "store"
weidongkl [Tue, 19 Sep 2023 08:29:21 +0000 (16:29 +0800)]
Add a space between "diskerror_event" and "store"

Signed-off-by: weidongkl <weidongkl@sina.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: ras-mc-ctl: Add support to display the THead vendor errors
Ruidong Tian [Thu, 7 Sep 2023 10:22:06 +0000 (18:22 +0800)]
rasdaemon: ras-mc-ctl: Add support to display the THead vendor errors

Add support for the THead YiTian DDRC register dump event.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>