scarlet-storm [Mon, 3 Mar 2025 09:46:57 +0000 (15:16 +0530)]
Use the right dev_t decoding for diskerror handler
There is a dev_t type defined by libc for makedev etc, which
is exposed by the kernel headers & also a dev_t type defined in the
internal kernel headers.
Both have simliar functionality, but different encoding.
Copy the MAJOR & MINOR macros from linux/kdev_t.h for proper
decoding of the trace events.
Fixes #71
Shiju Jose [Sun, 10 Nov 2024 20:19:16 +0000 (20:19 +0000)]
rasdaemon: ras-mc-ctl: Update logging of CXL memory module data to align with CXL spec rev 3.1
CXL spec 3.1 section 8.2.9.2.1.3 Table 8-47, Memory Module Event Record
has updated with following new fields and new info for Device Event Type
and Device Health Information fields.
1. Validity Flags
2. Component Identifier
3. Device Event Sub-Type
This update modifies ras-mc-ctl to parse and log CXL memory module event
data stored in the RAS SQLite database table, reflecting the
specification changes introduced in revision 3.1.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Mon, 11 Nov 2024 12:40:25 +0000 (12:40 +0000)]
rasdaemon: ras-mc-ctl: Update logging of CXL DRAM event data to align with CXL spec rev 3.1
CXL spec 3.1 section 8.2.9.2.1.2 Table 8-46, DRAM Event Record has updated
with following new fields and new types for Memory Event Type, Transaction
Type and Validity Flags fields.
1. Component Identifier
2. Sub-channel
3. Advanced Programmable Corrected Memory Error Threshold Event Flags
4. Corrected Volatile Memory Error Count at Event
5. Memory Event Sub-Type
This update modifies ras-mc-ctl to parse and log CXL DRAM event data
stored in the RAS SQLite database table, reflecting the specification
changes introduced in revision 3.1.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Fri, 8 Nov 2024 18:06:42 +0000 (18:06 +0000)]
rasdaemon: ras-mc-ctl: Update logging of CXL general media event data to align with CXL spec rev 3.1
CXL spec rev 3.1 section 8.2.9.2.1.1 Table 8-45, General Media Event
Record has updated with following new fields and new types for Memory
Event Type and Transaction Type fields.
1. Advanced Programmable Corrected Memory Error Threshold Event Flags
2. Corrected Memory Error Count at Event
3. Memory Event Sub-Type
The format of component identifier has changed (CXL spec 3.1 section
8.2.9.2.1 Table 8-44).
This update modifies ras-mc-ctl to parse and log CXL general media event
data stored in the RAS SQLite database table, reflecting the specification
changes introduced in revision 3.1.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Fri, 8 Nov 2024 16:06:37 +0000 (16:06 +0000)]
rasdaemon: ras-mc-ctl: Update logging of common event data to align with CXL spec rev 3.1
The Common Event Record format in the CXL spec 3.1, section 8.2.9.2.1,
Table 8-42, has been updated to include Maintenance Operation Subclass
information.
This update modifies ras-mc-ctl to log CXL common event data in the RAS
SQLite database tables, reflecting the specification changes introduced
in revision 3.1.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Fri, 8 Nov 2024 14:43:24 +0000 (14:43 +0000)]
rasdaemon: ras-mc-ctl: Fix logging of memory event type in CXL DRAM error table
CXL spec rev 3.0 section 8.2.9.2.1.2 defines the DRAM Event Record.
Fix decoding of memory event type in the CXL DRAM error table in RAS
SQLite database.
For e.g. if value is 0x1 it will be logged as an Invalid Address
(General Media Event Record - Memory Event Type) instead of Scrub Media
ECC Error (DRAM Event Record - Memory Event Type) and so on.
Fixes: c38c14afc5d7 ("rasdaemon: ras-mc-ctl: Add support for CXL DRAM trace events") Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
CXL spec 3.1 section 8.2.9.2.1.3 Table 8-47, Memory Module Event Record
has updated with following new fields and new info for Device Event Type
and Device Health Information fields.
1. Validity Flags
2. Component Identifier
3. Device Event Sub-Type
Update the parsing, logging and recording of memory module event for the
above spec rev 3.1 changes.
Example rasdaemon log for CXL memory module event,
cxl_memory_module 2024-11-19 18:43:15 +0000 memdev:mem3 host:0000:0f:00.0 \
serial:0x3 log type:Fatal hdr_uuid:fe927475-dd59-4339-a586-79bab113b774 \
hdr_handle:0x2 hdr_related_handle:0x0 hdr_timestamp:1970-01-01 00:09:36 +0000 \
hdr_length:128 hdr_maint_op_class:0 hdr_maint_op_sub_class:1 \
event_type:Temperature Change event_sub_type:Unsupported Config Data \
health_status:'MAINTENANCE_NEEDED' 'REPLACEMENT_NEEDED' \
media_status:All Data Loss in Event of Power Loss as_life_used:Unknown \
as_dev_temp:Normal as_cor_vol_err_cnt:Normal as_cor_per_err_cnt:Normal \
life_used:8 device_temp:3 dirty_shutdown_cnt:33 cor_vol_err_cnt:25 \
cor_per_err_cnt:45 comp_id:02 74 c5 08 9a 1a 0b fc d2 7e 2f 31 9b 3c 81 4d \
comp_id_pldm_valid_flags:'Resource ID' Resource ID:fc d2 7e 2f
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Mon, 11 Nov 2024 11:54:10 +0000 (11:54 +0000)]
rasdaemon: cxl: Update CXL DRAM event to CXL spec rev 3.1
CXL spec 3.1 section 8.2.9.2.1.2 Table 8-46, DRAM Event Record has updated
with following new fields and new types for Memory Event Type, Transaction
Type and Validity Flags fields.
1. Component Identifier
2. Sub-channel
3. Advanced Programmable Corrected Memory Error Threshold Event Flags
4. Corrected Memory Error Count at Event
5. Memory Event Sub-Type
Update the parsing, logging and recording of DRAM event for the above
spec rev 3.1 changes.
Shiju Jose [Mon, 11 Nov 2024 11:49:49 +0000 (11:49 +0000)]
rasdaemon: cxl: Update CXL general media event to CXL spec rev 3.1
CXL spec rev 3.1 section 8.2.9.2.1.1 Table 8-45, General Media Event
Record has updated with following new fields and new types for Memory
Event Type and Transaction Type fields.
1. Advanced Programmable Corrected Memory Error Threshold Event Flags
2. Corrected Memory Error Count at Event
3. Memory Event Sub-Type
The format of component identifier has changed (CXL spec 3.1 section
8.2.9.2.1 Table 8-44).
Update the parsing, logging and recording of general media event for
the above spec changes.
Example rasdaemon log for CXL general media event,
Shiju Jose [Tue, 5 Nov 2024 17:51:29 +0000 (17:51 +0000)]
rasdaemon: cxl: Update common event to CXL spec rev 3.1
CXL spec 3.1 section 8.2.9.2.1 Table 8-42, Common Event Record format has
updated with Maintenance Operation Subclass information.
Add updates in rasdaemon CXL event handler for the above spec change
and for the corresponding changes in kernel CXL common trace event
implementation.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Fri, 1 Nov 2024 18:57:07 +0000 (18:57 +0000)]
rasdaemon: cxl: Add automatic indexing for storing CXL fields in SQLite database
When the CXL specification adds new fields to the common header of
CXL event records, manual updates to the indexing are required to
store these CXL fields in the SQLite database. This update introduces
automatic indexing to facilitate the storage of CXL fields in the
SQLite database, eliminating the need for manual update to indexing.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Wed, 20 Nov 2024 00:52:46 +0000 (00:52 +0000)]
rasdaemon: cxl: Fix mismatch in region field's name with kernel DRAM trace event
Fix mismatch in 'region' field's name with kernel DRAM trace event.
Fixes: ea224ad58b37 ("rasdaemon: CXL: Extract, log and record region info from cxl_general_media and cxl_dram events") Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Fri, 1 Nov 2024 16:38:26 +0000 (16:38 +0000)]
rasdaemon: cxl: Fix logging of memory event type of DRAM trace event
CXL spec rev 3.0 section 8.2.9.2.1.2 defines the DRAM Event Record.
Fix logging of memory event type field of DRAM trace event.
For e.g. if value is 0x1 it will be reported as an Invalid Address
(General Media Event Record - Memory Event Type) instead of Scrub Media
ECC Error (DRAM Event Record - Memory Event Type) and so on.
Fixes: 9a2f6186db26 ("rasdaemon: Add support for the CXL dram events") Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Thu, 9 Jan 2025 17:16:57 +0000 (17:16 +0000)]
rasdaemon: Fix for parsing error when trace event's format file is larger than PAGE_SIZE
When a trace event's format file is larger than PAGE_SIZE (4096) then
libtraceevent returns parsing failed when rasdaemon reads the fields.
The reason found that tep_parse_event() call in the add_event_handler()
internally fails in libtraceevent because of the incomplete format file
data read. However libtraceevent did not return error in this stage,
which is fixed in the following patch for libtraceevent.
https://lore.kernel.org/all/20250109102338.6128644d@gandalf.local.home/
When rasdaemon reads a trace event format file,the maximum data size
that can be read is limited to PAGE_SIZE by the seq_read() and
seq_read_iter() functions in the kernel. This results in userspace
receiving partial data if the format file is larger than PAGE_SIZE,
requiring fix in the rasdaemon to read the complete data from the
format file.
Add fix for reading trace event format files larger than PAGE_SIZE
in add_event_handler().
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Srinivasulu Thanneeru [Wed, 20 Nov 2024 06:25:28 +0000 (22:25 -0800)]
rasdaemon: Add page offline support for cxl memory
CXL Type 3 device implements a threshold for corrected errors as described in
CXL 3.1 specification section 8.2.9.2.1.2 and 8.2.9.9.11.3.
Device can set the threshold field in the DRAM event descriptor when
it detects corrected errors that meet or exceed the threshold value.
This patch is intended to offline pages for corrected memory errors when the
device sets the threshold in the DRAM event descriptor.
This helps prevent corrected errors from becoming uncorrected.
Record the hpa for given dpa, then do pageoffline for hpa when corrected
errors threshold is set.
ras-page-isolation.h: fix most coding style issues
Fix several checkpatch.pl warnings:
ras-page-isolation.h:50: WARNING:SPACING: missing space after enum definition
ras-page-isolation.h:79: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
ras-page-isolation.h:80: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
ras-page-isolation.h:96: WARNING:CONSTANT_COMPARISON: Comparisons should place the constant on the right side of the test
ras-page-isolation.h:119: WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'
It should be noticed that this warning was not addressed,
as it seems to be a false-positive:
ras-page-isolation.h:96: WARNING:CONSTANT_COMPARISON: Comparisons should place the constant on the right side of the test
The location_fields is used for both APEI and DSM data.
The logic there defines 7 values for APEI and 9 for DSM,
but, with the current logic, it allocates only 7 elements.
mce-intel-ivb/mce-intel-sb: remove code commented with #if 0
The dead code is there for a long time without any attempts
to actually implement it for SB/IVB. As such CPUs were released
a long time ago, it is unlikely that someone would address the
comments there.
So, drop the dead code. If needed, this patch can be reversed
later.
rasdaemon: don't use braces for single statement blocks
Solve those checkpatch warnings:
WARNING: braces {} are not necessary for single statement blocks
+ if (clock_gettime(clk_id, &ts) == 0 && !strcmp(ev.error_type, "Corrected")) {
+ ras_record_row_error(ev.driver_detail, ev.error_count, ts.tv_sec, ev.address);
+ }
total: 0 errors, 1 warnings, 0 checks, 304 lines checked
WARNING: braces {} are not necessary for single statement blocks
+ if (!matched) {
+ log(TERM, LOG_INFO, "Improper %s, set to default off\n", env);
+ }
WARNING: braces {} are not necessary for any arm of this statement
+ if (rr1->type == GHES) {
[...]
+ } else {
[...]
WARNING: braces {} are not necessary for single statement blocks
+ for (int i = 0; i < ROW_LOCATION_FIELDS_NUM; i++) {
+ dst->location_fields[i] = src->location_fields[i];
+ }
zhuofeng [Sat, 29 Jun 2024 09:27:56 +0000 (17:27 +0800)]
The rasdaemon service may fail to be started for the first time.
The rasdaemon creates a separate instance virtual directory on first startup, like `/sys/kernel/debug/tracing/instances/rasdaemon`.
After the directory is created, the kernel generates virtual files such as `trace_clock` and `set_event` in `/sys/kernel/debug/tracing/instances/rasdaemon`.
The kernel generates virtual files and the rasdaemon accesses the virtual files at the same time. Therefore, the kernel may not generate the virtual files when the rasdaemon accesses the virtual files.
So add up to 30 seconds to give the kernel enough time to generate the files.
zhuofeng [Mon, 4 Mar 2024 13:04:42 +0000 (21:04 +0800)]
New feature: support memory row CE threshold policy
- Introduction: Identify memory row faults in memory CE faults and
isolate the physical memory pages where row faults occur. This method
can effectively prevent CE storms or memory UCE faults caused by memory
row failures.
- Implementation: The system counts the number of CE faults in the same
memory row within a specified period. If the number of CE faults exceeds
the configured threshold, the system considers that the memory row may
fail and isolates all physical pages recorded in the memory row.
Notes:
1. This function is disabled by default. You can enable it by
configuring the'ROW_CE_ACTION' field in the '/etc/sysconfig/rasdaemon' configuration file.
2. If both row isolation and page isolation are enabled, page isolation is automatically
disabled by default.
3. If the number of fault times in the DIMM CE fault information received by the rasdaemon
is 0, the BIOS does not correctly parse the number of fault times when parsing the fault information.
When a fault occurs, the rasdaemon process considers that the number of faults is 1 by default,
which is the same as the kernel process.
Avadhut Naik [Thu, 7 Nov 2024 06:24:44 +0000 (06:24 +0000)]
rasdaemon: Modify support for vendor-specific machine check error information
Commit 83a3ced797256d ("rasdaemon: Add support for vendor-specific
machine check error information") assumes that MCA_CONFIG MSR will be
exported as part of vendor-specific error information through the MCE
tracepoint.
The same, however, is not true anymore. MCA_CONFIG MSR will not be
exported through the MCE tracepoint. Instead, the data from MCA_SYND1/2
MSRs, exported as vendor-specific error information on newer AMD SOCs,
should always be interpreted as FRUText.
Modify the error decoding support accordingly.
Fixes: 83a3ced797256d ("rasdaemon: Add support for vendor-specific
machine check error information") Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Shiju Jose [Thu, 16 May 2024 14:23:44 +0000 (15:23 +0100)]
rasdaemon: ras-mc-ctl: Log hpa and region info from cxl_general_media and cxl_dram tables
Add support for read and log hpa and region info from cxl_general_media and
cxl_dram tables.
Note: This change does not have backward compatability, because
the select command with newly added columns would fail with previous
CXL tables where newly added columns are not present.
The issue can be solved with updating the CXL table's name to v2,
but again no backward compatability in ras-mc-ctl for listing errors
which fails when previous version of CXL table only present in the
database as it cannot find v2 of the table.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Wed, 15 May 2024 11:18:36 +0000 (12:18 +0100)]
rasdaemon: CXL: Extract, log and record region info from cxl_general_media and cxl_dram events
Add extract, log and record region info to cxl_general_media and
cxl_dram events.
The corresponding kernel changes:
https://lore.kernel.org/all/cover.1711598777.git.alison.schofield@intel.com/T/#m6fd773b5477fc44b875848e053708a1c8996c4e4
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Mon, 19 Aug 2024 10:56:15 +0000 (11:56 +0100)]
rasdaemon: ras-events: Fix warning ‘filter_ras_mc_event’ defined but not used
Fix following compilation warning,
ras-events.c:318:12: warning: ‘filter_ras_mc_event’ defined but not used [-Wunused-function]
static int filter_ras_mc_event(struct ras_events *ras, char *group, char *event,
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Fix following checkpatch warning in ras-arm-handler.
+ trace_seq_printf(s, " Program execution can be restarted reliably at the PC associated with the error");
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Shiju Jose [Mon, 19 Aug 2024 10:33:08 +0000 (11:33 +0100)]
rasdaemon: Fix for compilation warning in ras-memory-failure-handler.c
Fix for following compilation warning,
ras-memory-failure-handler.c:120:6: warning: implicit declaration of function ‘asprintf’; did you mean ‘vsprintf’? [-Wimplicit-function-declaration]
if (asprintf(&env[ei++], "PATH=%s", getenv("PATH") ?: "/sbin:/usr/sbin:/bin:/usr/bin") < 0)
Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Avadhut Naik [Fri, 16 Aug 2024 18:10:40 +0000 (18:10 +0000)]
rasdaemon: Fix mem_fail_event build breakage
Commit 566a52622b1d ("add mem_fail_event trigger") introduces an event
trigger for a memory failure event.
However, if the rasdaemon is not configured with enable-memory-failure,
the setup function of the trigger, mem_fail_event_trigger_setup(), will
result in an undefined reference linker error when called through
setup_event_trigger().
Ensure that the setup function for the trigger is called only when the
rasdaemon has been configured with enable-memory-failure.
Tomohiro Misono [Thu, 8 Aug 2024 09:21:17 +0000 (09:21 +0000)]
ras-events: fix -d option to work again
It seems commit 3e9a59a184ca("Add dynamic switch of ras events support.")
inadvertedly introduced the change to ignore -d option.
Fix this so that -d will disable all trace events at once like before.
- Rework the returned code logic to be more consistent;
- error codes will be using negative values;
- positive values indicate special return codes.
- Don't bloat the logs with lots of error messages due to
unsupported traces;
- Ensure that the number of CPUs will probably retrieved or bail out;
- Don't bail if it can't setup a monotone clock: it is better
to have a wrong timestamp than no log at all.
ras-arm-handler: cope with latest upstream changes
Unfortunately, rasdaemon support for the firmware first
CPER ARM processor extended trace was added years before
having it merged upstream. That's bad, specially since
upstream revision requested a change on some fields.
Fix support for it by aligning with latest upstream version:
https://lore.kernel.org/linux-edac/3853853f820a666253ca8ed6c7c724dc3d50044a.1720679234.git.mchehab+huawei@kernel.org/T/#m17003e47912b228e91e57ac6e4f90ea30061aa3b
A backward-compatible logic was added to avoid breaking with
existing OOT support.
scripts/checkpatch.pl: some improvements to reduce false positives
- camelcase is OK for printk inttypes.h;
- strncpy is OK;
- accept up to 120 chars on lines without warnings;
- stop complaining about "BACKTRACE=" strings split on multiple lines;
- remove PREFER_DEFINED_ATTRIBUTE_MACRO, as this is kernel-specific;
- remove MACRO_ARG_REUSE, as this applies mostly to multithreading;
- don't warn on using do{} while(0) with single line statements;
There were lots of changes on this version. The summary at
ChangeLog contains a sanitized version of it.
It should be noticed that the next version will likely bring
an uAPI incompatible change. Unfortunately, UEFI CPER record
trace for ARM processor is currently incomplete upstream.
Rasdaemon gained support for an extended arm trace event that
supports all fields of the CPER record, but it depends on a
patch that it is not upstreamed yet.
While looked on such patches, there are some changes needed
to get it merged, meaning that future versions of rasdaemon
may not be compatible with the downstream patch anymore.