]> www.infradead.org Git - users/mchehab/rasdaemon.git/log
users/mchehab/rasdaemon.git
6 weeks agorasdaemon: bump to version 0.8.3 master v0.8.3
Mauro Carvalho Chehab [Mon, 10 Mar 2025 11:00:28 +0000 (12:00 +0100)]
rasdaemon: bump to version 0.8.3

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agoras-diskerror-handler.h: fix checkpatch warnings
Mauro Carvalho Chehab [Mon, 10 Mar 2025 10:52:00 +0000 (11:52 +0100)]
ras-diskerror-handler.h: fix checkpatch warnings

Adjust some whitespace to make checkpatch happier.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agoUse the right dev_t decoding for diskerror handler
scarlet-storm [Mon, 3 Mar 2025 09:46:57 +0000 (15:16 +0530)]
Use the right dev_t decoding for diskerror handler

There is a dev_t type defined by libc for makedev etc, which
is exposed by the kernel headers & also a dev_t type defined in the
internal kernel headers.
Both have simliar functionality, but different encoding.
Copy the MAJOR & MINOR macros from linux/kdev_t.h for proper
decoding of the trace events.
Fixes #71

Signed-off-by: scarlet-storm
<12461256+scarlet-storm@users.noreply.github.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: Add new modules supported by HiSilicon common section
Bing Xia [Thu, 16 May 2024 02:21:20 +0000 (10:21 +0800)]
rasdaemon: Add new modules supported by HiSilicon common section

Add new modules supported by HiSilicon common error section.

Signed-off-by: Bing Xia <xiabing14@h-partners.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: Fix some static check warning
Xiaofei Tan [Fri, 7 Feb 2025 21:26:01 +0000 (21:26 +0000)]
rasdaemon: Fix some static check warning

The decode_int_fields() and decode_text_fields() functions are used
to replace the original if judgment branch, reducing the cyclomatic
complexity.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: Fix few compilation warnings in non standard hisilicon code
Bing Xia [Sun, 19 Jan 2025 11:26:43 +0000 (11:26 +0000)]
rasdaemon: Fix few compilation warnings in non standard hisilicon code

Fix the problem that the type of a constant string does not match
when it is assigned to a character pointer.

Signed-off-by: Bing Xia <xiabing14@h-partners.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: Fix some compilation alarms in ras-record.h.
Bing Xia [Sun, 19 Jan 2025 11:08:26 +0000 (11:08 +0000)]
rasdaemon: Fix some compilation alarms in ras-record.h.

Fix the problem that the type of a constant string does not match
when it is assigned to a character pointer.

Signed-off-by: Bing Xia <xiabing14@h-partners.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: ras-mc-ctl: Update logging of CXL memory module data to align with CXL...
Shiju Jose [Sun, 10 Nov 2024 20:19:16 +0000 (20:19 +0000)]
rasdaemon: ras-mc-ctl: Update logging of CXL memory module data to align with CXL spec rev 3.1

CXL spec 3.1 section 8.2.9.2.1.3 Table 8-47, Memory Module Event Record
has updated with following new fields and new info for Device Event Type
and Device Health Information fields.
1. Validity Flags
2. Component Identifier
3. Device Event Sub-Type

This update modifies ras-mc-ctl to parse and log CXL memory module event
data stored in the RAS SQLite database table, reflecting the
specification changes introduced in revision 3.1.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: ras-mc-ctl: Update logging of CXL DRAM event data to align with CXL spec...
Shiju Jose [Mon, 11 Nov 2024 12:40:25 +0000 (12:40 +0000)]
rasdaemon: ras-mc-ctl: Update logging of CXL DRAM event data to align with CXL spec rev 3.1

CXL spec 3.1 section 8.2.9.2.1.2 Table 8-46, DRAM Event Record has updated
with following new fields and new types for Memory Event Type, Transaction
Type and Validity Flags fields.
1. Component Identifier
2. Sub-channel
3. Advanced Programmable Corrected Memory Error Threshold Event Flags
4. Corrected Volatile Memory Error Count at Event
5. Memory Event Sub-Type

This update modifies ras-mc-ctl to parse and log CXL DRAM event data
stored in the RAS SQLite database table, reflecting the specification
changes introduced in revision 3.1.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: ras-mc-ctl: Update logging of CXL general media event data to align with...
Shiju Jose [Fri, 8 Nov 2024 18:06:42 +0000 (18:06 +0000)]
rasdaemon: ras-mc-ctl: Update logging of CXL general media event data to align with CXL spec rev 3.1

CXL spec rev 3.1 section 8.2.9.2.1.1 Table 8-45, General Media Event
Record has updated with following new fields and new types for Memory
Event Type and Transaction Type fields.
1. Advanced Programmable Corrected Memory Error Threshold Event Flags
2. Corrected Memory Error Count at Event
3. Memory Event Sub-Type

The format of component identifier has changed (CXL spec 3.1 section
8.2.9.2.1 Table 8-44).

This update modifies ras-mc-ctl to parse and log CXL general media event
data stored in the RAS SQLite database table, reflecting the specification
changes introduced in revision 3.1.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: ras-mc-ctl: Update logging of common event data to align with CXL spec...
Shiju Jose [Fri, 8 Nov 2024 16:06:37 +0000 (16:06 +0000)]
rasdaemon: ras-mc-ctl: Update logging of common event data to align with CXL spec rev 3.1

The Common Event Record format in the CXL spec 3.1, section 8.2.9.2.1,
Table 8-42, has been updated to include Maintenance Operation Subclass
information.

This update modifies ras-mc-ctl to log CXL common event data in the RAS
SQLite database tables, reflecting the specification changes introduced
in revision 3.1.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: ras-mc-ctl: Fix logging of memory event type in CXL DRAM error table
Shiju Jose [Fri, 8 Nov 2024 14:43:24 +0000 (14:43 +0000)]
rasdaemon: ras-mc-ctl: Fix logging of memory event type in CXL DRAM error table

CXL spec rev 3.0 section 8.2.9.2.1.2 defines the DRAM Event Record.

Fix decoding of memory event type in the CXL DRAM error table in RAS
SQLite database.
For e.g. if value is 0x1 it will be logged as an Invalid Address
(General Media Event Record - Memory Event Type) instead of Scrub Media
ECC Error (DRAM Event Record - Memory Event Type) and so on.

Fixes: c38c14afc5d7 ("rasdaemon: ras-mc-ctl: Add support for CXL DRAM trace events")
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: cxl: Update memory module event to CXL spec rev 3.1
Shiju Jose [Mon, 11 Nov 2024 11:58:49 +0000 (11:58 +0000)]
rasdaemon: cxl: Update memory module event to CXL spec rev 3.1

CXL spec 3.1 section 8.2.9.2.1.3 Table 8-47, Memory Module Event Record
has updated with following new fields and new info for Device Event Type
and Device Health Information fields.
1. Validity Flags
2. Component Identifier
3. Device Event Sub-Type

Update the parsing, logging and recording of memory module event for the
above spec rev 3.1 changes.

Example rasdaemon log for CXL memory module event,

cxl_memory_module 2024-11-19 18:43:15 +0000 memdev:mem3 host:0000:0f:00.0 \
serial:0x3 log type:Fatal hdr_uuid:fe927475-dd59-4339-a586-79bab113b774 \
hdr_handle:0x2 hdr_related_handle:0x0 hdr_timestamp:1970-01-01 00:09:36 +0000 \
hdr_length:128 hdr_maint_op_class:0 hdr_maint_op_sub_class:1 \
event_type:Temperature Change event_sub_type:Unsupported Config Data \
health_status:'MAINTENANCE_NEEDED' 'REPLACEMENT_NEEDED' \
media_status:All Data Loss in Event of Power Loss as_life_used:Unknown \
as_dev_temp:Normal as_cor_vol_err_cnt:Normal as_cor_per_err_cnt:Normal \
life_used:8 device_temp:3 dirty_shutdown_cnt:33 cor_vol_err_cnt:25 \
cor_per_err_cnt:45 comp_id:02 74 c5 08 9a 1a 0b fc d2 7e 2f 31 9b 3c 81 4d \
comp_id_pldm_valid_flags:'Resource ID' Resource ID:fc d2 7e 2f

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: cxl: Update CXL DRAM event to CXL spec rev 3.1
Shiju Jose [Mon, 11 Nov 2024 11:54:10 +0000 (11:54 +0000)]
rasdaemon: cxl: Update CXL DRAM event to CXL spec rev 3.1

CXL spec 3.1 section 8.2.9.2.1.2 Table 8-46, DRAM Event Record has updated
with following new fields and new types for Memory Event Type, Transaction
Type and Validity Flags fields.
1. Component Identifier
2. Sub-channel
3. Advanced Programmable Corrected Memory Error Threshold Event Flags
4. Corrected Memory Error Count at Event
5. Memory Event Sub-Type

Update the parsing, logging and recording of DRAM event for the above
spec rev 3.1 changes.

Example rasdaemon log for CXL DRAM event,

cxl_dram 2024-11-19 18:39:00 +0000 memdev:mem3 host:0000:0f:00.0 serial:0x3 \
log type:Informational hdr_uuid:601dcbb3-9c06-4eab-b8af-4e9bfb5c9624 \
hdr_handle:0x1 hdr_related_handle:0x0 hdr_timestamp:1970-01-01 00:05:21 +0000 \
hdr_length:128 hdr_maint_op_class:1 hdr_maint_op_sub_class:3 dpa:0x18680 \
dpa_flags:descriptor:'UNCORRECTABLE EVENT' 'THRESHOLD EVENT' \
memory_event_type:Data Path Error memory_event_sub_type:Media Link CRC Error \
transaction_type:Internal Media Scrub channel:3 rank:17 nibble_mask:3866802 \
bank_group:7 bank:11 row:2 column:77 correction_mask:21 00 00 00 00 00 00 00 \
2c 00 00 00 00 00 00 00 37 00 00 00 00 00 00 00 42 00 00 00 00 00 00 00 \
comp_id:01 74 c5 08 9a 1a 0b fc d2 7e 2f 31 9b 3c 81 4d \
comp_id_pldm_valid_flags:'PLDM Entity ID' PLDM Entity ID:74 c5 08 9a 1a 0b \
Advanced Programmable CME threshold Event Flags:'Corrected Memory Errors in \
Multiple Media Components' 'Exceeded Programmable Threshold' CVME Count:0x94

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: cxl: Update CXL general media event to CXL spec rev 3.1
Shiju Jose [Mon, 11 Nov 2024 11:49:49 +0000 (11:49 +0000)]
rasdaemon: cxl: Update CXL general media event to CXL spec rev 3.1

CXL spec rev 3.1 section 8.2.9.2.1.1 Table 8-45, General Media Event
Record has updated with following new fields and new types for Memory
Event Type and Transaction Type fields.
1. Advanced Programmable Corrected Memory Error Threshold Event Flags
2. Corrected Memory Error Count at Event
3. Memory Event Sub-Type

The format of component identifier has changed (CXL spec 3.1 section
8.2.9.2.1 Table 8-44).

Update the parsing, logging and recording of general media event for
the above spec changes.

Example rasdaemon log for CXL general media event,

cxl_general_media 2024-11-19 18:35:29 +0000 memdev:mem3 host:0000:0f:00.0 \
serial:0x3 log type:Fatal hdr_uuid:fbcd0a77-c260-417f-85a9-088b1621eba6 \
hdr_handle:0x1 hdr_related_handle:0x0 hdr_timestamp:1970-01-01 00:01:50 +0000 \
hdr_length:128 hdr_maint_op_class:2 hdr_maint_op_sub_class:4 dpa:0x30d40 \
dpa_flags:descriptor:'UNCORRECTABLE EVENT' 'THRESHOLD EVENT' 'POISON LIST OVERFLOW' \
memory_event_type:TE State Violation memory_event_sub_type:Media Link Command \
Training Error transaction_type:Host Inject Poison channel:3 rank:33 device:5 \
comp_id:03 74 c5 08 9a 1a 0b fc d2 7e 2f 31 9b 3c 81 4d \
comp_id_pldm_valid_flags:'PLDM Entity ID' 'Resource ID' \
PLDM Entity ID:74 c5 08 9a 1a 0b Resource ID:fc d2 7e 2f \
Advanced Programmable CME threshold Event Flags:'Corrected Memory Errors in Multiple \
Media Components' 'Exceeded Programmable Threshold' Corrected Memory Error Count:0x78

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: cxl: Add Component Identifier formatting for CXL spec rev 3.1
Shiju Jose [Mon, 11 Nov 2024 11:00:53 +0000 (11:00 +0000)]
rasdaemon: cxl: Add Component Identifier formatting for CXL spec rev 3.1

Add Component Identifier formatting for CXL spec rev 3.1, Section
8.2.9.2.1, Table 8-44.

Add helper function to print component ID, parse and log PLDM entity ID
and resource ID.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: cxl: Update common event to CXL spec rev 3.1
Shiju Jose [Tue, 5 Nov 2024 17:51:29 +0000 (17:51 +0000)]
rasdaemon: cxl: Update common event to CXL spec rev 3.1

CXL spec 3.1 section 8.2.9.2.1 Table 8-42, Common Event Record format has
updated with Maintenance Operation Subclass information.

Add updates in rasdaemon CXL event handler for the above spec change
and for the corresponding changes in kernel CXL common trace event
implementation.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: cxl: Add automatic indexing for storing CXL fields in SQLite database
Shiju Jose [Fri, 1 Nov 2024 18:57:07 +0000 (18:57 +0000)]
rasdaemon: cxl: Add automatic indexing for storing CXL fields in SQLite database

When the CXL specification adds new fields to the common header of
CXL event records, manual updates to the indexing are required to
store these CXL fields in the SQLite database. This update introduces
automatic indexing to facilitate the storage of CXL fields in the
SQLite database, eliminating the need for manual update to indexing.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: cxl: Fix mismatch in region field's name with kernel DRAM trace event
Shiju Jose [Wed, 20 Nov 2024 00:52:46 +0000 (00:52 +0000)]
rasdaemon: cxl: Fix mismatch in region field's name with kernel DRAM trace event

Fix mismatch in 'region' field's name with kernel DRAM trace event.

Fixes: ea224ad58b37 ("rasdaemon: CXL: Extract, log and record region info from cxl_general_media and cxl_dram events")
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: cxl: Fix logging of memory event type of DRAM trace event
Shiju Jose [Fri, 1 Nov 2024 16:38:26 +0000 (16:38 +0000)]
rasdaemon: cxl: Fix logging of memory event type of DRAM trace event

CXL spec rev 3.0 section 8.2.9.2.1.2 defines the DRAM Event Record.

Fix logging of memory event type field of DRAM trace event.
For e.g. if value is 0x1 it will be reported as an Invalid Address
(General Media Event Record - Memory Event Type) instead of Scrub Media
ECC Error (DRAM Event Record - Memory Event Type) and so on.

Fixes: 9a2f6186db26 ("rasdaemon: Add support for the CXL dram events")
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: Fix for parsing error when trace event's format file is larger than PAGE_SIZE
Shiju Jose [Thu, 9 Jan 2025 17:16:57 +0000 (17:16 +0000)]
rasdaemon: Fix for parsing error when trace event's format file is larger than PAGE_SIZE

When a trace event's format file is larger than PAGE_SIZE (4096) then
libtraceevent returns parsing failed when rasdaemon reads the fields.
The reason found that tep_parse_event() call in the add_event_handler()
internally fails in libtraceevent because of the incomplete format file
data read. However libtraceevent did not return error in this stage,
which is fixed in the following patch for libtraceevent.
https://lore.kernel.org/all/20250109102338.6128644d@gandalf.local.home/

When rasdaemon reads a trace event format file,the maximum data size
that can be read is limited to PAGE_SIZE by the seq_read() and
seq_read_iter() functions in the kernel. This results in userspace
receiving partial data if the format file is larger than PAGE_SIZE,
requiring fix in the rasdaemon to read the complete data from the
format file.

Add fix for reading trace event format files larger than PAGE_SIZE
in add_event_handler().

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: Add page offline support for cxl memory
Srinivasulu Thanneeru [Wed, 20 Nov 2024 06:25:28 +0000 (22:25 -0800)]
rasdaemon: Add page offline support for cxl memory

CXL Type 3 device implements a threshold for corrected errors as described in
CXL 3.1 specification section 8.2.9.2.1.2 and 8.2.9.9.11.3.
Device can set the threshold field in the DRAM event descriptor when
it detects corrected errors that meet or exceed the threshold value.

This patch is intended to offline pages for corrected memory errors when the
device sets the threshold in the DRAM event descriptor.
This helps prevent corrected errors from becoming uncorrected.

Record the hpa for given dpa, then do pageoffline for hpa when corrected
errors threshold is set.

Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agoAdd labels for ASRock X370 Taichi
Dezponia [Wed, 1 Jan 2025 12:55:36 +0000 (13:55 +0100)]
Add labels for ASRock X370 Taichi

Signed-off-by: Dezponia <150628177+dezponia@users.noreply.github.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agoAdd labels for ASRock X570 Creator
Mauro Carvalho Chehab [Mon, 10 Mar 2025 10:17:17 +0000 (11:17 +0100)]
Add labels for ASRock X570 Creator

Signed-off-by: Dezponia <150628177+dezponia@users.noreply.github.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agoAdd labels for ASRock X570S PG Riptide
Dezponia [Wed, 1 Jan 2025 01:30:50 +0000 (02:30 +0100)]
Add labels for ASRock X570S PG Riptide

Signed-off-by: Dezponia <150628177+dezponia@users.noreply.github.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: mce: decode io port for bus error
Ruidong Tian [Tue, 17 Dec 2024 06:39:42 +0000 (14:39 +0800)]
rasdaemon: mce: decode io port for bus error

mcelog decode bus error with io port like:

  ...
  MCA: BUS error: 0 0 Level-3 Generic IO Request-did-not-timeout
  IO MCA reported by root port 0:7b:07.0
  ...

Introduce the code into rasdaemon.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: arm: do not print error msg if field not found
Ruidong Tian [Thu, 19 Dec 2024 06:21:39 +0000 (14:21 +0800)]
rasdaemon: arm: do not print error msg if field not found

Fix output from:
2024-12-19 13:52:02 +0800 affinity: 0 MPIDR: 0x810c0200 MIDR: 0x481fd010 running_state: 1 psci_state: 0<CANT FIND FIELD pei_len>
to:
2024-12-19 13:52:02 +0800 affinity: 0 MPIDR: 0x810c0200 MIDR: 0x481fd010 running_state: 1 psci_state: 0

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
6 weeks agorasdaemon: add DE error type for AMD
Ruidong Tian [Wed, 4 Dec 2024 02:52:09 +0000 (10:52 +0800)]
rasdaemon: add DE error type for AMD

hw_event_mc_err_type in kernel include HW_EVENT_ERR_DEFERRED for AMD,
add this error type in rasdaemon.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 months agorasdaemon: Fix the display format of JaguarMicro vendor no standard errors
Hunter.He [Wed, 27 Nov 2024 13:42:26 +0000 (21:42 +0800)]
rasdaemon: Fix the display format of JaguarMicro vendor no standard errors

1)Another whitespaces were added by mce_snprintf. Remove redundant
 whitespaces.
2)Print logs in the same line.

Signed-off-by: "Hunter.He" <hunter.he@jaguarmicro.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agorasdaemon: bump to version 0.8.2 v0.8.2
Mauro Carvalho Chehab [Tue, 19 Nov 2024 07:42:27 +0000 (08:42 +0100)]
rasdaemon: bump to version 0.8.2

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-page-isolation.h: remove extra parenthesis
Mauro Carvalho Chehab [Tue, 19 Nov 2024 07:28:36 +0000 (08:28 +0100)]
ras-page-isolation.h: remove extra parenthesis

Another parenthesis were accidentally added to ROW_LOCATION_FIELDS_NUM
macro. Remove it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agorasdaemon: check if sscanf() processed all arguments on dev_name
Mauro Carvalho Chehab [Tue, 19 Nov 2024 07:20:38 +0000 (08:20 +0100)]
rasdaemon: check if sscanf() processed all arguments on dev_name

Ensure that all arguments are parsed by sscanf() when dealing
with dev_name.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agounified-sel.h: convert license boilerplate to SPDX
Mauro Carvalho Chehab [Tue, 19 Nov 2024 07:00:39 +0000 (08:00 +0100)]
unified-sel.h: convert license boilerplate to SPDX

Use SPDX for GPLv2.0 or later, instead of a license
boilerplate.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-page-isolation.h: fix most coding style issues
Mauro Carvalho Chehab [Tue, 19 Nov 2024 06:55:02 +0000 (07:55 +0100)]
ras-page-isolation.h: fix most coding style issues

Fix several checkpatch.pl warnings:

ras-page-isolation.h:50: WARNING:SPACING: missing space after enum definition
ras-page-isolation.h:79: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
ras-page-isolation.h:80: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
ras-page-isolation.h:96: WARNING:CONSTANT_COMPARISON: Comparisons should place the constant on the right side of the test
ras-page-isolation.h:119: WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'

It should be noticed that this warning was not addressed,
as it seems to be a false-positive:
ras-page-isolation.h:96: WARNING:CONSTANT_COMPARISON: Comparisons should place the constant on the right side of the test

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-page-isolation: fix location_fields size
Mauro Carvalho Chehab [Tue, 19 Nov 2024 06:51:58 +0000 (07:51 +0100)]
ras-page-isolation: fix location_fields size

The location_fields is used for both APEI and DSM data.
The logic there defines 7 values for APEI and 9 for DSM,
but, with the current logic, it allocates only 7 elements.

This is likely due to a typo. Fix it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agounified-sel: fix most coding style issues
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:59:51 +0000 (16:59 +0100)]
unified-sel: fix most coding style issues

Solve several issues pointed by checkpatch.pl.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agounified-sel: replace license boilerplate with SPDX
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:45:43 +0000 (16:45 +0100)]
unified-sel: replace license boilerplate with SPDX

Use GPL-2.0-or-later SPDX tag instead of a license boilerplate
to GPL-2.0+.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-page-isolation: fix additional coding style issues
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:43:02 +0000 (16:43 +0100)]
ras-page-isolation: fix additional coding style issues

Fix some indentation issues and an uneeded parenthesis
inside ras-page-isolation code.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-page-isolation: make memory_location_field static
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:39:21 +0000 (16:39 +0100)]
ras-page-isolation: make memory_location_field static

Those structures are used only internally inside the
ras-page-isolation code. They're not public. Make them static.

While here, fix coding style.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-page-isolation: drop some uneeded prototype
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:31:32 +0000 (16:31 +0100)]
ras-page-isolation: drop some uneeded prototype

There is a prototype not used and two other prototypes that
are used only internally inside the function.

Make such functions static and drop the unused one.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-page-isolation: use snprintf() instead of sprintf()
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:28:36 +0000 (16:28 +0100)]
ras-page-isolation: use snprintf() instead of sprintf()

Use the safer snprintf() call to avoid the risk of going past the
buffer.

While here, make row_record_get_id() static.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agomce-intel: drop a code commented a long time ago with an action
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:20:34 +0000 (16:20 +0100)]
mce-intel: drop a code commented a long time ago with an action

There is a commented out code at mce-intel that has been at
rasdaemon for a long time.

Remove it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agomce-intel-ivb/mce-intel-sb: remove code commented with #if 0
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:18:41 +0000 (16:18 +0100)]
mce-intel-ivb/mce-intel-sb: remove code commented with #if 0

The dead code is there for a long time without any attempts
to actually implement it for SB/IVB. As such CPUs were released
a long time ago, it is unlikely that someone would address the
comments there.

So, drop the dead code. If needed, this patch can be reversed
later.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agorasdaemon: don't use braces for single statement blocks
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:13:50 +0000 (16:13 +0100)]
rasdaemon: don't use braces for single statement blocks

Solve those checkpatch warnings:

WARNING: braces {} are not necessary for single statement blocks
+ if (clock_gettime(clk_id, &ts) == 0 && !strcmp(ev.error_type, "Corrected")) {
+ ras_record_row_error(ev.driver_detail, ev.error_count, ts.tv_sec, ev.address);
+ }

total: 0 errors, 1 warnings, 0 checks, 304 lines checked
WARNING: braces {} are not necessary for single statement blocks
+ if (!matched) {
+ log(TERM, LOG_INFO, "Improper %s, set to default off\n", env);
+ }

WARNING: braces {} are not necessary for any arm of this statement
+ if (rr1->type == GHES) {
[...]
+ } else {
[...]

WARNING: braces {} are not necessary for single statement blocks
+ for (int i = 0; i < ROW_LOCATION_FIELDS_NUM; i++) {
+ dst->location_fields[i] = src->location_fields[i];
+ }

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-page-isolation: don't use "/**" for normal comments
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:11:30 +0000 (16:11 +0100)]
ras-page-isolation: don't use "/**" for normal comments

The usage of /** is reserved for doxygen markups. Don't use
it where it doesn't belong.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agorasdaemon: use __func__ instead of the name of the function
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:09:18 +0000 (16:09 +0100)]
rasdaemon: use __func__ instead of the name of the function

Solve some checkpatch warnings about not usint __func__ at the
rasdaemon logs.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agorasdaemon: fix some coding style issues
Mauro Carvalho Chehab [Mon, 18 Nov 2024 15:01:23 +0000 (16:01 +0100)]
rasdaemon: fix some coding style issues

Use checkpatch to fix some trivial coding style issues with:

$ ./scripts/checkpatch.pl -f *.c -q --strict --fix-inplace

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoipmitool SEL logging of AER CEs on OpenBMC platforms
Krishna Dhulipala [Thu, 19 Sep 2024 14:58:37 +0000 (07:58 -0700)]
ipmitool SEL logging of AER CEs on OpenBMC platforms

Signed-off-by: Krishna Dhulipala <krishnad@meta.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agoThe rasdaemon service may fail to be started for the first time.
zhuofeng [Sat, 29 Jun 2024 09:27:56 +0000 (17:27 +0800)]
The rasdaemon service may fail to be started for the first time.

The rasdaemon creates a separate instance virtual directory on first startup, like `/sys/kernel/debug/tracing/instances/rasdaemon`.

After the directory is created, the kernel generates virtual files such as `trace_clock` and `set_event` in `/sys/kernel/debug/tracing/instances/rasdaemon`.

The kernel generates virtual files and the rasdaemon accesses the virtual files at the same time. Therefore, the kernel may not generate the virtual files when the rasdaemon accesses the virtual files.

So add up to 30 seconds to give the kernel enough time to generate the files.

Signed-off-by: zhuofeng <zhuofeng2@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agoMakefile: only enable rbtree if needed
Mauro Carvalho Chehab [Mon, 18 Nov 2024 13:35:28 +0000 (14:35 +0100)]
Makefile: only enable rbtree if needed

Don't enable rbtree and ras-page-isolation code unconditionally.
Only enable it if PFA is compiled.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoNew feature: support memory row CE threshold policy
zhuofeng [Mon, 4 Mar 2024 13:04:42 +0000 (21:04 +0800)]
New feature: support memory row CE threshold policy

- Introduction: Identify memory row faults in memory CE faults and
isolate the physical memory pages where row faults occur. This method
can effectively prevent CE storms or memory UCE faults caused by memory
row failures.

- Implementation: The system counts the number of CE faults in the same
memory row within a specified period. If the number of CE faults exceeds
the configured threshold, the system considers that the memory row may
fail and isolates all physical pages recorded in the memory row.

Notes:
1. This function is disabled by default. You can enable it by
configuring the'ROW_CE_ACTION' field in the '/etc/sysconfig/rasdaemon' configuration file.
2. If both row isolation and page isolation are enabled, page isolation is automatically
disabled by default.
3. If the number of fault times in the DIMM CE fault information received by the rasdaemon
is 0, the BIOS does not correctly parse the number of fault times when parsing the fault information.
When a fault occurs, the rasdaemon process considers that the number of faults is 1 by default,
which is the same as the kernel process.

Signed-off-by: zhuofeng <zhuofeng2@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agoras-page-isolation: drop an unused variable
Mauro Carvalho Chehab [Mon, 18 Nov 2024 13:03:26 +0000 (14:03 +0100)]
ras-page-isolation: drop an unused variable

There's no need to store the value of strtoul() during the
overflow check. Remove it, as this is causing a warning:

ras-page-isolation.c: In function ‘parse_isolation_env’:
ras-page-isolation.c:166:47: warning: unused variable ‘converted_value’ [-Wunused-variable]
  166 |                                 unsigned long converted_value = strtoul(config->env, &endptr, 10);
      |                                               ^~~~~~~~~~~~~~~

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoFix the bug that `config->env` is greater than `ulong_max` when units->val=1
zhuofeng [Thu, 7 Dec 2023 06:37:50 +0000 (14:37 +0800)]
Fix the bug that `config->env` is greater than `ulong_max` when units->val=1

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agorasdaemon: Modify support for vendor-specific machine check error information
Avadhut Naik [Thu, 7 Nov 2024 06:24:44 +0000 (06:24 +0000)]
rasdaemon: Modify support for vendor-specific machine check error information

Commit 83a3ced797256d ("rasdaemon: Add support for vendor-specific
machine check error information") assumes that MCA_CONFIG MSR will be
exported as part of vendor-specific error information through the MCE
tracepoint.

The same, however, is not true anymore. MCA_CONFIG MSR will not be
exported through the MCE tracepoint. Instead, the data from MCA_SYND1/2
MSRs, exported as vendor-specific error information on newer AMD SOCs,
should always be interpreted as FRUText.

Modify the error decoding support accordingly.

Fixes: 83a3ced797256d ("rasdaemon: Add support for vendor-specific
machine check error information")
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agorasdaemon: ras-mc-ctl: Log hpa and region info from cxl_general_media and cxl_dram...
Shiju Jose [Thu, 16 May 2024 14:23:44 +0000 (15:23 +0100)]
rasdaemon: ras-mc-ctl: Log hpa and region info from cxl_general_media and cxl_dram tables

Add support for read and log hpa and region info from cxl_general_media and
cxl_dram tables.

Note: This change does not have backward compatability, because
the select command with newly added columns would fail with previous
CXL tables where newly added columns are not present.
The issue can be solved with updating the CXL table's name to v2,
but again no backward compatability in ras-mc-ctl for listing errors
which fails when previous version of CXL table only present in the
database as it cannot find v2 of the table.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: CXL: Extract, log and record region info from cxl_general_media and cxl_dr...
Shiju Jose [Wed, 15 May 2024 11:18:36 +0000 (12:18 +0100)]
rasdaemon: CXL: Extract, log and record region info from cxl_general_media and cxl_dram events

Add extract, log and record region info to cxl_general_media and
cxl_dram events.

The corresponding kernel changes:
https://lore.kernel.org/all/cover.1711598777.git.alison.schofield@intel.com/T/#m6fd773b5477fc44b875848e053708a1c8996c4e4

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: CXL: Fix uncorrectable macro spelling
Shiju Jose [Tue, 14 May 2024 14:22:42 +0000 (15:22 +0100)]
rasdaemon: CXL: Fix uncorrectable macro spelling

Fix the macro (CXL_GMER_EVT_DESC_UNCORECTABLE_EVENT) spelling .
Uncorrectable is spelled with two r's.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: ras-non-standard-handler: Fix checkpatch warning
Shiju Jose [Mon, 19 Aug 2024 11:11:44 +0000 (12:11 +0100)]
rasdaemon: ras-non-standard-handler: Fix checkpatch warning

Fix following checkpatch warning,
CHECK: spaces preferred around that '*' (ctx:WxV)
+ sqlite3_stmt *stmt_dec_record;

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: ras-events: Fix warning ‘filter_ras_mc_event’ defined but not used
Shiju Jose [Mon, 19 Aug 2024 10:56:15 +0000 (11:56 +0100)]
rasdaemon: ras-events: Fix warning ‘filter_ras_mc_event’ defined but not used

Fix following compilation warning,
ras-events.c:318:12: warning: ‘filter_ras_mc_event’ defined but not used [-Wunused-function]
 static int filter_ras_mc_event(struct ras_events *ras, char *group, char *event,

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: ras-arm-handler: Fix checkpatch warning length exceeds 120 columns
Shiju Jose [Mon, 19 Aug 2024 10:51:30 +0000 (11:51 +0100)]
rasdaemon: ras-arm-handler: Fix checkpatch warning length exceeds 120 columns

Fix following checkpatch warning in ras-arm-handler.
+ trace_seq_printf(s, " Program execution can be restarted reliably at the PC associated with the error");

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: ras-events: removed obselete code under #if 0
Shiju Jose [Mon, 19 Aug 2024 10:48:06 +0000 (11:48 +0100)]
rasdaemon: ras-events: removed obselete code under #if 0

Remove unused code enclosed under #if 0 to fix the checkpatch
warnings.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: ras-mce-handler: Fix checkpatch errors
Shiju Jose [Mon, 19 Aug 2024 10:44:58 +0000 (11:44 +0100)]
rasdaemon: ras-mce-handler: Fix checkpatch errors

Fix following checkpatch error in  ras-mce-handler.c

Delete below obselte code under #if 0 ... #endif
WARNING: Consider removing the code enclosed by this #if 0 and its #endif

WARNING: Consider removing the code enclosed by this #if 0 and its #endif

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: rbtree: removed unused definition for RB_ROOT
Shiju Jose [Mon, 19 Aug 2024 10:38:16 +0000 (11:38 +0100)]
rasdaemon: rbtree: removed unused definition for RB_ROOT

Removed unused definition for RB_ROOT from rbtree.h

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: Fix for compilation warning in ras-memory-failure-handler.c
Shiju Jose [Mon, 19 Aug 2024 10:33:08 +0000 (11:33 +0100)]
rasdaemon: Fix for compilation warning in ras-memory-failure-handler.c

Fix for following compilation warning,
ras-memory-failure-handler.c:120:6: warning: implicit declaration of function ‘asprintf’; did you mean ‘vsprintf’? [-Wimplicit-function-declaration]
  if (asprintf(&env[ei++], "PATH=%s", getenv("PATH") ?: "/sbin:/usr/sbin:/bin:/usr/bin") < 0)

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
5 months agorasdaemon: Fix mem_fail_event build breakage
Avadhut Naik [Fri, 16 Aug 2024 18:10:40 +0000 (18:10 +0000)]
rasdaemon: Fix mem_fail_event build breakage

Commit 566a52622b1d ("add mem_fail_event trigger") introduces an event
trigger for a memory failure event.

However, if the rasdaemon is not configured with enable-memory-failure,
the setup function of the trigger, mem_fail_event_trigger_setup(), will
result in an undefined reference linker error when called through
setup_event_trigger().

Ensure that the setup function for the trigger is called only when the
rasdaemon has been configured with enable-memory-failure.

Fixes: 566a52622b1d ("add mem_fail_event trigger")
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoras-events: fix -d option to work again
Tomohiro Misono [Thu, 8 Aug 2024 09:21:17 +0000 (09:21 +0000)]
ras-events: fix -d option to work again

It seems commit 3e9a59a184ca("Add dynamic switch of ras events support.")
inadvertedly introduced the change to ignore -d option.
Fix this so that -d will disable all trace events at once like before.

Signed-off-by: Tomohiro Misono <misono.tomohiro@fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
5 months agoChangeLog: fix 0.8.1 release date
Baruch Siach [Tue, 30 Jul 2024 07:02:10 +0000 (10:02 +0300)]
ChangeLog: fix 0.8.1 release date

2023 -> 2024.

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoci.yml: Change the name of the second job
Mauro Carvalho Chehab [Fri, 19 Jul 2024 09:11:05 +0000 (11:11 +0200)]
ci.yml: Change the name of the second job

Using the same name seems to cause troubles

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoci.yml: place checkpatch check in separate
Mauro Carvalho Chehab [Fri, 19 Jul 2024 08:58:07 +0000 (10:58 +0200)]
ci.yml: place checkpatch check in separate

This doesn't need to run for all 3 architectures.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoci.yml: run checkpatch when doing tests
Mauro Carvalho Chehab [Fri, 19 Jul 2024 08:54:56 +0000 (10:54 +0200)]
ci.yml: run checkpatch when doing tests

That helps detecting new problems at the code.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoMakefile.am: add types.h to the list of headers
Mauro Carvalho Chehab [Fri, 19 Jul 2024 08:42:25 +0000 (10:42 +0200)]
Makefile.am: add types.h to the list of headers

Without that, make mock won't work properly.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoscripts/checkpatch.pl: add support for checking SPDX
Mauro Carvalho Chehab [Fri, 19 Jul 2024 08:38:58 +0000 (10:38 +0200)]
scripts/checkpatch.pl: add support for checking SPDX

Now that rasdaemon files have SPDX tags, enforce it via checkpatch
script.

The code was imported from the Linux Kernel, with some changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: enforce SPDX license tags
Mauro Carvalho Chehab [Fri, 19 Jul 2024 07:53:41 +0000 (09:53 +0200)]
rasdaemon: enforce SPDX license tags

Replace license text comments with SPDX tags. For files that don't
have any license, use the COPYING license (GPL-2.0).

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-events: demote log information about trace being enabled/disabled
Mauro Carvalho Chehab [Fri, 19 Jul 2024 07:38:52 +0000 (09:38 +0200)]
ras-events: demote log information about trace being enabled/disabled

There are already enough information outside __toggle_ras_mc_event()
to identify if a feature was enabled or disabled.

So, this is mostly for debugging purposes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: cleanup coding style
Mauro Carvalho Chehab [Fri, 19 Jul 2024 07:29:51 +0000 (09:29 +0200)]
rasdaemon: cleanup coding style

Solve a series of coding style warnings:

mce-amd.c:132: WARNING:RETURN_VOID: void function return statements are not generally useful
mce-amd-smca.c:984: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'm->family == 0x19'
non-standard-ampere.c:743: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'err->subtype == 0x01'
non-standard-ampere.c:743: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'err->subtype == 0x02'
non-standard-jaguarmicro.c:382: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'mod_id >= tbl_size'
non-standard-jaguarmicro.c:382: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around '!module'
non-standard-jaguarmicro.c:425: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'sub_id >= tbl_size'
non-standard-jaguarmicro.c:425: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around '!sub_module'
ras-cxl-handler.c:408: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'i > 0'
ras-cxl-handler.c:705: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'i > 0'
ras-mce-handler.c:251: WARNING:USE_NEGATIVE_ERRNO: return of an errno should typically be negative (ie: return -ENOMEM)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-events: make returned error code consistent
Mauro Carvalho Chehab [Fri, 19 Jul 2024 06:24:25 +0000 (08:24 +0200)]
ras-events: make returned error code consistent

- Rework the returned code logic to be more consistent;
  - error codes will be using negative values;
  - positive values indicate special return codes.
- Don't bloat the logs with lots of error messages due to
  unsupported traces;
- Ensure that the number of CPUs will probably retrieved or bail out;
- Don't bail if it can't setup a monotone clock: it is better
  to have a wrong timestamp than no log at all.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: add .editorconfig file to follow our coding style
Mauro Carvalho Chehab [Fri, 19 Jul 2024 05:36:17 +0000 (07:36 +0200)]
rasdaemon: add .editorconfig file to follow our coding style

That helps keeping the coding style, as lots of editors support
this file.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-report.h: avoid long lines
Mauro Carvalho Chehab [Thu, 18 Jul 2024 16:06:49 +0000 (18:06 +0200)]
ras-report.h: avoid long lines

Better format the stubs on this file to avoid long lines.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agotypes.h: remove whitespaces
Mauro Carvalho Chehab [Thu, 18 Jul 2024 16:02:43 +0000 (18:02 +0200)]
types.h: remove whitespaces

Cut-and-pasting it from /usr/include/linux/bits.h ended adding
unwanted whitespaces. Remove those.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agotypes.h: don't depend on linux/bits.h
Mauro Carvalho Chehab [Thu, 18 Jul 2024 15:51:28 +0000 (17:51 +0200)]
types.h: don't depend on linux/bits.h

Such include would require Kernel sources to be installed.
We don't really need that: Just copy the two GENMASK macros
and be it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-events: don't use extern inside a C file
Mauro Carvalho Chehab [Thu, 18 Jul 2024 15:40:17 +0000 (17:40 +0200)]
ras-events: don't use extern inside a C file

Fix a checkpatch warning:

ras-events.c:66: WARNING:AVOID_EXTERNS: externs should be avoided in .c files

by better handing how checks_inside var is handled.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: don't use unsafe strcpy, strcat and sprintf
Mauro Carvalho Chehab [Thu, 18 Jul 2024 11:02:30 +0000 (13:02 +0200)]
rasdaemon: don't use unsafe strcpy, strcat and sprintf

Remove all occurrences of those calls.

While here, also fix a couple missing whitespace warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agotypes.h: add an implementation for strscpy() and strscat()
Mauro Carvalho Chehab [Thu, 18 Jul 2024 14:44:47 +0000 (16:44 +0200)]
types.h: add an implementation for strscpy() and strscat()

Do our own implementation for such routines, as the Kernel
implementation is a lot more complex than what it would be needed
here.

With that, change checkpatch.pl to request usage of such functions
instead of unsafe strcpy()/strcat().

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-events: drop a dead code to check number of CPUs
Mauro Carvalho Chehab [Thu, 18 Jul 2024 12:23:43 +0000 (14:23 +0200)]
ras-events: drop a dead code to check number of CPUs

Just use sysconf(_SC_NPROCESSORS_ONLN) here.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-report: fix coding style and string fill issues
Mauro Carvalho Chehab [Thu, 18 Jul 2024 10:58:23 +0000 (12:58 +0200)]
ras-report: fix coding style and string fill issues

Don't use unsafe sprintf(). Instead, re-implement the logic in
a way that buffer overflows won't occur.

While here, also avoid lines longer than 80 columns when possible.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agonon-standard-jaguarmicro: avoid CamelCase
Mauro Carvalho Chehab [Thu, 18 Jul 2024 09:54:08 +0000 (11:54 +0200)]
non-standard-jaguarmicro: avoid CamelCase

Coding-style: no need to use CamelCase here. So, use lowercase.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agocheckpatch.pl: warn also about strcat and sprintf usages
Mauro Carvalho Chehab [Thu, 18 Jul 2024 11:01:00 +0000 (13:01 +0200)]
checkpatch.pl: warn also about strcat and sprintf usages

strcpy, strncpy and sprintf aren't safe, as they don't check
buffer overflows. Change the checkpatch logic to warn about
such usages.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: alphabetically sort includes
Mauro Carvalho Chehab [Thu, 18 Jul 2024 09:45:16 +0000 (11:45 +0200)]
rasdaemon: alphabetically sort includes

Reorder includes to ensure that they'll all be alphabetically
sorted.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-arm-handler: use GENMASK() macro
Mauro Carvalho Chehab [Thu, 18 Jul 2024 08:44:57 +0000 (10:44 +0200)]
ras-arm-handler: use GENMASK() macro

Now that we have the macro defined on types.h, use it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: move type macros to a separate header (types.h)
Mauro Carvalho Chehab [Thu, 18 Jul 2024 08:43:17 +0000 (10:43 +0200)]
rasdaemon: move type macros to a separate header (types.h)

That makes easier to use/maintain it, without needing to include
ras-record.h when all it is needed are common macros.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: fix a coding style issue
Mauro Carvalho Chehab [Thu, 18 Jul 2024 08:43:07 +0000 (10:43 +0200)]
rasdaemon: fix a coding style issue

Comment block identation was wrong. Fix it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-arm-handler: Parse and log ARM Processor Error Info table
Shiju Jose [Tue, 16 Jul 2024 16:36:59 +0000 (17:36 +0100)]
ras-arm-handler: Parse and log ARM Processor Error Info table

Parse and log ARM Processor Error Info table data, UEFI 2.9A/2.10
specs section N2.4.4.1.

[mchehab: fix a typo]
Suggested-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: fix some typos and correct spelling
Mauro Carvalho Chehab [Wed, 17 Jul 2024 05:19:05 +0000 (07:19 +0200)]
rasdaemon: fix some typos and correct spelling

With the help of checkpatch.pl --codespell, fix some typos.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoscripts/checkpatch.pl: set default mode to strict
Mauro Carvalho Chehab [Wed, 17 Jul 2024 05:11:44 +0000 (07:11 +0200)]
scripts/checkpatch.pl: set default mode to strict

There aren't many false positives. So, change default to strict
mode.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoras-arm-handler: cope with latest upstream changes
Mauro Carvalho Chehab [Wed, 17 Jul 2024 05:01:29 +0000 (07:01 +0200)]
ras-arm-handler: cope with latest upstream changes

Unfortunately, rasdaemon support for the firmware first
CPER ARM processor extended trace was added years before
having it merged upstream. That's bad, specially since
upstream revision requested a change on some fields.

Fix support for it by aligning with latest upstream version:
        https://lore.kernel.org/linux-edac/3853853f820a666253ca8ed6c7c724dc3d50044a.1720679234.git.mchehab+huawei@kernel.org/T/#m17003e47912b228e91e57ac6e4f90ea30061aa3b

A backward-compatible logic was added to avoid breaking with
existing OOT support.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoscripts/checkpatch.pl: some improvements to reduce false positives
Mauro Carvalho Chehab [Wed, 17 Jul 2024 04:23:00 +0000 (06:23 +0200)]
scripts/checkpatch.pl: some improvements to reduce false positives

- camelcase is OK for printk inttypes.h;
- strncpy is OK;
- accept up to 120 chars on lines without warnings;
- stop complaining about "BACKTRACE=" strings split on multiple lines;
- remove PREFER_DEFINED_ATTRIBUTE_MACRO, as this is kernel-specific;
- remove MACRO_ARG_REUSE, as this applies mostly to multithreading;
- don't warn on using do{} while(0) with single line statements;

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: coding style cleanup
Mauro Carvalho Chehab [Tue, 16 Jul 2024 23:06:52 +0000 (01:06 +0200)]
rasdaemon: coding style cleanup

Solve lots of coding style issues reported by:

./scripts/checkpatch.pl --terse --show-types --strict \
-f $(git ls-files|grep -E '\.[ch]$') \
--ignore MACRO_ARG_REUSE,STRCPY,IF_0,UNNECESSARY_PARENTHESES,CAMELCASE,STRNCPY; done

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoscripts/checkpatch.pl: do some additional cleanups
Mauro Carvalho Chehab [Tue, 16 Jul 2024 23:06:21 +0000 (01:06 +0200)]
scripts/checkpatch.pl: do some additional cleanups

Remove more things that won't make sense for rasdaemon.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoBump version to 0.8.1 v0.8.1
Mauro Carvalho Chehab [Tue, 16 Jul 2024 08:24:51 +0000 (10:24 +0200)]
Bump version to 0.8.1

There were lots of changes on this version. The summary at
ChangeLog contains a sanitized version of it.

It should be noticed that the next version will likely bring
an uAPI incompatible change. Unfortunately, UEFI CPER record
trace for ARM processor is currently incomplete upstream.

Rasdaemon gained support for an extended arm trace event that
supports all fields of the CPER record, but it depends on a
patch that it is not upstreamed yet.

While looked on such patches, there are some changes needed
to get it merged, meaning that future versions of rasdaemon
may not be compatible with the downstream patch anymore.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agorasdaemon: adjust install targets for the spec to be build
Mauro Carvalho Chehab [Tue, 16 Jul 2024 08:41:49 +0000 (10:41 +0200)]
rasdaemon: adjust install targets for the spec to be build

We use Fedora spec file to check if everything is OK. Do some
changes to make it happy.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>