]> www.infradead.org Git - users/mchehab/rasdaemon.git/log
users/mchehab/rasdaemon.git
9 months agoutil/arm_einj.py: remove a debug print
Mauro Carvalho Chehab [Wed, 10 Jul 2024 13:07:39 +0000 (15:07 +0200)]
util/arm_einj.py: remove a debug print

This was meant only for testing argument handling. Remove it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
9 months agoutil/arm_einj.py: add an utility for ARM error injection via QEMU
Mauro Carvalho Chehab [Wed, 10 Jul 2024 12:15:24 +0000 (14:15 +0200)]
util/arm_einj.py: add an utility for ARM error injection via QEMU

Testing rasdaemon is not easy, as it depends on either having
real hardware producing events or a test BIOS. This is usually
not available and/or not too reliable.

So, take a different approach by adding a QEMU QAPI designed for
doing hardware error injection. The QEMU patches are at:

https://gitlab.com/mchehab_kernel/qemu/-/tree/arm-error-inject-v2

And some instructions about how to use it are at rasdaemon wiki
pages at github:

https://github.com/mchehab/rasdaemon/wiki

Add the error injection tool to rasdaemon sources.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agoras-arm-handler: be compatible with upstream Kernel
Mauro Carvalho Chehab [Tue, 25 Jun 2024 08:05:45 +0000 (10:05 +0200)]
ras-arm-handler: be compatible with upstream Kernel

Changeset e37eb2f11a82 ("Add code to decode Ampere specific error")
broke ARM event record with upstream Kernel, as it requires a different
trace event than the one that it is on upstream Kernel, and it is
part of a pending pull request:

https://lore.kernel.org/all/20240321-b4-arm-ras-error-vendor-info-v5-rc3-v5-0-850f9bfb97a8@os.amperecomputing.com/

Restore its behavior by making parsing the UEFI 2.6+ N.17 and N.16
table extra fields to be optional. That should make it compatible
with current upstream Kernels again.

Fixes: e37eb2f11a82 ("Add code to decode Ampere specific error")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agoDo a coding style cleanup with regards to tabs and white spaces
Mauro Carvalho Chehab [Tue, 11 Jun 2024 10:01:40 +0000 (12:01 +0200)]
Do a coding style cleanup with regards to tabs and white spaces

Use tabs instead of spaces and remove blank ending whitespaces.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Add Corrected Internal Error for aer_cor_errors
Jesus Esquivel [Mon, 3 Jun 2024 22:47:20 +0000 (16:47 -0600)]
rasdaemon: Add Corrected Internal Error for aer_cor_errors

Add "Corrected Internal Error" for aer_cor_errors to decode
the error reported in status register in bit 14.

Signed-off-by: Jesus Esquivel <jesus.esquivel@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Update SMCA bank error descriptions
Avadhut Naik [Fri, 10 May 2024 18:20:19 +0000 (13:20 -0500)]
rasdaemon: Update SMCA bank error descriptions

Update error descriptions of SMCA bank types to support AMD's new Family
1Ah-based processors.
Also, modify some existing error descriptions to better reflect the error
received.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agoAdd Lenovo P920 DIMM labels
Raul E Rangel [Thu, 9 May 2024 18:55:11 +0000 (18:55 +0000)]
Add Lenovo P920 DIMM labels

This adds the labels entry for the Lenovo ThinkStation P920.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Fix for vendor errors are not recorded in the SQLite database if some...
Shiju Jose [Wed, 20 Mar 2024 12:16:05 +0000 (12:16 +0000)]
rasdaemon: Fix for vendor errors are not recorded in the SQLite database if some cpus are offline

Fix for vendor errors are not recorded in the SQLite database if some cpus
are offline at the system start.

Issue:

This issue is reproducible by offline some cpus, run
./rasdaemon -f --record & and
inject vendor specific error supported in the rasdaemon.

Reason:

When the system starts with some of the cpus offline and then run
the rasdaemon, read_ras_event_all_cpus() exit with error and switch to
the multi thread way. However read() in read_ras_event() return error in
threads for each of the offline CPUs and does clean up including calling
ras_ns_finalize_vendor_tables(), which invokes sqlite3_finalize() on vendor
tables created. Thus the vendor error data does not stored in the SQLite
database when such error is reported next time.

Solution:

In ras_ns_add_vendor_tables() and ras_ns_finalize_vendor_tables() use
reference count and close vendor tables which created in ras_ns_add_vendor_tables()
based on the reference count.

Reported-by: Junhao He <hejunhao3@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agomce-amd-smca: update smca_hwid to use smca_bank_types
Aristeu Rozanski [Tue, 9 Apr 2024 14:06:30 +0000 (10:06 -0400)]
mce-amd-smca: update smca_hwid to use smca_bank_types

bank_type is used as smca_bank_types everywhere, there's no point in
declaring it as unsigned int. It also upsets covscan:

3. rasdaemon-0.6.7/mce-amd-smca.c:914: assignment: Assigning: "bank_type" = "s_hwid->bank_type".
7. rasdaemon-0.6.7/mce-amd-smca.c:926: cond_at_most: Checking "bank_type >= 64U" implies that "bank_type" and "s_hwid->bank_type" may be up to 63 on the false branch.
14. rasdaemon-0.6.7/mce-amd-smca.c:942: overrun-local: Overrunning array "smca_mce_descs" of 38 16-byte elements at element index 63 (byte offset 1023) using index "bank_type" (which evaluates to 63).
#   940|        /* Only print the descriptor of valid extended error code */
#   941|        if (xec < smca_mce_descs[bank_type].num_descs)
#   942|->              mce_snprintf(e->mcastatus_msg,
#   943|                             "%s. Ext Err Code: %d",
#   944|                             smca_mce_descs[bank_type].descs[xec],

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agolabels/asrock: Add DIMM labels for ASRock Rack X570D4U
Ivan Mironov [Thu, 28 Mar 2024 00:40:13 +0000 (05:40 +0500)]
labels/asrock: Add DIMM labels for ASRock Rack X570D4U

Signed-off-by: Ivan Mironov <mironov.ivan@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Add support to parse microcode field of mce tracepoint
Avadhut Naik [Tue, 2 Apr 2024 05:07:38 +0000 (00:07 -0500)]
rasdaemon: Add support to parse microcode field of mce tracepoint

Support for exporting the Microcode Revision is being added to the
mce_record tracepoint.

Add the required, corresponding support in the rasdaemon for the field
to be parsed and logged or added to the database and viewed later through
ras-mc-ctl utility.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: Add support to parse the PPIN field of mce tracepoint
Avadhut Naik [Tue, 2 Apr 2024 04:33:07 +0000 (23:33 -0500)]
rasdaemon: Add support to parse the PPIN field of mce tracepoint

Support for exporting the PPIN (Protected Processor Inventory Number)
is being added to the mce_record tracepoint.

Add the required, corresponding support in the rasdaemon for the field
to be parsed and logged or added to the database and viewed later through
ras-mc-ctl utility.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support to display mcastatus_msg string
Avadhut Naik [Tue, 26 Mar 2024 04:06:08 +0000 (23:06 -0500)]
rasdaemon: ras-mc-ctl: Add support to display mcastatus_msg string

Currently, the mcastatus_msg string of struct mce_event is added to the
SQLite database by the rasdaemon when it is recording errors. The same
however, is not outputted by the ras-mc-ctl utility.

The string provides important error information relating to the received
MCE. For example, on AMD SMCA systems, the string outputs extended error
code and description. As such, the string should be present in the
output of ras-mc-ctl utility.

Add support to output the string through the ras-mc-ctl utility.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agoprint logs in the same line
zhuofeng [Tue, 12 Mar 2024 06:28:55 +0000 (14:28 +0800)]
print logs in the same line

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL memory module trace events
Shiju Jose [Mon, 12 Feb 2024 11:29:13 +0000 (11:29 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL memory module trace events

Add support for CXL memory module events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL DRAM trace events
Shiju Jose [Mon, 12 Feb 2024 11:22:03 +0000 (11:22 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL DRAM trace events

Add support for CXL DRAM events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL general media trace events
Shiju Jose [Mon, 12 Feb 2024 11:14:03 +0000 (11:14 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL general media trace events

Add support for CXL general media events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL generic trace events
Shiju Jose [Mon, 12 Feb 2024 10:56:25 +0000 (10:56 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL generic trace events

Add support for CXL generic events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL poison trace events
Shiju Jose [Mon, 12 Feb 2024 10:49:10 +0000 (10:49 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL poison trace events

Add support for CXL poison events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL overflow trace events
Shiju Jose [Mon, 12 Feb 2024 10:38:51 +0000 (10:38 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL overflow trace events

Add support for CXL overflow events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL AER correctable trace events
Shiju Jose [Mon, 12 Feb 2024 10:35:25 +0000 (10:35 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL AER correctable trace events

Add support for CXL AER correctable events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Add support for CXL AER uncorrectable trace events
Shiju Jose [Mon, 12 Feb 2024 10:27:58 +0000 (10:27 +0000)]
rasdaemon: ras-mc-ctl: Add support for CXL AER uncorrectable trace events

Add support for CXL AER uncorrectable events to the ras-mc-ctl tool.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-memory-failure-handler: update memory failure action page types
Shiju Jose [Tue, 6 Feb 2024 12:08:00 +0000 (12:08 +0000)]
rasdaemon: ras-memory-failure-handler: update memory failure action page types

Update memory failure action page types corresponding to the same in
mm/memory-failure.c in the kernel.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: Fix build warnings unused variable if AMP RAS errors is not enabled
Shiju Jose [Mon, 4 Mar 2024 11:49:50 +0000 (11:49 +0000)]
rasdaemon: Fix build warnings unused variable if AMP RAS errors is not enabled

This patch fixes following build warnings unused variable if AMP RAS errors
is not enabled(--enable-amp-ns-decode).

==================================================
ras-aer-handler.c: In function ‘ras_aer_event_handler’:
ras-aer-handler.c:72:21: warning: unused variable ‘fn’ [-Wunused-variable]
  int seg, bus, dev, fn;
                     ^~
ras-aer-handler.c:72:16: warning: unused variable ‘dev’ [-Wunused-variable]
  int seg, bus, dev, fn;
                ^~~
ras-aer-handler.c:72:11: warning: unused variable ‘bus’ [-Wunused-variable]
  int seg, bus, dev, fn;
           ^~~
ras-aer-handler.c:72:6: warning: unused variable ‘seg’ [-Wunused-variable]
  int seg, bus, dev, fn;
      ^~~
ras-aer-handler.c:71:10: warning: variable ‘sel_data’ set but not used [-Wunused-but-set-variable]
  uint8_t sel_data[5];
          ^~~~~~~~
ras-aer-handler.c:70:7: warning: unused variable ‘ipmi_add_sel’ [-Wunused-variable]
  char ipmi_add_sel[105];
       ^~~~~~~~~~~~
==================================================

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: ras-mc-ctl: Do not try to find modprobe
Ivan Mironov [Sun, 3 Mar 2024 09:51:13 +0000 (14:51 +0500)]
rasdaemon: ras-mc-ctl: Do not try to find modprobe

It is not used and prevents ras-mc-ctl.service from starting on Fedora
when SELinux is in Enforcing mode.

Resolves: rhbz#1836861
Resolves: https://github.com/fedora-selinux/selinux-policy/issues/2054
Resolves: https://github.com/mchehab/rasdaemon/issues/79
Signed-off-by: Ivan Mironov <mironov.ivan@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agolabels/asus: Add DIMM labels for Asus PRIME X570-P
Ivan Mironov [Sat, 2 Mar 2024 05:53:50 +0000 (10:53 +0500)]
labels/asus: Add DIMM labels for Asus PRIME X570-P

Signed-off-by: Ivan Mironov <mironov.ivan@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agoUse block_rq_error if RHEL >= 9.1
Etienne Champetier [Mon, 26 Feb 2024 20:02:01 +0000 (15:02 -0500)]
Use block_rq_error if RHEL >= 9.1

The commit introducing block_rq_error tracepoint
has been backported in RHEL 9.1, so improve the check
for block_rq_error presence to use it.

Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: Add error decoding for MCA_CTL_SMU extended bits
Sathya Priya Kumar [Thu, 11 Jan 2024 07:20:07 +0000 (01:20 -0600)]
rasdaemon: Add error decoding for MCA_CTL_SMU extended bits

Enable error decoding support for the newly added extended
error bit descriptions from MCA_CTL_SMU.
b'0:11 can be decoded from existing array smca_smu2_mce_desc.
Define a function to append the newly defined b'58:62 to the
smca_smu2_mce_desc. This reduces the maintaining Reserved bits
from b'12:57 in the code.

Signed-off-by: Sathya Priya Kumar <sathyapriya.k@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: labels/apple add MacPro 1,1 and 2,1 models
Walter Sonius [Sun, 11 Feb 2024 22:30:25 +0000 (23:30 +0100)]
rasdaemon: labels/apple add MacPro 1,1 and 2,1 models

For the Apple MacPro 1,1 (Mac-F4208DC8) and MacPro 2,1 (Mac-F4208DA9)
these are the correct labels for the DIMM numbers 1-4 on each DIMM Riser
A&B for a total of 8 DIMMS. The MacPro 1,1 vendor is actually called
"Apple Computer, Inc." vs "Apple Inc." for the MacPro 2,1 and 3,1.
Another note is that the MacPro 1,1 and 2,1 require the kernel parameter
noefi for their efi32 firmware to boot a 64bit kernel using the
debian-12.4.0-amd64-netinst.iso.

The upper Riser is called A the lower Riser is called B. However
compared to MacPro 3,1 the riser labels A & B are branch swapped on the
memory controller on MacPro1,1 and 2,1 not its physical location in the
case (double checked it)! The so called slot 2 and slot 3 found by
ras-mc-ctl --layout are not available as slots or risers on the
motherboard. The ras-mc-ctl --guess-labels showed right labels but the
DIMM numbers are indistinguishable, however this commit is needed to
link them to the right memory location.

Signed-off-by: Walter Sonius <walterav1984@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agorasdaemon: labels/intel add DQ57TM vendor and model
Walter Sonius [Thu, 8 Feb 2024 14:40:45 +0000 (15:40 +0100)]
rasdaemon: labels/intel add DQ57TM vendor and model

Add labels used on the Intel Corporation DQ57TM motherboard.

$ sudo dmesg | grep DMI | grep DQ57TM
[    0.000000] DMI:  /DQ57TM, BIOS TMIBX10H.86A.0050.2011.1207.1134 12/07/2011

Signed-off-by: Walter Sonius <walterav1984@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
10 months agoREADME.md: Fix repository information
Mauro Carvalho Chehab [Wed, 5 Jun 2024 12:47:00 +0000 (14:47 +0200)]
README.md: Fix repository information

We don't use Fedorahosted for a long time; the URL was updated,
but right now it is a way more common to receive patches via github
than from other repositories, so change the repository order.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
14 months agoapple macpro 2008 3,1 dimm1-4 labels riser A&B
Walter Sonius [Mon, 29 Jan 2024 17:21:37 +0000 (18:21 +0100)]
apple macpro 2008 3,1 dimm1-4 labels riser A&B

For the Apple Mac Pro 3,1( 2008) Mac-F42C88C8 these are the correct labels for the DIMM numbers 1-4 on each DIMM Riser A&B for a total of 8 DIMMS.

The upper Riser is called A the lower Riser is called B. The so called `slot 2` and `slot 3` found by `ras-mc-ctl --layout` are not available as slots or risers on the motherboard. The `ras-mc-ctl --guess-labels` showed right labels but the DIMM numbers are indistinguishable, however  this commit is needed to link them to the right memory location.

```
$ ras-mc-ctl --layout
       +-----------------------------------------------+
       |                      mc0                      |
       |        branch0        |        branch1        |
       | channel0  | channel1  | channel0  | channel1  |
-------+-----------------------------------------------+
slot3: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
slot2: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
-------+-----------------------------------------------+
slot1: |  2048 MB  |  2048 MB  |  2048 MB  |  2048 MB  |
slot0: |  8192 MB  |  8192 MB  |  4096 MB  |  4096 MB  |
-------+-----------------------------------------------+

$ ras-mc-ctl --guess-labels
memory stick 'DIMM 1' is located at 'DIMM Riser B'
memory stick 'DIMM 2' is located at 'DIMM Riser B'
memory stick 'DIMM 1' is located at 'DIMM Riser A'
memory stick 'DIMM 2' is located at 'DIMM Riser A'
memory stick 'DIMM 3' is located at 'DIMM Riser B'
memory stick 'DIMM 4' is located at 'DIMM Riser B'
memory stick 'DIMM 3' is located at 'DIMM Riser A'
memory stick 'DIMM 4' is located at 'DIMM Riser A'
```

Signed-off-by: Walter Sonius <walterav1984@gmail.com>
14 months agolabels/supermicro: add Supermicro X11DPi-N(T)
Werner Fischer [Wed, 31 Jan 2024 12:33:00 +0000 (13:33 +0100)]
labels/supermicro: add Supermicro X11DPi-N(T)

Add labels for Supermicro X11DPi-N and X11DPi-NT motherboards.

Signed-off-by: Werner Fischer <devlists@wefi.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agoC files: cleanup coding style
Mauro Carvalho Chehab [Mon, 22 Jan 2024 07:36:47 +0000 (08:36 +0100)]
C files: cleanup coding style

The rasdaemon conding style follows Linux Kernel where it makes sense.

Yet, changes made overtime ended with some coding style non-compliances.

Adjust rasdaemon coding style by using:

   scripts/checkpatch.pl --fix-inplace --strict *.c --ignore PREFER_KERNEL_TYPES

And doing some manual fixups where the script didn't work.
As a bonus, some typos were also fixed on some rasdaemon messages.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agorasdaemon: ras-mc-ctl: Add support to display the JaguarMicro vendor errors
Hunter He [Mon, 25 Dec 2023 09:34:56 +0000 (17:34 +0800)]
rasdaemon: ras-mc-ctl: Add support to display the JaguarMicro vendor errors

Add support to display the JaguarMicro Corsica DPU vendor errors event.

Signed-off-by: Hunter He <hunter.he@jaguarmicro.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agoSupermicro X12DPU-6 DIMM labels
DmNosachev [Tue, 19 Dec 2023 09:44:01 +0000 (12:44 +0300)]
Supermicro X12DPU-6 DIMM labels

Add labels for X12DPU-6 motherboard.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agoFix potential overflow with some arrays at page-isolation logic
zhuofeng [Thu, 7 Dec 2023 02:26:56 +0000 (10:26 +0800)]
Fix potential overflow with some arrays at page-isolation logic

Overflows may happen in the `threshold_string` and `cycle_string` arrays.

If the PAGE_CE_THRESHOLD value in page isolation is set to 50 bits,
there is a risk of array overflow. Because sprintf is an insecure
function, use snprintf instead.

An error is reported when the AddressSanitizer is used.

rasdaemon: Improper PAGE_CE_ACTION, set to default soft
rasdaemon: Page offline choice on Corrected Errors is soft
=================================================================
==221920==ERROR: AddressSanitizer: stack-buffer-overflow on address 0xffffdd91d932 at pc 0xffffa24071c4 bp 0xffffdd91d720 sp 0xffffdd91ced8
WRITE of size 55 at 0xffffdd91d932 thread T0
    #0 0xffffa24071c0 in vsprintf (/usr/lib64/libasan.so.6+0x5c1c0)
    #1 0xffffa24073cc in sprintf (/usr/lib64/libasan.so.6+0x5c3cc)
    #2 0x459558 in parse_env_string /home/rasdaemon/ras-page-isolation.c:185
    #3 0x4596f4 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:202
    #4 0x459934 in ras_page_account_init /home/rasdaemon/ras-page-isolation.c:211
    #5 0x40f700 in handle_ras_events /home/rasdaemon/ras-events.c:902
    #6 0x405b8c in main /home/rasdaemon/rasdaemon.c:211
    #7 0xffffa20b6f38 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #8 0xffffa20b7004 in __libc_start_main_impl ../csu/libc-start.c:409
    #9 0x4038ec in _start (/home/rasdaemon/rasdaemon+0x4038ec)

Address 0xffffdd91d932 is located in stack of thread T0 at offset 82 in frame
    #0 0x459574 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:190

  This frame has 2 object(s):
    [32, 82) 'threshold_string' (line 191)
    [128, 178) 'cycle_string' (line 192) <== Memory access at offset 82 partially underflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/usr/lib64/libasan.so.6+0x5c1c0) in vsprintf
Shadow bytes around the buggy address:
  0x200ffbb23ad0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23b10: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
=>0x200ffbb23b20: 00 00 00 00 00 00[02]f2 f2 f2 f2 f2 00 00 00 00
  0x200ffbb23b30: 00 00 02 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00
  0x200ffbb23b40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ffbb23b50: f1 f1 f1 f1 f1 f1 04 f2 00 00 f2 f2 00 00 00 00
  0x200ffbb23b60: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
  0x200ffbb23b70: f2 f2 f2 f2 00 00 00 00 00 00 00 00 f2 f2 f2 f2
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==221920==ABORTING

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agorasdaemon: Fix return value type compiling warnning of configure Optional Features...
Hunter He [Mon, 4 Dec 2023 04:54:55 +0000 (12:54 +0800)]
rasdaemon: Fix return value type compiling warnning of configure Optional Features with --enable-amp-ns-decode and without --enable-sqlite3.

Fix return value type compiling warnning of configure Optional Features
with --enable-amp-ns-decode and without --enable-sqlite3.

Signed-off-by: Hunter He <hunter.he@jaguarmicro.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agorasdaemon:Add support for creating vendor tables at startup.
Hunter He [Wed, 6 Dec 2023 06:52:03 +0000 (14:52 +0800)]
rasdaemon:Add support for creating vendor tables at startup.

When rasdaemon is running without non-standard error, those
tables are not created in the database file. Then ras-mc-ctl
script breaks trying to query data from non-existent tables.

Add support for creating vendor tables at startup.

Signed-off-by: Hunter He <hunter.he@jaguarmicro.com>
15 months agoAdd dynamic switch of ras events support.
caixiaomeng 00662745 [Wed, 29 Nov 2023 06:31:46 +0000 (14:31 +0800)]
Add dynamic switch of ras events support.

Rasdaemon does not support a way to disable some events by config.
If user want to disable specified event(eg:block_rq_complete), he
should recompile rasdaemon, which is not so convenient.

This patch add dynamic switch of ras event support.You can add
events you want to disabled in /etc/sysconfig/rasdaemon.For example,
`DISABLE="ras:mc_event,block:block_rq_complete"`.Then restart
rasdaemon, these two events will be disabled without recompilation.

[mchehab: make is_disabled_event() static]
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
15 months agorasdaemon: Add support for vendor-specific machine check error information
Avadhut Naik [Tue, 21 Nov 2023 20:04:19 +0000 (14:04 -0600)]
rasdaemon: Add support for vendor-specific machine check error information

Some CPU vendors may provide additional vendor-specific machine check
error information. AMD, for example, provides FRU Text through SYND 1/2
registers if BIT 9 of SMCA_CONFIG register is set.

Add support to display the additional vendor-specific error information,
if any.

Signed-off-by: Avadhut Naik <Avadhut.Naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: ras-mc-ctl: Modify check for HiSilicon KunPeng9xx error fields
Shiju Jose [Thu, 24 Aug 2023 12:07:17 +0000 (13:07 +0100)]
rasdaemon: ras-mc-ctl: Modify check for HiSilicon KunPeng9xx error fields

Modify check for valid HiSilicon KunPeng9xx error fields.
Fixes an error data is not printed when it's value is 0.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add Emerald Rapids support
Delgado Vargas, Daniel [Fri, 20 Oct 2023 16:57:11 +0000 (10:57 -0600)]
rasdaemon: Add Emerald Rapids support

Signed-off-by: Delgado Vargas, Daniel <daniel.delgado.vargas@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoAdd a space between "diskerror_event" and "store"
weidongkl [Tue, 19 Sep 2023 08:29:21 +0000 (16:29 +0800)]
Add a space between "diskerror_event" and "store"

Signed-off-by: weidongkl <weidongkl@sina.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: ras-mc-ctl: Add support to display the THead vendor errors
Ruidong Tian [Thu, 7 Sep 2023 10:22:06 +0000 (18:22 +0800)]
rasdaemon: ras-mc-ctl: Add support to display the THead vendor errors

Add support for the THead YiTian DDRC register dump event.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: add support for THead Yitian non-standard error decoder
Ruidong Tian [Thu, 7 Sep 2023 10:21:05 +0000 (18:21 +0800)]
rasdaemon: add support for THead Yitian non-standard error decoder

Add a new non-standard error decoder to decode THead YiTian error
section. Put all related code to a new source file.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: log non_standard_event at just one line
Ruidong Tian [Thu, 7 Sep 2023 10:19:40 +0000 (18:19 +0800)]
rasdaemon: log non_standard_event at just one line

It is more reasonable log non_standard_event in one line exclude errors
dump. So you can easily to get decoded non_standard_event log in one
line if you implement a decoder like other event.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Fix SMCA bank type decoding
Avadhut Naik [Thu, 31 Aug 2023 07:23:48 +0000 (02:23 -0500)]
rasdaemon: Fix SMCA bank type decoding

On AMD systems with Scalable MCA (SMCA), the (HWID, MCATYPE) tuple from
the MCA_IPID MSR, bits 43:32 and 63:48 respectively, are used for SMCA
bank type decoding. On occurrence of an SMCA error, the cached tuples are
compared against the tuple read from the MCA_IPID MSR to determine the
SMCA bank type.

Currently however, all high 32 bits of the MCA_IPID register are cached in
the rasdaemon for all SMCA bank types. Bits 47:44 which do not play a part
in bank type decoding are zeroed out. Likewise, when an SMCA error occurs,
all high 32 bits of the MCA_IPID register are read and compared against
the cached values in smca_hwid_mcatypes array.

This can lead to erroneous bank type decoding since the bits 47:44 are
not guaranteed to be zero. They are either reserved or, on some modern
AMD systems viz. Genoa, denote the InstanceIdHi value. The bits therefore,
should not be associated with SMCA bank type decoding.

Import the HWID_MCATYPE macro from the kernel to ensure that only the
relevant fields i.e. (HWID, MCATYPE) tuples are used for SMCA bank type
decoding on occurrence of an SMCA error.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Identify the DIe Number in multidie system
Muralidhara M K [Thu, 27 Jul 2023 10:18:12 +0000 (10:18 +0000)]
rasdaemon: Identify the DIe Number in multidie system

Some AMD systems have 4 dies in each socket and Die ID represents
whether the error occured on cpu die or gpu die.
Also, respective Die used for FRU identification.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Handle reassigned bit definitions for UMC bank
Muralidhara M K [Fri, 30 Jun 2023 11:19:42 +0000 (11:19 +0000)]
rasdaemon: Handle reassigned bit definitions for UMC bank

On some AMD systems some of the existing bit definitions in the
CTL register of SMCA bank type are reassigned without defining
new HWID and McaType. Consequently, the errors whose bit
definitions have been reassigned in the CTL register are being
erroneously decoded.

Add new error description structure to compensate for the
reassigned bit definitions, by new software defined SMCA bank
type by utilizing  the hardware-reserved values for HWID.
The new SMCA bank type will only be employed for UMC error
decoding on affected models and the existing error description
structure for UMC bank type is still valid.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add new MA_LLC, USR_DP, and USR_CP bank types.
Muralidhara M K [Fri, 30 Jun 2023 10:36:53 +0000 (10:36 +0000)]
rasdaemon: Add new MA_LLC, USR_DP, and USR_CP bank types.

Add HWID and McaType values for new SMCA bank types
and error decoding for those new SMCA banks.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for post-processing MCA errors
Avadhut Naik [Mon, 22 May 2023 22:13:17 +0000 (22:13 +0000)]
rasdaemon: Add support for post-processing MCA errors

Currently, the rasdaemon performs detailed error decoding of received
MCA errors on the system only whence it is running, either as a daemon
or in the foreground.

As such, error decoding cannot be undertaken for any MCA errors received
whence the rasdaemon wasn't running. Additionally, if the error decoding
modules like edac_mce_amd too have not been loaded, error records in the
demsg buffer might correspond to raw values in associated MSRs, compelling
users to undertake decoding manually. The scenario seems more plausible on
AMD systems with Scalabale MCA (SMCA) with plans in place to remove SMCA
Extended Error Descriptions from the edac_mce_amd module in an effort to
offload SMCA Error Decoding to the rasdaemon.

As such, add support to post-process and decode MCA Errors received on AMD
SMCA systems from raw MSR values. Support for post-processing and decoding
of MCA Errors received on CPUs of other vendors can be added in the future,
as needed.

Suggested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Handle reassigned bit definitions for CS SMCA
Avadhut Naik [Mon, 24 Apr 2023 20:35:56 +0000 (20:35 +0000)]
rasdaemon: Handle reassigned bit definitions for CS SMCA

Currently, on AMD systems with Scalable MCA (SMCA), each machine check
error of a SMCA bank type has an associated bit position in the bank's
control (CTL) register used for enabling / disabling reporting of the
very error. An error's bit position in the CTL register is also used
during error decoding for offsetting into the corresponding bank's error
description structure. As new errors are being added in newer AMD systems
for existing SMCA bank types, the underlying SMCA architecture guarantees
that the bit positions of existing errors are not altered.

However, on some AMD systems viz. Genoa, some of the existing bit
definitions in the CTL register of the Coherent Slave (CS) SMCA bank type
are reassigned without defining new HWID and McaType. Consequently, the
very errors whose bit definitions have been reassigned in the CTL register
are being erroneously decoded.

As a solution, create a new software defined SMCA bank type by utilizing
one of the hardware-reserved values for HWID. The new SMCA bank type will
only be employed for CS error decoding on affected CPU models.

Additionally, since the existing error description structure for the CS
SMCA bank type is still valid, add new error description structure to
compensate for the reassigned bit definitions.

Signed-off-by: Avadhut Naik <avadnaik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Update SMCA bank error descriptions
Avadhut Naik [Tue, 18 Apr 2023 18:24:21 +0000 (18:24 +0000)]
rasdaemon: Update SMCA bank error descriptions

Update, reword some existing SMCA bank type error descriptions to extend
SMCA error decoding functionality for modern AMD processors. Additionally,
also add new error descriptions for missing SMCA bank types.

Signed-off-by: Avadhut Naik <avadnaik@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoadd ':' before error output
weidong [Tue, 8 Aug 2023 08:59:12 +0000 (08:59 +0000)]
add ':' before error output

All prints except disk are preceded by a colon

Signed-off-by: weidong <weidongkl@sina.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoAdd label for mainboard: ASUSTeK COMPUTER INC. Model: Z9PH-D16 Series
garadar [Fri, 14 Jul 2023 17:45:28 +0000 (19:45 +0200)]
Add label for mainboard: ASUSTeK COMPUTER INC. Model: Z9PH-D16 Series

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoAdd label for mainboard: GIGABYTE model MZ62-HD0-00
alberta [Fri, 14 Jul 2023 16:19:11 +0000 (18:19 +0200)]
Add label for mainboard: GIGABYTE model MZ62-HD0-00

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoCheck CPUs online, not configured.
Zeph / Liz Loss-Cutler-Hull [Sun, 9 Jul 2023 11:57:19 +0000 (04:57 -0700)]
Check CPUs online, not configured.

When the number of CPUs detected is greater than the number of CPUs in
the system, rasdaemon will crash when it receives some events.

Looking deeper, we also fail to use the poll method for similar reasons
in this case.

All of this can be prevented by checking to see how many CPUs are
currently online (sysconf(_SC_NPROCESSORS_ONLN)) instead of how many
CPUs the current kernel was configured to support
(sysconf(_SC_NPROCESSORS_CONF)).

For the kernel side of the discussion, see https://lore.kernel.org/lkml/CAM6Wdxft33zLeeXHhmNX5jyJtfGTLiwkQSApc=10fqf+rQh9DA@mail.gmail.com/T/
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL memory module events
Shiju Jose [Wed, 5 Apr 2023 15:16:19 +0000 (16:16 +0100)]
rasdaemon: Add support for the CXL memory module events

Add support to log and record the CXL memory module events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL dram events
Shiju Jose [Wed, 5 Apr 2023 12:28:20 +0000 (13:28 +0100)]
rasdaemon: Add support for the CXL dram events

Add support to log and record the CXL dram events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL general media events
Shiju Jose [Wed, 5 Apr 2023 10:54:41 +0000 (11:54 +0100)]
rasdaemon: Add support for the CXL general media events

Add support to log and record the CXL general media events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL generic events
Shiju Jose [Tue, 4 Apr 2023 17:49:09 +0000 (18:49 +0100)]
rasdaemon: Add support for the CXL generic events

Add support to log and record the CXL generic events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for the CXL overflow events
Shiju Jose [Tue, 4 Apr 2023 15:50:50 +0000 (16:50 +0100)]
rasdaemon: Add support for the CXL overflow events

Add support to log and record the CXL overflow events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add common function to get timestamp for the event
Shiju Jose [Tue, 4 Apr 2023 15:07:21 +0000 (16:07 +0100)]
rasdaemon: Add common function to get timestamp for the event

Add common function to get the timestamp for the event
reported.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add common function to convert timestamp in the CXL event records to the...
Shiju Jose [Tue, 4 Apr 2023 13:40:42 +0000 (14:40 +0100)]
rasdaemon: Add common function to convert timestamp in the CXL event records to the broken-down time format

Add common function to convert the timestamp in the CXL event records
in nanoseconds to the broken-down time format.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: Add support for creating the vendor error tables at startup
Shiju Jose [Wed, 31 May 2023 15:24:36 +0000 (16:24 +0100)]
rasdaemon: Add support for creating the vendor error tables at startup

1. Support for create/open the vendor error tables at rasdaemon startup.
2. Make changes in the HiSilicon error handling code for the same.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: fix issue of signed and unsigned integer comparison and remove redundant...
Xiaofei Tan [Tue, 30 May 2023 10:44:12 +0000 (11:44 +0100)]
rasdaemon: fix issue of signed and unsigned integer comparison and remove redundant header file

1. The return value of ARRAY_SIZE() is unsigned integer. It isn't right to
compare it with a signed integer. This patch fix them.

2. Remove redundant header file and adjust the header files sequence.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agorasdaemon: fix return value type issue of read/write function from unistd.h
Xiaofei Tan [Thu, 11 May 2023 02:54:26 +0000 (10:54 +0800)]
rasdaemon: fix return value type issue of read/write function from unistd.h

The return value type of read/write function from unistd.h is ssize_t.
It's signed normally, and return -1 on error. Fix incorrect use in the
function read_ras_event_all_cpus().

BTW, make setting buffer_percent as a separate function.

Fixes: 94750bcf9309 ("rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely")
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoRasdaemon: Fix autoreconf build error
Ayush Jain [Tue, 23 May 2023 06:55:36 +0000 (12:25 +0530)]
Rasdaemon: Fix autoreconf build error

When building rasdaemon with autoreconf, on certain distros
we see the following error message.
Makefile.am: error: required file './README' not found
Autoreconf looks for README file instead of README.md
Fix this by passing 'foreign' to AM_INIT_AUTOMAKE.

Signed-off-by: Ayush Jain <ayush.jain3@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
18 months agoras-events: quit loop in read_ras_event when kbuf data is broken
hubin [Thu, 18 May 2023 08:14:41 +0000 (16:14 +0800)]
ras-events: quit loop in read_ras_event when kbuf data is broken

when kbuf data is broken, kbuffer_next_event() may move kbuf->index back to
the current kbuf->index position, causing dead loop.

In this situation, rasdaemon will repeatedly parse an invalid event, and
print warning like "ug! negative record size -8!", pushing cpu utilization
rate to 100%.

when kbuf data is broken, discard current page and continue reading next page
kbuf.

Signed-off-by: hubin <hubin73@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: Add support for the CXL AER correctable errors
Shiju Jose [Fri, 17 Mar 2023 13:07:01 +0000 (13:07 +0000)]
rasdaemon: Add support for the CXL AER correctable errors

Add support to log and record the CXL AER correctable errors.

The corresponding Kernel patches are here:
https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t
https://lore.kernel.org/linux-cxl/63e5ed38d77d9_138fbc2947a@iweiny-mobl.notmuch/T/#t

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: Add support for the CXL AER uncorrectable errors
Shiju Jose [Fri, 17 Mar 2023 12:51:02 +0000 (12:51 +0000)]
rasdaemon: Add support for the CXL AER uncorrectable errors

Add support to log and record the CXL AER uncorrectable errors.

The corresponding Kernel patches are here:
https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t
https://lore.kernel.org/lkml/63eeb2a8c9e3f_32d612941f@dwillia2-xfh.jf.intel.com.notmuch/T/

It was found that the header log data to be converted to the
big-endian format to correctly store in the SQLite DB likely
because the SQLite database seems uses the big-endian storage.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>#
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: Add support for the CXL poison events
Shiju Jose [Fri, 31 Mar 2023 12:35:13 +0000 (13:35 +0100)]
rasdaemon: Add support for the CXL poison events

Add support to log and record the CXL poison events.

The corresponding Kernel patches here:
https://lore.kernel.org/linux-cxl/64457d30bae07_2028294ac@dwillia2-xfh.jf.intel.com.notmuch/

Presently for logging only, could be extended for the policy
based recovery action for the frequent poison events depending on the above
kernel patches.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: Move definition for BIT and BIT_ULL to a common file
Shiju Jose [Mon, 16 Jan 2023 17:13:32 +0000 (17:13 +0000)]
rasdaemon: Move definition for BIT and BIT_ULL to a common file

Move definition for BIT() and BIT_ULL() to the
common file ras-record.h

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agoras-mc-ctl: add option to exclude old events from reports
Marcus Sundman [Thu, 20 Apr 2023 15:17:17 +0000 (18:17 +0300)]
ras-mc-ctl: add option to exclude old events from reports

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agorasdaemon: fix table create if some cpus are offline
Shiju Jose [Sun, 5 Mar 2023 23:14:42 +0000 (23:14 +0000)]
rasdaemon: fix table create if some cpus are offline

Fix for regression in ras_mc_create_table() if some cpus are offline
at the system start

Issue:

Regression in the ras_mc_create_table() if some of the cpus are offline
at the system start when run the rasdaemon.

This issue is reproducible in ras_mc_create_table() with decode and
record non-standard events and reproducible sometimes with
ras_mc_create_table() for the standard events.

Also in the multi thread way, there is memory leak in ras_mc_event_opendb()
as struct sqlite3_priv *priv and sqlite3 *db allocated/initialized per
thread, but stored in the common struct ras_events ras in pthread data,
which is shared across the threads.

Reason:

when the system starts with some of the cpus offline and then run
the rasdaemon, read_ras_event_all_cpus() exit with error and switch to
the multi thread way. However read() in read_ras_event() return error in
threads for each of the offline CPUs and does clean up including calling
ras_mc_event_closedb().

Since the 'struct ras_events ras' passed in the pthread_data to each of the
threads is common, struct sqlite3_priv *priv and sqlite3 *db allocated/
initialized per thread and stored in the common 'struct ras_events ras',
are getting overwritten in each ras_mc_event_opendb()(which called from
pthread per cpu), result memory leak.

Also when ras_mc_event_closedb() is called in the above error case from
the threads corresponding to the offline cpus, close the sqlite3 *db and
free sqlite3_priv *priv stored in the common 'struct ras_events ras',
result regression when accessing priv->db in the ras_mc_create_table()
from another context later.

Solution:

In ras_mc_event_opendb(), allocate struct sqlite3_priv *priv,
init sqlite3 *db and create tables common for the threads with shared
'struct ras_events ras' based on a reference count and free them in the
same way.

Also protect critical code ras_mc_event_opendb() and ras_mc_event_closedb()
using mutex in the multi thread case from any regression caused by the
thread pre-emption.

Reported-by: Lei Feng <fenglei47@h-partners.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
23 months agoconfigure.ac: fix bashisms
Sam James [Sun, 19 Feb 2023 18:33:20 +0000 (18:33 +0000)]
configure.ac: fix bashisms

configure scripts need to be runnable with a POSIX-compliant /bin/sh.

On many (but not all!) systems, /bin/sh is provided by Bash, so errors
like this aren't spotted. Notably Debian defaults to /bin/sh provided
by dash which doesn't tolerate such bashisms as '=='.

This retains compatibility with bash.

Fixes configure warnings/errors like:
```
checking for libtraceevent... yes
./configure: 13430: test: x: unexpected operator
./configure: 13439: test: x: unexpected operator
```

Signed-off-by: Sam James <sam@gentoo.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoFix create release workflow
Mauro Carvalho Chehab [Sat, 18 Feb 2023 17:26:33 +0000 (18:26 +0100)]
Fix create release workflow

make dist-bzip2 requires configure to work, which, in turn, depends
on having some tools installed.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoci.yml: fix workflow to build rasdaemon
Mauro Carvalho Chehab [Sat, 18 Feb 2023 13:04:51 +0000 (14:04 +0100)]
ci.yml: fix workflow to build rasdaemon

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoChangeLog: do some minor updates
Mauro Carvalho Chehab [Sat, 18 Feb 2023 13:04:05 +0000 (14:04 +0100)]
ChangeLog: do some minor updates

It is missing an entry about new labels. Also, version is at the
wrong place.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoBump version to 0.8.0 v0.8.0
Mauro Carvalho Chehab [Sat, 18 Feb 2023 08:45:50 +0000 (09:45 +0100)]
Bump version to 0.8.0

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agolabels/asrock: add X399D8A-2T
tictooc [Sat, 11 Feb 2023 17:40:29 +0000 (17:40 +0000)]
labels/asrock: add X399D8A-2T

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoConvert README to markdown format
Mauro Carvalho Chehab [Sat, 18 Feb 2023 08:45:56 +0000 (09:45 +0100)]
Convert README to markdown format

That allows git??b to better parse it.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agomisc/rasdaemon.spec.in: add libtraceevent requirement
Mauro Carvalho Chehab [Sat, 18 Feb 2023 08:15:07 +0000 (09:15 +0100)]
misc/rasdaemon.spec.in: add libtraceevent requirement

As we're not not bunding libtraceevent inside RASdaemon, packaging
it now requires it.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoMakefile.am: fix mock build target
Mauro Carvalho Chehab [Sat, 18 Feb 2023 08:08:08 +0000 (09:08 +0100)]
Makefile.am: fix mock build target

Mock now makes mandatory to add the install dir, otherwise it
refuses to build. So, add it.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agorasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely
Shiju Jose [Sat, 4 Feb 2023 19:15:55 +0000 (19:15 +0000)]
rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely

The error events are not received in the rasdaemon since kernel 6.1-rc6.
This issue is firstly detected and reported, when testing the CXL error
events in the rasdaemon.

Debugging showed, poll() on trace_pipe_raw in the ras-events.c do not
return and this issue is seen after the commit
42fb0a1e84ff525ebe560e2baf9451ab69127e2b ("tracing/ring-buffer: Have
polling block on watermark").

This issue is also verified using a test application for poll()
and select() on per_cpu trace_pipe_raw.

There is also a bug reported on this issue,
https://lore.kernel.org/all/31eb3b12-3350-90a4-a0d9-d1494db7cf74@oracle.com/

This issue occurs for the per_cpu case, which calls the ring_buffer_poll_wait(),
in kernel/trace/ring_buffer.c, with the buffer_percent > 0 and then wait until
the percentage of pages are available. The default value set for the
buffer_percent is 50 in the kernel/trace/trace.c. However poll() does not return
even met the percentage of pages condition.

As a fix, rasdaemon set buffer_percent as 0 through the
/sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent, then the
task will wake up as soon as data is added to any of the specific cpu
buffer and poll() on per_cpu/cpuX/trace_pipe_raw does not block
indefinitely.

Dependency on the kernel fix commit
3e46d910d8acf94e5360126593b68bf4fee4c4a1("tracing: Fix poll() and select()
do not work on per_cpu trace_pipe and trace_pipe_raw")

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
2 years agoREADME: Update instructions about how to contribute
Mauro Carvalho Chehab [Mon, 23 Jan 2023 14:29:33 +0000 (15:29 +0100)]
README: Update instructions about how to contribute

Nowadays, we're only using github in practice for development.
Let it clearer at the documentation.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoMakefile.am: enable all options on make distcheck
Mauro Carvalho Chehab [Sat, 21 Jan 2023 13:06:54 +0000 (14:06 +0100)]
Makefile.am: enable all options on make distcheck

Ensure that all modules are enabled on "make distcheck".

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoconfigure.ac: get rid of obsolete macros
Mauro Carvalho Chehab [Sat, 21 Jan 2023 13:04:19 +0000 (14:04 +0100)]
configure.ac: get rid of obsolete macros

Use autoupdate 2.71, in order to get rid of obsoleted macros.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoci.yml: add libtraceevent-dev dependency
Mauro Carvalho Chehab [Sat, 21 Jan 2023 12:41:59 +0000 (13:41 +0100)]
ci.yml: add libtraceevent-dev dependency

This is needed to build newest version of rasdaemon.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoRemove the old libtrace
Mauro Carvalho Chehab [Sat, 21 Jan 2023 08:23:57 +0000 (09:23 +0100)]
Remove the old libtrace

Now that rasdaemon is using the libtraceevent library, we
can get rid of our own fork.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoAdjust indentations
Mauro Carvalho Chehab [Sat, 21 Jan 2023 08:59:57 +0000 (09:59 +0100)]
Adjust indentations

With the function rename due to the usage of libtraceevent
library, adjust some indentations.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoConvert to use libtraceevent
Mauro Carvalho Chehab [Sat, 21 Jan 2023 08:23:57 +0000 (09:23 +0100)]
Convert to use libtraceevent

Rasdaemon used for a long time an early version of this library,
with the code embedded directly into its code. The rationale is
that the library was not officially released on that time, but
this has long changed.

So, instead, just use the library directly.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoon_tag.yml: use a different approach to upload artifact v0.7.0
Mauro Carvalho Chehab [Sun, 22 Jan 2023 06:23:22 +0000 (07:23 +0100)]
on_tag.yml: use a different approach to upload artifact

Use my own upload release asset logic, as it is known to work
already on ZBar.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoAdd a release workflow
Mauro Carvalho Chehab [Sat, 21 Jan 2023 13:49:40 +0000 (14:49 +0100)]
Add a release workflow

Should be auto-filling the release information and upload
a source distro package tarball.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoBump version to 0.7.0 libtrace
Mauro Carvalho Chehab [Sat, 21 Jan 2023 06:52:14 +0000 (07:52 +0100)]
Bump version to 0.7.0

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years ago.gitignore: add the auto-generated "compile" file
Mauro Carvalho Chehab [Sat, 21 Jan 2023 06:55:05 +0000 (07:55 +0100)]
.gitignore: add the auto-generated "compile" file

autoreconf is producing a compile file. Ignore it on git status.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoINSTALL: update from latest version of it
Mauro Carvalho Chehab [Sat, 21 Jan 2023 06:54:30 +0000 (07:54 +0100)]
INSTALL: update from latest version of it

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agoconfigure.ac: fix bashisms
Sam James [Thu, 29 Dec 2022 17:23:47 +0000 (17:23 +0000)]
configure.ac: fix bashisms

configure scripts need to be runnable with a POSIX-compliant /bin/sh.

On many (but not all!) systems, /bin/sh is provided by Bash, so errors
like this aren't spotted. Notably Debian defaults to /bin/sh provided
by dash which doesn't tolerate such bashisms as '=='.

This retains compatibility with bash.

Signed-off-by: Sam James <sam@gentoo.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2 years agolabels/asus: add ASUS TUF GAMING B450-PLUS II
dgcampea [Mon, 19 Dec 2022 18:53:13 +0000 (18:53 +0000)]
labels/asus: add ASUS TUF GAMING B450-PLUS II

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>