From a26fe287eed112b4e21e854f173c8918a6a8596d Mon Sep 17 00:00:00 2001 From: Daniel Gomez Date: Fri, 28 Mar 2025 14:28:37 +0000 Subject: [PATCH 01/16] kconfig: merge_config: use an empty file as initfile The scripts/kconfig/merge_config.sh script requires an existing $INITFILE (or the $1 argument) as a base file for merging Kconfig fragments. However, an empty $INITFILE can serve as an initial starting point, later referenced by the KCONFIG_ALLCONFIG Makefile variable if -m is not used. This variable can point to any configuration file containing preset config symbols (the merged output) as stated in Documentation/kbuild/kconfig.rst. When -m is used $INITFILE will contain just the merge output requiring the user to run make (i.e. KCONFIG_ALLCONFIG=<$INITFILE> make or make olddefconfig). Instead of failing when `$INITFILE` is missing, create an empty file and use it as the starting point for merges. Signed-off-by: Daniel Gomez Signed-off-by: Masahiro Yamada --- scripts/kconfig/merge_config.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/kconfig/merge_config.sh b/scripts/kconfig/merge_config.sh index 0b7952471c18..79c09b378be8 100755 --- a/scripts/kconfig/merge_config.sh +++ b/scripts/kconfig/merge_config.sh @@ -112,8 +112,8 @@ INITFILE=$1 shift; if [ ! -r "$INITFILE" ]; then - echo "The base file '$INITFILE' does not exist. Exit." >&2 - exit 1 + echo "The base file '$INITFILE' does not exist. Creating one..." >&2 + touch "$INITFILE" fi MERGE_LIST=$* -- 2.51.0 From a7c699d090a1f3795c3271c2b399230e182db06e Mon Sep 17 00:00:00 2001 From: Uday Shankar Date: Mon, 31 Mar 2025 16:46:32 -0600 Subject: [PATCH 02/16] kbuild: rpm-pkg: build a debuginfo RPM The rpm-pkg make target currently suffers from a few issues related to debuginfo: 1. debuginfo for things built into the kernel (vmlinux) is not available in any RPM produced by make rpm-pkg. This makes using tools like systemtap against a make rpm-pkg kernel impossible. 2. debug source for the kernel is not available. This means that commands like 'disas /s' in gdb, which display source intermixed with assembly, can only print file names/line numbers which then must be painstakingly resolved to actual source in a separate editor. 3. debuginfo for modules is available, but it remains bundled with the .ko files that contain module code, in the main kernel RPM. This is a waste of space for users who do not need to debug the kernel (i.e. most users). Address all of these issues by additionally building a debuginfo RPM when the kernel configuration allows for it, in line with standard patterns followed by RPM distributors. With these changes: 1. systemtap now works (when these changes are backported to 6.11, since systemtap lags a bit behind in compatibility), as verified by the following simple test script: # stap -e 'probe kernel.function("do_sys_open").call { printf("%s\n", $$parms); }' dfd=0xffffffffffffff9c filename=0x7fe18800b160 flags=0x88800 mode=0x0 ... 2. disas /s works correctly in gdb, with source and disassembly interspersed: # gdb vmlinux --batch -ex 'disas /s blk_op_str' Dump of assembler code for function blk_op_str: block/blk-core.c: 125 { 0xffffffff814c8740 <+0>: endbr64 127 128 if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op]) 0xffffffff814c8744 <+4>: mov $0xffffffff824a7378,%rax 0xffffffff814c874b <+11>: cmp $0x23,%edi 0xffffffff814c874e <+14>: ja 0xffffffff814c8768 0xffffffff814c8750 <+16>: mov %edi,%edi 126 const char *op_str = "UNKNOWN"; 0xffffffff814c8752 <+18>: mov $0xffffffff824a7378,%rdx 127 128 if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op]) 0xffffffff814c8759 <+25>: mov -0x7dfa0160(,%rdi,8),%rax 126 const char *op_str = "UNKNOWN"; 0xffffffff814c8761 <+33>: test %rax,%rax 0xffffffff814c8764 <+36>: cmove %rdx,%rax 129 op_str = blk_op_name[op]; 130 131 return op_str; 132 } 0xffffffff814c8768 <+40>: jmp 0xffffffff81d01360 <__x86_return_thunk> End of assembler dump. 3. The size of the main kernel package goes down substantially, especially if many modules are built (quite typical). Here is a comparison of installed size of the kernel package (configured with allmodconfig, dwarf4 debuginfo, and module compression turned off) before and after this patch: # rpm -qi kernel-6.13* | grep -E '^(Version|Size)' Version : 6.13.0postpatch+ Size : 1382874089 Version : 6.13.0prepatch+ Size : 17870795887 This is a ~92% size reduction. Note that a debuginfo package can only be produced if the following configs are set: - CONFIG_DEBUG_INFO=y - CONFIG_MODULE_COMPRESS=n - CONFIG_DEBUG_INFO_SPLIT=n The first of these is obvious - we can't produce debuginfo if the build does not generate it. The second two requirements can in principle be removed, but doing so is difficult with the current approach, which uses a generic rpmbuild script find-debuginfo.sh that processes all packaged executables. If we want to remove those requirements the best path forward is likely to add some debuginfo extraction/installation logic to the modules_install target (controllable by flags). That way, it's easier to operate on modules before they're compressed, and the logic can be reused by all packaging targets. Signed-off-by: Uday Shankar Signed-off-by: Masahiro Yamada --- scripts/package/kernel.spec | 46 +++++++++++++++++++++++++++++++++++-- scripts/package/mkspec | 10 ++++++++ 2 files changed, 54 insertions(+), 2 deletions(-) diff --git a/scripts/package/kernel.spec b/scripts/package/kernel.spec index ac3e5ac01d8a..726f34e11960 100644 --- a/scripts/package/kernel.spec +++ b/scripts/package/kernel.spec @@ -2,8 +2,6 @@ %{!?_arch: %define _arch dummy} %{!?make: %define make make} %define makeflags %{?_smp_mflags} ARCH=%{ARCH} -%define __spec_install_post /usr/lib/rpm/brp-compress || : -%define debug_package %{nil} Name: kernel Summary: The Linux Kernel @@ -46,6 +44,36 @@ This package provides kernel headers and makefiles sufficient to build modules against the %{version} kernel package. %endif +%if %{with_debuginfo} +# list of debuginfo-related options taken from distribution kernel.spec +# files +%undefine _include_minidebuginfo +%undefine _find_debuginfo_dwz_opts +%undefine _unique_build_ids +%undefine _unique_debug_names +%undefine _unique_debug_srcs +%undefine _debugsource_packages +%undefine _debuginfo_subpackages +%global _find_debuginfo_opts -r +%global _missing_build_ids_terminate_build 1 +%global _no_recompute_build_ids 1 +%{debug_package} +%endif +# some (but not all) versions of rpmbuild emit %%debug_package with +# %%install. since we've already emitted it manually, that would cause +# a package redefinition error. ensure that doesn't happen +%define debug_package %{nil} + +# later, we make all modules executable so that find-debuginfo.sh strips +# them up. but they don't actually need to be executable, so remove the +# executable bit, taking care to do it _after_ find-debuginfo.sh has run +%define __spec_install_post \ + %{?__debug_package:%{__debug_install_post}} \ + %{__arch_install_post} \ + %{__os_install_post} \ + find %{buildroot}/lib/modules/%{KERNELRELEASE} -name "*.ko" -type f \\\ + | xargs --no-run-if-empty chmod u-x + %prep %setup -q -n linux cp %{SOURCE1} .config @@ -89,8 +117,22 @@ ln -fns /usr/src/kernels/%{KERNELRELEASE} %{buildroot}/lib/modules/%{KERNELRELEA echo "%exclude /lib/modules/%{KERNELRELEASE}/build" } > %{buildroot}/kernel.list +# make modules executable so that find-debuginfo.sh strips them. this +# will be undone later in %%__spec_install_post +find %{buildroot}/lib/modules/%{KERNELRELEASE} -name "*.ko" -type f \ + | xargs --no-run-if-empty chmod u+x + +%if %{with_debuginfo} +# copying vmlinux directly to the debug directory means it will not get +# stripped (but its source paths will still be collected + fixed up) +mkdir -p %{buildroot}/usr/lib/debug/lib/modules/%{KERNELRELEASE} +cp vmlinux %{buildroot}/usr/lib/debug/lib/modules/%{KERNELRELEASE} +%endif + %clean rm -rf %{buildroot} +rm -f debugfiles.list debuglinks.list debugsourcefiles.list debugsources.list \ + elfbins.list %post if [ -x /usr/bin/kernel-install ]; then diff --git a/scripts/package/mkspec b/scripts/package/mkspec index 4dc1466dfc81..c7375bfc25a9 100755 --- a/scripts/package/mkspec +++ b/scripts/package/mkspec @@ -23,6 +23,16 @@ else echo '%define with_devel 0' fi +# debuginfo package generation uses find-debuginfo.sh under the hood, +# which only works on uncompressed modules that contain debuginfo +if grep -q CONFIG_DEBUG_INFO=y include/config/auto.conf && + (! grep -q CONFIG_MODULE_COMPRESS=y include/config/auto.conf) && + (! grep -q CONFIG_DEBUG_INFO_SPLIT=y include/config/auto.conf); then +echo '%define with_debuginfo %{?_without_debuginfo: 0} %{?!_without_debuginfo: 1}' +else +echo '%define with_debuginfo 0' +fi + cat< Date: Wed, 19 Mar 2025 15:27:31 -0500 Subject: [PATCH 03/16] tools/power turbostat: Increase CPU_SUBSET_MAXCPUS to 8192 On systems with >= 1024 cpus (in my case 1152), turbostat fails with the error output: "turbostat: /sys/fs/cgroup/cpuset.cpus.effective: cpu str malformat 0-1151" A similar error appears with the use of turbostat --cpu when the inputted cpu range contains a cpu number >= 1024: # turbostat -c 1100-1151 "--cpu 1100-1151" malformed ... Both errors are caused by parse_cpu_str() reaching its limit of CPU_SUBSET_MAXCPUS. It's a good idea to limit the maximum cpu number being parsed, but 1024 is too low. For a small increase in compute and allocated memory, increasing CPU_SUBSET_MAXCPUS brings support for parsing cpu numbers >= 1024. Increase CPU_SUBSET_MAXCPUS to 8192, a common setting for CONFIG_NR_CPUS on x86_64. Signed-off-by: Justin Ernst Signed-off-by: Len Brown --- tools/power/x86/turbostat/turbostat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index f29e47fe4249..218aca958923 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -1121,7 +1121,7 @@ end: int backwards_count; char *progname; -#define CPU_SUBSET_MAXCPUS 1024 /* need to use before probe... */ +#define CPU_SUBSET_MAXCPUS 8192 /* need to use before probe... */ cpu_set_t *cpu_present_set, *cpu_possible_set, *cpu_effective_set, *cpu_allowed_set, *cpu_affinity_set, *cpu_subset; size_t cpu_present_setsize, cpu_possible_setsize, cpu_effective_setsize, cpu_allowed_setsize, cpu_affinity_setsize, cpu_subset_size; #define MAX_ADDED_THREAD_COUNTERS 24 -- 2.51.0 From f729775f79a9c942c6c82ed6b44bd030afe10423 Mon Sep 17 00:00:00 2001 From: Len Brown Date: Sun, 6 Apr 2025 11:18:39 -0400 Subject: [PATCH 04/16] tools/power turbostat: report CoreThr per measurement interval The CoreThr column displays total thermal throttling events since boot time. Change it to report events during the measurement interval. This is more useful for showing a user the current conditions. Total events since boot time are still available to the user via /sys/devices/system/cpu/cpu*/thermal_throttle/* Document CoreThr on turbostat.8 Fixes: eae97e053fe30 ("turbostat: Support thermal throttle count print") Reported-by: Arjan van de Ven Signed-off-by: Len Brown Cc: Chen Yu --- tools/power/x86/turbostat/turbostat.8 | 2 ++ tools/power/x86/turbostat/turbostat.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/power/x86/turbostat/turbostat.8 b/tools/power/x86/turbostat/turbostat.8 index 52d727e29ea7..144565151e1e 100644 --- a/tools/power/x86/turbostat/turbostat.8 +++ b/tools/power/x86/turbostat/turbostat.8 @@ -172,6 +172,8 @@ The system configuration dump (if --quiet is not used) is followed by statistics .PP \fBPkgTmp\fP Degrees Celsius reported by the per-package Package Thermal Monitor. .PP +\fBCoreThr\fP Core Thermal Throttling events during the measurement interval. Note that events since boot can be find in /sys/devices/system/cpu/cpu*/thermal_throttle/* +.PP \fBGFX%rc6\fP The percentage of time the GPU is in the "render C6" state, rc6, during the measurement interval. From /sys/class/drm/card0/power/rc6_residency_ms or /sys/class/drm/card0/gt/gt0/rc6_residency_ms or /sys/class/drm/card0/device/tile0/gtN/gtidle/idle_residency_ms depending on the graphics driver being used. .PP \fBGFXMHz\fP Instantaneous snapshot of what sysfs presents at the end of the measurement interval. From /sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz or /sys/class/drm/card0/gt_cur_freq_mhz or /sys/class/drm/card0/gt/gt0/rps_cur_freq_mhz or /sys/class/drm/card0/device/tile0/gtN/freq0/cur_freq depending on the graphics driver being used. diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index 218aca958923..70e17d4ad9b6 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -3485,7 +3485,7 @@ void delta_core(struct core_data *new, struct core_data *old) old->c6 = new->c6 - old->c6; old->c7 = new->c7 - old->c7; old->core_temp_c = new->core_temp_c; - old->core_throt_cnt = new->core_throt_cnt; + old->core_throt_cnt = new->core_throt_cnt - old->core_throt_cnt; old->mc6_us = new->mc6_us - old->mc6_us; DELTA_WRAP32(new->core_energy.raw_value, old->core_energy.raw_value); -- 2.51.0 From 3ae8508663372b93c5556a887e96ed0ca5df0711 Mon Sep 17 00:00:00 2001 From: Len Brown Date: Sun, 6 Apr 2025 12:23:22 -0400 Subject: [PATCH 05/16] tools/power turbostat: Document GNR UncMHz domain convention Document that on Intel Granite Rapids Systems, Uncore domains 0-2 are CPU domains, and uncore domains 3-4 are IO domains. Signed-off-by: Len Brown --- tools/power/x86/turbostat/turbostat.8 | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/power/x86/turbostat/turbostat.8 b/tools/power/x86/turbostat/turbostat.8 index 144565151e1e..e86493880c16 100644 --- a/tools/power/x86/turbostat/turbostat.8 +++ b/tools/power/x86/turbostat/turbostat.8 @@ -205,6 +205,7 @@ The system configuration dump (if --quiet is not used) is followed by statistics \fBUncMHz\fP per-package uncore MHz, instantaneous sample. .PP \fBUMHz1.0\fP per-package uncore MHz for domain=1 and fabric_cluster=0, instantaneous sample. System summary is the average of all packages. +Intel Granite Rapids systems use domains 0-2 for CPUs, and 3-4 for IO, with cluster always 0. For the "--show" and "--hide" options, use "UncMHz" to operate on all UMHz*.* as a group. .SH TOO MUCH INFORMATION EXAMPLE By default, turbostat dumps all possible information -- a system configuration header, followed by columns for all counters. -- 2.51.0 From f8b136ef2605c1bf62020462d10e35228760aa19 Mon Sep 17 00:00:00 2001 From: Zhang Rui Date: Wed, 19 Mar 2025 08:53:07 +0800 Subject: [PATCH 06/16] tools/power turbostat: Restore GFX sysfs fflush() call Do fflush() to discard the buffered data, before each read of the graphics sysfs knobs. Fixes: ba99a4fc8c24 ("tools/power turbostat: Remove unnecessary fflush() call") Signed-off-by: Zhang Rui Signed-off-by: Len Brown --- tools/power/x86/turbostat/turbostat.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index 70e17d4ad9b6..c9a34c16c7a8 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -6039,6 +6039,7 @@ int snapshot_graphics(int idx) int retval; rewind(gfx_info[idx].fp); + fflush(gfx_info[idx].fp); switch (idx) { case GFX_rc6: -- 2.51.0 From 994633894f208a0151baaee1688ab3c431912553 Mon Sep 17 00:00:00 2001 From: Len Brown Date: Sun, 6 Apr 2025 12:53:18 -0400 Subject: [PATCH 07/16] tools/power turbostat: re-factor sysfs code Probe cpuidle "sysfs" residency and counts separately, since soon we will make one disabled on, and the other disabled off. Clarify that some BIC (build-in-counters) are actually "groups". since we're about to re-name some of those groups. no functional change. Signed-off-by: Len Brown --- tools/power/x86/turbostat/turbostat.c | 31 ++++++++++++++++++--------- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index c9a34c16c7a8..df0391bedcde 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -273,10 +273,10 @@ struct msr_counter bic[] = { #define BIC_NMI (1ULL << 61) #define BIC_CPU_c1e (1ULL << 62) -#define BIC_TOPOLOGY (BIC_Package | BIC_Node | BIC_CoreCnt | BIC_PkgCnt | BIC_Core | BIC_CPU | BIC_Die) -#define BIC_THERMAL_PWR (BIC_CoreTmp | BIC_PkgTmp | BIC_PkgWatt | BIC_CorWatt | BIC_GFXWatt | BIC_RAMWatt | BIC_PKG__ | BIC_RAM__ | BIC_SysWatt) -#define BIC_FREQUENCY (BIC_Avg_MHz | BIC_Busy | BIC_Bzy_MHz | BIC_TSC_MHz | BIC_GFXMHz | BIC_GFXACTMHz | BIC_SAMMHz | BIC_SAMACTMHz | BIC_UNCORE_MHZ) -#define BIC_IDLE (BIC_Busy | BIC_sysfs | BIC_CPU_c1 | BIC_CPU_c3 | BIC_CPU_c6 | BIC_CPU_c7 | BIC_GFX_rc6 | BIC_Pkgpc2 | BIC_Pkgpc3 | BIC_Pkgpc6 | BIC_Pkgpc7 | BIC_Pkgpc8 | BIC_Pkgpc9 | BIC_Pkgpc10 | BIC_CPU_LPI | BIC_SYS_LPI | BIC_Mod_c6 | BIC_Totl_c0 | BIC_Any_c0 | BIC_GFX_c0 | BIC_CPUGFX | BIC_SAM_mc6 | BIC_Diec6) +#define BIC_GROUP_TOPOLOGY (BIC_Package | BIC_Node | BIC_CoreCnt | BIC_PkgCnt | BIC_Core | BIC_CPU | BIC_Die) +#define BIC_GROUP_THERMAL_PWR (BIC_CoreTmp | BIC_PkgTmp | BIC_PkgWatt | BIC_CorWatt | BIC_GFXWatt | BIC_RAMWatt | BIC_PKG__ | BIC_RAM__ | BIC_SysWatt) +#define BIC_GROUP_FREQUENCY (BIC_Avg_MHz | BIC_Busy | BIC_Bzy_MHz | BIC_TSC_MHz | BIC_GFXMHz | BIC_GFXACTMHz | BIC_SAMMHz | BIC_SAMACTMHz | BIC_UNCORE_MHZ) +#define BIC_GROUP_IDLE (BIC_Busy | BIC_sysfs | BIC_CPU_c1 | BIC_CPU_c3 | BIC_CPU_c6 | BIC_CPU_c7 | BIC_GFX_rc6 | BIC_Pkgpc2 | BIC_Pkgpc3 | BIC_Pkgpc6 | BIC_Pkgpc7 | BIC_Pkgpc8 | BIC_Pkgpc9 | BIC_Pkgpc10 | BIC_CPU_LPI | BIC_SYS_LPI | BIC_Mod_c6 | BIC_Totl_c0 | BIC_Any_c0 | BIC_GFX_c0 | BIC_CPUGFX | BIC_SAM_mc6 | BIC_Diec6) #define BIC_OTHER (BIC_IRQ | BIC_NMI | BIC_SMI | BIC_ThreadC | BIC_CoreTmp | BIC_IPC) #define BIC_DISABLED_BY_DEFAULT (BIC_USEC | BIC_TOD | BIC_APIC | BIC_X2APIC) @@ -2354,16 +2354,16 @@ unsigned long long bic_lookup(char *name_list, enum show_hide_mode mode) retval |= ~0; break; } else if (!strcmp(name_list, "topology")) { - retval |= BIC_TOPOLOGY; + retval |= BIC_GROUP_TOPOLOGY; break; } else if (!strcmp(name_list, "power")) { - retval |= BIC_THERMAL_PWR; + retval |= BIC_GROUP_THERMAL_PWR; break; } else if (!strcmp(name_list, "idle")) { - retval |= BIC_IDLE; + retval |= BIC_GROUP_IDLE; break; } else if (!strcmp(name_list, "frequency")) { - retval |= BIC_FREQUENCY; + retval |= BIC_GROUP_FREQUENCY; break; } else if (!strcmp(name_list, "other")) { retval |= BIC_OTHER; @@ -10260,7 +10260,7 @@ int is_deferred_skip(char *name) return 0; } -void probe_sysfs(void) +void probe_cpuidle_residency(void) { char path[64]; char name_buf[16]; @@ -10304,6 +10304,16 @@ void probe_sysfs(void) if (state < min_state) min_state = state; } +} + +void probe_cpuidle_counts(void) +{ + char path[64]; + char name_buf[16]; + FILE *input; + int state; + int min_state = 1024, max_state = 0; + char *sp; for (state = 10; state >= 0; --state) { @@ -10602,7 +10612,8 @@ skip_cgroup_setting: print_bootcmd(); } - probe_sysfs(); + probe_cpuidle_residency(); + probe_cpuidle_counts(); if (!getuid()) set_rlimit(); -- 2.51.0 From 6f110a5e4f9977c31ce76fefbfef6fd4eab6bfb7 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 6 Apr 2025 10:00:04 -0700 Subject: [PATCH 08/16] Disable SLUB_TINY for build testing ... and don't error out so hard on missing module descriptions. Before commit 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()") we used to warn about missing module descriptions, but only when building with extra warnigns (ie 'W=1'). After that commit the warning became an unconditional hard error. And it turns out not all modules have been converted despite the claims to the contrary. As reported by Damian Tometzki, the slub KUnit test didn't have a module description, and apparently nobody ever really noticed. The reason nobody noticed seems to be that the slub KUnit tests get disabled by SLUB_TINY, which also ends up disabling a lot of other code, both in tests and in slub itself. And so anybody doing full build tests didn't actually see this failre. So let's disable SLUB_TINY for build-only tests, since it clearly ends up limiting build coverage. Also turn the missing module descriptions error back into a warning, but let's keep it around for non-'W=1' builds. Reported-by: Damian Tometzki Link: https://lore.kernel.org/all/01070196099fd059-e8463438-7b1b-4ec8-816d-173874be9966-000000@eu-central-1.amazonses.com/ Cc: Masahiro Yamada Cc: Jeff Johnson Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()") Signed-off-by: Linus Torvalds --- mm/Kconfig | 2 +- scripts/mod/modpost.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index d3fb3762887b..e113f713b493 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -201,7 +201,7 @@ config KVFREE_RCU_BATCHED config SLUB_TINY bool "Configure for minimal memory footprint" - depends on EXPERT + depends on EXPERT && !COMPILE_TEST select SLAB_MERGE_DEFAULT help Configures the slab allocator in a way to achieve minimal memory diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c index 92627e8d0e16..be89921d60b6 100644 --- a/scripts/mod/modpost.c +++ b/scripts/mod/modpost.c @@ -1603,7 +1603,7 @@ static void read_symbols(const char *modname) } if (!get_modinfo(&info, "description")) - error("missing MODULE_DESCRIPTION() in %s\n", modname); + warn("missing MODULE_DESCRIPTION() in %s\n", modname); } for (sym = info.symtab_start; sym < info.symtab_stop; sym++) { -- 2.51.0 From ec4acd3166d8a7a03b059d01b9c6f11a658e833f Mon Sep 17 00:00:00 2001 From: Len Brown Date: Sun, 6 Apr 2025 14:29:57 -0400 Subject: [PATCH 09/16] tools/power turbostat: disable "cpuidle" invocation counters, by default Create "pct_idle" counter group, the sofware notion of residency so it can now be singled out, independent of other counter groups. Create "cpuidle" group, the cpuidle invocation counts. Disable "cpuidle", by default. Create "swidle" = "cpuidle" + "pct_idle". Undocument "sysfs", the old name for "swidle", but keep it working for backwards compatibilty. Create "hwidle", all the HW idle counters Modify "idle", enabled by default "idle" = "hwidle" + "pct_idle" (and now excludes "cpuidle") Signed-off-by: Len Brown --- tools/power/x86/turbostat/turbostat.8 | 12 +++++----- tools/power/x86/turbostat/turbostat.c | 34 +++++++++++++++++++++------ 2 files changed, 33 insertions(+), 13 deletions(-) diff --git a/tools/power/x86/turbostat/turbostat.8 b/tools/power/x86/turbostat/turbostat.8 index e86493880c16..b74ed916057e 100644 --- a/tools/power/x86/turbostat/turbostat.8 +++ b/tools/power/x86/turbostat/turbostat.8 @@ -100,7 +100,7 @@ The column name "all" can be used to enable all disabled-by-default built-in cou .PP \fB--show column\fP show only the specified built-in columns. May be invoked multiple times, or with a comma-separated list of column names. .PP -\fB--show CATEGORY --hide CATEGORY\fP Show and hide also accept a single CATEGORY of columns: "all", "topology", "idle", "frequency", "power", "sysfs", "other". +\fB--show CATEGORY --hide CATEGORY\fP Show and hide also accept a single CATEGORY of columns: "all", "topology", "idle", "frequency", "power", "cpuidle", "hwidle", "swidle", "other". "idle" (enabled by default), includes "hwidle" and "idle_pct". "cpuidle" (default disabled) includes cpuidle software invocation counters. "swidle" includes "cpuidle" plus "idle_pct". "hwidle" includes only hardware based idle residency counters. Older versions of turbostat used the term "sysfs" for what is now "swidle". .PP \fB--Dump\fP displays the raw counter values. .PP @@ -158,15 +158,15 @@ The system configuration dump (if --quiet is not used) is followed by statistics .PP \fBSMI\fP The number of System Management Interrupts serviced CPU during the measurement interval. While this counter is actually per-CPU, SMI are triggered on all processors, so the number should be the same for all CPUs. .PP -\fBC1, C2, C3...\fP The number times Linux requested the C1, C2, C3 idle state during the measurement interval. The system summary line shows the sum for all CPUs. These are C-state names as exported in /sys/devices/system/cpu/cpu*/cpuidle/state*/name. While their names are generic, their attributes are processor specific. They the system description section of output shows what MWAIT sub-states they are mapped to on each system. +\fBC1, C2, C3...\fP The number times Linux requested the C1, C2, C3 idle state during the measurement interval. The system summary line shows the sum for all CPUs. These are C-state names as exported in /sys/devices/system/cpu/cpu*/cpuidle/state*/name. While their names are generic, their attributes are processor specific. They the system description section of output shows what MWAIT sub-states they are mapped to on each system. These counters are in the "cpuidle" group, which is disabled, by default. .PP -\fBC1+, C2+, C3+...\fP The idle governor idle state misprediction statistics. Inidcates the number times Linux requested the C1, C2, C3 idle state during the measurement interval, but should have requested a deeper idle state (if it exists and enabled). These statistics come from the /sys/devices/system/cpu/cpu*/cpuidle/state*/below file. +\fBC1+, C2+, C3+...\fP The idle governor idle state misprediction statistics. Inidcates the number times Linux requested the C1, C2, C3 idle state during the measurement interval, but should have requested a deeper idle state (if it exists and enabled). These statistics come from the /sys/devices/system/cpu/cpu*/cpuidle/state*/below file. These counters are in the "cpuidle" group, which is disabled, by default. .PP -\fBC1-, C2-, C3-...\fP The idle governor idle state misprediction statistics. Inidcates the number times Linux requested the C1, C2, C3 idle state during the measurement interval, but should have requested a shallower idle state (if it exists and enabled). These statistics come from the /sys/devices/system/cpu/cpu*/cpuidle/state*/above file. +\fBC1-, C2-, C3-...\fP The idle governor idle state misprediction statistics. Inidcates the number times Linux requested the C1, C2, C3 idle state during the measurement interval, but should have requested a shallower idle state (if it exists and enabled). These statistics come from the /sys/devices/system/cpu/cpu*/cpuidle/state*/above file. These counters are in the "cpuidle" group, which is disabled, by default. .PP -\fBC1%, C2%, C3%\fP The residency percentage that Linux requested C1, C2, C3.... The system summary is the average of all CPUs in the system. Note that these are software, reflecting what was requested. The hardware counters reflect what was actually achieved. +\fBC1%, C2%, C3%\fP The residency percentage that Linux requested C1, C2, C3.... The system summary is the average of all CPUs in the system. Note that these are software, reflecting what was requested. The hardware counters reflect what was actually achieved. These counters are in the "pct_idle" group, which is enabled by default. .PP -\fBCPU%c1, CPU%c3, CPU%c6, CPU%c7\fP show the percentage residency in hardware core idle states. These numbers are from hardware residency counters. +\fBCPU%c1, CPU%c3, CPU%c6, CPU%c7\fP show the percentage residency in hardware core idle states. These numbers are from hardware residency counters and are in the "hwidle" group, which is enabled, by default. .PP \fBCoreTmp\fP Degrees Celsius reported by the per-core Digital Thermal Sensor. .PP diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index df0391bedcde..ab184f95cdaf 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -153,7 +153,7 @@ struct msr_counter bic[] = { { 0x0, "TSC_MHz", NULL, 0, 0, 0, NULL, 0 }, { 0x0, "IRQ", NULL, 0, 0, 0, NULL, 0 }, { 0x0, "SMI", NULL, 32, 0, FORMAT_DELTA, NULL, 0 }, - { 0x0, "sysfs", NULL, 0, 0, 0, NULL, 0 }, + { 0x0, "cpuidle", NULL, 0, 0, 0, NULL, 0 }, { 0x0, "CPU%c1", NULL, 0, 0, 0, NULL, 0 }, { 0x0, "CPU%c3", NULL, 0, 0, 0, NULL, 0 }, { 0x0, "CPU%c6", NULL, 0, 0, 0, NULL, 0 }, @@ -206,6 +206,7 @@ struct msr_counter bic[] = { { 0x0, "Sys_J", NULL, 0, 0, 0, NULL, 0 }, { 0x0, "NMI", NULL, 0, 0, 0, NULL, 0 }, { 0x0, "CPU%c1e", NULL, 0, 0, 0, NULL, 0 }, + { 0x0, "pct_idle", NULL, 0, 0, 0, NULL, 0 }, }; #define MAX_BIC (sizeof(bic) / sizeof(struct msr_counter)) @@ -219,7 +220,7 @@ struct msr_counter bic[] = { #define BIC_TSC_MHz (1ULL << 7) #define BIC_IRQ (1ULL << 8) #define BIC_SMI (1ULL << 9) -#define BIC_sysfs (1ULL << 10) +#define BIC_cpuidle (1ULL << 10) #define BIC_CPU_c1 (1ULL << 11) #define BIC_CPU_c3 (1ULL << 12) #define BIC_CPU_c6 (1ULL << 13) @@ -272,17 +273,20 @@ struct msr_counter bic[] = { #define BIC_Sys_J (1ULL << 60) #define BIC_NMI (1ULL << 61) #define BIC_CPU_c1e (1ULL << 62) +#define BIC_pct_idle (1ULL << 63) #define BIC_GROUP_TOPOLOGY (BIC_Package | BIC_Node | BIC_CoreCnt | BIC_PkgCnt | BIC_Core | BIC_CPU | BIC_Die) #define BIC_GROUP_THERMAL_PWR (BIC_CoreTmp | BIC_PkgTmp | BIC_PkgWatt | BIC_CorWatt | BIC_GFXWatt | BIC_RAMWatt | BIC_PKG__ | BIC_RAM__ | BIC_SysWatt) #define BIC_GROUP_FREQUENCY (BIC_Avg_MHz | BIC_Busy | BIC_Bzy_MHz | BIC_TSC_MHz | BIC_GFXMHz | BIC_GFXACTMHz | BIC_SAMMHz | BIC_SAMACTMHz | BIC_UNCORE_MHZ) -#define BIC_GROUP_IDLE (BIC_Busy | BIC_sysfs | BIC_CPU_c1 | BIC_CPU_c3 | BIC_CPU_c6 | BIC_CPU_c7 | BIC_GFX_rc6 | BIC_Pkgpc2 | BIC_Pkgpc3 | BIC_Pkgpc6 | BIC_Pkgpc7 | BIC_Pkgpc8 | BIC_Pkgpc9 | BIC_Pkgpc10 | BIC_CPU_LPI | BIC_SYS_LPI | BIC_Mod_c6 | BIC_Totl_c0 | BIC_Any_c0 | BIC_GFX_c0 | BIC_CPUGFX | BIC_SAM_mc6 | BIC_Diec6) +#define BIC_GROUP_HW_IDLE (BIC_Busy | BIC_CPU_c1 | BIC_CPU_c3 | BIC_CPU_c6 | BIC_CPU_c7 | BIC_GFX_rc6 | BIC_Pkgpc2 | BIC_Pkgpc3 | BIC_Pkgpc6 | BIC_Pkgpc7 | BIC_Pkgpc8 | BIC_Pkgpc9 | BIC_Pkgpc10 | BIC_CPU_LPI | BIC_SYS_LPI | BIC_Mod_c6 | BIC_Totl_c0 | BIC_Any_c0 | BIC_GFX_c0 | BIC_CPUGFX | BIC_SAM_mc6 | BIC_Diec6) +#define BIC_GROUP_SW_IDLE (BIC_Busy | BIC_cpuidle | BIC_pct_idle ) +#define BIC_GROUP_IDLE (BIC_GROUP_HW_IDLE | BIC_pct_idle) #define BIC_OTHER (BIC_IRQ | BIC_NMI | BIC_SMI | BIC_ThreadC | BIC_CoreTmp | BIC_IPC) -#define BIC_DISABLED_BY_DEFAULT (BIC_USEC | BIC_TOD | BIC_APIC | BIC_X2APIC) +#define BIC_DISABLED_BY_DEFAULT (BIC_USEC | BIC_TOD | BIC_APIC | BIC_X2APIC | BIC_cpuidle) unsigned long long bic_enabled = (0xFFFFFFFFFFFFFFFFULL & ~BIC_DISABLED_BY_DEFAULT); -unsigned long long bic_present = BIC_USEC | BIC_TOD | BIC_sysfs | BIC_APIC | BIC_X2APIC; +unsigned long long bic_present = BIC_USEC | BIC_TOD | BIC_cpuidle | BIC_pct_idle | BIC_APIC | BIC_X2APIC; #define DO_BIC(COUNTER_NAME) (bic_enabled & bic_present & COUNTER_NAME) #define DO_BIC_READ(COUNTER_NAME) (bic_present & COUNTER_NAME) @@ -2362,6 +2366,15 @@ unsigned long long bic_lookup(char *name_list, enum show_hide_mode mode) } else if (!strcmp(name_list, "idle")) { retval |= BIC_GROUP_IDLE; break; + } else if (!strcmp(name_list, "swidle")) { + retval |= BIC_GROUP_SW_IDLE; + break; + } else if (!strcmp(name_list, "sysfs")) { /* legacy compatibility */ + retval |= BIC_GROUP_SW_IDLE; + break; + } else if (!strcmp(name_list, "hwidle")) { + retval |= BIC_GROUP_HW_IDLE; + break; } else if (!strcmp(name_list, "frequency")) { retval |= BIC_GROUP_FREQUENCY; break; @@ -2372,6 +2385,7 @@ unsigned long long bic_lookup(char *name_list, enum show_hide_mode mode) } if (i == MAX_BIC) { + fprintf(stderr, "deferred %s\n", name_list); if (mode == SHOW_LIST) { deferred_add_names[deferred_add_index++] = name_list; if (deferred_add_index >= MAX_DEFERRED) { @@ -10269,6 +10283,9 @@ void probe_cpuidle_residency(void) int min_state = 1024, max_state = 0; char *sp; + if (!DO_BIC(BIC_pct_idle)) + return; + for (state = 10; state >= 0; --state) { sprintf(path, "/sys/devices/system/cpu/cpu%d/cpuidle/state%d/name", base_cpu, state); @@ -10291,7 +10308,7 @@ void probe_cpuidle_residency(void) sprintf(path, "cpuidle/state%d/time", state); - if (!DO_BIC(BIC_sysfs) && !is_deferred_add(name_buf)) + if (!DO_BIC(BIC_pct_idle) && !is_deferred_add(name_buf)) continue; if (is_deferred_skip(name_buf)) @@ -10315,6 +10332,9 @@ void probe_cpuidle_counts(void) int min_state = 1024, max_state = 0; char *sp; + if (!DO_BIC(BIC_cpuidle)) + return; + for (state = 10; state >= 0; --state) { sprintf(path, "/sys/devices/system/cpu/cpu%d/cpuidle/state%d/name", base_cpu, state); @@ -10327,7 +10347,7 @@ void probe_cpuidle_counts(void) remove_underbar(name_buf); - if (!DO_BIC(BIC_sysfs) && !is_deferred_add(name_buf)) + if (!DO_BIC(BIC_cpuidle) && !is_deferred_add(name_buf)) continue; if (is_deferred_skip(name_buf)) -- 2.51.0 From 03e00e373cab981ad808271b2650700cfa0fbda6 Mon Sep 17 00:00:00 2001 From: Len Brown Date: Sun, 6 Apr 2025 14:49:20 -0400 Subject: [PATCH 10/16] tools/power turbostat: v2025.05.06 Support up to 8192 processors Add cpuidle governor debug telemetry, disabled by default Update default output to exclude cpuidle invocation counts Bug fixes Signed-off-by: Len Brown --- tools/power/x86/turbostat/turbostat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index ab184f95cdaf..1b9fdc1a7ee8 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -9594,7 +9594,7 @@ int get_and_dump_counters(void) void print_version() { - fprintf(outf, "turbostat version 2025.02.02 - Len Brown \n"); + fprintf(outf, "turbostat version 2025.04.06 - Len Brown \n"); } #define COMMAND_LINE_SIZE 2048 -- 2.51.0 From 0efdedb3358aa78102967f242379686f94315830 Mon Sep 17 00:00:00 2001 From: =?utf8?q?Thomas=20Wei=C3=9Fschuh?= Date: Wed, 2 Apr 2025 21:21:57 +0100 Subject: [PATCH 11/16] tools/include: make uapi/linux/types.h usable from assembly MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit The "real" linux/types.h UAPI header gracefully degrades to a NOOP when included from assembly code. Mirror this behaviour in the tools/ variant. Test for __ASSEMBLER__ over __ASSEMBLY__ as the former is provided by the toolchain automatically. Reported-by: Mark Brown Closes: https://lore.kernel.org/lkml/af553c62-ca2f-4956-932c-dd6e3a126f58@sirena.org.uk/ Fixes: c9fbaa879508 ("selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers") Signed-off-by: Thomas Weißschuh Link: https://patch.msgid.link/20250321-uapi-consistency-v1-1-439070118dc0@linutronix.de Signed-off-by: Mark Brown Reviewed-by: Mark Brown Signed-off-by: Linus Torvalds --- tools/include/uapi/linux/types.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/include/uapi/linux/types.h b/tools/include/uapi/linux/types.h index 91fa51a9c31d..85aa327245c6 100644 --- a/tools/include/uapi/linux/types.h +++ b/tools/include/uapi/linux/types.h @@ -4,6 +4,8 @@ #include +#ifndef __ASSEMBLER__ + /* copied from linux:include/uapi/linux/types.h */ #define __bitwise typedef __u16 __bitwise __le16; @@ -20,4 +22,5 @@ typedef __u32 __bitwise __wsum; #define __aligned_be64 __be64 __attribute__((aligned(8))) #define __aligned_le64 __le64 __attribute__((aligned(8))) +#endif /* __ASSEMBLER__ */ #endif /* _UAPI_LINUX_TYPES_H */ -- 2.51.0 From 0af2f6be1b4281385b618cb86ad946eded089ac8 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 6 Apr 2025 13:11:33 -0700 Subject: [PATCH 12/16] Linux 6.15-rc1 --- Makefile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index e55726a71d95..38689a0c3605 100644 --- a/Makefile +++ b/Makefile @@ -1,8 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 VERSION = 6 -PATCHLEVEL = 14 +PATCHLEVEL = 15 SUBLEVEL = 0 -EXTRAVERSION = +EXTRAVERSION = -rc1 NAME = Baby Opossum Posse # *DOCUMENTATION* -- 2.51.0 From 539d33455f96532ac88115c35b1769db966003c6 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Wed, 19 Mar 2025 11:30:10 +0000 Subject: [PATCH 13/16] f2fs: remove redundant assignment to variable err The variable err is being assigned a value zero and then the following goto page_hit reassigns err a new value. The zero assignment is redundant and can be removed. Signed-off-by: Colin Ian King [Jaegeuk Kim: clean up braces and if condition, suggested by Dan Carpenter] Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim --- fs/f2fs/node.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 5f15c224bf78..3f6f5e54759c 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1494,12 +1494,10 @@ repeat: return folio; err = read_node_page(&folio->page, 0); - if (err < 0) { + if (err < 0) goto out_put_err; - } else if (err == LOCKED_PAGE) { - err = 0; + if (err == LOCKED_PAGE) goto page_hit; - } if (parent) f2fs_ra_node_pages(parent, start + 1, MAX_RA_NODE); -- 2.51.0 From e073e92789839d10a8828c13f9fe47ee517aa8e6 Mon Sep 17 00:00:00 2001 From: Chao Yu Date: Thu, 20 Mar 2025 10:22:29 +0800 Subject: [PATCH 14/16] f2fs: add a proc entry show inject stats This patch adds a proc entry named inject_stats to show total injected count for each fault type. cat /proc/fs/f2fs//inject_stats fault_type injected_count kmalloc 0 kvmalloc 0 page alloc 0 page get 0 alloc bio(obsolete) 0 alloc nid 0 orphan 0 no more block 0 too big dir depth 0 evict_inode fail 0 truncate fail 0 read IO error 0 checkpoint error 0 discard error 0 write IO error 0 slab alloc 0 dquot initialize 0 lock_op 0 invalid blkaddr 0 inconsistent blkaddr 0 no free segment 0 inconsistent footer 0 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim --- fs/f2fs/f2fs.h | 3 +++ fs/f2fs/super.c | 1 + fs/f2fs/sysfs.c | 22 ++++++++++++++++++++++ 3 files changed, 26 insertions(+) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index f1576dc6ec67..986ee5b9326d 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -73,6 +73,8 @@ struct f2fs_fault_info { atomic_t inject_ops; int inject_rate; unsigned int inject_type; + /* Used to account total count of injection for each type */ + unsigned int inject_count[FAULT_MAX]; }; extern const char *f2fs_fault_name[FAULT_MAX]; @@ -1902,6 +1904,7 @@ static inline bool __time_to_inject(struct f2fs_sb_info *sbi, int type, atomic_inc(&ffi->inject_ops); if (atomic_read(&ffi->inject_ops) >= ffi->inject_rate) { atomic_set(&ffi->inject_ops, 0); + ffi->inject_count[type]++; f2fs_info_ratelimited(sbi, "inject %s in %s of %pS", f2fs_fault_name[type], func, parent_func); return true; diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index f087b2b71c89..dfe0604ab558 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -47,6 +47,7 @@ const char *f2fs_fault_name[FAULT_MAX] = { [FAULT_KVMALLOC] = "kvmalloc", [FAULT_PAGE_ALLOC] = "page alloc", [FAULT_PAGE_GET] = "page get", + [FAULT_ALLOC_BIO] = "alloc bio(obsolete)", [FAULT_ALLOC_NID] = "alloc nid", [FAULT_ORPHAN] = "orphan", [FAULT_BLOCK] = "no more block", diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c index c69161366467..46fa94db08a8 100644 --- a/fs/f2fs/sysfs.c +++ b/fs/f2fs/sysfs.c @@ -1679,6 +1679,24 @@ static int __maybe_unused disk_map_seq_show(struct seq_file *seq, return 0; } +#ifdef CONFIG_F2FS_FAULT_INJECTION +static int __maybe_unused inject_stats_seq_show(struct seq_file *seq, + void *offset) +{ + struct super_block *sb = seq->private; + struct f2fs_sb_info *sbi = F2FS_SB(sb); + struct f2fs_fault_info *ffi = &F2FS_OPTION(sbi).fault_info; + int i; + + seq_puts(seq, "fault_type injected_count\n"); + + for (i = 0; i < FAULT_MAX; i++) + seq_printf(seq, "%-24s%-10u\n", f2fs_fault_name[i], + ffi->inject_count[i]); + return 0; +} +#endif + int __init f2fs_init_sysfs(void) { int ret; @@ -1770,6 +1788,10 @@ int f2fs_register_sysfs(struct f2fs_sb_info *sbi) discard_plist_seq_show, sb); proc_create_single_data("disk_map", 0444, sbi->s_proc, disk_map_seq_show, sb); +#ifdef CONFIG_F2FS_FAULT_INJECTION + proc_create_single_data("inject_stats", 0444, sbi->s_proc, + inject_stats_seq_show, sb); +#endif return 0; put_feature_list_kobj: kobject_put(&sbi->s_feature_list_kobj); -- 2.51.0 From 2be96c2147e25d5845c1b06dc20521ab9e0eeeb0 Mon Sep 17 00:00:00 2001 From: Chao Yu Date: Thu, 20 Mar 2025 10:22:30 +0800 Subject: [PATCH 15/16] f2fs: fix to update injection attrs according to fault_option When we update inject type via sysfs, it shows wrong rate value as below, there is a same problem when we update inject rate, fix it. Before: F2FS-fs (vdd): build fault injection attr: rate: 0, type: 0xffff F2FS-fs (vdd): build fault injection attr: rate: 1, type: 0x0 After: F2FS-fs (vdd): build fault injection type: 0x1 F2FS-fs (vdd): build fault injection rate: 1 Meanwhile, let's avoid turning on all fault types when we enable fault injection via fault_injection mount option, it will lead to shutdown filesystem or fail the mount() easily. mount -o fault_injection=4 /dev/vdd /mnt/f2fs F2FS-fs (vdd): build fault injection attr: rate: 4, type: 0x7fffff F2FS-fs (vdd): inject kmalloc in f2fs_kmalloc of f2fs_fill_super+0xbdf/0x27c0 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim --- fs/f2fs/checkpoint.c | 2 +- fs/f2fs/f2fs.h | 14 ++++++++++---- fs/f2fs/super.c | 26 +++++++++++++------------- fs/f2fs/sysfs.c | 4 ++-- 4 files changed, 26 insertions(+), 20 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index cf77987d0698..85b7141f0d89 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -29,7 +29,7 @@ struct kmem_cache *f2fs_inode_entry_slab; void f2fs_stop_checkpoint(struct f2fs_sb_info *sbi, bool end_io, unsigned char reason) { - f2fs_build_fault_attr(sbi, 0, 0); + f2fs_build_fault_attr(sbi, 0, 0, FAULT_ALL); if (!end_io) f2fs_flush_merged_writes(sbi); f2fs_handle_critical_error(sbi, reason); diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 986ee5b9326d..ca884e39a5ff 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -66,9 +66,14 @@ enum { FAULT_MAX, }; -#ifdef CONFIG_F2FS_FAULT_INJECTION -#define F2FS_ALL_FAULT_TYPE (GENMASK(FAULT_MAX - 1, 0)) +/* indicate which option to update */ +enum fault_option { + FAULT_RATE = 1, /* only update fault rate */ + FAULT_TYPE = 2, /* only update fault type */ + FAULT_ALL = 4, /* reset all fault injection options/stats */ +}; +#ifdef CONFIG_F2FS_FAULT_INJECTION struct f2fs_fault_info { atomic_t inject_ops; int inject_rate; @@ -4765,10 +4770,11 @@ static inline bool f2fs_need_verity(const struct inode *inode, pgoff_t idx) #ifdef CONFIG_F2FS_FAULT_INJECTION extern int f2fs_build_fault_attr(struct f2fs_sb_info *sbi, unsigned long rate, - unsigned long type); + unsigned long type, enum fault_option fo); #else static inline int f2fs_build_fault_attr(struct f2fs_sb_info *sbi, - unsigned long rate, unsigned long type) + unsigned long rate, unsigned long type, + enum fault_option fo) { return 0; } diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index dfe0604ab558..011925ee54f8 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -68,29 +68,30 @@ const char *f2fs_fault_name[FAULT_MAX] = { }; int f2fs_build_fault_attr(struct f2fs_sb_info *sbi, unsigned long rate, - unsigned long type) + unsigned long type, enum fault_option fo) { struct f2fs_fault_info *ffi = &F2FS_OPTION(sbi).fault_info; - if (rate) { + if (fo & FAULT_ALL) { + memset(ffi, 0, sizeof(struct f2fs_fault_info)); + return 0; + } + + if (fo & FAULT_RATE) { if (rate > INT_MAX) return -EINVAL; atomic_set(&ffi->inject_ops, 0); ffi->inject_rate = (int)rate; + f2fs_info(sbi, "build fault injection rate: %lu", rate); } - if (type) { + if (fo & FAULT_TYPE) { if (type >= BIT(FAULT_MAX)) return -EINVAL; ffi->inject_type = (unsigned int)type; + f2fs_info(sbi, "build fault injection type: 0x%lx", type); } - if (!rate && !type) - memset(ffi, 0, sizeof(struct f2fs_fault_info)); - else - f2fs_info(sbi, - "build fault injection attr: rate: %lu, type: 0x%lx", - rate, type); return 0; } #endif @@ -897,8 +898,7 @@ static int parse_options(struct f2fs_sb_info *sbi, char *options, bool is_remoun case Opt_fault_injection: if (args->from && match_int(args, &arg)) return -EINVAL; - if (f2fs_build_fault_attr(sbi, arg, - F2FS_ALL_FAULT_TYPE)) + if (f2fs_build_fault_attr(sbi, arg, 0, FAULT_RATE)) return -EINVAL; set_opt(sbi, FAULT_INJECTION); break; @@ -906,7 +906,7 @@ static int parse_options(struct f2fs_sb_info *sbi, char *options, bool is_remoun case Opt_fault_type: if (args->from && match_int(args, &arg)) return -EINVAL; - if (f2fs_build_fault_attr(sbi, 0, arg)) + if (f2fs_build_fault_attr(sbi, 0, arg, FAULT_TYPE)) return -EINVAL; set_opt(sbi, FAULT_INJECTION); break; @@ -2209,7 +2209,7 @@ static void default_options(struct f2fs_sb_info *sbi, bool remount) set_opt(sbi, POSIX_ACL); #endif - f2fs_build_fault_attr(sbi, 0, 0); + f2fs_build_fault_attr(sbi, 0, 0, FAULT_ALL); } #ifdef CONFIG_QUOTA diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c index 46fa94db08a8..3a3485622691 100644 --- a/fs/f2fs/sysfs.c +++ b/fs/f2fs/sysfs.c @@ -494,12 +494,12 @@ out: return ret; #ifdef CONFIG_F2FS_FAULT_INJECTION if (a->struct_type == FAULT_INFO_TYPE) { - if (f2fs_build_fault_attr(sbi, 0, t)) + if (f2fs_build_fault_attr(sbi, 0, t, FAULT_TYPE)) return -EINVAL; return count; } if (a->struct_type == FAULT_INFO_RATE) { - if (f2fs_build_fault_attr(sbi, t, 0)) + if (f2fs_build_fault_attr(sbi, t, 0, FAULT_RATE)) return -EINVAL; return count; } -- 2.51.0 From db03c20c0850dc8d2bcabfa54b9438f7d666c863 Mon Sep 17 00:00:00 2001 From: Chao Yu Date: Thu, 27 Mar 2025 13:56:06 +0800 Subject: [PATCH 16/16] f2fs: fix to set atomic write status more clear 1. After we start atomic write in a database file, before committing all data, we'd better not set inode w/ vfs dirty status to avoid redundant updates, instead, we only set inode w/ atomic dirty status. 2. After we commit all data, before committing metadata, we need to clear atomic dirty status, and set vfs dirty status to allow vfs flush dirty inode. Cc: Daeho Jeong Reported-by: Zhiguo Niu Signed-off-by: Chao Yu Reviewed-by: Daeho Jeong Reviewed-by: Zhiguo Niu Signed-off-by: Jaegeuk Kim --- fs/f2fs/inode.c | 4 +++- fs/f2fs/segment.c | 6 ++++++ fs/f2fs/super.c | 4 +++- 3 files changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c index 83f862578fc8..fa5097da7c88 100644 --- a/fs/f2fs/inode.c +++ b/fs/f2fs/inode.c @@ -34,7 +34,9 @@ void f2fs_mark_inode_dirty_sync(struct inode *inode, bool sync) if (f2fs_inode_dirtied(inode, sync)) return; - if (f2fs_is_atomic_file(inode)) + /* only atomic file w/ FI_ATOMIC_COMMITTED can be set vfs dirty */ + if (f2fs_is_atomic_file(inode) && + !is_inode_flag_set(inode, FI_ATOMIC_COMMITTED)) return; mark_inode_dirty_sync(inode); diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 396ef71f41e3..3536fbdc0d91 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -376,7 +376,13 @@ out: } else { sbi->committed_atomic_block += fi->atomic_write_cnt; set_inode_flag(inode, FI_ATOMIC_COMMITTED); + + /* + * inode may has no FI_ATOMIC_DIRTIED flag due to no write + * before commit. + */ if (is_inode_flag_set(inode, FI_ATOMIC_DIRTIED)) { + /* clear atomic dirty status and set vfs dirty status */ clear_inode_flag(inode, FI_ATOMIC_DIRTIED); f2fs_mark_inode_dirty_sync(inode, true); } diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 011925ee54f8..22f26871b7aa 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -1532,7 +1532,9 @@ int f2fs_inode_dirtied(struct inode *inode, bool sync) } spin_unlock(&sbi->inode_lock[DIRTY_META]); - if (!ret && f2fs_is_atomic_file(inode)) + /* if atomic write is not committed, set inode w/ atomic dirty */ + if (!ret && f2fs_is_atomic_file(inode) && + !is_inode_flag_set(inode, FI_ATOMIC_COMMITTED)) set_inode_flag(inode, FI_ATOMIC_DIRTIED); return ret; -- 2.51.0