From b0430f39de089920e3aab3f4a9c35c35110bdbea Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Thu, 23 Jan 2025 13:29:03 -0800 Subject: [PATCH 01/16] lib/crc: simplify the kconfig options for CRC implementations Make the following simplifications to the kconfig options for choosing CRC implementations for CRC32 and CRC_T10DIF: 1. Make the option to disable the arch-optimized code be visible only when CONFIG_EXPERT=y. 2. Make a single option control the inclusion of the arch-optimized code for all enabled CRC variants. 3. Make CRC32_SARWATE (a.k.a. slice-by-1 or byte-by-byte) be the only generic CRC32 implementation. The result is there is now just one option, CRC_OPTIMIZATIONS, which is default y and can be disabled only when CONFIG_EXPERT=y. Rationale: 1. Enabling the arch-optimized code is nearly always the right choice. However, people trying to build the tiniest kernel possible would find some use in disabling it. Anything we add to CRC32 is de facto unconditional, given that CRC32 gets selected by something in nearly all kernels. And unfortunately enabling the arch CRC code does not eliminate the need to build the generic CRC code into the kernel too, due to CPU feature dependencies. The size of the arch CRC code will also increase slightly over time as more CRC variants get added and more implementations targeting different instruction set extensions get added. Thus, it seems worthwhile to still provide an option to disable it, but it should be considered an expert-level tweak. 2. Considering the use case described in (1), there doesn't seem to be sufficient value in making the arch-optimized CRC code be independently configurable for different CRC variants. Note also that multiple variants were already grouped together, e.g. CONFIG_CRC32 actually enables three different variants of CRC32. 3. The bit-by-bit implementation is uselessly slow, whereas slice-by-n for n=4 and n=8 use tables that are inconveniently large: 4096 bytes and 8192 bytes respectively, compared to 1024 bytes for n=1. Higher n gives higher instruction-level parallelism, so higher n easily wins on traditional microbenchmarks on most CPUs. However, the larger tables, which are accessed randomly, can be harmful in real-world situations where the dcache may be cold or useful data may need be evicted from the dcache. Meanwhile, today most architectures have much faster CRC32 implementations using dedicated CRC32 instructions or carryless multiplication instructions anyway, which make the generic code obsolete in most cases especially on long messages. Another reason for going with n=1 is that this is already what is used by all the other CRC variants in the kernel. CRC32 was unique in having support for larger tables. But as per the above this can be considered an outdated optimization. The standardization on slice-by-1 a.k.a. CRC32_SARWATE makes much of the code in lib/crc32.c unused. A later patch will clean that up. Link: https://lore.kernel.org/r/20250123212904.118683-2-ebiggers@kernel.org Reviewed-by: Ard Biesheuvel Reviewed-by: Martin K. Petersen Signed-off-by: Eric Biggers --- lib/Kconfig | 116 +++++++--------------------------------------------- 1 file changed, 14 insertions(+), 102 deletions(-) diff --git a/lib/Kconfig b/lib/Kconfig index a78d22c6507f..e08b26e8e03f 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -164,34 +164,9 @@ config CRC_T10DIF config ARCH_HAS_CRC_T10DIF bool -choice - prompt "CRC-T10DIF implementation" - depends on CRC_T10DIF - default CRC_T10DIF_IMPL_ARCH if ARCH_HAS_CRC_T10DIF - default CRC_T10DIF_IMPL_GENERIC if !ARCH_HAS_CRC_T10DIF - help - This option allows you to override the default choice of CRC-T10DIF - implementation. - -config CRC_T10DIF_IMPL_ARCH - bool "Architecture-optimized" if ARCH_HAS_CRC_T10DIF - help - Use the optimized implementation of CRC-T10DIF for the selected - architecture. It is recommended to keep this enabled, as it can - greatly improve CRC-T10DIF performance. - -config CRC_T10DIF_IMPL_GENERIC - bool "Generic implementation" - help - Use the generic table-based implementation of CRC-T10DIF. Selecting - this will reduce code size slightly but can greatly reduce CRC-T10DIF - performance. - -endchoice - config CRC_T10DIF_ARCH tristate - default CRC_T10DIF if CRC_T10DIF_IMPL_ARCH + default CRC_T10DIF if ARCH_HAS_CRC_T10DIF && CRC_OPTIMIZATIONS config CRC64_ROCKSOFT tristate "CRC calculation for the Rocksoft model CRC64" @@ -214,6 +189,7 @@ config CRC32 tristate "CRC32/CRC32c functions" default y select BITREVERSE + select CRC32_SARWATE help This option is provided for the case where no in-kernel-tree modules require CRC32/CRC32c functions, but a module built outside @@ -223,87 +199,12 @@ config CRC32 config ARCH_HAS_CRC32 bool -choice - prompt "CRC32 implementation" - depends on CRC32 - default CRC32_IMPL_ARCH_PLUS_SLICEBY8 if ARCH_HAS_CRC32 - default CRC32_IMPL_SLICEBY8 if !ARCH_HAS_CRC32 - help - This option allows you to override the default choice of CRC32 - implementation. Choose the default unless you know that you need one - of the others. - -config CRC32_IMPL_ARCH_PLUS_SLICEBY8 - bool "Arch-optimized, with fallback to slice-by-8" if ARCH_HAS_CRC32 - help - Use architecture-optimized implementation of CRC32. Fall back to - slice-by-8 in cases where the arch-optimized implementation cannot be - used, e.g. if the CPU lacks support for the needed instructions. - - This is the default when an arch-optimized implementation exists. - -config CRC32_IMPL_ARCH_PLUS_SLICEBY1 - bool "Arch-optimized, with fallback to slice-by-1" if ARCH_HAS_CRC32 - help - Use architecture-optimized implementation of CRC32, but fall back to - slice-by-1 instead of slice-by-8 in order to reduce the binary size. - -config CRC32_IMPL_SLICEBY8 - bool "Slice by 8 bytes" - help - Calculate checksum 8 bytes at a time with a clever slicing algorithm. - This is much slower than the architecture-optimized implementation of - CRC32 (if the selected arch has one), but it is portable and is the - fastest implementation when no arch-optimized implementation is - available. It uses an 8KiB lookup table. Most modern processors have - enough cache to hold this table without thrashing the cache. - -config CRC32_IMPL_SLICEBY4 - bool "Slice by 4 bytes" - help - Calculate checksum 4 bytes at a time with a clever slicing algorithm. - This is a bit slower than slice by 8, but has a smaller 4KiB lookup - table. - - Only choose this option if you know what you are doing. - -config CRC32_IMPL_SLICEBY1 - bool "Slice by 1 byte (Sarwate's algorithm)" - help - Calculate checksum a byte at a time using Sarwate's algorithm. This - is not particularly fast, but has a small 1KiB lookup table. - - Only choose this option if you know what you are doing. - -config CRC32_IMPL_BIT - bool "Classic Algorithm (one bit at a time)" - help - Calculate checksum one bit at a time. This is VERY slow, but has - no lookup table. This is provided as a debugging option. - - Only choose this option if you are debugging crc32. - -endchoice - config CRC32_ARCH tristate - default CRC32 if CRC32_IMPL_ARCH_PLUS_SLICEBY8 || CRC32_IMPL_ARCH_PLUS_SLICEBY1 - -config CRC32_SLICEBY8 - bool - default y if CRC32_IMPL_SLICEBY8 || CRC32_IMPL_ARCH_PLUS_SLICEBY8 - -config CRC32_SLICEBY4 - bool - default y if CRC32_IMPL_SLICEBY4 + default CRC32 if ARCH_HAS_CRC32 && CRC_OPTIMIZATIONS config CRC32_SARWATE bool - default y if CRC32_IMPL_SLICEBY1 || CRC32_IMPL_ARCH_PLUS_SLICEBY1 - -config CRC32_BIT - bool - default y if CRC32_IMPL_BIT config CRC64 tristate "CRC64 functions" @@ -343,6 +244,17 @@ config CRC8 when they need to do cyclic redundancy check according CRC8 algorithm. Module will be called crc8. +config CRC_OPTIMIZATIONS + bool "Enable optimized CRC implementations" if EXPERT + default y + help + Disabling this option reduces code size slightly by disabling the + architecture-optimized implementations of any CRC variants that are + enabled. CRC checksumming performance may get much slower. + + Keep this enabled unless you're really trying to minimize the size of + the kernel. + config XXHASH tristate -- 2.51.0 From 5e3c1c48fac3793c173567df735890d4e29cbb64 Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Thu, 23 Jan 2025 13:29:04 -0800 Subject: [PATCH 02/16] lib/crc32: remove other generic implementations Now that we've standardized on the byte-by-byte implementation of CRC32 as the only generic implementation (see previous commit for the rationale), remove the code for the other implementations. Tested with crc_kunit. Link: https://lore.kernel.org/r/20250123212904.118683-3-ebiggers@kernel.org Reviewed-by: Ard Biesheuvel Reviewed-by: Martin K. Petersen Signed-off-by: Eric Biggers --- lib/Kconfig | 4 - lib/crc32.c | 225 ++----------------------------------------- lib/crc32defs.h | 59 ------------ lib/gen_crc32table.c | 113 ++++++---------------- 4 files changed, 40 insertions(+), 361 deletions(-) delete mode 100644 lib/crc32defs.h diff --git a/lib/Kconfig b/lib/Kconfig index e08b26e8e03f..dccb61b7d698 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -189,7 +189,6 @@ config CRC32 tristate "CRC32/CRC32c functions" default y select BITREVERSE - select CRC32_SARWATE help This option is provided for the case where no in-kernel-tree modules require CRC32/CRC32c functions, but a module built outside @@ -203,9 +202,6 @@ config CRC32_ARCH tristate default CRC32 if ARCH_HAS_CRC32 && CRC_OPTIMIZATIONS -config CRC32_SARWATE - bool - config CRC64 tristate "CRC64 functions" help diff --git a/lib/crc32.c b/lib/crc32.c index 47151624332e..ede6131f66fc 100644 --- a/lib/crc32.c +++ b/lib/crc32.c @@ -30,20 +30,6 @@ #include #include #include -#include -#include "crc32defs.h" - -#if CRC_LE_BITS > 8 -# define tole(x) ((__force u32) cpu_to_le32(x)) -#else -# define tole(x) (x) -#endif - -#if CRC_BE_BITS > 8 -# define tobe(x) ((__force u32) cpu_to_be32(x)) -#else -# define tobe(x) (x) -#endif #include "crc32table.h" @@ -51,157 +37,20 @@ MODULE_AUTHOR("Matt Domsch "); MODULE_DESCRIPTION("Various CRC32 calculations"); MODULE_LICENSE("GPL"); -#if CRC_LE_BITS > 8 || CRC_BE_BITS > 8 - -/* implements slicing-by-4 or slicing-by-8 algorithm */ -static inline u32 __pure -crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32 (*tab)[256]) -{ -# ifdef __LITTLE_ENDIAN -# define DO_CRC(x) crc = t0[(crc ^ (x)) & 255] ^ (crc >> 8) -# define DO_CRC4 (t3[(q) & 255] ^ t2[(q >> 8) & 255] ^ \ - t1[(q >> 16) & 255] ^ t0[(q >> 24) & 255]) -# define DO_CRC8 (t7[(q) & 255] ^ t6[(q >> 8) & 255] ^ \ - t5[(q >> 16) & 255] ^ t4[(q >> 24) & 255]) -# else -# define DO_CRC(x) crc = t0[((crc >> 24) ^ (x)) & 255] ^ (crc << 8) -# define DO_CRC4 (t0[(q) & 255] ^ t1[(q >> 8) & 255] ^ \ - t2[(q >> 16) & 255] ^ t3[(q >> 24) & 255]) -# define DO_CRC8 (t4[(q) & 255] ^ t5[(q >> 8) & 255] ^ \ - t6[(q >> 16) & 255] ^ t7[(q >> 24) & 255]) -# endif - const u32 *b; - size_t rem_len; -# ifdef CONFIG_X86 - size_t i; -# endif - const u32 *t0=tab[0], *t1=tab[1], *t2=tab[2], *t3=tab[3]; -# if CRC_LE_BITS != 32 - const u32 *t4 = tab[4], *t5 = tab[5], *t6 = tab[6], *t7 = tab[7]; -# endif - u32 q; - - /* Align it */ - if (unlikely((long)buf & 3 && len)) { - do { - DO_CRC(*buf++); - } while ((--len) && ((long)buf)&3); - } - -# if CRC_LE_BITS == 32 - rem_len = len & 3; - len = len >> 2; -# else - rem_len = len & 7; - len = len >> 3; -# endif - - b = (const u32 *)buf; -# ifdef CONFIG_X86 - --b; - for (i = 0; i < len; i++) { -# else - for (--b; len; --len) { -# endif - q = crc ^ *++b; /* use pre increment for speed */ -# if CRC_LE_BITS == 32 - crc = DO_CRC4; -# else - crc = DO_CRC8; - q = *++b; - crc ^= DO_CRC4; -# endif - } - len = rem_len; - /* And the last few bytes */ - if (len) { - u8 *p = (u8 *)(b + 1) - 1; -# ifdef CONFIG_X86 - for (i = 0; i < len; i++) - DO_CRC(*++p); /* use pre increment for speed */ -# else - do { - DO_CRC(*++p); /* use pre increment for speed */ - } while (--len); -# endif - } - return crc; -#undef DO_CRC -#undef DO_CRC4 -#undef DO_CRC8 -} -#endif - - -/** - * crc32_le_generic() - Calculate bitwise little-endian Ethernet AUTODIN II - * CRC32/CRC32C - * @crc: seed value for computation. ~0 for Ethernet, sometimes 0 for other - * uses, or the previous crc32/crc32c value if computing incrementally. - * @p: pointer to buffer over which CRC32/CRC32C is run - * @len: length of buffer @p - * @tab: little-endian Ethernet table - * @polynomial: CRC32/CRC32c LE polynomial - */ -static inline u32 __pure crc32_le_generic(u32 crc, unsigned char const *p, - size_t len, const u32 (*tab)[256], - u32 polynomial) +u32 __pure crc32_le_base(u32 crc, const u8 *p, size_t len) { -#if CRC_LE_BITS == 1 - int i; - while (len--) { - crc ^= *p++; - for (i = 0; i < 8; i++) - crc = (crc >> 1) ^ ((crc & 1) ? polynomial : 0); - } -# elif CRC_LE_BITS == 2 - while (len--) { - crc ^= *p++; - crc = (crc >> 2) ^ tab[0][crc & 3]; - crc = (crc >> 2) ^ tab[0][crc & 3]; - crc = (crc >> 2) ^ tab[0][crc & 3]; - crc = (crc >> 2) ^ tab[0][crc & 3]; - } -# elif CRC_LE_BITS == 4 - while (len--) { - crc ^= *p++; - crc = (crc >> 4) ^ tab[0][crc & 15]; - crc = (crc >> 4) ^ tab[0][crc & 15]; - } -# elif CRC_LE_BITS == 8 - /* aka Sarwate algorithm */ - while (len--) { - crc ^= *p++; - crc = (crc >> 8) ^ tab[0][crc & 255]; - } -# else - crc = (__force u32) __cpu_to_le32(crc); - crc = crc32_body(crc, p, len, tab); - crc = __le32_to_cpu((__force __le32)crc); -#endif + while (len--) + crc = (crc >> 8) ^ crc32table_le[(crc & 255) ^ *p++]; return crc; } +EXPORT_SYMBOL(crc32_le_base); -#if CRC_LE_BITS == 1 -u32 __pure crc32_le_base(u32 crc, const u8 *p, size_t len) -{ - return crc32_le_generic(crc, p, len, NULL, CRC32_POLY_LE); -} -u32 __pure crc32c_le_base(u32 crc, const u8 *p, size_t len) -{ - return crc32_le_generic(crc, p, len, NULL, CRC32C_POLY_LE); -} -#else -u32 __pure crc32_le_base(u32 crc, const u8 *p, size_t len) -{ - return crc32_le_generic(crc, p, len, crc32table_le, CRC32_POLY_LE); -} u32 __pure crc32c_le_base(u32 crc, const u8 *p, size_t len) { - return crc32_le_generic(crc, p, len, crc32ctable_le, CRC32C_POLY_LE); + while (len--) + crc = (crc >> 8) ^ crc32ctable_le[(crc & 255) ^ *p++]; + return crc; } -#endif -EXPORT_SYMBOL(crc32_le_base); EXPORT_SYMBOL(crc32c_le_base); /* @@ -277,64 +126,10 @@ u32 __attribute_const__ __crc32c_le_shift(u32 crc, size_t len) EXPORT_SYMBOL(crc32_le_shift); EXPORT_SYMBOL(__crc32c_le_shift); -/** - * crc32_be_generic() - Calculate bitwise big-endian Ethernet AUTODIN II CRC32 - * @crc: seed value for computation. ~0 for Ethernet, sometimes 0 for - * other uses, or the previous crc32 value if computing incrementally. - * @p: pointer to buffer over which CRC32 is run - * @len: length of buffer @p - * @tab: big-endian Ethernet table - * @polynomial: CRC32 BE polynomial - */ -static inline u32 __pure crc32_be_generic(u32 crc, unsigned char const *p, - size_t len, const u32 (*tab)[256], - u32 polynomial) -{ -#if CRC_BE_BITS == 1 - int i; - while (len--) { - crc ^= *p++ << 24; - for (i = 0; i < 8; i++) - crc = - (crc << 1) ^ ((crc & 0x80000000) ? polynomial : - 0); - } -# elif CRC_BE_BITS == 2 - while (len--) { - crc ^= *p++ << 24; - crc = (crc << 2) ^ tab[0][crc >> 30]; - crc = (crc << 2) ^ tab[0][crc >> 30]; - crc = (crc << 2) ^ tab[0][crc >> 30]; - crc = (crc << 2) ^ tab[0][crc >> 30]; - } -# elif CRC_BE_BITS == 4 - while (len--) { - crc ^= *p++ << 24; - crc = (crc << 4) ^ tab[0][crc >> 28]; - crc = (crc << 4) ^ tab[0][crc >> 28]; - } -# elif CRC_BE_BITS == 8 - while (len--) { - crc ^= *p++ << 24; - crc = (crc << 8) ^ tab[0][crc >> 24]; - } -# else - crc = (__force u32) __cpu_to_be32(crc); - crc = crc32_body(crc, p, len, tab); - crc = __be32_to_cpu((__force __be32)crc); -# endif - return crc; -} - -#if CRC_BE_BITS == 1 -u32 __pure crc32_be_base(u32 crc, const u8 *p, size_t len) -{ - return crc32_be_generic(crc, p, len, NULL, CRC32_POLY_BE); -} -#else u32 __pure crc32_be_base(u32 crc, const u8 *p, size_t len) { - return crc32_be_generic(crc, p, len, crc32table_be, CRC32_POLY_BE); + while (len--) + crc = (crc << 8) ^ crc32table_be[(crc >> 24) ^ *p++]; + return crc; } -#endif EXPORT_SYMBOL(crc32_be_base); diff --git a/lib/crc32defs.h b/lib/crc32defs.h deleted file mode 100644 index 0c8fb5923e7e..000000000000 --- a/lib/crc32defs.h +++ /dev/null @@ -1,59 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ - -/* Try to choose an implementation variant via Kconfig */ -#ifdef CONFIG_CRC32_SLICEBY8 -# define CRC_LE_BITS 64 -# define CRC_BE_BITS 64 -#endif -#ifdef CONFIG_CRC32_SLICEBY4 -# define CRC_LE_BITS 32 -# define CRC_BE_BITS 32 -#endif -#ifdef CONFIG_CRC32_SARWATE -# define CRC_LE_BITS 8 -# define CRC_BE_BITS 8 -#endif -#ifdef CONFIG_CRC32_BIT -# define CRC_LE_BITS 1 -# define CRC_BE_BITS 1 -#endif - -/* - * How many bits at a time to use. Valid values are 1, 2, 4, 8, 32 and 64. - * For less performance-sensitive, use 4 or 8 to save table size. - * For larger systems choose same as CPU architecture as default. - * This works well on X86_64, SPARC64 systems. This may require some - * elaboration after experiments with other architectures. - */ -#ifndef CRC_LE_BITS -# ifdef CONFIG_64BIT -# define CRC_LE_BITS 64 -# else -# define CRC_LE_BITS 32 -# endif -#endif -#ifndef CRC_BE_BITS -# ifdef CONFIG_64BIT -# define CRC_BE_BITS 64 -# else -# define CRC_BE_BITS 32 -# endif -#endif - -/* - * Little-endian CRC computation. Used with serial bit streams sent - * lsbit-first. Be sure to use cpu_to_le32() to append the computed CRC. - */ -#if CRC_LE_BITS > 64 || CRC_LE_BITS < 1 || CRC_LE_BITS == 16 || \ - CRC_LE_BITS & CRC_LE_BITS-1 -# error "CRC_LE_BITS must be one of {1, 2, 4, 8, 32, 64}" -#endif - -/* - * Big-endian CRC computation. Used with serial bit streams sent - * msbit-first. Be sure to use cpu_to_be32() to append the computed CRC. - */ -#if CRC_BE_BITS > 64 || CRC_BE_BITS < 1 || CRC_BE_BITS == 16 || \ - CRC_BE_BITS & CRC_BE_BITS-1 -# error "CRC_BE_BITS must be one of {1, 2, 4, 8, 32, 64}" -#endif diff --git a/lib/gen_crc32table.c b/lib/gen_crc32table.c index f755b997b967..6d03425b849e 100644 --- a/lib/gen_crc32table.c +++ b/lib/gen_crc32table.c @@ -2,30 +2,11 @@ #include #include "../include/linux/crc32poly.h" #include "../include/generated/autoconf.h" -#include "crc32defs.h" #include -#define ENTRIES_PER_LINE 4 - -#if CRC_LE_BITS > 8 -# define LE_TABLE_ROWS (CRC_LE_BITS/8) -# define LE_TABLE_SIZE 256 -#else -# define LE_TABLE_ROWS 1 -# define LE_TABLE_SIZE (1 << CRC_LE_BITS) -#endif - -#if CRC_BE_BITS > 8 -# define BE_TABLE_ROWS (CRC_BE_BITS/8) -# define BE_TABLE_SIZE 256 -#else -# define BE_TABLE_ROWS 1 -# define BE_TABLE_SIZE (1 << CRC_BE_BITS) -#endif - -static uint32_t crc32table_le[LE_TABLE_ROWS][256]; -static uint32_t crc32table_be[BE_TABLE_ROWS][256]; -static uint32_t crc32ctable_le[LE_TABLE_ROWS][256]; +static uint32_t crc32table_le[256]; +static uint32_t crc32table_be[256]; +static uint32_t crc32ctable_le[256]; /** * crc32init_le() - allocate and initialize LE table data @@ -34,25 +15,17 @@ static uint32_t crc32ctable_le[LE_TABLE_ROWS][256]; * fact that crctable[i^j] = crctable[i] ^ crctable[j]. * */ -static void crc32init_le_generic(const uint32_t polynomial, - uint32_t (*tab)[256]) +static void crc32init_le_generic(const uint32_t polynomial, uint32_t tab[256]) { unsigned i, j; uint32_t crc = 1; - tab[0][0] = 0; + tab[0] = 0; - for (i = LE_TABLE_SIZE >> 1; i; i >>= 1) { + for (i = 128; i; i >>= 1) { crc = (crc >> 1) ^ ((crc & 1) ? polynomial : 0); - for (j = 0; j < LE_TABLE_SIZE; j += 2 * i) - tab[0][i + j] = crc ^ tab[0][j]; - } - for (i = 0; i < LE_TABLE_SIZE; i++) { - crc = tab[0][i]; - for (j = 1; j < LE_TABLE_ROWS; j++) { - crc = tab[0][crc & 0xff] ^ (crc >> 8); - tab[j][i] = crc; - } + for (j = 0; j < 256; j += 2 * i) + tab[i + j] = crc ^ tab[j]; } } @@ -74,34 +47,22 @@ static void crc32init_be(void) unsigned i, j; uint32_t crc = 0x80000000; - crc32table_be[0][0] = 0; + crc32table_be[0] = 0; - for (i = 1; i < BE_TABLE_SIZE; i <<= 1) { + for (i = 1; i < 256; i <<= 1) { crc = (crc << 1) ^ ((crc & 0x80000000) ? CRC32_POLY_BE : 0); for (j = 0; j < i; j++) - crc32table_be[0][i + j] = crc ^ crc32table_be[0][j]; - } - for (i = 0; i < BE_TABLE_SIZE; i++) { - crc = crc32table_be[0][i]; - for (j = 1; j < BE_TABLE_ROWS; j++) { - crc = crc32table_be[0][(crc >> 24) & 0xff] ^ (crc << 8); - crc32table_be[j][i] = crc; - } + crc32table_be[i + j] = crc ^ crc32table_be[j]; } } -static void output_table(uint32_t (*table)[256], int rows, int len, char *trans) +static void output_table(const uint32_t table[256]) { - int i, j; - - for (j = 0 ; j < rows; j++) { - printf("{"); - for (i = 0; i < len - 1; i++) { - if (i % ENTRIES_PER_LINE == 0) - printf("\n"); - printf("%s(0x%8.8xL), ", trans, table[j][i]); - } - printf("%s(0x%8.8xL)},\n", trans, table[j][len - 1]); + int i; + + for (i = 0; i < 256; i += 4) { + printf("\t0x%08x, 0x%08x, 0x%08x, 0x%08x,\n", + table[i], table[i + 1], table[i + 2], table[i + 3]); } } @@ -109,34 +70,20 @@ int main(int argc, char** argv) { printf("/* this file is generated - do not edit */\n\n"); - if (CRC_LE_BITS > 1) { - crc32init_le(); - printf("static const u32 ____cacheline_aligned " - "crc32table_le[%d][%d] = {", - LE_TABLE_ROWS, LE_TABLE_SIZE); - output_table(crc32table_le, LE_TABLE_ROWS, - LE_TABLE_SIZE, "tole"); - printf("};\n"); - } + crc32init_le(); + printf("static const u32 ____cacheline_aligned crc32table_le[256] = {\n"); + output_table(crc32table_le); + printf("};\n"); - if (CRC_BE_BITS > 1) { - crc32init_be(); - printf("static const u32 ____cacheline_aligned " - "crc32table_be[%d][%d] = {", - BE_TABLE_ROWS, BE_TABLE_SIZE); - output_table(crc32table_be, LE_TABLE_ROWS, - BE_TABLE_SIZE, "tobe"); - printf("};\n"); - } - if (CRC_LE_BITS > 1) { - crc32cinit_le(); - printf("static const u32 ____cacheline_aligned " - "crc32ctable_le[%d][%d] = {", - LE_TABLE_ROWS, LE_TABLE_SIZE); - output_table(crc32ctable_le, LE_TABLE_ROWS, - LE_TABLE_SIZE, "tole"); - printf("};\n"); - } + crc32init_be(); + printf("static const u32 ____cacheline_aligned crc32table_be[256] = {\n"); + output_table(crc32table_be); + printf("};\n"); + + crc32cinit_le(); + printf("static const u32 ____cacheline_aligned crc32ctable_le[256] = {\n"); + output_table(crc32ctable_le); + printf("};\n"); return 0; } -- 2.51.0 From 1c7b17cf0594f33c898004ac1b5576c032f266e2 Mon Sep 17 00:00:00 2001 From: liuye Date: Tue, 19 Nov 2024 14:08:42 +0800 Subject: [PATCH 03/16] mm/vmscan: fix hard LOCKUP in function isolate_lru_folios MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit This fixes the following hard lockup in isolate_lru_folios() during memory reclaim. If the LRU mostly contains ineligible folios this may trigger watchdog. watchdog: Watchdog detected hard LOCKUP on cpu 173 RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0 Call Trace: _raw_spin_lock_irqsave+0x31/0x40 folio_lruvec_lock_irqsave+0x5f/0x90 folio_batch_move_lru+0x91/0x150 lru_add_drain_per_cpu+0x1c/0x40 process_one_work+0x17d/0x350 worker_thread+0x27b/0x3a0 kthread+0xe8/0x120 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1b/0x30 lruvec->lru_lock owner: PID: 2865 TASK: ffff888139214d40 CPU: 40 COMMAND: "kswapd0" #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde [exception RIP: isolate_lru_folios+403] RIP: ffffffffa597df53 RSP: ffffc90006fb7c28 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffc90006fb7c60 RCX: ffffea04a2196f88 RDX: ffffc90006fb7c60 RSI: ffffc90006fb7c60 RDI: ffffea04a2197048 RBP: ffff88812cbd3010 R8: ffffea04a2197008 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffffea04a2197008 R13: ffffea04a2197048 R14: ffffc90006fb7de8 R15: 0000000003e3e937 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788 #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0 #8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354 #9 [ffffc90006fb7ef8] kthread at ffffffffa5748238 crash> Scenario: User processe are requesting a large amount of memory and keep page active. Then a module continuously requests memory from ZONE_DMA32 area. Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached. However pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. Reproduce: Terminal 1: Construct to continuously increase pages active(anon). mkdir /tmp/memory mount -t tmpfs -o size=1024000M tmpfs /tmp/memory dd if=/dev/zero of=/tmp/memory/block bs=4M tail /tmp/memory/block Terminal 2: vmstat -a 1 active will increase. procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ... r b swpd free inact active si so bi bo 1 0 0 1445623076 45898836 83646008 0 0 0 1 0 0 1445623076 43450228 86094616 0 0 0 1 0 0 1445623076 41003480 88541364 0 0 0 1 0 0 1445623076 38557088 90987756 0 0 0 1 0 0 1445623076 36109688 93435156 0 0 0 1 0 0 1445619552 33663256 95881632 0 0 0 1 0 0 1445619804 31217140 98327792 0 0 0 1 0 0 1445619804 28769988 100774944 0 0 0 1 0 0 1445619804 26322348 103222584 0 0 0 1 0 0 1445619804 23875592 105669340 0 0 0 cat /proc/meminfo | head Active(anon) increase. MemTotal: 1579941036 kB MemFree: 1445618500 kB MemAvailable: 1453013224 kB Buffers: 6516 kB Cached: 128653956 kB SwapCached: 0 kB Active: 118110812 kB Inactive: 11436620 kB Active(anon): 115345744 kB Inactive(anon): 945292 kB When the Active(anon) is 115345744 kB, insmod module triggers the ZONE_DMA32 watermark. perf record -e vmscan:mm_vmscan_lru_isolate -aR perf script isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2 nr_skipped=2 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29 nr_skipped=29 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon See nr_scanned=28835844. 28835844 * 4k = 115343376KB approximately equal to 115345744 kB. If increase Active(anon) to 1000G then insmod module triggers the ZONE_DMA32 watermark. hard lockup will occur. In my device nr_scanned = 0000000003e3e937 when hard lockup. Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB. [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 ffffc90006fb7c30: 0000000000000020 0000000000000000 ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000 ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8 ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48 ffffc90006fb7c70: 0000000000000000 0000000000000000 ffffc90006fb7c80: 0000000000000000 0000000000000000 ffffc90006fb7c90: 0000000000000000 0000000000000000 ffffc90006fb7ca0: 0000000000000000 0000000003e3e937 ffffc90006fb7cb0: 0000000000000000 0000000000000000 ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000 About the Fixes: Why did it take eight years to be discovered? The problem requires the following conditions to occur: 1. The device memory should be large enough. 2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. 3. The memory in ZONE_DMA32 needs to reach the watermark. If the memory is not large enough, or if the usage design of ZONE_DMA32 area memory is reasonable, this problem is difficult to detect. notes: The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL, but other suitable scenarios may also trigger the problem. Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: liuye Cc: Hugh Dickins Cc: Mel Gorman Cc: Yang Shi Signed-off-by: Andrew Morton --- include/linux/swap.h | 1 + mm/vmscan.c | 6 +++++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index a5f475335aea..b13b72645db3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -222,6 +222,7 @@ enum { }; #define SWAP_CLUSTER_MAX 32UL +#define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10) #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX /* Bit flag in swap_map */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 683ec56d4f60..5b2f12425b28 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1692,6 +1692,7 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan, unsigned long nr_skipped[MAX_NR_ZONES] = { 0, }; unsigned long skipped = 0; unsigned long scan, total_scan, nr_pages; + unsigned long max_nr_skipped = 0; LIST_HEAD(folios_skipped); total_scan = 0; @@ -1706,9 +1707,12 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan, nr_pages = folio_nr_pages(folio); total_scan += nr_pages; - if (folio_zonenum(folio) > sc->reclaim_idx) { + /* Using max_nr_skipped to prevent hard LOCKUP*/ + if (max_nr_skipped < SWAP_CLUSTER_MAX_SKIPPED && + (folio_zonenum(folio) > sc->reclaim_idx)) { nr_skipped[folio_zonenum(folio)] += nr_pages; move_to = &folios_skipped; + max_nr_skipped++; goto move; } -- 2.51.0 From 6e8e04291d81867680552b49f40fa5c746b641d8 Mon Sep 17 00:00:00 2001 From: Hyeonggon Yoo <42.hyeyoo@gmail.com> Date: Tue, 28 Jan 2025 08:16:31 +0900 Subject: [PATCH 04/16] mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc() Commit c1b3bb73d55e ("mm/zsmalloc: use zpdesc in trylock_zspage()/lock_zspage()") introduces is_first_zpdesc() function. However, the function is only used when CONFIG_DEBUG_VM=y. When building with LLVM=1 and W=1 option, the following warning is generated: $ make -j12 W=1 LLVM=1 mm/zsmalloc.o mm/zsmalloc.c:455:20: error: function 'is_first_zpdesc' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration] 455 | static inline bool is_first_zpdesc(struct zpdesc *zpdesc) | ^~~~~~~~~~~~~~~ 1 error generated. Fix the warning by adding __maybe_unused attribute to the function. No functional change intended. Link: https://lkml.kernel.org/r/20250127231631.4363-1-42.hyeyoo@gmail.com Fixes: c1b3bb73d55e ("mm/zsmalloc: use zpdesc in trylock_zspage()/lock_zspage()") Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202501240958.4ILzuBrH-lkp@intel.com/ Cc: Alex Shi Cc: Minchan Kim Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton --- mm/zsmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 817626a351f8..6d0e47f7ae33 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -452,7 +452,7 @@ static DEFINE_PER_CPU(struct mapping_area, zs_map_area) = { .lock = INIT_LOCAL_LOCK(lock), }; -static inline bool is_first_zpdesc(struct zpdesc *zpdesc) +static inline bool __maybe_unused is_first_zpdesc(struct zpdesc *zpdesc) { return PagePrivate(zpdesc_page(zpdesc)); } -- 2.51.0 From f921da2c34692dfec5f72b5ae347b1bea22bb369 Mon Sep 17 00:00:00 2001 From: Heming Zhao Date: Tue, 21 Jan 2025 19:22:03 +0800 Subject: [PATCH 05/16] ocfs2: fix incorrect CPU endianness conversion causing mount failure Commit 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()") introduced a regression bug. The blksz_bits value is already converted to CPU endian in the previous code; therefore, the code shouldn't use le32_to_cpu() anymore. Link: https://lkml.kernel.org/r/20250121112204.12834-1-heming.zhao@suse.com Fixes: 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()") Signed-off-by: Heming Zhao Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Jun Piao Cc: Signed-off-by: Andrew Morton --- fs/ocfs2/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index e0b91dbaa0ac..8bb5022f3082 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -2285,7 +2285,7 @@ static int ocfs2_verify_volume(struct ocfs2_dinode *di, mlog(ML_ERROR, "found superblock with incorrect block " "size bits: found %u, should be 9, 10, 11, or 12\n", blksz_bits); - } else if ((1 << le32_to_cpu(blksz_bits)) != blksz) { + } else if ((1 << blksz_bits) != blksz) { mlog(ML_ERROR, "found superblock with incorrect block " "size: found %u, should be %u\n", 1 << blksz_bits, blksz); } else if (le16_to_cpu(di->id2.i_super.s_major_rev_level) != -- 2.51.0 From a479b078fddb0ad7f9e3c6da22d9cf8f2b5c7799 Mon Sep 17 00:00:00 2001 From: Li Zhijian Date: Fri, 10 Jan 2025 20:21:32 +0800 Subject: [PATCH 06/16] mm/vmscan: accumulate nr_demoted for accurate demotion statistics In shrink_folio_list(), demote_folio_list() can be called 2 times. Currently stat->nr_demoted will only store the last nr_demoted( the later nr_demoted is always zero, the former nr_demoted will get lost), as a result number of demoted pages is not accurate. Accumulate the nr_demoted count across multiple calls to demote_folio_list(), ensuring accurate reporting of demotion statistics. [lizhijian@fujitsu.com: introduce local nr_demoted to fix nr_reclaimed double counting] Link: https://lkml.kernel.org/r/20250111015253.425693-1-lizhijian@fujitsu.com Link: https://lkml.kernel.org/r/20250110122133.423481-1-lizhijian@fujitsu.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Li Zhijian Acked-by: Kaiyang Zhao Tested-by: Donet Tom Reviewed-by: Donet Tom Cc: Signed-off-by: Andrew Morton --- mm/vmscan.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5b2f12425b28..c767d71c43d7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1086,7 +1086,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, struct folio_batch free_folios; LIST_HEAD(ret_folios); LIST_HEAD(demote_folios); - unsigned int nr_reclaimed = 0; + unsigned int nr_reclaimed = 0, nr_demoted = 0; unsigned int pgactivate = 0; bool do_demote_pass; struct swap_iocb *plug = NULL; @@ -1550,8 +1550,9 @@ keep: /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - stat->nr_demoted = demote_folio_list(&demote_folios, pgdat); - nr_reclaimed += stat->nr_demoted; + nr_demoted = demote_folio_list(&demote_folios, pgdat); + nr_reclaimed += nr_demoted; + stat->nr_demoted += nr_demoted; /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */ -- 2.51.0 From 4ebc417ef9cb34010a71270421fe320ec5d88aa2 Mon Sep 17 00:00:00 2001 From: Jan Kiszka Date: Fri, 10 Jan 2025 11:36:33 +0100 Subject: [PATCH 07/16] scripts/gdb: fix aarch64 userspace detection in get_current_task At least recent gdb releases (seen with 14.2) return SP_EL0 as signed long which lets the right-shift always return 0. Link: https://lkml.kernel.org/r/dcd2fabc-9131-4b48-8419-6444e2d67454@siemens.com Signed-off-by: Jan Kiszka Cc: Barry Song Cc: Kieran Bingham Cc: Signed-off-by: Andrew Morton --- scripts/gdb/linux/cpus.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py index 2f11c4f9c345..13eb8b3901b8 100644 --- a/scripts/gdb/linux/cpus.py +++ b/scripts/gdb/linux/cpus.py @@ -167,7 +167,7 @@ def get_current_task(cpu): var_ptr = gdb.parse_and_eval("&pcpu_hot.current_task") return per_cpu(var_ptr, cpu).dereference() elif utils.is_target_arch("aarch64"): - current_task_addr = gdb.parse_and_eval("$SP_EL0") + current_task_addr = gdb.parse_and_eval("(unsigned long)$SP_EL0") if (current_task_addr >> 63) != 0: current_task = current_task_addr.cast(task_ptr_type) return current_task.dereference() -- 2.51.0 From bc404701349fbd3d94e389a06ccfc0948129ab24 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Thu, 23 Jan 2025 23:13:44 +0000 Subject: [PATCH 08/16] MAINTAINERS: mailmap: update Yosry Ahmed's email address Moving to a linux.dev email address. Link: https://lkml.kernel.org/r/20250123231344.817358-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed Cc: Chengming Zhou Cc: Johannes Weiner Cc: Nhat Pham Signed-off-by: Andrew Morton --- .mailmap | 1 + MAINTAINERS | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/.mailmap b/.mailmap index 17dd8eb2630e..b2a0050b1397 100644 --- a/.mailmap +++ b/.mailmap @@ -761,6 +761,7 @@ Wolfram Sang Yakir Yang Yanteng Si Ying Huang +Yosry Ahmed Yusuke Goda Zack Rusin Zhu Yanjun diff --git a/MAINTAINERS b/MAINTAINERS index bc8ce7af3303..d269d3c6e317 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -26213,7 +26213,7 @@ K: zstd ZSWAP COMPRESSED SWAP CACHING M: Johannes Weiner -M: Yosry Ahmed +M: Yosry Ahmed M: Nhat Pham R: Chengming Zhou L: linux-mm@kvack.org -- 2.51.0 From c3d8ced37e6f724cdcc5382b40858a2136c47591 Mon Sep 17 00:00:00 2001 From: Hamza Mahfooz Date: Mon, 20 Jan 2025 15:56:59 -0500 Subject: [PATCH 09/16] mailmap: add an entry for Hamza Mahfooz Map my previous work email to my current one. Link: https://lkml.kernel.org/r/20250120205659.139027-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Hamza Mahfooz Cc: David S. Miller Cc: Hans verkuil Cc: Matthieu Baerts Signed-off-by: Andrew Morton --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index b2a0050b1397..8d721d390e9d 100644 --- a/.mailmap +++ b/.mailmap @@ -261,6 +261,7 @@ Guo Ren Guru Das Srinagesh Gustavo Padovan Gustavo Padovan +Hamza Mahfooz Hanjun Guo Hans Verkuil Hans Verkuil -- 2.51.0 From 488b5b9eca68497b533ced059be5eff19578bbca Mon Sep 17 00:00:00 2001 From: Catalin Marinas Date: Mon, 27 Jan 2025 18:42:33 +0000 Subject: [PATCH 10/16] mm: kmemleak: fix upper boundary check for physical address objects Memblock allocations are registered by kmemleak separately, based on their physical address. During the scanning stage, it checks whether an object is within the min_low_pfn and max_low_pfn boundaries and ignores it otherwise. With the recent addition of __percpu pointer leak detection (commit 6c99d4eb7c5e ("kmemleak: enable tracking for percpu pointers")), kmemleak started reporting leaks in setup_zone_pageset() and setup_per_cpu_pageset(). These were caused by the node_data[0] object (initialised in alloc_node_data()) ending on the PFN_PHYS(max_low_pfn) boundary. The non-strict upper boundary check introduced by commit 84c326299191 ("mm: kmemleak: check physical address when scan") causes the pg_data_t object to be ignored (not scanned) and the __percpu pointers it contains to be reported as leaks. Make the max_low_pfn upper boundary check strict when deciding whether to ignore a physical address object and not scan it. Link: https://lkml.kernel.org/r/20250127184233.2974311-1-catalin.marinas@arm.com Fixes: 84c326299191 ("mm: kmemleak: check physical address when scan") Signed-off-by: Catalin Marinas Reported-by: Jakub Kicinski Tested-by: Matthieu Baerts (NGI0) Cc: Patrick Wang Cc: [6.0.x] Signed-off-by: Andrew Morton --- mm/kmemleak.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 982bb5ef3233..c6ed68604136 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1689,7 +1689,7 @@ static void kmemleak_scan(void) unsigned long phys = object->pointer; if (PHYS_PFN(phys) < min_low_pfn || - PHYS_PFN(phys + object->size) >= max_low_pfn) + PHYS_PFN(phys + object->size) > max_low_pfn) __paint_it(object, KMEMLEAK_BLACK); } -- 2.51.0 From 4c80187001d3e2876dfe7e011b9eac3b6270156f Mon Sep 17 00:00:00 2001 From: Bruno Faccini Date: Mon, 27 Jan 2025 09:16:23 -0800 Subject: [PATCH 11/16] mm/fake-numa: handle cases with no SRAT info MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit Handle more gracefully cases where no SRAT information is available, like in VMs with no Numa support, and allow fake-numa configuration to complete successfully in these cases Link: https://lkml.kernel.org/r/20250127171623.1523171-1-bfaccini@nvidia.com Fixes: 63db8170bf34 (“mm/fake-numa: allow later numa node hotplug”) Signed-off-by: Bruno Faccini Cc: David Hildenbrand Cc: Hyeonggon Yoo Cc: John Hubbard Cc: Len Brown Cc: "Mike Rapoport (IBM)" Cc: "Rafael J. Wysocki" Cc: Zi Yan Signed-off-by: Andrew Morton --- drivers/acpi/numa/srat.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c index 59fffe34c9d0..00ac0d7bb8c9 100644 --- a/drivers/acpi/numa/srat.c +++ b/drivers/acpi/numa/srat.c @@ -95,9 +95,13 @@ int __init fix_pxm_node_maps(int max_nid) int i, j, index = -1, count = 0; nodemask_t nodes_to_enable; - if (numa_off || srat_disabled()) + if (numa_off) return -1; + /* no or incomplete node/PXM mapping set, nothing to do */ + if (srat_disabled()) + return 0; + /* find fake nodes PXM mapping */ for (i = 0; i < MAX_NUMNODES; i++) { if (node_to_pxm_map[i] != PXM_INVAL) { @@ -117,6 +121,11 @@ int __init fix_pxm_node_maps(int max_nid) } } } + if (index == -1) { + pr_debug("No node/PXM mapping has been set\n"); + /* nothing more to be done */ + return 0; + } if (WARN(index != max_nid, "%d max nid when expected %d\n", index, max_nid)) return -1; -- 2.51.0 From 64c37e134b120fb462fb4a80694bfb8e7be77b14 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Mon, 27 Jan 2025 12:02:21 -0500 Subject: [PATCH 12/16] kernel: be more careful about dup_mmap() failures and uprobe registering If a memory allocation fails during dup_mmap(), the maple tree can be left in an unsafe state for other iterators besides the exit path. All the locks are dropped before the exit_mmap() call (in mm/mmap.c), but the incomplete mm_struct can be reached through (at least) the rmap finding the vmas which have a pointer back to the mm_struct. Up to this point, there have been no issues with being able to find an mm_struct that was only partially initialised. Syzbot was able to make the incomplete mm_struct fail with recent forking changes, so it has been proven unsafe to use the mm_struct that hasn't been initialised, as referenced in the link below. Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to invalid mm") fixed the uprobe access, it does not completely remove the race. This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the oom side (even though this is extremely unlikely to be selected as an oom victim in the race window), and sets MMF_UNSTABLE to avoid other potential users from using a partially initialised mm_struct. When registering vmas for uprobe, skip the vmas in an mm that is marked unstable. Modifying a vma in an unstable mm may cause issues if the mm isn't fully initialised. Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Oleg Nesterov Cc: Masami Hiramatsu Cc: Jann Horn Cc: Peter Zijlstra Cc: Michal Hocko Cc: Peng Zhang Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- kernel/events/uprobes.c | 4 ++++ kernel/fork.c | 17 ++++++++++++++--- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index e421a5f2ec7d..2ca797cbe465 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -28,6 +28,7 @@ #include #include #include +#include /* check_stable_address_space */ #include @@ -1260,6 +1261,9 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new) * returns NULL in find_active_uprobe_rcu(). */ mmap_write_lock(mm); + if (check_stable_address_space(mm)) + goto unlock; + vma = find_vma(mm, info->vaddr); if (!vma || !valid_vma(vma, is_register) || file_inode(vma->vm_file) != uprobe->inode) diff --git a/kernel/fork.c b/kernel/fork.c index cba5ede2c639..735405a9c5f3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -760,7 +760,8 @@ loop_out: mt_set_in_rcu(vmi.mas.tree); ksm_fork(mm, oldmm); khugepaged_fork(mm, oldmm); - } else if (mpnt) { + } else { + /* * The entire maple tree has already been duplicated. If the * mmap duplication fails, mark the failure point with @@ -768,8 +769,18 @@ loop_out: * stop releasing VMAs that have not been duplicated after this * point. */ - mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1); - mas_store(&vmi.mas, XA_ZERO_ENTRY); + if (mpnt) { + mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1); + mas_store(&vmi.mas, XA_ZERO_ENTRY); + /* Avoid OOM iterating a broken tree */ + set_bit(MMF_OOM_SKIP, &mm->flags); + } + /* + * The mm_struct is going to exit, but the locks will be dropped + * first. Set the mm_struct as unstable is advisable as it is + * not fully initialised. + */ + set_bit(MMF_UNSTABLE, &mm->flags); } out: mmap_write_unlock(mm); -- 2.51.0 From 6268f0a166ebcf5a31577036f4c1e613d5ab4fb1 Mon Sep 17 00:00:00 2001 From: yangge Date: Sat, 25 Jan 2025 14:53:57 +0800 Subject: [PATCH 13/16] mm: compaction: use the proper flag to determine watermarks There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of memory. I have configured 16GB of CMA memory on each NUMA node, and starting a 32GB virtual machine with device passthrough is extremely slow, taking almost an hour. Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of no-CMA memory on a NUMA node can be used as virtual machine memory. There is 16GB of free CMA memory on a NUMA node, which is sufficient to pass the order-0 watermark check, causing the __compaction_suitable() function to consistently return true. For costly allocations, if the __compaction_suitable() function always returns true, it causes the __alloc_pages_slowpath() function to fail to exit at the appropriate point. This prevents timely fallback to allocating memory on other nodes, ultimately resulting in excessively long virtual machine startup times. Call trace: __alloc_pages_slowpath if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; // should exit __alloc_pages_slowpath() from here We could use the real unmovable allocation context to have __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass the order-0 check anymore once the non-CMA part is exhausted. There is some risk that in some different scenario the compaction could in fact migrate pages from the exhausted non-CMA part of the zone to the CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY allocations should be affected in the immediate "goto nopage" when compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't fail without trying to compact-migrate the non-CMA pageblocks into CMA pageblocks first, so it should be fine. After this fix, it only takes a few tens of seconds to start a 32GB virtual machine with device passthrough functionality. Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/ Link: https://lkml.kernel.org/r/1737788037-8439-1-git-send-email-yangge1116@126.com Signed-off-by: yangge Acked-by: Vlastimil Babka Reviewed-by: Baolin Wang Acked-by: Johannes Weiner Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand Signed-off-by: Andrew Morton --- mm/compaction.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index bcc0df0066dc..12ed8425fa17 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2491,7 +2491,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, */ static enum compact_result compaction_suit_allocation_order(struct zone *zone, unsigned int order, - int highest_zoneidx, unsigned int alloc_flags) + int highest_zoneidx, unsigned int alloc_flags, + bool async) { unsigned long watermark; @@ -2500,6 +2501,23 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order, alloc_flags)) return COMPACT_SUCCESS; + /* + * For unmovable allocations (without ALLOC_CMA), check if there is enough + * free memory in the non-CMA pageblocks. Otherwise compaction could form + * the high-order page in CMA pageblocks, which would not help the + * allocation to succeed. However, limit the check to costly order async + * compaction (such as opportunistic THP attempts) because there is the + * possibility that compaction would migrate pages from non-CMA to CMA + * pageblock. + */ + if (order > PAGE_ALLOC_COSTLY_ORDER && async && + !(alloc_flags & ALLOC_CMA)) { + watermark = low_wmark_pages(zone) + compact_gap(order); + if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx, + 0, zone_page_state(zone, NR_FREE_PAGES))) + return COMPACT_SKIPPED; + } + if (!compaction_suitable(zone, order, highest_zoneidx)) return COMPACT_SKIPPED; @@ -2535,7 +2553,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) if (!is_via_compact_memory(cc->order)) { ret = compaction_suit_allocation_order(cc->zone, cc->order, cc->highest_zoneidx, - cc->alloc_flags); + cc->alloc_flags, + cc->mode == MIGRATE_ASYNC); if (ret != COMPACT_CONTINUE) return ret; } @@ -3038,7 +3057,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) ret = compaction_suit_allocation_order(zone, pgdat->kcompactd_max_order, - highest_zoneidx, ALLOC_WMARK_MIN); + highest_zoneidx, ALLOC_WMARK_MIN, + false); if (ret == COMPACT_CONTINUE) return true; } @@ -3079,7 +3099,8 @@ static void kcompactd_do_work(pg_data_t *pgdat) continue; ret = compaction_suit_allocation_order(zone, - cc.order, zoneid, ALLOC_WMARK_MIN); + cc.order, zoneid, ALLOC_WMARK_MIN, + false); if (ret != COMPACT_CONTINUE) continue; -- 2.51.0 From 6438ef381c183444f7f9d1de18f22661cba1e946 Mon Sep 17 00:00:00 2001 From: Nikita Zhandarovich Date: Sat, 25 Jan 2025 07:20:53 +0900 Subject: [PATCH 14/16] nilfs2: fix possible int overflows in nilfs_fiemap() Since nilfs_bmap_lookup_contig() in nilfs_fiemap() calculates its result by being prepared to go through potentially maxblocks == INT_MAX blocks, the value in n may experience an overflow caused by left shift of blkbits. While it is extremely unlikely to occur, play it safe and cast right hand expression to wider type to mitigate the issue. Found by Linux Verification Center (linuxtesting.org) with static analysis tool SVACE. Link: https://lkml.kernel.org/r/20250124222133.5323-1-konishi.ryusuke@gmail.com Fixes: 622daaff0a89 ("nilfs2: fiemap support") Signed-off-by: Nikita Zhandarovich Signed-off-by: Ryusuke Konishi Cc: Signed-off-by: Andrew Morton --- fs/nilfs2/inode.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c index e8015d24a82c..6613b8fcceb0 100644 --- a/fs/nilfs2/inode.c +++ b/fs/nilfs2/inode.c @@ -1186,7 +1186,7 @@ int nilfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, if (size) { if (phys && blkphy << blkbits == phys + size) { /* The current extent goes on */ - size += n << blkbits; + size += (u64)n << blkbits; } else { /* Terminate the current extent */ ret = fiemap_fill_next_extent( @@ -1199,14 +1199,14 @@ int nilfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, flags = FIEMAP_EXTENT_MERGED; logical = blkoff << blkbits; phys = blkphy << blkbits; - size = n << blkbits; + size = (u64)n << blkbits; } } else { /* Start a new extent */ flags = FIEMAP_EXTENT_MERGED; logical = blkoff << blkbits; phys = blkphy << blkbits; - size = n << blkbits; + size = (u64)n << blkbits; } blkoff += n; } -- 2.51.0 From e64f81946adf68cd75e2207dd9a51668348a4af8 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Fri, 24 Jan 2025 13:01:38 +0100 Subject: [PATCH 15/16] kfence: skip __GFP_THISNODE allocations on NUMA systems On NUMA systems, __GFP_THISNODE indicates that an allocation _must_ be on a particular node, and failure to allocate on the desired node will result in a failed allocation. Skip __GFP_THISNODE allocations if we are running on a NUMA system, since KFENCE can't guarantee which node its pool pages are allocated on. Link: https://lkml.kernel.org/r/20250124120145.410066-1-elver@google.com Fixes: 236e9f153852 ("kfence: skip all GFP_ZONEMASK allocations") Signed-off-by: Marco Elver Reported-by: Vlastimil Babka Acked-by: Vlastimil Babka Cc: Christoph Lameter Cc: Alexander Potapenko Cc: Chistoph Lameter Cc: Dmitriy Vyukov Cc: Signed-off-by: Andrew Morton --- mm/kfence/core.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 67fc321db79b..102048821c22 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -1084,6 +1085,7 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) * properties (e.g. reside in DMAable memory). */ if ((flags & GFP_ZONEMASK) || + ((flags & __GFP_THISNODE) && num_online_nodes() > 1) || (s->flags & (SLAB_CACHE_DMA | SLAB_CACHE_DMA32))) { atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_INCOMPAT]); return NULL; -- 2.51.0 From 1ccae30ecd98671325fa6954f9934bad298b56a2 Mon Sep 17 00:00:00 2001 From: Christopher Obbard Date: Wed, 22 Jan 2025 12:04:27 +0000 Subject: [PATCH 16/16] .mailmap: update email address for Christopher Obbard Update my email address. Link: https://lkml.kernel.org/r/20250122-wip-obbardc-update-email-v2-1-12bde6b79ad0@linaro.org Signed-off-by: Christopher Obbard Signed-off-by: Andrew Morton --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index 8d721d390e9d..fec6b455b576 100644 --- a/.mailmap +++ b/.mailmap @@ -165,6 +165,7 @@ Christian Brauner Christian Brauner Christian Marangi Christophe Ricard +Christopher Obbard Christoph Hellwig Chuck Lever Chuck Lever -- 2.51.0