From fcd807a03b864e2c7b2aa5eaade185127c4e2414 Mon Sep 17 00:00:00 2001 From: Marcelo Moreira Date: Mon, 17 Feb 2025 18:54:31 -0300 Subject: [PATCH 01/16] Docs/mm/damon: fix spelling and grammar in monitoring_intervals_tuning_example.rst This patch fixes some spelling and grammar mistakes in the documentation, improving the readability. - multipled -> multiplied - idential -> identical - minuts -> minutes - efficieny -> efficiency Link: https://lkml.kernel.org/r/20250217215512.12833-1-marcelomoreira1905@gmail.com Signed-off-by: Marcelo Moreira Reviewed-by: SeongJae Park Cc: Shuah khan Signed-off-by: Andrew Morton --- .../mm/damon/monitoring_intervals_tuning_example.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/mm/damon/monitoring_intervals_tuning_example.rst b/Documentation/mm/damon/monitoring_intervals_tuning_example.rst index 334a854efb40..7207cbed591f 100644 --- a/Documentation/mm/damon/monitoring_intervals_tuning_example.rst +++ b/Documentation/mm/damon/monitoring_intervals_tuning_example.rst @@ -36,7 +36,7 @@ Then, list the DAMON-found regions of different access patterns, sorted by the "access temperature". "Access temperature" is a metric representing the access-hotness of a region. It is calculated as a weighted sum of the access frequency and the age of the region. If the access frequency is 0 %, the -temperature is multipled by minus one. That is, if a region is not accessed, +temperature is multiplied by minus one. That is, if a region is not accessed, it gets minus temperature and it gets lower as not accessed for longer time. The sorting is in temperature-ascendint order, so the region at the top of the list is the coldest, and the one at the bottom is the hottest one. :: @@ -58,11 +58,11 @@ list is the coldest, and the one at the bottom is the hottest one. :: The list shows not seemingly hot regions, and only minimum access pattern diversity. Every region has zero access frequency. The number of region is 10, which is the default ``min_nr_regions value``. Size of each region is also -nearly idential. We can suspect this is because “adaptive regions adjustment” +nearly identical. We can suspect this is because “adaptive regions adjustment” mechanism was not well working. As the guide suggested, we can get relative hotness of regions using ``age`` as the recency information. That would be better than nothing, but given the fact that the longest age is only about 6 -seconds while we waited about ten minuts, it is unclear how useful this will +seconds while we waited about ten minutes, it is unclear how useful this will be. The temperature ranges to total size of regions of each range histogram @@ -190,7 +190,7 @@ for sampling and aggregation intervals, respectively). :: The number of regions having different access patterns has significantly increased. Size of each region is also more varied. Total size of non-zero access frequency regions is also significantly increased. Maybe this is already -good enough to make some meaningful memory management efficieny changes. +good enough to make some meaningful memory management efficiency changes. 800ms/16s intervals: Another bias ================================= -- 2.51.0 From 63a23847dc47113b879a5f53cc0ca5cedc881ffd Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 17 Feb 2025 19:20:06 +0000 Subject: [PATCH 02/16] fs: convert block_commit_write() to take a folio All callers now have a folio, so pass it in instead of converting folio->page->folio. Link: https://lkml.kernel.org/r/20250217192009.437916-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- fs/buffer.c | 14 ++++---------- fs/ext4/inline.c | 2 +- fs/ext4/move_extent.c | 2 +- fs/iomap/buffered-io.c | 2 +- fs/ocfs2/aops.c | 4 ++-- fs/ocfs2/file.c | 2 +- fs/udf/file.c | 2 +- include/linux/buffer_head.h | 2 +- 8 files changed, 12 insertions(+), 18 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index cc8452f60251..c66a59bb068b 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2166,7 +2166,7 @@ int __block_write_begin(struct folio *folio, loff_t pos, unsigned len, } EXPORT_SYMBOL(__block_write_begin); -static void __block_commit_write(struct folio *folio, size_t from, size_t to) +void block_commit_write(struct folio *folio, size_t from, size_t to) { size_t block_start, block_end; bool partial = false; @@ -2204,6 +2204,7 @@ static void __block_commit_write(struct folio *folio, size_t from, size_t to) if (!partial) folio_mark_uptodate(folio); } +EXPORT_SYMBOL(block_commit_write); /* * block_write_begin takes care of the basic task of block allocation and @@ -2262,7 +2263,7 @@ int block_write_end(struct file *file, struct address_space *mapping, flush_dcache_folio(folio); /* This could be a short (even 0-length) commit */ - __block_commit_write(folio, start, start + copied); + block_commit_write(folio, start, start + copied); return copied; } @@ -2578,13 +2579,6 @@ int cont_write_begin(struct file *file, struct address_space *mapping, } EXPORT_SYMBOL(cont_write_begin); -void block_commit_write(struct page *page, unsigned from, unsigned to) -{ - struct folio *folio = page_folio(page); - __block_commit_write(folio, from, to); -} -EXPORT_SYMBOL(block_commit_write); - /* * block_page_mkwrite() is not allowed to change the file size as it gets * called from a page fault handler when a page is first dirtied. Hence we must @@ -2630,7 +2624,7 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, if (unlikely(ret)) goto out_unlock; - __block_commit_write(folio, 0, end); + block_commit_write(folio, 0, end); folio_mark_dirty(folio); folio_wait_stable(folio); diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index 3536ca7e4fcc..0af474c8b260 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -637,7 +637,7 @@ retry: goto retry; if (folio) - block_commit_write(&folio->page, from, to); + block_commit_write(folio, from, to); out: if (folio) { folio_unlock(folio); diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c index 898443e98efc..48649be64d6a 100644 --- a/fs/ext4/move_extent.c +++ b/fs/ext4/move_extent.c @@ -399,7 +399,7 @@ data_copy: bh = bh->b_this_page; } - block_commit_write(&folio[0]->page, from, from + replaced_size); + block_commit_write(folio[0], from, from + replaced_size); /* Even in case of data=writeback it is reasonable to pin * inode to transaction, to prevent unexpected data loss */ diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index d303e6c8900c..f3904d13cda4 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1484,7 +1484,7 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter, &iter->iomap); if (ret) return ret; - block_commit_write(&folio->page, 0, length); + block_commit_write(folio, 0, length); } else { WARN_ON_ONCE(!folio_test_uptodate(folio)); folio_mark_dirty(folio); diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 5bbeb6fbb1ac..ee1d92ed950f 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -920,7 +920,7 @@ static void ocfs2_write_failure(struct inode *inode, ocfs2_jbd2_inode_add_write(wc->w_handle, inode, user_pos, user_len); - block_commit_write(&folio->page, from, to); + block_commit_write(folio, from, to); } } } @@ -2012,7 +2012,7 @@ int ocfs2_write_end_nolock(struct address_space *mapping, loff_t pos, ocfs2_jbd2_inode_add_write(handle, inode, start_byte, length); } - block_commit_write(&folio->page, from, to); + block_commit_write(folio, from, to); } } diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index e54f2c4b5a90..2056cf08ac1e 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -813,7 +813,7 @@ static int ocfs2_write_zero_page(struct inode *inode, u64 abs_from, /* must not update i_size! */ - block_commit_write(&folio->page, block_start + 1, block_start + 1); + block_commit_write(folio, block_start + 1, block_start + 1); } /* diff --git a/fs/udf/file.c b/fs/udf/file.c index 412fe7c4d348..0d76c4f37b3e 100644 --- a/fs/udf/file.c +++ b/fs/udf/file.c @@ -69,7 +69,7 @@ static vm_fault_t udf_page_mkwrite(struct vm_fault *vmf) goto out_unlock; } - block_commit_write(&folio->page, 0, end); + block_commit_write(folio, 0, end); out_dirty: folio_mark_dirty(folio); folio_wait_stable(folio); diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 932139c5d46f..6672e1a5031c 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -271,7 +271,7 @@ int cont_write_begin(struct file *, struct address_space *, loff_t, unsigned, struct folio **, void **, get_block_t *, loff_t *); int generic_cont_expand_simple(struct inode *inode, loff_t size); -void block_commit_write(struct page *page, unsigned int from, unsigned int to); +void block_commit_write(struct folio *folio, size_t from, size_t to); int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, get_block_t get_block); sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *); -- 2.51.0 From 52d671a1a36a16f3a0dd9a2beff964e75bce9787 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 17 Feb 2025 19:20:07 +0000 Subject: [PATCH 03/16] fs: remove page_file_mapping() This wrapper has no more callers. Delete it. Link: https://lkml.kernel.org/r/20250217192009.437916-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 5 ----- 1 file changed, 5 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 47bfc6b1b632..975c56fb4f85 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -575,11 +575,6 @@ static inline struct address_space *folio_flush_mapping(struct folio *folio) return folio_mapping(folio); } -static inline struct address_space *page_file_mapping(struct page *page) -{ - return folio_file_mapping(page_folio(page)); -} - /** * folio_inode - Get the host inode for this folio. * @folio: The folio. -- 2.51.0 From 0d40cfe63a2f19b9d375382e6d90b9ebd412901e Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 17 Feb 2025 19:20:08 +0000 Subject: [PATCH 04/16] fs: remove folio_file_mapping() No callers of this function remain as filesystems no longer see swapfile pages through their normal read/write paths. Link: https://lkml.kernel.org/r/20250217192009.437916-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 20 -------------------- 1 file changed, 20 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 975c56fb4f85..ad7c0f615e9b 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -535,26 +535,6 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping) struct address_space *folio_mapping(struct folio *); struct address_space *swapcache_mapping(struct folio *); -/** - * folio_file_mapping - Find the mapping this folio belongs to. - * @folio: The folio. - * - * For folios which are in the page cache, return the mapping that this - * page belongs to. Folios in the swap cache return the mapping of the - * swap file or swap device where the data is stored. This is different - * from the mapping returned by folio_mapping(). The only reason to - * use it is if, like NFS, you return 0 from ->activate_swapfile. - * - * Do not call this for folios which aren't in the page cache or swap cache. - */ -static inline struct address_space *folio_file_mapping(struct folio *folio) -{ - if (unlikely(folio_test_swapcache(folio))) - return swapcache_mapping(folio); - - return folio->mapping; -} - /** * folio_flush_mapping - Find the file mapping this folio belongs to. * @folio: The folio. -- 2.51.0 From 8e4909d693224134fb14ae082c379168b43112c9 Mon Sep 17 00:00:00 2001 From: Suchit K Date: Sat, 15 Feb 2025 22:30:42 +0530 Subject: [PATCH 05/16] Documentation/mm: fix spelling mistake The word watermark was misspelled as "watemark". Link: https://lkml.kernel.org/r/CAO9wTFhe4sf1eVVgijt2cdLPPsUHBj7B=HN-380_JSpve5KbvQ@mail.gmail.com Signed-off-by: Suchit Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/mm/balance.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/mm/balance.rst b/Documentation/mm/balance.rst index abaa78561c31..c4962c89a7f5 100644 --- a/Documentation/mm/balance.rst +++ b/Documentation/mm/balance.rst @@ -81,7 +81,7 @@ Page stealing from process memory and shm is done if stealing the page would alleviate memory pressure on any zone in the page's node that has fallen below its watermark. -watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These +watermark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These are per-zone fields, used to determine when a zone needs to be balanced. When the number of pages falls below watermark[WMARK_MIN], the hysteric field low_on_memory gets set. This stays set till the number of free pages becomes -- 2.51.0 From af3b45aac5c9d0bdaebf4e93e3fecddc3f363857 Mon Sep 17 00:00:00 2001 From: Ujwal Kundur Date: Sat, 15 Feb 2025 13:48:03 +0530 Subject: [PATCH 06/16] selftests/mm: fix spelling Fix misspelling flagged by codespell. Link: https://lkml.kernel.org/r/20250215081803.1793-1-ujwal.kundur@gmail.com Signed-off-by: Ujwal Kundur Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/uffd-common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/uffd-common.c b/tools/testing/selftests/mm/uffd-common.c index 31e0c8a3110d..5457a078690d 100644 --- a/tools/testing/selftests/mm/uffd-common.c +++ b/tools/testing/selftests/mm/uffd-common.c @@ -348,7 +348,7 @@ int uffd_test_ctx_init(uint64_t features, const char **errmsg) /* * After initialization of area_src, we must explicitly release pages * for area_dst to make sure it's fully empty. Otherwise we could have - * some area_dst pages be errornously initialized with zero pages, + * some area_dst pages be erroneously initialized with zero pages, * hence we could hit memory corruption later in the test. * * One example is when THP is globally enabled, above allocate_area() -- 2.51.0 From 86758b504864913233f6a16076184ba784cd4466 Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Tue, 18 Feb 2025 15:49:54 +0530 Subject: [PATCH 07/16] mm/ioremap: pass pgprot_t to ioremap_prot() instead of unsigned long ioremap_prot() currently accepts pgprot_val parameter as an unsigned long, thus implicitly assuming that pgprot_val and pgprot_t could never be bigger than unsigned long. But this assumption soon will not be true on arm64 when using D128 pgtables. In 128 bit page table configuration, unsigned long is 64 bit, but pgprot_t is 128 bit. Passing platform abstracted pgprot_t argument is better as compared to size based data types. Let's change the parameter to directly pass pgprot_t like another similar helper generic_ioremap_prot(). Without this change in place, D128 configuration does not work on arm64 as the top 64 bits gets silently stripped when passing the protection value to this function. Link: https://lkml.kernel.org/r/20250218101954.415331-1-anshuman.khandual@arm.com Signed-off-by: Ryan Roberts Co-developed-by: Anshuman Khandual Signed-off-by: Anshuman Khandual Acked-by: Catalin Marinas [arm64] Signed-off-by: Andrew Morton --- arch/arc/mm/ioremap.c | 6 ++---- arch/arm64/include/asm/io.h | 6 +++--- arch/arm64/kernel/acpi.c | 2 +- arch/arm64/mm/ioremap.c | 3 +-- arch/csky/include/asm/io.h | 2 +- arch/loongarch/include/asm/io.h | 10 +++++----- arch/mips/include/asm/io.h | 8 ++++---- arch/mips/mm/ioremap.c | 4 ++-- arch/mips/mm/ioremap64.c | 4 ++-- arch/parisc/include/asm/io.h | 2 +- arch/parisc/mm/ioremap.c | 4 ++-- arch/powerpc/include/asm/io.h | 2 +- arch/powerpc/mm/ioremap.c | 4 ++-- arch/powerpc/platforms/ps3/spu.c | 4 ++-- arch/riscv/include/asm/io.h | 2 +- arch/riscv/kernel/acpi.c | 2 +- arch/s390/include/asm/io.h | 4 ++-- arch/s390/pci/pci.c | 4 ++-- arch/sh/boards/mach-landisk/setup.c | 2 +- arch/sh/boards/mach-lboxre2/setup.c | 2 +- arch/sh/boards/mach-sh03/setup.c | 2 +- arch/sh/include/asm/io.h | 2 +- arch/sh/mm/ioremap.c | 3 +-- arch/x86/include/asm/io.h | 2 +- arch/x86/mm/ioremap.c | 4 ++-- arch/xtensa/include/asm/io.h | 6 +++--- arch/xtensa/mm/ioremap.c | 4 ++-- include/asm-generic/io.h | 4 ++-- mm/ioremap.c | 4 ++-- mm/memory.c | 6 +++--- 30 files changed, 55 insertions(+), 59 deletions(-) diff --git a/arch/arc/mm/ioremap.c b/arch/arc/mm/ioremap.c index b07004d53267..fd8897a0e52c 100644 --- a/arch/arc/mm/ioremap.c +++ b/arch/arc/mm/ioremap.c @@ -32,7 +32,7 @@ void __iomem *ioremap(phys_addr_t paddr, unsigned long size) return (void __iomem *)(u32)paddr; return ioremap_prot(paddr, size, - pgprot_val(pgprot_noncached(PAGE_KERNEL))); + pgprot_noncached(PAGE_KERNEL)); } EXPORT_SYMBOL(ioremap); @@ -44,10 +44,8 @@ EXPORT_SYMBOL(ioremap); * might need finer access control (R/W/X) */ void __iomem *ioremap_prot(phys_addr_t paddr, size_t size, - unsigned long flags) + pgprot_t prot) { - pgprot_t prot = __pgprot(flags); - /* force uncached */ return generic_ioremap_prot(paddr, size, pgprot_noncached(prot)); } diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 76ebbdc6ffdd..9b96840fb979 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -270,9 +270,9 @@ int arm64_ioremap_prot_hook_register(const ioremap_prot_hook_t hook); #define _PAGE_IOREMAP PROT_DEVICE_nGnRE #define ioremap_wc(addr, size) \ - ioremap_prot((addr), (size), PROT_NORMAL_NC) + ioremap_prot((addr), (size), __pgprot(PROT_NORMAL_NC)) #define ioremap_np(addr, size) \ - ioremap_prot((addr), (size), PROT_DEVICE_nGnRnE) + ioremap_prot((addr), (size), __pgprot(PROT_DEVICE_nGnRnE)) /* * io{read,write}{16,32,64}be() macros @@ -293,7 +293,7 @@ static inline void __iomem *ioremap_cache(phys_addr_t addr, size_t size) if (pfn_is_map_memory(__phys_to_pfn(addr))) return (void __iomem *)__phys_to_virt(addr); - return ioremap_prot(addr, size, PROT_NORMAL); + return ioremap_prot(addr, size, __pgprot(PROT_NORMAL)); } /* diff --git a/arch/arm64/kernel/acpi.c b/arch/arm64/kernel/acpi.c index e6f66491fbe9..b9a66fc146c9 100644 --- a/arch/arm64/kernel/acpi.c +++ b/arch/arm64/kernel/acpi.c @@ -379,7 +379,7 @@ void __iomem *acpi_os_ioremap(acpi_physical_address phys, acpi_size size) prot = __acpi_get_writethrough_mem_attribute(); } } - return ioremap_prot(phys, size, pgprot_val(prot)); + return ioremap_prot(phys, size, prot); } /* diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c index 6cc0b7e7eb03..10e246f11271 100644 --- a/arch/arm64/mm/ioremap.c +++ b/arch/arm64/mm/ioremap.c @@ -15,10 +15,9 @@ int arm64_ioremap_prot_hook_register(ioremap_prot_hook_t hook) } void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, - unsigned long prot) + pgprot_t pgprot) { unsigned long last_addr = phys_addr + size - 1; - pgprot_t pgprot = __pgprot(prot); /* Don't allow outside PHYS_MASK */ if (last_addr & ~PHYS_MASK) diff --git a/arch/csky/include/asm/io.h b/arch/csky/include/asm/io.h index ed53f0b47388..536d3bf32ff1 100644 --- a/arch/csky/include/asm/io.h +++ b/arch/csky/include/asm/io.h @@ -36,7 +36,7 @@ */ #define ioremap_wc(addr, size) \ ioremap_prot((addr), (size), \ - (_PAGE_IOREMAP & ~_CACHE_MASK) | _CACHE_UNCACHED) + __pgprot((_PAGE_IOREMAP & ~_CACHE_MASK) | _CACHE_UNCACHED)) #include diff --git a/arch/loongarch/include/asm/io.h b/arch/loongarch/include/asm/io.h index e77a56eaf906..eaff72b38dc8 100644 --- a/arch/loongarch/include/asm/io.h +++ b/arch/loongarch/include/asm/io.h @@ -23,9 +23,9 @@ extern void __init early_iounmap(void __iomem *addr, unsigned long size); #ifdef CONFIG_ARCH_IOREMAP static inline void __iomem *ioremap_prot(phys_addr_t offset, unsigned long size, - unsigned long prot_val) + pgprot_t prot) { - switch (prot_val & _CACHE_MASK) { + switch (pgprot_val(prot) & _CACHE_MASK) { case _CACHE_CC: return (void __iomem *)(unsigned long)(CACHE_BASE + offset); case _CACHE_SUC: @@ -38,7 +38,7 @@ static inline void __iomem *ioremap_prot(phys_addr_t offset, unsigned long size, } #define ioremap(offset, size) \ - ioremap_prot((offset), (size), pgprot_val(PAGE_KERNEL_SUC)) + ioremap_prot((offset), (size), PAGE_KERNEL_SUC) #define iounmap(addr) ((void)(addr)) @@ -55,10 +55,10 @@ static inline void __iomem *ioremap_prot(phys_addr_t offset, unsigned long size, */ #define ioremap_wc(offset, size) \ ioremap_prot((offset), (size), \ - pgprot_val(wc_enabled ? PAGE_KERNEL_WUC : PAGE_KERNEL_SUC)) + wc_enabled ? PAGE_KERNEL_WUC : PAGE_KERNEL_SUC) #define ioremap_cache(offset, size) \ - ioremap_prot((offset), (size), pgprot_val(PAGE_KERNEL)) + ioremap_prot((offset), (size), PAGE_KERNEL) #define mmiowb() wmb() diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h index 0bddb568af7c..4dacf40ebefd 100644 --- a/arch/mips/include/asm/io.h +++ b/arch/mips/include/asm/io.h @@ -126,7 +126,7 @@ static inline unsigned long isa_virt_to_bus(volatile void *address) } void __iomem *ioremap_prot(phys_addr_t offset, unsigned long size, - unsigned long prot_val); + pgprot_t prot); void iounmap(const volatile void __iomem *addr); /* @@ -141,7 +141,7 @@ void iounmap(const volatile void __iomem *addr); * address. */ #define ioremap(offset, size) \ - ioremap_prot((offset), (size), _CACHE_UNCACHED) + ioremap_prot((offset), (size), __pgprot(_CACHE_UNCACHED)) /* * ioremap_cache - map bus memory into CPU space @@ -159,7 +159,7 @@ void iounmap(const volatile void __iomem *addr); * memory-like regions on I/O busses. */ #define ioremap_cache(offset, size) \ - ioremap_prot((offset), (size), _page_cachable_default) + ioremap_prot((offset), (size), __pgprot(_page_cachable_default)) /* * ioremap_wc - map bus memory into CPU space @@ -180,7 +180,7 @@ void iounmap(const volatile void __iomem *addr); * _CACHE_UNCACHED option (see cpu_probe() method). */ #define ioremap_wc(offset, size) \ - ioremap_prot((offset), (size), boot_cpu_data.writecombine) + ioremap_prot((offset), (size), __pgprot(boot_cpu_data.writecombine)) #if defined(CONFIG_CPU_CAVIUM_OCTEON) #define war_io_reorder_wmb() wmb() diff --git a/arch/mips/mm/ioremap.c b/arch/mips/mm/ioremap.c index d8243d61ef32..c6c4576cd4a8 100644 --- a/arch/mips/mm/ioremap.c +++ b/arch/mips/mm/ioremap.c @@ -44,9 +44,9 @@ static int __ioremap_check_ram(unsigned long start_pfn, unsigned long nr_pages, * ioremap_prot gives the caller control over cache coherency attributes (CCA) */ void __iomem *ioremap_prot(phys_addr_t phys_addr, unsigned long size, - unsigned long prot_val) + pgprot_t prot) { - unsigned long flags = prot_val & _CACHE_MASK; + unsigned long flags = pgprot_val(prot) & _CACHE_MASK; unsigned long offset, pfn, last_pfn; struct vm_struct *area; phys_addr_t last_addr; diff --git a/arch/mips/mm/ioremap64.c b/arch/mips/mm/ioremap64.c index 15e7820d6a5f..acc03ba20098 100644 --- a/arch/mips/mm/ioremap64.c +++ b/arch/mips/mm/ioremap64.c @@ -3,9 +3,9 @@ #include void __iomem *ioremap_prot(phys_addr_t offset, unsigned long size, - unsigned long prot_val) + pgprot_t prot) { - unsigned long flags = prot_val & _CACHE_MASK; + unsigned long flags = pgprot_val(prot) & _CACHE_MASK; u64 base = (flags == _CACHE_UNCACHED ? IO_BASE : UNCAC_BASE); void __iomem *addr; diff --git a/arch/parisc/include/asm/io.h b/arch/parisc/include/asm/io.h index 3143cf29ce27..04b783e2a6d1 100644 --- a/arch/parisc/include/asm/io.h +++ b/arch/parisc/include/asm/io.h @@ -131,7 +131,7 @@ static inline void gsc_writeq(unsigned long long val, unsigned long addr) _PAGE_ACCESSED | _PAGE_NO_CACHE) #define ioremap_wc(addr, size) \ - ioremap_prot((addr), (size), _PAGE_IOREMAP) + ioremap_prot((addr), (size), __pgprot(_PAGE_IOREMAP)) #define pci_iounmap pci_iounmap diff --git a/arch/parisc/mm/ioremap.c b/arch/parisc/mm/ioremap.c index fd996472dfe7..0b65c4b3baee 100644 --- a/arch/parisc/mm/ioremap.c +++ b/arch/parisc/mm/ioremap.c @@ -14,7 +14,7 @@ #include void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, - unsigned long prot) + pgprot_t prot) { #ifdef CONFIG_EISA unsigned long end = phys_addr + size - 1; @@ -41,6 +41,6 @@ void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, } } - return generic_ioremap_prot(phys_addr, size, __pgprot(prot)); + return generic_ioremap_prot(phys_addr, size, prot); } EXPORT_SYMBOL(ioremap_prot); diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index fd92ac450169..0436cdc7cfcc 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -895,7 +895,7 @@ void __iomem *ioremap_wt(phys_addr_t address, unsigned long size); void __iomem *ioremap_coherent(phys_addr_t address, unsigned long size); #define ioremap_cache(addr, size) \ - ioremap_prot((addr), (size), pgprot_val(PAGE_KERNEL)) + ioremap_prot((addr), (size), PAGE_KERNEL) #define iounmap iounmap diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c index 7b0afcabd89f..0d6615620ada 100644 --- a/arch/powerpc/mm/ioremap.c +++ b/arch/powerpc/mm/ioremap.c @@ -41,9 +41,9 @@ void __iomem *ioremap_coherent(phys_addr_t addr, unsigned long size) return __ioremap_caller(addr, size, prot, caller); } -void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long flags) +void __iomem *ioremap_prot(phys_addr_t addr, size_t size, pgprot_t prot) { - pte_t pte = __pte(flags); + pte_t pte = __pte(pgprot_val(prot)); void *caller = __builtin_return_address(0); /* writeable implies dirty for kernel addresses */ diff --git a/arch/powerpc/platforms/ps3/spu.c b/arch/powerpc/platforms/ps3/spu.c index 4a2520ec6d7f..61b37c9400b2 100644 --- a/arch/powerpc/platforms/ps3/spu.c +++ b/arch/powerpc/platforms/ps3/spu.c @@ -190,10 +190,10 @@ static void spu_unmap(struct spu *spu) static int __init setup_areas(struct spu *spu) { struct table {char* name; unsigned long addr; unsigned long size;}; - unsigned long shadow_flags = pgprot_val(pgprot_noncached_wc(PAGE_KERNEL_RO)); spu_pdata(spu)->shadow = ioremap_prot(spu_pdata(spu)->shadow_addr, - sizeof(struct spe_shadow), shadow_flags); + sizeof(struct spe_shadow), + pgprot_noncached_wc(PAGE_KERNEL_RO)); if (!spu_pdata(spu)->shadow) { pr_debug("%s:%d: ioremap shadow failed\n", __func__, __LINE__); goto fail_ioremap; diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h index 1c5c641075d2..0536846db9b6 100644 --- a/arch/riscv/include/asm/io.h +++ b/arch/riscv/include/asm/io.h @@ -137,7 +137,7 @@ __io_writes_outs(outs, u64, q, __io_pbr(), __io_paw()) #ifdef CONFIG_MMU #define arch_memremap_wb(addr, size) \ - ((__force void *)ioremap_prot((addr), (size), _PAGE_KERNEL)) + ((__force void *)ioremap_prot((addr), (size), __pgprot(_PAGE_KERNEL))) #endif #endif /* _ASM_RISCV_IO_H */ diff --git a/arch/riscv/kernel/acpi.c b/arch/riscv/kernel/acpi.c index 2fd29695a788..3f6d5a6789e8 100644 --- a/arch/riscv/kernel/acpi.c +++ b/arch/riscv/kernel/acpi.c @@ -305,7 +305,7 @@ void __iomem *acpi_os_ioremap(acpi_physical_address phys, acpi_size size) } } - return ioremap_prot(phys, size, pgprot_val(prot)); + return ioremap_prot(phys, size, prot); } #ifdef CONFIG_PCI diff --git a/arch/s390/include/asm/io.h b/arch/s390/include/asm/io.h index fc9933a743d6..82f1043a4fc3 100644 --- a/arch/s390/include/asm/io.h +++ b/arch/s390/include/asm/io.h @@ -33,9 +33,9 @@ void unxlate_dev_mem_ptr(phys_addr_t phys, void *addr); #define _PAGE_IOREMAP pgprot_val(PAGE_KERNEL) #define ioremap_wc(addr, size) \ - ioremap_prot((addr), (size), pgprot_val(pgprot_writecombine(PAGE_KERNEL))) + ioremap_prot((addr), (size), pgprot_writecombine(PAGE_KERNEL)) #define ioremap_wt(addr, size) \ - ioremap_prot((addr), (size), pgprot_val(pgprot_writethrough(PAGE_KERNEL))) + ioremap_prot((addr), (size), pgprot_writethrough(PAGE_KERNEL)) static inline void __iomem *ioport_map(unsigned long port, unsigned int nr) { diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c index 88f72745fa59..9fdcd733d40e 100644 --- a/arch/s390/pci/pci.c +++ b/arch/s390/pci/pci.c @@ -255,7 +255,7 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res, } void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, - unsigned long prot) + pgprot_t prot) { /* * When PCI MIO instructions are unavailable the "physical" address @@ -265,7 +265,7 @@ void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, if (!static_branch_unlikely(&have_mio)) return (void __iomem *)phys_addr; - return generic_ioremap_prot(phys_addr, size, __pgprot(prot)); + return generic_ioremap_prot(phys_addr, size, prot); } EXPORT_SYMBOL(ioremap_prot); diff --git a/arch/sh/boards/mach-landisk/setup.c b/arch/sh/boards/mach-landisk/setup.c index 2c44b94f82fb..1b3f43c3ac46 100644 --- a/arch/sh/boards/mach-landisk/setup.c +++ b/arch/sh/boards/mach-landisk/setup.c @@ -58,7 +58,7 @@ static int __init landisk_devices_setup(void) /* open I/O area window */ paddrbase = virt_to_phys((void *)PA_AREA5_IO); prot = PAGE_KERNEL_PCC(1, _PAGE_PCC_IO16); - cf_ide_base = ioremap_prot(paddrbase, PAGE_SIZE, pgprot_val(prot)); + cf_ide_base = ioremap_prot(paddrbase, PAGE_SIZE, prot); if (!cf_ide_base) { printk("allocate_cf_area : can't open CF I/O window!\n"); return -ENOMEM; diff --git a/arch/sh/boards/mach-lboxre2/setup.c b/arch/sh/boards/mach-lboxre2/setup.c index 20d01b430f2a..e95bde207adb 100644 --- a/arch/sh/boards/mach-lboxre2/setup.c +++ b/arch/sh/boards/mach-lboxre2/setup.c @@ -53,7 +53,7 @@ static int __init lboxre2_devices_setup(void) paddrbase = virt_to_phys((void*)PA_AREA5_IO); psize = PAGE_SIZE; prot = PAGE_KERNEL_PCC(1, _PAGE_PCC_IO16); - cf0_io_base = (u32)ioremap_prot(paddrbase, psize, pgprot_val(prot)); + cf0_io_base = (u32)ioremap_prot(paddrbase, psize, prot); if (!cf0_io_base) { printk(KERN_ERR "%s : can't open CF I/O window!\n" , __func__ ); return -ENOMEM; diff --git a/arch/sh/boards/mach-sh03/setup.c b/arch/sh/boards/mach-sh03/setup.c index 3901b6031ad5..5c9312f334d3 100644 --- a/arch/sh/boards/mach-sh03/setup.c +++ b/arch/sh/boards/mach-sh03/setup.c @@ -75,7 +75,7 @@ static int __init sh03_devices_setup(void) /* open I/O area window */ paddrbase = virt_to_phys((void *)PA_AREA5_IO); prot = PAGE_KERNEL_PCC(1, _PAGE_PCC_IO16); - cf_ide_base = ioremap_prot(paddrbase, PAGE_SIZE, pgprot_val(prot)); + cf_ide_base = ioremap_prot(paddrbase, PAGE_SIZE, prot); if (!cf_ide_base) { printk("allocate_cf_area : can't open CF I/O window!\n"); return -ENOMEM; diff --git a/arch/sh/include/asm/io.h b/arch/sh/include/asm/io.h index cf5eab840d57..531ec49b878d 100644 --- a/arch/sh/include/asm/io.h +++ b/arch/sh/include/asm/io.h @@ -299,7 +299,7 @@ unsigned long long poke_real_address_q(unsigned long long addr, #define _PAGE_IOREMAP pgprot_val(PAGE_KERNEL_NOCACHE) #define ioremap_cache(addr, size) \ - ioremap_prot((addr), (size), pgprot_val(PAGE_KERNEL)) + ioremap_prot((addr), (size), PAGE_KERNEL) #endif /* CONFIG_MMU */ #include diff --git a/arch/sh/mm/ioremap.c b/arch/sh/mm/ioremap.c index 33d20f34560f..5bbde53fb32d 100644 --- a/arch/sh/mm/ioremap.c +++ b/arch/sh/mm/ioremap.c @@ -73,10 +73,9 @@ __ioremap_29bit(phys_addr_t offset, unsigned long size, pgprot_t prot) #endif /* CONFIG_29BIT */ void __iomem __ref *ioremap_prot(phys_addr_t phys_addr, size_t size, - unsigned long prot) + pgprot_t pgprot) { void __iomem *mapped; - pgprot_t pgprot = __pgprot(prot); mapped = __ioremap_trapped(phys_addr, size); if (mapped) diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index ed580c7f9d0a..0794936ec187 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -170,7 +170,7 @@ extern void __iomem *ioremap_uc(resource_size_t offset, unsigned long size); #define ioremap_uc ioremap_uc extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size); #define ioremap_cache ioremap_cache -extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, unsigned long prot_val); +extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, pgprot_t prot); #define ioremap_prot ioremap_prot extern void __iomem *ioremap_encrypted(resource_size_t phys_addr, unsigned long size); #define ioremap_encrypted ioremap_encrypted diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 38ff7791a9c7..d501f0871aa5 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -440,10 +440,10 @@ void __iomem *ioremap_cache(resource_size_t phys_addr, unsigned long size) EXPORT_SYMBOL(ioremap_cache); void __iomem *ioremap_prot(resource_size_t phys_addr, unsigned long size, - unsigned long prot_val) + pgprot_t prot) { return __ioremap_caller(phys_addr, size, - pgprot2cachemode(__pgprot(prot_val)), + pgprot2cachemode(prot), __builtin_return_address(0), false); } EXPORT_SYMBOL(ioremap_prot); diff --git a/arch/xtensa/include/asm/io.h b/arch/xtensa/include/asm/io.h index 934e58399c8c..7cdcc2deab3e 100644 --- a/arch/xtensa/include/asm/io.h +++ b/arch/xtensa/include/asm/io.h @@ -29,7 +29,7 @@ * I/O memory mapping functions. */ void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, - unsigned long prot); + pgprot_t prot); #define ioremap_prot ioremap_prot #define iounmap iounmap @@ -40,7 +40,7 @@ static inline void __iomem *ioremap(unsigned long offset, unsigned long size) return (void*)(offset-XCHAL_KIO_PADDR+XCHAL_KIO_BYPASS_VADDR); else return ioremap_prot(offset, size, - pgprot_val(pgprot_noncached(PAGE_KERNEL))); + pgprot_noncached(PAGE_KERNEL)); } #define ioremap ioremap @@ -51,7 +51,7 @@ static inline void __iomem *ioremap_cache(unsigned long offset, && offset - XCHAL_KIO_PADDR < XCHAL_KIO_SIZE) return (void*)(offset-XCHAL_KIO_PADDR+XCHAL_KIO_CACHED_VADDR); else - return ioremap_prot(offset, size, pgprot_val(PAGE_KERNEL)); + return ioremap_prot(offset, size, PAGE_KERNEL); } #define ioremap_cache ioremap_cache diff --git a/arch/xtensa/mm/ioremap.c b/arch/xtensa/mm/ioremap.c index 8ca660b7ab49..26f238fa9d0d 100644 --- a/arch/xtensa/mm/ioremap.c +++ b/arch/xtensa/mm/ioremap.c @@ -11,12 +11,12 @@ #include void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, - unsigned long prot) + pgprot_t prot) { unsigned long pfn = __phys_to_pfn((phys_addr)); WARN_ON(pfn_valid(pfn)); - return generic_ioremap_prot(phys_addr, size, __pgprot(prot)); + return generic_ioremap_prot(phys_addr, size, prot); } EXPORT_SYMBOL(ioremap_prot); diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h index a5cbbf3e26ec..402020b23423 100644 --- a/include/asm-generic/io.h +++ b/include/asm-generic/io.h @@ -1111,7 +1111,7 @@ void __iomem *generic_ioremap_prot(phys_addr_t phys_addr, size_t size, pgprot_t prot); void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, - unsigned long prot); + pgprot_t prot); void iounmap(volatile void __iomem *addr); void generic_iounmap(volatile void __iomem *addr); @@ -1120,7 +1120,7 @@ void generic_iounmap(volatile void __iomem *addr); static inline void __iomem *ioremap(phys_addr_t addr, size_t size) { /* _PAGE_IOREMAP needs to be supplied by the architecture */ - return ioremap_prot(addr, size, _PAGE_IOREMAP); + return ioremap_prot(addr, size, __pgprot(_PAGE_IOREMAP)); } #endif #endif /* !CONFIG_MMU || CONFIG_GENERIC_IOREMAP */ diff --git a/mm/ioremap.c b/mm/ioremap.c index 3e049dfb28bd..c36dd9f62fd5 100644 --- a/mm/ioremap.c +++ b/mm/ioremap.c @@ -50,9 +50,9 @@ void __iomem *generic_ioremap_prot(phys_addr_t phys_addr, size_t size, #ifndef ioremap_prot void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, - unsigned long prot) + pgprot_t prot) { - return generic_ioremap_prot(phys_addr, size, __pgprot(prot)); + return generic_ioremap_prot(phys_addr, size, prot); } EXPORT_SYMBOL(ioremap_prot); #endif diff --git a/mm/memory.c b/mm/memory.c index 39bceed7448f..270c9357475d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6727,7 +6727,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write) { resource_size_t phys_addr; - unsigned long prot = 0; + pgprot_t prot = __pgprot(0); void __iomem *maddr; int offset = offset_in_page(addr); int ret = -EINVAL; @@ -6737,7 +6737,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, retry: if (follow_pfnmap_start(&args)) return -EINVAL; - prot = pgprot_val(args.pgprot); + prot = args.pgprot; phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT; writable = args.writable; follow_pfnmap_end(&args); @@ -6752,7 +6752,7 @@ retry: if (follow_pfnmap_start(&args)) goto out_unmap; - if ((prot != pgprot_val(args.pgprot)) || + if ((pgprot_val(prot) != pgprot_val(args.pgprot)) || (phys_addr != (args.pfn << PAGE_SHIFT)) || (writable != args.writable)) { follow_pfnmap_end(&args); -- 2.51.0 From 381ff0341ac63d245035f274cacd3f1b8068d388 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 18 Feb 2025 14:37:04 -0800 Subject: [PATCH 08/16] Docs/mm/damon/design: fix typo on DAMOS filters usage doc link Patch series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves". Fix and improve DAMOS filters documentation by fixing a copy-paste typo, adding hugepage_size filter documentation on design doc, moving logic details from usage to design, clarify DAMOS filters handling sequence based on handling layer, and re-organizing the filters type list for easier understanding of the handling sequence. This patch (of 5): The link from DAMOS filters design doc to usage doc has a typo calling filters as watermarks. Fix it. Link: https://lkml.kernel.org/r/20250218223708.53437-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250218223708.53437-2-sj@kernel.org Fixes: d31f5626a0e1 ("Docs/mm/damon/design: add links to sections of DAMON sysfs interface usage doc") Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index e28c6a1b40ae..12ae7e1209c8 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -617,7 +617,7 @@ Below ``type`` of filters are currently supported. - Applied to pages that belonging to a given DAMON monitoring target. - Handled by the core logic. -To know how user-space can set the watermarks via :ref:`DAMON sysfs interface +To know how user-space can set the filters via :ref:`DAMON sysfs interface `, refer to :ref:`filters ` part of the documentation. -- 2.51.0 From e52a942b47c8b32072e7f342aca600016bcfca33 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 18 Feb 2025 14:37:05 -0800 Subject: [PATCH 09/16] Docs/mm/damon/design: document hugepage_size filter 'hugepage_size' DAMOS filter type is not documented on the design doc. Add a description of the type. Link: https://lkml.kernel.org/r/20250218223708.53437-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 12ae7e1209c8..a959c081bc59 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -610,6 +610,9 @@ Below ``type`` of filters are currently supported. - Applied to pages that are accessed after the last access check from the scheme. - Handled by operations set layer. Supported by only ``paddr`` set. +- pages that managed in a given size range + - Applied to pages that managed in a given size range. + - Handled by operations set layer. Supported by only ``paddr`` set. - address range - Applied to pages that belonging to a given address range. - Handled by the core logic. -- 2.51.0 From 0f28583b28d84a5eaaf299e01b7301922ae8da46 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 18 Feb 2025 14:37:06 -0800 Subject: [PATCH 10/16] Docs/damon: move DAMOS filter type names and meaning to design doc DAMON sysfs usage doc is describing DAMOS filter type names and their meanings in short. The design doc is providing the short meaning and detailed descriptions, too. This is unnecessary duplicates and confuses where to document new DAMOS filter types and features. Move the details from usage to design doc. Link: https://lkml.kernel.org/r/20250218223708.53437-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 28 +++++++++----------- Documentation/mm/damon/design.rst | 12 ++++----- 2 files changed, 19 insertions(+), 21 deletions(-) diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 51af66c208c5..dc37bba96273 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -408,21 +408,19 @@ in the numeric order. Each filter directory contains nine files, namely ``type``, ``matching``, ``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max`` -and ``target_idx``. To ``type`` file, you can write one of six special -keywords: ``anon`` for anonymous pages, ``memcg`` for specific memory cgroup, -``young`` for young pages, ``addr`` for specific address range (an open-ended -interval), ``hugepage_size`` for large folios of a specific size range [``min``, -``max``] or ``target`` for specific DAMON monitoring target filtering. Meaning -of the types are same to the description on the :ref:`design doc -`. - -In case of the memory cgroup filtering, you can specify the memory cgroup of -the interest by writing the path of the memory cgroup from the cgroups mount -point to ``memcg_path`` file. In case of the address range filtering, you can -specify the start and end address of the range to ``addr_start`` and -``addr_end`` files, respectively. For the DAMON monitoring target filtering, -you can specify the index of the target between the list of the DAMON context's -monitoring targets list to ``target_idx`` file. +and ``target_idx``. To ``type`` file, you can write the type of the filter. +Refer to :ref:`the design doc ` for available type +names and their meanings. + +For ``memcg`` type, you can specify the memory cgroup of the interest by +writing the path of the memory cgroup from the cgroups mount point to +``memcg_path`` file. For ``addr`` type, you can specify the start and end +address of the range (open-ended interval) to ``addr_start`` and ``addr_end`` +files, respectively. For ``hugepage_size`` type, you can specify the minimum +and maximum size of the range (closed interval) to ``min`` and ``max`` files, +respectively. For ``target`` type, you can specify the index of the target +between the list of the DAMON context's monitoring targets list to +``target_idx`` file. You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter is for memory that matches the ``type``. You can write ``Y`` or ``N`` to diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index a959c081bc59..7360e5ac0d06 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -600,23 +600,23 @@ counted as the scheme has tried. This difference affects the statistics. Below ``type`` of filters are currently supported. -- anonymous page +- anon - Applied to pages that containing data that not stored in files. - Handled by operations set layer. Supported by only ``paddr`` set. -- memory cgroup +- memcg - Applied to pages that belonging to a given cgroup. - Handled by operations set layer. Supported by only ``paddr`` set. -- young page +- young - Applied to pages that are accessed after the last access check from the scheme. - Handled by operations set layer. Supported by only ``paddr`` set. -- pages that managed in a given size range +- hugepage_size - Applied to pages that managed in a given size range. - Handled by operations set layer. Supported by only ``paddr`` set. -- address range +- addr - Applied to pages that belonging to a given address range. - Handled by the core logic. -- DAMON monitoring target +- target - Applied to pages that belonging to a given DAMON monitoring target. - Handled by the core logic. -- 2.51.0 From 4a4d8e792506432270e27516cf03a8208cbbec8b Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 18 Feb 2025 14:37:07 -0800 Subject: [PATCH 11/16] Docs/mm/damon/design: clarify handling layer based filters evaluation sequence If an element of memory matches a DAMOS filter, filters that installed after that get no chance to make any effect to the element. Hence in what order DAMOS filters are handled is important, if both allow filters and reject filters are used together. The ordering is affected by both the installation order and which layter the filters are handled. The design document is not clearly documenting the latter part. Clarify it on the design doc. Link: https://lkml.kernel.org/r/20250218223708.53437-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 7360e5ac0d06..8b9727d91434 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -569,11 +569,21 @@ number of filters for each scheme. Each filter specifies - whether it is to allow (include) or reject (exclude) applying the scheme's action to the memory (``allow``). -When multiple filters are installed, each filter is evaluated in the installed -order. If a part of memory is matched to one of the filter, next filters are -ignored. If the memory passes through the filters evaluation stage because it -is not matched to any of the filters, applying the scheme's action to it is -allowed, same to the behavior when no filter exists. +For efficient handling of filters, some types of filters are handled by the +core layer, while others are handled by operations set. In the latter case, +hence, support of the filter types depends on the DAMON operations set. In +case of the core layer-handled filters, the memory regions that excluded by the +filter are not counted as the scheme has tried to the region. In contrast, if +a memory regions is filtered by an operations set layer-handled filter, it is +counted as the scheme has tried. This difference affects the statistics. + +When multiple filters are installed, the group of filters that handled by the +core layer are evaluated first. After that, the group of filters that handled +by the operations layer are evaluated. Filters in each of the groups are +evaluated in the installed order. If a part of memory is matched to one of the +filter, next filters are ignored. If the memory passes through the filters +evaluation stage because it is not matched to any of the filters, applying the +scheme's action to it is allowed, same to the behavior when no filter exists. For example, let's assume 1) a filter for allowing anonymous pages and 2) another filter for rejecting young pages are installed in the order. If a page @@ -590,14 +600,6 @@ filter-allowed or filters evaluation stage passed. It means that installing allow-filters at the end of the list makes no practical change but only filters-checking overhead. -For efficient handling of filters, some types of filters are handled by the -core layer, while others are handled by operations set. In the latter case, -hence, support of the filter types depends on the DAMON operations set. In -case of the core layer-handled filters, the memory regions that excluded by the -filter are not counted as the scheme has tried to the region. In contrast, if -a memory regions is filtered by an operations set layer-handled filter, it is -counted as the scheme has tried. This difference affects the statistics. - Below ``type`` of filters are currently supported. - anon -- 2.51.0 From edab6ffd792a7774e7d5fd7eb1d6251e452010f5 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 18 Feb 2025 14:37:08 -0800 Subject: [PATCH 12/16] Docs/mm/damon/design: categorize DAMOS filter types based on handling layer On what DAMON layer a DAMOS filter is handled is important to expect in what order filters will be evaluated. Re-organize the DAMOS filter types list on the design doc to categorize types based on the handling layer, to let users more easily understand the handling order. Link: https://lkml.kernel.org/r/20250218223708.53437-6-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 34 ++++++++++++++----------------- 1 file changed, 15 insertions(+), 19 deletions(-) diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 8b9727d91434..6a66aa0833fd 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -602,25 +602,21 @@ filters-checking overhead. Below ``type`` of filters are currently supported. -- anon - - Applied to pages that containing data that not stored in files. - - Handled by operations set layer. Supported by only ``paddr`` set. -- memcg - - Applied to pages that belonging to a given cgroup. - - Handled by operations set layer. Supported by only ``paddr`` set. -- young - - Applied to pages that are accessed after the last access check from the - scheme. - - Handled by operations set layer. Supported by only ``paddr`` set. -- hugepage_size - - Applied to pages that managed in a given size range. - - Handled by operations set layer. Supported by only ``paddr`` set. -- addr - - Applied to pages that belonging to a given address range. - - Handled by the core logic. -- target - - Applied to pages that belonging to a given DAMON monitoring target. - - Handled by the core logic. +- Core layer handled + - addr + - Applied to pages that belonging to a given address range. + - target + - Applied to pages that belonging to a given DAMON monitoring target. +- Operations layer handled, supported by only ``paddr`` operations set. + - anon + - Applied to pages that containing data that not stored in files. + - memcg + - Applied to pages that belonging to a given cgroup. + - young + - Applied to pages that are accessed after the last access check from the + scheme. + - hugepage_size + - Applied to pages that managed in a given size range. To know how user-space can set the filters via :ref:`DAMON sysfs interface `, refer to :ref:`filters ` part of the -- 2.51.0 From 7365ff2c8eef4ea50b5f3ae2349fa180e3782ef1 Mon Sep 17 00:00:00 2001 From: Frank van der Linden Date: Fri, 28 Feb 2025 18:29:02 +0000 Subject: [PATCH 13/16] mm/cma: export total and free number of pages for CMA areas Patch series "hugetlb/CMA improvements for large systems", v5. On large systems, we observed some issues with hugetlb and CMA: 1) When specifying a large number of hugetlb boot pages (hugepages= on the commandline), the kernel may run out of memory before it even gets to HVO. For example, if you have a 3072G system, and want to use 3024 1G hugetlb pages for VMs, that should leave you plenty of space for the hypervisor, provided you have the hugetlb vmemmap optimization (HVO) enabled. However, since the vmemmap pages are always allocated first, and then later in boot freed, you will actually run yourself out of memory before you can do HVO. This means not getting all the hugetlb pages you want, and worse, failure to boot if there is an allocation failure in the system from which it can't recover. 2) There is a system setup where you might want to use hugetlb_cma with a large value (say, again, 3024 out of 3072G like above), and then lower that if system usage allows it, to make room for non-hugetlb processes. For this, a variation of the problem above applies: the kernel runs out of unmovable space to allocate from before you finish boot, since your CMA area takes up all the space. 3) CMA wants to use one big contiguous area for allocations. Which fails if you have the aforementioned 3T system with a gap in the middle of physical memory (like the < 40bits BIOS DMA area seen on some AMD systems). You then won't be able to set up a CMA area for one of the NUMA nodes, leading to loss of half of your hugetlb CMA area. 4) Under the scenario mentioned in 2), when trying to grow the number of hugetlb pages after dropping it for a while, new CMA allocations may fail occasionally. This is not unexpected, some transient references on pages may prevent cma_alloc from succeeding under memory pressure. However, the hugetlb code then falls back to a normal contiguous alloc, which may end up succeeding. This is not always desired behavior. If you have a large CMA area, then the kernel has a restricted amount of memory it can do unmovable allocations from (a well known issue). A normal contiguous alloc may eat further in to this space. To resolve these issues, do the following: * Add hooks to the section init code to do custom initialization of memmap pages. Hugetlb bootmem (memblock) allocated pages can then be pre-HVOed. This avoids allocating a large number of vmemmap pages early in boot, only to have them be freed again later, and also avoids running out of memory as described under 1). Using these hooks for hugetlb is optional. It requires moving hugetlb bootmem allocation to an earlier spot by the architecture. This has been enabled on x86. * hugetlb_cma doesn't care about the CMA area it uses being one large contiguous range. Multiple smaller ranges are fine. The only requirements are that the areas should be on one NUMA node, and individual gigantic pages should be allocatable from them. So, implement multi-range support for CMA, avoiding issue 3). * Introduce a hugetlb_cma_only option on the commandline. This only allows allocations from CMA for gigantic pages, if hugetlb_cma= is also specified. * With hugetlb_cma_only active, it also makes sense to be able to pre-allocate gigantic hugetlb pages at boot time from the CMA area(s). Add a rudimentary early CMA allocation interface, that just grabs a piece of memblock-allocated space from the CMA area, which gets marked as allocated in the CMA bitmap when the CMA area is initialized. With this, hugepages= can be supported with hugetlb_cma=, making scenario 2) work. Additionally, fix some minor bugs, with one worth mentioning: since hugetlb gigantic bootmem pages are allocated by memblock, they may span multiple zones, as memblock doesn't (and mostly can't) know about zones. This can cause problems. A hugetlb page spanning multiple zones is bad, and it's worse with HVO, when the de-HVO step effectively sneakily re-assigns pages to a different zone than originally configured, since the tail pages all inherit the zone from the first 60 tail pages. This condition is not common, but can be easily reproduced using ZONE_MOVABLE. To fix this, add checks to see if gigantic bootmem pages intersect with multiple zones, and do not use them if they do, giving them back to the page allocator instead. The first patch is kind of along for the ride, except that maintaining an available_count for a CMA area is convenient for the multiple range support. This patch (of 27): In addition to the number of allocations and releases, system management software may like to be aware of the size of CMA areas, and how many pages are available in it. This information is currently not available, so export it in total_page and available_pages, respectively. The name 'available_pages' was picked over 'free_pages' because 'free' implies that the pages are unused. But they might not be, they just haven't been used by cma_alloc The number of available pages is tracked regardless of CONFIG_CMA_SYSFS, allowing for a few minor shortcuts in the code, avoiding bitmap operations. Link: https://lkml.kernel.org/r/20250228182928.2645936-2-fvdl@google.com Signed-off-by: Frank van der Linden Reviewed-by: Oscar Salvador Cc: David Hildenbrand Cc: Joao Martins Cc: Muchun Song Cc: Roman Gushchin (Cruise) Cc: Usama Arif Cc: Yu Zhao Cc: Zi Yan Cc: Alexander Gordeev Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Dan Carpenter Cc: Dave Hansen Cc: Heiko Carstens Cc: Johannes Weiner Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Peter Zijlstra Cc: Vasily Gorbik Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-kernel-mm-cma | 13 +++++++++++ mm/cma.c | 22 ++++++++++++++----- mm/cma.h | 1 + mm/cma_debug.c | 5 +---- mm/cma_sysfs.c | 20 +++++++++++++++++ 5 files changed, 51 insertions(+), 10 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma b/Documentation/ABI/testing/sysfs-kernel-mm-cma index dfd755201142..aaf2a5d8b13b 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-cma +++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma @@ -29,3 +29,16 @@ Date: Feb 2024 Contact: Anshuman Khandual Description: the number of pages CMA API succeeded to release + +What: /sys/kernel/mm/cma//total_pages +Date: Jun 2024 +Contact: Frank van der Linden +Description: + The size of the CMA area in pages. + +What: /sys/kernel/mm/cma//available_pages +Date: Jun 2024 +Contact: Frank van der Linden +Description: + The number of pages in the CMA area that are still + available for CMA allocation. diff --git a/mm/cma.c b/mm/cma.c index de5bc0c81fc2..95a8788e54d3 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -86,6 +86,7 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, spin_lock_irqsave(&cma->lock, flags); bitmap_clear(cma->bitmap, bitmap_no, bitmap_count); + cma->available_count += count; spin_unlock_irqrestore(&cma->lock, flags); } @@ -133,7 +134,7 @@ out_error: free_reserved_page(pfn_to_page(pfn)); } totalcma_pages -= cma->count; - cma->count = 0; + cma->available_count = cma->count = 0; pr_err("CMA area %s could not be activated\n", cma->name); } @@ -206,7 +207,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, snprintf(cma->name, CMA_MAX_NAME, "cma%d\n", cma_area_count); cma->base_pfn = PFN_DOWN(base); - cma->count = size >> PAGE_SHIFT; + cma->available_count = cma->count = size >> PAGE_SHIFT; cma->order_per_bit = order_per_bit; *res_cma = cma; cma_area_count++; @@ -390,7 +391,7 @@ static void cma_debug_show_areas(struct cma *cma) { unsigned long next_zero_bit, next_set_bit, nr_zero; unsigned long start = 0; - unsigned long nr_part, nr_total = 0; + unsigned long nr_part; unsigned long nbits = cma_bitmap_maxno(cma); spin_lock_irq(&cma->lock); @@ -402,12 +403,12 @@ static void cma_debug_show_areas(struct cma *cma) next_set_bit = find_next_bit(cma->bitmap, nbits, next_zero_bit); nr_zero = next_set_bit - next_zero_bit; nr_part = nr_zero << cma->order_per_bit; - pr_cont("%s%lu@%lu", nr_total ? "+" : "", nr_part, + pr_cont("%s%lu@%lu", start ? "+" : "", nr_part, next_zero_bit); - nr_total += nr_part; start = next_zero_bit + nr_zero; } - pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count); + pr_cont("=> %lu free of %lu total pages\n", cma->available_count, + cma->count); spin_unlock_irq(&cma->lock); } @@ -444,6 +445,14 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, for (;;) { spin_lock_irq(&cma->lock); + /* + * If the request is larger than the available number + * of pages, stop right away. + */ + if (count > cma->available_count) { + spin_unlock_irq(&cma->lock); + break; + } bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap, bitmap_maxno, start, bitmap_count, mask, offset); @@ -452,6 +461,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, break; } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); + cma->available_count -= count; /* * It's safe to drop the lock here. We've marked this region for * our exclusive use. If the migration fails we will take the diff --git a/mm/cma.h b/mm/cma.h index 8485ef893e99..3dd3376ae980 100644 --- a/mm/cma.h +++ b/mm/cma.h @@ -13,6 +13,7 @@ struct cma_kobject { struct cma { unsigned long base_pfn; unsigned long count; + unsigned long available_count; unsigned long *bitmap; unsigned int order_per_bit; /* Order of pages represented by one bit */ spinlock_t lock; diff --git a/mm/cma_debug.c b/mm/cma_debug.c index 602fff89b15f..89236f22230a 100644 --- a/mm/cma_debug.c +++ b/mm/cma_debug.c @@ -34,13 +34,10 @@ DEFINE_DEBUGFS_ATTRIBUTE(cma_debugfs_fops, cma_debugfs_get, NULL, "%llu\n"); static int cma_used_get(void *data, u64 *val) { struct cma *cma = data; - unsigned long used; spin_lock_irq(&cma->lock); - /* pages counter is smaller than sizeof(int) */ - used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma)); + *val = cma->count - cma->available_count; spin_unlock_irq(&cma->lock); - *val = (u64)used << cma->order_per_bit; return 0; } diff --git a/mm/cma_sysfs.c b/mm/cma_sysfs.c index f50db3973171..97acd3e5a6a5 100644 --- a/mm/cma_sysfs.c +++ b/mm/cma_sysfs.c @@ -62,6 +62,24 @@ static ssize_t release_pages_success_show(struct kobject *kobj, } CMA_ATTR_RO(release_pages_success); +static ssize_t total_pages_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct cma *cma = cma_from_kobj(kobj); + + return sysfs_emit(buf, "%lu\n", cma->count); +} +CMA_ATTR_RO(total_pages); + +static ssize_t available_pages_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct cma *cma = cma_from_kobj(kobj); + + return sysfs_emit(buf, "%lu\n", cma->available_count); +} +CMA_ATTR_RO(available_pages); + static void cma_kobj_release(struct kobject *kobj) { struct cma *cma = cma_from_kobj(kobj); @@ -75,6 +93,8 @@ static struct attribute *cma_attrs[] = { &alloc_pages_success_attr.attr, &alloc_pages_fail_attr.attr, &release_pages_success_attr.attr, + &total_pages_attr.attr, + &available_pages_attr.attr, NULL, }; ATTRIBUTE_GROUPS(cma); -- 2.51.0 From c009da4258f9885c5a3749fc004870db9c0e7a99 Mon Sep 17 00:00:00 2001 From: Frank van der Linden Date: Fri, 28 Feb 2025 18:29:03 +0000 Subject: [PATCH 14/16] mm, cma: support multiple contiguous ranges, if requested Currently, CMA manages one range of physically contiguous memory. Creation of larger CMA areas with hugetlb_cma may run in to gaps in physical memory, so that they are not able to allocate that contiguous physical range from memblock when creating the CMA area. This can happen, for example, on an AMD system with > 1TB of memory, where there will be a gap just below the 1TB (40bit DMA) line. If you have set aside most of memory for potential hugetlb CMA allocation, cma_declare_contiguous_nid will fail. hugetlb_cma doesn't need the entire area to be one physically contiguous range. It just cares about being able to get physically contiguous chunks of a certain size (e.g. 1G), and it is fine to have the CMA area backed by multiple physical ranges, as long as it gets 1G contiguous allocations. Multi-range support is implemented by introducing an array of ranges, instead of just one big one. Each range has its own bitmap. Effectively, the allocate and release operations work as before, just per-range. So, instead of going through one large bitmap, they now go through a number of smaller ones. The maximum number of supported ranges is 8, as defined in CMA_MAX_RANGES. Since some current users of CMA expect a CMA area to just use one physically contiguous range, only allow for multiple ranges if a new interface, cma_declare_contiguous_nid_multi, is used. The other interfaces will work like before, creating only CMA areas with 1 range. cma_declare_contiguous_nid_multi works as follows, mimicking the default "bottom-up, above 4G" reservation approach: 0) Try cma_declare_contiguous_nid, which will use only one region. If this succeeds, return. This makes sure that for all the cases that currently work, the behavior remains unchanged even if the caller switches from cma_declare_contiguous_nid to cma_declare_contiguous_nid_multi. 1) Select the largest free memblock ranges above 4G, with a maximum number of CMA_MAX_RANGES. 2) If we did not find at most CMA_MAX_RANGES that add up to the total size requested, return -ENOMEM. 3) Sort the selected ranges by base address. 4) Reserve them bottom-up until we get what we wanted. Link: https://lkml.kernel.org/r/20250228182928.2645936-3-fvdl@google.com Signed-off-by: Frank van der Linden Cc: Arnd Bergmann Cc: Alexander Gordeev Cc: Andy Lutomirski Cc: Dan Carpenter Cc: Dave Hansen Cc: David Hildenbrand Cc: Heiko Carstens Cc: Joao Martins Cc: Johannes Weiner Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Muchun Song Cc: Oscar Salvador Cc: Peter Zijlstra Cc: Roman Gushchin (Cruise) Cc: Usama Arif Cc: Vasily Gorbik Cc: Yu Zhao Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/cma_debugfs.rst | 10 +- include/linux/cma.h | 3 + mm/cma.c | 594 +++++++++++++++---- mm/cma.h | 27 +- mm/cma_debug.c | 56 +- 5 files changed, 550 insertions(+), 140 deletions(-) diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst index 7367e6294ef6..4120e9cb0cd5 100644 --- a/Documentation/admin-guide/mm/cma_debugfs.rst +++ b/Documentation/admin-guide/mm/cma_debugfs.rst @@ -12,10 +12,16 @@ its CMA name like below: The structure of the files created under that directory is as follows: - - [RO] base_pfn: The base PFN (Page Frame Number) of the zone. + - [RO] base_pfn: The base PFN (Page Frame Number) of the CMA area. + This is the same as ranges/0/base_pfn. - [RO] count: Amount of memory in the CMA area. - [RO] order_per_bit: Order of pages represented by one bit. - - [RO] bitmap: The bitmap of page states in the zone. + - [RO] bitmap: The bitmap of allocated pages in the area. + This is the same as ranges/0/base_pfn. + - [RO] ranges/N/base_pfn: The base PFN of contiguous range N + in the CMA area. + - [RO] ranges/N/bitmap: The bit map of allocated pages in + range N in the CMA area. - [WO] alloc: Allocate N pages from that CMA area. For example:: echo 5 > /cma//alloc diff --git a/include/linux/cma.h b/include/linux/cma.h index d15b64f51336..863427c27dc2 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -40,6 +40,9 @@ static inline int __init cma_declare_contiguous(phys_addr_t base, return cma_declare_contiguous_nid(base, size, limit, alignment, order_per_bit, fixed, name, res_cma, NUMA_NO_NODE); } +extern int __init cma_declare_contiguous_multi(phys_addr_t size, + phys_addr_t align, unsigned int order_per_bit, + const char *name, struct cma **res_cma, int nid); extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, unsigned int order_per_bit, const char *name, diff --git a/mm/cma.c b/mm/cma.c index 95a8788e54d3..34caa6b29c99 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -18,6 +18,7 @@ #include #include +#include #include #include #include @@ -35,9 +36,16 @@ struct cma cma_areas[MAX_CMA_AREAS]; unsigned int cma_area_count; static DEFINE_MUTEX(cma_mutex); +static int __init __cma_declare_contiguous_nid(phys_addr_t base, + phys_addr_t size, phys_addr_t limit, + phys_addr_t alignment, unsigned int order_per_bit, + bool fixed, const char *name, struct cma **res_cma, + int nid); + phys_addr_t cma_get_base(const struct cma *cma) { - return PFN_PHYS(cma->base_pfn); + WARN_ON_ONCE(cma->nranges != 1); + return PFN_PHYS(cma->ranges[0].base_pfn); } unsigned long cma_get_size(const struct cma *cma) @@ -63,9 +71,10 @@ static unsigned long cma_bitmap_aligned_mask(const struct cma *cma, * The value returned is represented in order_per_bits. */ static unsigned long cma_bitmap_aligned_offset(const struct cma *cma, + const struct cma_memrange *cmr, unsigned int align_order) { - return (cma->base_pfn & ((1UL << align_order) - 1)) + return (cmr->base_pfn & ((1UL << align_order) - 1)) >> cma->order_per_bit; } @@ -75,46 +84,57 @@ static unsigned long cma_bitmap_pages_to_bits(const struct cma *cma, return ALIGN(pages, 1UL << cma->order_per_bit) >> cma->order_per_bit; } -static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, - unsigned long count) +static void cma_clear_bitmap(struct cma *cma, const struct cma_memrange *cmr, + unsigned long pfn, unsigned long count) { unsigned long bitmap_no, bitmap_count; unsigned long flags; - bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit; + bitmap_no = (pfn - cmr->base_pfn) >> cma->order_per_bit; bitmap_count = cma_bitmap_pages_to_bits(cma, count); spin_lock_irqsave(&cma->lock, flags); - bitmap_clear(cma->bitmap, bitmap_no, bitmap_count); + bitmap_clear(cmr->bitmap, bitmap_no, bitmap_count); cma->available_count += count; spin_unlock_irqrestore(&cma->lock, flags); } static void __init cma_activate_area(struct cma *cma) { - unsigned long base_pfn = cma->base_pfn, pfn; + unsigned long pfn, base_pfn; + int allocrange, r; struct zone *zone; + struct cma_memrange *cmr; + + for (allocrange = 0; allocrange < cma->nranges; allocrange++) { + cmr = &cma->ranges[allocrange]; + cmr->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma, cmr), + GFP_KERNEL); + if (!cmr->bitmap) + goto cleanup; + } - cma->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma), GFP_KERNEL); - if (!cma->bitmap) - goto out_error; + for (r = 0; r < cma->nranges; r++) { + cmr = &cma->ranges[r]; + base_pfn = cmr->base_pfn; - /* - * alloc_contig_range() requires the pfn range specified to be in the - * same zone. Simplify by forcing the entire CMA resv range to be in the - * same zone. - */ - WARN_ON_ONCE(!pfn_valid(base_pfn)); - zone = page_zone(pfn_to_page(base_pfn)); - for (pfn = base_pfn + 1; pfn < base_pfn + cma->count; pfn++) { - WARN_ON_ONCE(!pfn_valid(pfn)); - if (page_zone(pfn_to_page(pfn)) != zone) - goto not_in_zone; - } + /* + * alloc_contig_range() requires the pfn range specified + * to be in the same zone. Simplify by forcing the entire + * CMA resv range to be in the same zone. + */ + WARN_ON_ONCE(!pfn_valid(base_pfn)); + zone = page_zone(pfn_to_page(base_pfn)); + for (pfn = base_pfn + 1; pfn < base_pfn + cmr->count; pfn++) { + WARN_ON_ONCE(!pfn_valid(pfn)); + if (page_zone(pfn_to_page(pfn)) != zone) + goto cleanup; + } - for (pfn = base_pfn; pfn < base_pfn + cma->count; - pfn += pageblock_nr_pages) - init_cma_reserved_pageblock(pfn_to_page(pfn)); + for (pfn = base_pfn; pfn < base_pfn + cmr->count; + pfn += pageblock_nr_pages) + init_cma_reserved_pageblock(pfn_to_page(pfn)); + } spin_lock_init(&cma->lock); @@ -125,13 +145,19 @@ static void __init cma_activate_area(struct cma *cma) return; -not_in_zone: - bitmap_free(cma->bitmap); -out_error: +cleanup: + for (r = 0; r < allocrange; r++) + bitmap_free(cma->ranges[r].bitmap); + /* Expose all pages to the buddy, they are useless for CMA. */ if (!cma->reserve_pages_on_error) { - for (pfn = base_pfn; pfn < base_pfn + cma->count; pfn++) - free_reserved_page(pfn_to_page(pfn)); + for (r = 0; r < allocrange; r++) { + cmr = &cma->ranges[r]; + for (pfn = cmr->base_pfn; + pfn < cmr->base_pfn + cmr->count; + pfn++) + free_reserved_page(pfn_to_page(pfn)); + } } totalcma_pages -= cma->count; cma->available_count = cma->count = 0; @@ -154,6 +180,43 @@ void __init cma_reserve_pages_on_error(struct cma *cma) cma->reserve_pages_on_error = true; } +static int __init cma_new_area(const char *name, phys_addr_t size, + unsigned int order_per_bit, + struct cma **res_cma) +{ + struct cma *cma; + + if (cma_area_count == ARRAY_SIZE(cma_areas)) { + pr_err("Not enough slots for CMA reserved regions!\n"); + return -ENOSPC; + } + + /* + * Each reserved area must be initialised later, when more kernel + * subsystems (like slab allocator) are available. + */ + cma = &cma_areas[cma_area_count]; + cma_area_count++; + + if (name) + snprintf(cma->name, CMA_MAX_NAME, "%s", name); + else + snprintf(cma->name, CMA_MAX_NAME, "cma%d\n", cma_area_count); + + cma->available_count = cma->count = size >> PAGE_SHIFT; + cma->order_per_bit = order_per_bit; + *res_cma = cma; + totalcma_pages += cma->count; + + return 0; +} + +static void __init cma_drop_area(struct cma *cma) +{ + totalcma_pages -= cma->count; + cma_area_count--; +} + /** * cma_init_reserved_mem() - create custom contiguous area from reserved memory * @base: Base address of the reserved area @@ -172,13 +235,9 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, struct cma **res_cma) { struct cma *cma; + int ret; /* Sanity checks */ - if (cma_area_count == ARRAY_SIZE(cma_areas)) { - pr_err("Not enough slots for CMA reserved regions!\n"); - return -ENOSPC; - } - if (!size || !memblock_is_region_reserved(base, size)) return -EINVAL; @@ -195,25 +254,261 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, if (!IS_ALIGNED(base | size, CMA_MIN_ALIGNMENT_BYTES)) return -EINVAL; + ret = cma_new_area(name, size, order_per_bit, &cma); + if (ret != 0) + return ret; + + cma->ranges[0].base_pfn = PFN_DOWN(base); + cma->ranges[0].count = cma->count; + cma->nranges = 1; + + *res_cma = cma; + + return 0; +} + +/* + * Structure used while walking physical memory ranges and finding out + * which one(s) to use for a CMA area. + */ +struct cma_init_memrange { + phys_addr_t base; + phys_addr_t size; + struct list_head list; +}; + +/* + * Work array used during CMA initialization. + */ +static struct cma_init_memrange memranges[CMA_MAX_RANGES] __initdata; + +static bool __init revsizecmp(struct cma_init_memrange *mlp, + struct cma_init_memrange *mrp) +{ + return mlp->size > mrp->size; +} + +static bool __init basecmp(struct cma_init_memrange *mlp, + struct cma_init_memrange *mrp) +{ + return mlp->base < mrp->base; +} + +/* + * Helper function to create sorted lists. + */ +static void __init list_insert_sorted( + struct list_head *ranges, + struct cma_init_memrange *mrp, + bool (*cmp)(struct cma_init_memrange *lh, struct cma_init_memrange *rh)) +{ + struct list_head *mp; + struct cma_init_memrange *mlp; + + if (list_empty(ranges)) + list_add(&mrp->list, ranges); + else { + list_for_each(mp, ranges) { + mlp = list_entry(mp, struct cma_init_memrange, list); + if (cmp(mlp, mrp)) + break; + } + __list_add(&mrp->list, mlp->list.prev, &mlp->list); + } +} + +/* + * Create CMA areas with a total size of @total_size. A normal allocation + * for one area is tried first. If that fails, the biggest memblock + * ranges above 4G are selected, and allocated bottom up. + * + * The complexity here is not great, but this function will only be + * called during boot, and the lists operated on have fewer than + * CMA_MAX_RANGES elements (default value: 8). + */ +int __init cma_declare_contiguous_multi(phys_addr_t total_size, + phys_addr_t align, unsigned int order_per_bit, + const char *name, struct cma **res_cma, int nid) +{ + phys_addr_t start, end; + phys_addr_t size, sizesum, sizeleft; + struct cma_init_memrange *mrp, *mlp, *failed; + struct cma_memrange *cmrp; + LIST_HEAD(ranges); + LIST_HEAD(final_ranges); + struct list_head *mp, *next; + int ret, nr = 1; + u64 i; + struct cma *cma; + /* - * Each reserved area must be initialised later, when more kernel - * subsystems (like slab allocator) are available. + * First, try it the normal way, producing just one range. */ - cma = &cma_areas[cma_area_count]; + ret = __cma_declare_contiguous_nid(0, total_size, 0, align, + order_per_bit, false, name, res_cma, nid); + if (ret != -ENOMEM) + goto out; - if (name) - snprintf(cma->name, CMA_MAX_NAME, name); - else - snprintf(cma->name, CMA_MAX_NAME, "cma%d\n", cma_area_count); + /* + * Couldn't find one range that fits our needs, so try multiple + * ranges. + * + * No need to do the alignment checks here, the call to + * cma_declare_contiguous_nid above would have caught + * any issues. With the checks, we know that: + * + * - @align is a power of 2 + * - @align is >= pageblock alignment + * - @size is aligned to @align and to @order_per_bit + * + * So, as long as we create ranges that have a base + * aligned to @align, and a size that is aligned to + * both @align and @order_to_bit, things will work out. + */ + nr = 0; + sizesum = 0; + failed = NULL; - cma->base_pfn = PFN_DOWN(base); - cma->available_count = cma->count = size >> PAGE_SHIFT; - cma->order_per_bit = order_per_bit; + ret = cma_new_area(name, total_size, order_per_bit, &cma); + if (ret != 0) + goto out; + + align = max_t(phys_addr_t, align, CMA_MIN_ALIGNMENT_BYTES); + /* + * Create a list of ranges above 4G, largest range first. + */ + for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &start, &end, NULL) { + if (upper_32_bits(start) == 0) + continue; + + start = ALIGN(start, align); + if (start >= end) + continue; + + end = ALIGN_DOWN(end, align); + if (end <= start) + continue; + + size = end - start; + size = ALIGN_DOWN(size, (PAGE_SIZE << order_per_bit)); + if (!size) + continue; + sizesum += size; + + pr_debug("consider %016llx - %016llx\n", (u64)start, (u64)end); + + /* + * If we don't yet have used the maximum number of + * areas, grab a new one. + * + * If we can't use anymore, see if this range is not + * smaller than the smallest one already recorded. If + * not, re-use the smallest element. + */ + if (nr < CMA_MAX_RANGES) + mrp = &memranges[nr++]; + else { + mrp = list_last_entry(&ranges, + struct cma_init_memrange, list); + if (size < mrp->size) + continue; + list_del(&mrp->list); + sizesum -= mrp->size; + pr_debug("deleted %016llx - %016llx from the list\n", + (u64)mrp->base, (u64)mrp->base + size); + } + mrp->base = start; + mrp->size = size; + + /* + * Now do a sorted insert. + */ + list_insert_sorted(&ranges, mrp, revsizecmp); + pr_debug("added %016llx - %016llx to the list\n", + (u64)mrp->base, (u64)mrp->base + size); + pr_debug("total size now %llu\n", (u64)sizesum); + } + + /* + * There is not enough room in the CMA_MAX_RANGES largest + * ranges, so bail out. + */ + if (sizesum < total_size) { + cma_drop_area(cma); + ret = -ENOMEM; + goto out; + } + + /* + * Found ranges that provide enough combined space. + * Now, sorted them by address, smallest first, because we + * want to mimic a bottom-up memblock allocation. + */ + sizesum = 0; + list_for_each_safe(mp, next, &ranges) { + mlp = list_entry(mp, struct cma_init_memrange, list); + list_del(mp); + list_insert_sorted(&final_ranges, mlp, basecmp); + sizesum += mlp->size; + if (sizesum >= total_size) + break; + } + + /* + * Walk the final list, and add a CMA range for + * each range, possibly not using the last one fully. + */ + nr = 0; + sizeleft = total_size; + list_for_each(mp, &final_ranges) { + mlp = list_entry(mp, struct cma_init_memrange, list); + size = min(sizeleft, mlp->size); + if (memblock_reserve(mlp->base, size)) { + /* + * Unexpected error. Could go on to + * the next one, but just abort to + * be safe. + */ + failed = mlp; + break; + } + + pr_debug("created region %d: %016llx - %016llx\n", + nr, (u64)mlp->base, (u64)mlp->base + size); + cmrp = &cma->ranges[nr++]; + cmrp->base_pfn = PHYS_PFN(mlp->base); + cmrp->count = size >> PAGE_SHIFT; + + sizeleft -= size; + if (sizeleft == 0) + break; + } + + if (failed) { + list_for_each(mp, &final_ranges) { + mlp = list_entry(mp, struct cma_init_memrange, list); + if (mlp == failed) + break; + memblock_phys_free(mlp->base, mlp->size); + } + cma_drop_area(cma); + ret = -ENOMEM; + goto out; + } + + cma->nranges = nr; *res_cma = cma; - cma_area_count++; - totalcma_pages += cma->count; - return 0; +out: + if (ret != 0) + pr_err("Failed to reserve %lu MiB\n", + (unsigned long)total_size / SZ_1M); + else + pr_info("Reserved %lu MiB in %d range%s\n", + (unsigned long)total_size / SZ_1M, nr, + nr > 1 ? "s" : ""); + + return ret; } /** @@ -241,6 +536,26 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, phys_addr_t alignment, unsigned int order_per_bit, bool fixed, const char *name, struct cma **res_cma, int nid) +{ + int ret; + + ret = __cma_declare_contiguous_nid(base, size, limit, alignment, + order_per_bit, fixed, name, res_cma, nid); + if (ret != 0) + pr_err("Failed to reserve %ld MiB\n", + (unsigned long)size / SZ_1M); + else + pr_info("Reserved %ld MiB at %pa\n", + (unsigned long)size / SZ_1M, &base); + + return ret; +} + +static int __init __cma_declare_contiguous_nid(phys_addr_t base, + phys_addr_t size, phys_addr_t limit, + phys_addr_t alignment, unsigned int order_per_bit, + bool fixed, const char *name, struct cma **res_cma, + int nid) { phys_addr_t memblock_end = memblock_end_of_DRAM(); phys_addr_t highmem_start; @@ -273,10 +588,9 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, /* Sanitise input arguments. */ alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES); if (fixed && base & (alignment - 1)) { - ret = -EINVAL; pr_err("Region at %pa must be aligned to %pa bytes\n", &base, &alignment); - goto err; + return -EINVAL; } base = ALIGN(base, alignment); size = ALIGN(size, alignment); @@ -294,10 +608,9 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, * low/high memory boundary. */ if (fixed && base < highmem_start && base + size > highmem_start) { - ret = -EINVAL; pr_err("Region at %pa defined on low/high memory boundary (%pa)\n", &base, &highmem_start); - goto err; + return -EINVAL; } /* @@ -309,18 +622,16 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, limit = memblock_end; if (base + size > limit) { - ret = -EINVAL; pr_err("Size (%pa) of region at %pa exceeds limit (%pa)\n", &size, &base, &limit); - goto err; + return -EINVAL; } /* Reserve memory */ if (fixed) { if (memblock_is_region_reserved(base, size) || memblock_reserve(base, size) < 0) { - ret = -EBUSY; - goto err; + return -EBUSY; } } else { phys_addr_t addr = 0; @@ -357,10 +668,8 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, if (!addr) { addr = memblock_alloc_range_nid(size, alignment, base, limit, nid, true); - if (!addr) { - ret = -ENOMEM; - goto err; - } + if (!addr) + return -ENOMEM; } /* @@ -373,75 +682,67 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, ret = cma_init_reserved_mem(base, size, order_per_bit, name, res_cma); if (ret) - goto free_mem; - - pr_info("Reserved %ld MiB at %pa on node %d\n", (unsigned long)size / SZ_1M, - &base, nid); - return 0; + memblock_phys_free(base, size); -free_mem: - memblock_phys_free(base, size); -err: - pr_err("Failed to reserve %ld MiB on node %d\n", (unsigned long)size / SZ_1M, - nid); return ret; } static void cma_debug_show_areas(struct cma *cma) { unsigned long next_zero_bit, next_set_bit, nr_zero; - unsigned long start = 0; + unsigned long start; unsigned long nr_part; - unsigned long nbits = cma_bitmap_maxno(cma); + unsigned long nbits; + int r; + struct cma_memrange *cmr; spin_lock_irq(&cma->lock); pr_info("number of available pages: "); - for (;;) { - next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start); - if (next_zero_bit >= nbits) - break; - next_set_bit = find_next_bit(cma->bitmap, nbits, next_zero_bit); - nr_zero = next_set_bit - next_zero_bit; - nr_part = nr_zero << cma->order_per_bit; - pr_cont("%s%lu@%lu", start ? "+" : "", nr_part, - next_zero_bit); - start = next_zero_bit + nr_zero; + for (r = 0; r < cma->nranges; r++) { + cmr = &cma->ranges[r]; + + start = 0; + nbits = cma_bitmap_maxno(cma, cmr); + + pr_info("range %d: ", r); + for (;;) { + next_zero_bit = find_next_zero_bit(cmr->bitmap, + nbits, start); + if (next_zero_bit >= nbits) + break; + next_set_bit = find_next_bit(cmr->bitmap, nbits, + next_zero_bit); + nr_zero = next_set_bit - next_zero_bit; + nr_part = nr_zero << cma->order_per_bit; + pr_cont("%s%lu@%lu", start ? "+" : "", nr_part, + next_zero_bit); + start = next_zero_bit + nr_zero; + } + pr_info("\n"); } pr_cont("=> %lu free of %lu total pages\n", cma->available_count, cma->count); spin_unlock_irq(&cma->lock); } -static struct page *__cma_alloc(struct cma *cma, unsigned long count, - unsigned int align, gfp_t gfp) +static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, + unsigned long count, unsigned int align, + struct page **pagep, gfp_t gfp) { unsigned long mask, offset; unsigned long pfn = -1; unsigned long start = 0; unsigned long bitmap_maxno, bitmap_no, bitmap_count; - unsigned long i; + int ret = -EBUSY; struct page *page = NULL; - int ret = -ENOMEM; - const char *name = cma ? cma->name : NULL; - - trace_cma_alloc_start(name, count, align); - - if (!cma || !cma->count || !cma->bitmap) - return page; - - pr_debug("%s(cma %p, name: %s, count %lu, align %d)\n", __func__, - (void *)cma, cma->name, count, align); - - if (!count) - return page; mask = cma_bitmap_aligned_mask(cma, align); - offset = cma_bitmap_aligned_offset(cma, align); - bitmap_maxno = cma_bitmap_maxno(cma); + offset = cma_bitmap_aligned_offset(cma, cmr, align); + bitmap_maxno = cma_bitmap_maxno(cma, cmr); bitmap_count = cma_bitmap_pages_to_bits(cma, count); if (bitmap_count > bitmap_maxno) - return page; + goto out; for (;;) { spin_lock_irq(&cma->lock); @@ -453,14 +754,14 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, spin_unlock_irq(&cma->lock); break; } - bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap, + bitmap_no = bitmap_find_next_zero_area_off(cmr->bitmap, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { spin_unlock_irq(&cma->lock); break; } - bitmap_set(cma->bitmap, bitmap_no, bitmap_count); + bitmap_set(cmr->bitmap, bitmap_no, bitmap_count); cma->available_count -= count; /* * It's safe to drop the lock here. We've marked this region for @@ -469,7 +770,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, */ spin_unlock_irq(&cma->lock); - pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit); + pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); mutex_lock(&cma_mutex); ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp); mutex_unlock(&cma_mutex); @@ -478,7 +779,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, break; } - cma_clear_bitmap(cma, pfn, count); + cma_clear_bitmap(cma, cmr, pfn, count); if (ret != -EBUSY) break; @@ -490,6 +791,38 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, /* try again with a bit different memory target */ start = bitmap_no + mask + 1; } +out: + *pagep = page; + return ret; +} + +static struct page *__cma_alloc(struct cma *cma, unsigned long count, + unsigned int align, gfp_t gfp) +{ + struct page *page = NULL; + int ret = -ENOMEM, r; + unsigned long i; + const char *name = cma ? cma->name : NULL; + + trace_cma_alloc_start(name, count, align); + + if (!cma || !cma->count) + return page; + + pr_debug("%s(cma %p, name: %s, count %lu, align %d)\n", __func__, + (void *)cma, cma->name, count, align); + + if (!count) + return page; + + for (r = 0; r < cma->nranges; r++) { + page = NULL; + + ret = cma_range_alloc(cma, &cma->ranges[r], count, align, + &page, gfp); + if (ret != -EBUSY || page) + break; + } /* * CMA can allocate multiple page blocks, which results in different @@ -508,7 +841,8 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, } pr_debug("%s(): returned %p\n", __func__, page); - trace_cma_alloc_finish(name, pfn, page, count, align, ret); + trace_cma_alloc_finish(name, page ? page_to_pfn(page) : 0, + page, count, align, ret); if (page) { count_vm_event(CMA_ALLOC_SUCCESS); cma_sysfs_account_success_pages(cma, count); @@ -551,20 +885,31 @@ struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp) bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count) { - unsigned long pfn; + unsigned long pfn, end; + int r; + struct cma_memrange *cmr; + bool ret; - if (!cma || !pages) + if (!cma || !pages || count > cma->count) return false; pfn = page_to_pfn(pages); + ret = false; - if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count) { - pr_debug("%s(page %p, count %lu)\n", __func__, - (void *)pages, count); - return false; + for (r = 0; r < cma->nranges; r++) { + cmr = &cma->ranges[r]; + end = cmr->base_pfn + cmr->count; + if (pfn >= cmr->base_pfn && pfn < end) { + ret = pfn + count <= end; + break; + } } - return true; + if (!ret) + pr_debug("%s(page %p, count %lu)\n", + __func__, (void *)pages, count); + + return ret; } /** @@ -580,19 +925,32 @@ bool cma_pages_valid(struct cma *cma, const struct page *pages, bool cma_release(struct cma *cma, const struct page *pages, unsigned long count) { - unsigned long pfn; + struct cma_memrange *cmr; + unsigned long pfn, end_pfn; + int r; + + pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count); if (!cma_pages_valid(cma, pages, count)) return false; - pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count); - pfn = page_to_pfn(pages); + end_pfn = pfn + count; + + for (r = 0; r < cma->nranges; r++) { + cmr = &cma->ranges[r]; + if (pfn >= cmr->base_pfn && + pfn < (cmr->base_pfn + cmr->count)) { + VM_BUG_ON(end_pfn > cmr->base_pfn + cmr->count); + break; + } + } - VM_BUG_ON(pfn + count > cma->base_pfn + cma->count); + if (r == cma->nranges) + return false; free_contig_range(pfn, count); - cma_clear_bitmap(cma, pfn, count); + cma_clear_bitmap(cma, cmr, pfn, count); cma_sysfs_account_release_pages(cma, count); trace_cma_release(cma->name, pfn, pages, count); diff --git a/mm/cma.h b/mm/cma.h index 3dd3376ae980..5f39dd1aac91 100644 --- a/mm/cma.h +++ b/mm/cma.h @@ -10,19 +10,35 @@ struct cma_kobject { struct cma *cma; }; +/* + * Multi-range support. This can be useful if the size of the allocation + * is not expected to be larger than the alignment (like with hugetlb_cma), + * and the total amount of memory requested, while smaller than the total + * amount of memory available, is large enough that it doesn't fit in a + * single physical memory range because of memory holes. + */ +struct cma_memrange { + unsigned long base_pfn; + unsigned long count; + unsigned long *bitmap; +#ifdef CONFIG_CMA_DEBUGFS + struct debugfs_u32_array dfs_bitmap; +#endif +}; +#define CMA_MAX_RANGES 8 + struct cma { - unsigned long base_pfn; unsigned long count; unsigned long available_count; - unsigned long *bitmap; unsigned int order_per_bit; /* Order of pages represented by one bit */ spinlock_t lock; #ifdef CONFIG_CMA_DEBUGFS struct hlist_head mem_head; spinlock_t mem_head_lock; - struct debugfs_u32_array dfs_bitmap; #endif char name[CMA_MAX_NAME]; + int nranges; + struct cma_memrange ranges[CMA_MAX_RANGES]; #ifdef CONFIG_CMA_SYSFS /* the number of CMA page successful allocations */ atomic64_t nr_pages_succeeded; @@ -39,9 +55,10 @@ struct cma { extern struct cma cma_areas[MAX_CMA_AREAS]; extern unsigned int cma_area_count; -static inline unsigned long cma_bitmap_maxno(struct cma *cma) +static inline unsigned long cma_bitmap_maxno(struct cma *cma, + struct cma_memrange *cmr) { - return cma->count >> cma->order_per_bit; + return cmr->count >> cma->order_per_bit; } #ifdef CONFIG_CMA_SYSFS diff --git a/mm/cma_debug.c b/mm/cma_debug.c index 89236f22230a..fdf899532ca0 100644 --- a/mm/cma_debug.c +++ b/mm/cma_debug.c @@ -46,17 +46,26 @@ DEFINE_DEBUGFS_ATTRIBUTE(cma_used_fops, cma_used_get, NULL, "%llu\n"); static int cma_maxchunk_get(void *data, u64 *val) { struct cma *cma = data; + struct cma_memrange *cmr; unsigned long maxchunk = 0; - unsigned long start, end = 0; - unsigned long bitmap_maxno = cma_bitmap_maxno(cma); + unsigned long start, end; + unsigned long bitmap_maxno; + int r; spin_lock_irq(&cma->lock); - for (;;) { - start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end); - if (start >= bitmap_maxno) - break; - end = find_next_bit(cma->bitmap, bitmap_maxno, start); - maxchunk = max(end - start, maxchunk); + for (r = 0; r < cma->nranges; r++) { + cmr = &cma->ranges[r]; + bitmap_maxno = cma_bitmap_maxno(cma, cmr); + end = 0; + for (;;) { + start = find_next_zero_bit(cmr->bitmap, + bitmap_maxno, end); + if (start >= bitmap_maxno) + break; + end = find_next_bit(cmr->bitmap, bitmap_maxno, + start); + maxchunk = max(end - start, maxchunk); + } } spin_unlock_irq(&cma->lock); *val = (u64)maxchunk << cma->order_per_bit; @@ -159,24 +168,41 @@ DEFINE_DEBUGFS_ATTRIBUTE(cma_alloc_fops, NULL, cma_alloc_write, "%llu\n"); static void cma_debugfs_add_one(struct cma *cma, struct dentry *root_dentry) { - struct dentry *tmp; + struct dentry *tmp, *dir, *rangedir; + int r; + char rdirname[12]; + struct cma_memrange *cmr; tmp = debugfs_create_dir(cma->name, root_dentry); debugfs_create_file("alloc", 0200, tmp, cma, &cma_alloc_fops); debugfs_create_file("free", 0200, tmp, cma, &cma_free_fops); - debugfs_create_file("base_pfn", 0444, tmp, - &cma->base_pfn, &cma_debugfs_fops); debugfs_create_file("count", 0444, tmp, &cma->count, &cma_debugfs_fops); debugfs_create_file("order_per_bit", 0444, tmp, &cma->order_per_bit, &cma_debugfs_fops); debugfs_create_file("used", 0444, tmp, cma, &cma_used_fops); debugfs_create_file("maxchunk", 0444, tmp, cma, &cma_maxchunk_fops); - cma->dfs_bitmap.array = (u32 *)cma->bitmap; - cma->dfs_bitmap.n_elements = DIV_ROUND_UP(cma_bitmap_maxno(cma), - BITS_PER_BYTE * sizeof(u32)); - debugfs_create_u32_array("bitmap", 0444, tmp, &cma->dfs_bitmap); + rangedir = debugfs_create_dir("ranges", tmp); + for (r = 0; r < cma->nranges; r++) { + cmr = &cma->ranges[r]; + snprintf(rdirname, sizeof(rdirname), "%d", r); + dir = debugfs_create_dir(rdirname, rangedir); + debugfs_create_file("base_pfn", 0444, dir, + &cmr->base_pfn, &cma_debugfs_fops); + cmr->dfs_bitmap.array = (u32 *)cmr->bitmap; + cmr->dfs_bitmap.n_elements = + DIV_ROUND_UP(cma_bitmap_maxno(cma, cmr), + BITS_PER_BYTE * sizeof(u32)); + debugfs_create_u32_array("bitmap", 0444, dir, + &cmr->dfs_bitmap); + } + + /* + * Backward compatible symlinks to range 0 for base_pfn and bitmap. + */ + debugfs_create_symlink("base_pfn", tmp, "ranges/0/base_pfn"); + debugfs_create_symlink("bitmap", tmp, "ranges/0/bitmap"); } static int __init cma_debugfs_init(void) -- 2.51.0 From 624ab90b7b8758090654520b03c97cecc438a414 Mon Sep 17 00:00:00 2001 From: Frank van der Linden Date: Fri, 28 Feb 2025 18:29:04 +0000 Subject: [PATCH 15/16] mm/cma: introduce cma_intersects function Now that CMA areas can have multiple physical ranges, code can't assume a CMA struct represents a base_pfn plus a size, as returned from cma_get_base. Most cases are ok though, since they all explicitly refer to CMA areas that were created using existing interfaces (cma_declare_contiguous_nid or cma_init_reserved_mem), which guarantees they have just one physical range. An exception is the s390 code, which walks all CMA ranges to see if they intersect with a range of memory that is about to be hotremoved. So, in the future, it might run in to multi-range areas. To keep this check working, define a cma_intersects function. This just checks if a physaddr range intersects any of the ranges. Use it in the s390 check. Link: https://lkml.kernel.org/r/20250228182928.2645936-4-fvdl@google.com Signed-off-by: Frank van der Linden Acked-by: Alexander Gordeev Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Alexander Gordeev Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Dan Carpenter Cc: Dave Hansen Cc: David Hildenbrand Cc: Joao Martins Cc: Johannes Weiner Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Muchun Song Cc: Oscar Salvador Cc: Peter Zijlstra Cc: Roman Gushchin (Cruise) Cc: Usama Arif Cc: Yu Zhao Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/s390/mm/init.c | 13 +++++-------- include/linux/cma.h | 1 + mm/cma.c | 21 +++++++++++++++++++++ 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index f2298f7a3f21..d88cb1c13f7d 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -239,16 +239,13 @@ struct s390_cma_mem_data { static int s390_cma_check_range(struct cma *cma, void *data) { struct s390_cma_mem_data *mem_data; - unsigned long start, end; mem_data = data; - start = cma_get_base(cma); - end = start + cma_get_size(cma); - if (end < mem_data->start) - return 0; - if (start >= mem_data->end) - return 0; - return -EBUSY; + + if (cma_intersects(cma, mem_data->start, mem_data->end)) + return -EBUSY; + + return 0; } static int s390_cma_mem_notifier(struct notifier_block *nb, diff --git a/include/linux/cma.h b/include/linux/cma.h index 863427c27dc2..03d85c100dcc 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -53,6 +53,7 @@ extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count); extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data); +extern bool cma_intersects(struct cma *cma, unsigned long start, unsigned long end); extern void cma_reserve_pages_on_error(struct cma *cma); diff --git a/mm/cma.c b/mm/cma.c index 34caa6b29c99..8dc46bfa3819 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -978,3 +978,24 @@ int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data) return 0; } + +bool cma_intersects(struct cma *cma, unsigned long start, unsigned long end) +{ + int r; + struct cma_memrange *cmr; + unsigned long rstart, rend; + + for (r = 0; r < cma->nranges; r++) { + cmr = &cma->ranges[r]; + + rstart = PFN_PHYS(cmr->base_pfn); + rend = PFN_PHYS(cmr->base_pfn + cmr->count); + if (end < rstart) + continue; + if (start >= rend) + continue; + return true; + } + + return false; +} -- 2.51.0 From 3dda0103e8ea4923d1a2f9d3c6b75aadc08aee73 Mon Sep 17 00:00:00 2001 From: Frank van der Linden Date: Fri, 28 Feb 2025 18:29:05 +0000 Subject: [PATCH 16/16] mm, hugetlb: use cma_declare_contiguous_multi hugetlb_cma is fine with using multiple CMA ranges, as long as it can get its gigantic pages allocated from them. So, use cma_declare_contiguous_multi to allow for multiple ranges, increasing the chances of getting what we want on systems with gaps in physical memory. Link: https://lkml.kernel.org/r/20250228182928.2645936-5-fvdl@google.com Signed-off-by: Frank van der Linden Cc: Alexander Gordeev Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Dan Carpenter Cc: Dave Hansen Cc: David Hildenbrand Cc: Heiko Carstens Cc: Joao Martins Cc: Johannes Weiner Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Muchun Song Cc: Oscar Salvador Cc: Peter Zijlstra Cc: Roman Gushchin (Cruise) Cc: Usama Arif Cc: Vasily Gorbik Cc: Yu Zhao Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/hugetlb.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 318624c96584..2585d4da6c45 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7750,9 +7750,8 @@ void __init hugetlb_cma_reserve(int order) * may be returned to CMA allocator in the case of * huge page demotion. */ - res = cma_declare_contiguous_nid(0, size, 0, - PAGE_SIZE << order, - HUGETLB_PAGE_ORDER, false, name, + res = cma_declare_contiguous_multi(size, PAGE_SIZE << order, + HUGETLB_PAGE_ORDER, name, &hugetlb_cma[nid], nid); if (res) { pr_warn("hugetlb_cma: reservation failed: err %d, node %d", -- 2.51.0