Liam R. Howlett [Sat, 30 Nov 2024 04:54:58 +0000 (23:54 -0500)]
maple_tree: Combine mas_parent_gap() into mas_update_gap()
mas_parent_gap() is used in one location and a lot of what is needed already
exists in the calling function. Inline the function and dropping the
duplication simplifies the code and reduces the instruction count.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Liam R. Howlett [Wed, 23 Aug 2023 16:13:52 +0000 (12:13 -0400)]
tools/testing/radix-tree: Add maple tree fuzzer
This is the introduction of the maple tree fuzzer into the radix-tree
test suite, based heavily on the work of Vasily Gorbik sent in March
2022.
The tester uses the LLVM fuzzer to automate testing of the maple tree by
randomly inserting, storing, deleting, and resetting the tree. Testing
has been expanded to test both allocation and basic trees.
After building the fuzz-maple target with clang, just run the resulting
executable. The llvm libfuzzer supports minimizing the steps to
reproduce a crash from a crash file.
Using V=1 on the minimized crash will result in a testcase that can be
added to the lib/test_maple_tree.c test suite. Using V=2 can help
figure out what is happening to cause the crash.
Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Hamza Mahfooz [Mon, 20 Jan 2025 20:56:59 +0000 (15:56 -0500)]
mailmap: add an entry for Hamza Mahfooz
Map my previous work email to my current one.
Link: https://lkml.kernel.org/r/20250120205659.139027-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: David S. Miller <davem@davemloft.net> Cc: Hans verkuil <hverkuil@xs4all.nl> Cc: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhaoyang Huang [Tue, 21 Jan 2025 02:01:59 +0000 (10:01 +0800)]
mm: gup: fix infinite loop within __get_longterm_locked
We can run into an infinite loop in __get_longterm_locked() when
collect_longterm_unpinnable_folios() finds only folios that are isolated
from the LRU or were never added to the LRU. This can happen when all
folios to be pinned are never added to the LRU, for example when
vm_ops->fault allocated pages using cma_alloc() and never added them to
the LRU.
We incorrectly update the "collected" variable even if nothing was
collected. Fix it by incrementing "collected" only when we isolated a
folio and added it to the list of folios to migrate.
Link: https://lkml.kernel.org/r/20250121020159.3636477-1-zhaoyang.huang@unisoc.com Fixes: 67e139b02d99 ("mm/gup.c: refactor check_and_migrate_movable_pages()") Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Aijun Sun <aijun.sun@unisoc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heming Zhao [Tue, 21 Jan 2025 11:22:03 +0000 (19:22 +0800)]
ocfs2: fix incorrect CPU endianness conversion causing mount failure
Commit 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()")
introduced a regression bug. The blksz_bits value is already converted to
CPU endian in the previous code; therefore, the code shouldn't use
le32_to_cpu() anymore.
Link: https://lkml.kernel.org/r/20250121112204.12834-1-heming.zhao@suse.com Fixes: 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As has_unmovable_pages() call folio_hstate() without hugetlb_lock, there
is a race to free the HugeTLB page between PageHuge() and folio_hstate().
There is no need to add hugetlb_lock here as the HugeTLB page can be freed
in lot of places. So it's enough to unfold folio_hstate() and add a check
to avoid NULL pointer dereference for hugepage_migration_supported().
Link: https://lkml.kernel.org/r/20250122061151.578768-1-liushixin2@huawei.com Fixes: 464c7ffbcb16 ("mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Carlos Bilbao [Sat, 11 Jan 2025 16:11:06 +0000 (10:11 -0600)]
mailmap, docs: update email to carlos.bilbao@kernel.org
Update .mailmap to reflect my new (and final) primary email address,
carlos.bilbao@kernel.org. This ensures consistent attribution in Git
history. Also update my contact information in file
Documentation/translations/sp_SP/index.rst to help contributors reach out
for Spanish translations.
introduce local nr_demoted to fix nr_reclaimed double counting
Link: https://lkml.kernel.org/r/20250111015253.425693-1-lizhijian@fujitsu.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Zhijian [Fri, 10 Jan 2025 12:21:32 +0000 (20:21 +0800)]
mm/vmscan: accumulate nr_demoted for accurate demotion statistics
In shrink_folio_list(), demote_folio_list() can be called 2 times.
Currently stat->nr_demoted will only store the last nr_demoted( the later
nr_demoted is always zero, the former nr_demoted will get lost), as a
result number of demoted pages is not accurate.
Accumulate the nr_demoted count across multiple calls to
demote_folio_list(), ensuring accurate reporting of demotion statistics.
Link: https://lkml.kernel.org/r/20250110122133.423481-1-lizhijian@fujitsu.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Acked-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Tested-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Wed, 8 Jan 2025 07:48:21 +0000 (00:48 -0700)]
mm/hugetlb_vmemmap: fix memory loads ordering
Using x86_64 as an example, for a 32KB struct page[] area describing a 2MB
hugeTLB, HVO reduces the area to 4KB by the following steps:
1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs;
2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped
by PTE 0, and at the same time change the permission from r/w to
r/o;
3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB
to 4KB.
However, the following race can happen due to improperly memory loads
ordering:
CPU 1 (HVO) CPU 2 (speculative PFN walker)
atomic_add_unless(&page->_refcount)
XXX: try to modify r/o struct page[]
Specifically, page_is_fake_head() must be ordered after page_ref_count()
on CPU 2 so that it can only return true for this case, to avoid the later
attempt to modify r/o struct page[].
This patch adds the missing memory barrier and makes the tests on
page_is_fake_head() and page_ref_count() done in the proper order.
Link: https://lkml.kernel.org/r/20250108074822.722696-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Will Deacon <will@kernel.org> Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/ Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
liuye [Tue, 19 Nov 2024 06:08:42 +0000 (14:08 +0800)]
mm/vmscan: fix hard LOCKUP in function isolate_lru_folios
This fixes the following hard lockup in isolate_lru_folios() during memory
reclaim. If the LRU mostly contains ineligible folios this may trigger
watchdog.
watchdog: Watchdog detected hard LOCKUP on cpu 173
RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0
Call Trace:
_raw_spin_lock_irqsave+0x31/0x40
folio_lruvec_lock_irqsave+0x5f/0x90
folio_batch_move_lru+0x91/0x150
lru_add_drain_per_cpu+0x1c/0x40
process_one_work+0x17d/0x350
worker_thread+0x27b/0x3a0
kthread+0xe8/0x120
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1b/0x30
Scenario:
User processe are requesting a large amount of memory and keep page active.
Then a module continuously requests memory from ZONE_DMA32 area.
Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached.
However pages in the LRU(active_anon) list are mostly from
the ZONE_NORMAL area.
About the Fixes:
Why did it take eight years to be discovered?
The problem requires the following conditions to occur:
1. The device memory should be large enough.
2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area.
3. The memory in ZONE_DMA32 needs to reach the watermark.
If the memory is not large enough, or if the usage design of ZONE_DMA32
area memory is reasonable, this problem is difficult to detect.
notes:
The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL,
but other suitable scenarios may also trigger the problem.
Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: liuye <liuye@kylinos.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
syzkaller reported a UBSAN shift-out-of-bounds warning of (1UL << order)
in isolate_freepages_block(). The bogus compound_order can be any value
because it is union with flags. Add back the MAX_PAGE_ORDER check to fix
the warning.
Link: https://lkml.kernel.org/r/20250123021029.2826736-1-liushixin2@huawei.com Fixes: 3da0272a4c7d ("mm/compaction: correctly return failure with bogus compound_order in strict mode") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexander Gordeev [Thu, 23 Jan 2025 16:03:49 +0000 (17:03 +0100)]
s390/mm: add missing ctor/dtor on page table upgrade
Commit 78966b550289 ("s390: pgtable: add statistics for PUD and P4D level
page table") misses the call to pagetable_p4d_ctor() against a newly
allocated P4D table in crst_table_upgrade();
Commit 68c601de75d8 ("mm: introduce ctor/dtor at PGD level") misses the
call to pagetable_pgd_ctor() against a newly allocated PGD and the call to
pagetable_dtor() against a newly allocated P4D that is about to be freed
on crst_table_upgrade() PGD upgrade fail path.
The missed constructors and destructor break (at least) the page table
accounting when a process memory space is upgraded.
Link: https://lkml.kernel.org/r/20250123160349.200154-1-agordeev@linux.ibm.com Fixes: 78966b550289 ("s390: pgtable: add statistics for PUD and P4D level page table") Fixes: 68c601de75d8 ("mm: introduce ctor/dtor at PGD level") Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reported-by: Heiko Carstens <hca@linux.ibm.com> Closes: https://lore.kernel.org/all/20250122074954.8685-A-hca@linux.ibm.com/ Suggested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jim Zhao [Thu, 21 Nov 2024 10:05:39 +0000 (18:05 +0800)]
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
Address the feedback from 39ac99852fca ("mm/page-writeback: raise
wb_thresh to prevent write blocking with strictlimit)". The wb_thresh
bumping logic is scattered across wb_position_ratio, __wb_calc_thresh, and
wb_update_dirty_ratelimit. For consistency, consolidate all wb_thresh
bumping logic into __wb_calc_thresh.
Link: https://lkml.kernel.org/r/20241121100539.605818-1-jimzhao.ai@gmail.com Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yuntao Wang [Wed, 15 Jan 2025 04:16:34 +0000 (12:16 +0800)]
mm/page_alloc: remove the incorrect and misleading comment
The comment removed in this patch originally belonged to the
build_zonelists_in_zone_order() function, which was introduced by commit f0c0b2b808f2 ("change zonelist order: zonelist order selection logic").
Later, commit c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE")
removed build_zonelists_in_zone_order() but left its comment behind.
Subsequently, commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
initialization") moved the node_order variable into build_zonelists(),
making the comment originally belonged to build_zonelists_in_zone_order()
appear as if it were part of build_zonelists().
Sergey Senozhatsky [Wed, 15 Jan 2025 07:19:16 +0000 (16:19 +0900)]
zram: remove zcomp_stream_put() from write_incompressible_page()
We cannot and should not put per-CPU compression stream in
write_incompressible_page() because that function never gets any
per-CPU streams in the first place. It's zram_write_page() that
puts the stream before it calls write_incompressible_page().
Byungchul Park [Thu, 8 Aug 2024 06:53:58 +0000 (15:53 +0900)]
mm: separate move/undo parts from migrate_pages_batch()
Functionally, no change. This is a preparation for luf mechanism that
requires to use separated folio lists for its own handling during
migration. Refactored migrate_pages_batch() so as to separate move/undo
parts from migrate_pages_batch().
Thomas Weißschuh [Tue, 14 Jan 2025 16:06:48 +0000 (17:06 +0100)]
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
The virtual_address_range selftest reads from the start of each mapping
listed in /proc/self/maps. However not all mappings are valid to be
arbitrarily accessed.
For example the vvar data used for virtual clocks on x86 [vvar_vclock] can
only be accessed if 1) the kernel configuration enables virtual clocks and
2) the hypervisor provided the data for it. Only the VDSO itself has the
necessary information to know this. Since commit e93d2521b27f ("x86/vdso:
Split virtual clock pages into dedicated mapping") the virtual clock data
was split out into its own mapping, leading to EFAULT from read() during
the validation.
Check for the VM_IO flag as a proxy. It is present for the VVAR mappings
and MMIO ranges can be dangerous to access arbitrarily.
Link: https://lkml.kernel.org/r/20250114-virtual_address_range-tests-v4-4-6fd7269934a5@linutronix.de Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202412271148.2656e485-lkp@intel.com Fixes: e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping") Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Suggested-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/lkml/e97c2a5d-c815-4936-a767-ac42a3220a90@redhat.com/ Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thomas Weißschuh [Tue, 14 Jan 2025 16:06:46 +0000 (17:06 +0100)]
selftests/mm: virtual_address_range: unmap chunks after validation
For each accessed chunk a PTE is created. More than 1GiB of PTEs is used
in this way. Remove each PTE after validating a chunk to reduce peak
memory usage.
It is important to only unmap memory that previously mmap()ed, as
unmapping other mappings like the stack, heap or executable mappings will
crash the process.
The mappings read from /proc/self/maps and the return values from mmap()
don't allow a simple correlation due to merging and no guaranteed order.
To correlate the pointers and mappings use prctl(PR_SET_VMA_ANON_NAME).
While it introduces a test dependency, other alternatives would introduce
runtime or development overhead.
Link: https://lkml.kernel.org/r/20250114-virtual_address_range-tests-v4-2-6fd7269934a5@linutronix.de Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thomas Weißschuh [Tue, 14 Jan 2025 16:06:45 +0000 (17:06 +0100)]
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
Patch series "selftests/mm: virtual_address_range: Reduce memory", v4.
The selftest started failing since commit e93d2521b27f ("x86/vdso: Split
virtual clock pages into dedicated mapping") was merged. While debugging
I stumbled upon some memory usage optimizations.
With these test now runs on a VM with only 60MiB of memory.
This patch (of 4):
When mapping a larger chunk than physical memory is available with
PROT_WRITE and overcommit is disabled, the mapping will fail. This will
prevent the test from running on systems with less then ~1GiB of memory
and triggering an inscrutinable test failure. As the mappings are never
written to anyways, the flag can be removed.
Jens Axboe [Fri, 20 Dec 2024 15:47:50 +0000 (08:47 -0700)]
mm: add FGP_DONTCACHE folio creation flag
Callers can pass this in for uncached folio creation, in which case if a
folio is newly created it gets marked as uncached. If a folio exists for
this index and lookup succeeds, then it will not get marked as uncached.
If an !uncached lookup finds a cached folio, clear the flag. For that
case, there are competeting uncached and cached users of the folio, and it
should not get pruned.
Link: https://lkml.kernel.org/r/20241220154831.1086649-13-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jens Axboe [Fri, 20 Dec 2024 15:47:49 +0000 (08:47 -0700)]
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
When a buffered write submitted with IOCB_DONTCACHE has been successfully
submitted, call filemap_fdatawrite_range_kick() to kick off the IO. File
systems call generic_write_sync() for any successful buffered write
submission, hence add the logic here rather than needing to modify the
file system.
Link: https://lkml.kernel.org/r/20241220154831.1086649-12-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Works like filemap_fdatawrite_range(), except it's a non-integrity data
writeback and hence only starts writeback on the specified range. Will
help facilitate generically starting uncached writeback from
generic_write_sync(), as header dependencies preclude doing this inline
from fs.h.
Link: https://lkml.kernel.org/r/20241220154831.1086649-11-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jens Axboe [Fri, 20 Dec 2024 15:47:47 +0000 (08:47 -0700)]
mm/filemap: drop streaming/uncached pages when writeback completes
If the folio is marked as streaming, drop pages when writeback completes.
Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for
uncached IO.
Link: https://lkml.kernel.org/r/20241220154831.1086649-10-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jens Axboe [Fri, 20 Dec 2024 15:47:46 +0000 (08:47 -0700)]
mm/filemap: add read support for RWF_DONTCACHE
Add RWF_DONTCACHE as a read operation flag, which means that any data read
wil be removed from the page cache upon completion. Uses the page cache
to synchronize, and simply prunes folios that were instantiated when the
operation completes. While it would be possible to use private pages for
this, using the page cache as synchronization is handy for a variety of
reasons:
1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable
and the code to support this is much simpler as a result.
You can think of uncached buffered IO as being the much more attractive
cousin of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes, it
will copy the data, but unlike regular buffered IO, it doesn't run into
the unpredictability of the page cache in terms of reclaim. As an
example, on a test box with 32 drives, reading them with buffered IO looks
as follows:
where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:
which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached
8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top
where just the test app is using CPU, no reclaim is taking place outside
of the main thread. Not only is performance 65% better, it's also using
half the CPU to do it.
Link: https://lkml.kernel.org/r/20241220154831.1086649-9-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jens Axboe [Fri, 20 Dec 2024 15:47:45 +0000 (08:47 -0700)]
fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
If a file system supports uncached buffered IO, it may set FOP_DONTCACHE
and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted
without the file system supporting it, it'll get errored with -EOPNOTSUPP.
Link: https://lkml.kernel.org/r/20241220154831.1086649-8-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jens Axboe [Fri, 20 Dec 2024 15:47:44 +0000 (08:47 -0700)]
mm/truncate: add folio_unmap_invalidate() helper
Add a folio_unmap_invalidate() helper, which unmaps and invalidates a
given folio. The caller must already have locked the folio. Embed the
old invalidate_complete_folio2() helper in there as well, as nobody else
calls it.
Use this new helper in invalidate_inode_pages2_range(), rather than
duplicate the code there.
In preparation for using this elsewhere as well, have it take a gfp_t mask
rather than assume GFP_KERNEL is the right choice. This bubbles back to
invalidate_complete_folio2() as well.
Link: https://lkml.kernel.org/r/20241220154831.1086649-7-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>