]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
9 days agocompiles, ship it. split_w_structs
Liam R. Howlett [Wed, 16 Apr 2025 01:02:35 +0000 (21:02 -0400)]
compiles, ship it.

The copy and pasting of code does not work out without an overall plan

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
10 days agobroken for now
Liam R. Howlett [Mon, 14 Apr 2025 19:39:23 +0000 (15:39 -0400)]
broken for now

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agorename alloc to is_alloc
Liam R. Howlett [Thu, 10 Apr 2025 19:12:52 +0000 (15:12 -0400)]
rename alloc to is_alloc

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agoma_part -> part and comments
Liam R. Howlett [Thu, 10 Apr 2025 18:43:55 +0000 (14:43 -0400)]
ma_part -> part and comments

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agofinished conversion
Liam R. Howlett [Thu, 10 Apr 2025 18:34:08 +0000 (14:34 -0400)]
finished conversion

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agotry_rebalance converted
Liam R. Howlett [Thu, 10 Apr 2025 18:19:39 +0000 (14:19 -0400)]
try_rebalance converted

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agoconverged converted
Liam R. Howlett [Thu, 10 Apr 2025 18:13:16 +0000 (14:13 -0400)]
converged converted

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agouse split_data struct
Liam R. Howlett [Thu, 10 Apr 2025 15:58:47 +0000 (11:58 -0400)]
use split_data struct

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agowip, cleanup
Liam R. Howlett [Tue, 8 Apr 2025 17:50:28 +0000 (13:50 -0400)]
wip, cleanup

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agomas_wr_split cleanup and some comments
Liam R. Howlett [Tue, 8 Apr 2025 15:07:39 +0000 (11:07 -0400)]
mas_wr_split cleanup and some comments

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agoRevert "remove dead code"
Liam R. Howlett [Tue, 8 Apr 2025 02:02:54 +0000 (22:02 -0400)]
Revert "remove dead code"

This reverts commit b3062ee59ae87e44f81fd309b79c8fbf5660f2db.

2 weeks agomore printk and dead code
Liam R. Howlett [Tue, 8 Apr 2025 02:00:03 +0000 (22:00 -0400)]
more printk and dead code

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agoremove dead code
Liam R. Howlett [Tue, 8 Apr 2025 01:53:20 +0000 (21:53 -0400)]
remove dead code

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agoremove printk
Liam R. Howlett [Tue, 8 Apr 2025 01:53:01 +0000 (21:53 -0400)]
remove printk

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 weeks agoworking new code
Liam R. Howlett [Tue, 8 Apr 2025 01:49:42 +0000 (21:49 -0400)]
working new code

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
3 weeks agodebug
Liam R. Howlett [Wed, 2 Apr 2025 13:36:49 +0000 (09:36 -0400)]
debug

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
7 weeks agomas_wr_try_rebalance: Drop dead code
Liam R. Howlett [Fri, 28 Feb 2025 22:28:38 +0000 (17:28 -0500)]
mas_wr_try_rebalance: Drop dead code

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
7 weeks agomas_split: Drop dead code
Liam R. Howlett [Fri, 28 Feb 2025 22:27:39 +0000 (17:27 -0500)]
mas_split: Drop dead code

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
7 weeks agomaple_tree: use mt_wr_split_data() in split and try_rebalance
Liam R. Howlett [Fri, 28 Feb 2025 22:27:01 +0000 (17:27 -0500)]
maple_tree: use mt_wr_split_data() in split and try_rebalance

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
8 weeks agofix rewind for ma_part
Liam R. Howlett [Thu, 27 Feb 2025 18:57:47 +0000 (13:57 -0500)]
fix rewind for ma_part

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
8 weeks agomas_wr_split using mt_wr_split_data
Liam R. Howlett [Thu, 27 Feb 2025 03:02:30 +0000 (22:02 -0500)]
mas_wr_split using mt_wr_split_data

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
8 weeks agomt_wr_split_data()
Liam R. Howlett [Thu, 27 Feb 2025 02:20:19 +0000 (21:20 -0500)]
mt_wr_split_data()

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agoprogress, I guess Split and reblance testing
Liam R. Howlett [Wed, 19 Feb 2025 02:45:42 +0000 (21:45 -0500)]
progress, I guess  Split and reblance testing

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agoworking on split rebalance now
Liam R. Howlett [Fri, 14 Feb 2025 20:42:56 +0000 (15:42 -0500)]
working on split rebalance now

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agowip, moving to struct splitting
Liam R. Howlett [Tue, 11 Feb 2025 17:18:30 +0000 (12:18 -0500)]
wip, moving to struct splitting

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agorebalance wip
Liam R. Howlett [Tue, 4 Feb 2025 18:05:35 +0000 (13:05 -0500)]
rebalance wip

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agomas_wr_rebalance() - it compiles new_split_2025
Liam R. Howlett [Mon, 3 Feb 2025 21:04:12 +0000 (16:04 -0500)]
mas_wr_rebalance() - it compiles

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agomas_destroy_rebalance() whitespace fix
Liam R. Howlett [Thu, 30 Jan 2025 18:42:29 +0000 (13:42 -0500)]
mas_destroy_rebalance() whitespace fix

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agomas_wr_rebalance_nodes: fix line length
Liam R. Howlett [Mon, 27 Jan 2025 21:10:36 +0000 (16:10 -0500)]
mas_wr_rebalance_nodes: fix line length

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agotesting/raix-tree/maple: Increase readers and reduce delay for faster machines
Liam R. Howlett [Mon, 27 Jan 2025 20:59:53 +0000 (15:59 -0500)]
testing/raix-tree/maple: Increase readers and reduce delay for faster machines

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agomas_wr_rebalance for insufficient data
Liam R. Howlett [Mon, 27 Jan 2025 20:56:40 +0000 (15:56 -0500)]
mas_wr_rebalance for insufficient data

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agodrop mas_wr_rebalance to be re-added clean
Liam R. Howlett [Mon, 27 Jan 2025 20:56:11 +0000 (15:56 -0500)]
drop mas_wr_rebalance to be re-added clean

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
2 months agodrop meaningless return from mas_wr_split
Liam R. Howlett [Thu, 5 Dec 2024 20:49:17 +0000 (15:49 -0500)]
drop meaningless return from mas_wr_split

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agomaple_tree: Combine mas_parent_gap() into mas_update_gap()
Liam R. Howlett [Sat, 30 Nov 2024 04:54:58 +0000 (23:54 -0500)]
maple_tree: Combine mas_parent_gap() into mas_update_gap()

mas_parent_gap() is used in one location and a lot of what is needed already
exists in the calling function.  Inline the function and dropping the
duplication simplifies the code and reduces the instruction count.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agomaple_tree: Inline ma_max_gap() in mas_update_gap()
Liam R. Howlett [Sat, 30 Nov 2024 04:30:52 +0000 (23:30 -0500)]
maple_tree: Inline ma_max_gap() in mas_update_gap()

ma_max_gap is called from a single location and can benefit from the
setup in mas_update_gap, so inline it.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agomaple_tree: Fix ma_dead_node() comment
Liam R. Howlett [Wed, 30 Oct 2024 15:28:08 +0000 (11:28 -0400)]
maple_tree: Fix ma_dead_node() comment

Update the arguemnt description to be accurate.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agoinline three small functions
Liam R. Howlett [Sat, 30 Nov 2024 04:02:41 +0000 (23:02 -0500)]
inline three small functions

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agomas_wr_split() Avoid mas_wr_new_end() call
Liam R. Howlett [Sat, 30 Nov 2024 04:01:13 +0000 (23:01 -0500)]
mas_wr_split() Avoid mas_wr_new_end() call

ma_part has the size set to what is needed by examining the same data.
Use ma_part.size instead.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agotools/testing/radix-tree: Add maple tree fuzzer
Liam R. Howlett [Wed, 23 Aug 2023 16:13:52 +0000 (12:13 -0400)]
tools/testing/radix-tree: Add maple tree fuzzer

This is the introduction of the maple tree fuzzer into the radix-tree
test suite, based heavily on the work of Vasily Gorbik sent in March
2022.

The tester uses the LLVM fuzzer to automate testing of the maple tree by
randomly inserting, storing, deleting, and resetting the tree.  Testing
has been expanded to test both allocation and basic trees.

After building the fuzz-maple target with clang, just run the resulting
executable.  The llvm libfuzzer supports minimizing the steps to
reproduce a crash from a crash file.

Using V=1 on the minimized crash will result in a testcase that can be
added to the lib/test_maple_tree.c test suite.  Using V=2 can help
figure out what is happening to cause the crash.

Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 months agomaple_tree: Use local variable
Liam R. Howlett [Sat, 30 Nov 2024 01:34:34 +0000 (20:34 -0500)]
maple_tree: Use local variable

the maple state variable has already been set up, so use it.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agoinlineing
Liam R. Howlett [Sat, 30 Nov 2024 01:33:38 +0000 (20:33 -0500)]
inlineing

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agoalways inlined and such
Liam R. Howlett [Fri, 29 Nov 2024 23:26:03 +0000 (18:26 -0500)]
always inlined and such

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agomaple_tree: Add skipping slot support
Liam R. Howlett [Fri, 29 Nov 2024 22:15:53 +0000 (17:15 -0500)]
maple_tree: Add skipping slot support

Allow maple node state to jump entire slots when necessary.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agooptimize a bunch of junk
Liam R. Howlett [Fri, 29 Nov 2024 18:25:18 +0000 (13:25 -0500)]
optimize a bunch of junk

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agorebalance starts now
Liam R. Howlett [Fri, 29 Nov 2024 15:31:43 +0000 (10:31 -0500)]
rebalance starts now

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agomaple_tree: Optimise mas_wr_store_entry() and mas_prealloc_calc()
Liam R. Howlett [Fri, 29 Nov 2024 16:28:33 +0000 (11:28 -0500)]
maple_tree: Optimise mas_wr_store_entry() and mas_prealloc_calc()

Rearrange the switch statements so that the more likely code paths are
checked first.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agomaple_tree: Create new mas_wr_split()
Liam R. Howlett [Mon, 4 Nov 2024 15:32:03 +0000 (10:32 -0500)]
maple_tree: Create new mas_wr_split()

Stop using the large struct big_node and use logic with two allocated
nodes.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2 months agofoo
Andrew Morton [Sun, 26 Jan 2025 23:05:03 +0000 (15:05 -0800)]
foo

2 months agomailmap: add an entry for Hamza Mahfooz
Hamza Mahfooz [Mon, 20 Jan 2025 20:56:59 +0000 (15:56 -0500)]
mailmap: add an entry for Hamza Mahfooz

Map my previous work email to my current one.

Link: https://lkml.kernel.org/r/20250120205659.139027-1-hamzamahfooz@linux.microsoft.com
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hans verkuil <hverkuil@xs4all.nl>
Cc: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: gup: fix infinite loop within __get_longterm_locked
Zhaoyang Huang [Tue, 21 Jan 2025 02:01:59 +0000 (10:01 +0800)]
mm: gup: fix infinite loop within __get_longterm_locked

We can run into an infinite loop in __get_longterm_locked() when
collect_longterm_unpinnable_folios() finds only folios that are isolated
from the LRU or were never added to the LRU.  This can happen when all
folios to be pinned are never added to the LRU, for example when
vm_ops->fault allocated pages using cma_alloc() and never added them to
the LRU.

We incorrectly update the "collected" variable even if nothing was
collected.  Fix it by incrementing "collected" only when we isolated a
folio and added it to the list of folios to migrate.

Link: https://lkml.kernel.org/r/20250121020159.3636477-1-zhaoyang.huang@unisoc.com
Fixes: 67e139b02d99 ("mm/gup.c: refactor check_and_migrate_movable_pages()")
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Cc: Aijun Sun <aijun.sun@unisoc.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoocfs2: fix incorrect CPU endianness conversion causing mount failure
Heming Zhao [Tue, 21 Jan 2025 11:22:03 +0000 (19:22 +0800)]
ocfs2: fix incorrect CPU endianness conversion causing mount failure

Commit 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()")
introduced a regression bug.  The blksz_bits value is already converted to
CPU endian in the previous code; therefore, the code shouldn't use
le32_to_cpu() anymore.

Link: https://lkml.kernel.org/r/20250121112204.12834-1-heming.zhao@suse.com
Fixes: 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: page_isolation: avoid calling folio_hstate() without hugetlb_lock
Liu Shixin [Wed, 22 Jan 2025 06:11:51 +0000 (14:11 +0800)]
mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock

I found a NULL pointer dereference as followed:

 BUG: kernel NULL pointer dereference, address: 0000000000000028
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0000 [#1] SMP PTI
 CPU: 5 UID: 0 PID: 5964 Comm: sh Kdump: loaded Not tainted 6.13.0-dirty #20
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.
 RIP: 0010:has_unmovable_pages+0x184/0x360
 ...
 Call Trace:
  <TASK>
  set_migratetype_isolate+0xd1/0x180
  start_isolate_page_range+0xd2/0x170
  alloc_contig_range_noprof+0x101/0x660
  alloc_contig_pages_noprof+0x238/0x290
  alloc_gigantic_folio.isra.0+0xb6/0x1f0
  only_alloc_fresh_hugetlb_folio.isra.0+0xf/0x60
  alloc_pool_huge_folio+0x80/0xf0
  set_max_huge_pages+0x211/0x490
  __nr_hugepages_store_common+0x5f/0xe0
  nr_hugepages_store+0x77/0x80
  kernfs_fop_write_iter+0x118/0x200
  vfs_write+0x23c/0x3f0
  ksys_write+0x62/0xe0
  do_syscall_64+0x5b/0x170
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

As has_unmovable_pages() call folio_hstate() without hugetlb_lock, there
is a race to free the HugeTLB page between PageHuge() and folio_hstate().
There is no need to add hugetlb_lock here as the HugeTLB page can be freed
in lot of places.  So it's enough to unfold folio_hstate() and add a check
to avoid NULL pointer dereference for hugepage_migration_supported().

Link: https://lkml.kernel.org/r/20250122061151.578768-1-liushixin2@huawei.com
Fixes: 464c7ffbcb16 ("mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoMAINTAINERS: mailmap: update Yosry Ahmed's email address
Yosry Ahmed [Thu, 23 Jan 2025 23:13:44 +0000 (23:13 +0000)]
MAINTAINERS: mailmap: update Yosry Ahmed's email address

Moving to a linux.dev email address.

Link: https://lkml.kernel.org/r/20250123231344.817358-1-yosry.ahmed@linux.dev
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomailmap, docs: update email to carlos.bilbao@kernel.org
Carlos Bilbao [Sat, 11 Jan 2025 16:11:06 +0000 (10:11 -0600)]
mailmap, docs: update email to carlos.bilbao@kernel.org

Update .mailmap to reflect my new (and final) primary email address,
carlos.bilbao@kernel.org.  This ensures consistent attribution in Git
history.  Also update my contact information in file
Documentation/translations/sp_SP/index.rst to help contributors reach out
for Spanish translations.

Link: https://lkml.kernel.org/r/20250111161110.862131-1-carlos.bilbao@kernel.org
Signed-off-by: Carlos Bilbao <carlos.bilbao@kernel.org>
Cc: Avadhut Naik <avadhut.naik@amd.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoscripts/gdb: fix aarch64 userspace detection in get_current_task
Jan Kiszka [Fri, 10 Jan 2025 10:36:33 +0000 (11:36 +0100)]
scripts/gdb: fix aarch64 userspace detection in get_current_task

At least recent gdb releases (seen with 14.2) return SP_EL0 as signed long
which lets the right-shift always return 0.

Link: https://lkml.kernel.org/r/dcd2fabc-9131-4b48-8419-6444e2d67454@siemens.com
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm-vmscan-accumulate-nr_demoted-for-accurate-demotion-statistics-v2
Li Zhijian [Sat, 11 Jan 2025 01:52:53 +0000 (09:52 +0800)]
mm-vmscan-accumulate-nr_demoted-for-accurate-demotion-statistics-v2

introduce local nr_demoted to fix nr_reclaimed double counting

Link: https://lkml.kernel.org/r/20250111015253.425693-1-lizhijian@fujitsu.com
Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/vmscan: accumulate nr_demoted for accurate demotion statistics
Li Zhijian [Fri, 10 Jan 2025 12:21:32 +0000 (20:21 +0800)]
mm/vmscan: accumulate nr_demoted for accurate demotion statistics

In shrink_folio_list(), demote_folio_list() can be called 2 times.
Currently stat->nr_demoted will only store the last nr_demoted( the later
nr_demoted is always zero, the former nr_demoted will get lost), as a
result number of demoted pages is not accurate.

Accumulate the nr_demoted count across multiple calls to
demote_folio_list(), ensuring accurate reporting of demotion statistics.

Link: https://lkml.kernel.org/r/20250110122133.423481-1-lizhijian@fujitsu.com
Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Tested-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Donet Tom <donettom@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/hugetlb_vmemmap: fix memory loads ordering
Yu Zhao [Wed, 8 Jan 2025 07:48:21 +0000 (00:48 -0700)]
mm/hugetlb_vmemmap: fix memory loads ordering

Using x86_64 as an example, for a 32KB struct page[] area describing a 2MB
hugeTLB, HVO reduces the area to 4KB by the following steps:

1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs;
2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped
   by PTE 0, and at the same time change the permission from r/w to
   r/o;
3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB
   to 4KB.

However, the following race can happen due to improperly memory loads
ordering:
  CPU 1 (HVO)                     CPU 2 (speculative PFN walker)

  page_ref_freeze()
  synchronize_rcu()
                                  rcu_read_lock()
                                  page_is_fake_head() is false
  vmemmap_remap_pte()
  XXX: struct page[] becomes r/o

  page_ref_unfreeze()
                                  page_ref_count() is not zero

                                  atomic_add_unless(&page->_refcount)
                                  XXX: try to modify r/o struct page[]

Specifically, page_is_fake_head() must be ordered after page_ref_count()
on CPU 2 so that it can only return true for this case, to avoid the later
attempt to modify r/o struct page[].

This patch adds the missing memory barrier and makes the tests on
page_is_fake_head() and page_ref_count() done in the proper order.

Link: https://lkml.kernel.org/r/20250108074822.722696-1-yuzhao@google.com
Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Will Deacon <will@kernel.org>
Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/vmscan: fix hard LOCKUP in function isolate_lru_folios
liuye [Tue, 19 Nov 2024 06:08:42 +0000 (14:08 +0800)]
mm/vmscan: fix hard LOCKUP in function isolate_lru_folios

This fixes the following hard lockup in isolate_lru_folios() during memory
reclaim.  If the LRU mostly contains ineligible folios this may trigger
watchdog.

watchdog: Watchdog detected hard LOCKUP on cpu 173
RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0
Call Trace:
_raw_spin_lock_irqsave+0x31/0x40
folio_lruvec_lock_irqsave+0x5f/0x90
folio_batch_move_lru+0x91/0x150
lru_add_drain_per_cpu+0x1c/0x40
process_one_work+0x17d/0x350
worker_thread+0x27b/0x3a0
kthread+0xe8/0x120
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1b/0x30

lruvec->lru_lock owner:

PID: 2865     TASK: ffff888139214d40  CPU: 40   COMMAND: "kswapd0"
 #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555
 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171
 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920
 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4
 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde
    [exception RIP: isolate_lru_folios+403]
    RIP: ffffffffa597df53  RSP: ffffc90006fb7c28  RFLAGS: 00000002
    RAX: 0000000000000001  RBX: ffffc90006fb7c60  RCX: ffffea04a2196f88
    RDX: ffffc90006fb7c60  RSI: ffffc90006fb7c60  RDI: ffffea04a2197048
    RBP: ffff88812cbd3010   R8: ffffea04a2197008   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000001  R12: ffffea04a2197008
    R13: ffffea04a2197048  R14: ffffc90006fb7de8  R15: 0000000003e3e937
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    <NMI exception stack>
 #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788
 #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0
 #8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354
 #9 [ffffc90006fb7ef8] kthread at ffffffffa5748238
crash>

Scenario:
User processe are requesting a large amount of memory and keep page active.
Then a module continuously requests memory from ZONE_DMA32 area.
Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached.
However pages in the LRU(active_anon) list are mostly from
the ZONE_NORMAL area.

Reproduce:
Terminal 1: Construct to continuously increase pages active(anon).
mkdir /tmp/memory
mount -t tmpfs -o size=1024000M tmpfs /tmp/memory
dd if=/dev/zero of=/tmp/memory/block bs=4M
tail /tmp/memory/block

Terminal 2:
vmstat -a 1
active will increase.
procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ...
 r  b   swpd   free  inact active   si   so    bi    bo
 1  0   0 1445623076 45898836 83646008    0    0     0
 1  0   0 1445623076 43450228 86094616    0    0     0
 1  0   0 1445623076 41003480 88541364    0    0     0
 1  0   0 1445623076 38557088 90987756    0    0     0
 1  0   0 1445623076 36109688 93435156    0    0     0
 1  0   0 1445619552 33663256 95881632    0    0     0
 1  0   0 1445619804 31217140 98327792    0    0     0
 1  0   0 1445619804 28769988 100774944    0    0     0
 1  0   0 1445619804 26322348 103222584    0    0     0
 1  0   0 1445619804 23875592 105669340    0    0     0

cat /proc/meminfo | head
Active(anon) increase.
MemTotal:       1579941036 kB
MemFree:        1445618500 kB
MemAvailable:   1453013224 kB
Buffers:            6516 kB
Cached:         128653956 kB
SwapCached:            0 kB
Active:         118110812 kB
Inactive:       11436620 kB
Active(anon):   115345744 kB
Inactive(anon):   945292 kB

When the Active(anon) is 115345744 kB, insmod module triggers
the ZONE_DMA32 watermark.

perf record -e vmscan:mm_vmscan_lru_isolate -aR
perf script
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2
nr_skipped=2 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844
nr_skipped=28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844
nr_skipped=28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29
nr_skipped=29 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon

See nr_scanned=28835844.
28835844 * 4k = 115343376KB approximately equal to 115345744 kB.

If increase Active(anon) to 1000G then insmod module triggers
the ZONE_DMA32 watermark. hard lockup will occur.

In my device nr_scanned = 0000000003e3e937 when hard lockup.
Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB.

   [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
    ffffc90006fb7c300000000000000020 0000000000000000
    ffffc90006fb7c40ffffc90006fb7d40 ffff88812cbd3000
    ffffc90006fb7c50ffffc90006fb7d30 0000000106fb7de8
    ffffc90006fb7c60ffffea04a2197008 ffffea0006ed4a48
    ffffc90006fb7c700000000000000000 0000000000000000
    ffffc90006fb7c800000000000000000 0000000000000000
    ffffc90006fb7c900000000000000000 0000000000000000
    ffffc90006fb7ca00000000000000000 0000000003e3e937
    ffffc90006fb7cb00000000000000000 0000000000000000
    ffffc90006fb7cc08d7c0b56b7874b00 ffff88812cbd3000

About the Fixes:
Why did it take eight years to be discovered?

The problem requires the following conditions to occur:
1. The device memory should be large enough.
2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area.
3. The memory in ZONE_DMA32 needs to reach the watermark.

If the memory is not large enough, or if the usage design of ZONE_DMA32
area memory is reasonable, this problem is difficult to detect.

notes:
The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL,
but other suitable scenarios may also trigger the problem.

Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn
Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
Signed-off-by: liuye <liuye@kylinos.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/compaction: fix UBSAN shift-out-of-bounds warning
Liu Shixin [Thu, 23 Jan 2025 02:10:29 +0000 (10:10 +0800)]
mm/compaction: fix UBSAN shift-out-of-bounds warning

syzkaller reported a UBSAN shift-out-of-bounds warning of (1UL << order)
in isolate_freepages_block().  The bogus compound_order can be any value
because it is union with flags.  Add back the MAX_PAGE_ORDER check to fix
the warning.

Link: https://lkml.kernel.org/r/20250123021029.2826736-1-liushixin2@huawei.com
Fixes: 3da0272a4c7d ("mm/compaction: correctly return failure with bogus compound_order in strict mode")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agos390/mm: add missing ctor/dtor on page table upgrade
Alexander Gordeev [Thu, 23 Jan 2025 16:03:49 +0000 (17:03 +0100)]
s390/mm: add missing ctor/dtor on page table upgrade

Commit 78966b550289 ("s390: pgtable: add statistics for PUD and P4D level
page table") misses the call to pagetable_p4d_ctor() against a newly
allocated P4D table in crst_table_upgrade();

Commit 68c601de75d8 ("mm: introduce ctor/dtor at PGD level") misses the
call to pagetable_pgd_ctor() against a newly allocated PGD and the call to
pagetable_dtor() against a newly allocated P4D that is about to be freed
on crst_table_upgrade() PGD upgrade fail path.

The missed constructors and destructor break (at least) the page table
accounting when a process memory space is upgraded.

Link: https://lkml.kernel.org/r/20250123160349.200154-1-agordeev@linux.ibm.com
Fixes: 78966b550289 ("s390: pgtable: add statistics for PUD and P4D level page table")
Fixes: 68c601de75d8 ("mm: introduce ctor/dtor at PGD level")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Closes: https://lore.kernel.org/all/20250122074954.8685-A-hca@linux.ibm.com/
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agokasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
Thorsten Blum [Thu, 16 Jan 2025 06:24:04 +0000 (07:24 +0100)]
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()

Remove hard-coded strings by using the str_on_off() helper function.

Link: https://lkml.kernel.org/r/20250116062403.2496-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agotools: add VM_WARN_ON_VMG definition
Suren Baghdasaryan [Thu, 16 Jan 2025 18:15:38 +0000 (10:15 -0800)]
tools: add VM_WARN_ON_VMG definition

vma tests compilation yields the following error:

vma.c:732:9: error: implicit declaration of function â€˜VM_WARN_ON_VMG’

Fix it by adding missing VM_WARN_ON_VMG() definition.

Link: https://lkml.kernel.org/r/20250116181538.759469-1-surenb@google.com
Fixes: e3a7ae85f87c ("mm/debug: prefer VM_WARN_ON_VMG() to report VMG debug warnings")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
Thorsten Blum [Thu, 16 Jan 2025 20:42:16 +0000 (21:42 +0100)]
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()

Remove hard-coded strings by using the str_high_low() helper function.

Link: https://lkml.kernel.org/r/20250116204216.106999-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoseqlock: add missing parameter documentation for raw_seqcount_try_begin()
Suren Baghdasaryan [Thu, 16 Jan 2025 18:27:30 +0000 (10:27 -0800)]
seqlock: add missing parameter documentation for raw_seqcount_try_begin()

Add missing documentation for raw_seqcount_try_begin() start parameter.

Link: https://lkml.kernel.org/r/20250116182730.801497-1-surenb@google.com
Fixes: dba4761a3e40 ("seqlock: add raw_seqcount_try_begin")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/all/20250116170522.23e884d5@canb.auug.org.au/
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
Jim Zhao [Thu, 21 Nov 2024 10:05:39 +0000 (18:05 +0800)]
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh

Address the feedback from 39ac99852fca ("mm/page-writeback: raise
wb_thresh to prevent write blocking with strictlimit)".  The wb_thresh
bumping logic is scattered across wb_position_ratio, __wb_calc_thresh, and
wb_update_dirty_ratelimit.  For consistency, consolidate all wb_thresh
bumping logic into __wb_calc_thresh.

Link: https://lkml.kernel.org/r/20241121100539.605818-1-jimzhao.ai@gmail.com
Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/page_alloc: remove the incorrect and misleading comment
Yuntao Wang [Wed, 15 Jan 2025 04:16:34 +0000 (12:16 +0800)]
mm/page_alloc: remove the incorrect and misleading comment

The comment removed in this patch originally belonged to the
build_zonelists_in_zone_order() function, which was introduced by commit
f0c0b2b808f2 ("change zonelist order: zonelist order selection logic").

Later, commit c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE")
removed build_zonelists_in_zone_order() but left its comment behind.

Subsequently, commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
initialization") moved the node_order variable into build_zonelists(),
making the comment originally belonged to build_zonelists_in_zone_order()
appear as if it were part of build_zonelists().

Remove this misleading comment.

Link: https://lkml.kernel.org/r/20250115041634.63387-1-yuntao.wang@linux.dev
Signed-off-by: Yuntao Wang <yuntao.wang@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agozram: remove zcomp_stream_put() from write_incompressible_page()
Sergey Senozhatsky [Wed, 15 Jan 2025 07:19:16 +0000 (16:19 +0900)]
zram: remove zcomp_stream_put() from write_incompressible_page()

We cannot and should not put per-CPU compression stream in
write_incompressible_page() because that function never gets any
per-CPU streams in the first place.  It's zram_write_page() that
puts the stream before it calls write_incompressible_page().

Link: https://lkml.kernel.org/r/20250115072003.380567-1-senozhatsky@chromium.org
Fixes: 485d11509d6d ("zram: factor out ZRAM_HUGE write")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: separate move/undo parts from migrate_pages_batch()
Byungchul Park [Thu, 8 Aug 2024 06:53:58 +0000 (15:53 +0900)]
mm: separate move/undo parts from migrate_pages_batch()

Functionally, no change.  This is a preparation for luf mechanism that
requires to use separated folio lists for its own handling during
migration.  Refactored migrate_pages_batch() so as to separate move/undo
parts from migrate_pages_batch().

Link: https://lkml.kernel.org/r/20250115103403.11882-1-byungchul@sk.com
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/kfence: use str_write_read() helper in get_access_type()
Thorsten Blum [Wed, 15 Jan 2025 15:55:12 +0000 (16:55 +0100)]
mm/kfence: use str_write_read() helper in get_access_type()

Remove hard-coded strings by using the str_write_read() helper function.

Link: https://lkml.kernel.org/r/20250115155511.954535-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoselftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
liuye [Tue, 14 Jan 2025 02:38:38 +0000 (10:38 +0800)]
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()

Release memory before exception branch returns to prevent memory leaks

Checking tools/testing/selftests/mm/mkdirty.c ...
tools/testing/selftests/mm/mkdirty.c:283:3: error: Memory leak: src [memleak]
  return;
  ^

Link: https://lkml.kernel.org/r/20250114023838.48589-1-liuye@kylinos.cn
Signed-off-by: liuye <liuye@kylinos.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agokasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
Thorsten Blum [Tue, 14 Jan 2025 15:09:35 +0000 (16:09 +0100)]
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()

Remove hard-coded strings by using the str_on_off() helper function.

Link: https://lkml.kernel.org/r/20250114150935.780869-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoselftests/mm: virtual_address_range: avoid reading from VM_IO mappings
Thomas Weißschuh [Tue, 14 Jan 2025 16:06:48 +0000 (17:06 +0100)]
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings

The virtual_address_range selftest reads from the start of each mapping
listed in /proc/self/maps.  However not all mappings are valid to be
arbitrarily accessed.

For example the vvar data used for virtual clocks on x86 [vvar_vclock] can
only be accessed if 1) the kernel configuration enables virtual clocks and
2) the hypervisor provided the data for it.  Only the VDSO itself has the
necessary information to know this.  Since commit e93d2521b27f ("x86/vdso:
Split virtual clock pages into dedicated mapping") the virtual clock data
was split out into its own mapping, leading to EFAULT from read() during
the validation.

Check for the VM_IO flag as a proxy.  It is present for the VVAR mappings
and MMIO ranges can be dangerous to access arbitrarily.

Link: https://lkml.kernel.org/r/20250114-virtual_address_range-tests-v4-4-6fd7269934a5@linutronix.de
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202412271148.2656e485-lkp@intel.com
Fixes: e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping")
Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/lkml/e97c2a5d-c815-4936-a767-ac42a3220a90@redhat.com/
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoselftests/mm: vm_util: split up /proc/self/smaps parsing
Thomas Weißschuh [Tue, 14 Jan 2025 16:06:47 +0000 (17:06 +0100)]
selftests/mm: vm_util: split up /proc/self/smaps parsing

Upcoming changes want to reuse the /proc/self/smaps parsing logic to parse
the VmFlags field.

As that works differently from the currently parsed HugePage counters,
split up the logic so common functionality can be shared.

While reworking this code, also use the correct sscanf placeholder for the
"uint64_t thp" variable.

Link: https://lkml.kernel.org/r/20250114-virtual_address_range-tests-v4-3-6fd7269934a5@linutronix.de
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoselftests/mm: virtual_address_range: unmap chunks after validation
Thomas Weißschuh [Tue, 14 Jan 2025 16:06:46 +0000 (17:06 +0100)]
selftests/mm: virtual_address_range: unmap chunks after validation

For each accessed chunk a PTE is created.  More than 1GiB of PTEs is used
in this way.  Remove each PTE after validating a chunk to reduce peak
memory usage.

It is important to only unmap memory that previously mmap()ed, as
unmapping other mappings like the stack, heap or executable mappings will
crash the process.

The mappings read from /proc/self/maps and the return values from mmap()
don't allow a simple correlation due to merging and no guaranteed order.
To correlate the pointers and mappings use prctl(PR_SET_VMA_ANON_NAME).
While it introduces a test dependency, other alternatives would introduce
runtime or development overhead.

Link: https://lkml.kernel.org/r/20250114-virtual_address_range-tests-v4-2-6fd7269934a5@linutronix.de
Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoselftests/mm: virtual_address_range: mmap() without PROT_WRITE
Thomas Weißschuh [Tue, 14 Jan 2025 16:06:45 +0000 (17:06 +0100)]
selftests/mm: virtual_address_range: mmap() without PROT_WRITE

Patch series "selftests/mm: virtual_address_range: Reduce memory", v4.

The selftest started failing since commit e93d2521b27f ("x86/vdso: Split
virtual clock pages into dedicated mapping") was merged.  While debugging
I stumbled upon some memory usage optimizations.

With these test now runs on a VM with only 60MiB of memory.

This patch (of 4):

When mapping a larger chunk than physical memory is available with
PROT_WRITE and overcommit is disabled, the mapping will fail.  This will
prevent the test from running on systems with less then ~1GiB of memory
and triggering an inscrutinable test failure.  As the mappings are never
written to anyways, the flag can be removed.

Link: https://lkml.kernel.org/r/20250114-virtual_address_range-tests-v4-0-6fd7269934a5@linutronix.de
Link: https://lkml.kernel.org/r/20250114-virtual_address_range-tests-v4-1-6fd7269934a5@linutronix.de
Fixes: 4e5ce33ceb32 ("selftests/vm: add a test for virtual address range mapping")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dev Jain <dev.jain@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoselftests/memfd/memfd_test: fix possible NULL pointer dereference
liuye [Tue, 14 Jan 2025 03:21:15 +0000 (11:21 +0800)]
selftests/memfd/memfd_test: fix possible NULL pointer dereference

If `name' is NULL, a NULL pointer may be accessed in printf.

Link: https://lkml.kernel.org/r/20250114032115.58638-1-liuye@kylinos.cn
Signed-off-by: liuye <liuye@kylinos.cn>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Saurav Shah <sauravshah.31@gmail.com>
Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: add FGP_DONTCACHE folio creation flag
Jens Axboe [Fri, 20 Dec 2024 15:47:50 +0000 (08:47 -0700)]
mm: add FGP_DONTCACHE folio creation flag

Callers can pass this in for uncached folio creation, in which case if a
folio is newly created it gets marked as uncached.  If a folio exists for
this index and lookup succeeds, then it will not get marked as uncached.
If an !uncached lookup finds a cached folio, clear the flag.  For that
case, there are competeting uncached and cached users of the folio, and it
should not get pruned.

Link: https://lkml.kernel.org/r/20241220154831.1086649-13-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
Jens Axboe [Fri, 20 Dec 2024 15:47:49 +0000 (08:47 -0700)]
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue

When a buffered write submitted with IOCB_DONTCACHE has been successfully
submitted, call filemap_fdatawrite_range_kick() to kick off the IO.  File
systems call generic_write_sync() for any successful buffered write
submission, hence add the logic here rather than needing to modify the
file system.

Link: https://lkml.kernel.org/r/20241220154831.1086649-12-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/filemap: add filemap_fdatawrite_range_kick() helper
Jens Axboe [Fri, 20 Dec 2024 15:47:48 +0000 (08:47 -0700)]
mm/filemap: add filemap_fdatawrite_range_kick() helper

Works like filemap_fdatawrite_range(), except it's a non-integrity data
writeback and hence only starts writeback on the specified range.  Will
help facilitate generically starting uncached writeback from
generic_write_sync(), as header dependencies preclude doing this inline
from fs.h.

Link: https://lkml.kernel.org/r/20241220154831.1086649-11-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/filemap: drop streaming/uncached pages when writeback completes
Jens Axboe [Fri, 20 Dec 2024 15:47:47 +0000 (08:47 -0700)]
mm/filemap: drop streaming/uncached pages when writeback completes

If the folio is marked as streaming, drop pages when writeback completes.
Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for
uncached IO.

Link: https://lkml.kernel.org/r/20241220154831.1086649-10-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/filemap: add read support for RWF_DONTCACHE
Jens Axboe [Fri, 20 Dec 2024 15:47:46 +0000 (08:47 -0700)]
mm/filemap: add read support for RWF_DONTCACHE

Add RWF_DONTCACHE as a read operation flag, which means that any data read
wil be removed from the page cache upon completion.  Uses the page cache
to synchronize, and simply prunes folios that were instantiated when the
operation completes.  While it would be possible to use private pages for
this, using the page cache as synchronization is handy for a variety of
reasons:

1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
   cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable

and the code to support this is much simpler as a result.

You can think of uncached buffered IO as being the much more attractive
cousin of O_DIRECT - it has none of the restrictions of O_DIRECT.  Yes, it
will copy the data, but unlike regular buffered IO, it doesn't run into
the unpredictability of the page cache in terms of reclaim.  As an
example, on a test box with 32 drives, reading them with buffered IO looks
as follows:

Reading bs 65536, uncached 0
  1s: 145945MB/sec
  2s: 158067MB/sec
  3s: 157007MB/sec
  4s: 148622MB/sec
  5s: 118824MB/sec
  6s: 70494MB/sec
  7s: 41754MB/sec
  8s: 90811MB/sec
  9s: 92204MB/sec
 10s: 95178MB/sec
 11s: 95488MB/sec
 12s: 95552MB/sec
 13s: 96275MB/sec

where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7535 root      20   0  267004      0      0 S  3199   0.0   8:40.65 uncached
3326 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd4
3327 root      20   0       0      0      0 R 100.0   0.0   0:17.22 kswapd5
3328 root      20   0       0      0      0 R 100.0   0.0   0:13.29 kswapd6
3332 root      20   0       0      0      0 R 100.0   0.0   0:11.11 kswapd10
3339 root      20   0       0      0      0 R 100.0   0.0   0:16.25 kswapd17
3348 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd26
3343 root      20   0       0      0      0 R 100.0   0.0   0:16.30 kswapd21
3344 root      20   0       0      0      0 R 100.0   0.0   0:11.92 kswapd22
3349 root      20   0       0      0      0 R 100.0   0.0   0:16.28 kswapd27
3352 root      20   0       0      0      0 R  99.7   0.0   0:11.89 kswapd30
3353 root      20   0       0      0      0 R  96.7   0.0   0:16.04 kswapd31
3329 root      20   0       0      0      0 R  96.4   0.0   0:11.41 kswapd7
3345 root      20   0       0      0      0 R  96.4   0.0   0:13.40 kswapd23
3330 root      20   0       0      0      0 S  91.1   0.0   0:08.28 kswapd8
3350 root      20   0       0      0      0 S  86.8   0.0   0:11.13 kswapd28
3325 root      20   0       0      0      0 S  76.3   0.0   0:07.43 kswapd3
3341 root      20   0       0      0      0 S  74.7   0.0   0:08.85 kswapd19
3334 root      20   0       0      0      0 S  71.7   0.0   0:10.04 kswapd12
3351 root      20   0       0      0      0 R  60.5   0.0   0:09.59 kswapd29
3323 root      20   0       0      0      0 R  57.6   0.0   0:11.50 kswapd1
[...]

which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.

If the same test case is run with RWF_DONTCACHE set for the buffered read,
the output looks as follows:

Reading bs 65536, uncached 0
  1s: 153144MB/sec
  2s: 156760MB/sec
  3s: 158110MB/sec
  4s: 158009MB/sec
  5s: 158043MB/sec
  6s: 157638MB/sec
  7s: 157999MB/sec
  8s: 158024MB/sec
  9s: 157764MB/sec
 10s: 157477MB/sec
 11s: 157417MB/sec
 12s: 157455MB/sec
 13s: 157233MB/sec
 14s: 156692MB/sec

which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top

where just the test app is using CPU, no reclaim is taking place outside
of the main thread.  Not only is performance 65% better, it's also using
half the CPU to do it.

Link: https://lkml.kernel.org/r/20241220154831.1086649-9-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agofs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
Jens Axboe [Fri, 20 Dec 2024 15:47:45 +0000 (08:47 -0700)]
fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag

If a file system supports uncached buffered IO, it may set FOP_DONTCACHE
and enable support for RWF_DONTCACHE.  If RWF_DONTCACHE is attempted
without the file system supporting it, it'll get errored with -EOPNOTSUPP.

Link: https://lkml.kernel.org/r/20241220154831.1086649-8-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/truncate: add folio_unmap_invalidate() helper
Jens Axboe [Fri, 20 Dec 2024 15:47:44 +0000 (08:47 -0700)]
mm/truncate: add folio_unmap_invalidate() helper

Add a folio_unmap_invalidate() helper, which unmaps and invalidates a
given folio.  The caller must already have locked the folio.  Embed the
old invalidate_complete_folio2() helper in there as well, as nobody else
calls it.

Use this new helper in invalidate_inode_pages2_range(), rather than
duplicate the code there.

In preparation for using this elsewhere as well, have it take a gfp_t mask
rather than assume GFP_KERNEL is the right choice.  This bubbles back to
invalidate_complete_folio2() as well.

Link: https://lkml.kernel.org/r/20241220154831.1086649-7-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/readahead: add readahead_control->dropbehind member
Jens Axboe [Fri, 20 Dec 2024 15:47:43 +0000 (08:47 -0700)]
mm/readahead: add readahead_control->dropbehind member

If ractl->dropbehind is set to true, then folios created are marked as
dropbehind as well.

Link: https://lkml.kernel.org/r/20241220154831.1086649-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: add PG_dropbehind folio flag
Jens Axboe [Fri, 20 Dec 2024 15:47:42 +0000 (08:47 -0700)]
mm: add PG_dropbehind folio flag

Add a folio flag that file IO can use to indicate that the cached IO being
done should be dropped from the page cache upon completion.

Link: https://lkml.kernel.org/r/20241220154831.1086649-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/readahead: add folio allocation helper
Jens Axboe [Fri, 20 Dec 2024 15:47:41 +0000 (08:47 -0700)]
mm/readahead: add folio allocation helper

Just a wrapper around filemap_alloc_folio() for now, but add it in
preparation for modifying the folio based on the 'ractl' being passed in.

No functional changes in this patch.

Link: https://lkml.kernel.org/r/20241220154831.1086649-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/filemap: use page_cache_sync_ra() to kick off read-ahead
Jens Axboe [Fri, 20 Dec 2024 15:47:40 +0000 (08:47 -0700)]
mm/filemap: use page_cache_sync_ra() to kick off read-ahead

Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly.  In preparation for needing
to modify ractl inside filemap_get_pages().

No functional changes in this patch.

Link: https://lkml.kernel.org/r/20241220154831.1086649-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/filemap: change filemap_create_folio() to take a struct kiocb
Jens Axboe [Fri, 20 Dec 2024 15:47:39 +0000 (08:47 -0700)]
mm/filemap: change filemap_create_folio() to take a struct kiocb

Patch series "Uncached buffered IO", v8.

5 years ago I posted patches adding support for RWF_UNCACHED, as a way to
do buffered IO that isn't page cache persistent.  The approach back then
was to have private pages for IO, and then get rid of them once IO was
done.  But that then runs into all the issues that O_DIRECT has, in terms
of synchronizing with the page cache.

So here's a new approach to the same concent, but using the page cache as
synchronization.  Due to excessive bike shedding on the naming, this is
now named RWF_DONTCACHE, and is less special in that it's just page cache
IO, except it prunes the ranges once IO is completed.

Why do this, you may ask?  The tldr is that device speeds are only getting
faster, while reclaim is not.  Doing normal buffered IO can be very
unpredictable, and suck up a lot of resources on the reclaim side.  This
leads people to use O_DIRECT as a work-around, which has its own set of
restrictions in terms of size, offset, and length of IO.  It's also
inherently synchronous, and now you need async IO as well.  While the
latter isn't necessarily a big problem as we have good options available
there, it also should not be a requirement when all you want to do is read
or write some data without caching.

Even on desktop type systems, a normal NVMe device can fill the entire
page cache in seconds.  On the big system I used for testing, there's a
lot more RAM, but also a lot more devices.  As can be seen in some of the
results in the following patches, you can still fill RAM in seconds even
when there's 1TB of it.  Hence this problem isn't solely a "big
hyperscaler system" issue, it's common across the board.

Common for both reads and writes with RWF_DONTCACHE is that they use the
page cache for IO.  Reads work just like a normal buffered read would,
with the only exception being that the touched ranges will get pruned
after data has been copied.  For writes, the ranges will get writeback
kicked off before the syscall returns, and then writeback completion will
prune the range.  Hence writes aren't synchronous, and it's easy to
pipeline writes using RWF_DONTCACHE.  Folios that aren't instantiated by
RWF_DONTCACHE IO are left untouched.  This means you that uncached IO will
take advantage of the page cache for uptodate data, but not leave anything
it instantiated/created in cache.

File systems need to support this.  This patchset adds support for the
generic read path, which covers file systems like ext4.  Patches exist to
add support for iomap/XFS and btrfs as well, which sit on top of this
series.  If RWF_DONTCACHE IO is attempted on a file system that doesn't
support it, -EOPNOTSUPP is returned.  Hence the user can rely on it either
working as designed, or flagging and error if that's not the case.  The
intent here is to give the application a sensible fallback path - eg, it
may fall back to O_DIRECT if appropriate, or just live with the fact that
uncached IO isn't available and do normal buffered IO.

Adding "support" to other file systems should be trivial, most of the time
just a one-liner adding FOP_DONTCACHE to the fop_flags in the
file_operations struct, if the file system is using either iomap or the
generic filemap helpers for reading and writing.

Performance results are in patch 8 for reads, and you can find the write
side results in the XFS patch adding support for DONTCACHE writes for XFS:

https://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached-fs.10&id=257e92de795fdff7d7e256501e024fac6da6a7f4

with the tldr being that I see about a 65% improvement in performance for
both, with fully predictable IO times.  CPU reduction is substantial as
well, with no kswapd activity at all for reclaim when using uncached IO.

Using it from applications is trivial - just set RWF_DONTCACHE for the
read or write, using pwritev2(2) or preadv2(2).  For io_uring, same thing,
just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write
operation.  And that's it.

Patches 1..7 are just prep patches, and should have no functional changes
at all.  Patch 8 adds support for the filemap path for RWF_DONTCACHE
reads, and patches 9..12 are just prep patches for supporting the write
side of uncached writes.  In the below mentioned branch, there are then
patches to adopt uncached reads and writes for xfs, btrfs, and ext4.  The
latter currently relies on bit of a hack for passing whether this is an
uncached write or not through ->write_begin(), which can hopefully go away
once ext4 adopts iomap for buffered writes.  I say this is a hack as it's
not the prettiest way to do it, however it is fully solid and will work
just fine.

Passes full xfstests and fsx overnight runs, no issues observed.  That
includes the vm running the testing also using RWF_DONTCACHE on the host.
I'll post fsstress and fsx patches for RWF_DONTCACHE separately.  As far
as I'm concerned, no further work needs doing here.

And git tree for the patches is here:

https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.10

with the file system patches on top adding support for xfs/btrfs/ext4
here:

https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached-fs.10

This patch (of 12):

Rather than pass in both the file and position directly from the kiocb,
just take a struct kiocb instead.  With the kiocb being passed in, skip
passing in the address_space separately as well.  While doing so, move the
ki_flags checking into filemap_create_folio() as well.  In preparation for
actually needing the kiocb in the function.

No functional changes in this patch.

Link: https://lkml.kernel.org/r/20241220154831.1086649-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20241220154831.1086649-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/hugetlb: use folio->lru int demote_free_hugetlb_folios()
David Hildenbrand [Mon, 13 Jan 2025 13:16:11 +0000 (14:16 +0100)]
mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()

We are demoting hugetlb folios to smaller hugetlb folios; let's avoid
messing with pages where avoidable and handle it more similar to
__split_huge_page_tail().

Link: https://lkml.kernel.org/r/20250113131611.2554758-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on folios
David Hildenbrand [Mon, 13 Jan 2025 13:16:10 +0000 (14:16 +0100)]
mm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on folios

Let's convert hugetlb_cgroup_css_offline() and
hugetlb_cgroup_move_parent() to work on folios.  hugepage_activelist
contains folios, not pages.

While at it, rename page_hcg simply to hcg, removing most of the "page"
terminology.

This removes an unnecessary call to compound_head().

Link: https://lkml.kernel.org/r/20250113131611.2554758-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/hugetlb: rename folio_putback_active_hugetlb() to folio_putback_hugetlb()
David Hildenbrand [Mon, 13 Jan 2025 13:16:09 +0000 (14:16 +0100)]
mm/hugetlb: rename folio_putback_active_hugetlb() to folio_putback_hugetlb()

Now that folio_putback_hugetlb() is only called on folios that were
previously isolated through folio_isolate_hugetlb(), let's rename it to
match folio_putback_lru().

Add some kernel doc to clarify how this function is supposed to be used.

Link: https://lkml.kernel.org/r/20250113131611.2554758-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folio
David Hildenbrand [Mon, 13 Jan 2025 13:16:08 +0000 (14:16 +0100)]
mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folio

We replaced a simple put_page() by a putback_active_hugepage() call in
commit 3aaa76e125c1 ("mm: migrate: hugetlb: putback destination hugepage
to active list"), to set the "active" flag on the dst hugetlb folio.

Nowadays, we decoupled the "active" list from the flag, by calling the
flag "migratable".

Calling "putback" on something that wasn't allocated is weird and not
future proof, especially if we might reach that path when migration failed
and we just want to free the freshly allocated hugetlb folio.

Let's simply handle the migratable flag and the active list flag in
move_hugetlb_state(), where we know that allocation succeeded and already
handle the temporary flag; use a simple folio_put() to return our
reference.

Link: https://lkml.kernel.org/r/20250113131611.2554758-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb()
David Hildenbrand [Mon, 13 Jan 2025 13:16:07 +0000 (14:16 +0100)]
mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb()

Let's make the function name match "folio_isolate_lru()", and add some
kernel doc.

Link: https://lkml.kernel.org/r/20250113131611.2554758-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/huge_memory: convert has_hwpoisoned into a pure folio flag
David Hildenbrand [Mon, 13 Jan 2025 13:16:06 +0000 (14:16 +0100)]
mm/huge_memory: convert has_hwpoisoned into a pure folio flag

Patch series "mm: hugetlb+THP folio and migration cleanups", v2.

Some cleanups around more folio conversion and migration handling that I
collected working on random stuff.

This patch (of 6):

Let's stop setting it on pages, there is no need to anymore.

Link: https://lkml.kernel.org/r/20250113131611.2554758-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/damon/paddr: improve readability of damon_pa_stat
Joshua Hahn [Mon, 13 Jan 2025 21:01:56 +0000 (13:01 -0800)]
mm/damon/paddr: improve readability of damon_pa_stat

damon_pa_stat contains an unnecessary goto statement, and the if/else can
be re-written to be more readable.

This patch is written on top of SJ's patch series [1], which in turn is
written on top of another one of his series [2].

[1] https://lore.kernel.org/all/20241219040327.61902-1-sj@kernel.org/
[2] https://lore.kernel.org/all/20241213215306.54778-1-sj@kernel.org/

Link: https://lkml.kernel.org/r/20250113210201.446051-1-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/damon/paddr: increment pa_stat damon address range by folio size
Usama Arif [Mon, 13 Jan 2025 19:07:38 +0000 (19:07 +0000)]
mm/damon/paddr: increment pa_stat damon address range by folio size

This is to avoid going through all the pages in a folio.  For folio_size >
PAGE_SIZE, damon_get_folio will return NULL for tail pages, so the for
loop in those instances will be a nop.  Have a more efficient loop by just
incrementing the address by folio_size.

Link: https://lkml.kernel.org/r/20250113190738.1156381-1-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoselftests/mm/cow: modify the incorrect checking parameters
Hao Ge [Mon, 13 Jan 2025 03:28:58 +0000 (11:28 +0800)]
selftests/mm/cow: modify the incorrect checking parameters

In run_with_memfd_hugetlb(), some error handle have passed incorrect
parameters.  It should be "smem", but it was mistakenly written as "mem".

Let's fix it.

[gehao@kylinos.cn: fix other errant sites, per Anshuman]
Link: https://lkml.kernel.org/r/20250113050908.93638-1-hao.ge@linux.dev
Link: https://lkml.kernel.org/r/20250113032858.63670-1-hao.ge@linux.dev
Fixes: f8664f3c4a08 ("selftests/vm: cow: basic COW tests for non-anonymous pages")
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agokasan: use correct kernel-doc format
Randy Dunlap [Sat, 11 Jan 2025 06:32:49 +0000 (22:32 -0800)]
kasan: use correct kernel-doc format

Use the correct kernel-doc character following function parameters or
struct members (':' instead of '-') to eliminate kernel-doc warnings.

kasan.h:509: warning: Function parameter or struct member 'addr' not described in 'kasan_poison'
kasan.h:509: warning: Function parameter or struct member 'size' not described in 'kasan_poison'
kasan.h:509: warning: Function parameter or struct member 'value' not described in 'kasan_poison'
kasan.h:509: warning: Function parameter or struct member 'init' not described in 'kasan_poison'
kasan.h:522: warning: Function parameter or struct member 'addr' not described in 'kasan_unpoison'
kasan.h:522: warning: Function parameter or struct member 'size' not described in 'kasan_unpoison'
kasan.h:522: warning: Function parameter or struct member 'init' not described in 'kasan_unpoison'
kasan.h:539: warning: Function parameter or struct member 'address' not described in 'kasan_poison_last_granule'
kasan.h:539: warning: Function parameter or struct member 'size' not described in 'kasan_poison_last_granule'

Link: https://lkml.kernel.org/r/20250111063249.910975-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoselftests/mm: add tests for splitting pmd THPs to all lower orders
Zi Yan [Fri, 10 Jan 2025 23:50:28 +0000 (18:50 -0500)]
selftests/mm: add tests for splitting pmd THPs to all lower orders

Kernel already supports splitting a folio to any lower order. Test it.

[ziy@nvidia.com: no need to test splitting to order-1]
Link: https://lkml.kernel.org/r/DDA202EA-4664-4F50-A7FD-B00CBB7A624B@nvidia.com
Link: https://lkml.kernel.org/r/20250110235028.96824-2-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Alexander Zhu <alexlzhu@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>