Mike Kravetz [Wed, 21 Jun 2023 21:24:03 +0000 (14:24 -0700)]
hugetlb: revert use of page_cache_next_miss()
Ackerley Tng reported an issue with hugetlbfs fallocate as noted in the
Closes tag. The issue showed up after the conversion of hugetlb page
cache lookup code to use page_cache_next_miss. User visible effects are:
- hugetlbfs fallocate incorrectly returns -EEXIST if pages are presnet
in the file.
- hugetlb pages will not be included in core dumps if they need to be
brought in via GUP.
- userfaultfd UFFDIO_COPY will not notice pages already present in the
cache. It may try to allocate a new page and potentially return
ENOMEM as opposed to EEXIST.
Revert the use page_cache_next_miss() in hugetlb code.
IMPORTANT NOTE FOR STABLE BACKPORTS:
This patch will apply cleanly to v6.3. However, due to the change of
filemap_get_folio() return values, it will not function correctly. This
patch must be modified for stable backports.
[dan.carpenter@linaro.org: fix hugetlbfs_pagecache_present()] Link: https://lkml.kernel.org/r/efa86091-6a2c-4064-8f55-9b44e1313015@moroto.mountain Link: https://lkml.kernel.org/r/20230621212403.174710-2-mike.kravetz@oracle.com Fixes: d0ce0e47b323 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reported-by: Ackerley Tng <ackerleytng@google.com> Closes: https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Erdem Aktas <erdemaktas@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vishal Annapurve <vannapurve@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The reverted commit fixed up routines primarily used by readahead code
such that they could also be used by hugetlb. Unfortunately, this
caused a performance regression as pointed out by the Closes: tag.
The hugetlb code which uses page_cache_next_miss will be addressed in
a subsequent patch.
Link: https://lkml.kernel.org/r/20230621212403.174710-1-mike.kravetz@oracle.com Fixes: 9425c591e06a ("page cache: fix page_cache_next/prev_miss off by one") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202306211346.1e9ff03e-oliver.sang@intel.com Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Erdem Aktas <erdemaktas@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vishal Annapurve <vannapurve@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When memory.reclaim was introduced, it became the first case where
cgroup_reclaim() is true for the root cgroup. Johannes concluded [1] that
for most cases this is okay, except for one case. Historically, kswapd
would throttle reclaim on a node if a lot of pages marked for reclaim are
under writeback (aka the node is congested). This occurred by setting
LRUVEC_CONGESTED bit in lruvec->flags. The bit would be cleared when the
node is balanced.
Similarly, cgroup reclaim would set the same bit when an lruvec is
congested, and clear it on the way out of reclaim (to throttle local
reclaimers).
Before the introduction of memory.reclaim, the root memcg was the only
target of kswapd reclaim, and non-root memcgs were the only targets of
cgroup reclaim, so they would never interfere. Using the same bit for
both was fine. After memory.reclaim, it is possible for cgroup reclaim on
the root cgroup to clear the bit set by kswapd. This would result in
reclaim on the node to be unthrottled before the node is balanced.
Fix this by introducing separate bits for cgroup-level and node-level
congestion. kswapd can unthrottle an lruvec that is marked as congested
by cgroup reclaim (as the entire node should no longer be congested), but
not vice versa (to prevent premature unthrottling before the entire node
is balanced).
Yosry Ahmed [Wed, 21 Jun 2023 02:30:53 +0000 (02:30 +0000)]
mm: memcg: rename and document global_reclaim()
Evidently, global_reclaim() can be a confusing name. Especially that it
used to exist before with a subtly different definition (removed by commit b5ead35e7e1d ("mm: vmscan: naming fixes: global_reclaim() and
sane_reclaim()"). It can be interpreted as non-cgroup reclaim, even
though it returns true for cgroup reclaim on the root memcg (through
memory.reclaim).
Rename it to root_reclaim() in an attempt to make it less ambiguous, and
add documentation to it as well as cgroup_reclaim.
Link: https://lkml.kernel.org/r/20230621023053.432374-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/ Acked-by: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Mon, 19 Jun 2023 11:07:18 +0000 (19:07 +0800)]
mm: kill [add|del]_page_to_lru_list()
Now no one call [add|del]_page_to_lru_list(), let's drop unused page
interfaces.
Link:https://lkml.kernel.org/r/20230619110718.65679-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: James Gowans <jgowans@amazon.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Mon, 19 Jun 2023 11:07:17 +0000 (19:07 +0800)]
mm: compaction: convert to use a folio in isolate_migratepages_block()
Directly use a folio instead of page_folio() when page successfully
isolated (hugepage and movable page) and after folio_get_nontail_page(),
which removes several calls to compound_head().
Link: https://lkml.kernel.org/r/20230619110718.65679-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: James Gowans <jgowans@amazon.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yosry Ahmed [Wed, 21 Jun 2023 09:30:09 +0000 (09:30 +0000)]
mm: zswap: fix double invalidate with exclusive loads
If exclusive loads are enabled for zswap, we invalidate the entry before
returning from zswap_frontswap_load(), after dropping the local reference.
However, the tree lock is dropped during decompression after the local
reference is acquired, so the entry could be invalidated before we drop
the local ref. If this happens, the entry is freed once we drop the local
ref, and zswap_invalidate_entry() tries to invalidate an already freed
entry.
Fix this by:
(a) Making sure zswap_invalidate_entry() is always called with a local
ref held, to avoid being called on a freed entry.
(b) Making sure zswap_invalidate_entry() only drops the ref if the entry
was actually on the rbtree. Otherwise, another invalidation could
have already happened, and the initial ref is already dropped.
With these changes, there is no need to check that there is no need to
make sure the entry still exists in the tree in zswap_reclaim_entry()
before invalidating it, as zswap_reclaim_entry() will make this check
internally.
Link: https://lkml.kernel.org/r/20230621093009.637544-1-yosryahmed@google.com Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 21 Jun 2023 16:45:55 +0000 (17:45 +0100)]
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
We don't use pagevecs for the LRU cache any more, and we don't know that
the failed invalidations were due to the folio being in an LRU cache. So
rename it to be more accurate.
Matthew Wilcox (Oracle) [Wed, 21 Jun 2023 16:45:53 +0000 (17:45 +0100)]
net: convert sunrpc from pagevec to folio_batch
Remove the last usage of pagevecs. There is a slight change here; we now
free the folio_batch as soon as it fills up instead of freeing the
folio_batch when we try to add a page to a full batch. This should have
no effect in practice.
Matthew Wilcox (Oracle) [Wed, 21 Jun 2023 16:45:48 +0000 (17:45 +0100)]
i915: convert shmem_sg_free_table() to use a folio_batch
Remove a few hidden compound_head() calls by converting the returned page
to a folio once and using the folio APIs. We also only increment the
refcount on the folio once instead of once for each page. Ideally, we
would have a for_each_sgt_folio macro, but until then this will do.
Matthew Wilcox (Oracle) [Wed, 21 Jun 2023 16:45:47 +0000 (17:45 +0100)]
scatterlist: add sg_set_folio()
This wrapper for sg_set_page() lets drivers add folios to a scatterlist
more easily. We could, perhaps, do better by using a different page in
the folio if offset is larger than UINT_MAX, but let's hope we get a
better data structure than this before we need to care about such large
folios.
Baolin Wang [Wed, 21 Jun 2023 08:14:28 +0000 (16:14 +0800)]
mm: page_alloc: use the correct type of list for free pages
Commit bf75f200569d ("mm/page_alloc: add page->buddy_list and
page->pcp_list") introduces page->buddy_list and page->pcp_list as a union
with page->lru, but missed to change get_page_from_free_area() to use
page->buddy_list to clarify the correct type of list for a free page.
Ivan Orlov [Tue, 20 Jun 2023 18:33:15 +0000 (20:33 +0200)]
mm: backing-dev: make bdi_class a static const structure
Now that the driver core allows for struct class to be in read-only
memory, move the bdi_class structure to be declared at build time placing
it into read-only memory, instead of having to be dynamically allocated at
load time.
Link: https://lkml.kernel.org/r/20230620183314.682822-2-gregkh@linuxfoundation.org Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Delete a triply out-of-date comment from add_swap_count_continuation():
1. vmalloc_to_page() changed from pte_offset_map() to pte_offset_kernel()
2. pte_offset_map() changed from using kmap_atomic() to kmap_local_page()
3. kmap_atomic() changed from using fixed FIX_KMAP addresses in 2.6.37.
Yajun Deng [Mon, 19 Jun 2023 02:34:06 +0000 (10:34 +0800)]
mm: pass nid to reserve_bootmem_region()
early_pfn_to_nid() is called frequently in init_reserved_page(), it
returns the node id of the PFN. These PFN are probably from the same
memory region, they have the same node id. It's not necessary to call
early_pfn_to_nid() for each PFN.
Pass nid to reserve_bootmem_region() and drop the call to
early_pfn_to_nid() in init_reserved_page(). Also, set nid on all reserved
pages before doing this, as some reserved memory regions may not be set
nid.
The most beneficial function is memmap_init_reserved_pages() if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.
The following data was tested on an x86 machine with 190GB of RAM.
Jason Gunthorpe [Mon, 19 Jun 2023 18:27:25 +0000 (15:27 -0300)]
mm/gup: do not return 0 from pin_user_pages_fast() for bad args
These routines are not intended to return zero, the callers cannot do
anything sane with a 0 return. They should return an error which means
future calls to GUP will not succeed, or they should return some non-zero
number of pinned pages which means GUP should be called again.
If start + nr_pages overflows it should return -EOVERFLOW to signal the
arguments are invalid.
Syzkaller keeps tripping on this when fuzzing GUP arguments.
Link: https://lkml.kernel.org/r/0-v1-3d5ed1f20d50+104-gup_overflow_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reported-by: syzbot+353c7be4964c6253f24a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000094fdd05faa4d3a4@google.com Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Haifeng Xu [Mon, 19 Jun 2023 12:47:35 +0000 (12:47 +0000)]
selftests: cgroup: fix unexpected failure on test_memcg_sock
Before server got a client connection, there were some memory allocations
in the test memcg, such as user stack. So do not count those allocations
which are not related to socket when checking socket memory accounting.
Link: https://lkml.kernel.org/r/20230619124735.2124-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Haifeng Xu [Mon, 19 Jun 2023 13:04:42 +0000 (13:04 +0000)]
mm/memcontrol: do not tweak node in mem_cgroup_init()
mem_cgroup_init() request for allocations from each possible node, and
it's used to be a problem because NODE_DATA is not allocated for offline
node. Things have already changed since commit 09f49dca570a9 ("mm: handle
uninitialized numa nodes gracefully"), so it's unnecessary to check for
!node_online nodes here.
Tetsuo Handa [Sat, 27 May 2023 15:25:31 +0000 (00:25 +0900)]
kasan,kmsan: remove __GFP_KSWAPD_RECLAIM usage from kasan/kmsan
syzbot is reporting lockdep warning in __stack_depot_save(), for
the caller of __stack_depot_save() (i.e. __kasan_record_aux_stack() in
this report) is responsible for masking __GFP_KSWAPD_RECLAIM flag in
order not to wake kswapd which in turn wakes kcompactd.
Since kasan/kmsan functions might be called with arbitrary locks held,
mask __GFP_KSWAPD_RECLAIM flag from all GFP_NOWAIT/GFP_ATOMIC allocations
in kasan/kmsan.
Note that kmsan_save_stack_with_flags() is changed to mask both
__GFP_DIRECT_RECLAIM flag and __GFP_KSWAPD_RECLAIM flag, for
wakeup_kswapd() from wake_all_kswapds() from __alloc_pages_slowpath()
calls wakeup_kcompactd() if __GFP_KSWAPD_RECLAIM flag is set and
__GFP_DIRECT_RECLAIM flag is not set.
Baolin Wang [Wed, 14 Jun 2023 08:40:20 +0000 (16:40 +0800)]
mm: compaction: skip memory hole rapidly when isolating migratable pages
On some machines, the normal zone can have a large memory hole like below
memory layout, and we can see the range from 0x100000000 to 0x1800000000
is a hole. So when isolating some migratable pages, the scanner can meet
the hole and it will take more time to skip the large hole. From my
measurement, I can see the isolation scanner will take 80us ~ 100us to
skip the large hole [0x100000000 - 0x1800000000].
So adding a new helper to fast search next online memory section to skip
the large hole can help to find next suitable pageblock efficiently. With
this patch, I can see the large hole scanning only takes < 1us.
Yu Zhao [Mon, 19 Jun 2023 19:38:21 +0000 (13:38 -0600)]
mm/mglru: make memcg_lru->lock irq safe
lru_gen_rotate_memcg() can happen in softirq if memory.soft_limit_in_bytes
is set. This requires memcg_lru->lock to be irq safe. Lockdep warns on
this.
SeongJae Park [Fri, 16 Jun 2023 19:17:42 +0000 (19:17 +0000)]
Docs/admin-guide/mm/damon/usage: update the ways for getting monitoring results
The recommended ways for getting DAMON monitoring results are using
tried_regions sysfs directory for partial snapshot of the results, and
DAMON tracepoint for full record of the results. However, the
tried_regions sysfs directory usage has not sufficiently updated on some
sections of the DAMON usage document. Update those.
SeongJae Park [Fri, 16 Jun 2023 19:17:40 +0000 (19:17 +0000)]
Docs/admin-guide/mm/damon/usage: link design document for DAMOS
The background and concept of DAMOS is redundantly documented, in the
design document and the usage document. Replace the duplicated ones in
usage document with links to the design document.
SeongJae Park [Fri, 16 Jun 2023 19:17:39 +0000 (19:17 +0000)]
Docs/admin-guide/mm/damon/usage: remove unnecessary sentences about supported address spaces
Brief explanation of DAMON user space tool and sysfs interface are
unnecessarily and repeatedly mentioning the list of address spaces that
DAMON is supporting. Remove those.
SeongJae Park [Fri, 16 Jun 2023 19:17:37 +0000 (19:17 +0000)]
Docs/admin-guide/mm/damon/start: update DAMOS example command
DAMON user-space tool, damo, has deprecated[1] its old DAMOS schemes
specification format. However, an example of DAMON documentation is still
using it. Update the example to use one of the alternative options.
SeongJae Park [Fri, 16 Jun 2023 19:17:36 +0000 (19:17 +0000)]
Docs/mm/damon/design: document 'age' of region
Patch series "Docs/{mm,admin-guide}damon: update design and usage docs".
Update DAMON design and usage documents for outdated and unnecessarily
duplicated parts.
This patch (of 7):
The 'age' of each region in DAMON monitoring results is an important
concept for both monitoring part and DAMOS. And DAMOS section of the
design document is mentioning it. However, the age itself is not
explained in the document. Add a section for that.
Mathieu Desnoyers [Mon, 15 May 2023 14:35:36 +0000 (10:35 -0400)]
mm: move mm_count into its own cache line
The mm_struct mm_count field is frequently updated by mmgrab/mmdrop
performed by context switch. This causes false-sharing for surrounding
mm_struct fields which are read-mostly.
This has been observed on a 2sockets/112core/224cpu Intel Sapphire Rapids
server running hackbench, and by the kernel test robot will-it-scale
testcase.
Move the mm_count field into its own cache line to prevent false-sharing
with other mm_struct fields.
Move mm_count to the first field of mm_struct to minimize the amount of
padding required: rather than adding padding before and after the mm_count
field, padding is only added after mm_count.
Note that I noticed this odd comment in mm_struct:
commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct")
/*
* With some kernel config, the current mmap_lock's offset
* inside 'mm_struct' is at 0x120, which is very optimal, as
* its two hot fields 'count' and 'owner' sit in 2 different
* cachelines, and when mmap_lock is highly contended, both
* of the 2 fields will be accessed frequently, current layout
* will help to reduce cache bouncing.
*
* So please be careful with adding new fields before
* mmap_lock, which can easily push the 2 fields into one
* cacheline.
*/
struct rw_semaphore mmap_lock;
This comment is rather odd for a few reasons:
- It requires addition/removal of mm_struct fields to carefully consider
field alignment of _other_ fields,
- It expresses the wish to keep an "optimal" alignment for a specific
kernel config.
I suspect that the author of this comment may want to revisit this topic
and perhaps introduce a split-struct approach for struct rw_semaphore,
if the need is to place various fields of this structure in different
cache lines.
SeongJae Park [Thu, 15 Jun 2023 18:33:22 +0000 (18:33 +0000)]
mm/damon/core-test: add a test for damon_set_attrs()
Commit 5ff6e2fff88e ("mm/damon/core: fix divide error in
damon_nr_accesses_to_accesses_bp()") fixed a bug by adding arguments
validation in damon_set_attrs(). Add a unit test for the added validation
to ensure the bug cannot occur again.
mm: remove is_longterm_pinnable_page() and reimplement folio_is_longterm_pinnable()
folio_is_longterm_pinnable() already exists as a wrapper function. Now
that the whole implementation of is_longterm_pinnable_page() can be
implemented using folios, folio_is_longterm_pinnable() can be made its own
standalone function - and we can remove is_longterm_pinnable_page().
Link: https://lkml.kernel.org/r/20230614021312.34085-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
try_get_folio() takes in a page, then chooses to do some folio operations
based on the flags (either FOLL_GET or FOLL_PIN). We can rewrite this
function to be more purpose oriented.
After calling try_get_folio(), if neither FOLL_GET nor FOLL_PIN are set,
warn and fail. If FOLL_GET is set we can return the result. If FOLL_GET
is not set then FOLL_PIN is set, so we pin the folio.
This change assists with folio conversions, and makes the function more
readable.
Introduce folio_migratetype() as a folio equivalent for
get_pageblock_migratetype(). This function intends to return the
migratetype the folio is located in, hence the name choice.
Marco Elver [Wed, 14 Jun 2023 09:51:16 +0000 (11:51 +0200)]
kasan: add support for kasan.fault=panic_on_write
KASAN's boot time kernel parameter 'kasan.fault=' currently supports
'report' and 'panic', which results in either only reporting bugs or also
panicking on reports.
However, some users may wish to have more control over when KASAN reports
result in a kernel panic: in particular, KASAN reported invalid _writes_
are of special interest, because they have greater potential to corrupt
random kernel memory or be more easily exploited.
To panic on invalid writes only, introduce 'kasan.fault=panic_on_write',
which allows users to choose to continue running on invalid reads, but
panic only on invalid writes.
Link: https://lkml.kernel.org/r/20230614095158.1133673-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Taras Madan <tarasmadan@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Wed, 14 Jun 2023 14:13:12 +0000 (23:13 +0900)]
zram: further limit recompression threshold
Recompression threshold should be below huge-size-class watermark. Any
object larger than huge-size-class is a "huge object" and occupies a
whole physical page on the zsmalloc side, in other words it's
incompressible, as far as zsmalloc is concerned.
Link: https://lkml.kernel.org/r/20230614141338.3480029-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Brian Geffon <bgeffon@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Domenico Cerasuolo [Wed, 14 Jun 2023 14:31:22 +0000 (16:31 +0200)]
mm: zswap: invaldiate entry after writeback
When an entry started writeback, it used to be invalidated with ref count
logic alone, meaning that it would stay on the tree until all references
were put. The problem with this behavior is that as soon as the writeback
started, the ownership of the data held by the entry is passed to the
swapcache and should not be left in zswap too. Currently there are no
known issues because of this, but this change explicitly invalidates an
entry that started writeback to reduce opportunities for future bugs.
This patch is a follow up on the series titled "mm: zswap: move writeback
LRU from zpool to zswap" + commit f090b7949768("mm: zswap: support
exclusive loads").
Link: https://lkml.kernel.org/r/20230614143122.74471-1-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Wed, 14 Jun 2023 14:36:12 +0000 (22:36 +0800)]
mm: kill lock|unlock_page_memcg()
Since commit c7c3dec1c9db ("mm: rmap: remove lock_page_memcg()"),
no more user, kill lock_page_memcg() and unlock_page_memcg().
Link: https://lkml.kernel.org/r/20230614143612.62575-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:40 +0000 (22:01 +0100)]
buffer: use a folio in __find_get_block_slow()
Saves a call to compound_head() and may be needed to support block size >
PAGE_SIZE.
Link: https://lkml.kernel.org/r/20230612210141.730128-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:39 +0000 (22:01 +0100)]
buffer: convert link_dev_buffers to take a folio
Its one caller already has a folio, so switch it to use the folio API.
Removes a hidden call to compound_head().
Link: https://lkml.kernel.org/r/20230612210141.730128-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:38 +0000 (22:01 +0100)]
buffer: convert init_page_buffers() to folio_init_buffers()
Use the folio API and pass the folio from both callers. Saves a hidden
call to compound_head().
Link: https://lkml.kernel.org/r/20230612210141.730128-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:37 +0000 (22:01 +0100)]
buffer: convert grow_dev_page() to use a folio
Get a folio from the page cache instead of a page, then use the folio API
throughout. Removes a few calls to compound_head() and may be needed to
support block size > PAGE_SIZE.
Link: https://lkml.kernel.org/r/20230612210141.730128-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:36 +0000 (22:01 +0100)]
buffer: convert page_zero_new_buffers() to folio_zero_new_buffers()
Most of the callers already have a folio; convert reiserfs_write_end() to
have a folio. Removes a couple of hidden calls to compound_head().
Link: https://lkml.kernel.org/r/20230612210141.730128-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:35 +0000 (22:01 +0100)]
buffer: convert __block_commit_write() to take a folio
This removes a hidden call to compound_head() inside
__block_commit_write() and moves it to those callers which are still page
based. Also make block_write_end() safe for large folios.
Link: https://lkml.kernel.org/r/20230612210141.730128-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:34 +0000 (22:01 +0100)]
buffer: convert block_page_mkwrite() to use a folio
If any page in a folio is dirtied, dirty the entire folio. Removes a
number of hidden calls to compound_head() and references to page->mapping
and page->index. Fixes a pre-existing bug where we could mark a folio as
dirty if the file is truncated to a multiple of the page size just as we
take the page fault. I don't believe this bug has any bad effect, it's
just inefficient.
Link: https://lkml.kernel.org/r/20230612210141.730128-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:33 +0000 (22:01 +0100)]
buffer: make block_write_full_page() handle large folios correctly
Keep the interface as struct page, but work entirely on the folio
internally. Removes several PAGE_SIZE assumptions and removes some
references to page->index and page->mapping.
Link: https://lkml.kernel.org/r/20230612210141.730128-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:32 +0000 (22:01 +0100)]
gfs2: support ludicrously large folios in gfs2_trans_add_databufs()
We may someday support folios larger than 4GB, so use a size_t for the
byte count within a folio to prevent unpleasant truncations.
Link: https://lkml.kernel.org/r/20230612210141.730128-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:31 +0000 (22:01 +0100)]
buffer: convert __block_write_full_page() to __block_write_full_folio()
Remove nine hidden calls to compound_head() by using a folio instead of a
page.
Link: https://lkml.kernel.org/r/20230612210141.730128-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:30 +0000 (22:01 +0100)]
gfs2: convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()
Add support for large folios and remove some accesses to page->mapping and
page->index.
Link: https://lkml.kernel.org/r/20230612210141.730128-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:29 +0000 (22:01 +0100)]
gfs2: pass a folio to __gfs2_jdata_write_folio()
Remove a couple of folio->page conversions in the callers, and two calls
to compound_head() in the function itself. Rename it from
__gfs2_jdata_writepage() to __gfs2_jdata_write_folio().
Link: https://lkml.kernel.org/r/20230612210141.730128-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 12 Jun 2023 21:01:28 +0000 (22:01 +0100)]
gfs2: use a folio inside gfs2_jdata_writepage()
Patch series "gfs2/buffer folio changes for 6.5", v3.
This kind of started off as a gfs2 patch series, then became entwined with
buffer heads once I realised that gfs2 was the only remaining caller of
__block_write_full_page(). For those not in the gfs2 world, the big point
of this series is that block_write_full_page() should now handle large
folios correctly.
This patch (of 14):
Replace a few implicit calls to compound_head() with one explicit one.
Yu Ma [Sat, 10 Jun 2023 03:07:30 +0000 (23:07 -0400)]
percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing
When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.
UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. It will do system call on execl
frequently, and execl will call mm_init to initialize mm_struct of the
process. mm_init will call __percpu_counter_init for percpu_counters
initialization. Then pcpu_alloc is called to read the base_addr of
pcpu_chunk for memory allocation. Inside pcpu_alloc, it will call
pcpu_alloc_area to allocate memory from a specified chunk. This function
will update "free_bytes" and "chunk_md" to record the rest free bytes and
other meta data for this chunk. Correspondingly, pcpu_free_area will also
update these 2 members when free memory.
In current pcpu_chunk layout, `base_addr' is in the same cache line with
`free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes. This
patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a
new cacheline.
With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.4-rc4, the 160 parallel score improves by 24%.
The pcpu_chunk struct is a backing data structure per chunk, so the
additional memory should not be dramatic. A chunk covers ballpark
between 64kb and 512kb memory depending on some config and boot time
stuff, so I believe the additional memory used here is nominal at best.
Working the #s on my desktop:
Percpu: 58624 kB
28 cores -> ~2.1MB of percpu memory.
At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB
of overhead?
I believe we can do a little better to avoid eating that full padding,
so likely less than that.
[dennis@kernel.org: changelog details] Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@intel.com Signed-off-by: Yu Ma <yu.ma@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 1 Jun 2023 01:54:02 +0000 (21:54 -0400)]
userfaultfd: fix regression in userfaultfd_unmap_prep()
Android reported a performance regression in the userfaultfd unmap path.
A closer inspection on the userfaultfd_unmap_prep() change showed that a
second tree walk would be necessary in the reworked code.
Fix the regression by passing each VMA that will be unmapped through to
the userfaultfd_unmap_prep() function as they are added to the unmap list,
instead of re-walking the tree for the VMA.
Link: https://lkml.kernel.org/r/20230601015402.2819343-1-Liam.Howlett@oracle.com Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tarun Sahu [Mon, 12 Jun 2023 09:35:14 +0000 (15:05 +0530)]
mm/folio: replace set_compound_order with folio_set_order
The patch ("mm/folio: Avoid special handling for order value 0 in
folio_set_order") [1] removed the need for special handling of order = 0
in folio_set_order. Now, folio_set_order and set_compound_order becomes
similar function. This patch removes the set_compound_order and uses
folio_set_order instead.
Domenico Cerasuolo [Mon, 12 Jun 2023 09:38:15 +0000 (11:38 +0200)]
mm: zswap: remove zswap_header
Previously, zswap_header served the purpose of storing the swpentry within
zpool pages. This allowed zpool implementations to pass relevant
information to the writeback function. However, with the current
implementation, writeback is directly handled within zswap. Consequently,
there is no longer a necessity for zswap_header, as the swp_entry_t can be
stored directly in zswap_entry.
Link: https://lkml.kernel.org/r/20230612093815.133504-8-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Tested-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Domenico Cerasuolo [Mon, 12 Jun 2023 09:38:14 +0000 (11:38 +0200)]
mm: zswap: simplify writeback function
zswap_writeback_entry() used to be a callback for the backends, which
don't know about struct zswap_entry.
Now that the only user is the generic zswap LRU reclaimer, it can be
simplified: pass the pinned zswap_entry directly, and consolidate the
refcount management in the shrink function.
Link: https://lkml.kernel.org/r/20230612093815.133504-7-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Tested-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Domenico Cerasuolo [Mon, 12 Jun 2023 09:38:13 +0000 (11:38 +0200)]
mm: zswap: remove shrink from zpool interface
Now that all three zswap backends have removed their shrink code, it is
no longer necessary for the zpool interface to include shrink/writeback
endpoints.
Link: https://lkml.kernel.org/r/20230612093815.133504-6-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Domenico Cerasuolo [Mon, 12 Jun 2023 09:38:09 +0000 (11:38 +0200)]
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is wrong to include unprocessed user header files directly. They are
processed to "<source_tree>/usr/include" by running "make headers" and
they are included in selftests by kselftest makefiles automatically with
help of KHDR_INCLUDES variable. These headers should always bulilt first
before building kselftests.
Link: https://lkml.kernel.org/r/20230612095347.996335-1-usama.anjum@collabora.com Fixes: 07115fcc15b4 ("selftests/mm: add new selftests for KSM") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Roberts [Mon, 12 Jun 2023 15:15:45 +0000 (16:15 +0100)]
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Roberts [Mon, 12 Jun 2023 15:15:44 +0000 (16:15 +0100)]
mm: move ptep_get() and pmdp_get() helpers
There are many call sites that directly dereference a pte_t pointer. This
makes it very difficult to properly encapsulate a page table in the arch
code without having to allocate shadow page tables.
We will shortly solve this by replacing all the call sites with ptep_get()
calls. But there are call sites above the function definition in the
header file, so let's move ptep_get() to an earlier location to solve that
problem. And move pmdp_get() at the same time to keep it close to
ptep_get().
Link: https://lkml.kernel.org/r/20230612151545.3317766-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel test robot <lkp@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Roberts [Mon, 12 Jun 2023 15:15:43 +0000 (16:15 +0100)]
mm: ptdump should use ptep_get_lockless()
Patch series "Encapsulate PTE contents from non-arch code", v3.
A series to improve the encapsulation of pte entries by disallowing
non-arch code from directly dereferencing pte_t pointers.
This means that by default, the accesses change from a C dereference to a
READ_ONCE(). This is technically the correct thing to do since where
pgtables are modified by HW (for access/dirty) they are volatile and
therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
This patch (of 3):
The page table dumper uses walk_page_range_novma() to walk the page
tables, which does not lock the PTL before calling the pte_entry()
callback. Therefore, the page table dumper's callback must use
ptep_get_lockless() rather than ptep_get() to ensure that the pte it reads
is not torn or otherwise corrupt when racing with writers.
Link: https://lkml.kernel.org/r/20230612151545.3317766-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230612151545.3317766-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Tue, 13 Jun 2023 15:52:45 +0000 (16:52 +0100)]
sh: move the ARCH_DMA_MINALIGN definition to asm/cache.h
The sh architecture defines ARCH_DMA_MINALIGN in asm/page.h. Move it to
asm/cache.h to allow a generic ARCH_DMA_MINALIGN definition in
linux/cache.h without redefine errors/warnings.
Link: https://lkml.kernel.org/r/20230613155245.1228274-4-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: kernel test robot <lkp@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Tue, 13 Jun 2023 15:52:44 +0000 (16:52 +0100)]
microblaze: move the ARCH_{DMA,SLAB}_MINALIGN definitions to asm/cache.h
The microblaze architecture defines ARCH_DMA_MINALIGN in asm/page.h. Move
it to asm/cache.h to allow a generic ARCH_DMA_MINALIGN definition in
linux/cache.h without redefine errors/warnings.
While at it, also move ARCH_SLAB_MINALIGN to asm/cache.h for
consistency.
Link: https://lkml.kernel.org/r/20230613155245.1228274-3-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: kernel test robot <lkp@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rich Felker <dalias@libc.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Unfortunately, this causes a duplicate definition warning for
microblaze, powerpc (32-bit only) and sh as these architectures define
ARCH_DMA_MINALIGN in a different file than asm/cache.h. Move the macro
to asm/cache.h to avoid this issue and also bring them in line with the
other architectures.
This patch (of 3):
The powerpc architecture defines ARCH_DMA_MINALIGN in asm/page_32.h and
only if CONFIG_NOT_COHERENT_CACHE is enabled (32-bit platforms only).
Move this macro to asm/cache.h to allow a generic ARCH_DMA_MINALIGN
definition in linux/cache.h without redefine errors/warnings.
Link: https://lkml.kernel.org/r/20230613155245.1228274-1-catalin.marinas@arm.com Link: https://lkml.kernel.org/r/20230613155245.1228274-2-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306131053.1ybvRRhO-lkp@intel.com/ Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Michal Simek <monstr@monstr.eu> Cc: Rich Felker <dalias@libc.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Mon, 12 Jun 2023 15:32:01 +0000 (16:32 +0100)]
arm64: enable ARCH_WANT_KMALLOC_DMA_BOUNCE for arm64
With the DMA bouncing of unaligned kmalloc() buffers now in place, enable
it for arm64 to allow the kmalloc-{8,16,32,48,96} caches. In addition,
always create the swiotlb buffer even when the end of RAM is within the
32-bit physical address range (the swiotlb buffer can still be disabled on
the kernel command line).
Link: https://lkml.kernel.org/r/20230612153201.554742-18-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Will Deacon <will@kernel.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Saravana Kannan <saravanak@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Mon, 12 Jun 2023 15:32:00 +0000 (16:32 +0100)]
mm: slab: reduce the kmalloc() minimum alignment if DMA bouncing possible
If an architecture opted in to DMA bouncing of unaligned kmalloc() buffers
(ARCH_WANT_KMALLOC_DMA_BOUNCE), reduce the minimum kmalloc() cache
alignment below cache-line size to ARCH_KMALLOC_MINALIGN.
Link: https://lkml.kernel.org/r/20230612153201.554742-17-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Mon, 12 Jun 2023 15:31:59 +0000 (16:31 +0100)]
iommu/dma: force bouncing if the size is not cacheline-aligned
Similarly to the direct DMA, bounce small allocations as they may have
originated from a kmalloc() cache not safe for DMA. Unlike the direct
DMA, iommu_dma_map_sg() cannot call iommu_dma_map_sg_swiotlb() for all
non-coherent devices as this would break some cases where the iova is
expected to be contiguous (dmabuf). Instead, scan the scatterlist for
any small sizes and only go the swiotlb path if any element of the list
needs bouncing (note that iommu_dma_map_page() would still only bounce
those buffers which are not DMA-aligned).
To avoid scanning the scatterlist on the 'sync' operations, introduce an
SG_DMA_SWIOTLB flag set by iommu_dma_map_sg_swiotlb(). The
dev_use_swiotlb() function together with the newly added
dev_use_sg_swiotlb() now check for both untrusted devices and unaligned
kmalloc() buffers (suggested by Robin Murphy).
Link: https://lkml.kernel.org/r/20230612153201.554742-16-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Mon, 12 Jun 2023 15:31:58 +0000 (16:31 +0100)]
dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned
For direct DMA, if the size is small enough to have originated from a
kmalloc() cache below ARCH_DMA_MINALIGN, check its alignment against
dma_get_cache_alignment() and bounce if necessary. For larger sizes, it
is the responsibility of the DMA API caller to ensure proper alignment.
At this point, the kmalloc() caches are properly aligned but this will
change in a subsequent patch.
Architectures can opt in by selecting DMA_BOUNCE_UNALIGNED_KMALLOC.
Link: https://lkml.kernel.org/r/20230612153201.554742-15-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Robin Murphy [Mon, 12 Jun 2023 15:31:57 +0000 (16:31 +0100)]
dma-mapping: name SG DMA flag helpers consistently
sg_is_dma_bus_address() is inconsistent with the naming pattern of its
corresponding setters and its own kerneldoc, so take the majority vote and
rename it sg_dma_is_bus_address() (and fix up the missing underscores in
the kerneldoc too). This gives us a nice clear pattern where SG DMA flags
are SG_DMA_<NAME>, and the helpers for acting on them are
sg_dma_<action>_<name>().
Link: https://lkml.kernel.org/r/20230612153201.554742-14-catalin.marinas@arm.com Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Link: https://lore.kernel.org/r/fa2eca2862c7ffc41b50337abffb2dfd2864d3ea.1685036694.git.robin.murphy@arm.com Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Mon, 12 Jun 2023 15:31:55 +0000 (16:31 +0100)]
arm64: allow kmalloc() caches aligned to the smaller cache_line_size()
On arm64, ARCH_DMA_MINALIGN is 128, larger than the cache line size on
most of the current platforms (typically 64). Define
ARCH_KMALLOC_MINALIGN to 8 (the default for architectures without their
own ARCH_DMA_MINALIGN) and override dma_get_cache_alignment() to return
cache_line_size(), probed at run-time. The kmalloc() caches will be
limited to the cache line size. This will allow the additional
kmalloc-{64,192} caches on most arm64 platforms.
Link: https://lkml.kernel.org/r/20230612153201.554742-12-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Will Deacon <will@kernel.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Saravana Kannan <saravanak@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>