Hugh Dickins [Mon, 23 Aug 2021 23:59:12 +0000 (09:59 +1000)]
huge tmpfs: SGP_NOALLOC to stop collapse_file() on race
khugepaged's collapse_file() currently uses SGP_NOHUGE to tell
shmem_getpage() not to try allocating a huge page, in the very unlikely
event that a racing hole-punch removes the swapped or fallocated page as
soon as i_pages lock is dropped.
We want to consolidate shmem's huge decisions, removing SGP_HUGE and
SGP_NOHUGE; but cannot quite persuade ourselves that it's okay to regress
the protection in this case - Yang Shi points out that the huge page would
remain indefinitely, charged to root instead of the intended memcg.
collapse_file() should not even allocate a small page in this case: why
proceed if someone is punching a hole? SGP_READ is almost the right flag
here, except that it optimizes away from a fallocated page, with NULL to
tell caller to fill with zeroes (like a hole); whereas collapse_file()'s
sequence relies on using a cache page. Add SGP_NOALLOC just for this.
There are too many consecutive "if (page"s there in shmem_getpage_gfp():
group it better; and fix the outdated "bring it back from swap" comment.
Link: https://lkml.kernel.org/r/1355343b-acf-4653-ef79-6aee40214ac5@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Hugh Dickins [Mon, 23 Aug 2021 23:59:12 +0000 (09:59 +1000)]
huge tmpfs: move shmem_huge_enabled() upwards
shmem_huge_enabled() is about to be enhanced into shmem_is_huge(), so that
it can be used more widely throughout: before making functional changes,
shift it to its final position (to avoid forward declaration).
Link: https://lkml.kernel.org/r/16fec7b7-5c84-415a-8586-69d8bf6a6685@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Hugh Dickins [Mon, 23 Aug 2021 23:59:12 +0000 (09:59 +1000)]
huge tmpfs: revert shmem's use of transhuge_vma_enabled()
5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
as a wrapper for two very different checks (one check is whether the app
has marked its address range not to use THPs, the other check is whether
the app is running in a hierarchy that has been marked never to use THPs).
shmem_huge_enabled() prefers to show those two checks explicitly, as
before.
Link: https://lkml.kernel.org/r/45e5338-18d-c6f9-c17e-34f510bc1728@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Hugh Dickins [Mon, 23 Aug 2021 23:59:12 +0000 (09:59 +1000)]
huge tmpfs: remove shrinklist addition from shmem_setattr()
There's a block of code in shmem_setattr() to add the inode to
shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates
from before 5.7 changed truncation to do split_huge_page() for itself, and
should have been removed at that time.
I am over-stating that: split_huge_page() can fail (notably if there's an
extra reference to the page at that time), so there might be value in
retrying. But there were already retries as truncation worked through the
tails, and this addition risks repeating unsuccessful retries
indefinitely: I'd rather remove it now, and work on reducing the chance of
split_huge_page() failures separately, if we need to.
Link: https://lkml.kernel.org/r/b73b3492-8822-18f9-83e2-938528cdde94@google.com Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Hugh Dickins [Mon, 23 Aug 2021 23:59:12 +0000 (09:59 +1000)]
huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE
A successful shmem_fallocate() guarantees that the extent has been
reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
split huge pages and free their excess beyond i_size; and by other uses of
split_huge_page() near i_size.
It's sad to add a shmem inode field just for this, but I did not find a
better way to keep the guarantee. A flag to say KEEP_SIZE has been used
would be cheaper, but I'm averse to unclearable flags. The fallocend
field is not perfect either (many disjoint ranges might be fallocated),
but good enough; and gains another use later on.
Link: https://lkml.kernel.org/r/ca9a146-3a59-6cd3-7f28-e9a044bb1052@google.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Hugh Dickins [Mon, 23 Aug 2021 23:59:12 +0000 (09:59 +1000)]
huge tmpfs: fix fallocate(vanilla) advance over huge pages
Patch series "huge tmpfs: shmem_is_huge() fixes and cleanups".
A series of huge tmpfs fixes and cleanups.
This patch (of 9):
shmem_fallocate() goes to a lot of trouble to leave its newly allocated
pages !Uptodate, partly to identify and undo them on failure, partly to
leave the overhead of clearing them until later. But the huge page case
did not skip to the end of the extent, walked through the tail pages one
by one, and appeared to work just fine: but in doing so, cleared and
Uptodated the huge page, so there was no way to undo it on failure.
And by setting Uptodate too soon, it messed up both its nr_falloced and
nr_unswapped counts, so that the intended "time to give up" heuristic did
not work at all.
Now advance immediately to the end of the huge extent, with a comment on
why this is more than just an optimization. But although this speeds up
huge tmpfs fallocation, it does leave the clearing until first use, and
some users may have come to appreciate slow fallocate but fast first use:
if they complain, then we can consider adding a pass to clear at the end.
Link: https://lkml.kernel.org/r/da632211-8e3e-6b1-aee-ab24734429a0@google.com Link: https://lkml.kernel.org/r/16201bd2-70e-37e2-e89b-5f929430da@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:11 +0000 (09:59 +1000)]
shmem: remove unneeded header file
mfill_atomic_install_pte() is introduced to install pte and update mmu
cache since commit bf6ebd97aba0 ("userfaultfd/shmem: modify
shmem_mfill_atomic_pte to use install_pte()"). So we should remove
tlbflush.h as update_mmu_cache() is not called here now.
Miaohe Lin [Mon, 23 Aug 2021 23:59:11 +0000 (09:59 +1000)]
shmem: remove unneeded variable ret
Patch series "Cleanups for shmem".
This series contains cleanups to remove unneeded variable, header file,
function forward declaration and so on. More details can be found in the
respective changelogs.
This patch (of 4):
The local variable ret is always equal to -ENOMEM and never touched. So
remove it and return -ENOMEM directly to simplify the code.
Sebastian Andrzej Siewior [Mon, 23 Aug 2021 23:59:11 +0000 (09:59 +1000)]
shmem: use raw_spinlock_t for ->stat_lock
Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is
per-CPU. Access here is serialized by disabling preemption. If the pool
is empty, it gets reloaded from `->next_ino'. Access here is serialized
by ->stat_lock which is a spinlock_t and can not be acquired with disabled
preemption.
One way around it would make per-CPU ino_batch struct containing the inode
number a local_lock_t.
Another solution is to promote ->stat_lock to a raw_spinlock_t. The
critical sections are short. The mpol_put() must be moved outside of the
critical section to avoid invoking the destructor with disabled
preemption.
Link: https://lkml.kernel.org/r/20210806142916.jdwkb5bx62q5fwfo@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
John Hubbard [Mon, 23 Aug 2021 23:59:10 +0000 (09:59 +1000)]
mm: delete unused get_kernel_page()
get_kernel_page() was added in 2012 by [1]. It was used for a while for
NFS, but then in 2014, a refactoring [2] removed all callers, and it has
apparently not been used since.
Remove get_kernel_page() because it has no callers.
[1] commit 18022c5d8627 ("mm: add get_kernel_page[s] for pinning of
kernel addresses for I/O")
[2] commit 91f79c43d1b5 ("new helper: iov_iter_get_pages_alloc()")
Link: https://lkml.kernel.org/r/20210729221847.1165665-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Hugh Dickins [Mon, 23 Aug 2021 23:59:10 +0000 (09:59 +1000)]
fs, mm: fix race in unlinking swapfile
We had a recurring situation in which admin procedures setting up
swapfiles would race with test preparation clearing away swapfiles; and
just occasionally that got stuck on a swapfile "(deleted)" which could
never be swapped off. That is not supposed to be possible.
2.6.28 commit f9454548e17c ("don't unlink an active swapfile") admitted
that it was leaving a race window open: now close it.
may_delete() makes the IS_SWAPFILE check (amongst many others) before
inode_lock has been taken on target: now repeat just that simple check in
vfs_unlink() and vfs_rename(), after taking inode_lock.
Which goes most of the way to fixing the race, but swapon() must also
check after it acquires inode_lock, that the file just opened has not
already been unlinked.
Link: https://lkml.kernel.org/r/e17b91ad-a578-9a15-5e3-4989e0f999b5@google.com Fixes: f9454548e17c ("don't unlink an active swapfile") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
try_get_page() is very similar to try_get_compound_head(), and in fact
try_get_page() has fallen a little behind in terms of maintenance:
try_get_compound_head() handles speculative page references more
thoroughly.
There are only two try_get_page() callsites, so just call
try_get_compound_head() directly from those, and remove try_get_page()
entirely.
Also, seeing as how this changes try_get_compound_head() into a non-static
function, provide some kerneldoc documentation for it.
Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
John Hubbard [Mon, 23 Aug 2021 23:59:10 +0000 (09:59 +1000)]
mm/gup: small refactoring: simplify try_grab_page()
try_grab_page() does the same thing as try_grab_compound_head(..., refs=1,
...), just with a different API. So there is a lot of code duplication
there.
Change try_grab_page() to call try_grab_compound_head(), while keeping the
API contract identical for callers.
Also, now that try_grab_compound_head() always has a caller, remove the
__maybe_unused annotation.
Link: https://lkml.kernel.org/r/20210813044133.1536842-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
John Hubbard [Mon, 23 Aug 2021 23:59:10 +0000 (09:59 +1000)]
mm/gup: documentation corrections for gup/pup
Patch series "A few gup refactorings and documentation updates", v3.
While reviewing some of the other things going on around gup.c, I noticed
that the documentation was wrong for a few of the routines that I wrote.
And then I noticed that there was some significant code duplication too.
So this fixes those issues.
This is not entirely risk-free, but after looking closely at this, I think
it's actually a useful improvement, getting rid of the code duplication
here.
This patch (of 3):
The documentation for try_grab_compound_head() and try_grab_page() has
fallen a little out of date. Update and clarify a few points.
Also make it kerneldoc-correct, by adding @args documentation.
Miaohe Lin [Mon, 23 Aug 2021 23:59:09 +0000 (09:59 +1000)]
mm: gup: use helper PAGE_ALIGNED in populate_vma_page_range()
Use helper PAGE_ALIGNED to check if address is aligned to PAGE_SIZE.
Minor readability improvement.
Link: https://lkml.kernel.org/r/20210807093620.21347-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:09 +0000 (09:59 +1000)]
mm: gup: fix potential pgmap refcnt leak in __gup_device_huge()
When failed to try_grab_page, put_dev_pagemap() is missed. So pgmap
refcnt will leak in this case. Also we remove the check for pgmap against
NULL as it's also checked inside the put_dev_pagemap().
Link: https://lkml.kernel.org/r/20210807093620.21347-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages") Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:09 +0000 (09:59 +1000)]
mm: gup: remove useless BUG_ON in __get_user_pages()
Indeed, this BUG_ON couldn't catch anything useful. We are sure ret == 0
here because we would already bail out if ret != 0 and ret is untouched
till here.
Link: https://lkml.kernel.org/r/20210807093620.21347-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:08 +0000 (09:59 +1000)]
mm: gup: remove unneed local variable orig_refs
Remove unneed local variable orig_refs since refs is unchanged now.
Link: https://lkml.kernel.org/r/20210807093620.21347-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:08 +0000 (09:59 +1000)]
mm: gup: remove set but unused local variable major
Patch series "Cleanups and fixup for gup".
This series contains cleanups to remove unneeded variable, useless BUG_ON
and use helper to improve readability. Also we fix a potential pgmap
refcnt leak. More details can be found in the respective changelogs.
This patch (of 5):
Since commit a2beb5f1efed ("mm: clean up the last pieces of page fault
accountings"), the local variable major is unused. Remove it.
Shakeel Butt [Mon, 23 Aug 2021 23:59:08 +0000 (09:59 +1000)]
writeback: memcg: simplify cgroup_writeback_by_id
Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty
pages for a memcg. However mem_cgroup_wb_stats() does a lot more than
just get the number of dirty pages. Just directly get the number of dirty
pages instead of calling mem_cgroup_wb_stats(). Also
cgroup_writeback_by_id() is only called for best-effort dirty flushing, so
remove the unused 'nr' parameter and no need to explicitly flush memcg
stats.
Link: https://lkml.kernel.org/r/20210722182627.2267368-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Johannes Weiner [Mon, 23 Aug 2021 23:59:08 +0000 (09:59 +1000)]
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones before
the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because the
inodes are not currently open or dirty - think working with a large git
tree. It further doesn't respect cgroup memory protection settings and
can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned scenarios,
with the workload generally entering unexpected thrashing states while
losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating inodes
on the LRU isn't feasible. We've tried (commit a76cf1a474d7 ("mm: don't
reclaim inodes with many attached pages")) and failed (commit 69056ee6a8a3
("Revert "mm: don't reclaim inodes with many attached pages"")). The
issue is mostly that shrinker pools attract pressure based on their size,
and when objects get skipped the shrinkers remember this as deferred
reclaim work. This accumulates excessive pressure on the remaining
inodes, and we can quickly eat into heavily used ones, or dirty ones that
require IO to reclaim, when there potentially is plenty of cold, clean
cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the first
place - just like an open file or dirty state would. An otherwise clean
and unused inode then gets queued when the last cache entry disappears.
This solves the problem without reintroducing the reclaim issues, and
generally is a bit more scalable than having to wade through potentially
hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the irq-safe
page cache lock (i_pages.xa_lock). Page cache deletions are serialized
through i_lock, taken before the i_pages lock, to make sure depopulated
inodes are queued reliably. Additions may race with deletions, but we'll
check again in the shrinker. If additions race with the shrinker itself,
we're protected by the i_lock: if find_inode() or iput() win, the shrinker
will bail on the elevated i_count or I_REFERENCED; if the shrinker wins
and goes ahead with the inode, it will set I_FREEING and inhibit further
igets(), which will cause the other side to create a new instance of the
inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Johannes Weiner [Mon, 23 Aug 2021 23:59:08 +0000 (09:59 +1000)]
fs: inode: count invalidated shadow pages in pginodesteal
pginodesteal is supposed to capture the impact that inode reclaim has on
the page cache state. Currently, it doesn't consider shadow pages that
get dropped this way, even though this can have a significant impact on
paging behavior, memory pressure calculations etc.
To improve visibility into these effects, make sure shadow pages get
counted when they get dropped through inode reclaim.
This changes the return value semantics of invalidate_mapping_pages()
semantics slightly, but the only two users are the inode shrinker itsel
and a usb driver that logs it for debugging purposes.
Johannes Weiner [Mon, 23 Aug 2021 23:59:07 +0000 (09:59 +1000)]
fs: drop_caches: fix skipping over shadow cache inodes
When drop_caches truncates the page cache in an inode it also includes any
shadow entries for evicted pages. However, there is a preliminary check
on whether the inode has pages: if it has *only* shadow entries, it will
skip running truncation on the inode and leave it behind.
Fix the check to mapping_empty(), such that it runs truncation on any
inode that has cache entries at all.
Link: https://lkml.kernel.org/r/20210614211904.14420-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Roman Gushchin <guro@fb.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Johannes Weiner [Mon, 23 Aug 2021 23:59:07 +0000 (09:59 +1000)]
mm: remove irqsave/restore locking from contexts with irqs enabled
The page cache deletion paths all have interrupts enabled, so no need to
use irqsafe/irqrestore locking variants.
They used to have irqs disabled by the memcg lock added in commit c4843a7593a9 ("memcg: add per cgroup dirty page accounting"), but that has
since been replaced by memcg taking the page lock instead, commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge AP").
Jan Kara [Mon, 23 Aug 2021 23:59:07 +0000 (09:59 +1000)]
writeback: use READ_ONCE for unlocked reads of writeback stats
We do some unlocked reads of writeback statistics like
avg_write_bandwidth, dirty_ratelimit, or bw_time_stamp. Generally we are
fine with getting somewhat out-of-date values but actually getting
different values in various parts of the functions because the compiler
decided to reload value from original memory location could confuse
calculations. Use READ_ONCE for these unlocked accesses and WRITE_ONCE
for the updates to be on the safe side.
Link: https://lkml.kernel.org/r/20210713104716.22868-5-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jan Kara [Mon, 23 Aug 2021 23:59:07 +0000 (09:59 +1000)]
writeback: rename domain_update_bandwidth()
Rename domain_update_bandwidth() to domain_update_dirty_limit(). The
original name is a misnomer. The function has nothing to do with a
bandwidth, it updates dirty limits.
Link: https://lkml.kernel.org/r/20210713104716.22868-4-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jan Kara [Mon, 23 Aug 2021 23:59:07 +0000 (09:59 +1000)]
writeback: avoid division by 0 in wb_update_dirty_ratelimit()
Fixup patch "writeback: Fix bandwidth estimate for spiky workload" which
introduced possibility of __wb_update_bandwidth() getting called at a
moment when 'elapsed' evaluates to 0.
Cc: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jan Kara [Mon, 23 Aug 2021 23:59:06 +0000 (09:59 +1000)]
writeback: fix bandwidth estimate for spiky workload
Michael Stapelberg has reported that for workload with short big spikes of
writes (GCC linker seem to trigger this frequently) the write throughput
is heavily underestimated and tends to steadily sink until it reaches
zero. This has rather bad impact on writeback throttling (causing
stalls). The problem is that writeback throughput estimate gets updated
at most once per 200 ms. One update happens early after we submit pages
for writeback (at that point writeout of only small fraction of pages is
completed and thus observed throughput is tiny). Next update happens only
during the next write spike (updates happen only from inode writeback and
dirty throttling code) and if that is more than 1s after previous spike,
we decide system was idle and just ignore whatever was written until this
moment.
Fix the problem by making sure writeback throughput estimate is also
updated shortly after writeback completes to get reasonable estimate of
throughput for spiky workloads.
Jan Kara [Mon, 23 Aug 2021 23:59:06 +0000 (09:59 +1000)]
writeback: reliably update bandwidth estimation
Currently we trigger writeback bandwidth estimation from
balance_dirty_pages() and from wb_writeback(). However neither of these
need to trigger when the system is relatively idle and writeback is
triggered e.g. from fsync(2). Make sure writeback estimates happen
reliably by triggering them from do_writepages().
Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jan Kara [Mon, 23 Aug 2021 23:59:06 +0000 (09:59 +1000)]
writeback: track number of inodes under writeback
Patch series "writeback: Fix bandwidth estimates", v4.
Fix estimate of writeback throughput when device is not fully busy doing
writeback. Michael Stapelberg has reported that such workload (e.g.
generated by linking) tends to push estimated throughput down to 0 and as
a result writeback on the device is practically stalled.
The first three patches fix the reported issue, the remaining two patches
are unrelated cleanups of problems I've noticed when reading the code.
This patch (of 4):
Track number of inodes under writeback for each bdi_writeback structure.
We will use this to decide whether wb does any IO and so we can estimate
its writeback throughput. In principle we could use number of pages under
writeback (WB_WRITEBACK counter) for this however normal percpu counter
reads are too inaccurate for our purposes and summing the counter is too
expensive.
Matthew Wilcox (Oracle) [Mon, 23 Aug 2021 23:59:06 +0000 (09:59 +1000)]
mm: mark idle page tracking as BROKEN
In discussion with other MM developers around how idle page tracking
should be fixed for transparent huge pages, several expressed the opinion
that it should be removed as it is inefficient at accomplishing the job
that it is supposed to, and we have better mechanisms (eg uffd) for
accomplishing the same goals these days.
Mark the feature as BROKEN for now and we can remove it entirely in a few
months if nobody complains. It is not enabled by Android, ChromeOS,
Debian, Fedora or SUSE. Red Hat enabled it with RHEL-8.1 and UEK followed
suit, but I have been unable to find why RHEL enabled it.
Link: https://lkml.kernel.org/r/20210612000714.775825-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Yu Zhao <yuzhao@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
SeongJae Park <sj38.park@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
liuhailong [Mon, 23 Aug 2021 23:59:06 +0000 (09:59 +1000)]
mm: add kernel_misc_reclaimable in show_free_areas
Print NR_KERNEL_MISC_RECLAIMABLE stat from show_free_areas() so users can
check whether the shrinker is working correctly and to show the current
memory usage.
While it was not useful to that bug report to know where the reclaim lock
had been acquired, it might be useful under other circumstances. Allow
the caller of __fs_reclaim_acquire to specify the instruction pointer to
use.
Link: https://lkml.kernel.org/r/20210719185709.1755149-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Omar Sandoval <osandov@fb.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Gavin Shan [Mon, 23 Aug 2021 23:59:05 +0000 (09:59 +1000)]
mm/debug_vm_pgtable: fix corrupted page flag
In page table entry modifying tests, set_xxx_at() are used to populate
the page table entries. On ARM64, PG_arch_1 (PG_dcache_clean) flag is
set to the target page flag if execution permission is given. The logic
exits since commit 4f04d8f00545 ("arm64: MMU definitions"). The page
flag is kept when the page is free'd to buddy's free area list. However,
it will trigger page checking failure when it's pulled from the buddy's
free area list, as the following warning messages indicate.
This fixes the issue by clearing PG_arch_1 through flush_dcache_page()
after set_xxx_at() is called. For architectures other than ARM64, the
unexpected overhead of cache flushing is acceptable.
Gavin Shan [Mon, 23 Aug 2021 23:59:05 +0000 (09:59 +1000)]
mm/debug_vm_pgtable: remove unused code
The variables used by old implementation isn't needed as we switched to
"struct pgtable_debug_args". Lets remove them and related code in
debug_vm_pgtable().
Gavin Shan [Mon, 23 Aug 2021 23:59:05 +0000 (09:59 +1000)]
mm/debug_vm_pgtable: use struct pgtable_debug_args in PGD and P4D modifying tests
This uses struct pgtable_debug_args in PGD/P4D modifying tests. No
allocated huge page is used in these tests. Besides, the unused variable
@saved_p4dp and @saved_pudp are dropped.
Gavin Shan [Mon, 23 Aug 2021 23:59:05 +0000 (09:59 +1000)]
mm/debug_vm_pgtable: use struct pgtable_debug_args in PUD modifying tests
This uses struct pgtable_debug_args in PUD modifying tests. The allocated
huge page is used when set_pud_at() is used. The corresponding tests are
skipped if the huge page doesn't exist. Besides, the following unused
variables in debug_vm_pgtable() are dropped: @prot, @paddr, @pud_aligned.
Gavin Shan [Mon, 23 Aug 2021 23:59:04 +0000 (09:59 +1000)]
mm/debug_vm_pgtable: use struct pgtable_debug_args in PMD modifying tests
This uses struct pgtable_debug_args in PMD modifying tests. The allocated
huge page is used when set_pmd_at() is used. The corresponding tests are
skipped if the huge page doesn't exist. Besides, the unused variable
@pmd_aligned in debug_vm_pgtable() is dropped.
Gavin Shan [Mon, 23 Aug 2021 23:59:04 +0000 (09:59 +1000)]
mm/debug_vm_pgtable: use struct pgtable_debug_args in PTE modifying tests
This uses struct pgtable_debug_args in PTE modifying tests. The allocated
page is used as set_pte_at() is used there. The tests are skipped if the
allocated page doesn't exist. It's notable that args->ptep need to be
mapped before the tests. The reason why we don't map args->ptep at the
beginning is PTE entry is only mapped and accessible in atomic context
when CONFIG_HIGHPTE is enabled. So we avoid to do that so that atomic
context is only enabled if needed.
Besides, the unused variable @pte_aligned and @ptep in debug_vm_pgtable()
are dropped.
Gavin Shan [Mon, 23 Aug 2021 23:59:04 +0000 (09:59 +1000)]
mm/debug_vm_pgtable: use struct pgtable_debug_args in migration and thp tests
This uses struct pgtable_debug_args in the migration and thp test
functions. It's notable that the pre-allocated page is used in
swap_migration_tests() as set_pte_at() is used there.
Patch series "mm/debug_vm_pgtable: Enhancements", v6.
There are a couple of issues with current implementations and this series
tries to resolve the issues:
(a) All needed information are scattered in variables, passed to various
test functions. The code is organized in pretty much relaxed fashion.
(b) The page isn't allocated from buddy during page table entry modifying
tests. The page can be invalid, conflicting to the implementations
of set_xxx_at() on ARM64. The target page is accessed so that the
iCache can be flushed when execution permission is given on ARM64.
Besides, the target page can be unmapped and accessing to it causes
kernel crash.
"struct pgtable_debug_args" is introduced to address issue (a). For issue
(b), the used page is allocated from buddy in page table entry modifying
tests. The corresponding tets will be skipped if we fail to allocate the
(huge) page. For other test cases, the original page around to kernel
symbol (@start_kernel) is still used.
The patches are organized as below. PATCH[2-10] could be combined to one
patch, but it will make the review harder:
PATCH[1] introduces "struct pgtable_debug_args" as place holder of all
needed information. With it, the old and new implementation
can coexist.
PATCH[2-10] uses "struct pgtable_debug_args" in various test functions.
PATCH[11] removes the unused code for old implementation.
PATCH[12] fixes the issue of corrupted page flag for ARM64
This patch (of 6):
In debug_vm_pgtable(), there are many local variables introduced to track
the needed information and they are passed to the functions for various
test cases. It'd better to introduce a struct as place holder for these
information. With it, what the tests functions need is the struct. In
this way, the code is simplified and easier to be maintained.
Besides, set_xxx_at() could access the data on the corresponding pages in
the page table modifying tests. So the accessed pages in the tests should
have been allocated from buddy. Otherwise, we're accessing pages that
aren't owned by us. This causes issues like page flag corruption or
kernel crash on accessing unmapped page when CONFIG_DEBUG_PAGEALLOC is
enabled.
This introduces "struct pgtable_debug_args". The struct is initialized
and destroyed, but the information in the struct isn't used yet. It will
be used in subsequent patches.
As pointed out by Sebastian Andrzej Siewior, this_cpu_cmpxchg_double()
needs the freelist and tid fields to be aligned to sum of their sizes
(16 bytes) but they are not in this configuration. This didn't happen
with non-debug RT and !RT configs as well as lockdep.
To fix this, move the lock field below partial field, so that it doesn't
affect the layout.
This is a fixup for mmotm patch
mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch
Link: https://lkml.kernel.org/r/7e9ccf34-57d1-786b-2dfd-3b9ba78e1b32@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:03 +0000 (09:59 +1000)]
mm, slub: convert kmem_cpu_slab protection to local_lock
Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions
of local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT
that's equivalent, with better lockdep visibility. On PREEMPT_RT that
means better preemption.
However, the cost on PREEMPT_RT is the loss of lockless fast paths which
only work with cpu freelist. Those are designed to detect and recover
from being preempted by other conflicting operations (both fast or slow
path), but the slow path operations assume they cannot be preempted by a
fast path operation, which is guaranteed naturally with disabled irqs.
With local locks on PREEMPT_RT, the fast paths now also need to take the
local lock to avoid races.
In the allocation fastpath slab_alloc_node() we can just defer to the
slowpath __slab_alloc() which also works with cpu freelist, but under the
local lock. In the free fastpath do_slab_free() we have to add a new
local lock protected version of freeing to the cpu freelist, as the
existing slowpath only works with the page freelist.
Also update the comment about locking scheme in SLUB to reflect changes
done by this series.
[efault@gmx.de: use local_lock() without irq in PREEMPT_RT scope, debugging of RT crashes resulting in put_cpu_partial() locking changes] Link: https://lkml.kernel.org/r/20210805152000.12817-36-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:02 +0000 (09:59 +1000)]
mm, slub: use migrate_disable() on PREEMPT_RT
We currently use preempt_disable() (directly or via get_cpu_ptr()) to
stabilize the pointer to kmem_cache_cpu. On PREEMPT_RT this would be
incompatible with the list_lock spinlock. We can use migrate_disable()
instead, but that increases overhead on !PREEMPT_RT as it's an
unconditional function call.
In order to get the best available mechanism on both PREEMPT_RT and
!PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr()
wrappers and use them.
Link: https://lkml.kernel.org/r/20210805152000.12817-35-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:02 +0000 (09:59 +1000)]
mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
Jann Horn reported [1] the following theoretically possible race:
task A: put_cpu_partial() calls preempt_disable()
task A: oldpage = this_cpu_read(s->cpu_slab->partial)
interrupt: kfree() reaches unfreeze_partials() and discards the page
task B (on another CPU): reallocates page as page cache
task A: reads page->pages and page->pobjects, which are actually
halves of the pointer page->lru.prev
task B (on another CPU): frees page
interrupt: allocates page as SLUB page and places it on the percpu partial list
task A: this_cpu_cmpxchg() succeeds
which would cause page->pages and page->pobjects to end up containing
halves of pointers that would then influence when put_cpu_partial()
happens and show up in root-only sysfs files. Maybe that's acceptable,
I don't know. But there should probably at least be a comment for now
to point out that we're reading union fields of a page that might be
in a completely different state.
Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
latter disables irqs, otherwise a __slab_free() in an irq handler could
call put_cpu_partial() in the middle of ___slab_alloc() manipulating
->partial and corrupt it. This becomes an issue on RT after a local_lock
is introduced in later patch. The fix means taking the local_lock also in
put_cpu_partial() on RT.
After debugging this issue, Mike Galbraith suggested [2] that to avoid
different locking schemes on RT and !RT, we can just protect
put_cpu_partial() with disabled irqs (to be converted to
local_lock_irqsave() later) everywhere. This should be acceptable as it's
not a fast path, and moving the actual partial unfreezing outside of the
irq disabled section makes it short, and with the retry loop gone the code
can be also simplified. In addition, the race reported by Jann should no
longer be possible.
Vlastimil Babka [Mon, 23 Aug 2021 23:59:02 +0000 (09:59 +1000)]
mm, slub: make slab_lock() disable irqs with PREEMPT_RT
We need to disable irqs around slab_lock() (a bit spinlock) to make it
irq-safe. The calls to slab_lock() are nested under spin_lock_irqsave()
which doesn't disable irqs on PREEMPT_RT, so add explicit disabling with
PREEMPT_RT.
We also distinguish cmpxchg_double_slab() where we do the disabling
explicitly and __cmpxchg_double_slab() for contexts with already disabled
irqs. However these context are also typically spin_lock_irqsave() thus
insufficient on PREEMPT_RT. Thus, change __cmpxchg_double_slab() to be
same as cmpxchg_double_slab() on PREEMPT_RT.
Link: https://lkml.kernel.org/r/20210805152000.12817-33-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:02 +0000 (09:59 +1000)]
mm, slub: optionally save/restore irqs in slab_[un]lock()/
For PREEMPT_RT we will need to disable irqs for this bit spinlock. As a
preparation, add a flags parameter, and an internal version that takes
additional bool parameter to control irq saving/restoring (the flags
parameter is compile-time unused if the bool is a constant false).
Convert ___cmpxchg_double_slab(), which also comes with the same bool
parameter, to use the internal version.
Link: https://lkml.kernel.org/r/20210805152000.12817-32-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:01 +0000 (09:59 +1000)]
mm, slub: fix memory and cpu hotplug related lock ordering issues
Qian Cai reported [1] a lockdep splat on memory offline.
[ 91.374541] WARNING: possible circular locking dependency detected
[ 91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
[ 91.387149] ------------------------------------------------------
[ 91.394016] lsbug/1523 is trying to acquire lock:
[ 91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
[ 91.407425] but task is already holding lock:
[ 91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
[ 91.423603] which lock already depends on the new lock.
To fix it, we need to change the order in flush_all() so that
cpus_read_lock() is first and mutex_lock(&flush_lock) second.
Also when called from slab_mem_going_offline_callback() we are already
under cpus_read_lock() and cannot take it again, so create a
flush_all_cpus_locked() variant and decouple flushing from actual
shrinking for this call path.
Additionally, Mike Galbraith reported [2] wrong order of cpus_read_lock()
and slab_mutex in kmem_cache_destroy() path and proposed a fix to reverse
it.
This patch is a fixup for the mmotm patch
mm-slub-move-flush_cpu_slab-invocations-__free_slab-invocations-out-of-irq-context.patch
Sebastian Andrzej Siewior [Mon, 23 Aug 2021 23:59:01 +0000 (09:59 +1000)]
mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context
flush_all() flushes a specific SLAB cache on each CPU (where the cache is
present). The deactivate_slab()/__free_slab() invocation happens within
IPI handler and is problematic for PREEMPT_RT.
The flush operation is not a frequent operation or a hot path. The
per-CPU flush operation can be moved to within a workqueue.
[vbabka@suse.cz: adapt to new SLUB changes] Link: https://lkml.kernel.org/r/20210805152000.12817-30-vbabka@suse.cz Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:01 +0000 (09:59 +1000)]
mm, slab: make flush_slab() possible to call with irqs enabled
Currently flush_slab() is always called with disabled IRQs if it's needed,
but the following patches will change that, so add a parameter to control
IRQ disabling within the function, which only protects the kmem_cache_cpu
manipulation and not the call to deactivate_slab() which doesn't need it.
Link: https://lkml.kernel.org/r/20210805152000.12817-29-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:01 +0000 (09:59 +1000)]
mm, slub: don't disable irqs in slub_cpu_dead()
slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls
only functions that are now irq safe, so we don't need to disable irqs
anymore.
Link: https://lkml.kernel.org/r/20210805152000.12817-28-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:00 +0000 (09:59 +1000)]
mm, slub: only disable irq with spin_lock in __unfreeze_partials()
__unfreeze_partials() no longer needs to have irqs disabled, except for
making the spin_lock operations irq-safe, so convert the spin_locks
operations and remove the separate irq handling.
Link: https://lkml.kernel.org/r/20210805152000.12817-27-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:00 +0000 (09:59 +1000)]
mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing
Unfreezing partial list can be split to two phases - detaching the list
from struct kmem_cache_cpu, and processing the list. The whole operation
does not need to be protected by disabled irqs. Restructure the code to
separate the detaching (with disabled irqs) and unfreezing (with irq
disabling to be reduced in the next patch).
Also, unfreeze_partials() can be called from another cpu on behalf of a
cpu that is being offlined, where disabling irqs on the local cpu has no
sense, so restructure the code as follows:
- __unfreeze_partials() is the bulk of unfreeze_partials() that processes the
detached percpu partial list
- unfreeze_partials() detaches list from current cpu with irqs disabled and
calls __unfreeze_partials()
- unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no
irq disabling, and is called from __flush_cpu_slab()
- flush_cpu_slab() is for the local cpu thus it needs to call
unfreeze_partials(). So it can't simply call
__flush_cpu_slab(smp_processor_id()) anymore and we have to open-code the
proper calls.
Link: https://lkml.kernel.org/r/20210805152000.12817-26-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:59:00 +0000 (09:59 +1000)]
mm, slub: detach whole partial list at once in unfreeze_partials()
Instead of iterating through the live percpu partial list, detach it from
the kmem_cache_cpu at once. This is simpler and will allow further
optimization.
Link: https://lkml.kernel.org/r/20210805152000.12817-25-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:59 +0000 (09:58 +1000)]
mm, slub: move irq control into unfreeze_partials()
unfreeze_partials() can be optimized so that it doesn't need irqs disabled
for the whole time. As the first step, move irq control into the function
and remove it from the put_cpu_partial() caller.
Link: https://lkml.kernel.org/r/20210805152000.12817-23-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:59 +0000 (09:58 +1000)]
mm, slub: call deactivate_slab() without disabling irqs
The function is now safe to be called with irqs enabled, so move the calls
outside of irq disabled sections.
When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so
to reenable them before deactivate_slab() we need to open-code
flush_slab() in ___slab_alloc() and reenable irqs after modifying the
kmem_cache_cpu fields. But that means a IRQ handler meanwhile might have
assigned a new page to kmem_cache_cpu.page so we have to retry the whole
check.
The remaining callers of flush_slab() are the IPI handler which has
disabled irqs anyway, and slub_cpu_dead() which will be dealt with in the
following patch.
Link: https://lkml.kernel.org/r/20210805152000.12817-22-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:59 +0000 (09:58 +1000)]
mm, slub: make locking in deactivate_slab() irq-safe
dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it
will be possible to call it with irqs enabled. Just convert the spin_lock
calls to their irq saving/restoring variants to make it irq-safe.
Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(),
because in some situations we don't take the list_lock, which would
disable irqs.
Link: https://lkml.kernel.org/r/20210805152000.12817-21-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:58 +0000 (09:58 +1000)]
mm, slub: move reset of c->page and freelist out of deactivate_slab()
deactivate_slab() removes the cpu slab by merging the cpu freelist with
slab's freelist and putting the slab on the proper node's list. It also
sets the respective kmem_cache_cpu pointers to NULL.
By extracting the kmem_cache_cpu operations from the function, we can make
it not dependent on disabled irqs.
Also if we return a single free pointer from ___slab_alloc, we no longer
have to assign kmem_cache_cpu.page before deactivation or care if somebody
preempted us and assigned a different page to our kmem_cache_cpu in the
process.
Link: https://lkml.kernel.org/r/20210805152000.12817-20-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:58 +0000 (09:58 +1000)]
mm, slub: stop disabling irqs around get_partial()
The function get_partial() does not need to have irqs disabled as a whole.
It's sufficient to convert spin_lock operations to their irq
saving/restoring versions.
As a result, it's now possible to reach the page allocator from the slab
allocator without disabling and re-enabling interrupts on the way.
Link: https://lkml.kernel.org/r/20210805152000.12817-19-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:58 +0000 (09:58 +1000)]
mm, slub: check new pages with restored irqs
Building on top of the previous patch, re-enable irqs before checking new
pages. alloc_debug_processing() is now called with enabled irqs so we
need to remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there
doesn't seem to be a need for it anyway.
Link: https://lkml.kernel.org/r/20210805152000.12817-18-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:58 +0000 (09:58 +1000)]
mm, slub: validate slab from partial list or page allocator before making it cpu slab
When we obtain a new slab page from node partial list or page allocator,
we assign it to kmem_cache_cpu, perform some checks, and if they fail, we
undo the assignment.
In order to allow doing the checks without irq disabled, restructure the
code so that the checks are done first, and kmem_cache_cpu.page assignment
only after they pass.
Link: https://lkml.kernel.org/r/20210805152000.12817-17-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:57 +0000 (09:58 +1000)]
mm, slub: restore irqs around calling new_slab()
allocate_slab() currently re-enables irqs before calling to the page
allocator. It depends on gfpflags_allow_blocking() to determine if it's
safe to do so. Now we can instead simply restore irq before calling it
through new_slab(). The other caller early_kmem_cache_node_alloc() is
unaffected by this.
Link: https://lkml.kernel.org/r/20210805152000.12817-16-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:57 +0000 (09:58 +1000)]
mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()
Continue reducing the irq disabled scope. Check for per-cpu partial slabs
with first with irqs enabled and then recheck with irqs disabled before
grabbing the slab page. Mostly preparatory for the following patches.
Link: https://lkml.kernel.org/r/20210805152000.12817-15-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lkml.kernel.org/r/ec98bce0-fef4-0fbc-2067-e358510e0321@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Clark Williams <williams@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
The problem is that we are opportunistically checking flags on a page in
irq enabled section. If we are interrupted and the page is freed, it's
not an issue as we detect it after disabling irqs. But on kernels with
CONFIG_DEBUG_VM. The check for PageSlab flag in PageSlabPfmemalloc() can
fail.
Fix this by creating an "unsafe" version of the check that doesn't check
PageSlab.
Link: https://lkml.kernel.org/r/f4756ee5-a7e9-ab02-3aba-1355f77b7c79@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Clark Williams <williams@redhat.com> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:56 +0000 (09:58 +1000)]
mm, slub: do initial checks in ___slab_alloc() with irqs enabled
As another step of shortening irq disabled sections in ___slab_alloc(),
delay disabling irqs until we pass the initial checks if there is a cached
percpu slab and it's suitable for our allocation.
Now we have to recheck c->page after actually disabling irqs as an
allocation in irq handler might have replaced it.
Because we call pfmemalloc_match() as one of the checks, we might hit
VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
interrupted and the page is freed. Thus introduce a
pfmemalloc_match_unsafe() variant that lacks the PageSlab check.
Link: https://lkml.kernel.org/r/20210805152000.12817-14-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:56 +0000 (09:58 +1000)]
mm, slub: move disabling/enabling irqs to ___slab_alloc()
Currently __slab_alloc() disables irqs around the whole ___slab_alloc().
This includes cases where this is not needed, such as when the allocation
ends up in the page allocator and has to awkwardly enable irqs back based
on gfp flags. Also the whole kmem_cache_alloc_bulk() is executed with
irqs disabled even when it hits the __slab_alloc() slow path, and long
periods with disabled interrupts are undesirable.
As a first step towards reducing irq disabled periods, move irq handling
into ___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu
pointer from becoming invalid via get_cpu_ptr(), thus preempt_disable().
This does not protect against modification by an irq handler, which is
still done by disabled irq for most of ___slab_alloc(). As a small
immediate benefit, slab_out_of_memory() from ___slab_alloc() is now called
with irqs enabled.
kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables
them before calling ___slab_alloc(), which then disables them at its
discretion. The whole kmem_cache_alloc_bulk() operation also disables
preemption.
When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
preemption, because new_slab() will re-enable interrupts in contexts that
allow blocking (this will be improved by later patches).
The patch itself will thus increase overhead a bit due to disabled
preemption (on configs where it matters) and increased disabling/enabling
irqs in kmem_cache_alloc_bulk(), but that will be gradually improved in
the following patches.
Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard
to CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly
paired in all configurations. On configs without involuntary preemption
and debugging the re-read of kmem_cache_cpu pointer is still compiled out
as it was before.
[efault@gmx.de: fix kmem_cache_alloc_bulk() error path] Link: https://lkml.kernel.org/r/20210805152000.12817-13-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:56 +0000 (09:58 +1000)]
mm, slub: simplify kmem_cache_cpu and tid setup
In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee
that our kmem_cache_cpu pointer is from the same cpu as the tid value.
Currently that's done by reading the tid first using this_cpu_read(), then
the kmem_cache_cpu pointer and verifying we read the same tid using the
pointer and plain READ_ONCE().
This can be simplified to just fetching kmem_cache_cpu pointer and then
reading tid using the pointer. That guarantees they are from the same
cpu. We don't need to read the tid using this_cpu_read() because the
value will be validated by this_cpu_cmpxchg_double(), making sure we are
on the correct cpu and the freelist didn't change by anyone preempting us
since reading the tid.
Link: https://lkml.kernel.org/r/20210805152000.12817-12-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:55 +0000 (09:58 +1000)]
mm, slub: restructure new page checks in ___slab_alloc()
When we allocate slab object from a newly acquired page (from node's
partial list or page allocator), we usually also retain the page as a new
percpu slab. There are two exceptions - when pfmemalloc status of the
page doesn't match our gfp flags, or when the cache has debugging enabled.
The current code for these decisions is not easy to follow, so restructure
it and add comments. The new structure will also help with the following
changes. No functional change.
Link: https://lkml.kernel.org/r/20210805152000.12817-11-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:55 +0000 (09:58 +1000)]
mm, slub: return slab page from get_partial() and set c->page afterwards
The function get_partial() finds a suitable page on a partial list,
acquires and returns its freelist and assigns the page pointer to
kmem_cache_cpu. In later patch we will need more control over the
kmem_cache_cpu.page assignment, so instead of passing a kmem_cache_cpu
pointer, pass a pointer to a pointer to a page that get_partial() can fill
and the caller can assign the kmem_cache_cpu.page pointer. No functional
change as all of this still happens with disabled IRQs.
Link: https://lkml.kernel.org/r/20210805152000.12817-10-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:55 +0000 (09:58 +1000)]
mm, slub: dissolve new_slab_objects() into ___slab_alloc()
The later patches will need more fine grained control over individual
actions in ___slab_alloc(), the only caller of new_slab_objects(), so
dissolve it there. This is a preparatory step with no functional change.
The only minor change is moving WARN_ON_ONCE() for using a constructor
together with __GFP_ZERO to new_slab(), which makes it somewhat less
frequent, but still able to catch a development change introducing a
systematic misuse.
Link: https://lkml.kernel.org/r/20210805152000.12817-9-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:55 +0000 (09:58 +1000)]
mm, slub: extract get_partial() from new_slab_objects()
The later patches will need more fine grained control over individual
actions in ___slab_alloc(), the only caller of new_slab_objects(), so this
is a first preparatory step with no functional change.
This adds a goto label that appears unnecessary at this point, but will be
useful for later changes.
Link: https://lkml.kernel.org/r/20210805152000.12817-8-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:55 +0000 (09:58 +1000)]
mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab()
These functions differ only in irq disabling in the slow path. We can
create a common function with an extra bool parameter to control the irq
disabling. As the functions are inline and the parameter compile-time
constant, there will be no runtime overhead due to this change.
Also change the DEBUG_VM based irqs disable assert to the more standard
lockdep_assert based one.
Link: https://lkml.kernel.org/r/20210805152000.12817-7-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:54 +0000 (09:58 +1000)]
mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()
Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs
immediately") introduced cpu partial flushing for kmemcg caches, based on
setting the target cpu_partial to 0 and adding a flushing check in
put_cpu_partial(). This code that sets cpu_partial to 0 was later moved
by c9fc586403e7 ("slab: introduce __kmemcg_cache_deactivate()") and
ultimately removed by 9855609bde03 ("mm: memcg/slab: use a single set of
kmem_caches for all accounted allocations"). However the check and flush
in put_cpu_partial() was never removed, although it's effectively a dead
code. So this patch removes it.
Note that d6e0b7fa1186 also added preempt_disable()/enable() to
unfreeze_partials() which could be thus also considered unnecessary. But
further patches will rely on it, so keep it.
Link: https://lkml.kernel.org/r/20210805152000.12817-6-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:54 +0000 (09:58 +1000)]
mm, slub: don't disable irq for debug_check_no_locks_freed()
In slab_free_hook() we disable irqs around the
debug_check_no_locks_freed() call, which is unnecessary, as irqs are
already being disabled inside the call. This seems to be leftover from
the past where there were more calls inside the irq disabled sections.
Remove the irq disable/enable operations.
Mel noted:
> Looks like it was needed for kmemcheck which went away back in 4.15
Link: https://lkml.kernel.org/r/20210805152000.12817-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:54 +0000 (09:58 +1000)]
mm, slub: allocate private object map for validate_slab_cache()
validate_slab_cache() is called either to handle a sysfs write, or from a
self-test context. In both situations it's straightforward to preallocate
a private object bitmap instead of grabbing the shared static one meant
for critical sections, so let's do that.
Link: https://lkml.kernel.org/r/20210805152000.12817-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:54 +0000 (09:58 +1000)]
mm, slub: allocate private object map for debugfs listings
Slub has a static spinlock protected bitmap for marking which objects are
on freelist when it wants to list them, for situations where dynamically
allocating such map can lead to recursion or locking issues, and on-stack
bitmap would be too large.
The handlers of debugfs files alloc_traces and free_traces also currently
use this shared bitmap, but their syscall context makes it straightforward
to allocate a private map before entering locked sections, so switch these
processing paths to use a private bitmap.
Link: https://lkml.kernel.org/r/20210805152000.12817-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Mon, 23 Aug 2021 23:58:54 +0000 (09:58 +1000)]
mm, slub: don't call flush_all() from slab_debug_trace_open()
Patch series "SLUB: reduce irq disabled scope and make it RT compatible", v4.
This series was initially inspired by Mel's pcplist local_lock rewrite,
and also interest to better understand SLUB's locking and the new
primitives and RT variants and implications. It should make SLUB more
preemption-friendly, especially for RT, hopefully without noticeable
regressions, as the fast paths are not affected.
The RFC/v1 version got basic performance screening by Mel that didn't show
major regressions. Mike's testing with hackbench of v2 on !RT reported
negligible differences [6]:
RT configs showed some throughput regressions, but that's expected
tradeoff for the preemption improvements through the RT mutex. It didn't
prevent the v2 to be incorporated to the 5.13 RT tree [7], leading to
testing exposure and bugfixes.
Before the series, SLUB is lockless in both allocation and free fast
paths, but elsewhere, it's disabling irqs for considerable periods of time
- especially in allocation slowpath and the bulk allocation, where IRQs
are re-enabled only when a new page from the page allocator is needed, and
the context allows blocking. The irq disabled sections can then include
deactivate_slab() which walks a full freelist and frees the slab back to
page allocator or unfreeze_partials() going through a list of percpu
partial slabs. The RT tree currently has some patches mitigating these,
but we can do much better in mainline too.
Patches 1-6 are straightforward improvements or cleanups that could exist
outside of this series too, but are prerequsities.
Patches 7-10 are also preparatory code changes without functional changes,
but not so useful without the rest of the series.
Patch 11 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.
Patches 12-21 focus on reducing irq disabled scope in the allocation
slowpath.
Patch 12 moves disabling of irqs into ___slab_alloc() from its callers,
which are the allocation slowpath, and bulk allocation. Instead these
callers only disable preemption to stabilize the cpu. The following
patches then gradually reduce the scope of disabled irqs in
___slab_alloc() and the functions called from there. As of patch 15, the
re-enabling of irqs based on gfp flags before calling the page allocator
is removed from allocate_slab(). As of patch 18, it's possible to reach
the page allocator (in case of existing slabs depleted) without disabling
and re-enabling irqs a single time.
Pathces 22-27 reduce the scope of disabled irqs in functions related to
unfreezing percpu partial slab.
Patch 28 is preparatory. Patch 29 is adopted from the RT tree and
converts the flushing of percpu slabs on all cpus from using IPI to
workqueue, so that the processing isn't happening with irqs disabled in
the IPI handler. The flushing is not performance critical so it should be
acceptable.
Patch 30 also comes from RT tree and makes object_map_lock RT compatible.
Patches 31-32 make slab_lock irq-safe on RT where we cannot rely on having
irq disabled from the list_lock spin lock usage.
Patch 33 changes kmem_cache_cpu->partial handling in put_cpu_partial()
from cmpxchg loop to a short irq disabled section, which is used by all
other code modifying the field. This addresses a theoretical race
scenario pointed out by Jann, and makes the critical section safe wrt with
RT local_lock semantics after the conversion in patch 35.
Patch 34 changes preempt disable to migrate disable, so that the nested
list_lock spinlock is safe to take on RT. Because migrate_disable() is a
function call even on !RT, a small set of private wrappers is introduced
to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations.
As of this patch, SLUB should be compatible with RT's lock semantics, to
the best of my knowledge.
Finally, patch 35 changes irq disabled sections that protect
kmem_cache_cpu fields in the slow paths, with a local lock. However on
PREEMPT_RT it means the lockless fast paths can now preempt slow paths
which don't expect that, so the local lock has to be taken also in the
fast paths and they are no longer lockless. It's up to RT folks to decide
if this is a good tradeoff. The patch also updates the locking
documentation in the file's comment.
The main results of this series:
* irq disabling is only done for minimum amount of time needed to
protect the kmem_cache_cpu data and as part of spin lock, local lock and
bit lock operations to make them irq-safe
* SLUB should be fully PREEMPT_RT compatible
This should have obvious implications for better preemptibility,
especially on RT.
Some details are different than how the previous SLUB RT tree patches were
implemented:
mm: sl[au]b: Change list_lock to raw_spinlock_t [2] - the SLAB part
can be dropped as a different patch restricts RT to SLUB anyway. And
after this series the list_lock in SLUB is never used with irqs disabled
before taking the lock so it doesn't have to be converted to
raw_spinlock_t.
mm: slub: Move discard_slab() invocations out of IRQ-off sections [3]
should be unnecessary as this series does move these invocations outside
irq disabled sections in a different way.
The remaining patches to upstream from the RT tree are small ones related
to KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or
SLOB) makes sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with
PREEMPT_RT could perhaps be re-evaluated as the series addresses some
latency issues with it.
slab_debug_trace_open() can only be called on caches with SLAB_STORE_USER
flag and as with all slub debugging flags, such caches avoid cpu or percpu
partial slabs altogether, so there's nothing to flush.
Link: https://lkml.kernel.org/r/20210805152000.12817-1-vbabka@suse.cz Link: https://lkml.kernel.org/r/20210805152000.12817-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wangyan [Mon, 23 Aug 2021 23:58:53 +0000 (09:58 +1000)]
ocfs2: fix ocfs2 corrupt when iputting an inode
In this condition, it will cause an bug on error.
ocfs2_mkdir()
->ocfs2_mknod()
->ocfs2_mknod_locked()
->__ocfs2_mknod_locked()
//Assume inode->i_generation is genN.
->inode->i_generation = osb->s_next_generation++;
// The inode lockres has been initialized.
->ocfs2_populate_inode()
->ocfs2_create_new_inode_locks()
->An error happened, returned value is non-zero
// free the start_bit x in bg_blkno
->ocfs2_free_suballoc_bits()
->... /* Another process execute mkdir success in this place,
and it occupied the start_bit x in bg_blkno
which has been freed before. Its inode->i_generation
is genN + 1 */
->iput(inode)
->evict()
->ocfs2_evict_inode()
->ocfs2_delete_inode()
->ocfs2_inode_lock()
->ocfs2_inode_lock_update()
/* Bug on here, genN != genN + 1 */
->mlog_bug_on_msg(inode->i_generation !=
le32_to_cpu(fe->i_generation))
So, we need not to reclaim the inode when the inode->ip_inode_lockres
has been initialized. It will be freed in iput().
Link: http://lkml.kernel.org/r/ef080ca3-5d74-e276-17a1-d9e7c7e662c9@huawei.com Fixes: b1529a41f777 ("ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error") Signed-off-by: Yan Wang <wangyan122@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wangyan [Mon, 23 Aug 2021 23:58:53 +0000 (09:58 +1000)]
ocfs2: clear links count in ocfs2_mknod() if an error occurs
In this condition, the inode can not be wiped when error happened.
ocfs2_mkdir()
->ocfs2_mknod()
->ocfs2_mknod_locked()
->__ocfs2_mknod_locked()
->ocfs2_set_links_count() // i_links_count is 2
-> ... // an error accrue, goto roll_back or leave.
->ocfs2_commit_trans()
->iput(inode)
->evict()
->ocfs2_evict_inode()
->ocfs2_delete_inode()
->ocfs2_inode_lock()
->ocfs2_inode_lock_update()
->ocfs2_refresh_inode()
->set_nlink(); // inode->i_nlink is 2 now.
/* if wipe is 0, it will goto bail_unlock_inode */
->ocfs2_query_inode_wipe()
->if (inode->i_nlink) return; // wipe is 0.
/* inode can not be wiped */
->ocfs2_wipe_inode()
So, we need clear links before the transaction committed.
Link: http://lkml.kernel.org/r/d8147c41-fb2b-bdf7-b660-1f3c8448c33f@huawei.com Signed-off-by: Yan Wang <wangyan122@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Gang He [Mon, 23 Aug 2021 23:58:53 +0000 (09:58 +1000)]
ocfs2: reflink deadlock when clone file to the same directory simultaneously
Running reflink from multiple nodes simultaneously to clone a file to the
same directory probably triggers a deadlock issue. For example, there is
a three node ocfs2 cluster, each node mounts the ocfs2 file system to
/mnt/shared, and run the reflink command from each node repeatedly, like
reflink "/mnt/shared/test" \
"/mnt/shared/.snapshots/test.`date +%m%d%H%M%S`.`hostname`"
then, reflink command process will be hung on each node, and you
can't list this file system directory.
The problematic reflink command process is blocked at one node,
task:reflink state:D stack: 0 pid: 1283 ppid: 4154
Call Trace:
__schedule+0x2fd/0x750
schedule+0x2f/0xa0
schedule_timeout+0x1cc/0x310
? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
? 0xffffffffc0e3e000
wait_for_completion+0xba/0x140
? wake_up_q+0xa0/0xa0
__ocfs2_cluster_lock.isra.41+0x3b5/0x820 [ocfs2]
? ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
ocfs2_init_security_and_acl+0xbe/0x1d0 [ocfs2]
ocfs2_reflink+0x436/0x4c0 [ocfs2]
? ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
ocfs2_ioctl+0x25e/0x670 [ocfs2]
do_vfs_ioctl+0xa0/0x680
ksys_ioctl+0x70/0x80
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5b/0x1e0
The other reflink command processes are blocked at other nodes,
task:reflink state:D stack: 0 pid:29759 ppid: 4088
Call Trace:
__schedule+0x2fd/0x750
schedule+0x2f/0xa0
schedule_timeout+0x1cc/0x310
? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
? 0xffffffffc0b19000
wait_for_completion+0xba/0x140
? wake_up_q+0xa0/0xa0
__ocfs2_cluster_lock.isra.41+0x3b5/0x820 [ocfs2]
? ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
ocfs2_mv_orphaned_inode_to_new+0x87/0x7e0 [ocfs2]
ocfs2_reflink+0x335/0x4c0 [ocfs2]
? ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
ocfs2_ioctl+0x25e/0x670 [ocfs2]
do_vfs_ioctl+0xa0/0x680
ksys_ioctl+0x70/0x80
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5b/0x1e0
or
task:reflink state:D stack: 0 pid:18465 ppid: 4156
Call Trace:
__schedule+0x302/0x940
? usleep_range+0x80/0x80
schedule+0x46/0xb0
schedule_timeout+0xff/0x140
? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
? 0xffffffffc0c3b000
__wait_for_common+0xb9/0x170
__ocfs2_cluster_lock.constprop.0+0x1d6/0x860 [ocfs2]
? ocfs2_wait_for_recovery+0x49/0xd0 [ocfs2]
? ocfs2_inode_lock_full_nested+0x30f/0xa50 [ocfs2]
ocfs2_inode_lock_full_nested+0x30f/0xa50 [ocfs2]
ocfs2_inode_lock_tracker+0xf2/0x2b0 [ocfs2]
? dput+0x32/0x2f0
ocfs2_permission+0x45/0xe0 [ocfs2]
inode_permission+0xcc/0x170
link_path_walk.part.0.constprop.0+0x2a2/0x380
? path_init+0x2c1/0x3f0
path_parentat+0x3c/0x90
filename_parentat+0xc1/0x1d0
? filename_lookup+0x138/0x1c0
filename_create+0x43/0x160
ocfs2_reflink_ioctl+0xe6/0x380 [ocfs2]
ocfs2_ioctl+0x1ea/0x2c0 [ocfs2]
? do_sys_openat2+0x81/0x150
__x64_sys_ioctl+0x82/0xb0
do_syscall_64+0x61/0xb0
The deadlock is caused by multiple acquiring the destination directory
inode dlm lock in ocfs2_reflink function, we should acquire this directory
inode dlm lock at the beginning, and hold this dlm lock until end of the
function.
Link: https://lkml.kernel.org/r/20210729110230.18983-1-ghe@suse.com Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tuo Li [Mon, 23 Aug 2021 23:58:53 +0000 (09:58 +1000)]
ocfs2: quota_local: fix possible uninitialized-variable access in ocfs2_local_read_info()
A memory block is allocated through kmalloc(), and its return value is
assigned to the pointer oinfo. However, oinfo->dqi_gqinode is not
initialized but it is accessed in:
iput(oinfo->dqi_gqinode);
To fix this possible uninitialized-variable access, assign NULL to
oinfo->dqi_gqinode, and add ocfs2_qinfo_lock_res_init() behind the
assignment in ocfs2_local_read_info(). Remove ocfs2_qinfo_lock_res_init()
in ocfs2_global_read_info().
Link: https://lkml.kernel.org/r/20210804031832.57154-1-islituo@gmail.com Signed-off-by: Tuo Li <islituo@gmail.com> Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dan Carpenter [Mon, 23 Aug 2021 23:58:52 +0000 (09:58 +1000)]
ocfs2: remove an unnecessary condition
The case where "tmp_oh" is NULL is handled at the start of the function.
At this point we know it's non-NULL so this will always return 1.
Link: https://lkml.kernel.org/r/YOcItgIXtisi3MaO@mwanda Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Larry Chen <lchen@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Kalesh Singh [Mon, 23 Aug 2021 23:58:52 +0000 (09:58 +1000)]
procfs: prevent unpriveleged processes accessing fdinfo dir
The file permissions on the fdinfo dir from were changed from
S_IRUSR|S_IXUSR to S_IRUGO|S_IXUGO, and a PTRACE_MODE_READ check was added
for opening the fdinfo files [1]. However, the ptrace permission check
was not added to the directory, allowing anyone to get the open FD numbers
by reading the fdinfo directory.
Add the missing ptrace permission check for opening the fdinfo directory.
The reason for the panic is that stable_page_flags() which parses the page
flags uses uninitialized struct pages reserved by the ZONE_DEVICE driver.
Earlier approach to fix this was discussed here:
https://marc.info/?l=linux-mm&m=152964770000672&w=2
This is another approach. To avoid using the uninitialized struct page,
immediately return with KPF_RESERVED at the beginning of
stable_page_flags() if the page is reserved by ZONE_DEVICE driver.
Dan said:
: The nvdimm implementation uses vmem_altmap to arrange for the 'struct
: page' array to be allocated from a reservation of a pmem namespace. A
: namespace in this mode contains an info-block that consumes the first
: 8K of the namespace capacity, capacity designated for page mapping,
: capacity for padding the start of data to optionally 4K, 2MB, or 1GB
: (on x86), and then the namespace data itself. The implementation
: specifies a section aligned (now sub-section aligned) address to
: arch_add_memory() to establish the linear mapping to map the metadata,
: and then vmem_altmap indicates to memmap_init_zone() which pfns
: represent data. The implementation only specifies enough 'struct page'
: capacity for pfn_to_page() to operate on the data space, not the
: namespace metadata space.
:
: The proposal to validate ZONE_DEVICE pfns against the altmap seems the
: right approach to me.
Link: http://lkml.kernel.org/r/20190725023100.31141-3-t-fukasawa@vx.jp.nec.com Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Junichi Nomura <j-nomura@ce.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Toshiki Fukasawa [Mon, 23 Aug 2021 23:58:52 +0000 (09:58 +1000)]
/proc/kpageflags: prevent an integer overflow in stable_page_flags()
stable_page_flags() returns kpageflags info in u64, but it uses "1 <<
KPF_*" internally which is considered as int. This type mismatch causes
no visible problem now, but it will if you set bit 32 or more as done in a
subsequent patch. So use BIT_ULL in order to avoid future overflow
issues.
Link: http://lkml.kernel.org/r/20190725023100.31141-2-t-fukasawa@vx.jp.nec.com Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Junichi Nomura <j-nomura@ce.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>