If the -Wno-maybe-uninitialized gcc option is not specified, compilation
of memcontrol.c may generate the following warnings:
mm/memcontrol.c: In function `refill_obj_stock':
./arch/x86/include/asm/irqflags.h:127:17: warning: `flags' may be used uninitialized in this function [-Wmaybe-uninitialized]
return !(flags & X86_EFLAGS_IF);
~~~~~~~^~~~~~~~~~~~~~~~
mm/memcontrol.c:3216:16: note: `flags' was declared here
unsigned long flags;
^~~~~
In file included from mm/memcontrol.c:29:
mm/memcontrol.c: In function `uncharge_page':
./include/linux/memcontrol.h:797:2: warning: `objcg' may be used uninitialized in this function [-Wmaybe-uninitialized]
percpu_ref_put(&objcg->refcnt);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix that by properly initializing *pflags in get_obj_stock() and
introducing a use_objcg bool variable in uncharge_page() to avoid
potentially accessing the struct page data twice.
Link: https://lkml.kernel.org/r/20210526193602.8742-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
WARNING: else is not generally useful after a break or return
#138: FILE: mm/memcontrol.c:2121:
+ return &stock->task_obj;
+ } else {
total: 0 errors, 1 warnings, 193 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
./patches/mm-memcg-optimize-user-context-object-stock-access.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Waiman Long [Wed, 2 Jun 2021 03:52:13 +0000 (13:52 +1000)]
mm/memcg: optimize user context object stock access
Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.
The irq disable/enable sequence used in this case to access content from
object stock is slow. To optimize for user context access, there are now
two sets of object stocks (in the new obj_stock structure) for task
context and interrupt context access respectively.
The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.
The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays. However, this is a small price to pay for better performance.
Link: https://lkml.kernel.org/r/20210506150007.16288-5-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <guro@fb.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Waiman Long [Wed, 2 Jun 2021 03:52:12 +0000 (13:52 +1000)]
mm/memcg: improve refill_obj_stock() performance
There are two issues with the current refill_obj_stock() code. First of
all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and
do a obj_cgroup_put(). It is likely that the same obj_cgroup will be used
again which leads to another call to drain_obj_stock() and
obj_cgroup_get() as well as atomically retrieve the available byte from
obj_cgroup. That is costly. Instead, we should just uncharge the excess
pages, reduce the stock bytes and be done with it. The drain_obj_stock()
function should only be called when obj_cgroup changes.
Secondly, when charging an object of size not less than a page in
obj_cgroup_charge(), it is possible that the remaining bytes to be
refilled to the stock will overflow a page and cause refill_obj_stock() to
uncharge 1 page. To avoid the additional uncharge in this case, a new
allow_uncharge flag is added to refill_obj_stock() which will be set to
false when called from obj_cgroup_charge() so that an uncharge_pages()
call won't be issued right after a charge_pages() call unless the objcg
changes.
A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core
96-thread x86-64 system with 96 testing threads were run. Before this
patch, the total number of kilo kmalloc+kfree operations done for a 4k
large object by all the testing threads per second were 4,304 kops/s
(cgroup v1) and 8,478 kops/s (cgroup v2). After applying this patch, the
number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively. This
represents a performance improvement of 1.10X (cgroup v1) and 49.3X
(cgroup v2).
Link: https://lkml.kernel.org/r/20210506150007.16288-4-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Waiman Long [Wed, 2 Jun 2021 03:52:12 +0000 (13:52 +1000)]
mm/memcg: cache vmstat data in percpu memcg_stock_pcp
Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc() and
kmem_cache_free(). This causes additional overhead for workloads that
generate a lot of alloc and free calls.
The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup
to reduce that overhead. To further reducing it, this patch makes the
vmstat data cached in the memcg_stock_pcp structure as well until it
accumulates a page size worth of update or when other cached data change.
Caching the vmstat data in the per-cpu stock eliminates two writes to
non-hot cachelines for memcg specific as well as memcg-lruvecs specific
vmstat data by a write to a hot local stock cacheline.
On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 20% (634400 out of 3243830) of the
time when mod_objcg_state() is called leads to an actual call to
__mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 17% (24329265 out of 142512465). So caching the
vmstat data reduces the number of calls to __mod_objcg_state() by more
than 80%.
Link: https://lkml.kernel.org/r/20210506150007.16288-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Waiman Long [Wed, 2 Jun 2021 03:52:12 +0000 (13:52 +1000)]
mm/memcg: move mod_objcg_state() to memcontrol.c
Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6.
With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory cgroup
and reduce overall kernel memory usage. However, we also add additional
memory accounting overhead to each call of kmem_cache_alloc() and
kmem_cache_free().
For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].
A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
or a big 4k object at module init time with a batch size of 4 (4 kmalloc's
followed by 4 kfree's) is used for benchmarking. The benchmarking tool
was run on a kernel based on linux-next-20210419. The test was run on a
CascadeLake server with turbo-boosting disable to reduce run-to-run
variation.
The small object test exercises mainly the object stock charging and
vmstat update code paths. The large object test also exercises the
refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code
paths.
With memory accounting disabled, the run time was 3.130s with both small
object big object tests.
With memory accounting enabled, both cgroup v1 and v2 showed similar
results in the small object test. The performance results of the large
object test, however, differed between cgroup v1 and v2.
The execution times with the application of various patches in the
patchset were:
Applied patches Run time Accounting overhead %age 1 %age 2
--------------- -------- ------------------- ------ ------
Patch 2 (vmstat data stock caching) helps in both the small object test
and the large v2 object test. It doesn't help much in v1 big object test.
Patch 3 (refill_obj_stock improvement) does help the small object test
but offer significant performance improvement for the large object test
(both v1 and v2).
Patch 4 (eliminating irq disable/enable) helps in all test cases.
To test for the extreme case, a multi-threaded kmalloc/kfree
microbenchmark was run on the 2-socket 48-core 96-thread system with
96 testing threads in the same memcg doing kmalloc+kfree of a 4k object
with accounting enabled for 10s. The total number of kmalloc+kfree done
in kilo operations per second (kops/s) were as follows:
With memory accounting disabled, the kmalloc/kfree rate was 1,481,291
kop/s. This test shows how significant the memory accouting overhead
can be in some extreme situations.
For this multithreaded test, the improvement from patch 2 mainly
comes from the conditional atomic xchg of objcg->nr_charged_bytes in
mod_objcg_state(). By using an unconditional xchg, the operation rates
were similar to the unpatched kernel.
Patch 3 elminates the single highly contended cacheline of
objcg->nr_charged_bytes for cgroup v2 leading to a huge performance
improvement. Cgroup v1, however, still has another highly contended
cacheline in the shared page counter &memcg->kmem. So the improvement
is only modest.
Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as
eliminating the irq_disable/irq_enable overhead seems to aggravate the
cacheline contention.
mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that
further optimization can be done to it in later patches without exposing
unnecessary details to other mm components.
Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Huang Ying [Wed, 2 Jun 2021 03:52:12 +0000 (13:52 +1000)]
mm: free idle swap cache page after COW
With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after COW,
the idle swap cache page (neither the page nor the corresponding swap
entry is mapped by any process) will be left in the LRU list, even if it's
in the active list or the head of the inactive list. So, the page
reclaimer may take quite some overhead to reclaim these actually unused
pages.
To help the page reclaiming, in this patch, after COW, the idle swap cache
page will be tried to be freed. To avoid to introduce much overhead to
the hot COW code path,
a) there's almost zero overhead for non-swap case via checking
PageSwapCache() firstly.
b) the page lock is acquired via trylock only.
To test the patch, we used pmbench memory accessing benchmark with
working-set larger than available memory on a 2-socket Intel server with a
NVMe SSD as swap device. Test results shows that the pmbench score
increases up to 23.8% with the decreased size of swap cache and swapin
throughput.
Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> [use free_swap_cache()] Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Huang Ying [Wed, 2 Jun 2021 03:52:12 +0000 (13:52 +1000)]
mm, swap: remove unnecessary smp_rmb() in swap_type_to_swap_info()
Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array
accesses to avoid NULL derefs"), the typical code to reference the
swap_info[] is as follows,
type = swp_type(swp_entry);
if (type >= nr_swapfiles)
/* handle invalid swp_entry */;
p = swap_info[type];
/* access fields of *p. OOPS! p may be NULL! */
Because the ordering isn't guaranteed, it's possible that swap_info[type]
is read before "nr_swapfiles". And that may result in NULL pointer
dereference.
/* users */
type = swp_type(swp_entry);
p = swap_type_to_swap_info(type);
if (!p)
/* handle invalid swp_entry */;
/* dereference p */
Where the value of swap_info[type] (that is, "p") is checked to be
non-zero before being dereferenced. So, the NULL deferencing becomes
impossible even if "nr_swapfiles" is read after swap_info[type].
Therefore, the "smp_rmb()" becomes unnecessary.
And, we don't even need to read "nr_swapfiles" here. Because the non-zero
checking for "p" is sufficient. We just need to make sure we will not
access out of the boundary of the array. With the change, nr_swapfiles
will only be accessed with swap_lock held, except in
swapcache_free_entries(). Where the absolute correctness of the value
isn't needed, as described in the comments.
We still need to guarantee swap_info[type] is read before being
dereferenced. That can be satisfied via the data dependency ordering
enforced by READ_ONCE(swap_info[type]). This needs to be paired with
proper write barriers. So smp_store_release() is used in
alloc_swap_info() to guarantee the fields of *swap_info[type] is
initialized before swap_info[type] itself being written. Note that the
fields of *swap_info[type] is initialized to be 0 via kvzalloc() firstly.
The assignment and deferencing of swap_info[type] is like
rcu_assign_pointer() and rcu_dereference().
Link: https://lkml.kernel.org/r/20210520073301.1676294-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Paul McKenney <paulmck@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
deactivate_swap_slots_cache() and reactivate_swap_slots_cache() are only
called below their implementations. So these forward declarations are
meaningless and should be removed.
Link: https://lkml.kernel.org/r/20210520134022.1370406-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Wed, 2 Jun 2021 03:52:11 +0000 (13:52 +1000)]
mm/swapfile: move scan_swap_map() under CONFIG_HIBERNATION
We should move scan_swap_map() under CONFIG_HIBERNATION since the only
caller of this function is get_swap_page_of_type() which is also under
CONFIG_HIBERNATION. And this fixes the unused-function warning of
scan_swap_map().
Miaohe Lin [Wed, 2 Jun 2021 03:52:10 +0000 (13:52 +1000)]
mm/swapfile: move get_swap_page_of_type() under CONFIG_HIBERNATION
Patch series "Cleanups for swap", v2.
This series contains just cleanups to remove some unused variables, delete
meaningless forward declarations and so on. More details can be found in
the respective changelogs.
This patch (of 4):
We should move get_swap_page_of_type() under CONFIG_HIBERNATION since the
only caller of this function is now suspend routine.
Miaohe Lin [Wed, 2 Jun 2021 03:52:10 +0000 (13:52 +1000)]
mm/shmem: fix shmem_swapin() race with swapoff
When I was investigating the swap code, I found the below possible race
window:
CPU 1 CPU 2
----- -----
shmem_swapin
swap_cluster_readahead
if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
swapoff
..
si->swap_file = NULL;
..
struct inode *inode = si->swap_file->f_mapping->host;[oops!]
Close this race window by using get/put_swap_device() to guard against
concurrent swapoff.
Link: https://lkml.kernel.org/r/20210426123316.806267-5-linmiaohe@huawei.com Fixes: 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or not") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alexs@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Wed, 2 Jun 2021 03:52:10 +0000 (13:52 +1000)]
mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()
The non_swap_entry() was used for working with VMA based swap readahead
via commit ec560175c0b6 ("mm, swap: VMA based swap readahead"). At that
time, the non_swap_entry() checking is necessary because the function is
called before checking that in do_swap_page(). Then it's moved to
swap_ra_info() since commit eaf649ebc3ac ("mm: swap: clean up swap
readahead"). After that, the non_swap_entry() checking is unnecessary,
because swap_ra_info() is called after non_swap_entry() has been checked
already. The resulting code is confusing as the non_swap_entry() check
looks racy now because while we released the pte lock, somebody else might
have faulted in this pte. So we should check whether it's swap pte first
to guard against such race or swap_type will be unexpected. But the race
isn't important because it will not cause problem. We would have enough
checking when we really operate the PTE entries later. So we remove the
non_swap_entry() check here to avoid confusion.
Link: https://lkml.kernel.org/r/20210426123316.806267-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Wed, 2 Jun 2021 03:52:10 +0000 (13:52 +1000)]
swap: fix do_swap_page() race with swapoff
When I was investigating the swap code, I found the below possible race
window:
CPU 1 CPU 2
----- -----
do_swap_page
if (data_race(si->flags & SWP_SYNCHRONOUS_IO)
swap_readpage
if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
..
p->swap_file = NULL;
..
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]
Note that for the pages that are swapped in through swap cache, this isn't
an issue. Because the page is locked, and the swap entry will be marked
with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
unlocked.
Fix this race by using get/put_swap_device() to guard against concurrent
swapoff.
Link: https://lkml.kernel.org/r/20210426123316.806267-3-linmiaohe@huawei.com Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Wed, 2 Jun 2021 03:52:10 +0000 (13:52 +1000)]
mm/swapfile: use percpu_ref to serialize against concurrent swapoff
Patch series "close various race windows for swap", v6.
When I was investigating the swap code, I found some possible race
windows. This series aims to fix all these races. But using current
get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really
long time. And to reduce the performance overhead on the hot-path as much
as possible, it appears we can use the percpu_ref to close this race
window(as suggested by Huang, Ying). The patch 1 adds percpu_ref support
for swap and most of the remaining patches try to use this to close
various race windows. More details can be found in the respective
changelogs.
This patch (of 4):
Using current get/put_swap_device() to guard against concurrent swapoff
for some swap ops, e.g. swap_readpage(), looks terrible because they
might take really long time. This patch adds the percpu_ref support to
serialize against concurrent swapoff(as suggested by Huang, Ying). Also
we remove the SWP_VALID flag because it's used together with RCU solution.
Link: https://lkml.kernel.org/r/20210426123316.806267-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210426123316.806267-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Peter Xu [Wed, 2 Jun 2021 03:52:09 +0000 (13:52 +1000)]
fixup! mm: gup: pack has_pinned in MMF_HAS_PINNED
This fixes build issue with !CONFIG_MMU.
Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#25:
[peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]
WARNING: please, no spaces at the start of a line
#130: FILE: mm/gup.c:1280:
+ if (!test_bit(MMF_HAS_PINNED, mm_flags))$
WARNING: suspect code indent for conditional statements (7, 15)
#130: FILE: mm/gup.c:1280:
+ if (!test_bit(MMF_HAS_PINNED, mm_flags))
+ set_bit(MMF_HAS_PINNED, mm_flags);
ERROR: code indent should use tabs where possible
#131: FILE: mm/gup.c:1281:
+ set_bit(MMF_HAS_PINNED, mm_flags);$
WARNING: please, no spaces at the start of a line
#131: FILE: mm/gup.c:1281:
+ set_bit(MMF_HAS_PINNED, mm_flags);$
total: 1 errors, 4 warnings, 90 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
NOTE: Whitespace errors detected.
You may wish to use scripts/cleanpatch or scripts/cleanfile
./patches/mm-gup-pack-has_pinned-in-mmf_has_pinned.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrea Arcangeli [Wed, 2 Jun 2021 03:52:09 +0000 (13:52 +1000)]
mm: gup: pack has_pinned in MMF_HAS_PINNED
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.
Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
would reintroduce a loss of SMP scalability to pin-fast, so there's no
future potential usefulness to keep an atomic in the mm for this.
set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
(atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
atomic_set after this commit) has to be still issued only once per "mm",
so the difference between the two will be lost in the noise.
will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.
will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
improvement against upstream as expected.
This is a noop as far as overall performance and SMP scalability are
concerned.
[peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments] Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrea Arcangeli [Wed, 2 Jun 2021 03:52:09 +0000 (13:52 +1000)]
mm: gup: allow FOLL_PIN to scale in SMP
has_pinned cannot be written by each pin-fast or it won't scale in SMP.
This isn't "false sharing" strictly speaking (it's more like "true
non-sharing"), but it creates the same SMP scalability bottleneck of
"false sharing".
To verify the improvement, below test is done on 40 cpus host with
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (must be with
CONFIG_GUP_TEST=y):
$ sudo chrt -f 1 ./gup_test -a -m 512 -j 40
Where we can get (average value for 40 threads):
Old kernel: 477729.97 (+- 3.79%)
New kernel: 89144.65 (+-11.76%)
On a similar condition with 256 cpus, this commits increases the SMP
scalability of pin_user_pages_fast() executed by different threads of the
same process by more than 4000%.
[peterx@redhat.com: rewrite commit message, add parentheses against "(A & B)"] Link: https://lkml.kernel.org/r/20210507150553.208763-3-peterx@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Peter Xu [Wed, 2 Jun 2021 03:52:09 +0000 (13:52 +1000)]
mm/gup_benchmark: support threading
Patch series "mm/gup: Fix pin page write cache bouncing on has_pinned", v2.
This series contains 3 patches, the 1st one enables threading for
gup_benchmark in the kselftest. The latter two patches are collected from
Andrea's local branch which can fix write cache bouncing issue with
pinning fast-gup.
To be explicit on the latter two patches:
- the 2nd patch fixes the perf degrade when introducing has_pinned, then
- the last patch tries to remove the has_pinned with a bit in mm->flags
For patch 3: originally I think we had a plan to reuse has_pinned into a
counter very soon, however that's not happening at least until today, so
maybe it proves that we can remove it until we really want such a counter
for whatever reason. As the commit message stated, it saves 4 bytes for
each mm without observable regressions.
Regarding testing: we can reference to the commit message of patch 2 for
some detailed testing with will-is-scale. Meanwhile I did patch 1 just
because then we can even easily verify the patchset using the existing
kselftest facilities or even regress test it in the future with the repo
if we want.
Below numbers are extra verification tests that I did besides commit
message of patch 2 using the new gup_benchmark and 256 cpus. Below test
is done on 40 cpus host with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz,
and I can get similar result (of course the write cache bouncing get
severe with even more cores).
After patch 1 applied (only test patch, so using old kernel):
$ sudo chrt -f 1 ./gup_test -a -m 512 -j 40
PIN_FAST_BENCHMARK: Time: get:459632 put:5990 us
PIN_FAST_BENCHMARK: Time: get:461967 put:5840 us
PIN_FAST_BENCHMARK: Time: get:464521 put:6140 us
PIN_FAST_BENCHMARK: Time: get:465176 put:7100 us
PIN_FAST_BENCHMARK: Time: get:465960 put:6733 us
PIN_FAST_BENCHMARK: Time: get:465324 put:6781 us
PIN_FAST_BENCHMARK: Time: get:466018 put:7130 us
PIN_FAST_BENCHMARK: Time: get:466362 put:7118 us
PIN_FAST_BENCHMARK: Time: get:465118 put:6975 us
PIN_FAST_BENCHMARK: Time: get:466422 put:6602 us
PIN_FAST_BENCHMARK: Time: get:465791 put:6818 us
PIN_FAST_BENCHMARK: Time: get:467091 put:6298 us
PIN_FAST_BENCHMARK: Time: get:467694 put:5432 us
PIN_FAST_BENCHMARK: Time: get:469575 put:5581 us
PIN_FAST_BENCHMARK: Time: get:468124 put:6055 us
PIN_FAST_BENCHMARK: Time: get:468877 put:6720 us
PIN_FAST_BENCHMARK: Time: get:467212 put:4961 us
PIN_FAST_BENCHMARK: Time: get:467834 put:6697 us
PIN_FAST_BENCHMARK: Time: get:470778 put:6398 us
PIN_FAST_BENCHMARK: Time: get:469788 put:6310 us
PIN_FAST_BENCHMARK: Time: get:488277 put:7113 us
PIN_FAST_BENCHMARK: Time: get:486613 put:7085 us
PIN_FAST_BENCHMARK: Time: get:486940 put:7202 us
PIN_FAST_BENCHMARK: Time: get:488728 put:7101 us
PIN_FAST_BENCHMARK: Time: get:487570 put:7327 us
PIN_FAST_BENCHMARK: Time: get:489260 put:7027 us
PIN_FAST_BENCHMARK: Time: get:488846 put:6866 us
PIN_FAST_BENCHMARK: Time: get:488521 put:6745 us
PIN_FAST_BENCHMARK: Time: get:489950 put:6459 us
PIN_FAST_BENCHMARK: Time: get:489777 put:6617 us
PIN_FAST_BENCHMARK: Time: get:488224 put:6591 us
PIN_FAST_BENCHMARK: Time: get:488644 put:6477 us
PIN_FAST_BENCHMARK: Time: get:488754 put:6711 us
PIN_FAST_BENCHMARK: Time: get:488875 put:6743 us
PIN_FAST_BENCHMARK: Time: get:489290 put:6657 us
PIN_FAST_BENCHMARK: Time: get:490264 put:6684 us
PIN_FAST_BENCHMARK: Time: get:489631 put:6737 us
PIN_FAST_BENCHMARK: Time: get:488434 put:6655 us
PIN_FAST_BENCHMARK: Time: get:492213 put:6297 us
PIN_FAST_BENCHMARK: Time: get:491124 put:6173 us
After the whole series applied (new fixed kernel):
$ sudo chrt -f 1 ./gup_test -a -m 512 -j 40
PIN_FAST_BENCHMARK: Time: get:82038 put:7041 us
PIN_FAST_BENCHMARK: Time: get:82144 put:6817 us
PIN_FAST_BENCHMARK: Time: get:83417 put:6674 us
PIN_FAST_BENCHMARK: Time: get:82540 put:6594 us
PIN_FAST_BENCHMARK: Time: get:83214 put:6681 us
PIN_FAST_BENCHMARK: Time: get:83444 put:6889 us
PIN_FAST_BENCHMARK: Time: get:83194 put:7499 us
PIN_FAST_BENCHMARK: Time: get:84876 put:7369 us
PIN_FAST_BENCHMARK: Time: get:86092 put:10289 us
PIN_FAST_BENCHMARK: Time: get:86153 put:10415 us
PIN_FAST_BENCHMARK: Time: get:85026 put:7751 us
PIN_FAST_BENCHMARK: Time: get:85458 put:7944 us
PIN_FAST_BENCHMARK: Time: get:85735 put:8154 us
PIN_FAST_BENCHMARK: Time: get:85851 put:8299 us
PIN_FAST_BENCHMARK: Time: get:86323 put:9617 us
PIN_FAST_BENCHMARK: Time: get:86288 put:10496 us
PIN_FAST_BENCHMARK: Time: get:87697 put:9346 us
PIN_FAST_BENCHMARK: Time: get:87980 put:8382 us
PIN_FAST_BENCHMARK: Time: get:88719 put:8400 us
PIN_FAST_BENCHMARK: Time: get:87616 put:8588 us
PIN_FAST_BENCHMARK: Time: get:86730 put:9563 us
PIN_FAST_BENCHMARK: Time: get:88167 put:8673 us
PIN_FAST_BENCHMARK: Time: get:86844 put:9777 us
PIN_FAST_BENCHMARK: Time: get:88068 put:11774 us
PIN_FAST_BENCHMARK: Time: get:86170 put:15676 us
PIN_FAST_BENCHMARK: Time: get:87967 put:12827 us
PIN_FAST_BENCHMARK: Time: get:95773 put:7652 us
PIN_FAST_BENCHMARK: Time: get:87734 put:13650 us
PIN_FAST_BENCHMARK: Time: get:89833 put:14237 us
PIN_FAST_BENCHMARK: Time: get:96186 put:8029 us
PIN_FAST_BENCHMARK: Time: get:95532 put:8886 us
PIN_FAST_BENCHMARK: Time: get:95351 put:5826 us
PIN_FAST_BENCHMARK: Time: get:96401 put:8407 us
PIN_FAST_BENCHMARK: Time: get:96473 put:8287 us
PIN_FAST_BENCHMARK: Time: get:97177 put:8430 us
PIN_FAST_BENCHMARK: Time: get:98120 put:5263 us
PIN_FAST_BENCHMARK: Time: get:96271 put:7757 us
PIN_FAST_BENCHMARK: Time: get:99628 put:10467 us
PIN_FAST_BENCHMARK: Time: get:99344 put:10045 us
PIN_FAST_BENCHMARK: Time: get:94212 put:15485 us
Summary:
Old kernel: 477729.97 (+-3.79%)
New kernel: 89144.65 (+-11.76%)
This patch (of 3):
Add a new parameter "-j N" to support concurrent gup test.
Link: https://lkml.kernel.org/r/20210507150553.208763-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210507150553.208763-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Chi Wu [Wed, 2 Jun 2021 03:52:08 +0000 (13:52 +1000)]
mm/page-writeback: Fix performance when BDI's share of ratio is 0.
Fix performance when BDI's share of ratio is 0.
The issue is similar to commit 74d369443325 ("writeback: Fix
performance regression in wb_over_bg_thresh()").
Balance_dirty_pages and the writeback worker will also disagree on
whether writeback when a BDI uses BDI_CAP_STRICTLIMIT and BDI's share
of the thresh ratio is zero.
For example, A thread on cpu0 writes 32 pages and then
balance_dirty_pages, it will wake up background writeback and pauses
because wb_dirty > wb->wb_thresh = 0 (share of thresh ratio is zero).
A thread may runs on cpu0 again because scheduler prefers pre_cpu.
Then writeback worker may runs on other cpus(1,2..) which causes the
value of wb_stat(wb, WB_RECLAIMABLE) in wb_over_bg_thresh is 0 and does
not writeback and returns.
Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
worker who will do nothing. It remains stuck in this state until the
writeback worker hit the right dirty cpu or the dirty pages expire.
The fix that we should get the wb_stat_sum radically when thresh is low.
Link: https://lkml.kernel.org/r/20210428225046.16301-1-wuchi.zero@gmail.com Signed-off-by: Chi Wu <wuchi.zero@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Anshuman Khandual [Wed, 2 Jun 2021 03:52:08 +0000 (13:52 +1000)]
mm/debug_vm_pgtable: ensure THP availability via has_transparent_hugepage()
On certain platforms, THP support could not just be validated via the
build option CONFIG_TRANSPARENT_HUGEPAGE. Instead
has_transparent_hugepage() also needs to be called upon to verify THP
runtime support. Otherwise the debug test will just run into unusable THP
helpers like in the case of a 4K hash config on powerpc platform [1].
This just moves all pfn_pmd() and pfn_pud() after THP runtime validation
with has_transparent_hugepage() which prevents the mentioned problem.
Reported-by: kernel test robot <lkp@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Those addresses are hashed for kernel security reasons. If we're trying
to be secure with slub_debug on the commandline we have some big problems
given that we dump whole chunks of kernel memory to the kernel logs.
Let's force on the no_hash_pointers commandline flag when slub_debug is on
the commandline. This makes slub debug messages more meaningful and if by
chance a kernel address is in some slub debug object dump we will have a
better chance of figuring out what went wrong.
Note that we don't use %px in the slub code because we want to reduce the
number of places that %px is used in the kernel. This also nicely prints
a big fat warning at kernel boot if slub_debug is on the commandline so
that we know that this kernel shouldn't be used on production systems.
Link: https://lkml.kernel.org/r/20210601182202.3011020-5-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Petr Mladek <pmladek@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joe Perches [Wed, 2 Jun 2021 03:52:07 +0000 (13:52 +1000)]
slub: indicate slab_fix() uses printf formats
Ideally, slab_fix() would be marked with __printf and the format here
would not use \n as that's emitted by the slab_fix(). Make these changes.
Link: https://lkml.kernel.org/r/20210601182202.3011020-4-swboyd@chromium.org Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Stephen Boyd [Wed, 2 Jun 2021 03:52:07 +0000 (13:52 +1000)]
slub: actually use 'message' in restore_bytes()
The message argument isn't used here. Let's pass the string to the printk
message so that the developer can figure out what's happening, instead of
guessing that a redzone is being restored, etc.
Link: https://lkml.kernel.org/r/20210601182202.3011020-3-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Stephen Boyd [Wed, 2 Jun 2021 03:52:06 +0000 (13:52 +1000)]
slub: restore slub_debug=- behavior
Petch series "slub: Print non-hashed pointers in slub debugging", v3.
I was doing some debugging recently and noticed that my pointers were
being hashed while slub_debug was on the kernel commandline. Let's force
on the no hash pointer option when slub_debug is on the kernel commandline
so that the prints are more meaningful.
The first two patches are something else I noticed while looking at the
code. The message argument is never used so the debugging messages are
not as clear as they could be and the slub_debug=- behavior seems to be
busted. Then there's a printf fixup from Joe and the final patch is the
one that force disables pointer hashing.
This patch (of 4):
Passing slub_debug=- on the kernel commandline is supposed to disable slub
debugging. This is especially useful with CONFIG_SLUB_DEBUG_ON where the
default is to have slub debugging enabled in the build. Due to some code
reorganization this behavior was dropped, but the code to make it work
mostly stuck around. Restore the previous behavior by disabling the
static key when we parse the commandline and see that we're trying to
disable slub debugging.
Link: https://lkml.kernel.org/r/20210601182202.3011020-1-swboyd@chromium.org Link: https://lkml.kernel.org/r/20210601182202.3011020-2-swboyd@chromium.org Fixes: ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Hyeonggon Yoo [Wed, 2 Jun 2021 03:52:06 +0000 (13:52 +1000)]
mm, slub: fix support for clang 10
Previously in 'commit ff3daafe3fd3 ("mm, slub: change run-time assertion
in kmalloc_index() to compile-time")', changed kmalloc_index's run-time
assertion to compile-time assertion.
But clang 10 has a bug misevaluating __builtin_constant_p() as true,
making it unable to compile. This bug was fixed in clang 11.
To support clang 10, introduce a macro to do run-time assertion if clang
version is less than 11, even if the size is constant. Might revert this
commit later if we choose not to support clang 10.
Link: https://lkml.kernel.org/r/20210518181247.GA10062@hyeyoo Fixes: ff3daafe3fd3 ("mm, slub: change run-time assertion in kmalloc_index() to compile-time") Link: https://lore.kernel.org/r/CA+G9fYvYxqVhUTkertjZjcrUq8LWPnO7qC==Wum3gYCwWF9D6Q@mail.gmail.com/ Link: https://lkml.org/lkml/2021/5/11/872 Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Suggested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Marco Elver [Wed, 2 Jun 2021 03:52:06 +0000 (13:52 +1000)]
kfence: test: fix for "mm, slub: change run-time assertion in kmalloc_index() to compile-time"
Enable using kmalloc_index() in allocator test modules again where the
size may be non-constant, while ensuring normal usage always passes a
constant size.
Split the definition into __kmalloc_index(size, size_is_constant), and a
definition of kmalloc_index(s), matching the old kmalloc_index()
interface, but that still requires size_is_constant==true. This ensures
that normal usage of kmalloc_index() always passes a constant size.
While the __-prefix should make it clearer that the function is to be used
with care, also rewrite the "Note" to highlight the restriction (and add a
hint to kmalloc_slab()).
The alternative considered here is to export kmalloc_slab(), but given it
is internal to mm/ and not in <linux/slab.h>, we should probably avoid
exporting it. Allocator test modules will work just fine by using
__kmalloc_index(s, false).
Hyeonggon Yoo [Wed, 2 Jun 2021 03:52:06 +0000 (13:52 +1000)]
mm, slub: change run-time assertion in kmalloc_index() to compile-time
Currently when size is not supported by kmalloc_index, compiler will
generate a run-time BUG() while compile-time error is also possible, and
better. So change BUG to BUILD_BUG_ON_MSG to make compile-time check
possible.
Also remove code that allocates more than 32MB because current
implementation supports only up to 32MB.
Link: https://lkml.kernel.org/r/20210511173448.GA54466@hyeyoo Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Oliver Glitta [Wed, 2 Jun 2021 03:52:06 +0000 (13:52 +1000)]
slub: remove resiliency_test() function
Function resiliency_test() is hidden behind #ifdef SLUB_RESILIENCY_TEST
that is not part of Kconfig, so nobody runs it.
This function is replaced with KUnit test for SLUB added by the previous
patch "selftests: add a KUnit test for SLUB debugging functionality".
Link: https://lkml.kernel.org/r/20210511150734.3492-3-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Oliver Glitta <glittao@gmail.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Daniel Latypov <dlatypov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Oliver Glitta <glittao@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Oliver Glitta [Wed, 2 Jun 2021 03:52:05 +0000 (13:52 +1000)]
mm/slub, kunit: add a KUnit test for SLUB debugging functionality-fix
Remove unused function test_exit(), from SLUB KUnit test.
Link: https://lkml.kernel.org/r/20210512140656.12083-1-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Reported-by: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Oliver Glitta [Wed, 2 Jun 2021 03:52:05 +0000 (13:52 +1000)]
mm/slub, kunit: add a KUnit test for SLUB debugging functionality
SLUB has resiliency_test() function which is hidden behind #ifdef
SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody runs it.
KUnit should be a proper replacement for it.
Try changing byte in redzone after allocation and changing pointer to next
free node, first byte, 50th byte and redzone byte. Check if validation
finds errors.
There are several differences from the original resiliency test: Tests
create own caches with known state instead of corrupting shared kmalloc
caches.
The corruption of freepointer uses correct offset, the original resiliency
test got broken with freepointer changes.
Scratch changing random byte test, because it does not have meaning in
this form where we need deterministic results.
Add new option CONFIG_SLUB_KUNIT_TEST in Kconfig. Tests next_pointer,
first_word and clobber_50th_byte do not run with KASAN option on. Because
the test deliberately modifies non-allocated objects.
Use kunit_resource to count errors in cache and silence bug reports.
Count error whenever slab_bug() or slab_fix() is called or when the count
of pages is wrong.
Link: https://lkml.kernel.org/r/20210511150734.3492-2-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Daniel Latypov <dlatypov@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Wed, 2 Jun 2021 03:52:05 +0000 (13:52 +1000)]
kunit: make test->lock irq safe
The upcoming SLUB kunit test will be calling kunit_find_named_resource()
from a context with disabled interrupts. That means kunit's test->lock
needs to be IRQ safe to avoid potential deadlocks and lockdep splats.
This patch therefore changes the test->lock usage to spin_lock_irqsave()
and spin_unlock_irqrestore().
Link: https://lkml.kernel.org/r/20210511150734.3492-1-glittao@gmail.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Oliver Glitta <glittao@gmail.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Daniel Latypov <dlatypov@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Marco Elver <elver@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wangyan [Wed, 2 Jun 2021 03:52:04 +0000 (13:52 +1000)]
ocfs2: fix ocfs2 corrupt when iputting an inode
In this condition, it will cause an bug on error.
ocfs2_mkdir()
->ocfs2_mknod()
->ocfs2_mknod_locked()
->__ocfs2_mknod_locked()
//Assume inode->i_generation is genN.
->inode->i_generation = osb->s_next_generation++;
// The inode lockres has been initialized.
->ocfs2_populate_inode()
->ocfs2_create_new_inode_locks()
->An error happened, returned value is non-zero
// free the start_bit x in bg_blkno
->ocfs2_free_suballoc_bits()
->... /* Another process execute mkdir success in this place,
and it occupied the start_bit x in bg_blkno
which has been freed before. Its inode->i_generation
is genN + 1 */
->iput(inode)
->evict()
->ocfs2_evict_inode()
->ocfs2_delete_inode()
->ocfs2_inode_lock()
->ocfs2_inode_lock_update()
/* Bug on here, genN != genN + 1 */
->mlog_bug_on_msg(inode->i_generation !=
le32_to_cpu(fe->i_generation))
So, we need not to reclaim the inode when the inode->ip_inode_lockres
has been initialized. It will be freed in iput().
Link: http://lkml.kernel.org/r/ef080ca3-5d74-e276-17a1-d9e7c7e662c9@huawei.com Fixes: b1529a41f777 ("ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error") Signed-off-by: Yan Wang <wangyan122@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wangyan [Wed, 2 Jun 2021 03:52:04 +0000 (13:52 +1000)]
ocfs2: clear links count in ocfs2_mknod() if an error occurs
In this condition, the inode can not be wiped when error happened.
ocfs2_mkdir()
->ocfs2_mknod()
->ocfs2_mknod_locked()
->__ocfs2_mknod_locked()
->ocfs2_set_links_count() // i_links_count is 2
-> ... // an error accrue, goto roll_back or leave.
->ocfs2_commit_trans()
->iput(inode)
->evict()
->ocfs2_evict_inode()
->ocfs2_delete_inode()
->ocfs2_inode_lock()
->ocfs2_inode_lock_update()
->ocfs2_refresh_inode()
->set_nlink(); // inode->i_nlink is 2 now.
/* if wipe is 0, it will goto bail_unlock_inode */
->ocfs2_query_inode_wipe()
->if (inode->i_nlink) return; // wipe is 0.
/* inode can not be wiped */
->ocfs2_wipe_inode()
So, we need clear links before the transaction committed.
Link: http://lkml.kernel.org/r/d8147c41-fb2b-bdf7-b660-1f3c8448c33f@huawei.com Signed-off-by: Yan Wang <wangyan122@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Chen Huang [Wed, 2 Jun 2021 03:52:04 +0000 (13:52 +1000)]
ocfs2: replace simple_strtoull() with kstrtoull()
simple_strtoull() is deprecated in some situation since it does not check
for the range overflow, use kstrtoull() instead.
Link: https://lkml.kernel.org/r/20210526092020.554341-3-chenhuang5@huawei.com Signed-off-by: Chen Huang <chenhuang5@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wan Jiabing [Wed, 2 Jun 2021 03:52:03 +0000 (13:52 +1000)]
ocfs2: remove repeated uptodate check for buffer
In commit 60f91826ca62 ("buffer: Avoid setting buffer bits that are
already set"), function set_buffer_##name was added a test_bit() to check
buffer, which is the same as function buffer_##name. The
!buffer_uptodate(bh) here is a repeated check. Remove it.
Link: https://lkml.kernel.org/r/20210425025702.13628-1-wanjiabing@vivo.com Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Colin Ian King [Wed, 2 Jun 2021 03:52:03 +0000 (13:52 +1000)]
ocfs2: remove redundant assignment to pointer queue
The pointer queue is being initialized with a value that is never read and
it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value") Link: https://lkml.kernel.org/r/20210513113957.57539-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dan Carpenter [Wed, 2 Jun 2021 03:52:03 +0000 (13:52 +1000)]
ocfs2: fix snprintf() checking
The snprintf() function returns the number of bytes which would have been
printed if the buffer was large enough. In other words it can return ">=
remain" but this code assumes it returns "== remain".
The run time impact of this bug is not very severe. The next iteration
through the loop would trigger a WARN() when we pass a negative limit to
snprintf(). We would then return success instead of -E2BIG.
The kernel implementation of snprintf() will never return negatives so
there is no need to check and I have deleted that dead code.
Link: https://lkml.kernel.org/r/20210511135350.GV1955@kadam Fixes: a860f6eb4c6a ("ocfs2: sysfile interfaces for online file check") Fixes: 74ae4e104dfc ("ocfs2: Create stack glue sysfs files.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Yingliang [Wed, 2 Jun 2021 03:52:03 +0000 (13:52 +1000)]
ocfs2: remove unnecessary INIT_LIST_HEAD()
The list_head o2hb_node_events is initialized statically. It is
unnecessary to initialize by INIT_LIST_HEAD().
Link: https://lkml.kernel.org/r/20210511115847.3817395-1-yangyingliang@huawei.com Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vincent Whitchurch [Wed, 2 Jun 2021 03:52:03 +0000 (13:52 +1000)]
squashfs: add option to panic on errors
Add an errors=panic mount option to make squashfs trigger a panic when
errors are encountered, similar to several other filesystems. This allows
a kernel dump to be saved using which the corruption can be analysed and
debugged.
Inspired by a pre-fs_context patch by Anton Eliasson.
Link: https://lkml.kernel.org/r/20210527125019.14511-1-vincent.whitchurch@axis.com Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Steven Rostedt (VMware) [Wed, 2 Jun 2021 03:52:02 +0000 (13:52 +1000)]
streamline_config.pl: add softtabstop=4 for vim users
The tab stop for Perl files is by default (at least in emacs) to be 4
spaces, where a tab is used for all 8 spaces. Add a local variable
comment to make vim do the same by default, and this will help keep the
file consistent in the future when others edit it via vim and not emacs.
Link: https://lkml.kernel.org/r/20210322214032.293992979@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: "John (Warthog9) Hawley" <warthog9@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Steven Rostedt (VMware) [Wed, 2 Jun 2021 03:52:02 +0000 (13:52 +1000)]
streamline_config.pl: make spacing consistent
Patch series "streamline_config.pl: Fix Perl spacing".
Talking with John Hawley about how vim and emacs deal with Perl files with
respect to tabs and spaces, I found that some of my Perl code in the
kernel had inconsistent spacing. The way emacs handles Perl by default is
to use 4 spaces per indent, but make all 8 spaces into a single tab. Vim
does not do this by default. But if you add the vim variable control:
# vim: softtabstop=4
to a perl file, it makes vim behave the same way as emacs.
The first patch is to change all 8 spaces into a single tab (mostly from
people editing the file with vim). The next patch adds the softtabstop
variable to make vim act like emacs by default.
This patch (of 2):
As Perl code tends to have 4 space indentation, but uses tabs for every 8
spaces, make that consistent in the streamline_config.pl code. Replace
all 8 spaces with a single tab.
Link: https://lkml.kernel.org/r/20210322214032.133596267@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: "John (Warthog9) Hawley" <warthog9@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
gcc points out a mistake in the mca driver that goes back to before the
git history:
arch/ia64/kernel/mca_drv.c: In function 'init_record_index_pools':
arch/ia64/kernel/mca_drv.c:346:54: error: expression does not compute the number of elements in this array; element typ
e is 'int', not 'size_t' {aka 'long unsigned int'} [-Werror=sizeof-array-div]
346 | for (i = 1; i < sizeof sal_log_sect_min_sizes/sizeof(size_t); i++)
| ^
This is the same as sizeof(size_t), which is two shorter than the actual
array. Use the ARRAY_SIZE() macro to get the correct calculation instead.
Link: https://lkml.kernel.org/r/20210514214123.875971-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
The reason for the panic is that stable_page_flags() which parses the page
flags uses uninitialized struct pages reserved by the ZONE_DEVICE driver.
Earlier approach to fix this was discussed here:
https://marc.info/?l=linux-mm&m=152964770000672&w=2
This is another approach. To avoid using the uninitialized struct page,
immediately return with KPF_RESERVED at the beginning of
stable_page_flags() if the page is reserved by ZONE_DEVICE driver.
Dan said:
: The nvdimm implementation uses vmem_altmap to arrange for the 'struct
: page' array to be allocated from a reservation of a pmem namespace. A
: namespace in this mode contains an info-block that consumes the first
: 8K of the namespace capacity, capacity designated for page mapping,
: capacity for padding the start of data to optionally 4K, 2MB, or 1GB
: (on x86), and then the namespace data itself. The implementation
: specifies a section aligned (now sub-section aligned) address to
: arch_add_memory() to establish the linear mapping to map the metadata,
: and then vmem_altmap indicates to memmap_init_zone() which pfns
: represent data. The implementation only specifies enough 'struct page'
: capacity for pfn_to_page() to operate on the data space, not the
: namespace metadata space.
:
: The proposal to validate ZONE_DEVICE pfns against the altmap seems the
: right approach to me.
Link: http://lkml.kernel.org/r/20190725023100.31141-3-t-fukasawa@vx.jp.nec.com Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Junichi Nomura <j-nomura@ce.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Toshiki Fukasawa [Wed, 2 Jun 2021 03:52:01 +0000 (13:52 +1000)]
/proc/kpageflags: prevent an integer overflow in stable_page_flags()
stable_page_flags() returns kpageflags info in u64, but it uses "1 <<
KPF_*" internally which is considered as int. This type mismatch causes
no visible problem now, but it will if you set bit 32 or more as done in a
subsequent patch. So use BIT_ULL in order to avoid future overflow
issues.
Link: http://lkml.kernel.org/r/20190725023100.31141-2-t-fukasawa@vx.jp.nec.com Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Junichi Nomura <j-nomura@ce.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 2 Jun 2021 03:52:01 +0000 (13:52 +1000)]
ocfs2: fix data corruption by fallocate
When fallocate punches holes out of inode size, if original isize is in
the middle of last cluster, then the part from isize to the end of the
cluster will be zeroed with buffer write, at that time isize is not yet
updated to match the new size, if writeback is kicked in, it will invoke
ocfs2_writepage()->block_write_full_page() where the pages out of inode
size will be dropped. That will cause file corruption. Fix this by zero
out eof blocks when extending the inode size.
Running the following command with qemu-image 4.2.1 can get a corrupted
coverted image file easily.
Link: https://lkml.kernel.org/r/20210528210648.9124-1-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
YueHaibing [Wed, 2 Jun 2021 03:52:01 +0000 (13:52 +1000)]
lib: crc64: fix kernel-doc warning
Fix W=1 kernel build warning:
lib/crc64.c:40: warning:
bad line: or the previous crc64 value if computing incrementally.
Link: https://lkml.kernel.org/r/20210601135851.15444-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Coly Li <colyli@suse.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mina Almasry [Wed, 2 Jun 2021 03:52:00 +0000 (13:52 +1000)]
mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPY
The userfaultfd hugetlb tests cause a resv_huge_pages underflow. This
happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on an
index for which we already have a page in the cache. When this happens,
we allocate a second page, double consuming the reservation, and then fail
to insert the page into the cache and return -EEXIST.
To fix this, we first check if there is a page in the cache which already
consumed the reservation, and return -EEXIST immediately if so.
There is still a rare condition where we fail to copy the page contents
AND race with a call for hugetlb_no_page() for this index and again we
will underflow resv_huge_pages. That is fixed in a more complicated patch
not targeted for -stable.
Test:
Hacked the code locally such that resv_huge_pages underflows produce
a warning, then:
./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
./tools/testing/selftests/vm/userfaultfd hugetlb 10
2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
Both tests succeed and produce no warnings. After the
test runs number of free/resv hugepages is correct.
[mike.kravetz@oracle.com: changelog fixes] Link: https://lkml.kernel.org/r/20210528004649.85298-1-almasrymina@google.com Fixes: 8fb5debc5fcd ("userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Naoya Horiguchi [Wed, 2 Jun 2021 03:52:00 +0000 (13:52 +1000)]
hugetlb: pass head page to remove_hugetlb_page()
When memory_failure() or soft_offline_page() is called on a tail page of
some hugetlb page, "BUG: unable to handle page fault" error can be
triggered.
remove_hugetlb_page() dereferences page->lru, so it's assumed that the
page points to a head page, but one of the caller,
dissolve_free_huge_page(), provides remove_hugetlb_page() with 'page'
which could be a tail page. So pass 'head' to it, instead.
Link: https://lkml.kernel.org/r/20210526235257.2769473-1-nao.horiguchi@gmail.com Fixes: 6eb4e88a6d27 ("hugetlb: create remove_hugetlb_page() to separate functionality") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Wed, 2 Jun 2021 03:52:00 +0000 (13:52 +1000)]
drivers/base/memory: fix trying offlining memory blocks with memory holes on aarch64
offline_pages() properly checks for memory holes and bails out. However,
we do a page_zone(pfn_to_page(start_pfn)) before calling offline_pages()
when offlining a memory block. We should not unconditionally call
page_zone(pfn_to_page(start_pfn)) on aarch64 in offlining code, otherwise
we can trigger a BUG when hitting a memory hole:
If nr_vmemmap_pages is set, we know that we are dealing with hotplugged
memory that doesn't have any holes. So call
page_zone(pfn_to_page(start_pfn)) only when really necessary -- when
nr_vmemmap_pages is set and we actually adjust the present pages.
Link: https://lkml.kernel.org/r/20210526075226.5572-1-david@redhat.com Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Qian Cai (QUIC) <quic_qiancai@quicinc.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Ding Hui [Wed, 2 Jun 2021 03:52:00 +0000 (13:52 +1000)]
mm/page_alloc: fix counting of free pages after take off from buddy
Recently we found that there is a lot MemFree left in /proc/meminfo after
do a lot of pages soft offline, it's not quite correct.
Before Oscar rework soft offline for free pages [1], if we soft offline
free pages, these pages are left in buddy with HWPoison flag, and
NR_FREE_PAGES is not updated immediately. So the difference between
NR_FREE_PAGES and real number of available free pages is also even big at
the beginning.
However, with the workload running, when we catch HWPoison page in any
alloc functions subsequently, we will remove it from buddy, meanwhile
update the NR_FREE_PAGES and try again, so the NR_FREE_PAGES will get more
and more closer to the real number of available free pages. (regardless
of unpoison_memory())
Now, for offline free pages, after a successful call
take_page_off_buddy(), the page is no longer belong to buddy allocator,
and will not be used any more, but we missed accounting NR_FREE_PAGES in
this situation, and there is no chance to be updated later.
Do update in take_page_off_buddy() like rmqueue() does, but avoid double
counting if some one already set_migratetype_isolate() on the page.
[1]: commit 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")
Link: https://lkml.kernel.org/r/20210526075247.11130-1-dinghui@sangfor.com.cn Fixes: 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages") Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Gerald Schaefer [Wed, 2 Jun 2021 03:52:00 +0000 (13:52 +1000)]
mm/debug_vm_pgtable: fix alignment for pmd/pud_advanced_tests()
In pmd/pud_advanced_tests(), the vaddr is aligned up to the next pmd/pud
entry, and so it does not match the given pmdp/pudp and (aligned down) pfn
any more.
For s390, this results in memory corruption, because the IDTE instruction
used e.g. in xxx_get_and_clear() will take the vaddr for some calculations,
in combination with the given pmdp. It will then end up with a wrong table
origin, ending on ...ff8, and some of those wrongly set low-order bits will
also select a wrong pagetable level for the index addition. IDTE could
therefore invalidate (or 0x20) something outside of the page tables,
depending on the wrongly picked index, which in turn depends on the random
vaddr.
As result, we sometimes see "BUG task_struct (Not tainted): Padding
overwritten" on s390, where one 0x5a padding value got overwritten with
0x7a.
Fix this by aligning down, similar to how the pmd/pud_aligned pfns are
calculated.
Mark Rutland [Wed, 2 Jun 2021 03:51:59 +0000 (13:51 +1000)]
pid: take a reference when initializing `cad_pid`
During boot, kernel_init_freeable() initializes `cad_pid` to the init
task's struct pid. Later on, we may change `cad_pid` via a sysctl, and
when this happens proc_do_cad_pid() will increment the refcount on the new
pid via get_pid(), and will decrement the refcount on the old pid via
put_pid(). As we never called get_pid() when we initialized `cad_pid`, we
decrement a reference we never incremented, can therefore free the init
task's struct pid early. As there can be dangling references to the
struct pid, we can later encounter a use-after-free (e.g. when delivering
signals).
This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to have
been around since the conversion of `cad_pid` to struct pid in commit:
Fix this by getting a reference to the init task's struct pid when we
assign it to `cad_pid`.
Full KASAN splat below.
==================================================================
BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline]
BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273
Memory state around the buggy address: ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
==================================================================
Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com Fixes: 9ec52099e4b8678a ("[PATCH] replace cad_pid by a struct pid") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Christian Brauner <christian@brauner.io> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Marco Elver [Wed, 2 Jun 2021 03:51:59 +0000 (13:51 +1000)]
kfence: use TASK_IDLE when awaiting allocation
Since wait_event() uses TASK_UNINTERRUPTIBLE by default, waiting for an
allocation counts towards load. However, for KFENCE, this does not make
any sense, since there is no busy work we're awaiting.
Instead, use TASK_IDLE via wait_event_idle() to not count towards load.
BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1185565 Link: https://lkml.kernel.org/r/20210521083209.3740269-1-elver@google.com Fixes: 407f1d8c1b5f ("kfence: await for allocation using wait_event") Signed-off-by: Marco Elver <elver@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Hillf Danton <hdanton@sina.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
MIPS cache flush logic needs to know whether the mapping was already
established to decide how to flush caches. This is done by checking the
valid bit in the PTE. The commit above breaks this logic by setting the
valid in the PTE in new mappings, which causes kernel crashes.
Link: https://lkml.kernel.org/r/20210526094335.92948-1-tsbogend@alpha.franken.de Fixes: f685a533a7f ("MIPS: make userspace mapping young by default") Reported-by: Zhou Yanjie <zhouyanjie@wanyeetech.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Huang Pei <huangpei@loongson.cn> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Linus Torvalds [Mon, 31 May 2021 15:57:22 +0000 (05:57 -1000)]
Merge tag 'gfs2-v5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Various gfs2 fixes"
* tag 'gfs2-v5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix use-after-free in gfs2_glock_shrink_scan
gfs2: Fix mmap locking for write faults
gfs2: Clean up revokes on normal withdraws
gfs2: fix a deadlock on withdraw-during-mount
gfs2: fix scheduling while atomic bug in glocks
gfs2: Fix I_NEW check in gfs2_dinode_in
gfs2: Prevent direct-I/O write fallback errors from getting lost
Linus Torvalds [Mon, 31 May 2021 15:52:22 +0000 (05:52 -1000)]
Merge tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
"A fix for permission checking with fanotify unpriviledged groups.
Also there's a small update in MAINTAINERS file for fanotify"
* tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: fix permission model of unprivileged group
MAINTAINERS: Add Matthew Bobrowski as a reviewer
Hillf Danton [Tue, 18 May 2021 08:46:25 +0000 (16:46 +0800)]
gfs2: Fix use-after-free in gfs2_glock_shrink_scan
The GLF_LRU flag is checked under lru_lock in gfs2_glock_remove_from_lru() to
remove the glock from the lru list in __gfs2_glock_put().
On the shrink scan path, the same flag is cleared under lru_lock but because
of cond_resched_lock(&lru_lock) in gfs2_dispose_glock_lru(), progress on the
put side can be made without deleting the glock from the lru list.
Keep GLF_LRU across the race window opened by cond_resched_lock(&lru_lock) to
ensure correct behavior on both sides - clear GLF_LRU after list_del under
lru_lock.
Linus Torvalds [Sun, 30 May 2021 04:24:00 +0000 (18:24 -1000)]
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"This is a bit larger than usual at rc4 time. The reason is due to
Lee's work of fixing newly reported build warnings.
The rest is fixes as usual"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (22 commits)
MAINTAINERS: adjust to removing i2c designware platform data
i2c: s3c2410: fix possible NULL pointer deref on read message after write
i2c: mediatek: Disable i2c start_en and clear intr_stat brfore reset
i2c: i801: Don't generate an interrupt on bus reset
i2c: mpc: implement erratum A-004447 workaround
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P1010 i2c controllers
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P2041 i2c controllers
dt-bindings: i2c: mpc: Add fsl,i2c-erratum-a004447 flag
i2c: busses: i2c-stm32f4: Remove incorrectly placed ' ' from function name
i2c: busses: i2c-st: Fix copy/paste function misnaming issues
i2c: busses: i2c-pnx: Provide descriptions for 'alg_data' data structure
i2c: busses: i2c-ocores: Place the expected function names into the documentation headers
i2c: busses: i2c-eg20t: Fix 'bad line' issue and provide description for 'msgs' param
i2c: busses: i2c-designware-master: Fix misnaming of 'i2c_dw_init_master()'
i2c: busses: i2c-cadence: Fix incorrectly documented 'enum cdns_i2c_slave_mode'
i2c: busses: i2c-ali1563: File headers are not good candidates for kernel-doc
i2c: muxes: i2c-arb-gpio-challenge: Demote non-conformant kernel-doc headers
i2c: busses: i2c-nomadik: Fix formatting issue pertaining to 'timeout'
i2c: sh_mobile: Use new clock calculation formulas for RZ/G2E
i2c: I2C_HISI should depend on ACPI
...
Linus Torvalds [Sun, 30 May 2021 04:16:09 +0000 (18:16 -1000)]
Merge tag 'seccomp-fixes-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
"This fixes a hard-to-hit race condition in the addfd user_notif
feature of seccomp, visible since v5.9.
And a small documentation fix"
* tag 'seccomp-fixes-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: Refactor notification handler to prepare for new semantics
Documentation: seccomp: Fix user notification documentation
Linus Torvalds [Sun, 30 May 2021 03:47:19 +0000 (17:47 -1000)]
Merge tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"This week's pile mitigates some decades-old problems in how extent
size hints interact with realtime volumes, fixes some failures in
online shrink, and fixes a problem where directory and symlink
shrinking on extremely fragmented filesystems could fail.
The most user-notable change here is to point users at our (new) IRC
channel on OFTC. Freedom isn't free, it costs folks like you and me;
and if you don't kowtow, they'll expel everyone and take over your
channel. (Ok, ok, that didn't fit the song lyrics...)
Summary:
- Fix a bug where unmapping operations end earlier than expected,
which can cause chaos on multi-block directory and symlink shrink
operations.
- Fix an erroneous assert that can trigger if we try to transition a
bmap structure from btree format to extents format with zero
extents. This was exposed by xfs/538"
* tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: bunmapi has unnecessary AG lock ordering issues
xfs: btree format inode forks can have zero extents
xfs: add new IRC channel to MAINTAINERS
xfs: validate extsz hints against rt extent size when rtinherit is set
xfs: standardize extent size hint validation
xfs: check free AG space when making per-AG reservations
Sargun Dhillon [Mon, 17 May 2021 19:39:06 +0000 (12:39 -0700)]
seccomp: Refactor notification handler to prepare for new semantics
This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in semantic, in that
previously we ignored addfd calls upon wakeup if the notification had been
responded to, but instead with the new change we check for an outstanding
addfd calls prior to returning to userspace.
Rodrigo Campos also identified a bug that can result in addfd causing
an early return, when the supervisor didn't actually handle the
syscall [1].
Linus Torvalds [Sat, 29 May 2021 16:55:55 +0000 (06:55 -1000)]
Merge tag 'thermal-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux
Pull thermal fixes from Daniel Lezcano:
- Fix uninitialized error code value for the SPMI adc driver (Yang
Yingliang)
- Fix kernel doc warning (Yang Li)
- Fix wrong read-write thermal trip point initialization (Srinivas
Pandruvada)
* tag 'thermal-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux:
thermal/drivers/qcom: Fix error code in adc_tm5_get_dt_channel_data()
thermal/ti-soc-thermal: Fix kernel-doc
thermal/drivers/intel: Initialize RW trip to THERMAL_TEMP_INVALID
Linus Torvalds [Sat, 29 May 2021 16:33:28 +0000 (06:33 -1000)]
Merge tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are three small driver core / debugfs fixes for 5.13-rc4:
- debugfs fix for incorrect "lockdown" mode for selinux accesses
- two device link changes, one bugfix and one cleanup
All of these have been in linux-next for over a week with no reported
problems"
* tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
drivers: base: Reduce device link removal code duplication
drivers: base: Fix device link removal
debugfs: fix security_locked_down() call for SELinux
Linus Torvalds [Sat, 29 May 2021 16:29:13 +0000 (06:29 -1000)]
Merge tag 'staging-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging and IIO driver fixes from Greg KH:
"Here are some small IIO and staging driver fixes for reported issues
for 5.13-rc4.
Nothing major here, tiny changes for reported problems, full details
are in the shortlog if people are curious.
All have been in linux-next for a while with no reported problems"
* tag 'staging-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: adc: ad7793: Add missing error code in ad7793_setup()
iio: adc: ad7923: Fix undersized rx buffer.
iio: adc: ad7768-1: Fix too small buffer passed to iio_push_to_buffers_with_timestamp()
iio: dac: ad5770r: Put fwnode in error case during ->probe()
iio: gyro: fxas21002c: balance runtime power in error path
staging: emxx_udc: fix loop in _nbu2ss_nuke()
staging: iio: cdc: ad7746: avoid overwrite of num_channels
iio: adc: ad7192: handle regulator voltage error first
iio: adc: ad7192: Avoid disabling a clock that was never enabled.
iio: adc: ad7124: Fix potential overflow due to non sequential channel numbers
iio: adc: ad7124: Fix missbalanced regulator enable / disable on error.
Linus Torvalds [Sat, 29 May 2021 16:25:16 +0000 (06:25 -1000)]
Merge tag 'tty-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial driver fixes from Greg KH:
"Here are some small fixes for reported problems for tty and serial
drivers for 5.13-rc4.
They consist of:
- 8250 bugfixes and new device support
- lockdown security mode fixup
- syzbot found problems fixed
- 8250_omap fix for interrupt storm
- revert of 8250_omap driver fix as it caused worse problem than the
original issue
All but the last patch have been in linux-next for a while, the last
one is a revert of a problem found in linux-next with the 8250_omap
driver change"
* tag 'tty-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "serial: 8250: 8250_omap: Fix possible interrupt storm"
serial: 8250_pci: handle FL_NOIRQ board flag
serial: rp2: use 'request_firmware' instead of 'request_firmware_nowait'
serial: 8250_pci: Add support for new HPE serial device
serial: 8250: 8250_omap: Fix possible interrupt storm
serial: 8250: Use BIT(x) for UART_{CAP,BUG}_*
serial: 8250: Add UART_BUG_TXRACE workaround for Aspeed VUART
serial: 8250_dw: Add device HID for new AMD UART controller
serial: sh-sci: Fix off-by-one error in FIFO threshold register setting
serial: core: fix suspicious security_locked_down() call
serial: tegra: Fix a mask operation that is always true
Linus Torvalds [Sat, 29 May 2021 16:11:21 +0000 (06:11 -1000)]
Merge tag 'usb-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB / Thunderbolt fixes from Greg KH:
"Here are a number of tiny USB and Thunderbolt driver fixes for
5.13-rc4.
They consist of:
- thunderbolt fixes for some NVM bound issues
- xhci fixes for reported problems
- control-request fixups
- documentation build warning fixes
- new usb-serial driver device ids
- typec bugfixes for reported issues
- usbfs warning fixups (could be triggered from userspace)
- other tiny fixes for reported problems.
All of these have been in linux-next with no reported issues"
* tag 'usb-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (22 commits)
xhci: Fix 5.12 regression of missing xHC cache clearing command after a Stall
xhci: fix giving back URB with incorrect status regression in 5.12
usb: gadget: udc: renesas_usb3: Fix a race in usb3_start_pipen()
usb: typec: tcpm: Respond Not_Supported if no snk_vdo
usb: typec: tcpm: Properly interrupt VDM AMS
USB: trancevibrator: fix control-request direction
usb: Restore the usb_header label
usb: typec: tcpm: Use LE to CPU conversion when accessing msg->header
usb: typec: ucsi: Clear pending after acking connector change
usb: typec: mux: Fix matching with typec_altmode_desc
misc/uss720: fix memory leak in uss720_probe
usb: dwc3: gadget: Properly track pending and queued SG
USB: usbfs: Don't WARN about excessively large memory allocations
thunderbolt: usb4: Fix NVM read buffer bounds and offset issue
thunderbolt: dma_port: Fix NVM read buffer bounds and offset issue
usb: chipidea: udc: assign interrupt number to USB gadget structure
usb: cdnsp: Fix lack of removing request from pending list.
usb: cdns3: Fix runtime PM imbalance on error
USB: serial: pl2303: add device id for ADLINK ND-6530 GC
USB: serial: ti_usb_3410_5052: add startech.com device id
...
Linus Torvalds [Sat, 29 May 2021 16:02:25 +0000 (06:02 -1000)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM fixes:
- Another state update on exit to userspace fix
- Prevent the creation of mixed 32/64 VMs
- Fix regression with irqbypass not restarting the guest on failed
connect
- Fix regression with debug register decoding resulting in
overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
x86 fixes:
- fix guest missed wakeup with assigned devices
- fix WARN reported by syzkaller
- do not use BIT() in UAPI headers
- make the kvm_amd.avic parameter bool
PPC fixes:
- make halt polling heuristics consistent with other architectures
selftests:
- various fixes
- new performance selftest memslot_perf_test
- test UFFD minor faults in demand_paging_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
selftests: kvm: fix overlapping addresses in memslot_perf_test
KVM: X86: Kill off ctxt->ud
KVM: X86: Fix warning caused by stale emulation context
KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception
KVM: x86/mmu: Fix comment mentioning skip_4k
KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK
KVM: x86: add start_assignment hook to kvm_x86_ops
KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch
selftests: kvm: do only 1 memslot_perf_test run by default
KVM: X86: Use _BITUL() macro in UAPI headers
KVM: selftests: add shared hugetlbfs backing source type
KVM: selftests: allow using UFFD minor faults for demand paging
KVM: selftests: create alias mappings when using shared memory
KVM: selftests: add shmem backing source type
KVM: selftests: refactor vm_mem_backing_src_type flags
KVM: selftests: allow different backing source types
KVM: selftests: compute correct demand paging size
KVM: selftests: simplify setup_demand_paging error handling
KVM: selftests: Print a message if /dev/kvm is missing
...
Linus Torvalds [Sat, 29 May 2021 15:51:53 +0000 (05:51 -1000)]
Merge tag 's390-5.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
"Fix races in vfio-ccw request handling"
* tag 's390-5.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
vfio-ccw: Serialize FSM IDLE state with I/O completion
vfio-ccw: Reset FSM state to IDLE inside FSM
vfio-ccw: Check initialized flag in cp_init()
Paolo Bonzini [Fri, 28 May 2021 19:10:58 +0000 (15:10 -0400)]
selftests: kvm: fix overlapping addresses in memslot_perf_test
vm_create allocates memory and maps it close to GPA. This memory
is separate from what is allocated in subsequent calls to
vm_userspace_mem_region_add, so it is incorrect to pass the
test memory size to vm_create_default. Just pass a small
fixed amount of memory which can be used later for page table,
otherwise GPAs are already allocated at MEM_GPA and the
test aborts.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Sat, 29 May 2021 00:47:48 +0000 (14:47 -1000)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Ten small fixes, all in drivers"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: qla2xxx: Wait for stop_phase1 at WWN removal
scsi: hisi_sas: Drop free_irq() of devm_request_irq() allocated irq
scsi: vmw_pvscsi: Set correct residual data length
scsi: bnx2fc: Return failure if io_req is already in ABTS processing
scsi: aic7xxx: Remove multiple definition of globals
scsi: aic7xxx: Restore several defines for aic7xxx firmware build
scsi: target: iblock: Fix smp_processor_id() BUG messages
scsi: libsas: Use _safe() loop in sas_resume_port()
scsi: target: tcmu: Fix xarray RCU warning
scsi: target: core: Avoid smp_processor_id() in preemptible code
Linus Torvalds [Sat, 29 May 2021 00:42:37 +0000 (14:42 -1000)]
Merge tag 'block-5.13-2021-05-28' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request (Christoph):
- fix a memory leak in nvme_cdev_add (Guoqing Jiang)
- fix inline data size comparison in nvmet_tcp_queue_response (Hou
Pu)
- fix false keep-alive timeout when a controller is torn down
(Sagi Grimberg)
- fix a nvme-tcp Kconfig dependency (Sagi Grimberg)
- short-circuit reconnect retries for FC (Hannes Reinecke)
- decode host pathing error for connect (Hannes Reinecke)
* tag 'block-5.13-2021-05-28' of git://git.kernel.dk/linux-block:
nvmet: fix false keep-alive timeout when a controller is torn down
nvmet-tcp: fix inline data size comparison in nvmet_tcp_queue_response
nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME
md/raid5: remove an incorrect assert in in_chunk_boundary
s390/dasd: add missing discipline function
nvme-fabrics: decode host pathing error for connect
nvme-fc: short-circuit reconnect retries
nvme: fix potential memory leaks in nvme_cdev_add
Linus Torvalds [Sat, 29 May 2021 00:35:55 +0000 (14:35 -1000)]
Merge tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few minor fixes:
- Fix an issue with hashed wait removal on exit (Zqiang, Pavel)
- Fix a recent data race introduced in this series (Marco)"
* tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block:
io_uring: fix data race to avoid potential NULL-deref
io-wq: Fix UAF when wakeup wqe in hash waitqueue
io_uring/io-wq: close io-wq full-stop gap
* tag 'drm-fixes-2021-05-29' of git://anongit.freedesktop.org/drm/drm:
drm/ttm: Skip swapout if ttm object is not populated
drm/i915: Reenable LTTPR non-transparent LT mode for DPCD_REV<1.4
drm/meson: fix shutdown crash when component not probed
drm/amdgpu/jpeg3: add cancel_delayed_work_sync before power gate
drm/amdgpu/jpeg2.5: add cancel_delayed_work_sync before power gate
drm/amdgpu/jpeg2.0: add cancel_delayed_work_sync before power gate
drm/amdgpu/vcn3: add cancel_delayed_work_sync before power gate
drm/amdgpu/vcn2.5: add cancel_delayed_work_sync before power gate
drm/amdgpu/vcn2.0: add cancel_delayed_work_sync before power gate
drm/amdgpu/vcn1: add cancel_delayed_work_sync before power gate
drm/amdkfd: correct sienna_cichlid SDMA RLC register offset error
drm/amd/pm: correct MGpuFanBoost setting
Linus Torvalds [Sat, 29 May 2021 00:15:47 +0000 (14:15 -1000)]
Merge tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three SMB3 fixes.
Two for stable, and the other fixes a problem pointed out with a
recently added ioctl"
* tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: change format of CIFS_FULL_KEY_DUMP ioctl
cifs: fix string declarations and assignments in tracepoints
cifs: set server->cipher_type to AES-128-CCM for SMB3.0
Linus Torvalds [Fri, 28 May 2021 18:53:19 +0000 (08:53 -1000)]
Merge tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Stable fixes:
- Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config
- Fix Oops in xs_tcp_send_request() when transport is disconnected
- Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return()
Bugfixes:
- Fix instances where signal_pending() should be fatal_signal_pending()
- fix an incorrect limit in filelayout_decode_layout()
- Fixes for the SUNRPC backlogged RPC queue
- Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()
- Revert commit 586a0787ce35 ("Clean up rpcrdma_prepare_readch()")"
* tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: Remove trailing semicolon in macros
xprtrdma: Revert 586a0787ce35
NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config
NFS: Clean up reset of the mirror accounting variables
NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()
NFS: Fix an Oopsable condition in __nfs_pageio_add_request()
SUNRPC: More fixes for backlog congestion
SUNRPC: Fix Oops in xs_tcp_send_request() when transport is disconnected
NFSv4: Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return()
SUNRPC in case of backlog, hand free slots directly to waiting task
pNFS/NFSv4: Remove redundant initialization of 'rd_size'
NFS: fix an incorrect limit in filelayout_decode_layout()
fs/nfs: Use fatal_signal_pending instead of signal_pending
Linus Torvalds [Fri, 28 May 2021 18:47:50 +0000 (08:47 -1000)]
Merge tag 'sound-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A slightly high volume at this time due to pending ASoC fixes.
While there are a few generic simple-card fixes for regressions, most
of the changes are device-specific fixes: ASoC Intel SOF, codec
clocks, other codec / platform fixes as well as usual HD-audio and
USB-audio"
* tag 'sound-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (37 commits)
ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Zbook Fury 17 G8
ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Zbook Fury 15 G8
ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Zbook G8
ALSA: hda/realtek: fix mute/micmute LEDs for HP 855 G8
ALSA: hda/realtek: Chain in pop reduction fixup for ThinkStation P340
ALSA: usb-audio: scarlett2: snd_scarlett_gen2_controls_create() can be static
ALSA: hda/realtek: the bass speaker can't output sound on Yoga 9i
ALSA: hda/realtek: Headphone volume is controlled by Front mixer
ALSA: usb-audio: scarlett2: Improve driver startup messages
ALSA: usb-audio: scarlett2: Fix device hang with ehci-pci
ALSA: usb-audio: fix control-request direction
ASoC: qcom: lpass-cpu: Use optional clk APIs
ASoC: cs35l33: fix an error code in probe()
ASoC: SOF: Intel: hda: don't send DAI_CONFIG IPC for older firmware
ASoC: fsl: fix SND_SOC_IMX_RPMSG dependency
ASoC: cs42l52: Minor tidy up of error paths
ASoC: cs35l32: Add missing regmap use_single config
ASoC: cs35l34: Add missing regmap use_single config
ASoC: cs42l73: Add missing regmap use_single config
ASoC: cs53l30: Add missing regmap use_single config
...
- Avoid CFI mismatches by checking initcall_t types (Marco Elver)
* tag 'clang-features-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
Makefile: LTO: have linker check -Wframe-larger-than
init: verify that function is initcall_t at compile-time
Linus Torvalds [Fri, 28 May 2021 18:24:13 +0000 (08:24 -1000)]
Merge tag 'mips-fixes_5.13_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:
- fix function/preempt trace hangs
- a few build fixes
* tag 'mips-fixes_5.13_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: Fix kernel hang under FUNCTION_GRAPH_TRACER and PREEMPT_TRACER
MIPS: ralink: export rt_sysc_membase for rt2880_wdt.c
MIPS: launch.h: add include guard to prevent build errors
MIPS: alchemy: xxs1500: add gpio-au1000.h header file