Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Shi [Wed, 5 Dec 2018 00:13:32 +0000 (11:13 +1100)]
mm: ksm: do not block on page lock when searching stable tree
ksmd needs to search the stable tree to look for a suitable KSM page, but
the KSM page might be locked for a long time due to the KSM page rmap
walk.
It is not worth waiting for the lock; the page can be skipped and we can
then try to merge it in the next scan to avoid long stalls if its content
is still intact.
Introduce an async mode to get_ksm_page() to not block on the page lock,
as try_to_merge_one_page() does.
Return -EBUSY if the trylock fails, since a NULL means failure to find a
suitable KSM page, which is a valid case.
Link: http://lkml.kernel.org/r/1541618201-120667-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
ksmd found a suitable KSM page on the stable tree and is trying to lock
it. But it is locked by the direct reclaim path which is walking the
page's rmap to get the number of referenced PTEs.
The KSM page rmap walk needs to iterate all rmap_items of the page and all
rmap anon_vmas of each rmap_item. So it may take (# rmap_item * #
children processes) loops. This number of loops might be very large in
the worst case, and may take a long time.
Typically, direct reclaim will not intend to reclaim too many pages, and
it is latency sensitive. So it is not worth doing the long ksm page rmap
walk to reclaim just one page.
Skip KSM pages in direct reclaim if the reclaim priority is low, but still
try to reclaim KSM pages with high priority.
Link: http://lkml.kernel.org/r/1541618201-120667-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Anders Roxell [Wed, 5 Dec 2018 00:13:31 +0000 (11:13 +1100)]
writeback: don't decrement wb->refcnt if !wb->bdi
This happened while running in qemu-system-aarch64, the AMBA PL011 UART
driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE.
arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init),
devtmpfs' handle_remove() crashes because the reference count is a NULL
pointer only because wb->bdi hasn't been initialized yet.
Rework so that wb_put have an extra check if wb->bdi before decrement
wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens
again in other drivers.
Link: http://lkml.kernel.org/r/20181030113545.30999-2-anders.roxell@linaro.org Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Contrary to its name, mmu_notifier_synchronize() does not synchronize the
notifier's SRCU instance, but rather waits for RCU callbacks to finished,
i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on
this matter, explicitly calling out that rcu_barrier() does not imply
synchronize_rcu().
As there are no callers of mmu_notifier_synchronize() and it's unclear
whether any user of mmu_notifier_call_srcu() will ever want to barrier on
their callbacks, simply remove the function.
Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Balbir Singh [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
mm/hotplug: optimize clear_hwpoisoned_pages()
In hot remove, we try to clear poisoned pages, but a small optimization to
check if num_poisoned_pages is 0 helps remove the iteration through
nr_pages.
Link: http://lkml.kernel.org/r/20181102120001.4526-1-bsingharora@gmail.com Signed-off-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrew Morton [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
mm-page_owner-clamp-read-count-to-page_size-fix
use min_t()
Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miles Chen [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
mm/page_owner: clamp read count to PAGE_SIZE
The (root-only) page owner read might allocate a large size of memory with
a large read count. Allocation fails can easily occur when doing high
order allocations.
Clamp buffer size to PAGE_SIZE to avoid arbitrary size allocation
and avoid allocation fails due to high order allocation.
Link: http://lkml.kernel.org/r/1541091607-27402-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
include/linux/slab.h: fix sparse warning in kmalloc_type()
Multiple people have reported the following sparse warning:
./include/linux/slab.h:332:43: warning: dubious: x & !y
The minimal fix would be to change the logical & to boolean &&, which
emits the same code, but Andrew has suggested that the branch-avoiding
tricks are maybe not worthwile. David Laight provided a nice comparison
of disassembly of multiple variants, which shows that the current version
produces a 4 deep dependency chain, and fixing the sparse warning by
changing logical and to multiplication emits an IMUL, making it even more
expensive.
The code as rewritten by this patch yielded the best disassembly, with a
single predictable branch for the most common case, and a ternary operator
for the rest, which gcc seems to compile without a branch or cmov by
itself.
The result should be more readable, without a sparse warning and probably
also faster for the common case.
Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Bart Van Assche <bvanassche@acm.org> Reported-by: Darryl T. Agostinelli <dagostinelli@gmail.com> Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:29 +0000 (11:13 +1100)]
mm/slub.c: improve performance by skipping checked node in get_any_partial()
1. Background
Current slub has three layers:
* cpu_slab
* percpu_partial
* per node partial list
Slub allocator tries to get an object from top to bottom. When it
can't get an object from the upper two layers, it will search the per
node partial list. The is done in get_partial().
The idea behind this is: first try a local node, then try other nodes
if caller doesn't specify a node.
2. Room for Improvement
When we look one step deeper in get_any_partial(), it tries to get a
proper node by for_each_zone_zonelist(), which iterates on the
node_zonelists.
This behavior would introduce some redundant check on the same node.
Because:
* the local node is already checked in get_partial_node()
* one node may have several zones on node_zonelists
3. Solution Proposed in Patch
We could reduce these redundant check by record the last unsuccessful
node and then skip it.
4. Tests & Result
After some tests, the result shows this may improve the system a little,
especially on a machine with only one node.
4.1 Test Description
There are two cases for two system configurations.
Test Cases:
1. counter comparison
2. kernel build test
System Configuration:
1. One node machine with 4G
2. Four node machine with 8G
4.2 Result for Test 1
Test 1: counter comparison
This is a test with hacked kernel to record times function
get_any_partial() is invoked and times the inner loop iterates. By
comparing the ratio of two counters, we get to know how many inner
loops we skipped.
Here is a snip of the test patch.
---
static void *get_any_partial() {
get_partial_count++;
do {
for_each_zone_zonelist() {
get_partial_try_count++;
}
} while();
return NULL;
}
---
The result of (get_partial_count / get_partial_try_count):
Link: http://lkml.kernel.org/r/20181120033119.30013-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:29 +0000 (11:13 +1100)]
mm/slub.c: record final state of slub action in deactivate_slab()
If __cmpxchg_double_slab() fails and (l != m), current code records
transition states of slub action.
Update the action after __cmpxchg_double_slab() success to record the
final state.
Link: http://lkml.kernel.org/r/20181107013119.3816-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:29 +0000 (11:13 +1100)]
mm/slub.c: page is always non-NULL in node_match()
node_match() is a static function and is only invoked in slub.c.
In all three places, `page' is ensured to be valid.
Link: http://lkml.kernel.org/r/20181106150245.1668-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:28 +0000 (11:13 +1100)]
mm/slub.c: remove validation on cpu_slab in __flush_cpu_slab()
cpu_slab is a per cpu variable which is allocated in all or none. If a
cpu_slab failed to be allocated, the slub is not usable.
We could use cpu_slab without validation in __flush_cpu_slab().
Link: http://lkml.kernel.org/r/20181103141218.22844-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Josh Hunt [Wed, 5 Dec 2018 00:13:28 +0000 (11:13 +1100)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
#42: FILE: fs/ocfs2/aops.c:2155:
+ unsigned i_blkbits = inode->i_sb->s_blocksize_bits;
ERROR: code indent should use tabs where possible
#53: FILE: fs/ocfs2/aops.c:2166:
+ ^I * "pos" and "end", we need map twice to return different buffer state:$
WARNING: please, no space before tabs
#53: FILE: fs/ocfs2/aops.c:2166:
+ ^I * "pos" and "end", we need map twice to return different buffer state:$
ERROR: code indent should use tabs where possible
#54: FILE: fs/ocfs2/aops.c:2167:
+ ^I * 1. area in file size, not set NEW;$
WARNING: please, no space before tabs
#54: FILE: fs/ocfs2/aops.c:2167:
+ ^I * 1. area in file size, not set NEW;$
ERROR: code indent should use tabs where possible
#55: FILE: fs/ocfs2/aops.c:2168:
+ ^I * 2. area out file size, set NEW.$
WARNING: please, no space before tabs
#55: FILE: fs/ocfs2/aops.c:2168:
+ ^I * 2. area out file size, set NEW.$
ERROR: code indent should use tabs where possible
#56: FILE: fs/ocfs2/aops.c:2169:
+ ^I *$
WARNING: please, no space before tabs
#56: FILE: fs/ocfs2/aops.c:2169:
+ ^I *$
ERROR: code indent should use tabs where possible
#57: FILE: fs/ocfs2/aops.c:2170:
+ ^I *^I^I iblock endblk$
WARNING: please, no space before tabs
#57: FILE: fs/ocfs2/aops.c:2170:
+ ^I *^I^I iblock endblk$
ERROR: code indent should use tabs where possible
#58: FILE: fs/ocfs2/aops.c:2171:
+ ^I * |--------|---------|---------|---------$
WARNING: please, no space before tabs
#58: FILE: fs/ocfs2/aops.c:2171:
+ ^I * |--------|---------|---------|---------$
ERROR: code indent should use tabs where possible
#59: FILE: fs/ocfs2/aops.c:2172:
+ ^I * |<-------area in file------->|$
WARNING: please, no space before tabs
#59: FILE: fs/ocfs2/aops.c:2172:
+ ^I * |<-------area in file------->|$
ERROR: code indent should use tabs where possible
#60: FILE: fs/ocfs2/aops.c:2173:
+ ^I */$
WARNING: please, no space before tabs
#60: FILE: fs/ocfs2/aops.c:2173:
+ ^I */$
total: 8 errors, 9 warnings, 40 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
NOTE: Whitespace errors detected.
You may wish to use scripts/cleanpatch or scripts/cleanfile
./patches/ocfs2-clear-zero-in-unaligned-direct-io.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Jia Guo <guojia12@huawei.com> Cc: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jia Guo [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: clear zero in unaligned direct IO
Unused portion of a part-written fs-block-sized block is not set to zero
in unaligned append direct write.This can lead to serious data
inconsistencies.
Ocfs2 manage disk with cluster size(for example, 1M), part-written in one
cluster will change the cluster state from UN-WRITTEN to WRITTEN,
VFS(function dio_zero_block) doesn't do the cleaning because bh's state is
not set to NEW in function ocfs2_dio_wr_get_block when we write a WRITTEN
cluster. For example, the cluster size is 1M, file size is 8k and we
direct write from 14k to 15k, then 12k~14k and 15k~16k will contain dirty
data.
We have to deal with two cases:
1.The starting position of direct write is outside the file.
2.The starting position of direct write is located in the file.
We need set bh's state to NEW in the first case. In the second case, we
need mapped twice because bh's state of area out file should be set to NEW
while area in file not.
Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com Signed-off-by: Jia Guo <guojia12@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: don't clear bh uptodate for block read
For sync io read in ocfs2_read_blocks_sync(), first clear bh uptodate flag
and submit the io, second wait io done, last check whether bh uptodate, if
not return io error.
If two sync io for the same bh were issued, it could be the first io done
and set uptodate flag, but just before check that flag, the second io came
in and cleared uptodate, then ocfs2_read_blocks_sync() for the first io
will return IO error.
Indeed it's not necessary to clear uptodate flag, as the io end handler
end_buffer_read_sync() will set or clear it based on io succeed or failed.
The following message was found from a nfs server but the underlying
storage returned no error.
[4106438.567376] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2780 ERROR: read block 1238823695 failed -5
[4106438.567569] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2812 ERROR: status = -5
[4106438.567611] (nfsd,7146,3):ocfs2_test_inode_bit:2894 ERROR: get alloc slot and bit failed -5
[4106438.567643] (nfsd,7146,3):ocfs2_test_inode_bit:2932 ERROR: status = -5
[4106438.567675] (nfsd,7146,3):ocfs2_get_dentry:94 ERROR: test inode bit failed -5
Same issue in non sync read ocfs2_read_blocks(), fixed it as well.
Link: http://lkml.kernel.org/r/20181121020023.3034-4-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: clear journal dirty flag after shutdown journal
Dirty flag of the journal should be cleared at the last stage of umount,
if do it before jbd2_journal_destroy(), then some metadata in uncommitted
transaction could be lost due to io error, but as dirty flag of journal
was already cleared, we can't find that until run a full fsck. This may
cause system panic or other corruption.
Link: http://lkml.kernel.org/r/20181121020023.3034-3-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: fix panic due to unrecovered local alloc
mount.ocfs2 ignore the inconsistent error that journal is clean but local
alloc is unrecovered. After mount, local alloc not empty, then reserver
cluster didn't alloc a new local alloc window, reserveration map is
empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the
following panic.
This issue was reported at
https://oss.oracle.com/pipermail/ocfs2-devel/2015-May/010854.html and was
advised to fixed during mount. But this is a very unusual inconsistent
state, usually journal dirty flag should be cleared at the last stage of
umount until every other things go right. We may need do further debug to
check that. Any way to avoid possible futher corruption, mount should be
abort and fsck should be run.
Link: http://lkml.kernel.org/r/20181121020023.3034-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Larry Chen [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: improve ocfs2 Makefile
Included file path was hard-wired in the ocfs2 makefile, which might
causes some confusion when compiling ocfs2 as an external module.
Say if we compile ocfs2 module as following.
cp -r /kernel/tree/fs/ocfs2 /other/dir/ocfs2
cd /other/dir/ocfs2
make -C /path/to/kernel_source M=`pwd` modules
Acutally, the compiler wil try to find included file in
/kernel/tree/fs/ocfs2, rather than the directory /other/dir/ocfs2.
To fix this little bug, we introduce the var $(src) provided by kbuild.
$(src) means the absolute path of the running kbuild file.
Link: http://lkml.kernel.org/r/20181108085546.15149-1-lchen@suse.com Signed-off-by: Larry Chen <lchen@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
zhong jiang [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
ocfs2: remove set but not used variable 'lastzero'
lastzero is not used after setting its value. It is safe to remove the
unused variable.
Link: http://lkml.kernel.org/r/1540296942-24533-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
zhong jiang [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
ocfs2: dlmfs: remove set but not used variable 'status'
status is not used after setting its value. It is safe to remove the
unused variable.
Link: http://lkml.kernel.org/r/1540300179-26697-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jia Guo [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
ocfs2: optimize the reading of heartbeat data
Reading heartbeat data from lowest node rather than from zero, in cases
where the node is not defined from zero, can reduce the number of sectors
read.
Here is a simple test data obtained with 'iostat -dmx dm-5 2', with
two nodes in the cluster, node number 10, 20, respectively.
Qian Cai [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
debugobjects: call debug_objects_mem_init eariler
The current value of the early boot static pool size, 1024 is not big
enough for systems with large number of CPUs with timer or/and workqueue
objects selected. As the results, systems have 60+ CPUs with both timer
and workqueue objects enabled could trigger "ODEBUG: Out of memory.
ODEBUG disabled".
Some debug objects are allocated during the early boot. Enabling some
options like timers or workqueue objects may increase the size required
significantly with large number of CPUs. For example,
plus No. CPUs x 6 (workqueue) objects:
workqueue_init_early
alloc_workqueue
__alloc_workqueue_key
alloc_and_link_pwqs
init_pwq
Also, plus No. CPUs objects:
perf_event_init
__init_srcu_struct
init_srcu_struct_fields
init_srcu_struct_nodes
__init_work
However, none of the things are actually used or required before
debug_objects_mem_init() is invoked, so just move the call right before
vmalloc_init().
According to tglx, "the reason why the call is at this place in
start_kernel() is historical. It's because back in the days when
debugobjects were added the memory allocator was enabled way later than
today."
Link: http://lkml.kernel.org/r/20181126102407.1836-1-cai@gmx.us Signed-off-by: Qian Cai <cai@gmx.us> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
arch/sh/boards/mach-kfr2r09/setup.c does not need to #include
<mtd/onenand.h>, and doing so causes a build warning, so drop that header
file.
In file included from ../arch/sh/boards/mach-kfr2r09/setup.c:28:
../include/linux/mtd/onenand.h:225:12: warning: 'struct mtd_oob_ops' declared inside parameter list will not be visible outside of this definition or declaration
struct mtd_oob_ops *ops);
Link: http://lkml.kernel.org/r/702f0a25-c63e-6912-4640-6ab0f00afbc7@infradead.org Fixes: f3590dc32974 ("media: arch: sh: kfr2r09: Use new renesas-ceu camera driver") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Suggested-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Jacopo Mondi <jacopo+renesas@jmondi.org> Cc: Magnus Damm <magnus.damm@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:24 +0000 (11:13 +1100)]
kasan, mm, arm64: tag non slab memory allocated via pagealloc
Tag-based KASAN doesn't check memory accesses through pointers tagged with
0xff. When page_address is used to get pointer to memory that corresponds
to some page, the tag of the resulting pointer gets set to 0xff, even
though the allocated memory might have been tagged differently.
For slab pages it's impossible to recover the correct tag to return from
page_address, since the page might contain multiple slab objects tagged
with different values, and we can't know in advance which one of them is
going to get accessed. For non slab pages however, we can recover the tag
in page_address, since the whole page was marked with the same tag.
This patch adds tagging to non slab memory allocated with pagealloc. To
set the tag of the pointer returned from page_address, the tag gets stored
to page->flags when the memory gets allocated.
Link: http://lkml.kernel.org/r/6c1004acf28880f6a5cc7d2f974ba08adb2853ea.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:24 +0000 (11:13 +1100)]
kasan, arm64: add brk handler for inline instrumentation
Tag-based KASAN inline instrumentation mode (which embeds checks of shadow
memory into the generated code, instead of inserting a callback) generates
a brk instruction when a tag mismatch is detected.
This commit adds a tag-based KASAN specific brk handler, that decodes the
immediate value passed to the brk instructions (to extract information
about the memory access that triggered the mismatch), reads the register
values (x0 contains the guilty address) and reports the bug.
Link: http://lkml.kernel.org/r/e825441eda1dbbbb7f583f826a66c94e6f88316a.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:24 +0000 (11:13 +1100)]
kasan: add hooks implementation for tag-based mode
This commit adds tag-based KASAN specific hooks implementation and
adjusts common generic and tag-based KASAN ones.
1. When a new slab cache is created, tag-based KASAN rounds up the size of
the objects in this cache to KASAN_SHADOW_SCALE_SIZE (== 16).
2. On each kmalloc tag-based KASAN generates a random tag, sets the shadow
memory, that corresponds to this object to this tag, and embeds this
tag value into the top byte of the returned pointer.
3. On each kfree tag-based KASAN poisons the shadow memory with a random
tag to allow detection of use-after-free bugs.
The rest of the logic of the hook implementation is very much similar to
the one provided by generic KASAN. Tag-based KASAN saves allocation and
free stack metadata to the slab object the same way generic KASAN does.
Link: http://lkml.kernel.org/r/b10d44bace6a7e9279b9b5a5b4c2a9c4c58cbf4f.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
mm: move obj_to_index to include/linux/slab_def.h
While with SLUB we can actually preassign tags for caches with contructors
and store them in pointers in the freelist, SLAB doesn't allow that since
the freelist is stored as an array of indexes, so there are no pointers to
store the tags.
Instead we compute the tag twice, once when a slab is created before
calling the constructor and then again each time when an object is
allocated with kmalloc. Tag is computed simply by taking the lowest byte of
the index that corresponds to the object. However in kasan_kmalloc we only
have access to the objects pointer, so we need a way to find out which
index this object corresponds to.
This patch moves obj_to_index from slab.c to include/linux/slab_def.h to
be reused by KASAN.
Link: http://lkml.kernel.org/r/b68796c554fba66d5285274ea6356e642e18a9e5.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
kasan: add bug reporting routines for tag-based mode
This commit adds rountines, that print tag-based KASAN error reports.
Those are quite similar to generic KASAN, the difference is:
1. The way tag-based KASAN finds the first bad shadow cell (with a
mismatching tag). Tag-based KASAN compares memory tags from the shadow
memory to the pointer tag.
2. Tag-based KASAN reports all bugs with the "KASAN: invalid-access"
header.
Also simplify generic KASAN find_first_bad_addr.
Link: http://lkml.kernel.org/r/996c09ec2c8f11294c106973f3b1a211417fa74e.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
kasan: split out generic_report.c from report.c
This patch moves generic KASAN specific error reporting routines to
generic_report.c without any functional changes, leaving common error
reporting code in report.c to be later reused by tag-based KASAN.
Link: http://lkml.kernel.org/r/9030fe246a786be1348f8b08089f30e52be23ec4.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
kasan, mm: perform untagged pointers comparison in krealloc
The krealloc function checks where the same buffer was reused or a new one
allocated by comparing kernel pointers. Tag-based KASAN changes memory tag
on the krealloc'ed chunk of memory and therefore also changes the pointer
tag of the returned pointer. Therefore we need to perform comparison on
untagged (with tags reset) pointers to check whether it's the same memory
region or not.
Link: http://lkml.kernel.org/r/5045db8a8e249a1eda3199b952120035eacb3bd4.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
kasan, arm64: enable top byte ignore for the kernel
Tag-based KASAN uses the Top Byte Ignore feature of arm64 CPUs to store a
pointer tag in the top byte of each pointer. This commit enables the
TCR_TBI1 bit, which enables Top Byte Ignore for the kernel, when tag-based
KASAN is used.
Link: http://lkml.kernel.org/r/1ed03d53ee679cba52ba7118d2acbef948d21fcc.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
kasan, arm64: fix up fault handling logic
Right now arm64 fault handling code removes pointer tags from addresses
covered by TTBR0 in faults taken from both EL0 and EL1, but doesn't do
that for pointers covered by TTBR1.
This patch adds two helper functions is_ttbr0_addr() and is_ttbr1_addr(),
where the latter one accounts for the fact that TTBR1 pointers might be
tagged when tag-based KASAN is in use, and uses these helper functions to
perform pointer checks in arch/arm64/mm/fault.c.
Link: http://lkml.kernel.org/r/a54fe8c8c11948b0ac8c8b285fb36f845217c84a.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
kasan: preassign tags to objects with ctors or SLAB_TYPESAFE_BY_RCU
An object constructor can initialize pointers within this objects based on
the address of the object. Since the object address might be tagged, we
need to assign a tag before calling constructor.
The implemented approach is to assign tags to objects with constructors
when a slab is allocated and call constructors once as usual. The
downside is that such object would always have the same tag when it is
reallocated, so we won't catch use-after-frees on it.
Also pressign tags for objects from SLAB_TYPESAFE_BY_RCU caches, since
they can be validy accessed after having been freed.
Link: http://lkml.kernel.org/r/b2c17b6674f1737f981ffa6dca7fdfc059a88435.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
kasan, arm64: untag address in _virt_addr_is_linear
virt_addr_is_linear (which is used by virt_addr_valid) assumes that the
top byte of the address is 0xff, which isn't always the case with
tag-based KASAN.
This patch resets the tag in this macro.
Link: http://lkml.kernel.org/r/dd9cda296c70ca6b1839cf4de3ee3137cf5030e7.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
kasan: add tag related helper functions
This commit adds a few helper functions, that are meant to be used to
work with tags embedded in the top byte of kernel pointers: to set, to
get or to reset the top byte.
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
arm64: move untagged_addr macro from uaccess.h to memory.h
Move the untagged_addr() macro from arch/arm64/include/asm/uaccess.h
to arch/arm64/include/asm/memory.h to be later reused by KASAN.
Also make the untagged_addr() macro accept all kinds of address types
(void *, unsigned long, etc.). This allows not to specify type casts in
each place where the macro is used. This is done by using __typeof__.
Andrey Konovalov [Wed, 5 Dec 2018 00:13:21 +0000 (11:13 +1100)]
kasan: initialize shadow to 0xff for tag-based mode
A tag-based KASAN shadow memory cell contains a memory tag, that
corresponds to the tag in the top byte of the pointer, that points to that
memory. The native top byte value of kernel pointers is 0xff, so with
tag-based KASAN we need to initialize shadow memory to 0xff.
Link: http://lkml.kernel.org/r/9004fd16d56d8772775cf671a8fa66e54ed138dd.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:21 +0000 (11:13 +1100)]
kasan: rename kasan_zero_page to kasan_early_shadow_page
With tag based KASAN mode the early shadow value is 0xff and not 0x00,
so this patch renames kasan_zero_(page|pte|pmd|pud|p4d) to
kasan_early_shadow_(page|pte|pmd|pud|p4d) to avoid confusion.
Link: http://lkml.kernel.org/r/1aa5a8e562380e0bfb1d58c8ec3130436cff8b47.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:21 +0000 (11:13 +1100)]
kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS
This commit splits the current CONFIG_KASAN config option into two:
1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one
that exists now);
2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode.
The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have
another hardware tag-based KASAN mode, that will rely on hardware memory
tagging support in arm64.
With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to
instrument kernel files with -fsantize=kernel-hwaddress (except the ones
for which KASAN_SANITIZE := n is set).
Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both
CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes.
This commit also adds empty placeholder (for now) implementation of
tag-based KASAN specific hooks inserted by the compiler and adjusts
common hooks implementation.
While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option
is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will
enable once all the infrastracture code has been added.
Link: http://lkml.kernel.org/r/20728567aae93b5eb88a6636c94c1af73db7cdbc.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:20 +0000 (11:13 +1100)]
kasan: rename source files to reflect the new naming scheme
We now have two KASAN modes: generic KASAN and tag-based KASAN. Rename
kasan.c to generic.c to reflect that. Also rename kasan_init.c to init.c
as it contains initialization code for both KASAN modes.
Link: http://lkml.kernel.org/r/a6cd5acdc334d532754279a7d2d770e2be96419d.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:20 +0000 (11:13 +1100)]
kasan, slub: handle pointer tags in early_kmem_cache_node_alloc()
The previous patch updated KASAN hooks signatures and their usage in SLAB
and SLUB code, except for the early_kmem_cache_node_alloc function. This
patch handles that function separately, as it requires to reorder some of
the initialization code to correctly propagate a tagged pointer in case a
tag is assigned by kasan_kmalloc.
Link: http://lkml.kernel.org/r/3459b9f5d4daf96668d9579da2cf15eb5523284b.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:20 +0000 (11:13 +1100)]
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v12.
This patchset adds a new software tag-based mode to KASAN [1].
(Initially this mode was called KHWASAN, but it got renamed,
see the naming rationale at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise
bug detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which would
use hardware memory tagging support (announced by Arm [3]) instead of
compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
On mobile devices generic KASAN's memory usage is significant problem. One
of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes,
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life miserable.
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether
we operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/10068968c5dbdd1913bd60f89262e3c1a50f3d38.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
VM_DATA_DEFAULT_FLAGS uses READ_IMPLIES_EXEC, so page.h should include
personality.h to provide this.
This fixes no known bugs and can be safely ignored ;)
Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Piotr Jaroszynski [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
fs/iomap.c: get/put the page in iomap_page_create/release()
migrate_page_move_mapping() expects pages with private data set to have a
page_count elevated by 1. This is what used to happen for xfs through the
buffer_heads code before the switch to iomap in commit 82cb14175e7d ("xfs:
add support for sub-pagesize writeback without buffer_heads"). Not having
the count elevated causes move_pages() to fail on memory mapped files
coming from xfs.
Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it in
iomap_page_release().
Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads") Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Brian Foster <bfoster@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yongkai Wu [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
hugetlbfs: call VM_BUG_ON_PAGE earlier in free_huge_page()
A stack trace was triggered by VM_BUG_ON_PAGE(page_mapcount(page), page)
in free_huge_page(). Unfortunately, the page->mapping field was set to
NULL before this test. This made it more difficult to determine the root
cause of the problem.
Move the VM_BUG_ON_PAGE tests earlier in the function so that if they do
trigger more information is present in the page struct.
Link: http://lkml.kernel.org/r/1543491843-23438-1-git-send-email-nic_w@163.com Signed-off-by: Yongkai Wu <nic_w@163.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yueyi Li [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
memblock: annotate memblock_is_reserved() with __init_memblock
Found warning:
WARNING: EXPORT symbol "gsi_write_channel_scratch" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: vmlinux.o(.text+0x1e0a0): Section mismatch in reference from the function valid_phys_addr_range() to the function .init.text:memblock_is_reserved()
The function valid_phys_addr_range() references
the function __init memblock_is_reserved().
This is often because valid_phys_addr_range lacks a __init
annotation or the annotation of memblock_is_reserved is wrong.
Baruch Siach [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
psi: fix reference to kernel commandline enable
The kernel commandline parameter named in CONFIG_PSI_DEFAULT_DISABLED
help text contradicts the documentation in kernel-parameters.txt, and
the code. Fix that.
Link: http://lkml.kernel.org/r/20181203213416.GA12627@cmpxchg.org Fixes: e0c274472d ("psi: make disabling/enabling easier for vendor kernels") Signed-off-by: Baruch Siach <baruch@tkos.co.il> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jian Wang [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled
In dlm_move_lockres_to_recovery_list(), if the lock is in the granted
queue and cancel_pending is set, it will encounter a BUG. I think this is
a meaningless BUG, so be prepared to remove it. A scenario that causes
this BUG will be given below.
At the beginning, Node 1 is the master and has NL lock, Node 2 has PR
lock, Node 3 has PR lock too.
Node 1 Node 2 Node 3
want to get EX lock.
want to get EX lock.
Node 3 hinder
Node 2 to get
EX lock, send
Node 3 a BAST.
receive BAST from
Node 1. downconvert
thread begin to
cancel PR to EX conversion.
In dlmunlock_common function,
downconvert thread has set
lock->cancel_pending,
but did not enter
dlm_send_remote_unlock_request
function.
Node2 dies because
the host is powered down.
In recovery process,
clean the lock that
related to Node2.
then finish Node 3
PR to EX request.
give Node 3 a AST.
receive AST from Node 1.
change lock level to EX,
move lock to granted list.
Node1 dies because
the host is powered down.
In dlm_move_lockres_to_recovery_list
function. the lock is in the
granted queue and cancel_pending
is set. BUG_ON.
But after clearing this BUG, process will encounter
the second BUG in the ocfs2_unlock_ast function.
Here is a scenario that will cause the second BUG
in ocfs2_unlock_ast as follows:
At the beginning, Node 1 is the master and has NL lock,
Node 2 has PR lock, Node 3 has PR lock too.
Node 1 Node 2 Node 3
want to get EX lock.
want to get EX lock.
Node 3 hinder
Node 2 to get
EX lock, send
Node 3 a BAST.
receive BAST from
Node 1. downconvert
thread begin to
cancel PR to EX conversion.
In dlmunlock_common function,
downconvert thread has released
lock->spinlock and res->spinlock,
but did not enter
dlm_send_remote_unlock_request
function.
Node2 dies because
the host is powered down.
In recovery process,
clean the lock that
related to Node2.
then finish Node 3
PR to EX request.
give Node 3 a AST.
receive AST from Node 1.
change lock level to EX,
move lock to granted list,
set lockres->l_unlock_action
as OCFS2_UNLOCK_INVALID
in ocfs2_locking_ast function.
Node2 dies because
the host is powered down.
Node 3 realize that Node 1
is dead, remove Node 1 from
domain_map. downconvert thread
get DLM_NORMAL from
dlm_send_remote_unlock_request
function and set *call_ast as 1.
Then downconvert thread meet
BUG in ocfs2_unlock_ast function.
To avoid meet the second BUG, dlmunlock_common() should return
DLM_CANCELGRANT if the lock is on granted list and the operation is
canceled.
Link: http://lkml.kernel.org/r/98f0e80c-9c13-dbb6-047c-b40e100082b1@huawei.com Signed-off-by: Jian Wang <wangjian161@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jian Wang [Wed, 5 Dec 2018 00:13:18 +0000 (11:13 +1100)]
ocfs2/dlm: clean DLM_LKSB_GET_LVB and DLM_LKSB_PUT_LVB when the cancel_pending is set
dlm_move_lockres_to_recovery_list() should clean DLM_LKSB_GET_LVB and
DLM_LKSB_PUT_LVB when the cancel_pending is set. Otherwise node may panic
in dlm_proxy_ast_handler.
Here is the situation: At the beginning, Node1 is the master of the lock
resource and has NL lock, Node2 has PR lock, Node3 has PR lock, Node4 has
NL lock.
Node1 Node2 Node3 Node4
convert lock_2 from
PR to EX.
the mode of lock_3 is
PR, which blocks the
conversion request of
Node2. move lock_2 to
conversion list.
convert lock_3 from
PR to EX.
move lock_3 to conversion
list. send BAST to Node3.
receive BAST from Node1.
downconvert thread execute
canceling convert operation.
Node2 dies because
the host is powered down.
in dlmunlock_common function,
the downconvert thread set
cancel_pending. at the same
time, Node 3 realized that
Node 1 is dead, so move lock_3
back to granted list in
dlm_move_lockres_to_recovery_list
function and remove Node 1 from
the domain_map in
__dlm_hb_node_down function.
then downconvert thread failed
to send the lock cancellation
request to Node1 and return
DLM_NORMAL from
dlm_send_remote_unlock_request
function.
become recovery master.
during the recovery
process, send
lock_2 that is
converting form
PR to EX to Node4.
during the recovery process,
send lock_3 in the granted list and
cantain the DLM_LKSB_GET_LVB
flag to Node4. Then downconvert thread
delete DLM_LKSB_GET_LVB flag in
dlmunlock_common function.
Node4 finish recovery.
the mode of lock_3 is
PR, which blocks the
conversion request of
Node2, so send BAST
to Node3.
receive BAST from Node4.
convert lock_3 from PR to NL.
change the mode of lock_3
from PR to NL and send
message to Node3.
receive message from
Node4. The message contain
LKM_GET_LVB flag, but the
lock->lksb->flags does not
contain DLM_LKSB_GET_LVB,
BUG_ON in dlm_proxy_ast_handler
function.
Link: http://lkml.kernel.org/r/bffbe5a5-acb6-652d-eb8a-99fb051e6631@huawei.com Signed-off-by: Jian Wang <wangjian161@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Kravetz [Wed, 5 Dec 2018 00:13:18 +0000 (11:13 +1100)]
hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
Link: http://lkml.kernel.org/r/20181203200850.6460-3-mike.kravetz@oracle.com Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Kravetz [Wed, 5 Dec 2018 00:13:18 +0000 (11:13 +1100)]
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for better synchronization".
These patches are a follow up to the RFC,
http://lkml.kernel.org/r/20181024045053.1467-1-mike.kravetz@oracle.com
Comments made by Naoya were addressed.
There are two primary issues addressed here:
1) For shared pmds, huge PE pointers returned by huge_pte_alloc can
become invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid
global reserve counts and state.
Both issues are addressed by expanding the use of i_mmap_rwsem.
These issues have existed for a long time. They can be recreated with a
test program that causes page fault/truncation races. For simple
mappings, this results in a negative HugePages_Rsvd count. If racing with
mappings that contain shared pmds, we can hit "BUG at
fs/hugetlbfs/inode.c:444!" or Oops! as the result of an invalid memory
reference.
I broke up the larger RFC into separate patches addressing each issue.
Hopefully, this is easier to understand/review.
This patch (of 3):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
Link: http://lkml.kernel.org/r/20181203200850.6460-2-mike.kravetz@oracle.com Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Rientjes [Wed, 5 Dec 2018 00:13:17 +0000 (11:13 +1100)]
mm, thp: always specify disabled vmas as nh in smaps
1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") introduced
a regression in that userspace cannot always determine the set of vmas
where thp is disabled.
Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
to determine if a vma has been disabled from being backed by hugepages.
Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
be disabled and emit "nh" as a flag for the corresponding vmas as part of
/proc/pid/smaps. After the commit, thp is disabled by means of an mm flag
and "nh" is not emitted.
This causes smaps parsing libraries to assume a vma is enabled for thp and
ends up puzzling the user on why its memory is not backed by thp.
This also clears the "hg" flag to make the behavior of MADV_HUGEPAGE and
PR_SET_THP_DISABLE definitive.
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1809251449060.96762@chino.kir.corp.google.com Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") Signed-off-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mark Brown [Wed, 5 Dec 2018 00:13:17 +0000 (11:13 +1100)]
arch/sh/include/asm/io.h: provide prototypes for PCI I/O mapping in asm/io.h
Most architectures provide prototypes for the PCI I/O mapping operations
when asm/io.h is included but SH doesn't currently do that, leading to for
example warnings in sound/pci/hda/patch_ca0132.c when pci_iomap() is used
on current -next. Make SH more consistent with other architectures by
including asm-generic/pci_iomap.h in asm/io.h.
Link: http://lkml.kernel.org/r/20181106175142.27988-1-broonie@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Rapoport [Wed, 5 Dec 2018 00:13:17 +0000 (11:13 +1100)]
alpha: fix hang caused by the bootmem removal
The conversion of alpha to memblock as the early memory manager caused
boot to hang as described at [1].
The issue is caused because for CONFIG_DISCTONTIGMEM=y case,
memblock_add() is called using memory start PFN that had been rounded down
to the nearest 8Mb and it caused memblock to see more memory that is
actually present in the system.
Besides, memblock allocates memory from high addresses while bootmem was
using low memory, which broke the assumption that early allocations are
always accessible by the hardware.
This patch ensures that memblock_add() is using the correct PFN for the
memory start and forces memblock to use bottom-up allocations.
[1] https://lkml.org/lkml/2018/11/22/1032
Link: http://lkml.kernel.org/r/1543233216-25833-1-git-send-email-rppt@linux.ibm.com Reported-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Tested-by: Meelis Roos <mroos@linux.ee> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yangtao Li [Wed, 21 Nov 2018 15:22:54 +0000 (10:22 -0500)]
drivers/tty: add missing of_node_put()
of_find_node_by_path() acquires a reference to the node
returned by it and that reference needs to be dropped by its caller.
This place is not doing this, so fix it.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 2 Dec 2018 20:19:44 +0000 (12:19 -0800)]
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"Volume is a little higher than usual due to a set of gpio fixes for
Davinci platforms that's been around a while, still seemed appropriate
to not hold off until next merge window.
Besides that it's the usual mix of minor fixes, mostly corrections of
small stuff in device trees.
Major stability-related one is the removal of a regulator from DT on
Rock960, since DVFS caused undervoltage. I expect it'll be restored
once they figure out the underlying issue"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (28 commits)
MAINTAINERS: Remove unused Qualcomm SoC mailing list
ARM: davinci: dm644x: set the GPIO base to 0
ARM: davinci: da830: set the GPIO base to 0
ARM: davinci: dm355: set the GPIO base to 0
ARM: davinci: dm646x: set the GPIO base to 0
ARM: davinci: dm365: set the GPIO base to 0
ARM: davinci: da850: set the GPIO base to 0
gpio: davinci: restore a way to manually specify the GPIO base
ARM: davinci: dm644x: define gpio interrupts as separate resources
ARM: davinci: dm355: define gpio interrupts as separate resources
ARM: davinci: dm646x: define gpio interrupts as separate resources
ARM: davinci: dm365: define gpio interrupts as separate resources
ARM: davinci: da8xx: define gpio interrupts as separate resources
ARM: dts: at91: sama5d2: use the divided clock for SMC
ARM: dts: imx51-zii-rdu1: Remove EEPROM node
ARM: dts: rockchip: Remove @0 from the veyron memory node
arm64: dts: rockchip: Fix PCIe reset polarity for rk3399-puma-haikou.
arm64: dts: qcom: msm8998: Reserve gpio ranges on MTP
arm64: dts: sdm845-mtp: Reserve reserved gpios
arm64: dts: ti: k3-am654: Fix wakeup_uart reg address
...
Linus Torvalds [Sun, 2 Dec 2018 20:15:55 +0000 (12:15 -0800)]
Merge tag 'for-linus-4.20a-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- A revert of a previous commit as it is no longer necessary and has
shown to cause problems in some memory hotplug cases.
- Some small fixes and a minor cleanup.
- A patch for adding better diagnostic data in a very rare failure
case.
* tag 'for-linus-4.20a-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
pvcalls-front: fixes incorrect error handling
Revert "xen/balloon: Mark unallocated host memory as UNUSABLE"
xen: xlate_mmu: add missing header to fix 'W=1' warning
xen/x86: add diagnostic printout to xen_mc_flush() in case of error
x86/xen: cleanup includes in arch/x86/xen/spinlock.c
Linus Torvalds [Sun, 2 Dec 2018 20:07:27 +0000 (12:07 -0800)]
Merge tag 'dmaengine-fix-4.20-rc5' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
"This contains two fixes to at_hdmac which fixes long standing bus
reported recently on serial transfers causing memory leak. These fixes
were done by Richard Genoud"
* tag 'dmaengine-fix-4.20-rc5' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: at_hdmac: fix module unloading
dmaengine: at_hdmac: fix memory leak in at_dma_xlate()
Linus Torvalds [Sat, 1 Dec 2018 20:35:48 +0000 (12:35 -0800)]
Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull STIBP fallout fixes from Thomas Gleixner:
"The performance destruction department finally got it's act together
and came up with a cure for the STIPB regression:
- Provide a command line option to control the spectre v2 user space
mitigations. Default is either seccomp or prctl (if seccomp is
disabled in Kconfig). prctl allows mitigation opt-in, seccomp
enables the migitation for sandboxed processes.
- Rework the code to handle the conditional STIBP/IBPB control and
remove the now unused ptrace_may_access_sched() optimization
attempt
- Disable STIBP automatically when SMT is disabled
- Optimize the switch_to() logic to avoid MSR writes and invocations
of __switch_to_xtra().
- Make the asynchronous speculation TIF updates synchronous to
prevent stale mitigation state.
As a general cleanup this also makes retpoline directly depend on
compiler support and removes the 'minimal retpoline' option which just
pretended to provide some form of security while providing none"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
x86/speculation: Provide IBPB always command line options
x86/speculation: Add seccomp Spectre v2 user space protection mode
x86/speculation: Enable prctl mode for spectre_v2_user
x86/speculation: Add prctl() control for indirect branch speculation
x86/speculation: Prepare arch_smt_update() for PRCTL mode
x86/speculation: Prevent stale SPEC_CTRL msr content
x86/speculation: Split out TIF update
ptrace: Remove unused ptrace_may_access_sched() and MODE_IBRS
x86/speculation: Prepare for conditional IBPB in switch_mm()
x86/speculation: Avoid __switch_to_xtra() calls
x86/process: Consolidate and simplify switch_to_xtra() code
x86/speculation: Prepare for per task indirect branch speculation control
x86/speculation: Add command line control for indirect branch speculation
x86/speculation: Unify conditional spectre v2 print functions
x86/speculataion: Mark command line parser data __initdata
x86/speculation: Mark string arrays const correctly
x86/speculation: Reorder the spec_v2 code
x86/l1tf: Show actual SMT state
x86/speculation: Rework SMT state change
sched/smt: Expose sched_smt_present static key
...
Linus Torvalds [Sat, 1 Dec 2018 19:36:32 +0000 (11:36 -0800)]
Merge tag 'for-linus-20181201' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
- Single range elevator discard merge fix, that caused crashes (Ming)
- Fix for a regression in O_DIRECT, where we could potentially lose the
error value (Maximilian Heyne)
- NVMe pull request from Christoph, with little fixes all over the map
for NVMe.
* tag 'for-linus-20181201' of git://git.kernel.dk/linux-block:
block: fix single range discard merge
nvme-rdma: fix double freeing of async event data
nvme: flush namespace scanning work just before removing namespaces
nvme: warn when finding multi-port subsystems without multipathing enabled
fs: fix lost error code in dio_complete
nvme-pci: fix surprise removal
nvme-fc: initialize nvme_req(rq)->ctrl after calling __nvme_fc_init_request()
nvme: Free ctrl device name on init failure
* tag 'pci-v4.20-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: Fix incorrect value returned from pcie_get_speed_cap()
PCI: dwc: Fix MSI-X EP framework address calculation bug
PCI: layerscape: Fix wrong invocation of outbound window disable accessor
PCI: imx6: Fix link training status detection in link up check
* lorenzo/pci/controller-fixes:
PCI: dwc: Fix MSI-X EP framework address calculation bug
PCI: layerscape: Fix wrong invocation of outbound window disable accessor
PCI: imx6: Fix link training status detection in link up check
Mikulas Patocka [Mon, 26 Nov 2018 16:37:13 +0000 (10:37 -0600)]
PCI: Fix incorrect value returned from pcie_get_speed_cap()
The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
the register and compare it against them.
This fixes errors like this:
amdgpu: [powerplay] failed to send message 261 ret is 0
when a PCIe-v3 card is plugged into a PCIe-v1 slot, because the slot is
being incorrectly reported as PCIe-v3 capable.
6cf57be0f78e, which appeared in v4.17, added pcie_get_speed_cap() with the
incorrect test of PCI_EXP_LNKCAP_SLS as a bitmask. 5d9a63304032, which
appeared in v4.19, changed amdgpu to use pcie_get_speed_cap(), so the
amdgpu bug reports below are regressions in v4.19.
Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed") Fixes: 5d9a63304032 ("drm/amdgpu: use pcie functions for link width and speed") Link: https://bugs.freedesktop.org/show_bug.cgi?id=108704 Link: https://bugs.freedesktop.org/show_bug.cgi?id=108778 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
[bhelgaas: update comment, remove use of PCI_EXP_LNKCAP_SLS_8_0GB and
PCI_EXP_LNKCAP_SLS_16_0GB since those should be covered by PCI_EXP_LNKCAP2,
remove test of PCI_EXP_LNKCAP for zero, since that register is required] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # v4.17+
Linus Torvalds [Sat, 1 Dec 2018 02:45:49 +0000 (18:45 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"31 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
ocfs2: fix potential use after free
mm/khugepaged: fix the xas_create_range() error path
mm/khugepaged: collapse_shmem() do not crash on Compound
mm/khugepaged: collapse_shmem() without freezing new_page
mm/khugepaged: minor reorderings in collapse_shmem()
mm/khugepaged: collapse_shmem() remember to clear holes
mm/khugepaged: fix crashes due to misaccounted holes
mm/khugepaged: collapse_shmem() stop if punched or truncated
mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
mm/huge_memory: splitting set mapping+index before unfreeze
mm/huge_memory: rename freeze_page() to unmap_page()
initramfs: clean old path before creating a hardlink
kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
psi: make disabling/enabling easier for vendor kernels
proc: fixup map_files test on arm
debugobjects: avoid recursive calls with kmemleak
userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
userfaultfd: shmem: add i_size checks
userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
...
Linus Torvalds [Sat, 1 Dec 2018 02:39:07 +0000 (18:39 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Cortex-A76 erratum workaround
- ftrace fix to enable syscall events on arm64
- Fix uninitialised pointer in iort_get_platform_device_domain()
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
ACPI/IORT: Fix iort_get_platform_device_domain() uninitialized pointer value
arm64: ftrace: Fix to enable syscall events on arm64
arm64: Add workaround for Cortex-A76 erratum 1286807
Linus Torvalds [Sat, 1 Dec 2018 02:36:30 +0000 (18:36 -0800)]
Merge tag 'gcc-plugins-v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull stackleak plugin fix from Kees Cook:
"Fix crash by not allowing kprobing of stackleak_erase() (Alexander
Popov)"
* tag 'gcc-plugins-v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
stackleak: Disable function tracing and kprobes for stackleak_erase()
Linus Torvalds [Sat, 1 Dec 2018 02:32:33 +0000 (18:32 -0800)]
Merge tag 'fscache-fixes-20181130' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache and cachefiles fixes from David Howells:
"Misc fixes:
- Fix an assertion failure at fs/cachefiles/xattr.c:138 caused by a
race between a cache object lookup failing and someone attempting
to reenable that object, thereby triggering an update of the
object's attributes.
- Fix an assertion failure at fs/fscache/operation.c:449 caused by a
split atomic subtract and atomic read that allows a race to happen.
- Fix a leak of backing pages when simultaneously reading the same
page from the same object from two or more threads.
- Fix a hang due to a race between a cache object being discarded and
the corresponding cookie being reenabled.
There are also some minor cleanups:
- Cast an enum value to a different enum type to prevent clang from
generating a warning. This shouldn't cause any sort of change in
the emitted code.
- Use ktime_get_real_seconds() instead of get_seconds(). This is just
used to uniquify a filename for an object to be placed in the
graveyard. Objects placed there are deleted by cachfilesd in
userspace immediately thereafter.
- Remove an initialised, but otherwise unused variable. This should
have been entirely optimised away anyway"
* tag 'fscache-fixes-20181130' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache, cachefiles: remove redundant variable 'cache'
cachefiles: avoid deprecated get_seconds()
cachefiles: Explicitly cast enumerated type in put_object
fscache: fix race between enablement and dropping of object
cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active
fscache: Fix race in fscache_op_complete() due to split atomic_sub & read
cachefiles: Fix an assertion failure when trying to update a failed object
Paul Burton [Fri, 30 Nov 2018 19:57:22 +0000 (11:57 -0800)]
MAINTAINERS: Update linux-mips mailing list address
The linux-mips.org infrastructure has been unreliable recently & nobody
with sufficient access to fix it is around to do so. As a result we're
moving away from it, and part of this is migrating our mailing list to
kernel.org.
Replace all instances of linux-mips@linux-mips.org in MAINTAINERS with
the shiny new linux-mips@vger.kernel.org address.
The new list is now being archived on kernel.org at
https://lore.kernel.org/linux-mips/ which also holds the history of the
old linux-mips.org list.
Signed-off-by: Paul Burton <paul.burton@mips.com> Cc: linux-mips@vger.kernel.org Cc: linux-mips@linux-mips.org
Pan Bian [Fri, 30 Nov 2018 22:10:54 +0000 (14:10 -0800)]
ocfs2: fix potential use after free
ocfs2_get_dentry() calls iput(inode) to drop the reference count of
inode, and if the reference count hits 0, inode is freed. However, in
this function, it then reads inode->i_generation, which may result in a
use after free bug. Move the put operation later.
Link: http://lkml.kernel.org/r/1543109237-110227-1-git-send-email-bianpan2016@163.com Fixes: 781f200cb7a("ocfs2: Remove masklog ML_EXPORT.") Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 30 Nov 2018 22:10:50 +0000 (14:10 -0800)]
mm/khugepaged: fix the xas_create_range() error path
collapse_shmem()'s xas_nomem() is very unlikely to fail, but it is
rightly given a failure path, so move the whole xas_create_range() block
up before __SetPageLocked(new_page): so that it does not need to
remember to unlock_page(new_page).
Add the missing mem_cgroup_cancel_charge(), and set (currently unused)
result to SCAN_FAIL rather than SCAN_SUCCEED.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261531200.2275@eggly.anvils Fixes: 77da9389b9d5 ("mm: Convert collapse_shmem to XArray") Signed-off-by: Hugh Dickins <hughd@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 30 Nov 2018 22:10:47 +0000 (14:10 -0800)]
mm/khugepaged: collapse_shmem() do not crash on Compound
collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before
it holds page lock of the first page, racing truncation then extension
might conceivably have inserted a hugepage there already. Fail with the
SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
otherwise mishandling the unexpected hugepage - though later we might
code up a more constructive way of handling it, with SCAN_SUCCESS.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 30 Nov 2018 22:10:43 +0000 (14:10 -0800)]
mm/khugepaged: collapse_shmem() without freezing new_page
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 30 Nov 2018 22:10:39 +0000 (14:10 -0800)]
mm/khugepaged: minor reorderings in collapse_shmem()
Several cleanups in collapse_shmem(): most of which probably do not
really matter, beyond doing things in a more familiar and reassuring
order. Simplify the failure gotos in the main loop, and on success
update stats while interrupts still disabled from the last iteration.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 30 Nov 2018 22:10:35 +0000 (14:10 -0800)]
mm/khugepaged: collapse_shmem() remember to clear holes
Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp
flags khugepaged uses to allocate a huge page - in all common cases it
would just be a waste of effort - so collapse_shmem() must remember to
clear out any holes that it instantiates.
The obvious place to do so, where they are put into the page cache tree,
is not a good choice: because interrupts are disabled there. Leave it
until further down, once success is assured, where the other pages are
copied (before setting PageUptodate).
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>